PhyNEST
Documentation for a Julia package PhyNEST: Phylogenetic Network Estimation using SiTe patterns.
Tutorials
A step-by-step wiki tutorial for the major functions in PhyNEST
is available.
Reference
Sungsik Kong, David Swofford, and Laura Kubatko (2022) Inference of Phylogenetic Networks from Sequence Data using Composite Likelihood. Preprint available online on BioRxiv at https://doi.org/10.1101/2022.11.14.516468.
Getting help
Please use google group to report bugs or post questions and/or suggestions.
Functions
PhyNEST.greet
— Functiongreet()
Displays a greeting with citation information. No argument needed.
Example
julia> greet()
Thank you for using PhyNEST: Phylogenetic Network Estimation using SiTe patterns.
Please report bugs or make suggestions to https://groups.google.com/g/phynest-users.
If you conduct an analysis using PhyNEST, please cite:
Sungsik Kong, David Swofford, and Laura Kubatko (2022) Inference of Phylogenetic Networks from Sequence Data using Composite Likelihood.
Preprint available online on BioRxiv at https://doi.org/10.1101/2022.11.14.516468.
PhyNEST.readPhylip
— FunctionreadPhylip(inputfile::AbstractString;
writecsv=false::Bool,
csvname=""::AbstractString,
showProgress=true::Bool,
tdigits=3::Integer,
checkpoint=false::Bool)
Import, read, and parse the input phylip file. File name must be specified as string. By default, progress bar is displayed, and .csv
and .ckp
files are NOT produced. See optional arguments below and modify if needed.
Mandatory argument
inputfile
Name of the phylip file as a string
Optional arguments
writecsv (default=false)
A boolean arguement to allow writing site pattern frequencies in.csv
filecsvname (default=sitePatternCounts_inputfile.csv)
A string that will be name of the.csv
fileshowProgress (default=true)
A boolean argument for visualizing the progress of alignment parsingtdigits (default=3)
Number of decimal points to record timme taken to parse the file in secondscheckpoint (default=false)
A boolean argument to store the Phylip object as a.ckp
file. (Warning:.ckp
file can be large)
PhyNEST.readPhylipFile
— FunctionreadPhylipFile(inputfile::AbstractString,
writecsv::Bool,
csvname::AbstractString,
showProgress::Bool)
Exceuted while running the function readPhylip
. Fills in the attributes in the Phylip
object, except for the Phylip.time
.
PhyNEST.PhylipFileInfo
— FunctionPhylipFileInfo(inputfile::AbstractString,
p::Phylip,
showProgress::Bool)
Function that shortens?summarizes? the sequence alignment into two matrices:
- The one with unique sites and
- Another with how many times each column in the previous matrix occurs throughout the alignment.
PhyNEST.getUniqueQuartets
— FunctiongetUniqueQuartets(p::Phylip)
Getting some combinations for four sequences (quartets). This will result in n! quartets where n is the number of sequences. I believe a better way to do this exists.
PhyNEST.sitePatternCounts
— FunctionsitePatternCounts(p::Phylip,
ppbase::Array,
counts::Array)
Computes observed site pattern frequencies from UniqueBase and BaseCounts obtained from the function PhylipFileInfo
.
PhyNEST.spRearrange
— FunctionspRearrange(p::Phylip)
Rearranges the elemetns of each unique quartet obtained using getUniqueQuartets
, and also suffles the site pattern frequencies accordingly. This prevents 24 redundant computations, saving computation time.
PhyNEST.show_sp
— Functionshow_sp(p::Phylip)
Pretty name for displaying all quartet and the corresponding site pattern frequencies on screen in the DataFrame format. It may result in excessively long table when there are many sequences in the input file.
Mandatory argument
p
Phylip object
PhyNEST.sitePatternsToDF
— FunctionsitePatternsToDF(p::Phylip)
Extracts the quartet and site pattern information from the Phyliip object, and reorganizes them in the DataFrame format. This function does all hard work work for show_sp.
PhyNEST.write_sp
— Functionwrite_sp(p::Phylip;
csvname="PhyNEST_sp"::AbstractString)
Pretty name for exporting all quartet and the corresponding site pattern frequencies in a .csv format.
Mandatory argument
p
Phylip object
Optional argument
csvname
Filename for the .csv output can be given, otherwise we use PhyNEST_sp.csv by default.
PhyNEST.storeCheckPoint
— FunctionstoreCheckPoint(p::Phylip)
Stores the phylip object into a .ckp file so a user does not have to repeat the phylip parsing again if working with the same dataset next time. By default it creates a .ckp file that has the same filename as the input phylip file. File size can get pretty large...
Mandatory argument
p
Phylip object
PhyNEST.readCheckPoint
— FunctionreadCheckPoint(ckpfile::AbstractString)
Reads in .ckp file and creates a filled in phylip object.
Mandatory argument
ckpfile
Name of the checkpoint file
PhyNEST.GetTrueProbsSymm
— FunctionGetTrueProbsSymm(myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64)
GetTrueProbsSymm(myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64,alpha::Float64)
Computes true site pattern probabilities for the symmetric quartet tree: e.g., ((1,2),(3,4));
. The fifteen quartet site pattern probabilities are returned in the order of:
- AAAA
- AAAB
- AABA
- AABB
- AABC
- ABAA
- ABAB
- ABAC
- ABBA
- BAAA
- ABBC
- CABC
- BACA
- BCAA
- ABCD
Three speciation times (node ages) in coalescent unit and theta must be provided; alpha is assumed to be 4/3 if unspecified. See the manuscript and/or Chifman and Kubatko (2015)[10.1016/j.jtbi.2015.03.006] for more information.
Mandatory arguments
myt1
Speciation time for the common ancestor of species 1 and 2 in coalescent unitmyt2
Speciation time for the common ancestor of species 3 and 4 in coalescent unitmyt3
Root node age in coalescent unittheta
Effective population size parameteralpha
set dafault=4/3
PhyNEST.GetTrueProbsAsymm
— FunctionGetTrueProbsAsymm(myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64)
GetTrueProbsAsymm(myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64,alpha::Float64)
Computes true site pattern probabilities for the asymmetric quartet tree: e.g., (1,(2,(3,4)));. The fifteen quartet site pattern probabilities are returned in the order of:
- AAAA
- AAAB
- AABA
- AABB
- AABC
- ABAA
- ABAB
- ABAC
- ABBA
- BAAA
- ABBC
- CABC
- BACA
- BCAA
- ABCD
Three speciation times (node ages) in coalescent unit and theta must be provided; alpha is assumed to be 4/3 if unspecified. See the manuscript and/or Chifman and Kubatko (2015)[10.1016/j.jtbi.2015.03.006] for more information.
Mandatory arguments
myt1
Speciation time for the common ancestor of species 3 and 4 in coalescent unitmyt2
Speciation time for the common ancestor of species 2 and (3,4) in coalescent unitmyt3
Root node age in coalescent unittheta
Effective population size parameteralpha
set dafault=4/3
```
PhyNEST.GetTrueProbsAsymmTypes
— FunctionGetTrueProbsAsymmTypes(type::Integer,myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64)
GetTrueProbsAsymmTypes(type::Integer,myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64,alpha::Float64)
Computes true site pattern probabilities for any of the four the asymmetric quartet trees e.g.,:
- Type 1: (1,((2,3),4));
- Type 2: ((1,(2,3)),4);
- Type 3: (1,(2,(3,4)));
- Type 4: (((1,2),3),4);
The fifteen quartet site pattern probabilities are returned in the order of:
- AAAA
- AAAB
- AABA
- AABB
- AABC
- ABAA
- ABAB
- ABAC
- ABBA
- BAAA
- ABBC
- CABC
- BACA
- BCAA
- ABCD
Three speciation times (node ages) in coalescent unit and theta must be provided; alpha is assumed to be 4/3 if unspecified. See the manuscript and/or Chifman and Kubatko (2015)[10.1016/j.jtbi.2015.03.006] for more information.
Mandatory arguments
type
Type of the asymmetric quartet in interest. See above.myt1
Speciation time for the common ancestor of species 3 and 4 in coalescent unitmyt2
Speciation time for the common ancestor of species 2 and (3,4) in coalescent unitmyt3
Root node age in coalescent unittheta
Effective population size parameteralpha
set dafault=4/3
```
PhyNEST.sim_sp_freq
— Functionsim_sp_freq(type::Integer,myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64)
sim_sp_freq(type::Integer,myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64,alpha::Float64,length::Integer)
Generates site pattern frequencies for the five possible quartet topologies modeled as a multinomial random variable under the assumption that the observed sites are independent, conditional on the species tree. The five topologies, for X/in{1,2,3,4} for e.g., are:
- Type 0: ((1,2),(3,4));
- Type 1: (1,((2,3),4));
- Type 2: ((1,(2,3)),4);
- Type 3: (1,(2,(3,4)));
- Type 4: (((1,2),3),4);
The fifteen quartet site pattern frequencies are returned in the order of:
- AAAA
- AAAB
- AABA
- AABB
- AABC
- ABAA
- ABAB
- ABAC
- ABBA
- BAAA
- ABBC
- CABC
- BACA
- BCAA
- ABCD
Mandatory arguments
type
Type of the asymmetric quartet in interest. See above.myt1
Speciation time for the common ancestor of species 3 and 4 in coalescent unitmyt2
Speciation time for the common ancestor of species 2 and (3,4) in coalescent unitmyt3
Root node age in coalescent unittheta
Effective population size parameteralpha
set dafault=4/3length
set dafault=1000000
```
PhyNEST.phyne!
— Functionphyne!(starting_topology::HybridNetwork,p::Phylip,outgroup::String;
hmax=1::Integer,
maximum_number_of_steps=250000::Integer,
maximum_number_of_failures=100::Integer,
number_of_itera=1000::Integer,
number_of_runs=10::Integer,
nniST=false::Bool,
do_hill_climbing=true::Bool,
number_of_burn_in=25::Integer,
k=10::Integer,
cons=0.5::Float64,
alph=0.5::Float64,
filename=""::AbstractString)
phyne!
is fine. phyne!
executes function initiate_search(args)
.
Estimate the species network (or tree if hmax=0
) using maximum composite likelihood. The search begins from the starting_topology
which can be either estimated or randomly generated. Starting topology can be either tree or a network with <=hamx
. Outgroup taxon must be specified to root the network.
Mandatory arguments
starting_topology
Starting topology inHybridNetwork
object created using the functionreadTopology
p
Sequence alignment parsed inPhylip
object using the functionreadPhylip
outgroup
Name of the outgroup taxa
Optional arguments
Generic
hmax (dafault=1)
Maximum number of hybridizations to be included in the final networkdo_hill_climbing (default=true)
When=true
, network is searched using hill climbing and when=false
, it searches using simulated annealingnumber_of_runs (default=10)
Number of independent runs
Network space search
maximum_number_of_steps (default=250000)
Maximum number of steps before the search terminatesmaximum_number_of_failures (default=100)
Maximum number of consecutive failures (i.e., rejecting the proposed topology) before the search teminatesnniST (default=false)
Conducts NNI operation on the specified starting topology if set astrue
Optimization
number_of_itera (default=1000)
For simulated annealing
number_of_burn_in (default=25)
k (default=10)
Specifies the number of best networks to be stored at each runcons (default=0.5)
alph (default=0.5)
Output
filename (default="")
Specifies the name of the output file. If unspecified, it will usePhyNEST_hc
orPhyNEST_sa
depending on the heuristic method applied.- The best network estimated throughout the entire runs is written in the extended Newick format and stored in file with an extension
.out
.
First line in the file is readable by PhyloNetworks
that can be visualized using PhyloPlots
. Third (last) line is the identical network but readable by DendroScope
.
- Specifics of the network search from each independent run is stored in the file with an extension
.log
.
It summarizes how many of each network 'moves' were made during the search, the reason for terminating the search (e.g., researched maximum number of steps, reached maximum number of consecutive failures, or else), and the final network estimated and its composite likelihood. When do_hill_climbing=true
, a single best network is selected for each run, but when do_hill_climbing=false
it prints k
best networks visited.
PhyNEST.Dstat
— FunctionDstat(p::Phylip, outgroup::String;
alpha=0.05::Float64,
displayall=false::Bool,
writecsv=false::Bool,
filename=""::AbstractString)
Conducts Patterson's D-statistic test. The result prints site pattern frequencies ABAB and ABBA used to compute the D-statistic, Z-score, and the p-value for each quartet tested. Significance is marked with an asterisk. Function showall(df)
can be subsequently used to show all rows.
Mandatory arguments
p
ThePhylip
objectoutgroup
Name of the outgroup taxa
Optional arguments
alpha (default=0.05)
Alpha level for significancedisplay_all (default=false)
If set astrue
, the function print test results for every quartet. By default, it only prints those quartets where the signficance was found.writecsv (default=false)
Iftrue
, the result is exported and stored in.csv
file in the working directoryfilename
Specifies.csv
file name ifwritecsv=true
. If unspecified, the result is stored asDstat-out.csv
Example
julia> p=readPhylip("sample_n4h1.txt")
julia> df=Dstat(p,"4")
Tip: if neccessary, use showall(df) function to see all the rows.
2×10 DataFrame
Row │ outgroup taxa1 taxa2 taxa3 ABAB ABBA Dstat Zscore pvalue significance
│ String String String String Int64 Int64 Float64 Float64 Float64 String
─────┼──────────────────────────────────────────────────────────────────────────────────────────
1 │ 4 3 2 1 1427 7852 0.692424 66.6995 0.0 *
2 │ 4 1 2 3 1427 7836 0.691892 66.5908 0.0 *
julia> df=Dstat(p,"4",display_all=true)
Tip: if neccessary, use showall(df) function to see all the rows.
6×10 DataFrame
Row │ outgroup taxa1 taxa2 taxa3 ABAB ABBA Dstat Zscore pvalue significance
│ String String String String Int64 Int64 Float64 Float64 Float64 String
────┼─────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 4 3 1 2 7852 1427 -0.692424 -66.6995 1.0
2 │ 4 3 2 1 1427 7852 0.692424 66.6995 0.0 *
3 │ 4 1 3 2 7836 1427 -0.691892 -66.5908 1.0
4 │ 4 1 2 3 1427 7836 0.691892 66.5908 0.0 *
5 │ 4 2 3 1 7836 7852 0.00101989 0.127743 0.449176
6 │ 4 2 1 3 7852 7836 -0.00101989 -0.127743 0.550824
PhyNEST.showallDF
— FunctionshowallDF(df::DataFrame)
Print all rows of the DataFrame object using the package CSV
.
PhyNEST.HyDe
— FunctionHyDe(p::Phylip, outgroup::AbstractString;
alpha=0.05::Float64,
display_all=true::Bool, #filter
map=""::AbstractString,
writecsv=false::Bool,
filename=""::AbstractString
Conducts HyDe: Hybrid Detection using Phylogenetic invariants. See Blischak et al., (2018) (https://doi.org/10.1093/sysbio/syy023) and the manual (https://hybridization-detection.readthedocs.io) for more information. This function replicates run_hyde.py
in the original python package. The map file, a two-column table with individual names in the first column and the name of the population that it belongs to in the second column (see example below) is not necessary to start the analysis. If the map file is not provided, each sequnece in the data is assumed to represent distinct species.
Map file is a simple text file where each line contains the name of the sequencea and the assignment of the sequence to a species, delimited by a tab. See an example below.
Mandatory arguments
p
ThePhylip
objectoutgroup
Name of the outgroup taxa. Even when there are muliple outgroup individuals, but if their assignment to the species is provided in themap
file, simply specify one of the outgroup species as listed in the alignment.
Optional arguments
alpha (default=0.05)
Alpha level for significancedisplay_all (default=false)
If set astrue
, results are also filtered based on whether there is significant evidence for hybridization.map (default=no map file)
Specify a map file, if available.writecsv (default=false)
Iftrue
, the result is stored in.csv
file in the working directoryfilename
Specifies.csv
file name ifwritecsv=true
. If unspecified, the result is stored asHyDe-out.csv
Example
julia> HyDe(p,"5",display_all=true)
24×11 DataFrame
Row │ outgroup P1 Hybrid P2 AABB ABAB ABBA Gamma Zscore Pvalue significance
│ String String String String Int64 Int64 Int64 Float64 Float64 Float64 String
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 5 4 3 1 18168 2573 2611 0.00243076 0.528406 0.298609
2 │ 5 4 3 2 23742 2022 2064 0.00192997 0.657666 0.255376
3 │ 5 4 1 2 23703 2069 2069 0.0 -99999.9 1.0
4 │ 5 3 1 2 8005 8057 1991 0.9915 -0.412067 0.659855
5 │ 5 4 1 3 18168 2611 2573 -0.00244861 -99999.9 1.0
6 │ 5 3 4 1 2573 18168 2611 0.49939 -216.327 1.0
7 │ 5 3 1 4 2573 2611 18168 1.00245 -0.527118 0.700944
8 │ 5 1 4 3 2611 18168 2573 0.50061 -216.327 1.0
9 │ 5 1 3 4 2611 2573 18168 0.997569 0.528406 0.298609
10 │ 5 4 2 3 23742 2064 2022 -0.00194121 -99999.9 1.0
11 │ 5 3 4 2 2022 23742 2064 0.499516 -339.45 1.0
12 │ 5 3 2 4 2022 2064 23742 1.00194 -0.656395 0.744215
13 │ 5 2 4 3 2064 23742 2022 0.500484 -339.45 1.0
14 │ 5 2 3 4 2064 2022 23742 0.99807 0.657666 0.255376
15 │ 5 4 2 1 23703 2069 2069 0.0 -99999.9 1.0
16 │ 5 1 4 2 2069 23703 2069 0.5 -336.307 1.0
17 │ 5 1 2 4 2069 2069 23703 NaN 0.0 0.5
18 │ 5 2 4 1 2069 23703 2069 0.5 -336.307 1.0
19 │ 5 2 1 4 2069 2069 23703 NaN 0.0 0.5
20 │ 5 3 2 1 8005 1991 8057 0.502152 47.6571 0.0 *
21 │ 5 1 3 2 8057 8005 1991 1.00872 -99999.9 1.0
22 │ 5 1 2 3 8057 1991 8005 0.497848 47.6571 0.0 *
23 │ 5 2 3 1 1991 8005 8057 -0.00872191 -0.408534 0.658559
24 │ 5 2 1 3 1991 8057 8005 0.00849951 -0.412067 0.659855
julia> HyDe(p,"5",display_all=false)
2×11 DataFrame
Row │ outgroup P1 Hybrid P2 AABB ABAB ABBA Gamma Zscore Pvalue significance
│ String String String String Int64 Int64 Int64 Float64 Float64 Float64 String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 5 3 2 1 8005 1991 8057 0.502152 47.6571 0.0 *
2 │ 5 1 2 3 8057 1991 8005 0.497848 47.6571 0.0 *
julia> HyDe(p,"5",map="map.txt",display_all=false)
Map file [map.txt] provided.
2×11 DataFrame
Row │ outgroup P1 Hybrid P2 AABB ABAB ABBA Gamma Zscore Pvalue significance
│ String String String String Int64 Int64 Int64 Float64 Float64 Float64 String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
1 │ sp5out sp3 sp2 sp1 15841 3418 15909 0.501365 49.4337 0.0 *
2 │ sp5out sp1 sp2 sp3 15909 3418 15841 0.498635 49.4337 0.0 *
where the map.txt
used in above example is:
shell> cat map.txt
5 sp5out
4 sp5out
3 sp3
1 sp1
2 sp2
PhyNEST.cct
— Functionfunction cct(pvals::AbstractArray; weights=Float64[]::Array)
A function to perform the Cauchy combination test. It takes a list of p-values and a list of weights and return the global p-value. Example below shows how the p-values computed from the function HyDe
can be directly used for cct
See [https://github.com/rhaque62/pyghdet] for more information. This function is the PhyNEST
implementation of the Cauchy combination test proposed in Haque and Kubatko (2023) [https://www.biorxiv.org/content/biorxiv/early/2023/02/27/2023.02.24.529943.full.pdf], a method that consider hybridization and coalescence in a unified framework that can detect whether there are any hybrid species in a given set of species. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.
PhyNEST.cmc
— Functionfunction cmc(pvals::Array)
A function to perform the CMC test. It takes a list of p-values and return the global p-value.
Types
PhyNEST.Phylip
— TypePhylip
Subtype of abstract INPUT
type with the following attributes:
filename
Name of the input phylip alignment filetime
Time taken for parsing the input in secondsnumtaxa
Number of taxa given at the first line of the input fileseqleng
Sequence length given at the first line of the input filenametaxa
Sequence names given in the input filecounttaxa
Unique integer identifier given to each individual in the order of appearance in the input fileallquartet
All combinations of quartets for counttaxaindex
Just convert allquartet elements into arbitrary index numbersspcounts
Arrays of 15 site pattern frequencies for each quartet in allquartet
PhyNEST.quartets
— Typequartets
Subtype of abstract Quartet
type with the following attributes:
number
List of individuals quartets in the order of 1,2,3,4displayed_tree
nth displayed tree that the quartet was extracted fromquartet
List of quartets in the order of 1,2,3,4 using the leaf numbers in HybridNetworktquartet
List of quartets in the order of 1,2,3,4 using the leaf numbers in Phylipgamma
Inhertiance probability information provided in HybridNetworkmspcountsNET
We can directly use this counts for likelihood calculation.mrca
List of common ancesters of two taxa in the order of 1 and 2 (12),13,14,23,24,and 34ntau
Unique number of the taus in the tau used in branchlengthsmomestlength
Branch length identified for each quartet given the mspcounts using moment estimatoraverage_mom_est_bl
Branch length that is averaged for the entire tree/network - Will get filled later because theta is requiredsymtype
Type of each quartet. It can be either symmetric (type 0) or asymmetric. Asymmetric quartets have four possible topologies:- Type 0: ((1,2),(3,4));
- Type 1: (1,((2,3),4));
- Type 2: ((1,(2,3)),4);
- Type 3: (1,(2,(3,4)));
- Type 4: (((1,2),3),4);
logLik
It's just a negative likelihood for the quartet in interest
PhyNEST.Network
— TypeNetwork
Subtype of abstract Quartet
type with the following attributes:
leafname
Name of the leaves in the order of appearance in the input newickleafnumber
Assigned number for each leaf when reading in the input newickgamma
Gamma as appears on the newick, although irrelevant for our computationtheta
Estimated theta using the site pattern frequencies. If Network object is
created without Phylip, theta is not estimated byt default theta=0.001 is displayed.quartet
An array of all quartets in the given topology
Index
PhyNEST.Network
PhyNEST.Phylip
PhyNEST.quartets
PhyNEST.Dstat
PhyNEST.GetTrueProbsAsymm
PhyNEST.GetTrueProbsAsymmTypes
PhyNEST.GetTrueProbsSymm
PhyNEST.HyDe
PhyNEST.PhylipFileInfo
PhyNEST.cct
PhyNEST.cmc
PhyNEST.getUniqueQuartets
PhyNEST.greet
PhyNEST.phyne!
PhyNEST.readCheckPoint
PhyNEST.readPhylip
PhyNEST.readPhylipFile
PhyNEST.show_sp
PhyNEST.showallDF
PhyNEST.sim_sp_freq
PhyNEST.sitePatternCounts
PhyNEST.sitePatternsToDF
PhyNEST.spRearrange
PhyNEST.storeCheckPoint
PhyNEST.write_sp