PhyNEST

Documentation for a Julia package PhyNEST: Phylogenetic Network Estimation using SiTe patterns.


Tutorials

A step-by-step wiki tutorial for the major functions in PhyNEST is available.

Reference

Sungsik Kong, David Swofford, and Laura Kubatko (2022) Inference of Phylogenetic Networks from Sequence Data using Composite Likelihood. Preprint available online on BioRxiv at https://doi.org/10.1101/2022.11.14.516468.

Getting help

Please use google group to report bugs or post questions and/or suggestions.

Functions

PhyNEST.greetFunction
greet()

Displays a greeting with citation information. No argument needed.

Example

julia> greet()
Thank you for using PhyNEST: Phylogenetic Network Estimation using SiTe patterns.
Please report bugs or make suggestions to https://groups.google.com/g/phynest-users.
If you conduct an analysis using PhyNEST, please cite:
Sungsik Kong, David Swofford, and Laura Kubatko (2022) Inference of Phylogenetic Networks from Sequence Data using Composite Likelihood.
Preprint available online on BioRxiv at https://doi.org/10.1101/2022.11.14.516468.
source
PhyNEST.readPhylipFunction
readPhylip(inputfile::AbstractString; 
            writecsv=false::Bool,
            csvname=""::AbstractString,
            showProgress=true::Bool,
            tdigits=3::Integer,
            checkpoint=false::Bool)

Import, read, and parse the input phylip file. File name must be specified as string. By default, progress bar is displayed, and .csv and .ckp files are NOT produced. See optional arguments below and modify if needed.

Mandatory argument

  • inputfile Name of the phylip file as a string

Optional arguments

  • writecsv (default=false) A boolean arguement to allow writing site pattern frequencies in .csv file
  • csvname (default=sitePatternCounts_inputfile.csv) A string that will be name of the .csv file
  • showProgress (default=true) A boolean argument for visualizing the progress of alignment parsing
  • tdigits (default=3) Number of decimal points to record timme taken to parse the file in seconds
  • checkpoint (default=false) A boolean argument to store the Phylip object as a .ckp file. (Warning: .ckp file can be large)
source
PhyNEST.readPhylipFileFunction
readPhylipFile(inputfile::AbstractString,
                writecsv::Bool,
                csvname::AbstractString,
                showProgress::Bool)

Exceuted while running the function readPhylip. Fills in the attributes in the Phylip object, except for the Phylip.time.

source
PhyNEST.PhylipFileInfoFunction
PhylipFileInfo(inputfile::AbstractString, 
                p::Phylip, 
                showProgress::Bool)

Function that shortens?summarizes? the sequence alignment into two matrices:

  • The one with unique sites and
  • Another with how many times each column in the previous matrix occurs throughout the alignment.
source
PhyNEST.getUniqueQuartetsFunction
getUniqueQuartets(p::Phylip)

Getting some combinations for four sequences (quartets). This will result in n! quartets where n is the number of sequences. I believe a better way to do this exists.

source
PhyNEST.sitePatternCountsFunction
sitePatternCounts(p::Phylip,
                ppbase::Array,
                counts::Array)

Computes observed site pattern frequencies from UniqueBase and BaseCounts obtained from the function PhylipFileInfo.

source
PhyNEST.spRearrangeFunction
spRearrange(p::Phylip)

Rearranges the elemetns of each unique quartet obtained using getUniqueQuartets, and also suffles the site pattern frequencies accordingly. This prevents 24 redundant computations, saving computation time.

source
PhyNEST.show_spFunction
show_sp(p::Phylip)

Pretty name for displaying all quartet and the corresponding site pattern frequencies on screen in the DataFrame format. It may result in excessively long table when there are many sequences in the input file.

Mandatory argument

  • p Phylip object
source
PhyNEST.sitePatternsToDFFunction
sitePatternsToDF(p::Phylip)

Extracts the quartet and site pattern information from the Phyliip object, and reorganizes them in the DataFrame format. This function does all hard work work for show_sp.

source
PhyNEST.write_spFunction
write_sp(p::Phylip; 
        csvname="PhyNEST_sp"::AbstractString)

Pretty name for exporting all quartet and the corresponding site pattern frequencies in a .csv format.

Mandatory argument

  • p Phylip object

Optional argument

  • csvname Filename for the .csv output can be given, otherwise we use PhyNEST_sp.csv by default.
source
PhyNEST.storeCheckPointFunction
storeCheckPoint(p::Phylip)

Stores the phylip object into a .ckp file so a user does not have to repeat the phylip parsing again if working with the same dataset next time. By default it creates a .ckp file that has the same filename as the input phylip file. File size can get pretty large...

Mandatory argument

  • p Phylip object
source
PhyNEST.readCheckPointFunction
readCheckPoint(ckpfile::AbstractString)

Reads in .ckp file and creates a filled in phylip object.

Mandatory argument

  • ckpfile Name of the checkpoint file
source
PhyNEST.GetTrueProbsSymmFunction
GetTrueProbsSymm(myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64)
GetTrueProbsSymm(myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64,alpha::Float64)

Computes true site pattern probabilities for the symmetric quartet tree: e.g., ((1,2),(3,4));. The fifteen quartet site pattern probabilities are returned in the order of:

  • AAAA
  • AAAB
  • AABA
  • AABB
  • AABC
  • ABAA
  • ABAB
  • ABAC
  • ABBA
  • BAAA
  • ABBC
  • CABC
  • BACA
  • BCAA
  • ABCD

Three speciation times (node ages) in coalescent unit and theta must be provided; alpha is assumed to be 4/3 if unspecified. See the manuscript and/or Chifman and Kubatko (2015)[10.1016/j.jtbi.2015.03.006] for more information.

Mandatory arguments

  • myt1 Speciation time for the common ancestor of species 1 and 2 in coalescent unit
  • myt2 Speciation time for the common ancestor of species 3 and 4 in coalescent unit
  • myt3 Root node age in coalescent unit
  • theta Effective population size parameter
  • alpha set dafault=4/3
source
PhyNEST.GetTrueProbsAsymmFunction
GetTrueProbsAsymm(myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64)
GetTrueProbsAsymm(myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64,alpha::Float64)

Computes true site pattern probabilities for the asymmetric quartet tree: e.g., (1,(2,(3,4)));. The fifteen quartet site pattern probabilities are returned in the order of:

  • AAAA
  • AAAB
  • AABA
  • AABB
  • AABC
  • ABAA
  • ABAB
  • ABAC
  • ABBA
  • BAAA
  • ABBC
  • CABC
  • BACA
  • BCAA
  • ABCD

Three speciation times (node ages) in coalescent unit and theta must be provided; alpha is assumed to be 4/3 if unspecified. See the manuscript and/or Chifman and Kubatko (2015)[10.1016/j.jtbi.2015.03.006] for more information.

Mandatory arguments

  • myt1 Speciation time for the common ancestor of species 3 and 4 in coalescent unit
  • myt2 Speciation time for the common ancestor of species 2 and (3,4) in coalescent unit
  • myt3 Root node age in coalescent unit
  • theta Effective population size parameter
  • alpha set dafault=4/3

```

source
PhyNEST.GetTrueProbsAsymmTypesFunction
GetTrueProbsAsymmTypes(type::Integer,myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64)
GetTrueProbsAsymmTypes(type::Integer,myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64,alpha::Float64)

Computes true site pattern probabilities for any of the four the asymmetric quartet trees e.g.,:

  • Type 1: (1,((2,3),4));
  • Type 2: ((1,(2,3)),4);
  • Type 3: (1,(2,(3,4)));
  • Type 4: (((1,2),3),4);

The fifteen quartet site pattern probabilities are returned in the order of:

  • AAAA
  • AAAB
  • AABA
  • AABB
  • AABC
  • ABAA
  • ABAB
  • ABAC
  • ABBA
  • BAAA
  • ABBC
  • CABC
  • BACA
  • BCAA
  • ABCD

Three speciation times (node ages) in coalescent unit and theta must be provided; alpha is assumed to be 4/3 if unspecified. See the manuscript and/or Chifman and Kubatko (2015)[10.1016/j.jtbi.2015.03.006] for more information.

Mandatory arguments

  • type Type of the asymmetric quartet in interest. See above.
  • myt1 Speciation time for the common ancestor of species 3 and 4 in coalescent unit
  • myt2 Speciation time for the common ancestor of species 2 and (3,4) in coalescent unit
  • myt3 Root node age in coalescent unit
  • theta Effective population size parameter
  • alpha set dafault=4/3

```

source
PhyNEST.sim_sp_freqFunction
sim_sp_freq(type::Integer,myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64)
sim_sp_freq(type::Integer,myt1::Float64,myt2::Float64,myt3::Float64,theta::Float64,alpha::Float64,length::Integer)

Generates site pattern frequencies for the five possible quartet topologies modeled as a multinomial random variable under the assumption that the observed sites are independent, conditional on the species tree. The five topologies, for X/in{1,2,3,4} for e.g., are:

  • Type 0: ((1,2),(3,4));
  • Type 1: (1,((2,3),4));
  • Type 2: ((1,(2,3)),4);
  • Type 3: (1,(2,(3,4)));
  • Type 4: (((1,2),3),4);

The fifteen quartet site pattern frequencies are returned in the order of:

  • AAAA
  • AAAB
  • AABA
  • AABB
  • AABC
  • ABAA
  • ABAB
  • ABAC
  • ABBA
  • BAAA
  • ABBC
  • CABC
  • BACA
  • BCAA
  • ABCD

Mandatory arguments

  • type Type of the asymmetric quartet in interest. See above.
  • myt1 Speciation time for the common ancestor of species 3 and 4 in coalescent unit
  • myt2 Speciation time for the common ancestor of species 2 and (3,4) in coalescent unit
  • myt3 Root node age in coalescent unit
  • theta Effective population size parameter
  • alpha set dafault=4/3
  • length set dafault=1000000

```

source
PhyNEST.phyne!Function
phyne!(starting_topology::HybridNetwork,p::Phylip,outgroup::String;
        hmax=1::Integer,
        maximum_number_of_steps=250000::Integer,
        maximum_number_of_failures=100::Integer,
        number_of_itera=1000::Integer,
        number_of_runs=10::Integer,
        nniST=false::Bool,
        do_hill_climbing=true::Bool,
        number_of_burn_in=25::Integer,
        k=10::Integer,
        cons=0.5::Float64,
        alph=0.5::Float64,
        filename=""::AbstractString)

phyne! is fine. phyne! executes function initiate_search(args).

Estimate the species network (or tree if hmax=0) using maximum composite likelihood. The search begins from the starting_topology which can be either estimated or randomly generated. Starting topology can be either tree or a network with <=hamx. Outgroup taxon must be specified to root the network.

Mandatory arguments

  • starting_topology Starting topology in HybridNetwork object created using the function readTopology
  • p Sequence alignment parsed in Phylip object using the function readPhylip
  • outgroup Name of the outgroup taxa

Optional arguments

Generic

  • hmax (dafault=1) Maximum number of hybridizations to be included in the final network
  • do_hill_climbing (default=true) When =true, network is searched using hill climbing and when =false, it searches using simulated annealing
  • number_of_runs (default=10) Number of independent runs

Network space search

  • maximum_number_of_steps (default=250000) Maximum number of steps before the search terminates
  • maximum_number_of_failures (default=100) Maximum number of consecutive failures (i.e., rejecting the proposed topology) before the search teminates
  • nniST (default=false) Conducts NNI operation on the specified starting topology if set as true

Optimization

  • number_of_itera (default=1000)

For simulated annealing

  • number_of_burn_in (default=25)
  • k (default=10) Specifies the number of best networks to be stored at each run
  • cons (default=0.5)
  • alph (default=0.5)

Output

  • filename (default="") Specifies the name of the output file. If unspecified, it will use PhyNEST_hc or PhyNEST_sa depending on the heuristic method applied.
  • The best network estimated throughout the entire runs is written in the extended Newick format and stored in file with an extension .out.

First line in the file is readable by PhyloNetworks that can be visualized using PhyloPlots. Third (last) line is the identical network but readable by DendroScope.

  • Specifics of the network search from each independent run is stored in the file with an extension .log.

It summarizes how many of each network 'moves' were made during the search, the reason for terminating the search (e.g., researched maximum number of steps, reached maximum number of consecutive failures, or else), and the final network estimated and its composite likelihood. When do_hill_climbing=true, a single best network is selected for each run, but when do_hill_climbing=false it prints k best networks visited.

source
PhyNEST.DstatFunction
Dstat(p::Phylip, outgroup::String;
        alpha=0.05::Float64, 
        displayall=false::Bool,
        writecsv=false::Bool, 
        filename=""::AbstractString)

Conducts Patterson's D-statistic test. The result prints site pattern frequencies ABAB and ABBA used to compute the D-statistic, Z-score, and the p-value for each quartet tested. Significance is marked with an asterisk. Function showall(df) can be subsequently used to show all rows.

Mandatory arguments

  • p The Phylip object
  • outgroup Name of the outgroup taxa

Optional arguments

  • alpha (default=0.05) Alpha level for significance
  • display_all (default=false) If set as true, the function print test results for every quartet. By default, it only prints those quartets where the signficance was found.
  • writecsv (default=false) If true, the result is exported and stored in .csv file in the working directory
  • filename Specifies .csv file name if writecsv=true. If unspecified, the result is stored as Dstat-out.csv

Example

julia> p=readPhylip("sample_n4h1.txt")
julia> df=Dstat(p,"4")
Tip: if neccessary, use showall(df) function to see all the rows.
2×10 DataFrame
 Row │ outgroup  taxa1   taxa2   taxa3   ABAB   ABBA   Dstat     Zscore   pvalue   significance
     │ String    String  String  String  Int64  Int64  Float64   Float64  Float64  String
─────┼──────────────────────────────────────────────────────────────────────────────────────────
   1 │ 4         3       2       1        1427   7852  0.692424  66.6995      0.0  *
   2 │ 4         1       2       3        1427   7836  0.691892  66.5908      0.0  *

julia> df=Dstat(p,"4",display_all=true)
Tip: if neccessary, use showall(df) function to see all the rows.
6×10 DataFrame
Row │ outgroup  taxa1   taxa2   taxa3   ABAB   ABBA   Dstat        Zscore      pvalue    significance
    │ String    String  String  String  Int64  Int64  Float64      Float64     Float64   String
────┼─────────────────────────────────────────────────────────────────────────────────────────────────
  1 │ 4         3       1       2        7852   1427  -0.692424    -66.6995    1.0
  2 │ 4         3       2       1        1427   7852   0.692424     66.6995    0.0       *
  3 │ 4         1       3       2        7836   1427  -0.691892    -66.5908    1.0
  4 │ 4         1       2       3        1427   7836   0.691892     66.5908    0.0       *
  5 │ 4         2       3       1        7836   7852   0.00101989    0.127743  0.449176
  6 │ 4         2       1       3        7852   7836  -0.00101989   -0.127743  0.550824
source
PhyNEST.showallDFFunction
showallDF(df::DataFrame)

Print all rows of the DataFrame object using the package CSV.

source
PhyNEST.HyDeFunction
HyDe(p::Phylip, outgroup::AbstractString; 
    alpha=0.05::Float64, 
    display_all=true::Bool, #filter
    map=""::AbstractString,
    writecsv=false::Bool, 
    filename=""::AbstractString

Conducts HyDe: Hybrid Detection using Phylogenetic invariants. See Blischak et al., (2018) (https://doi.org/10.1093/sysbio/syy023) and the manual (https://hybridization-detection.readthedocs.io) for more information. This function replicates run_hyde.py in the original python package. The map file, a two-column table with individual names in the first column and the name of the population that it belongs to in the second column (see example below) is not necessary to start the analysis. If the map file is not provided, each sequnece in the data is assumed to represent distinct species.

Map file is a simple text file where each line contains the name of the sequencea and the assignment of the sequence to a species, delimited by a tab. See an example below.

Mandatory arguments

  • p The Phylip object
  • outgroup Name of the outgroup taxa. Even when there are muliple outgroup individuals, but if their assignment to the species is provided in the map file, simply specify one of the outgroup species as listed in the alignment.

Optional arguments

  • alpha (default=0.05) Alpha level for significance
  • display_all (default=false) If set as true, results are also filtered based on whether there is significant evidence for hybridization.
  • map (default=no map file) Specify a map file, if available.
  • writecsv (default=false) If true, the result is stored in .csv file in the working directory
  • filename Specifies .csv file name if writecsv=true. If unspecified, the result is stored as HyDe-out.csv

Example

julia> HyDe(p,"5",display_all=true)
24×11 DataFrame
 Row │ outgroup  P1      Hybrid  P2      AABB   ABAB   ABBA   Gamma         Zscore         Pvalue    significance
     │ String    String  String  String  Int64  Int64  Int64  Float64       Float64        Float64   String
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 5         4       3       1       18168   2573   2611    0.00243076       0.528406  0.298609
   2 │ 5         4       3       2       23742   2022   2064    0.00192997       0.657666  0.255376
   3 │ 5         4       1       2       23703   2069   2069    0.0         -99999.9       1.0
   4 │ 5         3       1       2        8005   8057   1991    0.9915          -0.412067  0.659855
   5 │ 5         4       1       3       18168   2611   2573   -0.00244861  -99999.9       1.0
   6 │ 5         3       4       1        2573  18168   2611    0.49939       -216.327     1.0
   7 │ 5         3       1       4        2573   2611  18168    1.00245         -0.527118  0.700944
   8 │ 5         1       4       3        2611  18168   2573    0.50061       -216.327     1.0
   9 │ 5         1       3       4        2611   2573  18168    0.997569         0.528406  0.298609
  10 │ 5         4       2       3       23742   2064   2022   -0.00194121  -99999.9       1.0
  11 │ 5         3       4       2        2022  23742   2064    0.499516      -339.45      1.0
  12 │ 5         3       2       4        2022   2064  23742    1.00194         -0.656395  0.744215
  13 │ 5         2       4       3        2064  23742   2022    0.500484      -339.45      1.0
  14 │ 5         2       3       4        2064   2022  23742    0.99807          0.657666  0.255376
  15 │ 5         4       2       1       23703   2069   2069    0.0         -99999.9       1.0
  16 │ 5         1       4       2        2069  23703   2069    0.5           -336.307     1.0
  17 │ 5         1       2       4        2069   2069  23703  NaN                0.0       0.5
  18 │ 5         2       4       1        2069  23703   2069    0.5           -336.307     1.0
  19 │ 5         2       1       4        2069   2069  23703  NaN                0.0       0.5
  20 │ 5         3       2       1        8005   1991   8057    0.502152        47.6571    0.0       *
  21 │ 5         1       3       2        8057   8005   1991    1.00872     -99999.9       1.0
  22 │ 5         1       2       3        8057   1991   8005    0.497848        47.6571    0.0       *
  23 │ 5         2       3       1        1991   8005   8057   -0.00872191      -0.408534  0.658559
  24 │ 5         2       1       3        1991   8057   8005    0.00849951      -0.412067  0.659855

julia> HyDe(p,"5",display_all=false)
2×11 DataFrame
 Row │ outgroup  P1      Hybrid  P2      AABB   ABAB   ABBA   Gamma     Zscore   Pvalue   significance
     │ String    String  String  String  Int64  Int64  Int64  Float64   Float64  Float64  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 5         3       2       1        8005   1991   8057  0.502152  47.6571      0.0  *
   2 │ 5         1       2       3        8057   1991   8005  0.497848  47.6571      0.0  *

julia> HyDe(p,"5",map="map.txt",display_all=false)
Map file [map.txt] provided.
2×11 DataFrame
 Row │ outgroup  P1      Hybrid  P2      AABB   ABAB   ABBA   Gamma     Zscore   Pvalue   significance
     │ String    String  String  String  Int64  Int64  Int64  Float64   Float64  Float64  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ sp5out    sp3     sp2     sp1     15841   3418  15909  0.501365  49.4337      0.0  *
   2 │ sp5out    sp1     sp2     sp3     15909   3418  15841  0.498635  49.4337      0.0  *

where the map.txt used in above example is:

shell> cat map.txt
5	sp5out
4	sp5out
3	sp3
1	sp1
2	sp2
source
PhyNEST.cctFunction
function cct(pvals::AbstractArray; weights=Float64[]::Array)

A function to perform the Cauchy combination test. It takes a list of p-values and a list of weights and return the global p-value. Example below shows how the p-values computed from the function HyDe can be directly used for cct

See [https://github.com/rhaque62/pyghdet] for more information. This function is the PhyNEST implementation of the Cauchy combination test proposed in Haque and Kubatko (2023) [https://www.biorxiv.org/content/biorxiv/early/2023/02/27/2023.02.24.529943.full.pdf], a method that consider hybridization and coalescence in a unified framework that can detect whether there are any hybrid species in a given set of species. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.

source
PhyNEST.cmcFunction
function cmc(pvals::Array)

A function to perform the CMC test. It takes a list of p-values and return the global p-value.

source

Types

PhyNEST.PhylipType
Phylip

Subtype of abstract INPUT type with the following attributes:

  • filename Name of the input phylip alignment file
  • time Time taken for parsing the input in seconds
  • numtaxa Number of taxa given at the first line of the input file
  • seqleng Sequence length given at the first line of the input file
  • nametaxa Sequence names given in the input file
  • counttaxa Unique integer identifier given to each individual in the order of appearance in the input file
  • allquartet All combinations of quartets for counttaxa
  • index Just convert allquartet elements into arbitrary index numbers
  • spcounts Arrays of 15 site pattern frequencies for each quartet in allquartet
source
PhyNEST.quartetsType
quartets

Subtype of abstract Quartet type with the following attributes:

  • number List of individuals quartets in the order of 1,2,3,4
  • displayed_treenth displayed tree that the quartet was extracted from
  • quartet List of quartets in the order of 1,2,3,4 using the leaf numbers in HybridNetwork
  • tquartet List of quartets in the order of 1,2,3,4 using the leaf numbers in Phylip
  • gamma Inhertiance probability information provided in HybridNetwork
  • mspcountsNET We can directly use this counts for likelihood calculation.
  • mrca List of common ancesters of two taxa in the order of 1 and 2 (12),13,14,23,24,and 34
  • ntau Unique number of the taus in the tau used in branchlengths
  • momestlength Branch length identified for each quartet given the mspcounts using moment estimator
  • average_mom_est_bl Branch length that is averaged for the entire tree/network - Will get filled later because theta is required
  • symtype Type of each quartet. It can be either symmetric (type 0) or asymmetric. Asymmetric quartets have four possible topologies:
    • Type 0: ((1,2),(3,4));
    • Type 1: (1,((2,3),4));
    • Type 2: ((1,(2,3)),4);
    • Type 3: (1,(2,(3,4)));
    • Type 4: (((1,2),3),4);
  • logLik It's just a negative likelihood for the quartet in interest
source
PhyNEST.NetworkType

Network

Subtype of abstract Quartet type with the following attributes:

  • leafname Name of the leaves in the order of appearance in the input newick
  • leafnumber Assigned number for each leaf when reading in the input newick
  • gamma Gamma as appears on the newick, although irrelevant for our computation
  • theta Estimated theta using the site pattern frequencies. If Network object is
    created without Phylip, theta is not estimated byt default theta=0.001 is displayed.
  • quartet An array of all quartets in the given topology
source

Index