This stage includes phylogenomics stages.

Phylo Quick-Start

HAPHPIPE includes three modules for phylogenomics: multiple_align, model_test, and build_tree. These three modules are sufficient to turn your consensus and/or haplotype sequences from the other modules into a phylogenetic tree! For purposes of this quick-start guide, we will demonstrate the modules to create a tree from HIV pol consensus sequences.

Step 1: Alignment

After running either of the assembly pipelines, final.fna files will be located in directories named ./<SampleID>/haphpipe_assemble_0[1|2]. For the multiple_align module, we need to create a list of all of these directories. We can do so easily with one command (shown for haphpipe_assemble_01 output:

ls -d ./SRR*/haphpipe_assemble_01 > ./dir_list.txt

Now, we will align all of these final.fna files:

haphpipe multiple_align --dir_list dir_list.txt --ref_gtf refs/HIV_B.K03455.HXB2.gtf 

The output will be located in a new directory, hp_multiple_align. The alignment of pol sequences is the file alignment_region00.fasta.

Step 2: Model Selection

Now, we will use the model_test module to determine the best-fit evolutionary model for our data. This is an input to the tree building module. We will use this command to generate best-fit models available in RAxML:

haphpipe model_test --seqs hp_multiple_align/alignment_region00.fasta --run_id alignment_region00 --template raxml

The ModelTest output will be written to a file called modeltest_results.out and a summary of all the best models will be written to modeltest_results_summary.tsv. Examples of both are below.

ModelTest-NG Output
    --------------------------------------------------------------------------------
    ModelTest-NG vx.y.z

    Input data:
      MSA:        multiple_align/alignment.fasta
      Tree:       Maximum likelihood
        file:           -
      #taxa:            6
      #sites:           1975
      #patterns:        114
      Max. thread mem:  0 MB

    Output:
      Log:           /var/folders/lv/dqdkd8957_3fv6yxsyfsvn0r0000gn/T/tmpHP_model_testud4sc4ya/samp12_modeltest_results.log
      Starting tree: /var/folders/lv/dqdkd8957_3fv6yxsyfsvn0r0000gn/T/tmpHP_model_testud4sc4ya/samp12_modeltest_results.tree
      Results:       /var/folders/lv/dqdkd8957_3fv6yxsyfsvn0r0000gn/T/tmpHP_model_testud4sc4ya/samp12_modeltest_results.out

    Selection options:
      # dna schemes:      11
      # dna models:       88
      include model parameters:
        Uniform:         true
        p-inv (+I):      true
        gamma (+G):      true
        both (+I+G):     true
        free rates (+R): false
        fixed freqs:     true
        estimated freqs: true
        #categories:     4
      gamma rates mode:   mean
      asc bias:           none
      epsilon (opt):      0.01
      epsilon (par):      0.05
      keep branches:      false

    Additional options:
      verbosity:        very low
      threads:          1/6
      RNG seed:         12345
      subtree repeats:  enabled
    --------------------------------------------------------------------------------

    BIC       model              K            lnL          score          delta    weight
    --------------------------------------------------------------------------------
           1  HKY                4     -5380.7535     10860.1551         0.0000    0.7049
           2  TrN                5     -5378.2013     10862.6391         2.4839    0.2036
           3  TPM1uf             5     -5379.8699     10865.9763         5.8211    0.0384
           4  TPM3uf             5     -5380.7175     10867.6716         7.5164    0.0164
           5  HKY+G4             5     -5380.8440     10867.9246         7.7694    0.0145
           6  HKY+I              5     -5381.4917     10869.2200         9.0648    0.0076
           7  TIM3               6     -5378.1637     10870.1523         9.9971    0.0048
           8  TrN+G4             6     -5378.3069     10870.4387        10.2836    0.0041
           9  TPM2uf+G4          6     -5378.9908     10871.8064        11.6512    0.0021
          10  TrN+I              6     -5379.0181     10871.8610        11.7059    0.0020
    --------------------------------------------------------------------------------
    Best model according to BIC
    ---------------------------
    Model:              HKY
    lnL:                -5380.7535
    Frequencies:        0.3695 0.1709 0.2219 0.2377
    Subst. Rates:       1.0000 2.4741 1.0000 1.0000 2.4741 1.0000 
    Inv. sites prop:    -
    Gamma shape:        -
    Score:              10860.1551
    Weight:             0.7049
    ---------------------------
    Parameter importances
    ---------------------------
    P.Inv:              0.0100
    Gamma:              0.0216
    Gamma-Inv:          0.0002
    Frequencies:        1.0000
    ---------------------------
    Model averaged estimates
    ---------------------------
    P.Inv:              0.0215
    Alpha:              94.2337
    Alpha-P.Inv:        91.7960
    P.Inv-Alpha:        0.0214
    Frequencies:        0.3700 0.1702 0.2226 0.2372 

    Commands:
      > phyml  -i multiple_align/alignment.fasta -m 010010 -f m -v 0 -a 0 -c 1 -o tlr
      > raxmlHPC-SSE3 -s multiple_align/alignment.fasta -c 1 -m GTRCATX -n EXEC_NAME -p PARSIMONY_SEED
      > raxml-ng --msa multiple_align/alignment.fasta --model HKY
      > paup -s multiple_align/alignment.fasta
      > iqtree -s multiple_align/alignment.fasta -m HKY

    AIC       model              K            lnL          score          delta    weight
    --------------------------------------------------------------------------------
           1  TrN                5     -5378.2013     10784.4026         0.0000    0.2246
           2  TIM2+G4            7     -5376.4972     10784.9944         0.5919    0.1671
           3  TIM3               6     -5378.1637     10786.3274         1.9249    0.0858
           4  TIM2+I             7     -5377.2570     10786.5141         2.1115    0.0782
           5  TrN+G4             6     -5378.3069     10786.6138         2.2113    0.0744
           6  TIM1+G4            7     -5377.3549     10786.7098         2.3073    0.0709
           7  HKY                4     -5380.7535     10787.5069         3.1044    0.0476
           8  TPM1uf             5     -5379.8699     10787.7398         3.3372    0.0423
           9  TPM2uf+G4          6     -5378.9908     10787.9815         3.5790    0.0375
          10  TrN+I              6     -5379.0181     10788.0362         3.6336    0.0365
    --------------------------------------------------------------------------------
    Best model according to AIC
    ---------------------------
    Model:              TrN
    lnL:                -5378.2013
    Frequencies:        0.3720 0.1679 0.2249 0.2352
    Subst. Rates:       1.0000 2.2008 1.0000 1.0000 3.1065 1.0000 
    Inv. sites prop:    -
    Gamma shape:        -
    Score:              10784.4026
    Weight:             0.2246
    ---------------------------
    Parameter importances
    ---------------------------
    P.Inv:              0.1262
    Gamma:              0.3943
    Gamma-Inv:          0.0384
    Frequencies:        1.0000
    ---------------------------
    Model averaged estimates
    ---------------------------
    P.Inv:              0.0216
    Alpha:              93.2595
    Alpha-P.Inv:        94.4193
    P.Inv-Alpha:        0.0216
    Frequencies:        0.3718 0.1684 0.2238 0.2359 

    Commands:
      > phyml  -i multiple_align/alignment.fasta -m 010020 -f m -v 0 -a 0 -c 1 -o tlr
      > raxmlHPC-SSE3 -s multiple_align/alignment.fasta -c 1 -m GTRCATX -n EXEC_NAME -p PARSIMONY_SEED
      > raxml-ng --msa multiple_align/alignment.fasta --model TrN
      > paup -s multiple_align/alignment.fasta
      > iqtree -s multiple_align/alignment.fasta -m TrN

    AICc      model              K            lnL          score          delta    weight
    --------------------------------------------------------------------------------
           1  TrN                5     -5378.2013     10784.4026         0.0000    0.2246
           2  TIM2+G4            7     -5376.4972     10784.9944         0.5919    0.1671
           3  TIM3               6     -5378.1637     10786.3274         1.9249    0.0858
           4  TIM2+I             7     -5377.2570     10786.5141         2.1115    0.0782
           5  TrN+G4             6     -5378.3069     10786.6138         2.2113    0.0744
           6  TIM1+G4            7     -5377.3549     10786.7098         2.3073    0.0709
           7  HKY                4     -5380.7535     10787.5069         3.1044    0.0476
           8  TPM1uf             5     -5379.8699     10787.7398         3.3372    0.0423
           9  TPM2uf+G4          6     -5378.9908     10787.9815         3.5790    0.0375
          10  TrN+I              6     -5379.0181     10788.0362         3.6336    0.0365
    --------------------------------------------------------------------------------
    Best model according to AICc
    ---------------------------
    Model:              TrN
    lnL:                -5378.2013
    Frequencies:        0.3720 0.1679 0.2249 0.2352
    Subst. Rates:       1.0000 2.2008 1.0000 1.0000 3.1065 1.0000 
    Inv. sites prop:    -
    Gamma shape:        -
    Score:              10784.4026
    Weight:             0.2246
    ---------------------------
    Parameter importances
    ---------------------------
    P.Inv:              0.1262
    Gamma:              0.3943
    Gamma-Inv:          0.0384
    Frequencies:        1.0000
    ---------------------------
    Model averaged estimates
    ---------------------------
    P.Inv:              0.0216
    Alpha:              93.2595
    Alpha-P.Inv:        94.4193
    P.Inv-Alpha:        0.0216
    Frequencies:        0.3718 0.1684 0.2238 0.2359 

    Commands:
      > phyml  -i multiple_align/alignment.fasta -m 010020 -f m -v 0 -a 0 -c 1 -o tlr
      > raxmlHPC-SSE3 -s multiple_align/alignment.fasta -c 1 -m GTRCATX -n EXEC_NAME -p PARSIMONY_SEED
      > raxml-ng --msa multiple_align/alignment.fasta --model TrN
      > paup -s multiple_align/alignment.fasta
      > iqtree -s multiple_align/alignment.fasta -m TrN
    Done

ModelTest-NG Output Summary
    File    Criteria    Best Model
    hp_multiple_align/alignment.fasta   BIC HKY
    hp_multiple_align/alignment.fasta   AIC TrN
    hp_multiple_align/alignment.fasta   AICc    TrN


Step 3: Build a Tree

Now, we will use build_tree_NG to build our tree! You should use the best model outputted in model_test for the --model argument (here we are using GTR). The --all option will automatically run a full maximum likelihood & bootstrapping analysis for us:

haphpipe build_tree_NG --seqs hp_multiple_align/alignment_region00.fasta --all --model GTR

The output will be written to a new directory, hp_build_tree. The best tree file from RAxML will be outputted as hp_tree.raxml.support. This tree can then be annotated in programs such as FigTree or iTOL.

Phylogenomics Pipelines

For users who would like to build a full pipeline to run assembly and phylogenetics stages in one go, we recommend adapting the demo pipeline (haphpipe_demo) for this purpose. See the demo page for more details.

multiple_align

Align consensus sequences using MAFFT (documentation). Input can be a list of directories which contain final.fna and/or ph_haplotypes.fna files or a fasta file, or both (in which case the sequences in the FASTA file are combined with the final.fna and/or ph_haplotypes.fna files retreived before the alignment. Sequences will be separated by amplicons using a supplied GTF file before alignment (unless the --alignall option is specified). This module may also be used to separate files by amplicons (without aligning) by specifying the --fastaonly option. Alignments are by default outputted as FASTA files, although PHYLIP (--phylipout) or CLUSTAL (--clustalout) output options are also available. Many options from MAFFT are available in this module. Please refer to the MAFFT documentation above for information about these options.

Usage:

haphpipe multiple_align [MAFFT OPTIONS] [HAPHPIPE OPTIONS] --seqs <FASTA> --dir_list <TXT> --ref_gtf <GTF> [--outdir]

(or):

hp_multiple_align [MAFFT OPTIONS] [HAPHPIPE OPTIONS] --seqs <FASTA> --dir_list <TXT> --ref_gtf <GTF> [--outdir]

Output files: alignment files in FASTA format (default), one per amplicon (or one alignment.fasta file if using --alignall option)

Note: MAFFT stores intermediate files in a temporary directory located in /tmp. More information is available here.

Input/Output Arguments:

Option Description
--seqs SEQS FASTA file with sequences to be aligned
--dir_list DIR_LIST List of directories which include either a final.fna or ph_haplotypes.fna file, one on each line
--ref_gtf REF_GTF Reference GTF file
--out_align OUT_ALIGN Name for alignment file
--nuc Assume nucleotide (default: False)
--amino Assume amino (default: False)
--clustalout Clustal output format (default: False)
--phylipout PHYLIP output format (default: False)
--inputorder Output order same as input (default: False)
--reorder Output order aligned (default: False)
--treeout Guide tree is output to the input.tree file (default:False)
--quiet_mafft Do not report progress (default: False)
--outdir OUTDIR Output directory

MAFFT Options:

Option Description
--algo ALGO Use different algorithm in command: linsi, ginsi, einsi, fftnsi, fftns, nwns, nwnsi
--auto Automatically select algorithm (default: False)
--sixmerpair Calculate distance based on shared 6mers, on by default (default: False)
--globalpair Use Needleman-Wunsch algorithm (default: False)
--localpair Use Smith-Waterman algorithm (default: False)
--genafpair Use local algorithm with generalized affine gap cost (default: False)
--fastapair Use FASTA for pairwise alignment (default: False)
--weighti WEIGHTI Weighting factor for consistency term
--retree RETREE Number of times to build guide tree
--maxiterate MAXITERATE Number of cycles for iterative refinement
--noscore Do not check alignment score in iterative alignment (default: False)
--memsave Use Myers-Miller algorithm (default: False)
--parttree Use fast tree-building method with 6mer distance (default: False)
--dpparttree Use PartTree algorithm with distances based on DP (default: False)
--fastaparttree Use PartTree algorithm with distances based on FASTA (default: False)
--partsize PARTSIZE Number of partitions for PartTree
--groupsize GROUPSIZE Max number of sequences for PartTree

MAFFT Parameters:

Option Description
--lop LOP Gap opening penalty
--lep LEP Offset value
--lexp LEXP Gap extension penalty
--LOP LOP Gap opening penalty to skip alignment
--LEXP LEXP Gap extension penalty to skip alignment
--bl BL BLOSUM matrix: 30, 45, 62, or 80
--jtt JTT JTT PAM number >0
--tm TM Transmembrane PAM number >0
--aamatrix AAMATRIX Path to user-defined AA scoring matrix
--fmodel Incorporate AA/nuc composition info into scoring matrix (default: False)

Options:

Option Description
--ncpu NCPU Number of CPU to use (default: 1)
--quiet Do not write output to console (silence stdout and stderr) (default: False)
--logfile LOGFILE Name for log file (output)
--debug Print commands but do not run (default: False)
--fastaonly Output fasta files separated by region but do not align (default: False)
--alignall Do not separate files by region, align entire file (default: False)

Example usage:

haphpipe multiple_align --dir_list demo_sra_list.txt --ref_gtf HIV_B.K03455.HXB2.gtf --phylipout --logfile demo_multiple_align.log

model_test

Select the best-fit model of evolution from an alignment file using ModelTest-NG (documentation). Input is an alignment in FASTA or PHYLIP format. Output is ModelTest-NG results (text file) containing information for the best performing models.

Usage:

haphpipe model_test [ModelTest-NG OPTIONS] [HAPHPIPE OPTIONS] --seqs <FASTA> [--outdir]

(or):

hp_model_test [ModelTest-NG OPTIONS] [HAPHPIPE OPTIONS] --seqs <FASTA> [--outdir]

Output files: ModelTest-NG output file (modeltest_results.out).

Input/Output Arguments:

Option Description
--seqs SEQS Alignment in FASTA or PHYLIP format
--outname Name for output file
--outdir OUTDIR Output directory

ModelTest-NG Options:

Option Description
--data_type Data type: nt or aa (default: nt)
--partitions Partitions file
--seed Seed for random number generator
--topology TOPOLOGY Starting topology: ml, mp, fixed-ml-jc, fixed-ml-gtr, fixed-mp, random, or user (default: ml)
--utree User-defined starting tree
--force Force output overriding (default: False)
--asc_bias ASC_BIAS Ascertainment bias correction: lewis, felsenstein, or stamatakis
--frequencies FREQUENCIES Candidate model frequencies: e (estimated) or f (fixed)
--het HET Set rate heterogeneity: u (uniform), i (invariant sites +I), g (gamma +G), or f (bothinvariant sites and gamma +I+G)
--models MODELS Text file with candidate models, one per line
--schemes SCHEMES Number of predefined DNA substitution schemes evaluated: 3, 5, 7, 11, or 203
--template TEMPLATE Set candidate models according to a specified tool: raxml, phyml, mrbayes, or paup

Options:

Option Description
--ncpu NCPU Number of CPU to use (default: 1)
--quiet Do not write output to console (silence stdout and stderr) (default: False)
--logfile LOGFILE Name for log file (output)
--debug Print commands but do not run (default: False)
--keep_tmp Keep temporary directory (default: False)

Example usage:

haphpipe model_test --seqs multiple_align/alignment.fasta

build_tree_NG

Phylogeny reconstruction with RAxML-NG (documentation). Input is an alignment (FASTA or PHYLIP format). Output is a tree file. Please see the RAxML-NG documentation for a full description of RAxML-NG options.

Usage:

haphpipe build_tree_NG [RAxML OPTIONS] [HAPHPIPE OPTIONS] --seqs <FASTA> --output_name <TXT> [--outdir]

(or):

hp build_tree_NG [RAxML OPTIONS] [HAPHPIPE OPTIONS] --seqs <FASTA> --output_name <TXT> [--outdir]

Output files:

File Description
%PREFIX.raxml.bestTree Outputs the best-scoring Maximum Likelihood tree
%PREFIX.raxml.bestPartitionTrees Best-scoring ML tree for each partition
%PREFIX.raxml.bestModel Optimized parameters for the highest-scoring ML tree
%PREFIX.raxml.bootstraps Trees generated for every bootstrap replicate
%PREFIX.raxml.bootstrapMSA..phy Bootstrap replicate alignments
%PREFIX.raxml.ckp Contains last log record if RAxML-NG has not finished successfully
%PREFIX.raxml.consensusTree Consensus tree: estimates support for each clade of the final tree
%PREFIX.raxml.log Screen log
%PREFIX.raxml.mlTrees Maximum Likelihood trees for each starting tree
%PREFIX.raxml.startTree The starting trees for each Maximum Likelihood inference
%PREFIX.raxml.support Best-scoring Maximum Likelihood tree with bootstrap values
%PREFIX.raxml.terrace Trees residing on a terrace (same likelihood or parsimony score) in compressed Newick form
%PREFIX.raxml.terraceNewick Trees residing on a terrace in multi-line standard Newick form

Input/Output Arguments:

Option RAxML-NG Equivalent Description
--seqs SEQS --msa Input alignment in PHYLIP or FASTA format
--output_name NAME --prefix Run name for trees (default: build_tree.tre)
--outdir OUTDIR --prefix Output directory (default: .)

RAxML-NG Options:

Option RAxML Equivalent-NG Description
--model MODEL --model Substitution model OR path to partition file
--all --all Run bootstrap search and find best ML tree
--branch_length BRANCH_LENGTH --brlen Specify branch linkage model
--consense --consense Build a consensus tree
--rand_tree RAND_TREE --tree rand Start tree option: start from a random topology
--pars_tree PARS_TREE --tree pars Start tree option: parsimony-based randomized stepwise addition algorithm
--user_tree USER_TREE --tree Start tree option: upload custom tree in Newick format
--search --search Predefined start tree: 10 random and 10 parsimony in v0.8.0 and later
--search_1random --search1 Predefined start tree: 1 random
--constraint_tree CONSTRAINT_TREE --tree-constraint Specify topological constraint tree
--outgroup OUTGROUP --outgroup Outgroup to root inferred tree
--bsconverge --bsconverge Posteriori bootstrap convergence test
--bs_msa --bsmsa Bootstrap replicate alignments
--bs_trees BS_TREES --bs-trees Number of bootstrap trees OR autoMRE
--bs_tree_cutoff BS_TREE_CUTOFF --bs-cutoff Specify bootstopping cutoff value
--bs_metric BS_METRIC --bs-metric Compare bootstrap support values
--bootstrap --bootstrap Non-parametric bootstrap analysis
--check --check Alignment sanity check
--log LOG --log Options for output verbosity
--loglh --loglh Compute and print the likelihood of the tree(s) without optimization or generating output files
--terrace TERRACE --terrace Check if a tree is on a phylogenetic terrace.

Options:

Option Description
--version check RAxML-NG version
--seed SEED Seed for random numbers
--redo Run even if there are existing files with the same name
--keep_tmp Keep temporary directory
--quiet Do not write output to console (silence stdout and stderr) (default: False)
--logfile LOGFILE Name for log file (output)
--debug Print commands but do not run (default: False)
--ncpu NCPU Number of CPU to use (default: 1)

Example usage:

haphpipe build_tree_NG --all --seqs hp_alignments/alignment.fasta --model GTR