Phylogenetic Analysis by Maximum Likelihood (PAML)

Introduction

PAML is a program package for phylogenetic analyses of DNA or protein sequences using maximum likelihood. It is maintained and distributed for academic use free of charge by Ziheng Yang. ANSI C source codes are distributed for UNIX/Linux/MAC OS X, and executables are provided for MS Windows.

This document is about downloading and compiling PAML and getting started. See the manual (pamlDOC.pdf) for more information about running programs in the package.

Possible uses of the programs are

Estimation of branch lengths in a phylogenetic tree and parameters in the evolutionary model such as the transition/transversion rate ratio, the shape parameter of the gamma distribution for variable evolutionary rates among sites, and rate parameters for different genes;
Test of hypotheses concerning sequence evolution, such as rate constancy and independence among nucleotide or amino acid sites, rate constancy among lineages (the molecular clock), and homogeneity of evolutionary process in multiple genes;
Calculation of substitution rates at sites;
Reconstruction of ancestral nucleotide or amino acid sequences;
Simulation of nucleotide, codon, and amino acid sequence data sets;
Phylogenetic tree reconstruction by maximum likelihood and Bayesian methods (??).

A summary of the types of analyses performed by different programs in the package is given below.

baseml: ML analysis of nucleotide sequences: estimation of tree topology, branch lengths, and substitution parameters under a variety of nucleotide substitution models (JC69, K80, F81, F84, HKY85, TN93, REV); constant or gamma rates for sites; molecular clock (rate constancy among lineages) or no clock, among-gene and within-gene variation of substitution rates; models for combined analyses of multiple sequence data sets; calculation of substitution rates at sites; reconstruction of ancestral nucleotides.
basemlg: ML analysis of nucleotide sequences under the model of gamma rates among sites. The (continuous) gamma model is used with one of the following substitution models: JC69, K80, F81, F84, HKY85, TN93, and REV.
codonml (codeml with seqtype = 1): ML analysis of protein-coding DNA sequences using codon substitution models (e.g., Goldman and Yang 1994); calculation of the codon-usage table; estimation of synonymous and nonsynonymous substitution rates; likelihood ratio test of positive selection or relaxed selective constraints along lineages based on the d_N/d_S rate ratios; identification of amino acid sites or evolutionary lineages potentially under positive selection; reconstruction of ancestral codon sequences.
aaml (codeml with seqtype = 2): ML analysis of amino acid sequences under a number of amino acid substitution models (Poisson, Proportional, empirical models such as those of Dayhoff et al., Jones et al., mtREV24, and mtmam, and REV); constant or gamma-distributed rates among sites; molecular clock (rate constancy among lineages) or no clock, among-gene and within-gene variation of substitution rates; models for combined analyses of multiple gene data; calculation of substitution rates at sites; reconstruction of ancestral amino acid sequences.
pamp: Parsimony-based analyses for a given tree topology, estimation of the substitution pattern by the method of Yang and Kumar (1996); estimation of the gamma parameter for variable rates among sites by the method of moments, the method of Sullivan et al. (1995), and the method of Yang and Kumar (1996); reconstruction of ancestral character states using the algorithm of Hartigan (1973) and an unpublished "improved parsimony" method.
mcmctree: Bayesian estimation of phylogenies using DNA sequence data (Rannala and Yang, 1996; Yang and Rannala, 1997). Markov chain Monte Carlo calculation of posterior probabilities of trees. The algorithm is too slow to be usable.
evolver: This program used to be named listtree and does miscellaneous things, such as listing all rooted and unrooted trees for a given number of species, generating random trees with branch lengths from a birth-death process with species sampling, and calculating tree bipartition distances. It now also simulates nucleotide, codon, or amino acid sequence data sets. Parameters for the simulation are specified in the files MCbase.dat, MCcodon.dat, and MCaa.dat. You can run the program to see the main menu, and then consult one of those files to see the details. This program can easily fill your hard disk.
yn00: This program implements the method of Yang and Nielsen (2000) for estimating synonymous and nonsynonymous substitution rates in pairwise comparison of protein-coding DNA sequences. The method of Nei and Gojobori (1986) is also included in the program. Run yn00 and have a look at the control file yn00.ctl and the default result file yn. No further documentation is included for this program.

What does PAML do?

PAML is not good for tree making. There are a few options for heuristic tree search, but they do not work well except for small data sets of only a few species. If you hope to use PAML to compare trees from relatively large data sets, one possibility is to get a collection of candidate trees and then compare them using more sophisticated models implemented in PAML. You can get candidate trees by using other programs/methods implemented in PAUP*, PHYLIP, MOLPHY etc.

PAML may be useful if you are interested in the process of sequence evolution. The two main programs, baseml and codeml, implement a number of sophisticated models, which you can use to construt likelihood ratio tests of evolutionary hypotheses. Right now, the following options/models do not seem available in other packages.

Codon-based likelihood analysis for estimating synonymous and nonsynonymous rates, and testing hypotheses concerning d_N/d_S rate ratios (Goldman and Yang 1994 MBE 11:725-736; Yang and Nielsen 1998 JME 46:409-418; Nielsen and Yang 1998 Genetics 148:929-936; Yang 1998 MBE 15:568-573; Yang et al. 2000 Genetics 155: 431-449).
Amino acid-based likelihood analysis with rate variation among sites.
Ancestral sequence reconstruction based on nucleotide, codon, or amino acid substitution models (Yang, Kumar, and Nei 1995 Genetics 141:1641-1650). Nucleotide sequence reconstruction is probably available in DNAML and PAUP* too.
Nucleotide-based models useful for combined analysis of multiple gene data (Yang 1996 JME 42:587-596).
Simulating nucleotide, codon, or amino acid sequence data sets.

Downloading and Compiling PAML

The downloading ftp site is ftp://abacus.gene.ucl.ac.uk/pub/paml/. Some network setup does not allow you to visit an anonymous ftp site, so you may have to speak to your system administrator. The archive, named something like paml*.*.tar.gz, is for all platforms. It includes excutables for MS Windows, and has source files that you can compile for Mac OSX and unix/linux.

Windows 9x/NT/2000/XP. Download the archive. Unpack it into a folder, using Winzip, say. The Windows executables are in the folder WinEXE/. The programs are simple Win32 Console applications, and do not support mice or menu. Open a "command prompt" box (Start - Programs - Accessories - Command Prompt). Run the program by typing its name rather. Avoid double-clicking the program names from Windows Explorer. You may have to specify the full path, as follows (depending on where you put the paml folder on your hard disk):

/paml3.14/winEXE/codeml ../../paml3.14/winEXE/codeml

UNIX, linux, MAC OS X or other systems. Save the archive on the disk. Unpack it into a folder, as follows

gzip -d paml3.14.tar.gz

Then cd to the paml folder (you have to remember where you saved the files) and again cd to the src/ folder and compile the programs.

cd src make -f Makefile ls -lF rm *.o mv baseml basemlg codeml pamp evolver yn00 chi2 .. cd .. baseml codeml evolver

You might have to open and edit the file Makefile before compiling using make. For example, you can change cc to gcc and -fast to -O3 or -O4. Also see readme.txt in the same folder for compiling instructions. You might want to mv the executables into the bin/ folder on your accounts rather than the paml main folder. And finally, if your current folder is not on your search path, you will have to add ./ in front fo the executable file name; that is, use ./codeml instead of codeml to run codeml.

MAC OS X. You should open a command terminal (Applications-Utilities-Terminal) and then compile and run the programs from the terminal. You cd to the paml folder and then look at the readme.txt or Makefile or Makefile.UNIX files. See above. You will need the Mac Developer Toolkit, which is not included in a standard installation of OS X and you will get a "Command not found" error with either cc or make. So you should go to the Apple web site to download and install the Toolkit (http://developer.apple.com/tools/index.html). There are some more notes about running programs on MAC OS X or UNIX at the FAQ page.

For linux administrators, a spec file paml.spec for the linux installer rpm has been kindly prepared by Hunter Matthews.

PowerMacs (PPC or G3 prior to OS X). Since OS X is now common, I have stopped distributing executables for MACs running OS 9 or earlier. MAC executables for two old versions, 3.0a and 3.0c, are still in the OldVersions/ folder at the ftp site.

Running Programs in PAML

The programs in distribution are essentially the copies I work on every day, as I make only minor changes before release to the public. So the programs are not always well tested. Models that I have never used myself, even it they look sensible or possible from options in the control file, should be taken with great caution. I have included example data sets that were used in our papers for the purpose of error checking. You are encouraged to duplicate our analysis first to check that the program works and also to get familiar with the format of the data file and the interpertation of results.

Programs baseml and codeml estimate parameters and calculate the log likelihood values, but do not calcualte the likelihood ratio statistics. You need to do the subtraction yourself. The theory is like this. If a more-general model involves p parameters and has log likelihood l₁, and a simpler model (which is a special case of the general model) has q parameters with log-likeliood value l₀, then 2(l₁ - l₀) can be comared with a chi-square distribution with d.f. = p - q. Suppose we want to test whether the transition/transversion rate ratio kappa = 1. We run the JC69 model and get l₀, and run K80 to get l₁. Then we compare 2(l₁ - l₀) with the chi-square distribution with one degree of freedom.

Running PAML. Most programs in the PAML package have control files that specify the names of the sequence data file, the tree structure file, and models and options for the analysis. The default control files are baseml.ctl for baseml and basemlg, codeml.ctl for codeml, pamp.ctl for pamp, mcmctree.ctl for mcmctree. The progam evolver does not have a control file, and uses a simple user interface. All you do is to type evolver and then choose the options. For other programs, you should prepare a sequence data file and a tree structure file, and modify the appropriate control files before running the programs. The formats of those files are detailed in the documentation in the package.

You need to prepare a sequence data file (e.g., brown.nuc) and modify the options in the appropriate control file. If you have chosen runmode = 0 or 1 in the control file, which means that the tree topologies are specified, you also need to prepare a tree structure file (e.g., trees.4s). On UNIX or Windows systems, you run the programs from a command prompt by

ProgramName [for example, baseml]

ProgramName ControlFileName

On the Mac, you simply click on the program name or icon. You can do this on a Windows machine too, but it is better if you open a command box and run the program from there.

PAML Resources on the web

PAML discussion group
Redhat linux rpm in the Biorpms repository
A C program for summarizing codeml results for sites under positive selection

Resources for UNIX (Mac OSX) and DOS/Windows

You need to know a few basics UNIX or DOS commands and know how to run a program from the command line. I listed a few in the documentation included in the package. There are many little books that teach you the basics of UNIX, with titles like "UNIX basics", "Teach yourself UNIX or linux in 24 hours", etc. Get one of those from the local bookstore. There are also a lot of resources on the web. Search for "UNIX basics", for example.

Linux Tutorial from Workshop on Molecular Evolution, Woods Hole

Questions and Bug Reports

Update history and bug fixes collected here.

Please try to solve your problems by reading the manual (pamlDOC.pdf) and the paml FAQ page. Try to identify where the problem lies. You should be able to tell from the screen output whether the program reads the sequence and tree files correctly and to correct mistakes in those files.

If you can't find answers to your questions in those resources, you can post your question at the Genetics Software Forum (http://www.rannala.org/gsf/), set up by Dr Bruce Rannala. This was set up to hopefully reduce the amount of time I have to spend in answering user questions.

When reporting problems, please mention the version number, what you did and what happened. In particular copy any error message on the screen into the message. Please use an informative "Subject:" field such as "paml/baseml: oom".

Try not to send messages to me. I will try to read and answer questions at the discussion site at least once a week. Messages sent to my mail box will be delayed by 2 weeks before I try to reply, and I will ignore some messages if the answers are readily available from the documentations. Sorry for the inadequate support but I receive more messages than I can answer.

Some of the nice buttons work. Enjoy clicking on them.

Counter since 8 March 2002.