The past few months I have been working with a microbial ecology toolkit called mothur. So far, it is the most flexible tool I have found to calculate distance matrices from 16S alignments and, subsequently, cluster these sequences into OTUs. Many people have asked me about how to use mothur so this post will serve as a tutorial.
If you want to process your reads from the FASTQ stage, the people at mothur have put together some nice tutorials for 454 and Illumina reads. However, if you already have your own quality filtering pipeline that has generated a FASTA file with sequences that you want to cluster into OTUs, I will guide you on how to jump into the mothur pipeline without having to start from the very beginning.
1) Downloading mothur – Developers guide for downloading and running mothur
The subsequent steps are described as if you are running the mothur program. For shell scripting of mother see here.
2) Performing an alignment in mothur
When working with 16S sequences, the initial alignment is arguably the most important step in the entire analysis. Picking the “right’ alignment method has been discussed in a variety of publications yet, despite the variety of methods available, a pairwise alignment to a reference 16S database is by far the most used alignment method and the method implemented in mothur using the align.seqs() command. In order to run align.seqs(), you will first need to download an aligned, 16S database such as this GreenGenes alignment. Once you have this databse, begin mothur. At this point your terminal should look like this:
from here you will enter: align.seqs(fasta=your_fasta_file, template=16S_database_file). In executing this command, mothur will go through every sequence in your_fasta_file, find the best match (think BLAST) in the16S_database_file. After the best match is identified, a pairwise alignment between these two sequences is performed. Gaps are added, post pairwise alignments, to the sequences using the NAST algorithm so that this alignment is compatible with the alignment of the 16S database you provided. After this step is complete your alignment will be written to an *.align file.
Notably, the pairwise aligner in mothur may not be the best tool for your study and I would recommend reading the “Caveat emptor” section in the mothur manual and this Google Group discussion to help you decide what alignment methods is right for you.
3) Calculating a distance matrix
Irregardless of whether or not you decide to use the align.seqs() command as your aligner, you can proceed to generating a distance matrix from your alignment using the dist.seqs(fasta=alignment_fasta) command. The “distance” is a user defined metric how dissimilar 2 sequences are based on mismatches and gaps in the alignment. In plain english, it is the number of mismatches and gaps divided by the total number of nucleotides (nt) in the shorter sequence. Notably, dist.seqs() has a lot of flexibility on how you want to penalize gaps between sequences in your alignment file.
To illustrate some of the various options, we will use the same sequences outlined here:
Sequence1 ATGCATGCATGC Sequence2 ---CAAGTA---
Under default settings, a string of consecutive gaps is given the same penalty as a single gap. Under this rule, when the shorter Sequence2 is compared to Sequence1 the alignment has 2 gaps, 2 mismatches, and 4 matches which mothur would define this as a distance of 4/8 or 0.50.
If you want to give a higher penalty to gaps in an alignment, I would recommend mothur’s “eachgap” option. In this option, there are 6 gaps, 2 mismatches and 4 matches equivalent to a distance of 8/12 or 0.67. On the other hand, if you did not want to count the gaps on the ends of Sequence2 (perhaps you did not think they were biologically meaningful), the option “countends=F” calculate the alignment as 2 mismatches and 4 matches — distance of 2/6 or 0.33.
4) Clustering OTUs
Once you have completed the distance matrix in step 3, you can then cluster OTUs. If you are just jumping in at points (2) or (3) on this list, you will be unable to use mothur’s hcluster() command until you have generated either a name file or count table. However, don’t stress, these files can be quickly made using in python.
To make a *.names file, just simply follow these commands on python:
from Bio import SeqIO input = open('myfastafile','r') namesfile = open('namesfile','w') for rec in SeqIO.parse(input,'fasta'): namesfile.write(rec.id+'\t'+rec.id+'\n') input.close() namesfile.close()
The count table is just as easy to make once you understand the format (see here). In the settings for hcluster(), I would recommend using average linkage clustering (default) and fill in the options as follow: hcluster(column=’*.dist’, name=’*names’,cutoff=0.15). If you are working with a large 16S dataset, use the cluster.split() command instead and the maximum distance “cutoff” helps reduce the computation time. When hcluster() finishes, you will have 3 files: 1) a *list file that indicates what sequences belong to each OTU, 2) a *rabund file that tells you how many sequences belong to each OTU (hint: this is perfect for the vegan package in R), 3) an *sabund file which tells you how many sequences are in your largest OTU and the number of singleton, doubleton, tripleton etc OTUs.