The Command Line & Some Tricks

Almost all metagenomic tools are executed from the command line so, unsurprisingly, learning how to use the  command line was the first (and essential!) step for me to be able to perform analysis on metagenomic data.  Not only does the command line allow you to write and execute scripts but it also saves time and frustration when viewing and moving through large fasta/fastq files.

For those reading, I am executing all commands through the Mac Terminal on OS X version 10.9

How to Begin:

1. To access the command line, go to spotlight and search for “Terminal”

Terminal-0-Spotlight        

Terminal_window

Once in the terminal, you can use general commands like cd, ls, cp, mv, mkdir to move through directories and move files much like you would in the Finder on the Mac

Some of my favorite commands that are very useful to view large fasta/fastq files without actually opening them in a text editor are:

more <filename>    —will open the file you request directly in the terminal without needing to open the file in a text editor.  You can then scroll through the entire file using “enter”.  Exit from viewing the fasta file before reaching the end through “cntrl-C”

more

head -n <filename>   —will return the first n lines of your file

tail -n <filename>      —will return the last n lines of your file

example:  head -10 myfile.fasta will return the first 10 lines of your file

A neat trick to analyze subsets of your fasta files easily to use the head command and “>”to generate a new file.  Remember fasta files are formatted to be sets of of 2 lines per sequence.  The first line is the header/description of the sequence and the second is the sequence.  Therefore:

head -2000 myfile.fasta > 1000seqs_myfile.fasta

will make a new file of the first 2000 lines (equivalent to 1000 sequences) of  myfile.fasta

To search for specific information in my fasta files, I like to use the grep command.  It is much faster than opening the fasta file and using the “Find” tool in the text editor.  My favorite command to execute using grep is to count the number of sequences I have in a fasta file.  For example:

grep ‘>’ myfile.fasta | wc -l

 will search myfile.fasta for the “>” character and then count the number of lines (“wc -l”command) a sequence header appears and, thus, return how many sequences are within your fasta file.

Finally, the sed and awk commands are extremely useful for editing fasta and fastq headers (as well as other text editing).   They require a bit more explanation for implementation so I will refer you here to learn more about these commands.  A neat trick using the awk command is turning a fastq into a fasta file.  Please note that this script depends on the first few characters of your fasta header to avoid confusion of the “@” symbol appearing in the quality values of the fastq sequences.  In the following example, all headers in my fastq file started with @D4Z:

awk ‘/^@D4Z/{gsub(/^@/,”>”,$1);print;getline;print}’ filename.fastq

fastq2fasta

Advertisements

One thought on “The Command Line & Some Tricks

  1. Pingback: Additions for your terminal | Subterraneaut

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s