MetaVelvet

In my last post, I described what a De Bruijn Graph assembler is and will now go into a short tutorial of how to begin using MetaVelvet.  MetaVelvet is an extension of the popular single genome De Bruijn Graph assembler, Velvet, and is optimized to handle the varying coverage and diversity of genomes in metagenomic samples and is executed through 3 steps: velveth, velvetg, and meta-velvetg.

Continue reading

De Bruijn Graph Assembly

When our lab got its first metagenomic dataset, the first thing we did was upload our QC filtered and merged paired-end Illumina reads (mean length 160 bp) onto MG-RAST for annotation.  However, when the annotations came back, some organisms whose genomes were known to be present in both our sample and the m5nr reference dataset were missing and, for those sequences that were annotated, the designated e-values centered around 1e-10.  In order to improve the annotation of our data, we decided to perform an assembly.   Searching the literature, I found that a class of assemblers — called De Bruijn Graph assemblers — were the popular choice for assembly of short read metagenomic data; however, the intuition behind how these assemblers worked was a little less clear. Continue reading

Alignment with Bowtie (and Bowtie2)

For those who only know a bow tie as something worn by hipsters or really fancy people — Bowtie is a very powerful bioinformatics tool that has a diverse array of applications.  Most often I will use Bowtie to map RNA transcripts back to a known genome; however, you can also use Bowtie to assess how well your assembly performed or for any instance where you want to find how many of your high throughput sequences map back to a [longer] sequence or genomes of interest.

What makes Bowtie special is that it requires little RAM (can easily run on your laptop) and is very fast — or as the creators of Bowtie declare: ultrafast (aligning more than 25 million reads to the human genome in 1 CPU hour) . Continue reading

Python for biologists

Python is great programming language for processing genomic data.  Instead of having to waste mindless hours copying, pasting and clicking through Excel spreadsheets, this easy to learn language has provided me an avenue to write my own scripts to quickly organize, analyze, and process large genomic datasets.  Some of the reasons I love Python are:

1. BioPython Package – Within this package lies the tools to easily manipulate and process your sequence data.  Time and time again I have found myself needing to count the number of bases, find the lengths of the sequences  etc in a sequence file.  Bio.SeqIO can be implemented to do just that.  SeqIO.parse is a command built into the BioPython Package to iterate a function across all sequences in a fasta file or other genomic format you may be needing.  Check out the example in the following code.  Here, I can open a fasta file containing thousands of sequences, retrieve the sequence name and the number of adenines in each sequence over a matter of seconds and never even leave the terminal!

Continue reading

Additions for your terminal

Something I realized when I began to work with the command line is that there are few extra additions that can be installed to make everything run smoothly within the terminal (Mac OSX 10.9).

1. Xcode 5.0 - This is an integrated development environment made by Apple Developer that provides a nice space to write and debug code.  More importantly is that Xcode contains Command Line Tools that needed to properly compile and install packages you make need in the future.  Note that you may need to have an Apple Developer ID to access the download site

2. Xcode’s Command Line Tools - This package can be added via Xcode or directly downloaded from here.  Note that you may need to have an Apple Developer ID to access the download site

3. MacPorts - allows for easy installation of over 17,000 (e.g. emacs, scipy, NumPy) you may end up needing using the a simple <sudo port install packagename> command

5. Python - Python is a programming language extremely good at handling sequence data.  Most Macs are already setup to use python version 2.7.  To test if you have python simply open the terminal and type “python”

python

The Command Line & Some Tricks

Almost all metagenomic tools are executed from the command line so, unsurprisingly, learning how to use the  command line was the first (and essential!) step for me to be able to perform analysis on metagenomic data.  Not only does the command line allow you to write and execute scripts but it also saves time and frustration when viewing and moving through large fasta/fastq files.

For those reading, I am executing all commands through the Mac Terminal on OS X version 10.9

Continue reading

What we know so far…

cropped-biofilms_pic1.jpg

A bacterial biofilm growing on nutrient rich fracture water 1.3 km below the surface in Beatrix Gold Mine, South Africa. NOTE: This image does not represent the bacterial communities I sample.

In order to study life kilometers below the earth’s surface, subterraneauts travel underground through deep mine shafts around the globe.  These scientists collect and analyze fracture waters that have been locked away for thousands of years — completely removed from the sun.  The deepest and most well-studied mines are located in South Africa.  Scientists that study these deep sites use  a “prawn” or similar device (see below) attached to a borehole to sample water that hides meters beyond the mine’s walls.

One of the most recognized discoveries of deep subsurface research is the unprecedented identification of “an ecosystem of one”.  Here, scientists performed a metagenomic study on a 2.8 km deep fracture water community and found that a novel bacterium, Candidatus Desulforudis audaxviator,  accounted for >99.9% of the microbial community (Chivian et al., 2008).  In order to survive on its own, the genome of D. audaxviator  reveals Continue reading

Welcome!

The subterranaut (pronounced: sub\cdotterrain\cdotnot) was inspired by my work on high throughput sequence data of the deep terrestrial subsurface.  

Coming from a background in biochemistry, I realized that there was a steep learning curve to begin working with high throughput sequence data.  Raw data files were larger than my iTunes library and scanning through endless rows on an Excel file felt like a hopeless task.  I knew that there must be better way to handle my large datasets and analyze them.  This blog contains the steps I took to understand my sequence data.  Many of the posts are direct answers to the wide range of questions that I had as I began to enter this field.  I hope others can use this blog as a roadmap to ease their transition from working in the lab to working on high throughput sequence data.

I hope you enjoy!