Python for biologists

Python is great programming language for processing genomic data.  Instead of having to waste mindless hours copying, pasting and clicking through Excel spreadsheets, this easy to learn language has provided me an avenue to write my own scripts to quickly organize, analyze, and process large genomic datasets.  Some of the reasons I love Python are:

1. BioPython Package – Within this package lies the tools to easily manipulate and process your sequence data.  Time and time again I have found myself needing to count the number of bases, find the lengths of the sequences  etc in a sequence file.  Bio.SeqIO can be implemented to do just that.  SeqIO.parse is a command built into the BioPython Package to iterate a function across all sequences in a fasta file or other genomic format you may be needing.  Check out the example in the following code.  Here, I can open a fasta file containing thousands of sequences, retrieve the sequence name and the number of adenines in each sequence over a matter of seconds and never even leave the terminal!

Continue reading

Additions for your terminal

Something I realized when I began to work with the command line is that there are few extra additions that can be installed to make everything run smoothly within the terminal (Mac OSX 10.9).

1. Xcode 5.0 – This is an integrated development environment made by Apple Developer that provides a nice space to write and debug code.  More importantly is that Xcode contains Command Line Tools that needed to properly compile and install packages you make need in the future.  Note that you may need to have an Apple Developer ID to access the download site

2. Xcode’s Command Line Tools – This package can be added via Xcode or directly downloaded from here.  Note that you may need to have an Apple Developer ID to access the download site

3. MacPorts – allows for easy installation of over 17,000 (e.g. emacs, scipy, NumPy) you may end up needing using the a simple <sudo port install packagename> command

5. Python – Python is a programming language extremely good at handling sequence data.  Most Macs are already setup to use python version 2.7.  To test if you have python simply open the terminal and type “python”


The Command Line & Some Tricks

Almost all metagenomic tools are executed from the command line so, unsurprisingly, learning how to use the  command line was the first (and essential!) step for me to be able to perform analysis on metagenomic data.  Not only does the command line allow you to write and execute scripts but it also saves time and frustration when viewing and moving through large fasta/fastq files.

For those reading, I am executing all commands through the Mac Terminal on OS X version 10.9

Continue reading

What we know so far…


A bacterial biofilm growing on nutrient rich fracture water 1.3 km below the surface in Beatrix Gold Mine, South Africa. NOTE: This image does not represent the bacterial communities I sample.

In order to study life kilometers below the earth’s surface, subterraneauts travel underground through deep mine shafts around the globe.  These scientists collect and analyze fracture waters that have been locked away for thousands of years — completely removed from the sun.  The deepest and most well-studied mines are located in South Africa.  Scientists that study these deep sites use  a “prawn” or similar device (see below) attached to a borehole to sample water that hides meters beyond the mine’s walls.

One of the most recognized discoveries of deep subsurface research is the unprecedented identification of “an ecosystem of one”.  Here, scientists performed a metagenomic study on a 2.8 km deep fracture water community and found that a novel bacterium, Candidatus Desulforudis audaxviator,  accounted for >99.9% of the microbial community (Chivian et al., 2008).  In order to survive on its own, the genome of D. audaxviator  reveals Continue reading


The subterranaut (pronounced: sub\cdotterrain\cdotnot) was inspired by my work on high throughput sequence data of the deep terrestrial subsurface.  

Coming from a background in biochemistry, I realized that there was a steep learning curve to begin working with high throughput sequence data.  Raw data files were larger than my iTunes library and scanning through endless rows on an Excel file felt like a hopeless task.  I knew that there must be better way to handle my large datasets and analyze them.  This blog contains the steps I took to understand my sequence data.  Many of the posts are direct answers to the wide range of questions that I had as I began to enter this field.  I hope others can use this blog as a roadmap to ease their transition from working in the lab to working on high throughput sequence data.

I hope you enjoy!