Python for biologists

Python is great programming language for processing genomic data.  Instead of having to waste mindless hours copying, pasting and clicking through Excel spreadsheets, this easy to learn language has provided me an avenue to write my own scripts to quickly organize, analyze, and process large genomic datasets.  Some of the reasons I love Python are:

1. BioPython Package – Within this package lies the tools to easily manipulate and process your sequence data.  Time and time again I have found myself needing to count the number of bases, find the lengths of the sequences  etc in a sequence file.  Bio.SeqIO can be implemented to do just that.  SeqIO.parse is a command built into the BioPython Package to iterate a function across all sequences in a fasta file or other genomic format you may be needing.  Check out the example in the following code.  Here, I can open a fasta file containing thousands of sequences, retrieve the sequence name and the number of adenines in each sequence over a matter of seconds and never even leave the terminal!

SeqIO

More uses for BioPython will be added in later posts.

For information about installing please refer here.  Please make sure NumPy has been installed on your Python path as it is a dependency of the BioPython package.  To check simply try importing NumPy from Python:

numpy_installed

If NumPy has not been installed you will receive an error message.

2. Automation – Hours of copy and paste, search and replace, and fighting formatting with Excel be gone!  Most of the work you want to do with sequence data is to count, search, and align.  Python can do this for you in a matter of seconds as long as you tell it how.  For example if you want to make a file summarizing sequence statistics that can be read by R, Excel, Matlab, etc, you can simply write a code like:

pyscript

The text file generated can then be opened and edited wherever you so choose.  For example, this is what the output would look viewed in RStudio:

pythonEx

3. Dictionaries – This is a data structure that is incredibly useful in storing and looking up information.  I like to use them when working between gene names, gene ids.  Check out the upcoming posts to see how dictionaries can be used.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s