Python is great programming language for processing genomic data. Instead of having to waste mindless hours copying, pasting and clicking through Excel spreadsheets, this easy to learn language has provided me an avenue to write my own scripts to quickly organize, analyze, and process large genomic datasets. Some of the reasons I love Python are:
1. BioPython Package – Within this package lies the tools to easily manipulate and process your sequence data. Time and time again I have found myself needing to count the number of bases, find the lengths of the sequences etc in a sequence file. Bio.SeqIO can be implemented to do just that. SeqIO.parse is a command built into the BioPython Package to iterate a function across all sequences in a fasta file or other genomic format you may be needing. Check out the example in the following code. Here, I can open a fasta file containing thousands of sequences, retrieve the sequence name and the number of adenines in each sequence over a matter of seconds and never even leave the terminal!
More uses for BioPython will be added in later posts.
For information about installing please refer here. Please make sure NumPy has been installed on your Python path as it is a dependency of the BioPython package. To check simply try importing NumPy from Python:
If NumPy has not been installed you will receive an error message.
2. Automation – Hours of copy and paste, search and replace, and fighting formatting with Excel be gone! Most of the work you want to do with sequence data is to count, search, and align. Python can do this for you in a matter of seconds as long as you tell it how. For example if you want to make a file summarizing sequence statistics that can be read by R, Excel, Matlab, etc, you can simply write a code like:
The text file generated can then be opened and edited wherever you so choose. For example, this is what the output would look viewed in RStudio:
3. Dictionaries – This is a data structure that is incredibly useful in storing and looking up information. I like to use them when working between gene names, gene ids. Check out the upcoming posts to see how dictionaries can be used.