The World Cup is underway and I am back to the blog! I was recently inspired by the World Cup Paninis to write a post about sequencing depth. In sequencing, the problem we are often faced with is whether or not enough sequences have been generated to be representative of a population. Tools we often use to determine whether we have sampled enough are rarefaction curves and the Chao estimate but how many sequences would we need to generate in order to capture a 16S from every organism present in an environment? Continue reading
The N50 is metric that is often associated with the length of contigs post-assembly. In my mind, I find it best to think of it as a weighted median where longer contigs are given a greater weight. The N50 is defined as follows: Continue reading
Python is great programming language for processing genomic data. Instead of having to waste mindless hours copying, pasting and clicking through Excel spreadsheets, this easy to learn language has provided me an avenue to write my own scripts to quickly organize, analyze, and process large genomic datasets. Some of the reasons I love Python are:
1. BioPython Package – Within this package lies the tools to easily manipulate and process your sequence data. Time and time again I have found myself needing to count the number of bases, find the lengths of the sequences etc in a sequence file. Bio.SeqIO can be implemented to do just that. SeqIO.parse is a command built into the BioPython Package to iterate a function across all sequences in a fasta file or other genomic format you may be needing. Check out the example in the following code. Here, I can open a fasta file containing thousands of sequences, retrieve the sequence name and the number of adenines in each sequence over a matter of seconds and never even leave the terminal!