Calculate the N50 of Assembled Contigs

The N50 is metric that is often associated with the length of contigs post-assembly.  In my mind, I find it best to think of it as a weighted median where longer contigs are given a greater weight.   The N50 is defined as follows:

Given a set of contigs, each with its own length, the N50 is the “length N for which 50% of all bases in the sequences are in a sequence of length L < N.  This can be found mathematically as follows: Take a list L of positive integers. Create another list L’ , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L’ is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L’ consists of six 2’s, six 3’s, four 4’s, and sixteen 8’s; the N50 of L is the median of L’ , which is 6.” (Broad Institute).

Based on this definition, I came up with this python script to calculate the N50 of a contigs file.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s