When our lab got its first metagenomic dataset, the first thing we did was upload our QC filtered and merged paired-end Illumina reads (mean length 160 bp) onto MG-RAST for annotation. However, when the annotations came back, some organisms whose genomes were known to be present in both our sample and the m5nr reference dataset were missing and, for those sequences that were annotated, the designated e-values centered around 1e-10. In order to improve the annotation of our data, we decided to perform an assembly. Searching the literature, I found that a class of assemblers — called De Bruijn Graph assemblers — were the popular choice for assembly of short read metagenomic data; however, the intuition behind how these assemblers worked was a little less clear.
To begin to answer the question of what a De Bruijn Graph is and why it is used, I found it easiest to think of genome assembly as a string reconstruction problem. For example, if we were to take the line, “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness.” from Charles Dickens’s A Tale of Two Cities, copy it five times, and chop it in 5 word sections, we would have a problem similar to genomic assembly.
In the example above, your first thought might be to pick a 5 word substring, and then find the substring that shares four of the 5 words to begin to reconstruct the sentence:
However, this method will quickly run into problems if you have a repeat in substrings. For example, when you reach: “of times, it was the”, you are presented with the option to proceed to the substring “times, it was the worst” OR “times, it was the age”:
The confusion of whether “the worst of times” or “it was the age” comes next can be resolved by either 1) Dividing the sentence into substrings of more than 5 words OR 2) Using a graphical approach to solve the problem. For the sake of example, let’s combine both approaches. Since our repeat is 5 words long, we will increase our substring length to 6 and construct a graph where every edge is a 6 word substring between its prefix and suffix.
If we represent all substrings in this manner, we can then glue all nodes with identical prefix/suffixes and build something known as a De Bruijn Graph.
If you take a close look at the De Bruijn Graph, you will see that the path through the graph is Eulerian. A Eulerian Path is a path through a graph that visits edge only once. The fact that the path through the De Bruijn Graph is Eulerian means that it can be found in linear time. (An interesting side note is that if we go back to the overlap method described earlier where each substring is a node, we would have a Hamiltonian path that is NP-complete and a run time that is exponential in the number of nodes). For biologists, the Eulerian path of the graph may not seem very important; however, it is the reason we are able to assemble millions of reads in matter of minutes rather than years.