Abstract:
The capability of next generation sequencers of emitting enormous volumes of data at a moderate cost has changed the field of sequence based research areas, such as metagenomics or studies estimating microbial diversity by using the 16S rRNA gene. While early studies investigated relatively small samples in isolation, current studies effectively target questions that require deeper sequencing of a larger number of samples. As a consequence of this development it becomes increasingly difficult to perform the computational component of the analysis on a desktop computer. As a matter of fact, even if the computationally intensive parts are outsourced to a more powerful environment, users still face datasets outgrowing the size of their home computers.
This development disagrees with the policy of MEGAN - a widely accepted, powerful and user-friendly tool for metagenomics - to perform qualitative analysis on local data files. To overcome this limitation, we developed MEGANServer. MEGANServer allows bioinformaticians to retain data files on a server with sufficient resources. Furthermore, we extended MEGAN to communicate with MEGANServer and by that enable researchers to perform their analysis on a home computer regardless the actual data size. Moreover, to overcome the complexity introduced by the growing number of samples, selection of datasets of interest is automated by metadata-based grouping. In addition, following the analysis strategy of the 16S rRNA studies, datasets can be opened applying different strategies, for instance as merged data, in order to provide a deeper insight on taxonomic and/or functional distribution.
Furthermore, and as a consequence of a development in which metagenomics and 16S rRNA studies are converging, we extended MEGAN to also deal with sequences that stem from a targeted approach. More precisely, we have developed a pipeline that covers the entire workflow, starting from pre-processing and, in a final step, allowing qualitative analysis using MEGAN. For that, we took advantage of a novel aligner, namely MALT, that in combination with a placement algorithm, namely the Majority Vote LCA, introduced recently in MEGAN, is not only capable of assigning more than 99\% of reads to the correct genus, but lowers the rate of false positives to a value close to 0\%.
We believe that, by the additional utilization of the different data access strategies implemented in MEGANServer, MEGAN in combination with MALT and the Majority Vote algorithm is now fully capable of serving as a powerful, yet user-friendly analysis tool for 16S rRNA sequencing data.