Abstract:
Next-generation sequencing technologies, with their low costs and high throughputs, have benefited the field of microbial research to a great degree. The application of whole-genome shotgun sequencing to DNA extracted from an environmental sample enables avoiding the usually complex method of cultivation of pure cultures of microorganisms in the laboratory. This protocol is referred to as whole-genome shotgun metagenomic sequencing. The analysis of sequencing data mainly aims at the taxonomic and functional characterization of the microbial sample. Many algorithms and tools have been developed for the same. The design of the analysis pipeline is usually dictated by the specific project at hand.
In this thesis, we describe several aspects of analyzing whole-genome shotgun metage- nomic data. Analysis usually begins with the quality check of raw sequencing data followed by its preprocessing to improve the read quality. When dealing with datasets containing several number of large samples, the preprocessing of the samples can take up considerable time and effort. However, if the binning of reads into different taxonomic and functional categories is the aim, a read with bad quality automatically gets filtered making the initial preprocessing unnecessary. Thus we first look into the effect of preprocessing on the ensuing analysis of the metagenomic samples. Next, we assess the correspondence between the different systems of functional classification typically used for metagenomic analyses. The reference proteins in databases like the NCBI-NR may have none or multiple identifiers belonging to a particular classification system. Consequently, a read aligning to such a reference may be placed into a functional group depending on the mapping of the reference to functional identifiers. We study the correspondence between the different classification systems using a few metagenomic samples.
Further, we describe the analysis of a dataset of human gut metagenomic samples obtained from obese patients undergoing a weight-loss diet-intervention. The obese patients were also detected positive for non-alcoholic fatty liver disease (NAFLD) and Metabolic Syndrome. The analysis is carried out using the popular metagenomic analysis tools DIAMOND and MEGAN. This study was carried out in order to track the effect of the diet-intervention on the gut flora composition and to relate the clinical parameters like weight-loss, NAFLD and metabolic syndrome to the microbiome.
A metagenomic sample could be subjected to analysis based directly on the reads or on an assembly. Both methods have their pros and cons. We explore the differences seen in the taxonomic and functional compositions between those two strategies and conclude that both provide similar results with minor differences depending on the sample being assembled. At the end, we describe how a gene-centric assembly can be carried out with the tools DIAMOND and MEGAN and demonstrate the usefulness of such a gene-centric assembly in a metagenomic analysis pipeline by carrying out a gene-centric assembly across different gene families and metagenomic samples.