NGS MM 13
Meeting Milanese su Next Generation Sequencing
September 25th 2013, Aula Martini, U6-4, Università degli Studi di Milano, Piazza dell'Ateneo Nuovo 1, 20126 Milano.
Le indicazioni per raggiungere il luogo del meeting si trovano qui.
- Marco Antoniotti, DISCo, Università degli Studi di Milano Bicocca
- Alessandro Guffanti, Unità di Bioinformatica, Genomnia Srl
- Heiko Muller, Center for Genomic Science, Istituto Italiano di Tecnologia (IIT@SEMM).
- Elia Stupka, Centro di Genomica Traslazionale e Bioinformatica Ospedale San Raffaele (CTGB@OSR)
- 9:00 Stefano Ceri, DEIB, Politecnico di Milano, Disclosing Genome Properties with Genometric Query Language
- Stefano Ceri, Marco Masseroli, Politecnico di Milano
- Next Generation Sequencing (NGS) is about to offer a fast (hrs to minutes) and unexpensive (hundreds to tens of dollars) technology for reading and sequencing the human genome. In a ten-year framework, a massive, unexpensive, complete and accurate procedure will allow the reading of millions of genomes. The corresponding potential for data querying, analysis and sharing is the biggest and most important "big data problem" of the mankind.
- The data management community is not ready to face this challenge. NGS data is up to now managed in physical formats and standards, which are influenced by the data production processing of sequencing machines, but do not carry explicit information about high-level properties of the human genome. As a consequence, asking high-level queries on DNA-related information is not adequately supported.
- In this exciting framework, we propose a new paradigm for raising the level of abstraction in NGS data management, by introducing a Genometric Data Model (GDM), which encodes experiment results in a format that takes into account the organization of the genome into regions, and a Genometric Query Language (GQL), which uses such regions as the main data abstractions, and computes high-level operations to extract regions of interest. The main difficulty in dealing with NGS data is scalability; therefore, the design of GQL is inspired by Pig Latin, a high-level, declarative language that can be executed over an Hadoop cloud computing architecture; GQL can be mapped to PIG for efficient execution. The use of GQL will enable the discovery of new biological phenomena that cannot be observed at the small experimental scale of current bio-informatics languages and standards.
- This work is part of a joint PoliMi-IEO-IIT research project on Genomic Computing, which today involves many faculty members and 8 new dedicated PhD students at PoliMi, and a comparable number of scientists at IEO-IIT. It is also one of the first results of GenData 2020, a PRIN project funded by the Italian Government (February 2013 - February 2016), which involves nine among the most prestigious database groups in Italy (see )
- 9:30 Maurizio Casiraghi, ZooPlantLab, BtBs, Università degli Studi di Milano Bicocca, Identificare cosa? Le entità biologiche nell'era della Next Generation Sequencing
- Il mondo della biologia sta affrontando una rivoluzione metodologica. Attraverso le tecniche collettivamente note come "Next Generation Sequencing" (NGS) è possibile analizzare finemente delle matrici complesse (acqua, terra, contenuto intestinale) nel quale numerosi organismi sono presenti contemporaneamente. L'approccio NGS prevede la combinazione di analisi molecolari e approcci bioinformatici, ma esistono due ordini di problemi: (1) le matrici di riferimento e comparazione sono largamente incomplete e molta biodiversità rimane di fatto non classificata. (2) la capacità di identificazione delle entità può essere influenzato dalla scelta del marcatore e da problemi intrinseci agli algoritmi utilizzati per la discriminazione nell'analisi bioinformatica.
- Allo ZooPlantLab di Milano-Bicocca, Dip. Biotecnologie e Bioscienze, si affronta l'analisi di matrici complesse tramite NGS lavorando sia a un livello pratico, sia sulla parte concettuale del problema. Verranno esposte problematiche, lavori in corso e programmati.
- 10:00 Davide Cittaro, CTGB@OSR, Exome Sequencing and Disease Gene Prioritizazion in Clinical Setting
- Exome sequencing (ES) is a cost-effective strategy for the identification of single nucleotide variants (SNV) at high depth of coverage. The output of a typical ES experiment includes thousands of SNV, hundreds of which are potentially affecting coding sequences. Many tools are available for the identification of such variants, while the strategies for prioritize them according to the biological relevance are mainly left to the manual curation of researchers. This approach is impractical in clinical context, where the number of samples may be limited and the curation bias may impact actionable findings.
- We suggest here a semi-automated pipeline, that only relies on existing tools, to improve the prioritization step. We also describe a case study that shows how the pipeline, together with a proper experimental design, may lead to clinically valuable results.
- 10:30 Stefano De Pretis, IIT@SEMM, RNAs production and degradation dynamics from NGS time course data
- Stefano De Pretis, Marco Morelli, Mattia Pelizzola IIT@SEMM
- We present an implementation of a technique to infer production and degradation dynamics of RNAs from NGS data of newly transcribed RNA (4sU labeling or chromatin associated RNA) and classical NGS data of Total cellular RNAs, over a time series. This technique fits two different models to data (constant and varying degradation) and discriminates which one better describes the behavior of each gene.
- Network and chat with your colleagues and with the speakers in front of refreshments and pastries – courtesy of Perkin Elmer Inc. Italy
- 11:30 Alessandro Guffanti, Genomnia, Bioinformatics analysis of NGS datasets: challenges and opportunities
- The analysis of Next Generation Sequencing datasets is one of the most difficult – and rewarding – bioinformatics activities. The development of bioinformatics algorithms and analysis procedures in this field is extremely active. The heterogeneity of the molecular species and signals associated with the eukaryotic transcriptome, genome and epigenome, many still uncharacterized and unannotated, require a complex integrated analytical approach involving bioinformatics, statistics, computational biology and a strong interaction between the analyst and the experimentalist. Finally, different biological systems seemingly require different bioinformatic analysis procedures.
- In this presentation I will illustrate the challenges and the opportunities that emerged from the analysis of different NGS datasets (RNA-seq; CHiP-seq; small RNA; 'ab initio' small genome) sequenced in Genomnia.
- 12:00 Arnaud Ceol, IIT@SEMM, From genome to molecules, mapping NGS data to the structures of molecular interaction to identify disease-causing mutations
- Next-generation sequencing has been applied to identify genomic variations possibly associated to many diseases. The next challenge consists in identifying the role of those mutations.
- Mutations located in protein interaction interfaces are often associated with loss-of-function or gain-of-function. We developed an extension for the Integrated Genome Browser, a widely used tool for the visualization and analyses of genomic data allowing to automatically retrieve from public repositories all the known molecular interactions that involve the product of a selection of genes. The structures available for those interactions can be retrieve and displayed and the mutated residues be highlighted, providing the opportunity to identify those which are at the binding interface and may be the cause of a loss of function.
- 12:30 Eduardo González Couto, Integromics, RNA-seq analysis and advanced visualization using SeqSolve and the ExpressionExplorer from the OmicsOffice Suite
- While RNA-seq becomes mainstream, the analysis, flexible visualization and exploration of the results is still a challenge. The OmicsOffice Suite brings the power of the most advanced bioinformatics tools developed and used by bioinformaticians to the whole life sciences community.
- SeqSolve, the RNA-seq workflow of OmicsOffice, integrates SAMtools, Cufflinks, and the Integrative Genomics Viewer in the industry-strong Spotfire platform. Complex processes like advanced quality control, critical for scientists using sequencing services or core facilities, new transcription events discovery, and differential gene or transcript isoform expression analysis are fully automated. The resulting visual results can then be explored, mined interactively, validated using other "Omics" technologies and even combined in time series or multi-sample large experiments in the OmicsExplorer.
- 14:30 Marco Previtali, DISCo Università Milano Bicocca, Lightweight assembling of splicing graphs from RNA-seq data using the Burrows-Wheeler Transform
- S. Beretta, P. Bonizzoni, G. Della Vedova, M. Previtali, R. Rizzi, DISCo Università Milano Bicocca
- In this talk we discuss a novel algorithmic approach to transcriptome analysis from RNA-seq data based on assembling RNA-seq data into splicing graphs using the Burrows-Wheeler Transform (BWT). More precisely, we employ an external memory algorithm to construct the BWT, so that primary memory (RAM) is used efficiently to store, manage and assemble RNA-seq data into transcripts. In this talk we will discuss some possible further uses of the BWT in the design of new algorithms for alternative splicing prediction.
- 15:00 Heiko Muller, IIT@SEMM, Identification of mutational signatures in human cancer using bipartite graph models
- Next-generation sequencing has recently been applied to the sequencing of cancer genomes. Cancer genomes are known to be unstable and to accumulate varying numbers of somatic rearrangements and mutations. The challenge is to distinguish events that are relevant for the onset and the progression of the disease from simple bystander events.
- We downloaded level 2 and level 3 data from the TCGA data portal and focused our attention on data regarding somatic mutations. We employed bipartite graph models in conjunction with bi-binomial statistics that are suited to cope with largely varying numbers of mutations per sample. We applied a jackknifing strategy combined with a k-nearest neighbor classifying scheme. The analyses were performed using NetCutter software (1).
- We find that TCGA samples can be classified according to tissue of origin with high efficiency. We find known tumor suppressors and oncogenes at the top of the list of molecules that, when mutated, contribute the most to successful classification. We also find a large number of mutations that perform well as classifiers and are located in genes whose role in the pathogenesis of cancer is unclear at the moment. Interestingly, removal of 500 known cancer genes diminished the quality of sample classifications only marginally, suggesting that many cancer genes are yet to be discovered.
- 1. Müller H, Mancuso F (2008) Identification and Analysis of Co-Occurrence Networks with NetCutter. PLoS ONE 3(9): e3178. [doi:10.1371/journal.pone.0003178]
- 15:30 Ermanno Rizzi, ITB CNR, Ancient Population study by Next Generation Sequencing
- Ancient genetic studies provide a unique means to test assumptions based on modern genetic data, but they are often based on DNA extracted from only one or a few individuals and, therefore, do not lend themselves to statistical inference. Here I present an ancient genetic population project to study the Italian genetic history covering a time span of thousands of years, from the Palaeolithic to medieval times. This project is focused on the analysis of the population genetic marker, the mitochondrial DNA, recovered from a large number of ancient human remains. This study is performed using next generation sequencing (NGS) platform that allows the recovery of ancient DNA (aDNA) sequences overcoming the intrinsic limitations of aDNA.
- 16:00 Silvia Bonfiglio, CTGB@OSR, MeDIP-seq: a cost-effective and efficient method for genome-wide DNA methylation studies
- Despite whole genome bisulfite sequencing (BS-seq) is considered the gold standard for methylomes studies, methylated DNA immunoprecipitation sequencing (MeDIP-seq) is a cost-effective valid alternative. In this assay, a monoclonal antibody raised against 5-methylcytosine (5mC) is used to precipitate the methylated fraction of a genome. MeDIP-seq provides high-quality methylomes at typically 100-300 bp resolution. We optimized our MeDIP-seq library preparation protocol in order to improve both throughput and scalability.
- Protocol optimization allowed us to scale down at least four-fold the starting DNA quantity required for commercial library-preparation kits while preserving a high specificity and reproducibility. Further protocol modification that is currently being developed will improve our throughput by means of pooling libraries for the immunoprecipitation step. We have already successfully applied our optimized protocol to the study of pancreatic neuroendocrine tumors in collaboration with Prof. Scarpa of ARC-Net Research Centre at Verona University.
- 16:30 Alessandro Pietrelli, ITB CNR, Surfing into NGS data: Challenges and solutions for exome/target sequencing data analysis
- Recent advances in genome sequencing technologies provide unexpected opportunities to characterize the genomic landscapes and identify relevant mutations for diagnosis and therapy. Specifically whole-exome sequencing (WES) and gene target sequencing, based on capture technology, are a powerful method to obtain relevant information on the coding part of the genome with moderate costs for laboratory compared to a whole-genome sequencing approach. While the next-generation sequencing (NGS) capture strategies are going to be standard methods for studying complex and mendelian diseases, data analysis and data management still remain the challenging fields and the real bottleneck in NGS projects. In this talk, I will give an overview of data analysis workflow used for exome and target sequencing projects focusing the attention on the steps necessary to drive the interpretation of data from raw reads to variations list and I will highlight the pitfalls, challenges and solutions for this particular application of NGS.
- 17:00 Conclusioni e Saluti finali