NGS MM 16
Next Generation Sequencing Milan Meeting 2016
The next NGS Milan Meeting (technologies, data, analysis) will be held in Room Sironi, U4-08, Università degli Studi di Milano Bicocca, on Aprile 8th, 2016.
NGS MM is an informal day-long meeting, where researchers of the greater Milan area who are involved in various aspects of NGS and Bioinformatics meet to presente their work, share experiences and initiate new collaborations.
The 2016 edition of NGS MM is graciously sponsored by
We kindly as you to register by filling this form. The meeting is free, but it is best to properly set up the catering.
- 9:00 Riccardo Bellazzi, Università degli Studi di Pavia; GeMoNe: The Genomic Data Portal for Myeloid Neoplasms
- Next-generation sequencing technologies generates large amount of genomic data that vary depending on the sequencing application. DNA-seq, RNA-seq, Chip-Seq and Bisulfite-Seq are complementary sequencing approaches that allow measuring different genetic signals from tumor samples. Reduction in sequencing costs lead to combine these technologies together in order to achieve a complete molecular picture unraveling the regulation of gene expression, related somatic mutations and in general the identification of new genetic signatures, able to better define diagnosis and to explain particular prognostic trajectories in tumor regression/progression1.
- We developed a Genomic Data Portal for Myeloid Neoplasms (GeMoNe) that allows collecting and integrating large amount of genomic data coming from the aforementioned sequencing applications. Data can be combined together in order to test the prognostic outcome of different genomic signatures on large cohort of patients.
- Patients’ cohorts can be created by grouping and selecting the corresponding samples and filtering on patient phenotypes such as gender, age and WHO diagnosis. Genomic signatures correspond to genes coupled with genomic data types (e.g. somatic mutations, expression levels). Cohorts and signatures are finally combined in order to generate and statistically compare Kaplan-Meier survival curves.
- The platform consists of a web application as front end and a RESTful API as back end that retrieves genomic data from a new generation database, built to store and quickly manage huge amount of data (Cassandra, a NoSQL database).
- A total number of 499 patients have been integrated on the platform: 200 AML cases coming from the Cancer Genome Atlas (TCGA) project, consisting of somatic variants, CNVs, expression levels and methylation; 299 patients coming from the IRCCS Policlinico S.Matteo of Pavia consisting of AML, MDS and MDS/MPN consisting of somatic variants obtained by a DNA-sequencing of 111 genes2.
- The platform has been implemented in collaboration with the Interdisciplinary Center for Biotechnology Research at the University of Florida and it is freely available online under restricted access (http://gemone.unipv.it).
- Joint work with Ivan Limongelli, Alberto Riva, Matteo G. Della Porta, Paolo Magni and Mario Cazzola
- 9:20 Maurizio Casiraghi, ZooPlantLab, BtBs, Università degli Studi di Milano Bicocca; "Water biome”: HTS tools for the analysis of aquatic environments
- Different kinds of water are characterized by extremely diversified biomes. The advent of High Throughput DNA Sequencing (HTS) introduced new tools to investigate this overall diversity. However, many approaches are still limited to a qualitative level, exploiting a fraction of the diversity only. We propose here the integration of different tools, both at the molecular lab stage and at the post-run analyses to fully uncover the values of the research findings.
- Joint work with Antonella Bruno and Anna Sandionigi
- 9:40 Marco Masseroli, Politecnico di Milano; Next generation genomic data management and computing
- Modern sequencing technologies and data processing pipelines are providing quickly and at low cost an increasing amount of sequencing data and associated (epi)genomic features of many individual genomes in several biological and clinical conditions. Answers to essential biomedical questions are hidden in such valuable information, which are usually publicly available within well-curated repositories; yet, their efficient management and integrative processing is becoming the biggest and most important “big data” problem of mankind. The processing of multiple heterogeneous samples can help data-driven discoveries and biomolecular sense making, such as finding how diverse genomic, transcriptomic and epigenomic features cooperate to characterize biomolecular functions; however, it needs state-of-the-art “big data” computing strategies, with abstractions beyond the capabilities of today generally used tools.
- Recently, we introduced a new paradigm in next generation sequencing (NGS) data management and processing (http://www.bioinformatics.deib.polimi.it/genomic_computing/); we proposed a simple Genomic Data Model (GDM) that uses few general abstractions for genomic region data and associated experimental, biological and clinical metadata, which guarantee interoperability between existing data formats. Taking advantage of the GDM, we developed a next-generation, high-level, declarative GenoMetric Query Language (GMQL) for genomics data (http://www.bioinformatics.deib.polimi.it/GMQL/); we proved its usefulness, flexibility and simplicity of use through several biological query examples. GMQL works downstream of NGS raw data preprocessing pipelines and supports queries over thousands of heterogeneous samples; computational efficiency and high scalability are obtained using parallel computing on clusters or public clouds. GDM and GMQL open doors to the seamless integration of descriptive statistics and high-level data analysis of NGS experimental results. Furthermore, they are applicable to federated repositories and can be used to offer integrated access to curated data, provided by large consortia such as ENCODE, Epigenomics Roadmap and TCGA, through easy to use search services and web interfaces. GMQL will be shortly available in the CINECA cloud, and an intuitive web application is publicly accessible at http://www.bioinformatics.deib.polimi.it/GMQL/queries/ to use a set of predefined parametric GMQL queries on ENCODE and Epigenomics Roadmap data.
- Joint work with Stefano Ceri, Abdulrahman Kaitoua, Pietro Pinoli, and Arif Canakoglu.
- 10:00 Eugenio Montini, Ospedale San Raffaele; Genomic Integration Site Analysis of 7 Metachromatic Leukodystrophy Patients up to 48 Months Follow-up After Lentiviral Hematopoietic Stem Cell Gene Therapy
- Background: The molecular analysis of the genomic distribution of viral vector genomic integration sites (IS) is a key step in hematopoietic stem cell-based (HSC) gene therapy applications, allowing one to assess the safety and the efficacy of the treatment and to study the basic aspects of hematopoiesis and stem cell biology by monitoring the clonal diversity and dynamics of multiple hematopoietic lineages after HSC transplantation.
- Results: In our ongoing clinical trial for metachromatic leukodystrophy (MLD) with a self-inactivating lentiviral vector (LV), we retrieved and sequenced >300 million proviral/host genomic junctions from cells of 7 patients with a 48-month follow up after treatment. We have observed diverse IS profiles in every patient in this study, indicating polyclonal reconstitution. The genome wide distribution of LV IS among patients was consistent with previous pre-clinical and early clinical data showing the characteristic bias to target expressed genes without preferences towards regulatory regions. Clonal abundance analyses showed no sustained clonal dominance in any patient. Statistical analysis of frequently targeted genomic regions indicated that integrations in these regions were the product of an intrinsic LV integration bias rather than of insertional mutagenesis. We estimated an average clonal HSC population of >7000 active stem cells from 9 months post treatment and observed only minor fluctuations in clonal population dynamics of the hematopoietic system reconstitution after this time point.
- Conclusion: IS analysis of 7 MLD patients shows a polyclonal repertoire and common molecular dynamics up to 48 months following gene therapy without any sign of genotoxicity.
- Joint work with Andrea Calabria, Giulio Spinozzi, Stefano Brasca, Fabrizio Benedicenti, Erika Tenderini, Stefano Beretta, Ivan Merelli, Luciano Milanesi, Luigi Naldini, and Alessandra Biffi
- 10:20 Michele Iacono, Roche Italia; Sequencing and Tissue Diagnostics at Roche
- 10:40 Raoul Bonnal, INGM; Docker: new opportunities for bioinformatics, research institutes and enterprises
- Nowadays computing power is not a problem anymore, research labs can buy commodity hardware, build their own super computer or use cloud services. What is really challenging is how quickly those environments can be adapted to the different research needs and used efficiently. Some of those needs, such as reproducibility, versioning, maintainability, collaboration, and openness can be tackled using Container technologies such as Docker and the adoption of the DevOps culture. To efficiently use the available computing power Docker can be combined with traditional “research” resource managers such as Torque, SLURM, SGE, LSF; use the cloud or more general resource managers such as Apache Mesos used by enterprises in heterogeneous environments. The seminar will show the benefits of converting bioinformatics pipelines in a series of containers; scratch the surface of Docker and its tools and the advantages of using Mesos as a resource manager.
- 11:00 COFFEE BREAK; Network and chat with your colleagues and with the speakers in front of refreshments and pastries; courtesy of
- 11:30 Arnaud Ceol, IIT@SEMM; NGS Management And Analysis: From Sample To Molecular And Network Biology
- The Genomic Unit of the Center for Genomic Science of IIT@SEMM processes thousands of samples on Next Generation Sequencing platforms. We will briefly present how we manage the experimental flow and data with our dedicated LIMS and facilitate primary and secondary analyses with HTS-flow, a workflow management system that has been standardized and made easily accessible to both dry and wet lab scientists. Finally, we will show how we are extending genome visualization tools to enable the integration of NGS data with molecular, network and structural biology.
- 11:50 Rocco Piazza, Scuola di Medicina, Università degli Studi di Milano Bicocca; RNA-Seq is a valuable complement of conventional diagnostic tools in newly diagnosed AML patients
- Somatic alterations in AML represent critical elements for diagnosis and treatment. Cytogenetics is the de facto standard to detect chromosomal abnormalities, despite requirement of mitotic cells, limited resolution and inability to detect cryptic fusions. Genotyping is usually performed with PCR/Sanger sequencing, which is expensive and allows detecting only a predefined set of known variants. The net result is that few molecular lesions are routinely tested in clinical settings. In this study 20 AML patients at diagnosis were subjected to RNA-Seq to identify gene fusions, NPM1 insertions, FLT3 internal tandem duplications and single nucleotide variants and the results compared with the routine clinical approach. All the 5 fusions identified by standard cytogenetics (1 RUNX1-RUNX1T1, 3 PML-RARA and 1 CBFB-MYH11) were confirmed by RNA-Seq, however RNA-Seq identified 4 additional events. Three of them were known from the literature (ZMYM2-FGFR1, KMT2A-MLLT10 and ETS2-ERG) and 1 was new (KDM2B-ETV6). The ZMYM2-FGFR1 was particularly interesting, because it predicted sensitivity to ponatinib, which was confirmed by proliferation assays on fresh blasts. A total of 3/3 NPM1 insertions and 4/6 FLT3 internal tandem duplications were detected by standard techniques and RNA-SEQ, respectively. These data indicate that RNA-Seq is a valuable complement of conventional diagnostic tools in newly diagnosed AML patients.
- 12:10 Paolo Provero, Università degli Studi di Torino e Ospedale San Raffaele; Patterns of somatic mutations in cancer predict disease genes
- Many recent studies show that genes associated to genetic diseases often are found recurrently mutated, at the somatic level, in cancer. These observations could in principle be useful in understanding the origin of both genetic diseases and cancer. However they are currently anecdotal in nature. We systematically explored this issue by asking whether patterns of recurrent somatic mutations in cancer can be used to predict the involvement of genes in genetic diseases. We built logistic models that use cancer somatic mutation data from TCGA to predict the involvement of genes in genetic diseases as reported in databases such as Disease Ontology and Human Phenotype Ontology. The model achieves significant predictive power on most diseases, and appears to be especially effective for diseases related to higher cognitive functions, such as intellectual disability and autism spectrum disorder. These results will be useful in identifying disease causing mutations from sequencing data and will provide insight into the common genetic underpinnings of cancer and hereditary diseases.
- Joint work with Davide Cittaro and Giovanni Tonon
- 13:00 LUNCH BREAK (free).
- 14:30 Marco Previtali, DISCo Università Milano Bicocca; Fast String Graph Construction for De Novo Assembly of NGS Data
- The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm.
- In this work, we explore a novel approach to compute the string graph, based on the FM-index and Burrows-Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads.
- Our algorithm has been integrated into the SGA assembler as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that FSG is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads
- 14:50 Stefano De Pretis, IIT@SEMM; Dynamics of transcriptional regulation: the contribution of epigenetic and regulatory factors
- The dynamics of transcriptional regulation were characterised in mouse fibroblasts in response to Myc activation. The transcriptional response was characterized in terms of RNA synthesis, processing and degradation rates. Integrative analysis of these data with information on the binding of regulatory proteins and epigenetic marks allowed elucidating the contribution of these factors on the dynamics of transcriptional regulation. In the last part, we will introduce recent work done in our group on the semantic annotation of publicly available NGS datasets.
- Joint work with Mattia Pelizzola.
- 15:10 Daniele Ramazzotti, DISCo Università Milano Bicocca; Inferring Selective Advantage Relationships to Reconstruct Cancer Progression Models
- Cancer is a disease of evolution whose process is characterized by the accumulation of somatic alterations to the genome, which selectively make a cancer cell fitter to survive. The understanding of progression models for cancer, i.e., the identification of sequences of mutations that leads to the emergence of the disease, is still unclear. The problem of reconstructing such progression models is not new; in fact several methods to extract progression models from cross-sectional samples have been developed since the late 90s.
- In the past two years and a half, we have proposed two novel algorithms called CAPRESE (CAncer PRogression Extraction with Single Edges) and CAPRI (CAncer PRogression Inference) to reconstruct models of the sequences of mutations accumulation, which characterize cancer evolution. To the best of our knowledge, the existing techniques are based either on correlation or on maximum likelihood. Differently, we perform the reconstruction by exploiting the notion of probabilistic causation in the spirit of Suppes’ causality theory. We note that in the context of biological systems and cancer progression, the notion of causality can be interpreted as the notion of "selective advantage" of the occurrence of a mutation.
- In this setting, we prove the correctness of our algorithms and characterize their performance. Finally we discuss how our R BioConductor package TRanslational ONCOlogy (TRONCO) is being used on real cancer datasets - e.g. Atypical Myeloid Chronic Leukemia (aCML), Colorectal Cancer (CRC), et al. - and how it highlights possibly biologically significant patterns in the progressions inferred.
- 15:30 Saverio Vicario, CNR - ITB (Institute of Biomedical Technology), Bari; Algebra of biodiversity: how to represent change in communities with a special focus on HTC data
- Joint work with Nicola Mosca and Vito Renò
- 15:50 Giorgio EM Melloni, IIT@SEMM; CancerPanelSimulation: a resource to analyze and design genomic gene panels for cancer clinical trials
- In the era of precision medicine and personalized health a tremendous effort to bring NGS based technology to the clinical routine is currently ongoing. Nevertheless, the knowledge coming from genome-wide research studies is, in most cases, of little relevance for the clinical practice and costly to apply in routine diagnostic. To bridge the gap between cancer research and clinical genomics, we propose CancerPanelSimulation, a suite of simple tools that drastically simplify the design and adaptation of cancer gene panels in a clinical setting. Sequencing a relatively small number of genes presents a cost advantage compared to a more comprehensive sequencing approach (e.g. whole exome sequencing, whole genome sequencing) but it also raises the questions of what and how many genes to target. Our tool is able to integrate different sources of information (mutations, expression, copynumber and translocation data) and simulate an in-silico clinical trial using publicly available or in-house data to infer the alteration frequency to be expected, the percentage of patient treatment allocation and comparing drug coverage over multiple cohorts. Given a panel and tumor types to analyze, CancerPanelSimulation could represent a valuable resource to create a pilot basket or umbrella design, predicting patient allocation in a genomic driven clinical trial and finalizing a targeted library to be submitted for sequencing.
- Joint work with Alessandro Guida, Laura Riva, Giuseppe Curigliano, Piergiuseppe Pelicci and Luca Mazzarella.
- 16:10 Conclusions and Final Greetings
Directions to reach Edificio U4 of the Università degli Studi di Milano Bicocca can be found here. Room U4-08 Sironi can be found at Floor -1. Turn left in the hallway in front of the bottom of the stairs.