Data Science in Life Sciences

Big data and artificial intelligence concept. Machine learning and cyber mind domination concept in form of women face outline outline with circuit board and binary data flow on blue background.

Team Leader Data Science in Life Sciences: Prof. Dr. Abdullah Kahraman

Computational identification of new cancer and disease mechanisms for superior clinical diagnostics and therapeutics.

Continuous new developments in DNA sequencing and mass spectrometry have opened up the possibility of probing the molecular landscape of single and bulk cells on an unprecedented scale. 

For example, The Cancer Genome Atlas (TCGA) and the International Cancer Genomics Consortium (ICGC) have applied Next Generation Sequencing (NGS) technologies to screen the tumor mutational landscape of more than 85,000 cancer patients, generating over 3 petabytes of data. Using this large dataset, multi-institutional research teams have identified hundreds of cancer driver genes, which can drive cancerous growth in normal cells upon mutations. With this knowledge, pharma companies have started developing novel targeted and immuno- therapies for precision medicine.

Our research is in alignment with the aforementioned international and multi-institutional research programs for precision medicine. We have extensive experience in the development of novel Omics-based diagnostics algorithms and the analysis of non-coding mutations in cancer patients. We focus in particular on the detection of cancer-specific alternative splicing events for diagnostic purposes. For our investigations, we integrate whole-genome Next-Generation-Sequencing data, RNA sequencing data, Mass-Spectrometry based proteomics data, experimental X-ray protein structures, and protein interaction network data.

Cancer-specific Alternative Splicing Events

Alternative RNA splicing is a regulatory cellular mechanism to create multiple mRNA molecules from the same gene and is often disrupted in diseases. We could show in the international Pan-Cancer Analysis of Whole Genomes (PCAWG) study that such disruptions in alternative splicing patterns are widespread in cancer. By developing new computer algorithms, machine learning models, and databases (www.caniso.net) we aim to further understand the origins and consequences of such alternative splicing disruption and work towards developing new splicing biomarkers for superior diagnostic, treatment, and medication.

Alternative splicing overview
Overview of the number of cancer-specific Most Dominant Transcripts (cMDT) in 27 different cancer types.

NGS Data Interpretation

Molecular tumor boards are interdisciplinary meetings in hospitals, where oncologists, pathologists, bioinformaticians, and molecular biologists meet up to discuss Next Generation Sequencing (NGS) results of tumor biopsies. Since costs for NGS assays are dropping, hospitals have started to screen larger regions of cancer genomes for actionable mutations. The increasing size of the assays, however, increases the complexity of NGS results. To support the interpretation and streamline the discussions at Molecular Tumorboards, our group has developed the MTPpilot software (www.MTPpilot.org) to support the interpretation of complex NGS results at molecular tumor boards. We are continuously working on improving the software, adding new functionalities, and creating novel visualization tools. 

Ideogram
Ideogram of a melanoma patient in MTPpilot (www.mtppilot.org) with a NRAS mutation and multiple gene copy number amplifications.

Data Driven Molecular Modelling

Cancer cells have many deregulated protein complexes. Traditionally, these multimeric protein complexes have been studied by X-ray crystallography and cryo-Electron-Microscopy (cryo-EM). Recently, structural proteomics techniques like chemical Cross-linking Mass Spectrometry (CX-MS) and Limited Proteolysis coupled to targeted Mass-spectrometry (LiP-SRM) emerged as powerful complementary techniques. By integrating data from these new techniques with data-driven modeling via ROSETTA and AlphaFold, we aim to predict the structure of protein isoforms and large protein complexes in cancer cells. The structural information will help us to understand the functional impact of mutations and protein isoforms on cellular complexes and pathways. 

Xwalk example.
Solvent Accessible Surface Distance (SASD) as computed with Xwalk (www.xwalk.org)
×