NGFN-Science: Data integration and Computer Modelling

Data integration and Computer Modelling

The technologies in SMP Protein (MS, Protein-DNA, Protein-Protein interactions etc.) will generate large and diverse data sets that require sophisticated tools for data standardization, data normalization, data integration as well as modeling and simulation of relevant biological processes such as signaling pathways. For example, different MS manufacturers store data in different proprietary formats what limits data analysis, exchange of raw data sets and software development to a large extent. Furthermore, there are many existing strategies for assigning peptides to mass spectra that end up in partly conflicting results. Protein-protein interactions and the construction of networks suffer from high false positive and false negative rates (incorrect folding, inadequate subcellular localization, etc.) what tremendously downgrade the reproducibility and comparability of these approaches. The aims of this project are

- the design of easily transferable data formats for different proteomics technologies
- explorative data analysis and normalization for a core set of experimental methods
- the installation of the modelling and simulation resource PyBioS for the analysis of in silico experiments and dynamical characterization of biological models

We will implement a three level approach with defined interfaces:

1.) Standardization and normalization of experimental data
The rapid development and increasing complexity of proteomics techniques lead to a large amount of semi-structured high-throughput data. Data integration requires standardization of experimental techniques and the development of a common descriptive language for the heterogeneous data formats. In order to handle this situation we will develop an XML-based data format that covers an initial stock of experiments and serves as a common idiom for the import of high-throughput data into the analysis platform. Where possible we will adopt existing international standards. On this first level of data integration there will be tools for the normalization and grouping of homogenous experimental data. Consistency of the primary data will be controlled on the experimental level. Normalization requires the systematic identification of the influence factors that determine the experimental outcome and the elimination of the technical bias.

2.) Correlation and integration rules
Quality control tools will control the consistency and redundancy of the imported data. On this level heterogeneous data types will be correlated based on predefined integration rules. These rules will be iteratively extended and improved during the time of the project. The secondary data will be transformed using the integration rules into condensed higher structured data types combining different primary data types.

3.) Modeling and simulation
On the third data integration level consistency of the different secondary data structures will be controlled. The data structures will be integrated into biological processes, for example signalling pathways or regulatory networks. These objects will be either predefined or the result of the simulation and modeling approaches of the platform. Here, we incorporate our modeling and simulation platform, PyBioS, developed in the course of NGFN-1.

The sub-project will be closely related to the experimental sub-projects through the analysis and modeling of primary data, in particular with projects 2.2, 3.1, 3.2 and 3.3.

previous project | next project