Logo BMBF
 
Logo NGFN
Home
_
Quality Management "Clinic, Data Management and Data Analysis"

Quality management coordinator:
Christian Lawerenz
German Cancer Research Center (DKFZ)
Div. "Intelligent Bioinformatics Systems" (iBioS)
Phone: + 49 6221 422724
E-Mail:
c.lawerenz@dkfz.de
 

1          Introduction


Since the initiation of the National Genome Research Network in 2001, the requirements for data management, data integration and data exploration have substantially changed. The need for a comprehensive integrated representation of NGFN and NGFN-related data for the scientific community has yet to be met. In contrast to the successful large scale operation of centralized data facilities such as the NCBI or the EBI, the NGFN needs a unifying but flexible concept to cover the local needs for data management and analysis, as well as a global framework for the decentralized access to the data sources distributed over many projects within the collabo-rative framework of NGFN. While promising solutions for the rather centralized resources such as the systematic methodological platforms or the RZPD appear to be feasible within the time frame of NGFN II, the dispersed, diverse and heterogeneous data of the clinical networks await a suitable and manageable technical solution. Due to the heterogeneity of the data and the fact that the data are distributed over a large number of decentralized nodes, the design and implementation asks for an advanced but uncomplicated concept to meet these technical challenges. In the past, several but individual solutions have been implemented (e.g. RZPD Gene Matrix, DKFZ iCHIP, GSF-MIPS data repository).

2 Challenges and solutions

The information technology challenge for the NGFN is not unique but rather reflects the general situation in the life sciences, where an unprecedented amount of data is generated at a rapidly increasing rate. These data have to be set in context, i.e. into relation to other relevant data generated by other sources under different conditions and standard operating procedures (SOP’s). These issues have recently been the subject of an EU workshop and will become part of the framework VII programme (Innovative Medicine, subheading “Knowledge Management” ).  In the past, the dissemination of suitable (and even available) solutions have been handicapped by (i) the lack of local human resources and know-how, (ii) the resistance of the experimental groups to invest into professional solutions, (iii) the evolving industrial standards for data integration and middleware for decentralized data exchange, and (iv) the general underestimation of the task with respect to time and cost .

In the future, the need for general purpose databases solutions within research institutions will inevitably increase. The NGFN Bioinformatics platform has addressed some of the key issues and has initiated activities on the national and international level.

1) Quality control issues
Representatives of the clinical networks, GEM and SMPs have developed a common nomenclature for data capture and harmonized exchange represented by the NGFN Ontology. More than one hundred NGFN SOP’s enable the reusage of methods and the comparability of experimental conditions and data formats. Well established leading international standards, such as MIAME, MGED and MZDATA have been integrated into essential NGFN platforms (iCHIP, SMP Protein database…). Newly evolving standards in the areas of RNAi and cellular assays have also been initiated by NGFN partners (MARIE, MIACA). These activities led to substantial contributions in international com-mittees (LifeDB, PSI etc.).

2) Data integration platforms of different experimental technologies
On a national level, several database initiatives exist both internal and external to the NGFN for experimental databases. Here, most important biological entities are covered. The scope of the databases comprises microarray, gel electrophoresis, mass spec-trometry, array-CGH, tissue microarrays, cellular assays and RNAi experiments.

An annual database workshop was initiated in 2005, to deal with coordination issues and in particular, to avoid redundant development efforts. The incorporation of enhanced quality standards as described above has been a necessary and important factor for the realization of clinical and experimental data. The already incorporated standardized exchange features enable the dissemination and reusage of data.
The database workshop 2006 was intended to highlight the variety of database solutions developed and employed within the NGFN. A primary focus was the technical implementation of these solutions and an overview of the groups and persons involved. Another topic was the initiation of highlevel cooperations within the NGFN.

4 Mandatory conceptual requirements
Information and knowledge of any NGFN project needs to be accessible in a coherent manner, independent of its origin, using state-of-the-art internet technologies. Information and knowledge should be distributed to avoid unnecessary redundant storage. Information and knowledge should be accessible through a common web interface, as well as through standardized pro-gramming interfaces. The NGFN information needs to become compliant to the evolving inter-national standards (PSI-MI, SBML, MIRIAM, …).
So far, severe obstacles arise from the implementation of monolithic software only capable of importing and processing information for specialized tasks but unable to provide reusable dis-tributed components. This is not a new problem and since years well addressed with mature solutions in information technology. It is evident to subdivide the problem into the following separated concerns (as also concluded by the EU-Workshop):

1. Formal data representation schemes and standards (in XML)
2. Business topics, business objects, data elements (software components capable to ac-cess and process information according to 1.)
3. Core vocabularies, thesauri, taxonomies, ontologies, etc.
4. Associated rules / methods capable to infer knowledge from the underlying information and ontologies
5. Knowledge representations
6. Functional services, processes (Web Services and workflows)

The urgent need for the further integration and dissemination of the NGFN data is in the imple-mentation of available but only rarely applied technologies such as XML-based web-services (see 1. and 6.), layered database and retrieval architectures (see 2.), modular and flexible open communication between data resources and services (Web Service technology , see 6.) and formalized description of information and knowledge according to accepted internet standards like XML and Semantic Web (see 1., 3. and 4.).

One of the primary aims of this quality management project is to enhance the value of resources and data generated within the NGFN. A key focus in this regard lies in the validation of data output quality and the assurance of data clarity. Utilizing comprehensible and reusable methodologies together with data derived from defined protocols/documentation, we seek to ensure ‘research compatibility’ among independent research institutions.

In the context of strengthening quality assurance within the NGFN, we created specified clinical parameter sets as well as incorporated standard operating procedures (SOPs) for different networks (e.g., cancer, inflammation/infection, neurobiology). International standards together with data and workflow models were incorporated and improved, while quality standards for handling high-throughput data were also established.

5 Recommendations
The following recommendations have to be met for a future layout for the NGFN data and appli-cation services:

1. Use and distribution of generally accepted standards and service oriented mid-dleware technologies for data access and exchange across the internet
Information within the NGFN on any genes or gene-related information under investiga-tion should be accessible. This requirement asks for a structured organisation of data and retrieval systems. The strategy for the data management in the NGFN should avoid the development of proprietary solutions concerning basic technologies such as data warehouses, retrieval systems, and middleware components. Layered design and the use of open industrial standard technologies should be encouraged. Training for the ba-sic design of data resources and the application of suitable IT-technologies should be provided. Tasks for the design of NGFN databases should be standardized and cover the issues discussed above. Compliance to the internationally accepted standards for data exchange should be mandatory. Local information resources need to be repre-sented with these standards and capable to distribute their information based on the Web Service technology.

2. Global integration; installation of suitable interfaces and Web Services
To propagate NGFN research and results, NGFN data should be accessible through the NGFN portal to the scientific community to improve the international visibility of the NGFN. This complex task requires the integration of independent heterogeneous data linked by semantic (e.g. gene identifiers), functional, or methodological classification (e.g. disease networks). In a first step, the NGFN portal  will provide a framework based on the input provided by the various disease networks, identify the individual objects (e.g. genes under investigation) and link this information to publicly available information such as GO-classification and interaction data (e.g. functional modules or gene sets). The aim is to access distributed information based on Web Service enabled resources.
Obviously, this concept has to be extended within a future NGFN network to comply with new data, new types of information and information available from external sources. Complementary to the local sources providing detailed experimental information, the por-tal should evolve to an internationally acknowledged information resource.
3. Integration and knowledge-management for the NGFN
Relations among independent data sets are critical for the display of complex information such as biological knowledge related to genes or functional modules as the basis of cel-lular processes and molecular networks. The advent of Systems Biology has to be sup-ported by the next generation of data structures and tools, allowing for dynamics and modelling (e.g. MIRIAM, SBML, CellML). Theoretical modellers and practical experimen-tators need well defined structures in all phases of a systems biology endeavour for the best collaboration. The different entities of system biology as experimental data for the verification of models, components as well as reactions of biological systems and mathematical models should be easily accessible for the community. This area is part of active research at several centres (e.g. DKFZ, EMBL, GSF).
4. Integration of translational research
A crucial issue in the area of translational research relates to the missing link between clinic and molecular biological experiments. On the one hand, there is the necessity to relate the experimental result to the clinical case. Basic scientific discoveries should be translated into clinical applications. On the other hand, the clinical observations should be used in basic research. Furthermore, an essential need exists to utilise the experi-mental facilities provided by the NGFN. The establishment and support of integration platforms for translational research should therefore be achieved. The development of process oriented service strategies is necessary. Efforts should be made to minimize the clinical complexity through decreasing the numerous specifications to the essential re-quirements. This harmonization process will allow the use of clinical data pools. The pol-icy for data exchange of clinical and research data concerning data privacy and data se-curity has to be elaborated very carefully. After facing the pre-requisite, data from both the experimental researcher and the clinician would be available for translational re-search. By installing such a platform, high-throughput experiments based on numerous samples coming from biomaterial banks of clinical partners outside the NGFN could be performed due to the integration platforms and harmonization efforts. This infrastructure will enable a more efficient and productive research process for clinical centers.
 
Conclusion:
Data management and data integration are key issues for the future development of the NGNF. The rigorous realization of industrial standards for the communication with and between data, tools, and knowledge resources is necessary to overcome the current limitations of data dis-semination. This has to be realized in compliance with international standards and should be-come an active contribution to the European efforts in this area (e.g. Framework VII). Future developments should continue the current developments of the Bioinformatics Platform but overcome there limitations, in particular with respect to access, availability and NGFN-wide in-tegration of local data resources (e.g. iCHIP, data integration and NGFN Portal).
 



Standardisation of clinical data & data management. 

To meet the current requirements of clinical and basic research, we have provided a core parameter set for generic use across all networks and more specific parameter sets for individual networks.
The core parameter set includes attributes that are relevant for all disease networks such as age, diagnosis, sample type and date of biopsy. These basic data are supplemented by specific parameter sets for the individual indication areas.
The pool of basic parameters has been developed for almost all of our clinical partners in NGFN2, e.g., cardiovascular diseases, epidemiological studies including environmental and neurological specifications, and neuroblastoma/cancer. These standards able to transfer data within the NGFN and build databases in which patient records involving different disease pictures are merged. Thus, all network partners are able to analyze the data obtained from all patients and studies conducted. Together with the users, we continually develops these parameter sets further.

To ensure smooth data transfer, all participants must use a common vocabulary. That is why we have incorporated the structure and terminology of the NGFN ontology in a special XML-based language, the
Web Ontology Language (OWL). Due to the semantic description capabilities of OWL, the ontology is presented exactly on the basis of rules and terminologies and can serve as template for the data transfer. NGFN Partners are now able to view the standardized set using the Protégé-Browser, that is installed at the Resource Center for Genome Research (RZPD) and can be accessed by logging onto the NGFN-Intranet, thus circumventing the need to install the software locally.

The NGFN ontology is already integrated into the NGFN portal for high-throughput studies (MIPSexpress) and in
iCHIP as local database platform for various disease-related networks.
Furthermore important databases, such as the epidemiological databases in Munich, Kiel and Bonn, the upcoming
NGFN metadatabase and the upcoming cardiovascular database in Munich are based upon the NGFN ontology.

A final important issue relates to data protection and the need to ensure a constant compliance therewith. Clinical data must be anonymous or pseudonymized. Realization of NGFN database systems will therefore be in accordance with the respective ethics commissions e.g., the Deutsche Forschungsgemeinschaft (DFG), and the Telematikplattform für Medizinische Forschungsnetze (TMF e. V.). We developed a data security concept addressing legal, ethical and organisational issues of the establishment and maintenance of data bases, of the recruitment of patients, cooperation contracts and experimental design. The concept is closely attuned with the generic data security concept developed by the working groups "data security" and "biobanks" of the TMF. According to this concept, research results and patient data should preferably be logged in disconnected databases. Based on TMF applications, the implementation of a pseudonymizing process in the field of clinical medicine will be developed.
We also focus on the complete logging of data flow for clinical studies in order to follow the good clinical practice guidelines.


Standard operation procedures

In order to improve the quality assurance aspect, existing and newly defined standard operation procedures (SOP’s) have been developed. A minimal set of protocols were selected covering different practical issues, such as preparation and storage of clinical sample material and associated technical processes. Together with the working group "Microarrays" we developed protocols covering different aspects of a microarray experiment (see Quality Management "Microarrays") . 
These SOPS will in future be extended to include other methodologies. The protocols include important aspects of experimental design and quality control. Regular updates including the feedback from protocol users will be implemented on a continuous basis.

  • Collection and storage of blood for DNA preparation
  • Preparation of DNA from blood using silica particles
  • Preparation of DNA from blood using Invisorb Spin Blood Midi
  • DNA quality and quantity
  • Recommendations for normalization of microarray data
  • Recommendations for Profiling using Time of Flight Mass Spectrometry


    List of authors involved in the development of standard protocols:



    Websites of SMPs and Disease-oriented Genome Networks involved in this quality management project:

  •  
    | next next project