Data Integration

InterMine is designed to integrate multiple types of data into a single data warehouse. Each type of data to be loaded is defined as a 'source', sources are directories that contain everything needed to parse and integrate a particular type of data. There are some common sources to include several biological data types and you can create your own.

Each source loads objects in the order they appear in the project.xml file. If an object is loaded and there is already an object representing the same entity in the database they should be merged. Each class can be configured with one or more integration keys which define how merging is performed. In the case where multiple sources provide values for the same field, priorities must be defined to determine the outcome of potential conflicts.

See Also: PrimaryKeys, PriorityConfig


Loading Data

The InterMine system is designed to integrate data from multiple sources. It is a data warehouse so should be built from scratch each time updates of data are required, there is no facility to curate or update data.

Sources and Mines

Each type of data to be loaded is defined as a 'source', sources are directories that contain everything needed to parse and integrate a particular type of data. There are some common sources to include several biological data types and you can create your own. The sources to include and files they load are configured in a project XML file in each mine.

Data model

InterMine uses an object-based data model which can easily adapt to include new types of data. Each source can include extra fields or classes as 'additions' in a model XML format. The database schema and Java object code are automatically created from the model XML.

Data Integration

It is likely that data loaded from different sources will provide information about the same objects, for example a gff file may provide the genome location of a feature while a fasta file will provide the sequence. The data integration system makes sure that corresponding genes loaded from the two sources become the same gene and that any conflicts between data are resolved.


Tutorials

  • GettingStarted - in depth guide to building a data warehouse and web interface with example P. falciparum data
  • QuickStart - brief overview on how to create a new InterMine instance
  • SourceHowto - guide to creating a new 'source' to load your own data

Reference Documents