Chado
We have developed an InterMine data source that can use a GMOD Chado database as a source for an InterMine warehouse. The eventual aim is to allow import of any Chado database with some configuration. This will provide a web environment to perform rapid, complex queries on Chado databases with minimal development effort.
#
ConverterThe converter for this source is the ChadoDBConverter
class. This class controls which ChadoProcessors
are run. A ChadoProcessor
class corresponds to a chado module. For example, the sequence module is processed by the SequenceProcessor
and the stock module is processed by the StockProcessor
.
#
Chado tablesThe chado-db
source is able to integrate objects from a Chado database. Currently, only tables from the Chado sequence module
and Chado stock modules
are read.
These tables are queried from the chado database:
feature
used to create objects in the ObjectStore
- The default configuration only supports features that have a Sequence Ontology type (eg.
gene
,exon
,chromosome
) - Each new feature in InterMine will be a sub-class of
SequenceFeature
.
featureloc
used to create Location
objects to set chromosomeLocation
reference in each SequenceFeature
feature_relationship
used to find part_of
relationships between features
- this information is used to create parent-child references and collections
- examples include setting the
transcripts
collection in theExon
objects and thegene
reference in theTranscript
class.
dbxref
and feature_dbxref
used to create Synonym
objects for external identifiers of features
- the
Synonym`s will be added to the `synonyms
collection of the relevantSequenceFeature
featureprop
used to set fields in features based on properties
- an example from the FlyBase database: the
SequenceFeature.cytoLocation
field is set using thecyto_range
feature_prop
synonym
and feature_synonym
used to create extra Synonym
objects for chado
synonyms and to set fields in features
- the
Synonym`s will be added to the `synonyms
collection of the relevantSequenceFeature
cvterm
and feature_cvterm
used to set fields in features and to create synonyms based on CV terms
pub
, feature_pub
and db
used to set the publications
collection in the new SequenceFeature
objects.
Additionally, the StockProcessor
class reads the tables from the chado stock module, eg. stockcollection, stock, stock_genotype.
#
Default configurationThe default configuration of ChadoDBConverter
is to query the feature
table to only a limited list of types. The list can be changed by sub-classing the ChadoDBConverter
class and overriding the getFeatureList()
method. The featureloc
, feature_relationship
and pub
tables will then be queried to create locations, parent-child relationships and publications (respectively).
#
Converter configurationSub-classes can control how the Chado tables are used by overriding the getConfig()
method and returning a configuration map.
#
Source configurationExample source configuration for reading from the ''C.elegans'' Chado database:
#
Sub-classing the converterThe processor classes can be sub-classed to allow organism or database specific configuration. To do that, create your class (perhaps in bio/sources/chado-db/main/src/
) set the processors
property in your source element. For example for reading the FlyBase Chado database there is a FlyBaseProcessor
which can be configured like this:
#
Current usesFlyMine uses the chado-db
source for reading the ''Drosophila'' genomes from the FlyBase chado
database. The FlyBaseProcessor
sub-class is used for configuration and to handle FlyBase special cases.
modMine for the modENCODE project uses ChadoDBSource
for reading ''D. melanogaster'' from FlyBase and to read ''C. elegans'' data from the WormBase chado
database. The WormBaseProcessor
sub-class is used for configuration when reading from WormBase.
#
Implementation notes for the chado-db sourceThe chado-db
source is implemented by the ChadoDBConverter
class which runs the ChadoProcessor`s that have been configured in the `project.xml
. The configuration looks like this:
ChadoDBConverter.process()
will create an object for each ChadoProcessor
in turn, then call ChadoProcessor.process()
.
#
Chado sequence module table processingChadoSequenceProcessor
processes the sequence module from Chado. The process()
method handles each table in turn by calling: processFeatureTable(),
processFeatureCVTermTable()
etc.
Each table processing method calls a result set method, eg. processFeatureTable()
calls getFeatureTableResultSet()
and then processes each row. The returned ResultSet
may not always include all rows from the Chado table. For example the getFeatures()
method returns a sub-set of the possible feature types and that list is used to when querying the feature table.
Generally each row is made into an appropriate object, eg. in processFeatureTable()
, feature
table rows correspond to BioEntity
objects. Some rows of some tables are ignored (i.e. not turned into objects) based on configuration.
#
Reading the feature tableHandled by ChadoSequenceProcessor.processFeatureTable()
For each feature it calls: ChadoSequenceProcessor.makeFeatureData()
, which may be overridden by subclasses. If makeFeatureData()
returns null (eg. because the sub-class does not need that feature) the row is discarded, otherwise processing of the feature continues.
Based on the configuration, fields in the BioEntity
are set using feature.uniquename
and feature.name
from Chado.
If the residues
column in the feature is set, create a Sequence
object and add it to the new BioEntity
.
#
Reading the featureloc tableHandled by ChadoSequenceProcessor.processLocationTable()
.
This method gets passed a result set with start position, end position and information from the featureloc
table. For each row from the result set it will:
- store a
Location
object - set
chromosomeLocation
in the associatedSequenceFeature
- set the
chromosome
reference in theSequenceFeature
if thesrcfeature
from thefeatureloc
table is a chromosome feature
#
Reading the feature_relationship tableHandled by ChadoSequenceProcessor.processRelationTable()
.
This method calls getFeatureRelationshipResultSet()
to return the relations of interest. The relations will be used to create references and collections.
The method will automatically attempt to find and set the appropriate references and collections for part_of
relations. As an example, if there is a part_of
relation in the table between Gene
and Transcript
and there Gene.transcript
reference or a Gene.transcripts
collection, it will be set.
There are two modes of operation, controlled by the subjectFirst
parameters. If true, order by the subject_id
of the feature_relationship
table so we get results like:
gene1_feature_id | relation_type | protein1_feature_id |
gene1_feature_id | relation_type | protein2_feature_id |
gene2_feature_id | relation_type | protein1_feature_id |
gene2_feature_id | relation_type | protein2_feature_id |
(Assuming the unlikely case where two genes are related to two proteins)
If subjectFirst
is false we get results like:
gene1_feature_id | relation_type | protein1_feature_id |
gene2_feature_id | relation_type | protein1_feature_id |
gene1_feature_id | relation_type | protein2_feature_id |
gene2_feature_id | relation_type | protein2_feature_id |
The first case is used when we need to set a collection in the gene, the second if we need to set a collection in proteins.
#
Reading the cvterm tableHandled by ChadoSequenceProcessor.processFeatureCVTermTable()
#
Using the default chado sourceAdd the chado database to your MINE_NAME.properties file, eg:
The chado database has to be on the local network.
Add source to project XML file
Run the build
See Database Building for more information on running builds.
This will load the data using the default chado loader. If you want to load more data, you will have to write a custom chado converter. FlyMine uses a FlyBase chado "processor" to parse interactions, etc. See FlyBaseProcessor.java for an example.
#
TripalThe Chado specific tables are not in the postgres default "public" schema of the database. Instead, Tripal puts it in a postgres schema named "chado".
To workaround this, you would need to alter your Chado processor to run this query first, before running any SELECT statements:
Starting with InterMine 1.8, you can instead directly define the schema in the properties of the database in your properties file, like
for example