Id Resolvers
The ID resolver uses the files in the specified directory to create a large map. The key for the map is the unique identifier (the MOD ID, for example the MGI:, RGD, FBgn, ZFIN: identifiers). The values in the map are all the symbols, old identifiers, dbxrefs (e.g. Ensembl).
unique gene identifier | symbol, name, ensembl ID … |
---|---|
MGI:97490 | pax6, paired box gene 6 … |
The ID resolver then uses this map to replace old or non-unique identifiers with the unique identifier. This allows genes to be merged correctly into the database, and lets each mine be interoperable with other friendly mines.
The ID resolver is used in several data sources, Homologene for example.
If you look at the Homologene data, you'll see they don't use the MGI identifier. See:
1212 | 10090 | 18508 | Pax6 | 7305369 NP_038655.1 |
1212 | 10116 | 25509 | Pax6 | 6981334 NP_037133.1 |
When parsing the Homologene data file, the ID resolver replaces the symbol "Pax6" with the MGI identifier. The parser sets MGI:97490 to be the primary identifier then stores the gene to the database. Similarly, it replaces Pax6 with "RGD:3258" for the rat gene. And so on.
#
ID resolvers available in InterMineEntrezGeneIdResolverFactory | NCBI gene info for a collection of organisms | ftp://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz |
FlyBaseIdResolverFactory | flybase chado db, for ‘’D.melanogaster’’ only | ftp://ftp.flybase.net/releases/current/psql flybase chado |
WormBaseChadoIdResolverFactory | wormbase chado db, for ‘’C.elegans’’ only | modENCODE specific |
ZfinIdentifiersResolverFactory | zebrafish ids | http://zfin.org/downloads/identifiersForIntermine.txt |
MgiIdentifiersResolverFactory | mouse ids | ftp://ftp.informatics.jax.org/pub/reports/MRK_List2.rpt |
RgdIdentifiersResolverFactory | rat ids | ftp://rgd.mcw.edu/pub/data_release/GENES_RAT.txt |
HgncIdResolverFactory | HGNC human gene ids | Uses the biomart service at http://www.genenames.org |
EnsemblIdResolverFactory | Ensembl id | customised |
HumanIdResolverFactory | human ids | customised |
#
Using ID Resolvers in InterMine data convertersMany data converters use the Entrez (NCBI) Gene ID resolver:
Download the identifier file -
Unzip the file to
/DATA_DIR/ncbi/gene_info
Create a sub directory
/DATA_DIR/idresolver/
as file root path and a symbolic linkentrez
to the fileAdd the root path to the file in
~/.intermine/MINE.properties
Id resolvers and corresponding symbolic to data file:
Resolver | Symbolic link |
---|---|
EntrezGeneIdResolverFactory | entrez |
WormBaseChadoIdResolverFactory | wormid |
ZfinIdentifiersResolverFactory | zfin |
MgiIdentifiersResolverFactory | mgi |
RgdIdentifiersResolverFactory | rgd |
HgncIdResolverFactory | hgnc |
EnsemblIdResolverFactory | ensembl |
HumanIdResolverFactory | humangene |
In the data converter, the ID resolver is given an identifier. The resolver then looks in the map for the identifier.
number of matches | returns |
---|---|
0 | NULL |
1 | new identifier |
>1 | NULL |
#
Using ID Resolvers in your data convertersA factory will find data root path from ~/.intermine/MINE_NAME.properties
, path needs to be absolute.
the key and the symbolic link of the data file need to be hard-coded in factory class, e.g. in EntrezGeneIdResolverFactory
As for database case, e.g. flybase chado
the key also needs to be hard-coded in factory class, e.g. in FlyBaseIdResolverFactory
#
ConfigurationThe Entrez gene identifier source has a configuration file, entrezIdResolver_config.properties
. You shouldn't have to edit this file.
This config will parse fruit fly identifiers, e.g. FLYBASE:FBgn0088803
If you don't want to strip the prefix from the identifier, use this config:
Warning The EBI changed how they format their data. If you have a recent data file, you do NOT want the above configuration for MGI.
To replace a taxonomy identifier with a strain, use the following:
To ignore certain organisms, do this:
#
IdResolverServiceIdResolverService is a java class providing static methods to get id resolver directly. It's also the most straight forward way to create an id resolver. For example, to create a fish id resolver by taxon id in a converter:
You can use the IdResolverService to create resolver by taxon id, a list of taxon ids, or by organism, e.g.
#
Resolve an IdAs the resolver maintains java maps of one or more organisms' identifiers, you must explicitly tell it which organism you want it to resolve for, e.g.
It is also possible there are two or more matching primary identifiers for a particular identifier, in this case, discard this identifier, e.g.
#
Writing a New ID resolverAn IdResolver factory will create an IdResolver which will read and parse data from a file or database containing identifier information, to save them to a Java map which will be written to a cached file.
The new factory class needs to inherit super class IdResolverFactory:
createIdResolver method:
createFromFile method:
createFromDb method:
Multiple taxon ids:
Multiple classes:
Multiple files or mixture of file and db:
Add resolver factory to IdResolverService:
#
Future Plans- generalized resolver factory which will read a configuration file to be aware of identifier information by column. e.g. type=tab, column.0=mainId, etc.