Version: 5.0.0


Types of data loaded#

genes, proteins, GO annotation, protein domains, publications, UniProt features, comments, synonyms, cross references, EC numbers, components

How to download the data#

This source loads data from the UniProt website here:

The UniProt source expects the data files to be in a special format:


To download a single taxon, you can use this URL:

taxonomye.g. 9606 for human
reviewedyes for swiss prot, no for trembl
compressif yes, zipped

How to load the data into your mine#


Gene identifier fields#

You can specify which gene fields are assigned when UniProt data is loaded. An example entry:

10116.uniqueField = primaryIdentifier
10116.primaryIdentifier.dbref = RGD
10116.secondaryIdentifier.dbref = Ensembl = primary

The format for the file is:


An example

A rat uniprot entry:

The second line of that configuration would set the ID value as the gene.primaryIdentifier:

<dbReference type="RGD" id="619834" key="33">
<property type="gene designation" value="Acf"/>

The third line would set this ID value as gene.secondaryIdentifier:

<dbReference type="Ensembl" id="ENSRNOG00000033195" key="30">
<property type="organism name" value="Rattus norvegicus"/>

The last line would set the value between the <name/> tags as gene.symbol:

<name type="primary">A1cf</name>
<name type="synonym">Acf</name>
<name type="synonym">Asp</name>

The values for can be primary, ORF or ordered locus.

Protein feature types#

You can also configure which protein features to load.

To load specific feature types only, specify them like so:

# in
feature.types = helix

To load NO feature types:

# in
feature.types = NONE

To load ALL feature types, do not specify any feature types, remove that line from this config file. Loading all feature types is the default behaviour.


<source name="uniprot" type="uniprot" >
<property name="uniprot.organisms" value="7227 9606"/>
<property name="" location="/data/uniprot"/>
<property name="creatego" value="true"/>
<property name="creategenes" value="true"/>
<property name="allowduplicates" value="false"/>
<property name="loadfragments" value="false"/>
<property name="loadtrembl" value="true"/>
creategenesif TRUE, process genestrue
creategoif TRUE, process GO annotationfalse
allowduplicatesif TRUE, allow proteins with duplicate sequences to be processedfalse
loadfragmentsif TRUE, load all proteins even if isFragment = truefalse
loadtremblif FALSE, not load trembl data for given organisms, load sprot data onlytrue


This source loads FASTA data for isoforms. The UniProt entry is does not contain the sequences for isoforms.

<source name="uniprot-fasta" type="fasta">
<property name="fasta.taxonId" value="7227 9606"/>
<property name="fasta.className" value=""/>
<property name="fasta.classAttribute" value="primaryAccession"/>
<property name="fasta.dataSetTitle" value="UniProt data set"/>
<property name="fasta.dataSourceName" value="UniProt"/>
<property name="" location="/data/uniprot/current"/>
<property name="fasta.includes" value="uniprot_sprot_varsplic.fasta"/>
<property name="fasta.sequenceType" value="protein" />
<property name="fasta.loaderClassName" value=""/>

UniProt keywords#

Loads the names for the UniProt keywords contained in the main UniProt converter.

<source name="uniprot-keywords" type="uniprot-keywords">
<property name="" location="/data/uniprot/current"/>
<property name="" value="keywlist.xml"/>