UniProt
#
Types of data loadedgenes, proteins, GO annotation, protein domains, publications, UniProt features, comments, synonyms, cross references, EC numbers, components
#
How to download the dataThis source loads data from the UniProt website here: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release
The UniProt source expects the data files to be in a special format:
To download a single taxon, you can use this URL:
http://www.uniprot.org/uniprot/?format=xml&query=taxonomy%3A9606+AND+reviewed%3Ayes&compress=yes
parameter | value |
---|---|
taxonomy | e.g. 9606 for human |
reviewed | yes for swiss prot, no for trembl |
compress | if yes, zipped |
#
How to load the data into your mine#
Configuration#
Gene identifier fieldsYou can specify which gene fields are assigned when UniProt data is loaded. An example entry:
The format for the file is:
<TAXONID>.<IDENTIFIER_FIELD> = <VALUE>
An example
A rat uniprot entry: http://www.uniprot.org/uniprot/Q923K9.xml
The second line of that configuration would set the ID value as the gene.primaryIdentifier:
The third line would set this ID value as gene.secondaryIdentifier:
The last line would set the value between the <name/> tags as gene.symbol:
The values for symbol.name can be primary, ORF or ordered locus.
#
Protein feature typesYou can also configure which protein features to load.
To load specific feature types only, specify them like so:
To load NO feature types:
To load ALL feature types, do not specify any feature types, remove that line from this config file. Loading all feature types is the default behaviour.
#
Project.xmlproperty | description | default |
---|---|---|
creategenes | if TRUE, process genes | true |
creatego | if TRUE, process GO annotation | false |
allowduplicates | if TRUE, allow proteins with duplicate sequences to be processed | false |
loadfragments | if TRUE, load all proteins even if isFragment = true | false |
loadtrembl | if FALSE, not load trembl data for given organisms, load sprot data only | true |
#
FASTAThis source loads FASTA data for isoforms. The UniProt entry is does not contain the sequences for isoforms. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniprot_sprot_varsplic.fasta.gz
#
UniProt keywordsLoads the names for the UniProt keywords contained in the main UniProt converter. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs