GFF3
InterMine comes with a GFF parser which loads GFF3 data files into your mine - without writing any Perl or Java code. This isn't a source itself but genome annotation from gff files can be loaded easily by creating a new source of type gff. See redfly, malaria-gff and tiffin for examples.
Configuration is added to the project.properties
file and an optional handler can be added to deal with data in the attributes section of the gff file.
#
Types of data loadedsequence features
#
How to download the dataN/A - will parse any file in GFF3 format
#
How to load the data into your mine- Place valid GFF3 files into a directory
- Add entry to project XML file
- Run build
If you follow the above steps with this data file, the following will happen:
- gene and mRNA objects created
- "MAL1" will be the identifier
- start = 183057, end = 184457
- gene will be located in -1 strand, mRNA will be located on the 1 strand.
#
Configuration FileBy default, columns such as "type", "start", "end", "strand" and "ID" field in "attributes" column are parsed automatically. To do more processing or access the attributes, you are able to configure in gff_config.properties
. This file should live in your mine's dbmodel/resources
directory.
For more advanced processing, you will have to write your own GFF3 parser.
#
Parent relationshipThe parent-child relationship between features can also be handled automatically if you set it up properly. Take MalariaGFF3RecordHandler for example:
#
Project XMLHere is an example GFF3 entry in the project XML file:
Here are the descriptions of the properties available:
property | example definition | |
---|---|---|
gff3.seqClsName | Chromosome | the ids in the first column represent Chromosome objects, e.g. MAL1 |
gff3.taxonId | 36329 | taxon id |
gff3.dataSourceName | PlasmoDB | the data source for features and their identifiers, this is used for the DataSet (evidence) and synonyms. |
gff3.seqDataSourceName | PlasmoDB | the source of the seqids (chromosomes) is sometimes different to the features described |
gff3.dataSetTitle | PlasmoDB P. falciparum genome | a DataSet object is created as evidence for the features, it is linked to a DataSource (PlasmoDB) |
gff3.licence | https://creativecommons.org/licenses/by-sa/3.0/ | URL to a standard data licence |
#
Writing a custom GFF parserYou can extend the generic parser by writing your own Java code to process the GFF3 data.
#
Make Source scriptCreate your custom source by running the create source script:
The script has created a new source for you in the bio/sources
directory.
#
Java codeThe Java file you now want to edit is here: bio/sources/SOURCE_NAME/main/src/org/intermine/bio/dataconversion
The process()
method is called for every line of GFF3 file(s) being read. Features and their locations are already created but not stored so you can make changes here. Attributes from the last column of the file are available in a map with the attribute name as the key. For example:
Any new Items created can be stored by calling addItem(). For example:
You should make sure that new Items you create are unique, i.e. by storing in a map by some identifier.
It may be helpful to look at current GFF3 parsers:
LongOligoGFF3RecordHandler.java
MirandaGFF3RecordHandler.java
RedFlyGFF3RecordHandler.java
FlyRegGFF3RecordHandler.java
DrosDelGFF3RecordHandler.java
See Tutorial for more information on how to run a GFF source.