Table of Contents:

Ideas list:

Migration of natural language query translation code to OpenNMT 2.0

Description: InterMine is an open source biological data warehouse used by different organisations around the world. It allows you to specify a data model specific to your use-case, and provides a framework for populating your data warehouse with the integrated data. This data can then be accessed using the webapp, web services or API libraries available for various programming languages.

It is known that the number of InterMine instances as well as their size and complexity is constantly increasing. This creates a barrier to use especially for non-experts, who have to come to grips with the nature of the data, the way it has been represented in the database, and how to use the interfaces by which data are accessed. One approach to improving this situation is to allow users to pose their queries in natural language.

At InterMine, we have developed a prototype system that in a general way, supports the mapping of natural language searches to database queries by employing a sequence-to-sequence machine learning approach. We have a proof of concept working with HumanMine, an InterMine instance covering human data, as well as with other common database engines. The system is based on statistical natural language processing techniques, and makes use of the OpenNMT library.

Currently, the system uses the previous version of OpenNMT (1.2), therefore migrating it to the new OpenNMT release (2.0) would provide major benefits, especially in terms of performance and modeling flexibility. Additionally, as a stretch goal if time allows, the intern could work on extending our current approach to be a sequence-to-graph prediction, which we believe has the potential to significantly improve performance.

Required skills:

  • Python
  • Software engineering skills with experience of object-oriented programming.

Useful skills:

  • Basic machine learning concepts desired.
  • Basic understanding of sequence-to-sequence models desired.
  • For the stretch goal, advanced knowledge of machine learning-based natural language processing techniques would help.

Possible mentors:

  • Adrián Rodríguez-Bazaga (ar989@cam.ac.uk)
  • Rachel Lyne (rachel@intermine.org)

Please always email all mentors in the same mail if you would like to ask questions or discuss the project.

Expected outcome: We would like the current model training and prediction code to be migrated to use the new OpenNMT 2.0 version. As a stretch goal, the intern could go further beyond and extend the current prediction pipeline to work as a sequence-to-graph problem

Difficulty level: Hard.


InterMine query builder library

Description: InterMine is an open source biological data warehouse used by different organisations around the world. It allows you to specify a data model specific to your use-case, and provides a framework for loading the integrated biological data you want your data warehouse to be populated with. This data can then be accessed using the webapp, web services or API libraries available for various programming languages.

To tell InterMine what data you want, you build a query using the PathQuery API standard. A client will usually build the XML representation of a PathQuery and transmit it to a web service, which will then respond with the appropriate data.

The creation of a user interface to build PathQuery XML is oft repeated (a recent example is BlueGene’s Query Builder- which is set to replace the legacy webapp soon). Another example is im-tables, which is a library used in the InterMine webapp to display the results of a query and provide rudimentary editing of the PathQuery. There exist also alternative user interfaces like Data Browser, which provides a slightly different approach to querying InterMine servers and displaying the results. Third-party organisations have expressed interest in developing similar use-case specific user interfaces to their InterMine server.

All these interfaces redo the task of creating a user interface for generating PathQuery XML. Therefore we believe there lies value in creating a reusable library for building and modifying PathQuery.

Rough specification:

  • Preferably written in React.js (for consistency with existing user interfaces) but allow easy usage from a plain JavaScript environment
  • Can both load a previously created query encoded in PathQuery XML, and allow the creation of a query from scratch
  • Supports exporting the modified or newly created query to PathQuery XML, such that the application using the library can pass it to the InterMine server to obtain data
  • Parses the InterMine server’s Data Model encoded in JSON, which specifies the structure of the data present

For an example of the JSON you will need to parse, open iodocs, search for Data Model, click it, change the parameter format to JSON and click run. Another important JSON is Summary Fields, which you can view in the same manner.

The Data Model basically tells you what options to provide in the user interface, as we don’t want to make it possible to create an invalid query. Summary Fields specifies the attributes we want to show by default, unless the user wants something more specific.

Required skills:

  • Experience with frontend development and modern JavaScript tooling
  • Familiar with building SPAs in a functional UI framework (e.g. React, Vue)
  • Creative interest in building user interfaces
  • Note: no biology skills needed

Possible mentors:

  • Kevin Herald Reierskog (khr29@cam.ac.uk)
  • Daniela Butano (daniela@intermine.org)
  • Adrián Rodríguez-Bazaga (ar989@cam.ac.uk)

Please always email all mentors in the same mail if you would like to ask questions or discuss the project.

Expected outcome: A library easily used from React or plain JavaScript which provides a query building UI and passes PathQuery XML to its caller.

How could this integrate with existing InterMine tools? This will be used directly within BlueGenes’ im-tables and will be available for use by other user interfaces.

Difficulty level: Hard.


End to End and integration testing with Cypress

Description: BlueGenes (GitHub, Demo) is a ClojureScript browser-based UI for InterMine. im-tables-3 (GitHub) is a dynamic result table library used to show InterMine query results in BlueGenes.

Test libraries like Cypress which automate a series of interactions with the webapp through a controlled web browser, are very useful to avoid regression in functionality during development. By using selectors and methods which correspond to common actions done in a web browser, you are able to describe these automated interactions in JavaScript. The Cypress test runner will then run these interactions on the running webapp, and produce a test report along with artifacts like a video recording or images.

BlueGenes has a few integration tests written in JavaScript with Cypress, while im-tables-3 has none. Both could use vastly more of this type of tests. These tests would then be run in automated CI environments like CircleCI.

Required skills:

  • JavaScript
  • Cypress or other frontend interaction/integration testing library
  • Note: no biology skills needed

Possible mentors:

  • Kevin Herald Reierskog (khr29@cam.ac.uk)
  • Rachel Lyne (rachel@intermine.org)

Please always email all mentors in the same mail if you would like to ask questions or discuss the project.

Expected outcome: A vast pool of tests that covers the majority of use cases.

Difficulty level: Easy to Medium, depending on your experience with JavaScript and Cypress.