Source code for intermine.query

import re
from copy import deepcopy
from xml.dom import minidom, getDOMImplementation

from intermine.util import openAnything, ReadableException
from intermine.pathfeatures import PathDescription, Join, SortOrder, SortOrderList
from intermine.model import Column, Class, Model, Reference, ConstraintNode

import intermine.constraints as constraints

try:
    from functools import reduce
except ImportError:
    pass

"""
Classes representing queries against webservices
================================================

Representations of queries, and templates.

"""

__author__ = "Alex Kalderimis"
__organization__ = "InterMine"
__license__ = "LGPL"
__contact__ = "dev@intermine.org"

LOGIC_OPS = ["and", "or"]
LOGIC_PRODUCT = [(x, y) for x in LOGIC_OPS for y in LOGIC_OPS]

[docs]class Query(object): """ A Class representing a structured database query ================================================ Objects of this class have properties that model the attributes of the query, and methods for performing the request. SYNOPSIS -------- example: >>> service = Service("http://www.flymine.org/query/service") >>> query = service.new_query() >>> >>> query.add_view("Gene.symbol", "Gene.pathways.name", "Gene.proteins.symbol") >>> query.add_sort_order("Gene.pathways.name") >>> >>> query.add_constraint("Gene", "LOOKUP", "eve") >>> query.add_constraint("Gene.pathways.name", "=", "Phosphate*") >>> >>> query.set_logic("A or B") >>> >>> for row in query.rows(): ... handle_row(row) OR, using an SQL style DSL: >>> s = Service("www.flymine.org/query") >>> query = s.query("Gene").\\ ... select("*", "pathways.*").\\ ... where("symbol", "=", "H").\\ ... outerjoin("pathways").\\ ... order_by("symbol") >>> for row in query.rows(start=10, size=5): ... handle_row(row) OR, for a more SQL-alchemy, ORM style: >>> for gene in s.query(s.model.Gene).filter(s.model.Gene.symbol == ["zen", "H", "eve"]).add_columns(s.model.Gene.alleles): ... handle(gene) Query objects represent structured requests for information over the database housed at the datawarehouse whose webservice you are querying. They utilise some of the concepts of relational databases, within an object-related ORM context. If you don't know what that means, don't worry: you don't need to write SQL, and the queries will be fast. To make things slightly more familiar to those with knowledge of SQL, some syntactical sugar is provided to make constructing queries a bit more recognisable. PRINCIPLES ---------- The data model represents tables in the databases as classes, with records within tables as instances of that class. The columns of the database are the fields of that object:: The Gene table - showing two records/objects +---------------------------------------------------+ | id | symbol | length | cyto-location | organism | +----------------------------------------+----------+ | 01 | eve | 1539 | 46C10-46C10 | 01 | +----------------------------------------+----------+ | 02 | zen | 1331 | 84A5-84A5 | 01 | +----------------------------------------+----------+ ... The organism table - showing one record/object +----------------------------------+ | id | name | taxon id | +----------------------------------+ | 01 | D. melanogaster | 7227 | +----------------------------------+ Columns that contain a meaningful value are known as 'attributes' (in the tables above, that is everything except the id columns). The other columns (such as "organism" in the gene table) are ones that reference records of other tables (ie. other objects), and are called references. You can refer to any field in any class, that has a connection, however tenuous, with a table, by using dotted path notation:: Gene.organism.name -> the name column in the organism table, referenced by a record in the gene table These paths, and the connections between records and tables they represent, are the basis for the structure of InterMine queries. THE STUCTURE OF A QUERY ----------------------- A query has two principle sets of properties: - its view: the set of output columns - its constraints: the set of rules for what to include A query must have at least one output column in its view, but constraints are optional - if you don't include any, you will get back every record from the table (every object of that type) In addition, the query must be coherent: if you have information about an organism, and you want a list of genes, then the "Gene" table should be the basis for your query, and as such the Gene class, which represents this table, should be the root of all the paths that appear in it: So, to take a simple example:: I have an organism name, and I want a list of genes: The view is the list of things I want to know about those genes: >>> query.add_view("Gene.name") >>> query.add_view("Gene.length") >>> query.add_view("Gene.proteins.sequence.length") Note I can freely mix attributes and references, as long as every view ends in an attribute (a meaningful value). As a short-cut I can also write: >>> query.add_views("Gene.name", "Gene.length", "Gene.proteins.sequence.length") or: >>> query.add_views("Gene.name Gene.length Gene.proteins.sequence.length") They are all equivalent. You can also use common SQL style shortcuts such as "*" for all attribute fields: >>> query.add_views("Gene.*") You can also use "select" as a synonymn for "add_view" Now I can add my constraints. As, we mentioned, I have information about an organism, so: >>> query.add_constraint("Gene.organism.name", "=", "D. melanogaster") (note, here I can use "where" as a synonymn for "add_constraint") If I run this query, I will get literally millions of results - it needs to be filtered further: >>> query.add_constraint("Gene.proteins.sequence.length", "<", 500) If that doesn't restrict things enough I can add more filters: >>> query.add_constraint("Gene.symbol", "ONE OF", ["eve", "zen", "h"]) Now I am guaranteed to get only information on genes I am interested in. Note, though, that because I have included the link (or "join") from Gene -> Protein, this, by default, means that I only want genes that have protein information associated with them. If in fact I want information on all genes, and just want to know the protein information if it is available, then I can specify that with: >>> query.add_join("Gene.proteins", "OUTER") And if perhaps my query is not as simple as a strict cumulative filter, but I want all D. mel genes that EITHER have a short protein sequence OR come from one of my favourite genes (as unlikely as that sounds), I can specify the logic for that too: >>> query.set_logic("A and (B or C)") Each letter refers to one of the constraints - the codes are assigned in the order you add the constraints. If you want to be absolutely certain about the constraints you mean, you can use the constraint objects themselves: >>> gene_is_eve = query.add_constraint("Gene.symbol", "=", "eve") >>> gene_is_zen = query.add_constraint("Gene.symbol", "=", "zne") >>> >>> query.set_logic(gene_is_eve | gene_is_zen) By default the logic is a straight cumulative filter (ie: A and B and C and D and ...) Putting it all together: >>> query.add_view("Gene.name", "Gene.length", "Gene.proteins.sequence.length") >>> query.add_constraint("Gene.organism.name", "=", "D. melanogaster") >>> query.add_constraint("Gene.proteins.sequence.length", "<", 500) >>> query.add_constraint("Gene.symbol", "ONE OF", ["eve", "zen", "h"]) >>> query.add_join("Gene.proteins", "OUTER") >>> query.set_logic("A and (B or C)") This can be made more concise and readable with a little DSL sugar: >>> query = service.query("Gene") >>> query.select("name", "length", "proteins.sequence.length").\ ... where('organism.name' '=', 'D. melanogaster').\ ... where("proteins.sequence.length", "<", 500).\ ... where('symbol', 'ONE OF', ['eve', 'h', 'zen']).\ ... outerjoin('proteins').\ ... set_logic("A and (B or C)") And the query is defined. Result Processing: Rows ----------------------- calling ".rows()" on a query will return an iterator of rows, where each row is a ResultRow object, which can be treated as both a list and a dictionary. Which means you can refer to columns by name: >>> for row in query.rows(): ... print "name is %s" % (row["name"]) ... print "length is %d" % (row["length"]) As well as using list indices: >>> for row in query.rows(): ... print "The first column is %s" % (row[0]) Iterating over a row iterates over the cell values as a list: >>> for row in query.rows(): ... for column in row: ... do_something(column) Here each row will have a gene name, a gene length, and a sequence length, eg: >>> print row.to_l ["even skipped", "1359", "376"] To make that clearer, you can ask for a dictionary instead of a list: >>> for row in query.rows() ... print row.to_d {"Gene.name":"even skipped","Gene.length":"1359","Gene.proteins.sequence.length":"376"} If you just want the raw results, for printing to a file, or for piping to another program, you can request the results in one of these formats: json', 'rr', 'tsv', 'jsonobjects', 'jsonrows', 'list', 'dict', 'csv' >>> for row in query.result("<format name>", size = <size>) ... print(row) Result Processing: Results -------------------------- Results can also be processing on a record by record basis. If you have a query that has output columns of "Gene.symbol", "Gene.pathways.name" and "Gene.proteins.proteinDomains.primaryIdentifier", than processing it by records will return one object per gene, and that gene will have a property named "pathways" which contains objects which have a name property. Likewise there will be a proteins property which holds a list of proteinDomains which all have a primaryIdentifier property, and so on. This allows a more object orientated approach to database records, familiar to users of other ORMs. This is the format used when you choose to iterate over a query directly, or can be explicitly chosen by invoking L{intermine.query.Query.results}: >>> for gene in query: ... print gene.name, map(lambda x: x.name, gene.pathways) The structure of the object and the information it contains depends entirely on the output columns selected. The values may be None, of course, but also any valid values of an object (according to the data model) will also be None if they were not selected for output. Attempts to access invalid properties (such as gene.favourite_colour) will cause exceptions to be thrown. Getting us to Generate your Code -------------------------------- Not that you have to actually write any of this! The webapp will happily generate the code for any query (and template) you can build in it. A good way to get started is to use the webapp to generate your code, and then run it as scripts to speed up your queries. You can always tinker with and edit the scripts you download. To get generated queries, look for the "python" link at the bottom of query-builder and template form pages, it looks a bit like this:: . +=====================================+============= | | | Perl | Python | Java [Help] | | | +============================================== """ SO_SPLIT_PATTERN = re.compile("\s*(asc|desc)\s*", re.I) LOGIC_SPLIT_PATTERN = re.compile("\s*(?:and|or|\(|\))\s*", re.I) TRAILING_OP_PATTERN = re.compile("\s*(and|or)\s*$", re.I) LEADING_OP_PATTERN = re.compile("^\s*(and|or)\s*", re.I) ORPHANED_OP_PATTERN = re.compile("(?:\(\s*(?:and|or)\s*|\s*(?:and|or)\s*\))", re.I) def __init__(self, model, service=None, validate=True, root=None): """ Construct a new Query ===================== Construct a new query for making database queries against an InterMine data warehouse. Normally you would not need to use this constructor directly, but instead use the factory method on intermine.webservice.Service, which will handle construction for you. @param model: an instance of L{intermine.model.Model}. Required @param service: an instance of l{intermine.service.Service}. Optional, but you will not be able to make requests without one. @param validate: a boolean - defaults to True. If set to false, the query will not try and validate itself. You should not set this to false. """ self.model = model if root is None: self.root = root else: self.root = model.make_path(root).root self.name = '' self.description = '' self.service = service self.prefetch_depth = service.prefetch_depth if service is not None else 1 self.prefetch_id_only = service.prefetch_id_only if service is not None else False self.do_verification = validate self.path_descriptions = [] self.joins = [] self.constraint_dict = {} self.uncoded_constraints = [] self.views = [] self._sort_order_list = SortOrderList() self._logic_parser = constraints.LogicParser(self) self._logic = None self.constraint_factory = constraints.ConstraintFactory() # Set up sugary aliases self.c = self.column self.filter = self.where self.add_column = self.add_view self.add_columns = self.add_view self.add_views = self.add_view self.add_to_select = self.add_view self.order_by = self.add_sort_order self.all = self.get_results_list self.size = self.count self.summarize = self.summarise def __iter__(self): """Return an iterator over all the objects returned by this query""" return self.results("jsonobjects") def __len__(self): """Return the number of rows this query will return.""" return self.count() def __sub__(self, other): """Construct a new list from the symmetric difference of these things""" return self.service._list_manager.subtract([self], [other]) def __xor__(self, other): """Calculate the symmetric difference of this query and another""" return self.service._list_manager.xor([self, other]) def __and__(self, other): """ Intersect this query and another query or list """ return self.service._list_manager.intersect([self, other]) def __or__(self, other): """ Return the union of this query and another query or list. """ return self.service._list_manager.union([self, other]) def __add__(self, other): """ Return the union of this query and another query or list """ return self.service._list_manager.union([self, other]) @classmethod
[docs] def from_xml(cls, xml, *args, **kwargs): """ Deserialise a query serialised to XML ===================================== This method is used to instantiate serialised queries. It is used by intermine.webservice.Service objects to instantiate Template objects and it can be used to read in queries you have saved to a file. @param xml: The xml as a file name, url, or string @raise QueryParseError: if the query cannot be parsed @raise ModelError: if the query has illegal paths in it @raise ConstraintError: if the constraints don't make sense @rtype: L{Query} """ obj = cls(*args, **kwargs) obj.do_verification = False f = openAnything(xml) doc = minidom.parse(f) f.close() queries = doc.getElementsByTagName('query') if len(queries) != 1: raise QueryParseError("wrong number of queries in xml. " + "Only one <query> element is allowed. Found %d" % len(queries)) q = queries[0] obj.name = q.getAttribute('name') obj.description = q.getAttribute('description') obj.add_view(q.getAttribute('view')) for p in q.getElementsByTagName('pathDescription'): path = p.getAttribute('pathString') description = p.getAttribute('description') obj.add_path_description(path, description) for j in q.getElementsByTagName('join'): path = j.getAttribute('path') style = j.getAttribute('style') obj.add_join(path, style) for c in q.getElementsByTagName('constraint'): args = {} args['path'] = c.getAttribute('path') if args['path'] is None: if c.parentNode.tagName != "node": msg = "Constraints must have a path" raise QueryParseError(msg) args['path'] = c.parentNode.getAttribute('path') args['op'] = c.getAttribute('op') args['value'] = c.getAttribute('value') args['code'] = c.getAttribute('code') args['subclass'] = c.getAttribute('type') args['editable'] = c.getAttribute('editable') args['optional'] = c.getAttribute('switchable') args['extra_value'] = c.getAttribute('extraValue') args['loopPath'] = c.getAttribute('loopPath') values = [] for val_e in c.getElementsByTagName('value'): texts = [] for node in val_e.childNodes: if node.nodeType == node.TEXT_NODE: texts.append(node.data) values.append(' '.join(texts)) if len(values) > 0: args["values"] = values args = dict((k, v) for k, v in list(args.items()) if v is not None and v != '') if "loopPath" in args: args["op"] = { "=" : "IS", "!=": "IS NOT" }.get(args["op"]) con = obj.add_constraint(**args) if not con: raise ConstraintError("error adding constraint with args: " + args) def group(iterator, count): itr = iter(iterator) while True: yield tuple([next(itr) for i in range(count)]) if q.getAttribute('sortOrder') is not None: sos = Query.SO_SPLIT_PATTERN.split(q.getAttribute('sortOrder')) if len(sos) == 1: if sos[0] in obj.views: # Be tolerant of irrelevant sort-orders obj.add_sort_order(sos[0]) else: sos.pop() # Get rid of empty string at end for path, direction in group(sos, 2): if path in obj.views: # Be tolerant of irrelevant so. obj.add_sort_order(path, direction) if q.getAttribute('constraintLogic') is not None: obj._set_questionable_logic(q.getAttribute('constraintLogic')) obj.verify() return obj
def _set_questionable_logic(self, questionable_logic): """Attempts to sanity check the logic argument before it is set""" logic = questionable_logic used_codes = set(self.constraint_dict.keys()) logic_codes = set(Query.LOGIC_SPLIT_PATTERN.split(questionable_logic)) if "" in logic_codes: logic_codes.remove("") irrelevant_codes = logic_codes - used_codes for c in irrelevant_codes: pattern = re.compile("\\b" + c + "\\b", re.I) logic = pattern.sub("", logic) # Remove empty groups logic = re.sub("\((:?and|or|\s)*\)", "", logic) # Remove trailing and leading operators logic = Query.LEADING_OP_PATTERN.sub("", logic) logic = Query.TRAILING_OP_PATTERN.sub("", logic) for x in range(2): # repeat, as this process can leave doubles for left, right in LOGIC_PRODUCT: if left == right: repl = left else: repl = "and" pattern = re.compile(left + "\s*" + right, re.I) logic = pattern.sub(repl, logic) logic = Query.ORPHANED_OP_PATTERN.sub(lambda x: "(" if "(" in x.group(0) else ")", logic) logic = logic.strip().lstrip() logic = Query.LEADING_OP_PATTERN.sub("", logic) logic = Query.TRAILING_OP_PATTERN.sub("", logic) try: if len(logic) > 0 and logic not in ["and", "or"]: self.set_logic(logic) except Exception as e: raise Exception("Error parsing logic string " + repr(questionable_logic) + " (which is " + repr(logic) + " after irrelevant codes have been removed)" + " with available codes: " + repr(list(used_codes)) + " because: " + e.message) def __str__(self): """Return the XML serialisation of this query""" return self.to_xml()
[docs] def verify(self): """ Validate the query ================== Invalid queries will fail to run, and it is not always obvious why. The validation routine checks to see that the query will not cause errors on execution, and tries to provide informative error messages. This method is called immediately after a query is fully deserialised. @raise ModelError: if the paths are invalid @raise QueryError: if there are errors in query construction @raise ConstraintError: if there are errors in constraint construction """ self.verify_views() self.verify_constraint_paths() self.verify_join_paths() self.verify_pd_paths() self.validate_sort_order() self.do_verification = True
[docs] def select(self, *paths): """ Replace the current selection of output columns with this one ============================================================= example:: query.select("*", "proteins.name") This method is intended to provide an API familiar to those with experience of SQL or other ORM layers. This method, in contrast to other view manipulation methods, replaces the selection of output columns, rather than appending to it. Note that any sort orders that are no longer in the view will be removed. @param paths: The output columns to add """ self.views = [] self.add_view(*paths) so_elems = self._sort_order_list self._sort_order_list = SortOrderList() for so in so_elems: if so.path in self.views: self._sort_order_list.append(so) return self
[docs] def add_view(self, *paths): """ Add one or more views to the list of output columns =================================================== example:: query.add_view("Gene.name Gene.organism.name") This is the main method for adding views to the list of output columns. As well as appending views, it will also split a single, space or comma delimited string into multiple paths, and flatten out lists, or any combination. It will also immediately try to validate the views. Output columns must be valid paths according to the data model, and they must represent attributes of tables Also available as: - add_views - add_column - add_columns - add_to_select @see: intermine.model.Model @see: intermine.model.Path @see: intermine.model.Attribute """ views = [] for p in paths: if isinstance(p, (set, list)): views.extend(list(p)) elif isinstance(p, Class): views.append(p.name + ".*") elif isinstance(p, Column): if p._path.is_attribute(): views.append(str(p)) else: views.append(str(p) + ".*") elif isinstance(p, Reference): views.append(p.name + ".*") else: views.extend(re.split("(?:,?\s+|,)", str(p))) views = list(map(self.prefix_path, views)) views_to_add = [] for view in views: if view.endswith(".*"): view = re.sub("\.\*$", "", view) scd = self.get_subclass_dict() def expand(p, level, id_only=False): if level > 0: path = self.model.make_path(p, scd) cd = path.end_class add_f = lambda x: p + "." + x.name vs = [p + ".id"] if id_only and cd.has_id else [add_f(a) for a in cd.attributes] next_level = level - 1 rs_and_cs = list(cd.references) + list(cd.collections) for r in rs_and_cs: rp = add_f(r) if next_level: self.outerjoin(rp) vs.extend(expand(rp, next_level, self.prefetch_id_only)) return vs else: return [] depth = self.prefetch_depth views_to_add.extend(expand(view, depth)) else: views_to_add.append(view) if self.do_verification: self.verify_views(views_to_add) self.views.extend(views_to_add) return self
[docs] def prefix_path(self, path): if self.root is None: if self.do_verification: # eg. not when building from XML if path.endswith(".*"): trimmed = re.sub("\.\*$", "", path) else: trimmed = path self.root = self.model.make_path(trimmed, self.get_subclass_dict()).root return path else: if path.startswith(self.root.name): return path else: return self.root.name + "." + path
[docs] def clear_view(self): """ Clear the output column list ============================ Deletes all entries currently in the view list. """ self.views = []
[docs] def verify_views(self, views=None): """ Check to see if the views given are valid ========================================= This method checks to see if the views: - are valid according to the model - represent attributes @see: L{intermine.model.Attribute} @raise intermine.model.ModelError: if the paths are invalid @raise ConstraintError: if the paths are not attributes """ if views is None: views = self.views for path in views: path = self.model.make_path(path, self.get_subclass_dict()) if not path.is_attribute(): raise ConstraintError("'" + str(path) + "' does not represent an attribute")
[docs] def add_constraint(self, *args, **kwargs): """ Add a constraint (filter on records) ==================================== example:: query.add_constraint("Gene.symbol", "=", "zen") This method will try to make a constraint from the arguments given, trying each of the classes it knows of in turn to see if they accept the arguments. This allows you to add constraints of different types without having to know or care what their classes or implementation details are. All constraints derive from intermine.constraints.Constraint, and they all have a path attribute, but are otherwise diverse. Before adding the constraint to the query, this method will also try to check that the constraint is valid by calling Query.verify_constraint_paths() @see: L{intermine.constraints} @rtype: L{intermine.constraints.Constraint} """ if len(args) == 1 and len(kwargs) == 0: if isinstance(args[0], tuple): con = self.constraint_factory.make_constraint(*args[0]) else: try: con = self.constraint_factory.make_constraint(*args[0].vargs, **args[0].kwargs) except AttributeError: con = args[0] else: if len(args) == 0 and len(kwargs) == 1: k, v = list(kwargs.items())[0] d = {"path": k} if v in constraints.UnaryConstraint.OPS: d["op"] = v else: d["op"] = "=" d["value"] = v kwargs = d if len(args) and args[0] in self.constraint_factory.reference_ops: args = [self.root] + list(args) con = self.constraint_factory.make_constraint(*args, **kwargs) con.path = self.prefix_path(con.path) if self.do_verification: self.verify_constraint_paths([con]) if hasattr(con, "code"): self.constraint_dict[con.code] = con else: self.uncoded_constraints.append(con) return con
[docs] def where(self, *cons, **kwargs): """ Return a new query like this one but with an additional constraint ================================================================== In contrast to add_constraint, this method returns a new object with the given comstraint added, it does not mutate the Query it is invoked on. Also available as Query.filter """ c = self.clone() try: for conset in cons: codeds = c.coded_constraints lstr = str(c.get_logic()) + " AND " if codeds else "" start_c = chr(ord(codeds[-1].code) + 1) if codeds else 'A' for con in conset: c.add_constraint(*con.vargs, **con.kwargs) try: c.set_logic(lstr + conset.as_logic(start = start_c)) except constraints.EmptyLogicError: pass for path, value in list(kwargs.items()): c.add_constraint(path, "=", value) except AttributeError: c.add_constraint(*cons, **kwargs) return c
[docs] def column(self, col): """ Return a Column object suitable for using to construct constraints with ======================================================================= This method is part of the SQLAlchemy style API. Also available as Query.c """ return self.model.column(self.prefix_path(str(col)), self.get_subclass_dict(), self)
[docs] def verify_constraint_paths(self, cons=None): """ Check that the constraints are valid ==================================== This method will check the path attribute of each constraint. In addition it will: - Check that BinaryConstraints and MultiConstraints have an Attribute as their path - Check that TernaryConstraints have a Reference as theirs - Check that SubClassConstraints have a correct subclass relationship - Check that LoopConstraints have a valid loopPath, of a compatible type - Check that ListConstraints refer to an object - Don't even try to check RangeConstraints: these have variable semantics @param cons: The constraints to check (defaults to all constraints on the query) @raise ModelError: if the paths are not valid @raise ConstraintError: if the constraints do not satisfy the above rules """ if cons is None: cons = self.constraints for con in cons: pathA = self.model.make_path(con.path, self.get_subclass_dict()) if isinstance(con, constraints.RangeConstraint): pass # No verification done on these, beyond checking its path, of course. elif isinstance(con, constraints.IsaConstraint): if pathA.get_class() is None: raise ConstraintError("'" + str(pathA) + "' does not represent a class, or a reference to a class") for c in con.values: if c not in self.model.classes: raise ConstraintError("Illegal constraint: " + repr(con) + " '" + str(c) + "' is not a class in this model") elif isinstance(con, constraints.TernaryConstraint): if pathA.get_class() is None: raise ConstraintError("'" + str(pathA) + "' does not represent a class, or a reference to a class") elif isinstance(con, constraints.BinaryConstraint) or isinstance(con, constraints.MultiConstraint): if not pathA.is_attribute(): raise ConstraintError("'" + str(pathA) + "' does not represent an attribute") elif isinstance(con, constraints.SubClassConstraint): pathB = self.model.make_path(con.subclass, self.get_subclass_dict()) if not pathB.get_class().isa(pathA.get_class()): raise ConstraintError("'" + con.subclass + "' is not a subclass of '" + con.path + "'") elif isinstance(con, constraints.LoopConstraint): pathB = self.model.make_path(con.loopPath, self.get_subclass_dict()) for path in [pathA, pathB]: if not path.get_class(): raise ConstraintError("'" + str(path) + "' does not refer to an object") (classA, classB) = (pathA.get_class(), pathB.get_class()) if not classA.isa(classB) and not classB.isa(classA): raise ConstraintError("the classes are of incompatible types: " + str(classA) + "," + str(classB)) elif isinstance(con, constraints.ListConstraint): if not pathA.get_class(): raise ConstraintError("'" + str(pathA) + "' does not refer to an object")
@property def constraints(self): """ Returns the constraints of the query ==================================== Query.constraints S{->} list(intermine.constraints.Constraint) Constraints are returned in the order of their code (normally the order they were added to the query) and with any subclass contraints at the end. @rtype: list(Constraint) """ ret = sorted(list(self.constraint_dict.values()), key=lambda con: con.code) ret.extend(self.uncoded_constraints) return ret
[docs] def get_constraint(self, code): """ Returns the constraint with the given code ========================================== Returns the constraint with the given code, if if exists. If no such constraint exists, it throws a ConstraintError @return: the constraint corresponding to the given code @rtype: L{intermine.constraints.CodedConstraint} """ if code in self.constraint_dict: return self.constraint_dict[code] else: raise ConstraintError("There is no constraint with the code '" + code + "' on this query")
[docs] def add_join(self, *args ,**kwargs): """ Add a join statement to the query ================================= example:: query.add_join("Gene.proteins", "OUTER") A join statement is used to determine if references should restrict the result set by only including those references exist. For example, if one had a query with the view:: "Gene.name", "Gene.proteins.name" Then in the normal case (that of an INNER join), we would only get Genes that also have at least one protein that they reference. Simply by asking for this output column you are placing a restriction on the information you get back. If in fact you wanted all genes, regardless of whether they had proteins associated with them or not, but if they did you would rather like to know _what_ proteins, then you need to specify this reference to be an OUTER join:: query.add_join("Gene.proteins", "OUTER") Now you will get many more rows of results, some of which will have "null" values where the protein name would have been, This method will also attempt to validate the join by calling Query.verify_join_paths(). Joins must have a valid path, the style can be either INNER or OUTER (defaults to OUTER, as the user does not need to specify inner joins, since all references start out as inner joins), and the path must be a reference. @raise ModelError: if the path is invalid @raise TypeError: if the join style is invalid @rtype: L{intermine.pathfeatures.Join} """ join = Join(*args, **kwargs) join.path = self.prefix_path(join.path) if self.do_verification: self.verify_join_paths([join]) self.joins.append(join) return self
[docs] def outerjoin(self, column): """Alias for add_join(column, "OUTER")""" return self.add_join(str(column), "OUTER")
[docs] def verify_join_paths(self, joins=None): """ Check that the joins are valid ============================== Joins must have valid paths, and they must refer to references. @raise ModelError: if the paths are invalid @raise QueryError: if the paths are not references """ if joins is None: joins = self.joins for join in joins: path = self.model.make_path(join.path, self.get_subclass_dict()) if not path.is_reference(): raise QueryError("'" + join.path + "' is not a reference")
[docs] def add_path_description(self, *args ,**kwargs): """ Add a path description to the query =================================== example:: query.add_path_description("Gene.proteins.proteinDomains", "Protein Domain") This allows you to alias the components of long paths to improve the way they display column headers in a variety of circumstances. In the above example, if the view included the unwieldy path "Gene.proteins.proteinDomains.primaryIdentifier", it would (depending on the mine) be displayed as "Protein Domain > DB Identifer". These setting are taken into account by the webservice when generating column headers for flat-file results with the columnheaders parameter given, and always supplied when requesting jsontable results. @rtype: L{intermine.pathfeatures.PathDescription} """ path_description = PathDescription(*args, **kwargs) path_description.path = self.prefix_path(path_description.path) if self.do_verification: self.verify_pd_paths([path_description]) self.path_descriptions.append(path_description) return path_description
[docs] def verify_pd_paths(self, pds=None): """ Check that the path of the path description is valid ==================================================== Checks for consistency with the data model @raise ModelError: if the paths are invalid """ if pds is None: pds = self.path_descriptions for pd in pds: self.model.validate_path(pd.path, self.get_subclass_dict())
@property def coded_constraints(self): """ Returns the list of constraints that have a code ================================================ Query.coded_constraints S{->} list(intermine.constraints.CodedConstraint) This returns an up to date list of the constraints that can be used in a logic expression. The only kind of constraint that this excludes, at present, is SubClassConstraints @rtype: list(L{intermine.constraints.CodedConstraint}) """ return sorted(list(self.constraint_dict.values()), key=lambda con: con.code)
[docs] def get_logic(self): """ Returns the logic expression for the query ========================================== This returns the up to date logic expression. The default value is the representation of all coded constraints and'ed together. If the logic is empty and there are no constraints, returns an empty string. The LogicGroup object stringifies to a string that can be parsed to obtain itself (eg: "A and (B or C or D)"). @rtype: L{intermine.constraints.LogicGroup} """ if self._logic is None: if len(self.coded_constraints) > 0: return reduce(lambda x, y: x+y, self.coded_constraints) else: return "" else: return self._logic
[docs] def set_logic(self, value): """ Sets the Logic given the appropriate input ========================================== example:: Query.set_logic("A and (B or C)") This sets the logic to the appropriate value. If the value is already a LogicGroup, it is accepted, otherwise the string is tokenised and parsed. The logic is then validated with a call to validate_logic() raise LogicParseError: if there is a syntax error in the logic """ if isinstance(value, constraints.LogicGroup): logic = value else: try: logic = self._logic_parser.parse(value) except constraints.EmptyLogicError: if self.coded_constraints: raise else: return self if self.do_verification: self.validate_logic(logic) self._logic = logic return self
[docs] def validate_logic(self, logic=None): """ Validates the query logic ========================= Attempts to validate the logic by checking that every coded_constraint is included at least once @raise QueryError: if not every coded constraint is represented """ if logic is None: logic = self._logic logic_codes = set(logic.get_codes()) for con in self.coded_constraints: if con.code not in logic_codes: raise QueryError("Constraint " + con.code + repr(con) + " is not mentioned in the logic: " + str(logic))
[docs] def get_default_sort_order(self): """ Gets the sort order when none has been specified ================================================ This method is called to determine the sort order if none is specified @raise QueryError: if the view is empty @rtype: L{intermine.pathfeatures.SortOrderList} """ try: v0 = self.views[0] for j in self.joins: if j.style == "OUTER": if v0.startswith(j.path): return "" return SortOrderList((self.views[0], SortOrder.ASC)) except IndexError: raise QueryError("Query view is empty")
[docs] def get_sort_order(self): """ Return a sort order for the query ================================= This method returns the sort order if set, otherwise it returns the default sort order @raise QueryError: if the view is empty @rtype: L{intermine.pathfeatures.SortOrderList} """ if self._sort_order_list.is_empty(): return self.get_default_sort_order() else: return self._sort_order_list
[docs] def add_sort_order(self, path, direction=SortOrder.ASC): """ Adds a sort order to the query ============================== example:: Query.add_sort_order("Gene.name", "DESC") This method adds a sort order to the query. A query can have multiple sort orders, which are assessed in sequence. If a query has two sort-orders, for example, the first being "Gene.organism.name asc", and the second being "Gene.name desc", you would have the list of genes grouped by organism, with the lists within those groupings in reverse alphabetical order by gene name. This method will try to validate the sort order by calling validate_sort_order() Also available as Query.order_by """ so = SortOrder(str(path), direction) so.path = self.prefix_path(so.path) if self.do_verification: self.validate_sort_order(so) self._sort_order_list.append(so) return self
[docs] def validate_sort_order(self, *so_elems): """ Check the validity of the sort order ==================================== Checks that the sort order paths are: - valid paths - in the view @raise QueryError: if the sort order is not in the view @raise ModelError: if the path is invalid """ if not so_elems: so_elems = self._sort_order_list from_paths = self._from_paths() for so in so_elems: p = self.model.make_path(so.path, self.get_subclass_dict()) if p.prefix() not in from_paths: raise QueryError("Sort order element %s is not in the query" % so.path)
def _from_paths(self): scd = self.get_subclass_dict() froms = set([self.model.make_path(x, scd).prefix() for x in self.views]) for c in self.constraints: p = self.model.make_path(c.path, scd) if p.is_attribute(): froms.add(p.prefix()) else: froms.add(p) return froms
[docs] def get_subclass_dict(self): """ Return the current mapping of class to subclass =============================================== This method returns a mapping of classes used by the model for assessing whether certain paths are valid. For intance, if you subclass MicroArrayResult to be FlyAtlasResult, you can refer to the .presentCall attributes of fly atlas results. MicroArrayResults do not have this attribute, and a path such as:: Gene.microArrayResult.presentCall would be marked as invalid unless the dictionary is provided. Users most likely will not need to ever call this method. @rtype: dict(string, string) """ subclass_dict = {} for c in self.constraints: if isinstance(c, constraints.SubClassConstraint): subclass_dict[c.path] = c.subclass return subclass_dict
[docs] def results(self, row="object", start=0, size=None, summary_path=None): """ Return an iterator over result rows =================================== Usage:: >>> query = service.model.Gene.select("symbol", "length") >>> total = 0 >>> for gene in query.results(): ... print gene.symbol # handle strings ... total += gene.length # handle numbers >>> for row in query.results(row="rr"): ... print row["symbol"] # handle strings by dict index ... total += row["length"] # handle numbers by dict index ... print row["Gene.symbol"] # handle strings by full dict index ... total += row["Gene.length"] # handle numbers by full dict index ... print row[0] # handle strings by list index ... total += row[1] # handle numbers by list index >>> for d in query.results(row="dict"): ... print row["Gene.symbol"] # handle strings ... total += row["Gene.length"] # handle numbers >>> for l in query.results(row="list"): ... print row[0] # handle strings ... total += row[1] # handle numbers >>> import csv >>> csv_reader = csv.reader(q.results(row="csv"), delimiter=",", quotechar='"') >>> for row in csv_reader: ... print row[0] # handle strings ... length_sum += int(row[1]) # handle numbers >>> tsv_reader = csv.reader(q.results(row="tsv"), delimiter="\t") >>> for row in tsv_reader: ... print row[0] # handle strings ... length_sum += int(row[1]) # handle numbers This is the general method that allows access to any of the available result formats. The example above shows the ways these differ in terms of accessing fields of the rows, as well as dealing with different data types. Results can either be retrieved as typed values (jsonobjects, rr ['ResultRows'], dict, list), or as lists of strings (csv, tsv) which then require further parsing. The default format for this method is "objects", where information is grouped by its relationships. The other main format is "rr", which stands for 'ResultRows', and can be accessed directly through the L{rows} method. Note that when requesting object based results (the default), if your query contains any kind of collection, it is highly likely that start and size won't do what you think, as they operate only on the underlying rows used to build up the returned objects. If you want rows back, you are recommeded to use the simpler rows method. If no views have been specified, all attributes of the root class are selected for output. @param row: The format for each result. One of "object", "rr", "dict", "list", "tsv", "csv", "jsonrows", "jsonobjects" @type row: string @param start: the index of the first result to return (default = 0) @type start: int @param size: The maximum number of results to return (default = all) @type size: int @param summary_path: A column name to optionally summarise. Specifying a path will force "jsonrows" format, and return an iterator over a list of dictionaries. Use this when you are interested in processing a summary in order of greatest count to smallest. @type summary_path: str or L{intermine.model.Path} @rtype: L{intermine.webservice.ResultIterator} @raise WebserviceError: if the request is unsuccessful """ to_run = self.clone() if len(to_run.views) == 0: to_run.add_view(to_run.root) if "object" in row: for c in self.coded_constraints: p = to_run.column(c.path)._path from_p = p if p.end_class is not None else p.prefix() if not [v for v in to_run.views if v.startswith(str(from_p))]: if p.is_attribute(): to_run.add_view(p) else: to_run.add_view(p.append("id")) path = to_run.get_results_path() params = to_run.to_query_params() params["start"] = start if size: params["size"] = size if summary_path: params["summaryPath"] = to_run.prefix_path(summary_path) row = "jsonrows" view = to_run.views cld = to_run.root return to_run.service.get_results(path, params, row, view, cld)
[docs] def rows(self, start=0, size=None): """ Return the results as rows of data ================================== This is a shortcut for results("rr") Usage:: >>> for row in query.rows(start=10, size=10): ... print row["proteins.name"] @param start: the index of the first result to return (default = 0) @type start: int @param size: The maximum number of results to return (default = all) @type size: int @rtype: iterable<intermine.webservice.ResultRow> """ return self.results(row="rr", start=start, size=size)
[docs] def summarise(self, summary_path, **kwargs): """ Return a summary of the results for this column. ================================================ Usage:: >>> query = service.select("Gene.*", "organism.*").where("Gene", "IN", "my-list") >>> print query.summarise("length")["average"] ... 12345.67890 >>> print query.summarise("organism.name")["Drosophila simulans"] ... 98 This method allows you to get statistics summarising the information from just one column of a query. For numerical columns you get dictionary with four keys ('average', 'stdev', 'max', 'min'), and for non-numerical columns you get a dictionary where each item is a key and the values are the number of occurrences of this value in the column. Any key word arguments will be passed to the underlying results call - so you can limit the result size to the top 100 items by passing "size = 100" as part of the call. @see: L{intermine.query.Query.results} @param summary_path: The column to summarise (either in long or short form) @type summary_path: str or L{intermine.model.Path} @rtype: dict This method is sugar for particular combinations of calls to L{results}. """ p = self.model.make_path(self.prefix_path(summary_path), self.get_subclass_dict()) results = self.results(summary_path = summary_path, **kwargs) if p.end.type_name in Model.NUMERIC_TYPES: return dict((k, float(v)) for k, v in list(next(results).items())) else: return dict((r["item"], r["count"]) for r in results)
[docs] def one(self, row="jsonobjects"): """Return one result, and raise an error if the result size is not 1""" if row == "jsonobjects": if self.count() == 1: return self.first(row) else: ret = None for obj in self.results(): if ret is not None: raise QueryError("More than one result received") else: ret = obj if ret is None: raise QueryError("No results received") return ret else: c = self.count() if (c != 1): raise QueryError("Result size is not one: got %d results" % (c)) else: return self.first(row)
[docs] def first(self, row="jsonobjects", start=0, **kw): """Return the first result, or None if the results are empty""" if row == "jsonobjects": size = None else: size = 1 try: return next(self.results(row, start=start, size=size, **kw)) except StopIteration: return None
[docs] def get_results_list(self, *args, **kwargs): """ Get a list of result rows ========================= This method is a shortcut so that you do not have to do a list comprehension yourself on the iterator that is normally returned. If you have a very large result set (and these can get up to 100's of thousands or rows pretty easily) you will not want to have the whole list in memory at once, but there may be other circumstances when you might want to keep the whole list in one place. It takes all the same arguments and parameters as Query.results Also available as Query.all @see: L{intermine.query.Query.results} """ return list(self.results(*args, **kwargs))
[docs] def get_row_list(self, start=0, size=None): return self.get_results_list("rr", start, size)
[docs] def count(self): """ Return the total number of rows this query returns ================================================== Obtain the number of rows a particular query will return, without having to fetch and parse all the actual data. This method makes a request to the server to report the count for the query, and is sugar for a results call. Also available as Query.size @rtype: int @raise WebserviceError: if the request is unsuccessful. """ count_str = "" for row in self.results(row = "count"): count_str += row try: return int(count_str) except ValueError: raise ResultError("Server returned a non-integer count: " + count_str)
[docs] def get_list_upload_uri(self): """ Returns the uri to use to create a list from this query ======================================================= Query.get_list_upload_uri() -> str This method is used internally when performing list operations on queries. @rtype: str """ return self.service.root + self.service.QUERY_LIST_UPLOAD_PATH
[docs] def get_list_append_uri(self): """ Returns the uri to use to create a list from this query ======================================================= Query.get_list_append_uri() -> str This method is used internally when performing list operations on queries. @rtype: str """ return self.service.root + self.service.QUERY_LIST_APPEND_PATH
[docs] def get_results_path(self): """ Returns the path section pointing to the REST resource ====================================================== Query.get_results_path() -> str Internally, this just calls a constant property in intermine.service.Service @rtype: str """ return self.service.QUERY_PATH
[docs] def children(self): """ Returns the child objects of the query ====================================== This method is used during the serialisation of queries to xml. It is unlikely you will need access to this as a whole. Consider using "path_descriptions", "joins", "constraints" instead @see: Query.path_descriptions @see: Query.joins @see: Query.constraints @return: the child element of this query @rtype: list """ return sum([self.path_descriptions, self.joins, self.constraints], [])
[docs] def to_query(self): """ Implementation of trait that allows use of these objects as queries (casting). """ return self
[docs] def make_list_constraint(self, path, op): """ Implementation of trait that allows use of these objects in list constraints """ l = self.service.create_list(self) return ConstraintNode(path, op, l.name)
[docs] def to_query_params(self): """ Returns the parameters to be passed to the webservice ===================================================== The query is responsible for producing its own query parameters. These consist simply of: - query: the xml representation of the query @rtype: dict """ xml = self.to_xml() params = {'query' : xml } return params
[docs] def to_Node(self): """ Returns a DOM node representing the query ========================================= This is an intermediate step in the creation of the xml serialised version of the query. You probably won't need to call this directly. @rtype: xml.minidom.Node """ impl = getDOMImplementation() doc = impl.createDocument(None, "query", None) query = doc.documentElement query.setAttribute('name', self.name) query.setAttribute('model', self.model.name) query.setAttribute('view', ' '.join(self.views)) query.setAttribute('sortOrder', str(self.get_sort_order())) query.setAttribute('longDescription', self.description) if len(self.coded_constraints) > 1: query.setAttribute('constraintLogic', str(self.get_logic())) for c in self.children(): element = doc.createElement(c.child_type) for name, value in list(c.to_dict().items()): if isinstance(value, (set, list)): for v in value: subelement = doc.createElement(name) text = doc.createTextNode(v) subelement.appendChild(text) element.appendChild(subelement) else: element.setAttribute(name, value) query.appendChild(element) return query
[docs] def to_xml(self): """ Return an XML serialisation of the query ======================================== This method serialises the current state of the query to an xml string, suitable for storing, or sending over the internet to the webservice. @return: the serialised xml string @rtype: string """ n = self.to_Node() return n.toxml()
[docs] def to_formatted_xml(self): """ Return a readable XML serialisation of the query ================================================ This method serialises the current state of the query to an xml string, suitable for storing, or sending over the internet to the webservice, only more readably. @return: the serialised xml string @rtype: string """ n = self.to_Node() return n.toprettyxml()
[docs] def clone(self): """ Performs a deep clone ===================== This method will produce a clone that is independent, and can be altered without affecting the original, but starts off with the exact same state as it. The only shared elements should be the model and the service, which are shared by all queries that refer to the same webservice. @return: same class as caller """ newobj = self.__class__(self.model) for attr in ["joins", "views", "_sort_order_list", "_logic", "path_descriptions", "constraint_dict", "uncoded_constraints"]: setattr(newobj, attr, deepcopy(getattr(self, attr))) for attr in ["name", "description", "service", "do_verification", "constraint_factory", "root"]: setattr(newobj, attr, getattr(self, attr)) return newobj
[docs]class Template(Query): """ A Class representing a predefined query ======================================= Templates are ways of saving queries and allowing others to run them simply. They are the main interface to querying in the webapp SYNOPSIS -------- example:: service = Service("http://www.flymine.org/query/service") template = service.get_template("Gene_Pathways") for row in template.results(A={"value":"eve"}): process_row(row) ... A template is a subclass of query that comes predefined. They are typically retrieved from the webservice and run by specifying the values for their existing constraints. They are a concise and powerful way of running queries in the webapp. Being subclasses of query, everything is true of them that is true of a query. They are just less work, as you don't have to design each one. Also, you can store your own templates in the web-app, and then access them as a private webservice method, from anywhere, making them a kind of query in the cloud - for this you will need to authenticate by providing log in details to the service. The most significant difference is how constraint values are specified for each set of results. @see: L{Template.results} """ def __init__(self, *args, **kwargs): """ Constructor =========== Instantiation is identical that of queries. As with queries, these are best obtained from the intermine.webservice.Service factory methods. @see: L{intermine.webservice.Service.get_template} """ super(Template, self).__init__(*args, **kwargs) self.constraint_factory = constraints.TemplateConstraintFactory() @property def editable_constraints(self): """ Return the list of constraints you can edit =========================================== Template.editable_constraints -> list(intermine.constraints.Constraint) Templates have a concept of editable constraints, which is a way of hiding complexity from users. An underlying query may have five constraints, but only expose the one that is actually interesting. This property returns this subset of constraints that have the editable flag set to true. """ return [c for c in self.constraints if c.editable]
[docs] def to_query_params(self): """ Returns the query parameters needed for the webservice ====================================================== Template.to_query_params() -> dict(string, string) Overrides the method of the same name in query to provide the parameters needed by the templates results service. These are slightly more complex: - name: The template's name - for each constraint: (where [i] is an integer incremented for each constraint) - constraint[i]: the path - op[i]: the operator - value[i]: the value - code[i]: the code - extra[i]: the extra value for ternary constraints (optional) @rtype: dict """ p = {'name' : self.name} i = 1 for c in self.editable_constraints: if not c.switched_on: next for k, v in list(c.to_dict().items()): if k == "extraValue": k = "extra" if k == "path": k = "constraint" p[k + str(i)] = v i += 1 return p
[docs] def get_results_path(self): """ Returns the path section pointing to the REST resource ====================================================== Template.get_results_path() S{->} str Internally, this just calls a constant property in intermine.service.Service This overrides the method of the same name in Query @return: the path to the REST resource @rtype: string """ return self.service.TEMPLATEQUERY_PATH
[docs] def get_adjusted_template(self, con_values): """ Gets a template to run ====================== Template.get_adjusted_template(con_values) S{->} Template When templates are run, they are first cloned, and their values are changed to those desired. This leaves the original template unchanged so it can be run again with different values. This method does the cloning and changing of constraint values @raise ConstraintError: if the constraint values specify values for a non-editable constraint. @rtype: L{Template} """ clone = self.clone() for code, options in list(con_values.items()): con = clone.get_constraint(code) if not con.editable: raise ConstraintError("There is a constraint '" + code + "' on this query, but it is not editable") try: for key, value in list(options.items()): setattr(con, key, value) except AttributeError: setattr(con, "value", options) return clone
[docs] def results(self, row="object", start=0, size=None, **con_values): """ Get an iterator over result rows ================================ This method returns the same values with the same options as the method of the same name in Query (see intermine.query.Query). The main difference in in the arguments. The template result methods also accept a key-word pair set of arguments that are used to supply values to the editable constraints. eg:: template.results( A = {"value": "eve"}, B = {"op": ">", "value": 5000} ) The keys should be codes for editable constraints (you can inspect these with Template.editable_constraints) and the values should be a dictionary of constraint properties to replace. You can replace the values for "op" (operator), "value", and "extra_value" and "values" in the case of ternary and multi constraints. @rtype: L{intermine.webservice.ResultIterator} """ clone = self.get_adjusted_template(con_values) return super(Template, clone).results(row, start, size)
[docs] def get_results_list(self, row="object", start=0, size=None, **con_values): """ Get a list of result rows ========================= This method performs the same as the method of the same name in Query, and it shares the semantics of Template.results(). @see: L{intermine.query.Query.get_results_list} @see: L{intermine.query.Template.results} @rtype: list """ clone = self.get_adjusted_template(con_values) return super(Template, clone).get_results_list(row, start, size)
[docs] def get_row_list(self, start=0, size=None, **con_values): """Return a list of the rows returned by this query""" clone = self.get_adjusted_template(con_values) return super(Template, clone).get_row_list(start, size)
[docs] def rows(self, start=0, size=None, **con_values): """Get an iterator over the rows returned by this query""" clone = self.get_adjusted_template(con_values) return super(Template, clone).rows(start, size)
[docs] def count(self, **con_values): """ Return the total number of rows this template returns ===================================================== Obtain the number of rows a particular query will return, without having to fetch and parse all the actual data. This method makes a request to the server to report the count for the query, and is sugar for a results call. @rtype: int @raise WebserviceError: if the request is unsuccessful. """ clone = self.get_adjusted_template(con_values) return super(Template, clone).count()
[docs]class QueryError(ReadableException): pass
[docs]class ConstraintError(QueryError): pass
[docs]class QueryParseError(QueryError): pass
[docs]class ResultError(ReadableException): pass