blog.mynarz.net: 2017

2017-12-26

Copy-pasting the history of public procurement

They say that “those who do not learn history are doomed to repeat it.” However, those who machine-learn from history are also doomed to repeat it. Machine learning is in many ways copy-pasting history. A key challenge is thus to learn from history without copying its biases.

For the first time in the history we have open data on past experiences in making public contracts. Contracting authorities and bidders have a chance of learning from these experiences to improve the efficiency of agreeing on future contracts by matching contracts to the most suitable bidders. I wrote my Ph.D. thesis on this subject.

The thing is that we know little when learning from awarded contracts. What we have is basically this: Similar contracts are usually awarded to these bidders. This has two principal flaws: the first is tenuous similarity. Similarity between contracts hinges on the how informative their description is. However, the features comprising the contract description may be incomplete or uninformative. Civil servants administering contracts may either intentionally or inadvertently withhold key information from the contract description or fill it with information that ultimately amounts to noise.

What is perhaps more detrimental is that the associations between bidders and their awarded contracts need not be positive. If we learn from contracts awarded in the past, we assume that the awarded bidders are the best (or simply good enough) matches for the contracts. This assumption is fundamentally problematic. The winning bidder may be awarded on the basis of adverse selection, which is not in favour of the public, but the public decision maker. There are many ways in which the conditions of adverse selection can arise. Relevant information required for informed decision-making can be unevenly distributed between the participants of the public procurement market. Rival bidders can agree to cooperate for mutual benefit and limit open competition, such as by submitting overpriced fake bids to make real bids more appealing in comparison. When a bidder repeatedly wins contracts from the same contracting authority, it needs not be a sign of its exceptional prowess, but instead a sign of clientelism or institutional inertia. In most public procurement datasets the limitations of award data are further pronounced by missing data on unsuccessful bidders. In this light, the validity of contract awards recommended by machine learning from data with these flaws can be easily contested by unsuccessful bidders perceiving themselves to be a subject of discrimination. Consequently, blindly mimicking human decisions in matching bidders to contracts may not be the best course of action.

Some suspect we can solve this by learning from more kinds of data. Besides award data, we can learn from post-award data, including the evaluations of bidders’ performance. However, post-award data is hardly ever available. Without it, the record of a contract is incomplete. Moreover, in case of learning from data for decision support we combine conflicting incentives of decision arbiters and data collectors, who are often the same in public procurement. For instance, a winning bidder may receive a favourable evaluation despite its meagre delivery on the awarded contracts thanks to its above-standard relationship with the civil servant administering the contracts.

In Logics and practices of transparency and opacity in real-world applications of public sector machine learning Michael Veale suggests that we need to carefully select what data gets disclosed to combine “transparency for better decision-making and opacity to mitigate internal and external gaming.” We need to balance publishing open data to achieve transparency with the use of opacity to avoid gaming the rules of public procurement. Apart from data selection, pre-processing training data for decision support is of key importance. In our data-driven culture, engineering data may displace engineering software, as Andrej Karpathy argues in Software 2.0. Careful data pre-processing may help avoid overfitting to biases that contaminate the public procurement data.

Neither technology nor history is neutral. In fact, history always favours the winners. Winners of public contracts are thus always seen in favourable light in public procurement data. Moreover, data is not an objective record of the public procurement history. Therefore, we cannot restrict ourselves to copy-pasting from history of public procurement by a mechanical application of machine learning oblivious of the limitations of its training data. On this note in the context of the criminal justice system, the Raw data podcast asks: “Given that algorithms are trained on historical data – data that holds the weight of hundreds of years of racial discrimination – how do we evaluate the fairness and efficacy of these tools?” Nevertheless, how to escape this weight of history in machine learning is not yet well explored.

2017-06-17

What I would like to see in SPARQL 1.2

SPARQL 1.1 is now 4 years old. While considered by many as a thing of timeless beauty, developers are a fickle folk, and so they already started coming up with wish lists for SPARQL 1.2, asking for things like variables in SPARQL property paths. Now, while some RDF stores still struggle with implementing SPARQL 1.1, others keep introducing new features. For example, Stardog is extending SPARQL solutions to include collections. In fact, we could invoke the empirical method and look at a corpus of SPARQL queries to extract the most used non-standard features, such as the Virtuoso's bif:datediff(). If these features end up getting used and adopted among the SPARQL engines, we might as well have them standardized. The little things that maintain backwards compatibility may be added via small accretive changes to make up the 1.2 version of SPARQL. On the contrary, breaking changes that would require rewriting SPARQL 1.1 queries should wait for SPARQL 2.0.

In the past few years SPARQL has been the primary language I code in, so I had the time to think about what I would like to see make its way into SPARQL 1.2. Here is a list of these things; ordered perhaps from the more concrete to the more nebulous ones.

The REDUCED modifier is underspecified. Current SPARQL engines treat it "like a 'best effort' DISTINCT" (source). What constitutes the best effort differs between the implementations, which makes queries using REDUCED unportable (see here). Perhaps due to its unclear interpretation and limited portability queries using REDUCED are rare. Given its limited uptake, future versions of SPARQL may consider dropping it. Alternatively, REDUCED may be explicitly specified as eliminating consecutive duplicates.
The GROUP_CONCAT aggregate is nondeterministic. Unlike in SQL, there is no way to prescribe the order of the concatenated group members, which would make the results of GROUP_CONCAT deterministic. While some SPARQL engines concatenate in the same order, some do not, so adding an explicit ordering would help. It would find its use in many queries, such as in hashing for data fusion. The separator argument of GROUP_CONCAT is optional, so that SPARQL 1.2 can simply add another optional argument for order by.
SPARQL 1.2. can add a property path quantifier elt{n, m} to retrieve n to m hops large graph neighbourhood. This syntax, which mirrors the regular expression syntax for quantifiers, is already specified in SPARQL 1.1 Property Paths draft and several SPARQL engines implement it. While you can substitute this syntax by using repeated optional paths, it would provide a more compact notation for expressing the exact quantification, which is fairly common, for example in query expansion.
(Edit: This is a SPARQL 1.1 implementation issue.) The functions uuid(), struuid(), and rand() should be evaluated once per binding, not once per result set. Nowaydays, some SPARQL engines return the same value for these functions in all query results. I think the behaviour of these functions is underspecified in SPARQL 1.1, which only says that “different numbers can be produced every time this function is invoked” (see the spec) and that “each call of UUID() returns a different UUID” (spec). Again, this specification leads to inconsistent behaviour in SPARQL engines, in some to much chagrin. Specifying that these functions must be evaluated per binding would help address many uses, such as minting unique identifiers in SPARQL Update operations.
SPARQL 1.2 Graph Store HTTP Protocol could support quads, in serializations such as TriG or N-Quads. In such case, the graph query parameter could be omitted if the sent payload explicitly provides its named graph. I think that many tasks that do not fit within the bounds of triples would benefit from being able to work with quad-based multi-graph payloads.
SPARQL 1.2 could adopt the date time arithmetic from XPath. For instance, in XPath the difference between dates returns an xsd:duration. Some SPARQL engines, such as Apache Jena, behave the same. However, in others you have to reach for extension functions, such as Virtuoso's bif:datediff(), or conjure up convoluted ways for seemingly simple things like determining the day of week. Native support for date time arithmetic would also make many interesting temporal queries feasible.
(UPDATE: This is wrong, see here.) Perhaps a minor pet peeve of mine is the lack of support for float division. Division of integers in SPARQL 1.1 produces an integer. As a countermeasure, I always cast one of the operands of division to a decimal number; i.e. ?n/xsd:decimal(?m). I suspect there may be a performance benefit in maintaining the type of division's arguments, yet I also consider the usability of this operation. Nevertheless, changing the interpretation of division is not backwards compatible, so it may need to wait for SPARQL 2.0.
The CONSTRUCT query form could support the GRAPH clause to produce quads. As described above, many tasks with RDF today revolve around quads and not only triples. Being able to construct data with named graphs would be helpful for such tasks.
Would you like SPARQL to be able to group a cake and have it too? I would. Unfortunately, aggregates in SPARQL 1.1 produce scalar values, while many use cases call for structured collections. As mentioned above, Stardog is already extending SPARQL solutions to arrays and objects to support such use cases. Grassroots implementations are great, but having this standardized may help avoid the SPARQL engines to reinvent this particular wheel in many slightly incompatible ways.
Finally, while SPARQL is typically written by people, many applications write SPARQL too. For them having an alternative syntax based on structured data instead of text would enable easier programmatic manipulation of SPARQL. Suggestions for improvement and a detailed analysis of the ways developers cope with the lack of data syntax for SPARQL can be found in my post on generating SPARQL.

2017-02-05

Publishing temporal RDF data as create/delete event streams

Today I wondered about publishing temporal data as create/delete event streams. The events can be replayed by a database to produce state valid at a desired time. Use cases for this functionality can be found everywhere. For example, I've been recently working with open data from company registers. When companies relocate, their registered addresses change. It is thus useful to be able to trace back in history to find out what a company's address was at a particular date. When a company register provides only the current state of its data, tracing back to previous addresses is not possible.

Event streams are not a new idea. It corresponds with the well-known event sourcing pattern. This pattern was used in RDF too (for example, in shodan by Michael Hausenblas), although it is definitely not widespread. In 2013, I wrote an article about capturing temporal dimension of linked data. I think most of it holds true to date. In particular, there is still no established way of capturing temporal data in RDF. Event streams might be an option to consider.

Many answers to questions about temporal data in RDF are present in RDF 1.1 Concepts, one of the fundamental documents on RDF. For start, "the RDF data model is atemporal: RDF graphs are static snapshots of information." Nevertheless, "a snapshot of the state can be expressed as an RDF graph," which lends itself as a crude way of representing temporal data through time-indexed snapshots encoded as RDF graphs. There is also the option of reifying everything into observations, which is what statisticians and the Data Cube Vocabulary do. Alternatively, we can reify everything that happens into events.

Events

Events describe actions on data in a dataset. You can also think about them as database transactions, along with the associated properties. RDF triples are immutable, since "an RDF statement cannot be changed – it can only be added and removed" (source). This is why we need only two types of events:

:Create (addition)
:Delete (retraction)

Events can be represented as named graphs. Each event contains its type, which can be either :Create or :Delete, and timestamps, in particular valid time. Valid time tells us when the event's data is valid in the modelled world. The dcterms:valid property seems good enough to specify the valid time. Events may additionally describe other metadata, such as provenance. For example, dcterms:creator may link the person who created the event data. Encapsulating event's metadata in its named graph makes it self-contained, but it mixes operational data with data about the described domain, so an alternative to worth considering is to store the metadata in a separate graph.

The following example event stream describes that Alice was a member of ACME since 2000, while in 2001 she left to work for the Umbrella Corp, and then returned to ACME in 2003. The example is serialized in TriG, which allows to describe quads with named graphs instead of mere triples. You can use this example to test the queries discussed further on.

@prefix :         <http://example.com/> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix org:      <http://www.w3.org/ns/org#> .
@prefix schema:   <http://schema.org/> .
@prefix xsd:      <http://www.w3.org/2001/XMLSchema#> .

:event-1 {
  :event-1 a :Create ;
    dcterms:valid "2000-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME ;
    schema:name "Alice" .
}

:event-2 {
  :event-2 a :Delete ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

:event-3 {
  :event-3 a :Create ;
    dcterms:valid "2001-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-4 {
  :event-4 a :Delete ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :UmbrellaCorp .
}

:event-5 {
  :event-5 a :Create ;
    dcterms:valid "2003-01-01T09:00:00Z"^^xsd:dateTime .

  :Alice org:memberOf :ACME .
}

Limitations

Describing event streams in the afore-mentioned way has some limitations. One of the apparent issues is the volume of data that is needed to encode seemingly simple facts. There are several ways how we can deal with this. Under the hood, RDF stores may implement structural sharing as in persistent data structures to avoid duplicating substructures present across multiple events. We can also make several assumptions that save space. :Create can be made the default event type, so that it doesn't need to be provided explicitly. In some limited cases, we can assume that valid time is the same as the transaction time. For example, in some countries, public contracts become valid only after they are published.

Another limitation of this approach is that it doesn't support blank nodes. You have to know the IRIs of the resources your want to describe.

Since named graphs are claimed for events, they cannot be used to distinguish datasets, as they typically are. Datasets need to be distinguished as RDF datasets. Having multiple datasets may hence mean having multiple SPARQL endpoints. Cross-dataset queries then have to be federated, or alternatively, current snapshots of the queried datasets can be loaded into a single RDF store as named graphs.

Integrity constraints

To illustrate properties of the proposed event representation, we can define integrity constraints that the event data must satisfy.

Union of delete graphs must be a subset of the union of create graphs. You cannot delete non-existent data. The following ASK query must return false:

PREFIX : <http://example.com/>

ASK WHERE {
  GRAPH ?delete {
    ?delete a :Delete .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?create {
      ?create a :Create .
      ?s ?p ?o .
    }
  }
}

Each event graph must contain its type. The following ASK query must return true for each event:

ASK WHERE {
  GRAPH ?g {
    ?g a [] .
  }
}

The event type can be either :Create or :Delete. The following ASK query must return true for each event:

PREFIX : <http://example.com/>

ASK WHERE {
  VALUES ?type {
    :Create
    :Delete
  }
  GRAPH ?g {
    ?g a ?type .
  }
}

Events cannot have multiple types. The following ASK query must return false:

ASK WHERE {
  {
    SELECT ?g
    WHERE {
      GRAPH ?g {
        ?g a ?type .
      }
    }
    GROUP BY ?g
    HAVING (COUNT(?type) > 1)
  }
}

Querying

Querying over event streams is naturally more difficult than querying reconciled dataset snapshots. Nevertheless, the complexity of the queries may be hidden behind a proxy offering a more convenient syntax that extends SPARQL. An easy way to try the following queries is to use Apache Jena's Fuseki with an in-memory dataset loaded from the example event stream above: ./fuseki-server --file data.trig --update /ds.

Queries over the default graph, defined as the union of all graphs, query what has been true at some point in time:

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  # Fuseki uses <urn:x-arq:UnionGraph> to denote the union graph,
  # unless tdb:unionDefaultGraph is set to true.
  # (https://jena.apache.org/documentation/tdb/assembler.html#union-default-graph)
  GRAPH <urn:x-arq:UnionGraph> {
    ?s ?p ?o .
  }
}

Current valid data is a subset of the :Create graphs without the triples in the subsequent :Delete graphs:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
            dcterms:valid ?createValid .
    ?s ?p ?o .
  }
  FILTER NOT EXISTS {
    GRAPH ?delete {
      ?delete a :Delete ;
              dcterms:valid ?deleteValid .
      FILTER (?deleteValid > ?createValid)
      ?s ?p ?o .
    }
  }
}

We can also roll back and query data at a particular moment in time. This functionality is what Datomic provides as the asOf filter. For instance, the data valid at January 1, 2001 at 9:00 is union of the :Create events preceding this instant without the :Delete events that followed them until the chosen time:

PREFIX :        <http://example.com/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?create {
    ?create a :Create ;
      dcterms:valid ?validCreate .
    FILTER (?validCreate < "2001-01-01T09:00:00Z"^^xsd:dateTime)
    ?s ?p ?o .
  }
  MINUS {
    GRAPH ?delete {
      ?delete a :Delete ;
        dcterms:valid ?validDelete .
      FILTER (?validDelete < "2001-01-01T09:00:00Z"^^xsd:dateTime)
      ?s ?p ?o .
    }
  }
}

Event resolution proxy

Manipulation with event streams following the proposed representation can be simplied by an event resolution proxy. This proxy may be based on the SPARQL 1.1 Graph Store HTTP Protocol, which provides a standard way to work with named graphs. However, the Graph Store Protocol doesn't support quad-based RDF formats, so the proxy thus needs to partition multi-graph payloads into several transactions.

The proxy can provide several conveniences. It can prune event payloads by removing retractions of non-existent triples or additions of existing triples, or by dropping complete events if found redundant. It can automatically add transaction time; for example by using BIND (now() AS ?transactionTime) in SPARQL. Simplifying even further, the proxy can automatically mint event identifiers as URNs produced by the uuid() function in SPARQL. No event metadata can be provided explicitly in such case, although some metadata may be created automatically. Event type can be inferred from the HTTP method the proxy receives. HTTP PUT may correspond with the :Create type, while HTTP DELETE should indicate the :Delete type. Additionally, the proxy can assume that valid time is the same as transaction time.

Publishing

Create/delete event streams can be effectively split to batches by time intervals, suggesting several ways of publishing such data. An event stream should be published as an updated append-only quad dump. Additionally, there may be quads dumps of events from shorter periods of time, such as a day or month, to enable more responsive data syndication. Currently valid dataset may be materialized and published as a periodically updated dump. Instead of updating the current dataset in place, it may be published in snapshots. A snapshot from a given date can be used as a basis when replaying events, so that you don't have to replay the whole history of events, but only those that came after the snapshot. Any quad-based RDF serialization, such as TriG or N-Quads, will do for the dumps. Finally, in the absence of structural sharing, the dumps should be compressed to avoid size bloat caused by duplication of shared data structures.

The next challenge is to refine and test this idea. We can also wrap the event streams in a convenient abstraction that reduces the cognitive effort that comes with their manipulation. I think this is something developers of RDF store can consider to include in their products.

Subscribe to: Posts ( Atom )