This post is the second in a sequence intended to introduce developers to the Trill streaming query engine, its programming model, and its capabilities. We introduced in the previous post the concept of snapshot semantics for temporal query processing. Here, we go deeper into the mechanics of snapshot semantics by showing its impact on one of the most basic relational operations in query processing, the Join.
The join operation and how it is handled is one of the key differences between a standard relational query engine and one that operates on temporal data. What’s more, how an engine accommodates a notion of time in its query language speaks volumes about how that system thinks about time management. So here, we take the time to demonstrate Trill’s take on joining data.
A join is, of course, a binary operation, taking two inputs and producing a single output stream. Snapshot semantics tells us that we want to be looking at join results at each point in time. To see this operation in action graphically, consider the following simple pair of streams with events we want to join:
With snapshot semantics, computing the join between these two sources is straightforward. A join result is present at any time when there is a matching event in each source. So, what we conceptually look for is interval overlap, where each event exists at the same time:
Points, points everywhere
Of course, a lot of naturally temporal data does not appear as intervals but rather as points. A sensor reading, for instance, comes into the system with a reading taken at a single time, say 10:15am. Whether that reading should be valid until the next reading, or for a certain period, or some other heuristically-driven policy is a matter for the query designer. But before any lifetime alterations are made in the query, the input data to a join would be just points, looking like this:
A common join operation over point events such as these might look something like, “join the left and right data where the points are within 10 minutes of each other.” Indeed, for engines whose query and lifetime operations are not separate, a join must be specified in that form, with an explicit temporal condition.
But as with all things Trill, we need to think of data as intervals. We will need to adjust the lifetime of these points to bring about the join result that we want. To do so, we need to answer two questions:
- Which points on the left and right should match each other?
- What lifetime do we expect the result to have?
To answer both simultaneously, we need to reformulate the initial question. “Join left and right data within 10 minutes” alone only addresses the first point.
Here is one way to reformulate the question: “Return all data from the left, matched with any data from the right within 10 minutes.” The asymmetric nature of this new question gives us the lifetime information that we want – we’re using the data on the left as a pivot, so we want to preserve whatever the lifetime information was in the left data.
To effect the correct result for this question, we need to take the data on the right and adjust its lifetime to include the plus/minus 10 minute time window. Because we use the left data as the pivot, we leave it alone. The resulting lifespans, along with the join query result, look like this:
Conversely, we could have used the data on the right of the join as the pivot. Had we instead asked, “Return all data from the right, matched with any data from the left within 10 minutes,” we alter the left input data lifespans with the 10 minute time window, getting a different diagram and query result:
Both join results yield the same two data points, just at different times: one with the original times of the events from the left stream, and one from the events on the right.
Time is of the essence
There is yet another way to formulate the question, which also happens to be the default way that temporal joins over points works in many systems: “Return data from the left and right that occur within 10 minutes of each other, as quickly as possible with whatever timestamp is convenient.” That sentence is a mouthful, but its intent is clear: What’s important is finding whatever data matches, and the resulting lifetimes are irrelevant so long as latency is minimal.
The problem with latency in the left-pivot and right-pivot cases is that the width of the lifetime extensions may impact latency. Intuitively, the wider one expands a lifetime in a join input, the more latency as the system must wait longer for potential matches. There is a way to change the event lifetimes to get the same results but with smaller overall time windows, as follows:
In this diagram, instead of extending one side’s events by 10 minutes in both directions, we extend events on both sides by 10 minutes in one direction. We get the same two results thanks to the overlap in time intervals. The lifespans of the result events are… complicated, roughly belonging to whichever of the joined events happened second.
What’s the takeaway here? One big advantage of having the join operation use snapshot semantics is that there is a wide variety of possibilities for returning results that allow the query author control over the result. The temporal join offered by other streaming systems is possible, but so are other variants that may be more semantically correct if desired.
Like siblings on a car trip: I’m only going if you’re not
With this understanding of join in hand, we can take our newfound knowledge and apply it to other binary operations as well. What if we wanted to take an anti-semijoin of our two data streams instead?
The result stream will consist of intervals that represent when a left event exists, but a right event does not. What we end up with is interval subtraction instead of intersection. One could then ask the same question as before with Join: If we have point streams instead of intervals, what are my possibilities for computing an anti-semijoin in Trill?
How one defines anti-semijoin here depends entirely on desired semantics and latency requirements. One possibility, motivated by the first point-join arrangement before, would use the left stream as a pivot and adjust the lifetimes of right-stream data:
Is this “correct”? The answer is in the eye of the beholder. The good news, though, is that irrespective of your requirements for correctness and for latency, there is a way to map those requirements to lifespan intervals in Trill.
Are we there yet?
Between this post and the previous one, we have covered a fair amount of Trill’s basic query surface. We still need to show a little more about how aggregations work, as well as Trill’s support for automata and regular expressions. But before that, one big piece remaining is describing how to get data into and out of Trill, so our next tutorial post will cover ingress policies and temporal order. Until then, as always, please feel free to look at our code, skim through our usage samples, and let us know if there’s something you’d like to see more detail on.