Temporal Computation

Computing values that change over time

Time plays a critical role when working with event-based data and event-based models. The ability to calculate point-in-time historical feature values is one of the core features of Kaskada.

Rather than computing a single value, Fenl expressions produce temporal streams describing the result of a given computation as its changes over time.

Solving the Challenge of Event-Based Data

Kaskada is an event-based computation engine. An "event" can be any fact about the world associated with a time, for example, a user signing up for a service, or a customer purchasing a product. Most sources of event-based data change over time as events occur and are added to the system. Computing values from a set of events that changes over time means that the results must change as well.

Traditional data processing systems are designed to answer questions about the current state of a dataset, for instance, "how many purchases have a given user made?". This approach has some drawbacks: the answer to a given question changes based on when it is asked, and the only time at which you can ask questions is "now".

These limitations are reasonable for many use cases, but they make it difficult to build feature examples for training many machine learning models. A common error is accidentally using information that is known "now" to build training examples intended to describe the information available in the past.

The way traditional computations are expressed doesn't help matters. Query languages like SQL and data-processing interfaces like DataFrames were designed to answer questions about tabular (rather than temporal) data. Seemingly simple questions like "how many fraud reports had been filed against each purchase's vendor at the time of purchase?" can require complex windowing and partitioning operations.

How Fenl Deals with Time

Fenl takes a different approach by designing awareness of time into the query language.

Value Streams

Rather than answering a question with a single value, Fenl produces a stream of values describing the answer as it changes over time. For example, the answer to the question "how many purchases has a given user made?" might be the following table:

TimePurchase | count()
2012-02-231
2012-05-102
2018-11-033
2019-10-264

From this table we can see that if the question was asked in 2015 the answer would be "the user has made two purchases", but if the question was asked now the answer would be "the user has made four purchases".

Limiting Streams to Specific Times

Fenl allows asking these questions at specific points in time explicitly, for example we can look at the beginning of 2015 using the at operation.

TimePurchase | count() | at("2015-01-01")
2015-01-012

This approach ensures that results are reproducible and that questions can be easily and safely asked at arbitrary times in the past.

Temporally-Correct Joins

A core feature of Fenl is the ability to compute temporal joins across datasets. For example the question "how many fraud reports had been filed against each purchase's vendor at the time of purchase?" can be written in a single line.

FraudReport | count() | lookup(Purchase.vendor_id)

Basic Temporal Aggregation

Temporal computation is most important when dealing with aggregations, because aggregations incorporate values associated with different times.

We can think of aggregations as consuming a stream of input values and producing a stream of output values. By default each time an aggregation consumes an input it produces an output. In this case the time associated with each output is the same as the time associated with the corresponding input, and the output's value is the result of applying the aggregation to all the inputs consumed up to that time.

Purchase.amount | sum()
TimePurchase.amount... | sum()
2012-02-2355
2012-05-1027
2018-11-031320
2019-10-26424

Windowed Aggregations

The default behavior of aggregations is to produce an output whose value is an aggregation of all inputs seen to date each time an input is consumed. This behavior can be controlled using windowed aggregations.

Controlling What is Aggregated

The first aspect describes the set of input values used in an aggregation. The default behavior is for every input value to contribute. In some cases it may be preferable to only include the N most recent inputs, or to include every input since a particular event occurred.

Controlling When is Aggregated

The second aspect describes when the result of the aggregation should be produced. The default behavior is to produce an output value every time an input value is consumed. In some cases it may be preferable to produce an output value at the end of each day, when a particular event occurs.

Windowing Examples

Aggregations may be windowed by providing a window generator for the aggregation's window parameter. For example the prev(2) window generator computes the sum of the two most recent purchases.

The prev(n) window generator affects what is aggregated but retains the default when behavior of producing an output associated with each input.

Purchase.amount | sum(window = prev(2))
TimePurchase.amount... | sum(window = prev(2))
2012-02-235null
2012-05-1027
2018-11-031315
2019-10-26417

The yearly() window generator can be used to compute the total of all purchases at the beginning of each year.

Purchase.amount | sum(window = yearly())
Time... | sum(window = yearly())
2013-01-017
2014-01-010
2015-01-010
2016-01-010
2017-01-010
2018-01-010
2019-01-0113
2020-01-014

️ Going Deeper

Yearly windows produce values at the end of the window, but when should we stop producing windows? The set of times associated with events is finite and known when a computation takes place, but there are an unbounded number of year boundaries.

To avoid producing unbounded results, Fenl limits "cron-style" windows to time intervals that begin before the newest event and end after the oldest event in the dataset, across all entities.

Repeated Aggregation

Events may be aggregated multiple times. The events themselves are a sequence of timestamped data for each entity. The result of the first aggregation is the same — a sequence of timestamped data for each entity. Applying an additional aggregation simply aggregates over those times. For example, we can compute the average purchase amount sum.

Purchase.amount | sum() | mean()
TimePurchase.amount...| sum()... | mean()
2012-02-23555
2012-05-10276
2018-11-03132010.666
2019-10-2642414

© Copyright 2021 Kaskada, Inc. All rights reserved. Privacy Policy

Kaskada products are protected by patents in the United States, and Kaskada is currently seeking protection internationally with pending applications.