Thinking in Sets of Sets

6 10 2009

Over the summer, I built the Chronos temporal toolkit for Oracle. And for the last while I’ve been porting it to Postgres and working on the documentation.

The SQL standard defines several data types to model instances of times. But rarely does anything happen instantaneously. Often what we need to model is a period of time; for instance an airline flight, a hotel stay, a bus route, an employee’s shift, a doctor’s appointment, a leave request. And yet there is no standard defined to handle this. So instead we use some combination of start and end time, an anchor time and interval, or an anchor time and a number representing seconds, minutes or days.

Dr Snodgrass wrote quite a bit about temporal relations in databases and in the early 90′s there was an effort to add temporal extensions to the SQL standard. Unfortunately, that never made it out of committee. However, they laid a solid foundation that we can implement in an extensible database like Oracle and Postgres. You can read Dr Snodgrass’ work here.

Postgres already has a temporal extension which provides the period data type. The biggest problem with temporal is that it really needs to be a contrib module. If it were, it would be compiled and distributed with the server. To install it, users would only have to run a simple sql script. Instead, we have to compile from source. Its not terribly hard on Linux but not something Windows users are used to or typically have the tools and or know how to do.

So the first thing I did was implement the period data type as a composite type. It is a bit slower for some things and doesn’t have GiST support, but it is fully compatible with the temporal period and can be installed by running a simple sql script. I recommend using the temporal period type because of the GiST index. But the rest of the Chronos toolkit doesn’t care which period type you use.

I’m not going to go into great detail here about the period data type. Suffice it to say that it allows us to do really cool things when modelling a timespan. We can test for overlap, adjacency, and do set operations like intersect and union.

But that is only scratching the surface. Each period is a set or interval (the math interval not the database interval). We’ve got the basic functionality down, but we need to expand our thinking to sets of sets.  Joe Celko is an opinionated a$$… in my opinion :)  But one of his catch phrases is “thinking in sets.” And that is just what we are going to do here.

So think about this problem: an employee enters a leave request with a start and end time. In order to figure out how much leave he should be charged, we need to find his scheduled shifts, subtract any existing leave requests and holidays, figure out if the employee needs to take a lunch each day and then tally the leave taken on a day by day basis (since the employee could have split shifts or be working swing shift).

The image below illustrates some possible scenarios.  We may have a single shift (inclusion) and a single leave request (exclusion). There could be multiple leave requests for a single shift. And there could be multiple shifts for a single request. So we need to take the set of inclusions and a set of exclusions and somehow come up with the result set.

timespan_collections

How do we solve the problem? The correct answer is, “find a different project to work on, and leave this $*(! to someone else.” But if you happen to be like me and are more stubborn than you are smart, then we’ll have to work through this.

If your first response is to pop in to some procedural language and start walking cursors, then you are not “thinking in sets.” If you’ve taken a moment to think about how this problem might be solved, then you have figured out that doing this with only the base data types provided by the SQL spec would be inordinately difficult. Being able to treat each shift, leave request and day as a single unit makes the problem manageable. And being able to treat all shifts, all leave request and all days as arrays of periods makes the problem downright simple.

The Oracle implementation of Chronos is pretty much complete. The Postgres side isn’t quite production ready. But the next time you’re working on a project that involves scheduling, time tracking or calendaring, check out the Chronos temporal toolkit. And forget the cursors, think in sets… of sets.

About these ads

Actions

Information

12 responses

7 10 2009
Srgjan Srepfler

Sounds really nice. Any documentation on how it’s used? If you know Joda Time, a java library that maps temporal concepts, do you see the API mappable to ChronosDB? Joda Time should become the basis for the next Date API for Java. I’m glad we’re getting great API’s to work with time.

7 10 2009
Scott Bailey

Documentation? Damit Jim, I’m a developer not a miracle worker! LOL. It should be forthcoming in the next few weeks. I’m just slammed at the moment. The source is documented though.

Joda Time and Chronos are pretty similar. Many of the things Joda does are already provided by the db though. There are some big semantic differences though. Joda’s Period is actually an interval in databases, and Joda’s Interval is the same concept as the period data type here.

7 10 2009
davidfetter

Scott,

Have you checked out Jeff Davis’s patch to add generalized constraints? There’s also a “period” data type on pgfoundry.org that’d probably be worth checking out.

7 10 2009
Scott Bailey

David,

Yes the period data type you are referring to (also written by Jeff Davis) is part of the temporal project that I mentioned. Chronos can be used with either Jeff’s period type or it’s own period type.

7 10 2009
Jeff Davis

Hi Scott,

Thank you for promoting more discussion on this topic.

The patch for “operator exclusion constraints” (formerly “generalized index constraints”) can be used to implement temporal keys, and I expect it to be in 8.5. Next on my radar are interval merge joins, for temporal joins.

You make a good case for sets of time that are not necessarily contiguous intervals. I’ve thought briefly about this before, but I’d like to spend more time thinking about it; do you have any specific references that I should read? Do you think support for sets of time will require any backend support or a different approach to temporal joins or temporal keys? My first thought is that it can be done with a more sophisticated type and indexing strategy.

Also, does ChronosDB allow temporal joins, temporal keys, and sets of non-contiguous time?

Feel free to contact me with any further thoughts.

7 10 2009
Scott Bailey

Thanks Jeff. I was hoping to get some face time with you next week at PG West. Chronos uses period arrays to represent sets of non-contiguous time, which is where it really becomes powerful.

For instance period_exclude(period[], period[]) will return a period array containing every value contained in p1 that is not excluded by a period in p2. So in the case of employee shifts. You’d gather all of your shifts for the week into a period array and all of the leave and holidays into another array. You’d run period_exclude() and to get an array of all the periods you would be working.

I don’t think there is much (any) value in being able to index non-continuous sets because you wouldn’t typically store a period array on disk.

As for references, just Snodgrass and Celko, which I’m sure you are thoroughly familiar with.

7 10 2009
Jeff Davis

> Chronos uses period arrays to represent sets of non-contiguous time

If that’s enough, then I think we’re in pretty good shape already, I just need some extra functions and perhaps operators that act on arrays.

> I don’t think there is much (any) value in being able to index non-continuous sets

One thought I had is that it might be important for imposing a constraint, but I don’t have a practical example. As long as this doesn’t impact other work in the backend, I’m fine leaving this be until there is a real use case.

> Snodgrass and Celko, which I’m sure you are thoroughly familiar with.

I wouldn’t say “thoroughly”. I’m more familiar with “Temporal Data and the Relational Model” by C.J. Date, which is great conceptually, but doesn’t give much guidance for the SQL syntax.

“Developing Time-Oriented Database Applications in SQL” by Snodgrass seems to be showing how to contort various dialects of SQL to apply some of the temporal concepts. Unfortunately, the concepts don’t shine through very clearly in the process.

One particular issue that I ran into is trying to construct periods from timestamps with different combinations of exclusivity (e.g. open-open, open-closed, etc.). This is fairly awkward, and I ended up making 4 functions to do it. I thought about using unary operators or something, but I thought it would be too hard to remember. Trying to put intuitive syntax like ( ) ( ] [ ) [ ] is probably not workable for a language like SQL.

7 10 2009
Scott Bailey

I’m going to move the rest of this conversation to email.

When you asked for references, I forgot to mention the failed TSQL2 specification.

ftp://ftp.cs.arizona.edu/tsql/tsql2/spec.pdf

7 11 2012
nandasaurabh

Scott, I find myself working on a project which is a scheduling app and its core functionality is to figure out when resources are available and when they aren’t.

I was researching on the new range types introduced in PostgreSQL 9.2 — thank you for your hard work. It has made my life easier!

I was wondering about the efficiency / performance of the new range data types. Any insights on that?

7 11 2012
Scott Bailey

I can’t take credit for the range types in 9.2. That was mostly Jeff Davis. But with GiST index support, performance for the types of queries you’ll need to do to determine scheduling availability will be better than with separate from/to fields. And it will be easier to write queries against.

But if you want to hedge your bets, keep the existing table structure and then write a view utilizing the range types.

7 11 2012
nandasaurabh

I *want* to use the range data type. So, no hedging bets here. I’m just wondering about performance with large amounts of data and operations.

Also, is there some formal math theory behind this that you could point me to? I will be dealing mostly with non-contiguous ranges which are currently not supported. So, may have to implement some stuff on my own. Known the basic theory would really help.

Thanks!

7 11 2012
Scott Bailey

Then is Oracle an option? Postgres took the range type down a path that that will make NCS functionality pretty hard to implement. After they did that, I focused on Oracle with a wait-and-see attitude towards Postgres until the range types were fully implemented.

If you want resources, Temporal Data & the Relational Model by CJ Date is a good book; Dr Snodgrass wrote quite a bit about it and his papers are free to download; and there are several Temporal groups on LinkedIn.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

%d bloggers like this: