XML in Postgres – The Game Changer

1 06 2009

About the time that Postgres 8.3 came out, Adobe decided to open-source something called Flex. Up to that point, I had a non-committal relationship with Postgres. Meaning that I didn’t have a real strong preference between Postgres vs MySQL, and I didn’t participate in either community. I had heard some really good things about Adobe Flex, but I never had much interest in it until it was open-sourced. But when the two came together… it was like the Reese’s Penut Butter Cup commercials from the 80’s or 90’s. Or the KY his and her’s commercials of late… but maybe not quite that good 🙂 Read the rest of this entry »

In-database ETL

28 05 2009

I do a lot of work with data warehousing and operational data stores. A big part of that process is doing the Extract Transform Load (ETL). There are dozens commercial and open source ETL tools around written in Java, Python, PHP, etc. Each has varying capabilities and learning curves. But the fact is, none of these tools or languages will be as efficient in transforming your data as the actual DBMS itself. That is what it was designed and optimized for.

This is the method I’ve developed. It may not be the right tool for everyone or every application, but it does have its advantages. As I already mentioned, it is efficient because this is exactly the kind of work the database was designed for. It is simpler, because you are after all a SQL expert, and there is no overhead in learning a new tool or a new language. It is more manageable because we will use views to do the heavy lifting instead of complicated config files. So making alterations rather easy. Read the rest of this entry »

Postgres – adding custom aggregates most()

22 05 2009

Postgres allows you to define custom aggregates. In most cases the stock aggregate functions (AVG, MIN, MAX, SUM etc.) will meet our needs.  But lets say we have a table of teachers joined to the classes they teach and we want to group by teacher and see what subject area the teacher teaches. So what results will you get if the teacher has 5 English classes and 1 Science class? If you use MAX() she’ll show up as a Science teacher. What we want is the subject area that shows up most often.

Enter the most() aggregate. We’ll use it to find the value that occurs most in our group. We need a few helper functions for this. Read the rest of this entry »

Timestamp Math in Postgres

20 05 2009

Postgres allows us to add or subtract integers to a DATE type. So ‘2009-01-01’ + 1 = ‘2009-01-02’. But what we can’t do is add NUMERIC to DATE or add anything to TIMESTAMP. So ‘2009-01-01’ + 1.25 and ‘2009-01-01 12:30:00’ + 1 both fail.

I use this feature a lot on Oracle. So lets add that functionality to Postgres.
Read the rest of this entry »

Custom sorts

20 05 2009

Problem: You need to do a non natural sort on Postgres. For example, you may want your automotive results sorted Car, Truck, SUV then Motorcycle.

Solution: If we were using MySQL, (and we’re not) we’d do something like ORDER BY find_in_set(category, ‘Car, Truck, SUV, Motorcycle’). That lets you sort on the index position of category in the set. In Postgres, we’ll use arrays instead of sets.
Read the rest of this entry »

ETL With Postgres Arrays

20 05 2009

One area where Postgres stands head and shoulders above Oracle is arrays. Oracle arrays/collections are quite cumbersome to use. But Postgres provides us with a simple to use and incredibly power tool in arrays.

Recently I had to do some ETL work were the source was a real mess. (There’s a real surprise.) Its always fun when the database designer graduated from the MS Excel school of normalization.

In this particular table, each row contains up to 8 test scores for 3 terms. So 1 row from the import table will need to be transformed into up to 24 rows which need mapped to the appropriate test and term, while eliminating null entries.

The solution is to use multi-dimentional arrays to map each entry. One dimension will map the term and the second dimension will map the test.
Read the rest of this entry »

Timespan – Creating custom data types in PostgreSQL

19 05 2009

Working with time has always been the bane of database development. And one that it seems every DMBS has implemented differently. Postges and MySQL have a DATE type that really just stores the date. Oracle also has a DATE type but it stores time to the second. The equivalent in MySQL is called DATETIME. Postgres doesn’t have a DATETIME, instead you have to use TIMESTAMP(0) (technically TIMESTAMP WITH OUT TIME ZONE, but I like the shorthand personally) And while Oracle and Postgres can handle any reasonable timestamp, MySQL’s TIMESTAMP is limited to the range of the Unix timestamp. Of course MS SQL Server deserves honorable mention as having the absolute worst handling of dates and times.

So after writing dozens of applications that deal with time, I (finally) had an epiphany. Most often I am modeling a span of time with start time and end time or start time and duration. And while a timestamp is ideal for storing a specific point in time (creation time, last login); it is less ideal for recording spans (employee shift, appointment, trip time). And depending on the application (and even the table) some times I would model it as a start time and end time, some times as a start time and duration in minutes or seconds, some times as a start time and interval.

But since Postgres and Oracle make it easy to create custom data types, lets simplify things by bundling the start and end times into a single timespan type. For now we’ll implement it in Postgres and later I’ll show it in Oracle.

There are several ways we could do this in Postgres, but lets stay away from creating types in C. Actually, we don’t even need to use pl/pgsql, as using straight sql seems to perform faster and has the power to do what we need. Read the rest of this entry »