In-database ETL

28 05 2009

I do a lot of work with data warehousing and operational data stores. A big part of that process is doing the Extract Transform Load (ETL). There are dozens commercial and open source ETL tools around written in Java, Python, PHP, etc. Each has varying capabilities and learning curves. But the fact is, none of these tools or languages will be as efficient in transforming your data as the actual DBMS itself. That is what it was designed and optimized for.

This is the method I’ve developed. It may not be the right tool for everyone or every application, but it does have its advantages. As I already mentioned, it is efficient because this is exactly the kind of work the database was designed for. It is simpler, because you are after all a SQL expert, and there is no overhead in learning a new tool or a new language. It is more manageable because we will use views to do the heavy lifting instead of complicated config files. So making alterations rather easy.

I originally developed the process on Oracle, but it was easy to adapt to Postgres and should work on any other DMBS. There are basically four steps to the process.

Step 1 – Getting access to the data If we are going to work with the data, then the db needs to have access to it. If you are selecting the data over a db-link or from tables already on the db, you’ve already completed step 1. If you will be importing from text files, my advice is to keep it simple. Doing direct copies in Oracle and Postgres is much faster than doing inserts. Don’t go crazy on formatting and validating. Just get the data in and leave the validation and formatting for later. I’d suggest creating an ETL schema where all of your target tables reside. If you have many files to import, it may be worth your while to write a script in Python, Ruby or PHP to parse all of the text files in a directory and make corresponding target tables for you.

Step 2 – Transform the data This is the core of the whole process. Create a view that transforms the source data into the target table. The source view should look exactly like the target table with all the same columns types and names. Do all of you casting, referential integrity checks and lookups here. My practice is to name the view etl.src_<table_name>

Step 3 – Comparing the data Make a second view that joins your target table with your source view. The merge view should only return rows that need inserted, updated or deleted in the target table. You need to watch out for nulls when doing comparisons. NULL compared to anything is NULL and it won’t see the change when the table contains a NULL. So write a null-to-string that returns something like ” when you get a NULL value. I call the function ‘nts’ because you end up writing it alot πŸ™‚ My naming convention is etl.mrg_<table_name>.Β  If you are using Oracle, the next step is optional. For Postgres, add a column called ‘action’ that indicates whether the row needs inserted, updated or deleted. A big advantage of using this approach is that you can always query the merge view to see exactly what needs to change. So having the action column in Oracle can be handy.

Step 4 – Merging the data In Oracle write a procedure that contains a merge statement and optionally does some logging. Oracle’s merge statement can do insert/update/deletes in a single statement. (I also use it just to do updates because the syntax is so much nicer than their update with subquery syntax) The basic merge statement looks like:

MERGE INTO test_scores t
USING mrg_test_scores   v
ON (t.test_id = v.test_id)
   SET score = v.score,
      activity_date = v.activity_date
   (test_id, student_id, score, activity_date)
VALUES (seq_test_id.nextval, v.student_id, v.score, v.activity_date;

For postgres, you’ll have to write a function that does the inserts, updates and possibly deletes individually by filtering the action column of the merge query.

There you have it.Β  ETL that is powerful, manageable, free and no new tools to learn!




6 responses

28 05 2009

darn, I want MERGE in postgresql now πŸ˜›

btw, thanks dude.

28 05 2009
Robert Young

Dang. It is more than a bit refreshing to read that someone else has called BS on all those ETL baubles. SQL, that’s it.

29 05 2009

Sorry, you’re missing the point of ETL, focusing only on the load and transform part (BTW the new buzzword is ELT extract, load and then transform) and loosing the requirement to control the whole process.

29 05 2009
Scott Bailey


That’s funny, I thought the point of ETL was to get external data into your database. In this industry buzzwords are a dime a dozen. Although doing ETL in the db in deed necessitates that the data are loaded before being transformed.

But if the extract part of ETL (or ELT for the next 15 minutes) is the whole point for you, then perhaps you should walk a mile in my shoes.

It is not unusual to have little to no control over the extract part. Occasionally you’ll have full db access to the source data. In which case I use a dblink to extract like I said. But more often, either the vendor provides you with a text or xml extract, or the data resides in flat files on some legacy cobal system that someone telnets into and runs a report that is transferred using gopher.

30 05 2009
David Fetter


There are lots and lots of approaches to this kind of thing, and they’ve spawned a veritable alphabet soup of approaches, ETL, ELT, ETLT, etc. The trick is to find the approaches that work best for you for the task at hand πŸ™‚

18 11 2009
Creating a Dynamic Update Statement - dBforums

[…] Here is a blog about how I do this type of thing. Using SQL and views to determine which rows to insert, update and delete. In-database ETL Flex and Specs() […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: