I do a lot of work with data warehousing and operational data stores. A big part of that process is doing the Extract Transform Load (ETL). There are dozens commercial and open source ETL tools around written in Java, Python, PHP, etc. Each has varying capabilities and learning curves. But the fact is, none of these tools or languages will be as efficient in transforming your data as the actual DBMS itself. That is what it was designed and optimized for.
This is the method I’ve developed. It may not be the right tool for everyone or every application, but it does have its advantages. As I already mentioned, it is efficient because this is exactly the kind of work the database was designed for. It is simpler, because you are after all a SQL expert, and there is no overhead in learning a new tool or a new language. It is more manageable because we will use views to do the heavy lifting instead of complicated config files. So making alterations rather easy.
I originally developed the process on Oracle, but it was easy to adapt to Postgres and should work on any other DMBS. There are basically four steps to the process.
Step 1 – Getting access to the data If we are going to work with the data, then the db needs to have access to it. If you are selecting the data over a db-link or from tables already on the db, you’ve already completed step 1. If you will be importing from text files, my advice is to keep it simple. Doing direct copies in Oracle and Postgres is much faster than doing inserts. Don’t go crazy on formatting and validating. Just get the data in and leave the validation and formatting for later. I’d suggest creating an ETL schema where all of your target tables reside. If you have many files to import, it may be worth your while to write a script in Python, Ruby or PHP to parse all of the text files in a directory and make corresponding target tables for you.
Step 2 – Transform the data This is the core of the whole process. Create a view that transforms the source data into the target table. The source view should look exactly like the target table with all the same columns types and names. Do all of you casting, referential integrity checks and lookups here. My practice is to name the view etl.src_<table_name>
Step 3 – Comparing the data Make a second view that joins your target table with your source view. The merge view should only return rows that need inserted, updated or deleted in the target table. You need to watch out for nulls when doing comparisons. NULL compared to anything is NULL and it won’t see the change when the table contains a NULL. So write a null-to-string that returns something like ” when you get a NULL value. I call the function ‘nts’ because you end up writing it alot 🙂 My naming convention is etl.mrg_<table_name>. If you are using Oracle, the next step is optional. For Postgres, add a column called ‘action’ that indicates whether the row needs inserted, updated or deleted. A big advantage of using this approach is that you can always query the merge view to see exactly what needs to change. So having the action column in Oracle can be handy.
Step 4 – Merging the data In Oracle write a procedure that contains a merge statement and optionally does some logging. Oracle’s merge statement can do insert/update/deletes in a single statement. (I also use it just to do updates because the syntax is so much nicer than their update with subquery syntax) The basic merge statement looks like:
MERGE INTO test_scores t USING mrg_test_scores v ON (t.test_id = v.test_id) WHEN MATCHED THEN UPDATE SET score = v.score, activity_date = v.activity_date WHEN NOT MATCHED THEN INSERT (test_id, student_id, score, activity_date) VALUES (seq_test_id.nextval, v.student_id, v.score, v.activity_date;
For postgres, you’ll have to write a function that does the inserts, updates and possibly deletes individually by filtering the action column of the merge query.
There you have it. ETL that is powerful, manageable, free and no new tools to learn!