Working with Multiple Databases from Python

6 08 2013

As a database developer, I work with many different databases in the course of a week. Our ERP database is on Oracle and has over 4,000 tables and I spend the bulk of my time working with Oracle. But we also have databases on Postgres, MySQL, MS SQL Server and maybe even some MS Access lurking around. And when I have to script against the databases, or load a one-off spreadsheet the tool I typically reach for is Python.

But even though Python has a nice specification in DB API 2, there is enough variability between the different adapters that you can’t switch from an database adapter that you use all the time to one used infrequently without consulting Google. Is the connection parameter named ‘pass’ or ‘password’, ‘passwd’ or ‘pwd’? What was the MySQL adapter’s name and how many upper case letters were there? And how do you make a DSN for Oracle again?

Additionally, the extra features of each adapter vary quite a bit. PostgreSQL’s psycopg2 adapter has very nice dictionary and namedtuple cursor support. Oracle’s cx_Oracle adapter gives you exactly what the spec mandates, result sets returned as a list of tuples and nothing more.

And because I spend much more time working with Oracle than the others, I set out to level the playing field. So I wrote the Python DBMS (DataBases Made Simpler) module. I wanted memory efficient dictionary-style and namedtuple cursors but I needed to do it in a way that would work with any DB API 2.0 compatible adapter, not just cx_Oracle. And I wanted a uniform connection method regardless of the underlying database adapter. And while I’m at it, I might as well add database inspection so I can see what databases, schemas, tables, views and columns are available.

Because we don’t know what type of connection or cursor object we will be using, I couldn’t just subclass one of the adapter’s connection and cursor classes. But Python makes this very easy to handle using delegation. The connection and cursor objects store a reference to the actual adapter’s objects in _connection and _cursor respectively.

Lets see it in action:

>>> import dbms
>>> db = dbms.OraConnect('myUser', 'myPass', 'myInstance')
>>> cur = db.cursor()
>>> dir(db)
['action', 'autocommit', 'begin', 'cancel', 'changepassword', 'client_identifier',
'clientinfo', 'close', 'commit', 'current_schema', 'cursor', 'dsn', 'encoding',
'inputtypehandler', 'interface', 'maxBytesPerCharacter', 'module',
'nencoding', 'outputtypehandler', 'parmHelp', 'password', 'ping', 'placeholder',
'prepare', 'register', 'rollback', 'server', 'showColumns', 'showDatabases', 'showSchemas',
'showServerVersion', 'showTables', 'showViews', 'shutdown', 'startup', 'stmtcachesize',
'subscribe', 'tnsentry', 'unregister', 'username', 'version']
>>> db.interface.__name__

The call to dbms.OraConnect imports the adapter cx_Oracle, sets the environmental variable “NSL_LANG” (something that is easy to forget and causes lots of problems if not set) and returns a Connection object. For the most part it looks and behaves like a typical DB API 2 Connection object.  The first difference is that the Connection object keeps a reference to the adapter in the interface attribute.  This is important when debugging queries and writing dynamic queries.  The other significant difference is that there are several functions to help us find our way around the database.  Lets try them out on a Postgres database

>>> import dbms
>>> pg = dbms.PgConnect('myUser', 'myPass', 'myDatabase')
>>> pg.showServerVersion()
Record1(version='PostgreSQL 9.2.3, compiled by Visual C++ build 1600, 32-bit')
>>> pg.showSchemas()
>>> pg.showTables('information_schema')
Record3(schema='information_schema', table_name='sql_features')
Record3(schema='information_schema', table_name='sql_sizing_profiles')
>>> pg.showViews('information_schema')
Record4(schema='information_schema', name='_pg_foreign_data_wrappers')
Record4(schema='information_schema', name='views')
>>> pg.showColumns('information_schema.tables')
['table_catalog', 'table_schema', 'table_name', 'table_type', 'self_referencing_column_name', 'reference_generation', 'user_defined_type_catalog', 'user_defined_type_schema', 'user_defined_type_name', 'is_insertable_into', 'is_typed', 'commit_action']
>>> pg.showParmHelp()
Your adapter's parameter style is "pyformat"
'SELECT * FROM people WHERE last_name = %(name)s AND age > %(age)s', {'name': 'Smith', 'age': 30}
'SELECT * FROM people WHERE last_name = %s AND age > %s', ('Smith', 30)

Easy enough. You can see from the examples above that the results are being returned as namedtuples. Lets take a closer look at the Cursor object. The dbms module defines a memory efficient DictRow type that is a cross between a dict and a namedtuple. It is the default return type for dbms cursors and is subclassed for each recordset.

>>> cur = pg.cursor()
>>> cur.execute('SELECT * FROM information_schema.tables')
>>> row = cur.fetchone()
>>> print row
Record12{'table_catalog': 'pods', 'table_schema': 'pg_catalog', 'table_name': 'pg_type', 'table_type': 'BASE TABLE', 'self_referencing_column_name': None, 'reference_generation': None, 'user_defined_type_catalog': None, 'user_defined_type_schema': None, 'user_defined_type_name': None, 'is_insertable_into': 'YES', 'is_typed': 'NO', 'commit_action': None}
>>> row.table_name
>>> row['table_name']
>>> row[2]
>>> row[0:3]
['pods', 'pg_catalog', 'pg_type']

Scheduling with Date Mod Part 2

20 10 2009

So as the title suggests, this is the second part of a series on scheduling. In the first part we did some basic setup. Now we are ready to dig in. The goal here is to be able to model any recurring pattern no matter the cycle.

The linchpin in this solution is to define an anchor date and the cycle length for every schedule. Given those two pieces of information, we can calculate the day of cycle for any given date using date mod. So lets create a couple of tables.
Read the rest of this entry »

Scheduling with Date Mod Part 1

19 10 2009

One question that seems to come up a lot is how to model recurring schedules. That’s understandable, because schedules can get out of hand pretty quickly.

We need a template so that we can define a schedule and then some how project it out on a repeating basis. The single week schedules are pretty easy. We just record the day of week and for each schedule and day we look up that day of week to get the shift for that day.
Read the rest of this entry »

Thinking in Sets of Sets

6 10 2009

Over the summer, I built the Chronos temporal toolkit for Oracle. And for the last while I’ve been porting it to Postgres and working on the documentation.

The SQL standard defines several data types to model instances of times. But rarely does anything happen instantaneously. Often what we need to model is a period of time; for instance an airline flight, a hotel stay, a bus route, an employee’s shift, a doctor’s appointment, a leave request. And yet there is no standard defined to handle this. So instead we use some combination of start and end time, an anchor time and interval, or an anchor time and a number representing seconds, minutes or days. Read the rest of this entry »

Internal Data Representation

3 10 2009

Have you ever wondered how your database represents the timestamp you just entered or how many bytes it takes to store a boolean or interval?

Oracle has a handy dump function that lets us peer into the internal representation of our data. Dump takes any type of input and returns a varchar2 describing the data type, length and the internal representation. Output might look like this:
‘Typ=96 Len=4: 84,101,99,104’
Read the rest of this entry »

Simplify XML Parsing in Postgres

19 06 2009

I was reading this article over at Postgres Online Journal about processing GPX XML files and it reminded me how much cruft is involved with extracting values from XML in Postgres.

The XPath function is great for some things, especially when you want to return an array of XML nodes. But it ALWAYS returns an XML array. If you want to iterate over a chunk of XML, for example pulling all of the waypoint nodes, then XPath is the perfect tool for the job. But if you want to extract a single value, it becomes clunky. So if you want to get the latitude for a node, you’ve got to get the 1st element in the XML array, cast it as varchar in order to cast it again to numeric. It is not exactly straight forward. And if you spend a lot of time parsing XML files, that cruft can really add up.

So we are going to pull a(nother) page from Oracle’s book and create a few functions that will make it easier to work with XML in Postgres.  The first thing we need is a way to extract a single value from somewhere in an XML node. Oracle’s extractvalue() function does precisely this. Here’s what it looks like in Postgres: Read the rest of this entry »

XML in Postgres – The Game Changer

1 06 2009

About the time that Postgres 8.3 came out, Adobe decided to open-source something called Flex. Up to that point, I had a non-committal relationship with Postgres. Meaning that I didn’t have a real strong preference between Postgres vs MySQL, and I didn’t participate in either community. I had heard some really good things about Adobe Flex, but I never had much interest in it until it was open-sourced. But when the two came together… it was like the Reese’s Penut Butter Cup commercials from the 80’s or 90’s. Or the KY his and her’s commercials of late… but maybe not quite that good 🙂 Read the rest of this entry »