I was reading this article over at Postgres Online Journal about processing GPX XML files and it reminded me how much cruft is involved with extracting values from XML in Postgres.
The XPath function is great for some things, especially when you want to return an array of XML nodes. But it ALWAYS returns an XML array. If you want to iterate over a chunk of XML, for example pulling all of the waypoint nodes, then XPath is the perfect tool for the job. But if you want to extract a single value, it becomes clunky. So if you want to get the latitude for a node, you’ve got to get the 1st element in the XML array, cast it as varchar in order to cast it again to numeric. It is not exactly straight forward. And if you spend a lot of time parsing XML files, that cruft can really add up.
So we are going to pull a(nother) page from Oracle’s book and create a few functions that will make it easier to work with XML in Postgres. The first thing we need is a way to extract a single value from somewhere in an XML node. Oracle’s extractvalue() function does precisely this. Here’s what it looks like in Postgres:
CREATE OR REPLACE FUNCTION extract_value( VARCHAR, XML ) RETURNS TEXT AS $$ SELECT CASE WHEN $1 ~ '@[[:alnum:]_]+$' THEN (xpath($1, $2)) WHEN $1 ~* '/text()$' THEN (xpath($1, $2)) WHEN $1 LIKE '%/' THEN (xpath($1 || 'text()', $2)) ELSE (xpath($1 || '/text()', $2)) END::text; $$ LANGUAGE 'sql' IMMUTABLE;
This function basically checks to see if we are looking for a value in an attribute or an element. If it is an element, it makes sure that it ends with ‘text()’. Then it returns the first value it finds.
So the query to extract the waypoints from the GPX file originally looked like this:
SELECT CAST((xpath('/wpt/name/text()', wayp.pt)) As varchar(20)) As Name, CAST(CAST((xpath('/wpt/@lon', wayp.pt)) As varchar) As numeric) As longitude, CAST(CAST((xpath('/wpt/@lat', wayp.pt)) As varchar) As numeric) As latitude, CAST(CAST((xpath('/wpt/ele/text()', wayp.pt)) As varchar) As numeric) As Elevation FROM (SELECT (xpath('/gpx/wpt',g.object_value))[it.i] As pt FROM (SELECT generate_series(1,array_upper(xpath('/gpx/wpt',g.object_value),1)) As i FROM GPX As g WHERE object_name = 'gpxtestrevised.gpx') As it CROSS JOIN (SELECT object_value FROM GPX WHERE object_name = 'gpxtestrevised.gpx') As g OFFSET 0) As wayp;
And it can now be rewritten as this:
SELECT extract_value('/wpt/name', node) AS name, extract_value('/wpt/@lon', node)::numeric AS lon, extract_value('/wpt@lat', node)::numeric AS lat, extract_value('/wpt/ele', node)::numeric AS ele FROM ( SELECT unnest(xpath('/gpx/wpt', object_value)) AS node FROM gpx WHERE object_name = 'fellsLoop' ) sub
In addition to the extract_value(), I also used unnest() to convert the xml array to a set. It is a new feature in 8.4 (which should have been there all along). If you need to add it to 8.3 or below, see this post.
The next helper function tests if the xpath traversal returned any nodes, without actually returning the nodes.
CREATE OR REPLACE FUNCTION node_exists( VARCHAR, XML ) RETURNS BOOLEAN AS $$ SELECT CASE WHEN array_upper(xpath($1, $2), 1) > 0 THEN true ELSE false END; $$ LANGUAGE 'sql' IMMUTABLE;
We could use it like so:
SELECT CASE WHEN node_exists('/gpx/wpt[ele > 10000]', object_value) THEN 'dress warm' WHEN node_exists('/gpx/wpt[@lat < 30.0 and ele < 40]', object_value) THEN 'bring a towel' ELSE 'have fun' END AS instructions FROM gpx WHERE object_name = 'fells_loop'
Users coming from Oracle, take note that these are not drop in replacements. I changed the functions from
extractvalue(xml, xpath_expression) to extract_value(xpath_expression, xml) and
existsnode(xml, xpath_expression) to node_exists(xpath_expression, xml)
to keep it consistant with the postgres conventions and the fn signature of xpath.
If you want to play along, you can download a sample gpx file here. I removed the namespace declarations at the top just like they did at Postgres Online Journal. If you want to use namespaces, you’ll need write new functions with that parameter.
Default namespaces can be troublesome to work with. The original xml has a default namespace of xmlns=”http://www.topografix.com/GPX/1/0″. And all nodes not bound to another namespace are bound to the default namespace. You’ll need to alias the default namespace to get it to work. Here I alias it as foo.
SELECT xpath(‘//foo:wpt’, object_value, array[array[‘foo’, ‘http://www.topografix.com/GPX/1/0′%5D%5D)