Simplify XML Parsing in Postgres

19 06 2009

I was reading this article over at Postgres Online Journal about processing GPX XML files and it reminded me how much cruft is involved with extracting values from XML in Postgres.

The XPath function is great for some things, especially when you want to return an array of XML nodes. But it ALWAYS returns an XML array. If you want to iterate over a chunk of XML, for example pulling all of the waypoint nodes, then XPath is the perfect tool for the job. But if you want to extract a single value, it becomes clunky. So if you want to get the latitude for a node, you’ve got to get the 1st element in the XML array, cast it as varchar in order to cast it again to numeric. It is not exactly straight forward. And if you spend a lot of time parsing XML files, that cruft can really add up.

So we are going to pull a(nother) page from Oracle’s book and create a few functions that will make it easier to work with XML in Postgres.  The first thing we need is a way to extract a single value from somewhere in an XML node. Oracle’s extractvalue() function does precisely this. Here’s what it looks like in Postgres:

CREATE OR REPLACE FUNCTION extract_value(
   VARCHAR,
   XML
) RETURNS TEXT AS
$$
   SELECT CASE WHEN $1 ~ '@[[:alnum:]_]+$'
   THEN (xpath($1, $2))[1]
   WHEN $1 ~* '/text()$'
   THEN (xpath($1, $2))[1]
   WHEN $1 LIKE '%/'
   THEN (xpath($1 || 'text()', $2))[1]
   ELSE (xpath($1 || '/text()', $2))[1]
   END::text;

$$ LANGUAGE 'sql' IMMUTABLE;

This function basically checks to see if we are looking for a value in an attribute or an element. If it is an element, it makes sure that it ends with ‘text()’. Then it returns the first value it finds.

So the query to extract the waypoints from the GPX file originally looked like this:

SELECT CAST((xpath('/wpt/name/text()', wayp.pt))[1] As varchar(20)) As Name,
	CAST(CAST((xpath('/wpt/@lon', wayp.pt))[1] As varchar) As numeric) As longitude,
	CAST(CAST((xpath('/wpt/@lat', wayp.pt))[1] As varchar) As numeric) As latitude,
	CAST(CAST((xpath('/wpt/ele/text()', wayp.pt))[1] As varchar) As numeric) As Elevation
FROM (SELECT (xpath('/gpx/wpt',g.object_value))[it.i] As pt
FROM (SELECT generate_series(1,array_upper(xpath('/gpx/wpt',g.object_value),1)) As i
FROM GPX As g WHERE object_name = 'gpxtestrevised.gpx') As it
 CROSS JOIN (SELECT object_value
	FROM GPX WHERE object_name = 'gpxtestrevised.gpx') As g OFFSET 0) As wayp;

And it can now be rewritten as this:

SELECT extract_value('/wpt/name', node)  AS name,
   extract_value('/wpt/@lon', node)::numeric AS lon,
   extract_value('/wpt@lat', node)::numeric AS lat,
   extract_value('/wpt/ele', node)::numeric AS ele
FROM (
   SELECT unnest(xpath('/gpx/wpt', object_value)) AS node
   FROM gpx WHERE object_name = 'fellsLoop'
) sub

In addition to the extract_value(), I also used unnest() to convert the xml array to a set. It is a new feature in 8.4 (which should have been there all along). If you need to add it to 8.3 or below, see this post.

The next helper function tests if the xpath traversal returned any nodes, without actually returning the nodes.

CREATE OR REPLACE FUNCTION node_exists(
 VARCHAR,
 XML
) RETURNS BOOLEAN AS
$$
 SELECT CASE WHEN array_upper(xpath($1, $2), 1) > 0 
 THEN true ELSE false END; 
$$ LANGUAGE 'sql' IMMUTABLE;

We could use it like so:

SELECT CASE WHEN node_exists('/gpx/wpt[ele > 10000]', object_value)
THEN 'dress warm'
WHEN node_exists('/gpx/wpt[@lat < 30.0 and ele < 40]', object_value)
THEN 'bring a towel' ELSE 'have fun' END AS instructions
FROM gpx
WHERE object_name = 'fells_loop'

Users coming from Oracle, take note that these are not drop in replacements.  I changed the functions from
extractvalue(xml, xpath_expression) to  extract_value(xpath_expression, xml) and
existsnode(xml, xpath_expression) to node_exists(xpath_expression, xml)
to keep it consistant with the postgres conventions and the fn signature of xpath.

If you want to play along, you can download a sample gpx file here. I removed the namespace declarations at the top just like they did at Postgres Online Journal. If you want to use namespaces, you’ll need write new functions with that parameter.

Default namespaces can be troublesome to work with. The original xml has a default namespace of xmlns=”http://www.topografix.com/GPX/1/0&#8243;. And all nodes not bound to another namespace are bound to the default namespace. You’ll need to alias the default namespace to get it to work. Here I alias it as foo.
SELECT xpath(‘//foo:wpt’, object_value, array[array['foo', 'http://www.topografix.com/GPX/1/0']])

About these ads

Actions

Information

4 responses

19 06 2009
Postgres OnLine Journal

Loading and Processing GPX XML files using PostgreSQL…

Simon Greener, wrote an article on how to load GPX xml files into Oracle XMLDB. That
got me thinking that I haven’t really explored all the XML features that PostgreSQL has to offer
and to some extent I’ve been reticent about XML processed in any …

8 07 2009
Draper L. Kauffman » Blog Archive » SQL/XML

[...] Simplify XML Parsing in Postgres « Flex and Specs() [...]

29 10 2009
Prakash

GOOD aproach

3 07 2012
kacong

IT’S WORK’S GOOD
can you explain me about your function line by line ?
please answer to my email
thanks….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.

%d bloggers like this: