XML for CSV in Python

I have a lot of problems converting an XML file to a CSV in Python. I went through a lot of forums, tried both lxml and xmlutils.xml2csv, but I can't get it to work. This is GPS data from a Garmin GPS device.

Here's what my XML file looks like, shortened of course:

<?xml version="1.0" encoding="utf-8"?> <gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd"> <trk> <name>2013-12-03T21:08:56Z</name> <trkseg> <trkpt lat="45.4852855" lon="-122.6347885"> <ele>0.0000000</ele> <time>2013-12-03T21:08:56Z</time> </trkpt> <trkpt lat="45.4852961" lon="-122.6347926"> <ele>0.0000000</ele> <time>2013-12-03T21:09:00Z</time> </trkpt> <trkpt lat="45.4852982" lon="-122.6347897"> <ele>0.2000000</ele> <time>2013-12-03T21:09:01Z</time> </trkpt> </trkseg> </trk> </gpx> 

In my massive XML file there are several tag tags, but I can separate them - they represent different β€œsegments” or disconnections on the GPS device. All I want is a CSV file that looks something like this:

 LAT LON TIME ELE 45.4... -122.6... 2013-12... 0.00... ... ... ... ... 

Here is the code that I still have:

 ## Call libraries import csv from xmlutils.xml2csv import xml2csv inputs = "myfile.xml" output = "myfile.csv" converter = xml2csv(inputs, output) converter.convert(tag="WHATEVER_GOES_HERE_RENDERS_EMPTY_CSV") 

This is another alternative code. It simply outputs a CSV file without data, only lat and lon headers.

 import csv import lxml.etree x = ''' <?xml version="1.0" encoding="utf-8"?> <gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd"> <trk> <name>2013-12-03T21:08:56Z</name> <trkseg> <trkpt lat="45.4852855" lon="-122.6347885"> <ele>0.0000000</ele> <time>2013-12-03T21:08:56Z</time> </trkpt> <trkpt lat="45.4852961" lon="-122.6347926"> <ele>0.0000000</ele> <time>2013-12-03T21:09:00Z</time> </trkpt> <trkpt lat="45.4852982" lon="-122.6347897"> <ele>0.2000000</ele> <time>2013-12-03T21:09:01Z</time> </trkpt> </trkseg> </trk> </gpx> ''' with open('output.csv', 'w') as f: writer = csv.writer(f) writer.writerow(('lat', 'lon')) root = lxml.etree.fromstring(x) for trkpt in root.iter('trkpt'): row = trkpt.get('lat'), trkpt.get('lon') writer.writerow(row) 

How can I do it? Please understand that I'm a beginner, so a more complete explanation would be super awesome!

+6
source share
2 answers

This is an XML document with names . Therefore, you need to address the nodes using their respective namespaces.

The namespaces used in the document are defined above:

 xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" 

So, the first namespace is mapped to tc2 short form and will be used in an element of type <tc2:foobar/> . The latter, which does not have a short form after xmlns , is called the default namespace, and it applies to all elements of the document that do not explicitly use the namespace, so this also applies to your <trkpt /> elements.

Therefore, you need to write root.iter('{http://www.topografix.com/GPX/1/1}trkpt') to select these items.

To get the time and height, you can use trkpt.find() to access these elements below the trkpt node, and then element.text to get the text content of these elements (unlike attributes like lat and lon ). In addition, since time and ele elements also use the default namespace, you will need to use the {namespace}element syntax to select these nodes.

So you can use something like this:

 NS = 'http://www.topografix.com/GPX/1/1' header = ('lat', 'lon', 'ele', 'time') with open('output.csv', 'w') as f: writer = csv.writer(f) writer.writerow(header) root = lxml.etree.fromstring(x) for trkpt in root.iter('{%s}trkpt' % NS): lat = trkpt.get('lat') lon = trkpt.get('lon') ele = trkpt.find('{%s}ele' % NS).text time = trkpt.find('{%s}time' % NS).text row = lat, lon, ele, time writer.writerow(row) 

For more information about XML namespaces, see Namespaces in the lxml tutorial and Wikipedia article on XML namespaces . Also see GPS eXchange format for details on the .gpx format.

+18
source

Sorry to use the tools already done here, but it did work with your data:

It worked like a charm with your data.

 ele,time,_lat,_lon 0.0000000,2013-12-03T21:08:56Z,45.4852855,-122.6347885 0.0000000,2013-12-03T21:09:00Z,45.4852961,-122.6347926 0.2000000,2013-12-03T21:09:01Z,45.4852982,-122.6347897 

So for coding, I believe XML> JSON> CSV might be a good approach. You will find a lot of relevant scenarios mentioned in these links.

+1
source

All Articles