How can I capture data series from xml or tcx file

Question

How can I capture data series from xml or tcx file

I want to process data from a .tcx file (xml form) between specific tags using Python.
The file format is as follows.

<Track> <Trackpoint> <Time>2015-08-29T22:04:39.000Z</Time> <Position> <LatitudeDegrees>37.198049426078796</LatitudeDegrees> <LongitudeDegrees>127.07204628735781</LongitudeDegrees> </Position> <AltitudeMeters>34.79999923706055</AltitudeMeters> <DistanceMeters>7.309999942779541</DistanceMeters> <HeartRateBpm> <Value>102</Value> </HeartRateBpm> <Cadence>76</Cadence> <Extensions> <TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2"> <Watts>112</Watts> </TPX> </Extensions> </Trackpoint> ....Lots of <Trackpoint> ... </Trackpoint> </Track>

In the end, I will make a data table with columns "Lattitude, Altitude, ... Watts".
At first I tried to make a list of the collected data (e.g. Watts ... / Watts) using BeautifulSoup, xpath, etc. But I'm new to dealing with these tools. How can I capture data between tags in an XML file using Python?

+6

python xml parsing xpath beautifulsoup

Young dong kwon 10 sept. '15 at 13:53

source share

3 answers

gtlambert · Answer 1 · 2015-09-10T14:29:07+0000

You can use the lxml module with XPath . lxml is good for parsing XML / HTML, moving element trees, and returning text / element attributes. You can select specific elements, element sets, or element attributes using XPath . Using the example data:

 content = ''' <Track> <Trackpoint> <Time>2015-08-29T22:04:39.000Z</Time> <Position> <LatitudeDegrees>37.198049426078796</LatitudeDegrees> <LongitudeDegrees>127.07204628735781</LongitudeDegrees> </Position> <AltitudeMeters>34.79999923706055</AltitudeMeters> <DistanceMeters>7.309999942779541</DistanceMeters> <HeartRateBpm> <Value>102</Value> </HeartRateBpm> <Cadence>76</Cadence> <Extensions> <TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2"> <Watts>112</Watts> </TPX> </Extensions> </Trackpoint> ....Lots of <Trackpoint> ... </Trackpoint> </Track> ''' from lxml import etree tree = etree.XML(content) time = tree.xpath('Trackpoint/Time/text()') print(time)

Output

 ['2015-08-29T22:04:39.000Z']

Parfait · Answer 2 · 2015-09-10T16:32:11+0000

You can even use the lxml module to convert XML to CSV (for subsequent import into a data table, table or database table) using a duplicate Python list for different XPaths.

Note that the last Watts node is a special, longer XPath due to escaping a special namespace, xlmns not registered in the XML sample.

 import os, csv import lxml.etree as ET # SET DIRECTORY cd = os.path.dirname(os.path.abspath(__file__)) # LOAD XML FILE xmlfile = 'trackXML.xml' dom = ET.parse(os.path.join(cd, xmlfile)) # DEFINING COLUMNS columns = ['latitude', 'longitude', 'altitude', 'distance', 'watts'] # OPEN CSV FILE with open(os.path.join(cd,'trackData.csv'), 'w') as m: writer = csv.writer(m) writer.writerow(columns) nodexpath = dom.xpath('//Trackpoint') dataline = [] # FOR ONE-ROW CSV APPENDS datalines = [] # FOR FINAL OUTPUT for j in range(1,len(nodexpath)+1): dataline = [] # LOCATE PATH OF EACH NODE VALUE latitudexpath = dom.xpath('//Trackpoint[{0}]/Position/LatitudeDegrees/text()'.format(j)) dataline.append('') if latitudexpath == [] else dataline.append(latitudexpath[0]) longitudexpath = dom.xpath('//Trackpoint[{0}]/Position/LongitudeDegrees/text()'.format(j)) dataline.append('') if longitudexpath == [] else dataline.append(longitudexpath[0]) altitudexpath = dom.xpath('//Trackpoint[{0}]/AltitudeMeters/text()'.format(j)) dataline.append('') if altitudexpath == [] else dataline.append(altitudexpath[0]) distancexpath = dom.xpath('//Trackpoint[{0}]/DistanceMeters/text()'.format(j)) dataline.append('') if distancexpath == [] else dataline.append(distancexpath[0]) wattsxpath = dom.xpath("//Trackpoint[{0}]/*[name()='Extensions']/*[name()='TPX']/*[name()='Watts']/text()".format(j)) dataline.append('') if wattsxpath == [] else dataline.append(wattsxpath[0]) datalines.append(dataline) writer.writerow(dataline) print(datalines)

In addition to the CSV file, the following is a list of the datalines output from the selected columns:

 [['37.198049426078796', '127.07204628735781', '34.79999923706055', '7.309999942779541', '112']]

cast42 · Answer 3 · 2016-12-16T14:00:56+0000

The Python program https://github.com/cast42/vpower/blob/master/vpower.py iterates over the TCX file specified on the command line and adds a power field for all dimensions of the Event cycle. It uses the lxml library for speed and because it deals with namespaces. In previous versions of this program, I used xml.etree.ElementTree, but ran into problems with namespaces.

How can I capture data series from xml or tcx file

More articles: