Retrieving data from an HTML table using Python

I want to extract data from an HTML table using a Python script and save it as variables (which I can later use in the same script after loading them, if they exist) into a separate file. I also want the script to ignore the first row of the table (Component, Status, Time / Error). I would prefer not to use external libraries.

The output to the new file should be like this:

SAVE_DOCUMENT_STATUS = "OK" SAVE_DOCUMENT_TIME = "0.408" GET_DOCUMENT_STATUS = "OK" GET_DOCUMENT_TIME = "0.361" ... 

And here is the script entry:

 <table border=1> <tr> <td><b>Component</b></td> <td><b>Status</b></td> <td><b>Time / Error</b></td> </tr> <tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.408 s</td></tr> <tr><td>GET_DOCUMENT</td><td>OK</td><td>0.361 s</td></tr> <tr><td>DVK_SEND</td><td>OK</td><td>0.002 s</td></tr> <tr><td>DVK_RECEIVE</td><td>OK</td><td>0.002 s</td></tr> <tr><td>GET_USER_INFO</td><td>OK</td><td>0.135 s</td></tr> <tr><td>NOTIFICATIONS</td><td>OK</td><td>0.002 s</td></tr> <tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr> <tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.913 s</td></tr> </table> 

I tried to do this in bash, but since I need to compare * _TIME variables with the maximum time, it fails because they are floating point numbers.

+4
source share
2 answers

Using lxml :

 import lxml.html as lh content='''\ <table border=1> <tr> <td><b>Component</b></td> <td><b>Status</b></td> <td><b>Time / Error</b></td> </tr> <tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.408 s</td></tr> <tr><td>GET_DOCUMENT</td><td>OK</td><td>0.361 s</td></tr> <tr><td>DVK_SEND</td><td>OK</td><td>0.002 s</td></tr> <tr><td>DVK_RECEIVE</td><td>OK</td><td>0.002 s</td></tr> <tr><td>GET_USER_INFO</td><td>OK</td><td>0.135 s</td></tr> <tr><td>NOTIFICATIONS</td><td>OK</td><td>0.002 s</td></tr> <tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr> <tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.913 s</td></tr> </table> ''' tree=lh.fromstring(content) for key, status, t in zip(*[iter(tree.xpath('//td/text()'))]*3): print('''{k}_STATUS = "{s}" {k}_TIME = "{t}"'''.format(k=key,s=status,t=t.rstrip(' s'))) 

gives

 SAVE_DOCUMENT_STATUS = "OK" SAVE_DOCUMENT_TIME = "0.408" GET_DOCUMENT_STATUS = "OK" GET_DOCUMENT_TIME = "0.361" DVK_SEND_STATUS = "OK" DVK_SEND_TIME = "0.002" DVK_RECEIVE_STATUS = "OK" DVK_RECEIVE_TIME = "0.002" GET_USER_INFO_STATUS = "OK" GET_USER_INFO_TIME = "0.135" NOTIFICATIONS_STATUS = "OK" NOTIFICATIONS_TIME = "0.002" ERROR_LOG_STATUS = "OK" ERROR_LOG_TIME = "0.001" SUMMARY_STATUS_STATUS = "OK" SUMMARY_STATUS_TIME = "0.913" 
+4
source

Well, if your HTML document really has such a stable structure (which makes me scratch my head, because it's pretty rare), you can use regular expressions:

 >>> import re >>> r = re.compile('<tr><td>(.*)</td><td>(.*)</td><td>(.*) s</td></tr>') 

The regular expression below groups the values ​​you want to show as a result. Then you use the sub() method of the object. If the text is in a variable (e.g. content ), just do it like this:

 r.sub(r'\1_STATUS = "\2"\n\1_TIME = \3', content) 

Result:

 >>> print r.sub(r'\1_STATUS = "\2"\n\1_TIME = \3', content) <table border=1> <tr> <td><b>Component</b></td> <td><b>Status</b></td> <td><b>Time / Error</b></td> </tr> SAVE_DOCUMENT_STATUS = "OK" SAVE_DOCUMENT_TIME = 0.408 GET_DOCUMENT_STATUS = "OK" GET_DOCUMENT_TIME = 0.361 DVK_SEND_STATUS = "OK" DVK_SEND_TIME = 0.002 DVK_RECEIVE_STATUS = "OK" DVK_RECEIVE_TIME = 0.002 GET_USER_INFO_STATUS = "OK" GET_USER_INFO_TIME = 0.135 NOTIFICATIONS_STATUS = "OK" NOTIFICATIONS_TIME = 0.002 ERROR_LOG_STATUS = "OK" ERROR_LOG_TIME = 0.001 SUMMARY_STATUS_STATUS = "OK" SUMMARY_STATUS_TIME = 0.913 </table> 

Of course, there is still a lot of garbage in the line, but this gives an idea :)

If your HTML documents are not so stable, you should really consider some XML parser, or better yet, BeautifulSoup, because it would be a black job to process an unstable structured HTML file manually.

+2
source

All Articles