How to create a table of contents for HTML text in Python?

Question

Suppose I have some kind of HTML code like this (generated from Markdown or Textile or something else):

<h1>A header</h1>
<p>Foo</p>
<h2>Another header</h2>
<p>More content</p>
<h2>Different header</h2>
<h1>Another toplevel header
<!-- and so on -->

How can I generate a table of contents for it using Python?

+1

Leafstorm Feb 05 '10 at 20:40

2 answers

Here is an example using lxml and xpath.

from lxml import etree
doc = etree.parse("test.xml")
for node in doc.xpath('//h1|//h2|//h3|//h4|//h5'):
    print node.tag, node.text

+3

kloffy Feb 05 '10 at 22:26

Ignacio Vazquez-Abrams · Accepted Answer · 2010-02-05T20:41:09+0000

Use an HTML parser like lxml or BeautifulSoup to find all the title elements.