First tip: DON'T USE REGULAR EXPRESSIONS FOR HTML / XML PARSING!
Now that weβve found out, I suggest you look at Beautiful Soup . For Python, other SGML / XML / HTML parsers are available. However, this is one of the most beloved for working with the messy "tag soup" that most of us recognize in the real world. It does not require entrances to conform to standards or to be properly formed. If your browser manages to render it than Beautiful Soup, you may be able to parse it.
(Still tempted to use regular expressions for this task? I think, "it can't be so bad, I just want to extract exactly what is in the containers <h1>...</h1> and <h2>...</h2> "and ..." I ", you will never have to use any other corner cabinets:" This is crazy. The code that you write based on this line of reasoning will be fragile. It will work enough good to pass your tests and then it will get worse and worse every time you need to fix βone more thing.β Seriously, import the real parser and use it).
source share