Broken HTML with golang

I need to find elements in an HTML string. Unfortunately, HTML is pretty badly broken (for example, it closes tags without a pair of openings).

I tried using XPath with startpad.net/xmlpath, but it cannot parse the HTML file, so damn the error.

How can I find elements in broken HTML with golang? I would prefer to use XPath, but I am also open to other solutions if I can use it to search for tags with a specific identifier or class.

+8
html go xpath
source share
1 answer

It seems net / html is doing the job.

So what am I doing now:

package main import ( "strings" "golang.org/x/net/html" "log" "bytes" "gopkg.in/xmlpath.v2" ) func main() { brokenHtml := `<!DOCTYPE html><html><body><h1 id="someid">My First Heading</h1><p>paragraph</body></html>` reader := strings.NewReader(brokenHtml) root, err := html.Parse(reader) if err != nil { log.Fatal(err) } var b bytes.Buffer html.Render(&b, root) fixedHtml := b.String() reader = strings.NewReader(fixedHtml) xmlroot, xmlerr := xmlpath.ParseHTML(reader) if xmlerr != nil { log.Fatal(xmlerr) } var xpath string xpath = `//h1[@id='someid']` path := xmlpath.MustCompile(xpath) if value, ok := path.String(xmlroot); ok { log.Println("Found:", value) } } 
+19
source share

All Articles