C / CPP version of BeautifulSoup, especially when handling invalid HTML

Are there any recommendations for c / cpp lib that can be easily used (as much as possible) to parse / iterate / manipulate HTML streams / files, assuming some of them might be garbled, i.e. tags are not closed, etc.

BeautifulSoup

+4
source share
3 answers

Libxml 's HTMLParser is easy to use (a simple tutorial below) and works great even with garbled HTML.

Change The original blog post is no longer available, so I copied the content here.

Parsing (X) HTML in C is often seen as a daunting task. It is true that C is not the easiest language to develop a parser. Fortunately, the libxml2 HTMLParser module comes to the rescue. So, as promised, here is a small tutorial explaining how to use libxml2 HTML syntax to parse (X) HTML.

First you need to create a parser context. You have many functions for this, depending on how you want to feed the data into the parser. I will use htmlCreatePushParserCtxt() since it works with memory buffers.

 htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0); 

You can then set many parameters in this parser context.

 htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET); 

Now we are ready to parse the HTML document (X).

 // char * data : buffer containing part of the web page // int len : number of bytes in data // Last argument is 0 if the web page isn't complete, and 1 for the final call. htmlParseChunk(parser, data, len, 0); 

Once you have discarded all your data, you can call this function again with a NULL buffer and 1 as the last argument. This ensures that the analyzer processes everything.

Finally, how do you get the data you analyzed? This is easier than it sounds. You just need to go through the created XML tree.

 void walkTree(xmlNode * a_node) { xmlNode *cur_node = NULL; xmlAttr *cur_attr = NULL; for (cur_node = a_node; cur_node; cur_node = cur_node->next) { // do something with that node information, like... printing the tag name and attributes printf("Got tag : %s\n", cur_node->name) for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next) { printf(" ->; with attribute : %s\n", cur_attr->name); } walkTree(cur_node->children); } } walkTree(xmlDocGetRootElement(parser->myDoc)); 

And this! Isn't it that simple? From there, you can do any thing, for example, find all the reference images (looking at the img tag) and get them or anything you can think of.

In addition, you should be aware that you can walk the XML tree at any time, even if you have not yet parsed the entire HTML document.

If you need to parse (X) HTML in C, you should use libxml2 HTMLParser . This will save you a lot of time.

+6
source

I used libCurl C ++ for this type of thing, but found it to be pretty good and useful. I don't know how this will deal with broken HTML.

0
source

Try using SIP and run BeautifulSoup. Perhaps this will help.

More on the stream below. OpenFrameworks + Python

-3
source

Source: https://habr.com/ru/post/1414173/


All Articles