Parsing (X) HTML in C is often seen as a daunting task. It is true that C is not the easiest language to develop a parser. Fortunately, the libxml2 HTMLParser module comes to the rescue. So, as promised, here is a small tutorial explaining how to use libxml2 HTML syntax to parse (X) HTML.
First you need to create a parser context. You have many functions for this, depending on how you want to feed the data into the parser. I will use htmlCreatePushParserCtxt()
since it works with memory buffers.
htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0);
You can then set many parameters in this parser context.
htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
Now we are ready to parse the HTML document (X).
// char * data : buffer containing part of the web page // int len : number of bytes in data // Last argument is 0 if the web page isn't complete, and 1 for the final call. htmlParseChunk(parser, data, len, 0);
Once you have discarded all your data, you can call this function again with a NULL
buffer and 1
as the last argument. This ensures that the analyzer processes everything.
Finally, how do you get the data you analyzed? This is easier than it sounds. You just need to go through the created XML tree.
void walkTree(xmlNode * a_node) { xmlNode *cur_node = NULL; xmlAttr *cur_attr = NULL; for (cur_node = a_node; cur_node; cur_node = cur_node->next) { // do something with that node information, like... printing the tag name and attributes printf("Got tag : %s\n", cur_node->name) for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next) { printf(" ->; with attribute : %s\n", cur_attr->name); } walkTree(cur_node->children); } } walkTree(xmlDocGetRootElement(parser->myDoc));
And this! Isn't it that simple? From there, you can do any thing, for example, find all the reference images (looking at the img
tag) and get them or anything you can think of.
In addition, you should be aware that you can walk the XML tree at any time, even if you have not yet parsed the entire HTML document.
If you need to parse (X) HTML in C, you should use libxml2 HTMLParser
. This will save you a lot of time.