HTML parsing on iPhone

Can someone recommend a C library or Objective-C for parsing HTML? It should handle dirty HTML code that will not fully validate.

Is there such a library, or is it better for me to just use regular expressions?

+68
html iphone parsing html-content-extraction
Jan 02 '09 at 0:37
source share
9 answers

It looks like libxml2.2 is included in the SDK, and libxml/HTMLparser.h states the following:

This module implements non-validating HTML 4.0 HTML with an API compatible with XML parsers. It should be able to parse the "real world" of HTML, even if it is severely broken in terms of specification.

This is similar to what I need, so I will probably use this.

+48
Jan 02 '09 at 5:35
source share

I found using hpple quite useful for parsing dirty HTML. The Hpple project is an Objective-C wrapper in the XPathQuery HTML parsing library. Using it, you can send an XPath request and get the result.

Requirements

-Add libxml2 includes in your project

  • Project Menu-> Change Project Settings
  • Search for Header Search Paths
  • Add a new search path "$ {SDKROOT} / usr / include / libxml2"
  • Enable recursive option

-add libxml2 library to your project

  • Project Menu-> Change Project Settings
  • Search for Other Linker Flags Settings
  • Add new search flag "-lxml2"

-From hpple get the following source code files, add them to your project:

  • TFpple.h
  • TFpple.m
  • TFppleElement.h
  • TFppleElement.m
  • XPathQuery.h
  • XPathQuery.m

- Take a walk through the w3school XPath Tutorial to feel comfortable with XPath.

Code example

 #import "TFHpple.h" NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"]; // Create parser xpathParser = [[TFHpple alloc] initWithHTMLData:data]; //Get all the cells of the 2nd row of the 3rd table NSArray *elements = [xpathParser searchWithXPathQuery:@"//table[3]/tr[2]/td"]; // Access the first cell TFHppleElement *element = [elements objectAtIndex:0]; // Get the text within the cell tag NSString *content = [element content]; [xpathParser release]; [data release]; 

Known Issues

Since hpple is a wrapper on top of XPathQuery, which is another wrapper, this option is probably not the most efficient. If performance is a problem in your project, I recommend that you code your own lightweight solution based on the hpple library code and xpathquery.

+88
Oct 24 '09 at 15:30
source share

Just in case, when someone got here by running a good XPath parser and left and used TFHpple, note that TFHpple uses XPathQuery. This is pretty good, but has a memory leak.

In the * PerformXPathQuery function, if the nodes are found equal to zero, it jumps before cleaning.

So, where do you see this bit of code: add a two-line cleanup.

  xmlNodeSetPtr nodes = xpathObj->nodesetval; if (!nodes) { NSLog(@"Nodes was nil."); /* Cleanup */ xmlXPathFreeObject(xpathObj); xmlXPathFreeContext(xpathCtx); return nil; } 

If you do a lot of parsing, this is a vicious leak. Now ... how can I get my night back :-)

+20
Mar 09 2018-11-11T00:
source share

I wrote a little wrapper around libxml that might be useful:

Objective-C-HMTL-Parser

+12
May 10, '10 at 21:18
source share

It probably depends on how dirty the HTML is and what you want to extract. But usually Tidy does a good job. It is written in C, and I think you should be able to create and statically link it for the iPhone. You can easily install the command line version and check the results first.

+5
Jan 02 '09 at 2:14
source share

You can check out ElementParser. It provides "enough" parsing of HTML and XML. Good interfaces bypass XML / HTML documents very easily. http://touchtank.wordpress.com/

+5
Apr 29 '09 at 20:46
source share

How to use the Webkit component and possibly third-party packages such as jquery for such tasks? Couldn't get the html data in an invisible component and take advantage of mature javascript framework selectors?

+4
Jan 27 2018-11-11T00:
source share

The Google GData Objective-C API updates NSXMLElement and other related classes that Apple has removed from the iPhone SDK. You can find it here http://code.google.com/p/gdata-objectivec-client/ . I used it for messaging through Jabber. Of course, if your HTML is incorrect (missing closing tags), this may not help much.

+3
Jan 02 '09 at 6:09
source share

We use Convertigo to parse server-side HTML and return clean and tidy JSON web services to our mobile apps

+2
Jan 12 '12 at 18:18
source share



All Articles