Screen Scraper: Regular Expressions or XQuery Expressions?

I answered questions about the interview quiz, and the question was how I would take screenshots of the screen. That is, by selecting content from a web page, assuming that you do not have a more structured way to directly request information (for example, a web service).

My solution was to use an XQuery expression. The expression was quite long because the content I needed was pretty deep in the HTML hierarchy. I had to find my ancestors in an honest way before I discovered an element with an id attribute. For example, clearing an Amazon.com page for product sizes is as follows:

 //a[@id="productDetails"] /following-sibling::table //h2[contains(child::text(), "Product Details")] /following-sibling::div //li /b[contains(child::text(), "Product Dimensions:")] /following-sibling::text() 

This is a pretty nasty expression, but why does Amazon provide a web service API. In any case, this is just one example. The question is not about Amazon, but about screen scripting.

The interviewer did not like my decision. He thought it was fragile because redesigning an Amazon page might require rewriting an XQuery expression. Debugging an XQuery expression that does not match anything on the page to which it is applied is complex.

I did not agree with his statements, but I did not think that his decision was any improvement: he thought it was better to use regular expression , and search for content and markup near the delivery weight, for example, using Perl:

 $html =~ m{<li>\s*<b>\s*Product Dimensions:\s*</b>\s*(.*?)</li>}s; 

My counterargument was that it is also susceptible to Amazon changing its HTML code. They can write HTML tags in capitals ( <LI> ) or add CSS attributes or change <b> to <span> or change the label "Product Dimensions:" to "Dimensions:" or many other kinds of changes. My point was that regular expressions do not eliminate the flaws that it caused in my XQuery solution.

But, in addition, regular expressions can find false positives if you do not add enough context to the expression. It can also inadvertently map content that is inside a comment, or an attribute string, or a CDATA section.

My question is: what technology do you use to clean the screen? Why did you choose this solution? Are there any good reasons to use it? Or never use another? Is there a third choice besides the ones I showed above?

PS: For the sake of argument, suppose the web service API does not exist or another more direct way to get the desired content.

+6
regex xquery screen-scraping
source share
8 answers

I would use a regular expression, for the reasons that the manager gave, a few were enough (more portable, easier for external programmers, etc.).

Your counter argument does not take into account that its decision was fragile with respect to local changes, while your fragile one with respect to global changes. Anything that destroys it will probably break you, but not vice versa.

Finally, it is much easier to build slop / flex in its solution (if, for example, you have to deal with a few minor input variations).

+3
source share

I would use regex, but only because most HTML pages are invalid XML, so you will never get XQUERY working.

I do not know XQuery, but for me it looks like an XPATH expression. If so, then it looks a little expensive with many // operators in it.

+4
source share

Try JTidy or BeautifulSoup for me perfectly. of course // XPATH release is quite expensive.

+2
source share

I use BeautifulSoup for recycling.

+1
source share

I actually find CSS search expressions that are easier to read than either. There is probably at least one library in your language of choice that will parse the page and allow you to write CSS directives to search for specific elements. If there is an appropriate class or identifier nearby, the expression is pretty trivial. Otherwise, grab the items that seem appropriate and repeat them to find the ones you need.

As for fragile ones, well, they are all fragile. The screen scraper, by definition, depends on the author of this page, without changing its layout dramatically. Go with a solution that is readable and can be easily changed later.

+1
source share

Not a fragile screen scripting solution? Good luck for the interviewer: just because regular expressions throw away a lot of context does not mean that they are less fragile: they are just fragile in other ways. Fragility may not even be a flaw: if something changes on the original web page, you are often better off if your decision is alarming rather than trying to compensate in a smart (and unpredictable) way. As you noted. These things always depend on your assumptions: in this case, on what constitutes a likely change.

I prefer the HTML flexibility package : you get tolerance for non-XHTML-compatible web pages, combined with the expressive power of XPath.

+1
source share

Regular expressions are very fast and work with documents other than XML. These are really good points against XQuery. However, I think that using some converter for XHTML is as neat and possibly a bit simpler XQuery, as well as only the last part of yours:

 //b[contains(child::text(), "Product Dimensions:")]/following-sibling::text() 

- a very good alternative.

Hi,

Rafal Rusin

+1
source share

To work with html pages, it is best to use HTMLAgilityPack (and with some Linq codes). This is a great way to parse all items and / or do a direct search using XPath. In my opinion, it is more accurate than RegEx and easier to program. I was a little reluctant to use it before, but it is very easy to add to your project, and I think this is a standard factor for working with html. http://htmlagilitypack.codeplex.com/

Good luck

+1
source share

All Articles