Missing characters using Text.Regex.PCRE to parse a web page header

Question

Missing characters using Text.Regex.PCRE to parse a web page header

I recently created a website where I need to get conversation headers from the TED website.

So far, the problem is specific to this conversation: Francis Collins: We need better medicines - now

From the source of the webpage, I get:

<title>Francis Collins: We need better drugs -- now | Video on TED.com</title> <span id="altHeadline" >Francis Collins: We need better drugs -- now</span>

Now, in ghci, I tried this:

 λ> :m +Network.HTTP Text.Regex.PCRE λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html" λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody λ> body =~ "<span id=\"altHeadline\" >(.+)</span>" :: [[String]] [["id=\"altHeadline\" >Francis Collins: We need better drugs -- now</span>\n\t\t</h","s Collins: We need better drugs -- now</span"]] λ> body =~ "<title>(.+)</title>" :: [[String]] [["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>\n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]

In any case, the parsed name skips some characters on the left and has some unintended characters on the right. This seems to be related to -- in the title of the conversation. Nonetheless,

 λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>" λ> body' =~ "<title>(.+)</title>" :: [[String]] [["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]

Fortunately, this is not a problem with Text.Regex.Posix .

 λ> import qualified Text.Regex.Posix as P λ> body P.=~ "<title>(.+)</title>" :: [[String]] [["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]

+4

regex haskell pcre

rnons Mar 27 '13 at 11:49

source share

1 answer

Michael snoyman · Accepted Answer · 2013-03-27T12:25:11+0000

My recommendation: do not use regex to parse HTML. Use the correct HTML parser instead. Here is an example of using the html-conduit analyzer along with the xml-conduit cursor library (and http-conduit for download).

 {-# LANGUAGE OverloadedStrings #-} import Data.Monoid (mconcat) import Network.HTTP.Conduit (simpleHttp) import Text.HTML.DOM (parseLBS) import Text.XML.Cursor (attributeIs, content, element, fromDocument, ($//), (&//), (>=>)) main = do lbs <- simpleHttp "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html" let doc = parseLBS lbs cursor = fromDocument doc print $ mconcat $ cursor $// element "title" &// content print $ mconcat $ cursor $// element "span" >=> attributeIs "id" "altHeadline" &// content

Code is also available as an active code at Haskell .

Missing characters using Text.Regex.PCRE to parse a web page header

More articles: