I recently created a website where I need to get conversation headers from the TED website.
So far, the problem is specific to this conversation: Francis Collins: We need better medicines - now
From the source of the webpage, I get:
<title>Francis Collins: We need better drugs -- now | Video on TED.com</title> <span id="altHeadline" >Francis Collins: We need better drugs -- now</span>
Now, in ghci, I tried this:
λ> :m +Network.HTTP Text.Regex.PCRE λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html" λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody λ> body =~ "<span id=\"altHeadline\" >(.+)</span>" :: [[String]] [["id=\"altHeadline\" >Francis Collins: We need better drugs -- now</span>\n\t\t</h","s Collins: We need better drugs -- now</span"]] λ> body =~ "<title>(.+)</title>" :: [[String]] [["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>\n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]
In any case, the parsed name skips some characters on the left and has some unintended characters on the right. This seems to be related to -- in the title of the conversation. Nonetheless,
λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>" λ> body' =~ "<title>(.+)</title>" :: [[String]] [["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
Fortunately, this is not a problem with Text.Regex.Posix .
λ> import qualified Text.Regex.Posix as P λ> body P.=~ "<title>(.+)</title>" :: [[String]] [["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
rnons source share