I tried to learn how to extract data from HTML files in Haskell and hit a wall. I donβt really understand Haskell at all, and my previous knowledge is from Python (and BeatifulSoup for HTML parsing).
I am using TagSoup to view my HTML code (which seems to be recommended) and I have a basic understanding of how this works. Here is the main segment of my code in question (stand-alone and displays information for testing):
import System.IO import Network.HTTP import Text.HTML.TagSoup import Data.List main :: IO () main = do http <- simpleHTTP (getRequest "http://www.cbssports.com/nba/scoreboard/20130310") >>= getResponseBody let tags = dropWhile (~/= TagOpen "div" []) (parseTags http) done tags where done xs = case xs of [] -> putStrLn $ "\n" _ -> do putStrLn $ show $ head xs done (tail xs)
However, I am not trying to get into the div tag. I want to leave everything until the tag in this format:
TagOpen "div" [("id","scores-1997830"),("class","scoreBox spanCol2")] TagOpen "div" [("id","scores-1997831"),("class","scoreBox spanCol2 lastCol")]
I tried to write it:
let tags = dropWhile (~/= TagOpen "div" [("id", "scores-[0-9]+"), ("class", "scoreBox( spanCol[0-9]?)+( lastCol)?")]) (parseTags http)
But then he tries to find the literal [0-9] +. I haven't figured out a workaround with the Text.Regex.Posix module yet, and character escaping does not work. What is the solution here?
source share