Parsing tags with TagSoup in Haskell

I tried to learn how to extract data from HTML files in Haskell and hit a wall. I don’t really understand Haskell at all, and my previous knowledge is from Python (and BeatifulSoup for HTML parsing).

I am using TagSoup to view my HTML code (which seems to be recommended) and I have a basic understanding of how this works. Here is the main segment of my code in question (stand-alone and displays information for testing):

import System.IO import Network.HTTP import Text.HTML.TagSoup import Data.List main :: IO () main = do http <- simpleHTTP (getRequest "http://www.cbssports.com/nba/scoreboard/20130310") >>= getResponseBody let tags = dropWhile (~/= TagOpen "div" []) (parseTags http) done tags where done xs = case xs of [] -> putStrLn $ "\n" _ -> do putStrLn $ show $ head xs done (tail xs) 

However, I am not trying to get into the div tag. I want to leave everything until the tag in this format:

 TagOpen "div" [("id","scores-1997830"),("class","scoreBox spanCol2")] TagOpen "div" [("id","scores-1997831"),("class","scoreBox spanCol2 lastCol")] 

I tried to write it:

 let tags = dropWhile (~/= TagOpen "div" [("id", "scores-[0-9]+"), ("class", "scoreBox( spanCol[0-9]?)+( lastCol)?")]) (parseTags http) 

But then he tries to find the literal [0-9] +. I haven't figured out a workaround with the Text.Regex.Posix module yet, and character escaping does not work. What is the solution here?

+6
source share
1 answer

~== does not perform regular expressions, you will need to write a match yourself, something like strings

 import Data.Maybe import Text.Regex goodTag :: TagOpen -> Bool goodTag tag = tag ~== TagOpen "div" [] && fromAttrib "id" tag `matches` "scores-[0-9]+" -- Just a wrapper around Text.Regex.matchRegex matches :: String -> String -> Bool matches string regex = isJust $ mkRegex regex `matchRegex` string 
+4
source

All Articles