How to find all substrings of a string with start and end indices

Question

How to find all substrings of a string with start and end indices

I recently wrote Scala code that processes String, finding all its substrings and keeping a list of those that are in the dictionary. The beginning and end of substrings in a common line also needs to be saved for later use, so the easiest way to do this is to simply use nested loops, something like this:

for (i <- 0 until word.length) for (j <- i until word.length) { val sub = word.substring(i, j + 1) // lookup sub in dictionary here and add new match if found }

As an exercise, I decided to do the same thing in Haskell. It seems simple enough, without the need for substring indices - I can use something like this approach to get substrings, and then call a recursive function to accumulate matches. But if I need indexes too, it seems more complicated.

How can I write a function that returns a list containing each continuous substring, along with its start and end index inside the "parent" string?

For example, tokens "blah" will give [("b",0,0), ("bl",0,1), ("bla",0,2), ...]

Update

Great choice of answers and lots of new things to learn. Having a little worried, I went to the first answer, asking Daniel to allow the use of [0..] .

 data Token = Token String Int Int continuousSubSeqs = filter (not . null) . concatMap tails . inits tokenize xs = map (\(s, l) -> Token s (head l) (last l)) $ zip s ind where s = continuousSubSeqs xs ind = continuousSubSeqs [0..]

It seemed relatively easy to understand, given my limited knowledge of Haskell.

+3

haskell

Shaun the sheep Jul 15 '12 at 21:44

source share

4 answers

My version:

 import Data.List tokens = map join . filter (not . null) . concatMap inits . tails . zip [0..] where join s@ ((i, _):t) = (map snd s, i, foldl' (\_ i -> i) i (map fst t)) main = putStrLn $ show $ tokens "blah" -- [("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]

UPDATE

 import Control.Arrow ... tokens = map join . filter (not . null) . concatMap inits . tails . zip [0..] where join s = (s', i, j) where ((i, j), s') = (first (head &&& last)) $ unzip s ...

+1

Victor moroz Jul 16 '12 at 0:14

source share

The two nested loops you wrote are a great starting point. That is, we can write a tokens function that delegates its work to two recursive functions outer and inner that correspond to your loops:

 type Token a = ([a], Int, Int) tokens :: [a] -> [Token a] tokens = outer 0 where outer _ [] = [] outer i l@ (_ : xs) = inner i [] l ++ outer (i + 1) xs where inner _ _ [] = [] inner j acc (x : xs) = (acc ++ [x], i, j) : inner (j + 1) (acc ++ [x]) xs

Here outer iterates over the line and for each starting position inside this line calls inner to collect all the segments that begin at that position along with their ending positions.

Although this feature meets your requirements,

 > tokens "blah" [("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]

it is rather inefficient due to repeated concatenation of lists. A more efficient version will accumulate its results in the so-called difference lists :

 type Token a = ([a], Int, Int) tokens :: [a] -> [Token a] tokens l = outer 0 l [] where outer _ [] = id outer i l@ (_ : xs) = inner i id l . outer (i + 1) xs where inner _ _ [] = id inner j acc (x : xs) = ((acc [x], i, j) :) . inner (j + 1) (acc . (x :)) xs

How to build a dictionary, of course, depends on how you imagine it. Here's an approach that uses simple ordered lists of associations,

 type Dict a = [([a], [(Int, Int)])] empty :: Dict a empty = [] update :: Ord a => Token a -> Dict a -> Dict a update (xs, i, j) [] = [(xs, [(i, j)])] update (xs, i, j) ((ys, ns) : dict) = case compare xs ys of LT -> (xs, [(i, j)]) : (ys, ns) : dict EQ -> (ys, (i, j) : ns) : dict GT -> (ys, ns) : update (xs, i, j) dict toDict :: Ord a => [a] -> Dict a toDict = foldr update empty . tokens

but since your keys are strings, trying (aka prefix trees) is probably the best choice.

If this is an effective subtitle that you want about, I would recommend looking for suffix trees, although their implementation is somewhat involved. you can check

Robert Gigerich and Stefan Kurtz. Comparison of imperative and purely functional suffix constructions. Computer Science Science 25 (2-3): 187-218, 1995

and Bryan O'Sullivan suffixtree package for Hackage.

+1

Stefan holdermans Jul 16 '12 at 7:00

source share

Another version, which is easier to read from left to right, looks like unix channels

 import Data.List import Control.Category tokens = tailsWithIndex >>> concatMap (\(i,str) -> zip (repeat i) (initsWithIndex str)) >>> map adjust where tailsWithIndex = tails >>> init >>> zip [0..] initsWithIndex = inits >>> tail >>> zip [0..] adjust (i, (j, str)) = (str, i, i+j)

Run example

 >tokens "blah" [("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]

If concatMap is lazy, then all calculations are lazy and will be efficient, with the exception of using the Data.List functions instead of accessing the raw list.

+1

kusimari Jul 16 '12 at 9:22

source share

applicative_functor · Accepted Answer · 2012-07-15T22:05:06+0000

 import Data.List continuousSubSeqs = filter (not . null) . concatMap inits . tails tokens xs = map (\(s, l) -> (s, head l, last l)) $ zip s ind where s = continuousSubSeqs xs ind = continuousSubSeqs [0..length(xs)-1]

It works as follows:

 tokens "blah" [("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]

How to find all substrings of a string with start and end indices

Update

More articles: