How to eliminate cost centers in String Taversals and List Comprehensions

Question

How to eliminate cost centers in String Taversals and List Comprehensions

I am implementing a bioinformatics motivation search algorithm using Haskell. I will not go into the details of the algorithm, except to say that it is a branch and a related median string search. I planned to make my implementation more interesting by implementing a parallel approach (and then the STM approach) in order to get multi-core speed, but after compilation using the following flags

$ ghc -prof -auto-all -O2 -fllvm -threaded -rtsopts --make main

and printing out the profile, I saw something interesting (and possibly obvious):

 COST CENTRE entries %time %alloc hammingDistance 34677951 47.6 14.7 motifs 4835446 43.8 71.1

It is clear that remarkable acceleration can be achieved without approaching multi-core programming (although this was done, and I just need to find good test data and select a Criterion for them).

In any case, both of these functions are purely functional and are in no way parallel. They also do fairly simple things, so I was surprised that they took a long time. Here is the code for them:

 data NukeTide = A | T | C | G deriving (Read, Show, Eq, Ord, Enum) type Motif = [NukeTide] hammingDistance :: Motif -> Motif -> Int hammingDistance [] [] = 0 hammingDistance xs [] = 0 -- optimistic hammingDistance [] ys = 0 -- optimistic hammingDistance (x:xs) (y:ys) = case (x == y) of True -> hammingDistance xs ys False -> 1 + hammingDistance xs ys motifs :: Int -> [a] -> [[a]] motifs n nukeTides = [ take n $ drop k nukeTides | k <- [0..length nukeTides - n] ]

Note that of the two arguments for hammingDistance, I can assume that xs will be long x and that ys will be less than or equal to it if this opens up opportunities for improvement.

As you can see, hammingDistance calculates the distance between two motifs, which are lists of nucleotides. The motifs function takes a number and a list and returns all substrings of this length, for example:

 > motifs 3 "hello world" ["hel","ell","llo","lo ","ow"," wo","wor","orl","rld"]

Since the involved algorithmic processes are so simple, I can’t figure out how to optimize this further. However, I have two guesses about where I should go:

HammingDistance: The data types that I use (NukeTides and []) are slow / awkward. This is just an assumption, since I am not familiar with their implementations, but I believe that the definition of my own data type, although more legible, is probably associated with a lot of overhead, I intend. Also, pattern matching is alien to me, I don't know if this is trivial or expensive.
Motives: if I read them correctly, 70% of all memory allocations are motivated, and I assume that this needs to be garbage collected at some time. Using the list of all goals again can slow me down or understand the list, since the cost of this is incredibly unclear to me.

Does anyone have any tips on the usual procedure? If data types are a problem, will arrays be the correct answers? (I heard that they hit the boxes)

Thanks for the help.

Edit: It just occurred to me that it would be useful if I described a way to call these two functions:

 totalDistance :: Motif -> Int totalDistance motif = sum $ map (minimum . map (hammingDistance motif) . motifs l) dna

This function is the result of another function and is passed around nodes in the tree. On each node in the tree, a nucleotide is evaluated (of length <= n, that is, if == n, then this is a leaf of the node) using totalDistance to evaluate the node. From now on, this is your typical branch and related algorithm.

Edit: John asked me to print out the change I made, which virutally eliminated the cost of motives:

 scoreFunction :: DNA -> Int -> (Motif -> Int) scoreFunction dna l = totalDistance where -- The sum of the minimum hamming distance in each line of dna -- is given by totalDistance motif totalDistance motif = sum $ map (minimum . map (hammingDistance motif)) possibleMotifs possibleMotifs = map (motifs l) dna -- Previously this was computed in the line above

I did not explain my initial message, but scoreFunction is called only once, and the result is passed around the tree / branch and is bound and used to evaluate the nodes. Reassessment of motives at every step, in retrospect, is not one of the most striking things that I have done.

+7

optimization profiling haskell bioinformatics branch-and-bound

Dave Oct 9 '11 at 15:20

source share

3 answers

Your definition of motifs seems to be much more than necessary because every drop application should cross the list from the very beginning. I would use it instead of Data.List.tails :

 motifs2 :: Int -> [a] -> [[a]] motifs2 n nukeTides = map (take n) $ take count $ tails nukeTides where count = length nukeTides - n + 1

A quick comparison in GHCi shows the difference (using sum . map length for a forced estimate):

 *Main> let xs = concat (replicate 10000 [A, T, C, G]) (0.06 secs, 17914912 bytes) *Main> sum . map length $ motifs 5 xs 199980 (3.47 secs, 56561208 bytes) *Main> sum . map length $ motifs2 5 xs 199980 (0.15 secs, 47978952 bytes)

+7

hammar Oct 9 '11 at 15:36

source share

That's right ... I could not resist the limit and wrote a realistic implementation of "transparent metals":

 {-# language TypeSynonymInstances #-} {-# language BangPatterns #-} import Data.Bits import Data.Word data NukeTide = A | T | C | G deriving (Read, Show, Eq, Ord, Enum) type UnpackedMotif = [NukeTide] type PackageType = Word32 nukesInPackage = 16 :: Int allSetMask = complement 0 :: PackageType -- Be careful to have length of motif == nukesInPackage here! packNukesToWord :: UnpackedMotif -> PackageType packNukesToWord = packAt 0 where packAt _ [] = 0 packAt i (m:ml) = (b0 m .&. bit i) .|. (b1 m .&. bit (i+1)) .|. packAt (i+2) ml b0 A = 0 b0 T = allSetMask b0 C = 0 b0 G = allSetMask b1 A = 0 b1 T = 0 b1 C = allSetMask b1 G = allSetMask unpackNukesWord :: PackageType -> UnpackedMotif unpackNukesWord = unpackNNukesFromWord nukesInPackage unpackNNukesFromWord :: Int -> PackageType -> UnpackedMotif unpackNNukesFromWord = unpackN where unpackN 0 _ = [] unpackN iw = (nukeOf $ w .&. r2Mask):(unpackN (i-1) $ w`shiftR`2) nukeOf bs | bs == 0 = A | bs == bit 0 = T | bs == bit 1 = C | otherwise = G r2Mask = (bit 1 .|. bit 0) :: PackageType data PackedMotif = PackedMotif { motifPackets::[PackageType] , nukesInLastPack::Int } -- note nukesInLastPack will never be zero; motifPackets must be [] to represent empty motifs. packNukes :: UnpackedMotif -> PackedMotif packNukes m = case remain of [] -> PackedMotif [packNukesToWord takeN] (length takeN) r -> prAppend (packNukesToWord takeN) (packNukes r) where (takeN, remain) = splitAt nukesInPackage m prAppend w (PackedMotif li) = PackedMotif (w:l) i unpackNukes :: PackedMotif -> UnpackedMotif unpackNukes (PackedMotif li) = unpack li where unpack [l] i = unpackNNukesFromWord il unpack (l:ls) i = unpackNukesWord l ++ unpack ls i unpack [] _ = [] instance Show PackedMotif where show = show . unpackNukes class Nukes a where pLength :: a -> Int shiftLN1 :: a -> a hammingDistance :: a -> a -> Int motifs :: Int -> a -> [a] instance Nukes PackageType where pLength _ = nukesInPackage shiftLN1 = (`shiftR`2) hammingDistance !x !y = fromIntegral $ abt (x `xor` y) where abt !b = bbt(b.&.a0Mask .|. ((b.&.a1Mask) `shiftR` 1)) bbt !b = sbt $ (b.&.r16Mask) + (b `shiftR` nukesInPackage) sbt !b = (r2Mask .&. b) + (r2Mask .&. (b`shiftR`2)) + (r2Mask .&. (b`shiftR`4)) + (r2Mask .&. (b`shiftR`6)) + (r2Mask .&. (b`shiftR`8)) + (r2Mask .&. (b`shiftR`10)) + (r2Mask .&. (b`shiftR`12)) + (r2Mask .&. (b`shiftR`14)) a0Mask = 0x55555555 :: PackageType a1Mask = 0xAAAAAAAA :: PackageType r16Mask = 0xFFFF :: PackageType r2Mask = 0x3 :: PackageType motifs 0 _ = [] motifs lx = x : motifs (l-1) (shiftLN1 x) maskNukesBut :: Int -> PackageType -> PackageType maskNukesBut i = ( ( allSetMask `shiftR` (2*(nukesInPackage - i)) ) .&.) instance Nukes PackedMotif where pLength (PackedMotif (x:xs) ix) = nukesInPackage * (length xs) + ix pLength _ = 0 shiftLN1 ξ@(PackedMotif [] _) = ξ shiftLN1 (PackedMotif [x] ix) | ix>1 = PackedMotif [x`shiftR`2] (ix-1) | otherwise = PackedMotif [] nukesInPackage shiftLN1 (PackedMotif (x:x':xs) ix) = PackedMotif (( shiftLN1 x .|. pnext ):sxs) resLMod where sxs = motifPackets $ shiftLN1 (PackedMotif (x':xs) ix) pnext = shiftL (x'.&.0x3) 30 resLMod = if ix > 1 then (ix-1) else nukesInPackage hammingDistance xs ys = go 0 xs ys where go :: Int -> PackedMotif -> PackedMotif -> Int go !acc (PackedMotif [x] ix) (PackedMotif [y] iy) | ix > iy = acc + (hammingDistance y $ maskNukesBut iy x) | otherwise = acc + (hammingDistance x $ maskNukesBut ix y) go !acc (PackedMotif [x] ix) (PackedMotif (y:ys) iy) = acc + (hammingDistance x $ maskNukesBut ix y) go !acc (PackedMotif (x:xs) ix) (PackedMotif [y] iy) = acc + (hammingDistance y $ maskNukesBut iy x) go !acc (PackedMotif (x:xs) ix) (PackedMotif (y:ys) iy) = go (acc + hammingDistance xy) (PackedMotif xs ix) (PackedMotif ys iy) go !acc _ _ = acc motifs l ξ | l>0 = fShfts (min nukesInPackage $ pLength ξ + 1 - l) ξ >>= ct | otherwise = [] where fShfts k χ | k > 0 = χ : fShfts (k-1) (shiftLN1 χ) | otherwise = [] ct (PackedMotif ys iy) = case remain of [] -> if (length takeN - 1) * nukesInPackage + iy >= l then [PackedMotif takeN lMod] else [] _ -> PackedMotif takeN lMod : ct(PackedMotif (tail ys) iy) where (takeN, remain) = splitAt lQuot ys (lQuot,lMod) = case l `quotRem` nukesInPackage of (i,0) -> (i, nukesInPackage) (i,m) -> (i+1, m)

It can be used from UnpackedMotif = [NukeTide] using the packNukes function, for example

 *BioNuke0> motifs 23 $ packNukes $ take 27 $ cycle [A,T,G,C,A] [[A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G],[T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C],[G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A],[C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A],[A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T]] *BioNuke0> hammingDistance (packNukes [A,T,G,C,A,A,T,G]) (packNukes [A,T,C,C,A,T,G]) 3 *BioNuke0> map (hammingDistance (packNukes $ take 52 $ cycle [A,T,C,C,A,T,G])) (motifs 52 $ packNukes $ take 523 $ cycle [A,T,C,C,A,T,G]) [0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44]

I have not yet compared performance with the original version, but it should be slightly faster than any algebraic data type implementation. In addition, it easily offers a compact storage format.

+2

leftaroundabout Oct 10 '11 at 2:12

source share

John l · Accepted Answer · 2011-10-09T19:41:57+0000

Your definition of hammingDistance is probably much less effective than it could be.

 hammingDistance (x:xs) (y:ys) = case (x == y) of True -> hammingDistance xs ys False -> 1 + hammingDistance xs ys

Due to laziness laziness, it will be expanded to (in the worst case):

 (1 + (1 + (1 + ...)))

which will exist as a hit in the stack, and will be reduced only when it is used. Whether this is actually a problem depends on the call site, compiler options, etc. Therefore, it is often good practice to write your code in a form that avoids this problem as a whole.

A common solution is to create a tail recursive shape with a strict battery, but in this case you can use higher order functions, for example:

 hammingDistance :: Motif -> Motif -> Int hammingDistance xs ys = length . filter (uncurry (==)) $ zip xs ys

here is a tail recursive implementation, for comparison

 hammingDistance :: Motif -> Motif -> Int hammingDistance xs ys = go 0 xs ys where go !acc [] [] = acc go !acc xs [] = acc -- optimistic go !acc [] ys = acc -- optimistic go !acc (x:xs) (y:ys) = case (x == y) of True -> go acc xs ys False -> go (acc+1) xs ys

In this case, the BangPatterns extension is used to force the battery to accumulate, otherwise it will have the same problem as the current definition.

To answer your other questions:

Matching patterns is trivial
Whether lists or arrays should be used depends largely on how the data is generated and how it is used. In this case, it is possible that lists may be the best type. In particular, if your lists are all consumed as they are created, and you do not need the entire list in memory, they should be in order. If you save the lists in memory, although they have a lot of space overhead.

Usage patterns

I think the way of using these functions also does the extra work:

 (minimum . map (hammingDistance motif) . motifs l

Since you need a minimum of hammingDistance , you can calculate many additional values that are not needed. I can present two solutions:

Option 1. Define a new function hammingDistanceThresh :: Motif -> Int -> Motif -> Int , which stops when it exceeds a threshold. The order of a slightly strange type is to facilitate its use in a fold, for example:

 let motifs' = motifs l in foldl' (hammingDistanceThresh motif) (hammingDistance motif $ head motifs') (tail motifs')

Option 2. If you determine the type of a lazy integer, you can use this instead of Int for the result of hammingDistance . Then it will be calculated only as much as necessary for the distance from the interference.

-auto-all final note: using -auto-all will very often generate much slower code than other profiling options. First I will try to use only -auto , and then add manual SCC annotations if necessary.

How to eliminate cost centers in String Taversals and List Comprehensions

More articles: