Difference between revisions of "Data.List.Split"
(Fix bug in simplest version of "chunk") 

(19 intermediate revisions by 11 users not shown)  
Line 1:  Line 1:  
−  A theoretical module which contains implementations/combinators for implementing every possible method of listsplitting known to man. This way no one has to argue about what the correct interface for split is, we can just have them all. 
+  A [http://hackage.haskell.org/cgibin/hackagescripts/package/split (no longer theoretical) module] which contains implementations/combinators for implementing every possible method of listsplitting known to man. This way no one has to argue about what the correct interface for split is, we can just have them all. 
+  
+  This has been brought up many times on the mailing lists, and there's a summary of the split proposals on [[List function suggestions]]. Just for reference: 

+  
+  * [http://www.haskell.org/pipermail/libraries/2008January/thread.html#8922 haskellcafe thread January 2008] 

+  * [http://www.haskell.org/pipermail/haskellcafe/2006July/thread.html#16559 haskellcafe thread July 2006] 

+  * [http://www.haskell.org/pipermail/libraries/2004July/thread.html#2342 libraries thread July 2004] 

+  
Some possible ways to split a list, to get your creative juices flowing: 
Some possible ways to split a list, to get your creative juices flowing: 

Line 14:  Line 14:  
** keep the separators as their own separate pieces of the result list 
** keep the separators as their own separate pieces of the result list 

** what to do with separators at the beginning/end? create a blank split before/after, or not? 
** what to do with separators at the beginning/end? create a blank split before/after, or not? 

+  ** keep blank splits between consecutive delimiters, or merge multiple consecutive delimiters into one delimiter? 

An important caveat: we should strive to keep things flexible yet SIMPLE. The more complicated things get, the closer this gets to just being a general parsing or regex library. So the right balance needs to be struck. 
An important caveat: we should strive to keep things flexible yet SIMPLE. The more complicated things get, the closer this gets to just being a general parsing or regex library. So the right balance needs to be struck. 

Line 21:  Line 22:  
<haskell> 
<haskell> 

{# LANGUAGE ViewPatterns #} 
{# LANGUAGE ViewPatterns #} 

−  
+  
import Data.List (unfoldr) 
import Data.List (unfoldr) 

−  
+  
−  
+  
 intercalate :: [a] > [[a]] > [a] 
 intercalate :: [a] > [[a]] > [a] 

 intercalate x [a,b,c,x,y,z] = [a,x,b,x,c,x,x,y,x,z,x] 
 intercalate x [a,b,c,x,y,z] = [a,x,b,x,c,x,x,y,x,z,x] 

−  
+  
 unintercalate :: [a] > [a] > [[a]] 
 unintercalate :: [a] > [a] > [[a]] 

 unintercalate x [a,x,b,x,c,x,x,y,x,z,x] = [a,b,c,[],y,z] 
 unintercalate x [a,x,b,x,c,x,x,y,x,z,x] = [a,b,c,[],y,z] 

−  
+  
 unintercalate is the "inverse" of intercalate 
 unintercalate is the "inverse" of intercalate 

Line 37:  Line 38:  
match (p:ps) (q:qs)  p == q = match ps qs 
match (p:ps) (q:qs)  p == q = match ps qs 

match (_:_) (_:_)  otherwise = Nothing 
match (_:_) (_:_)  otherwise = Nothing 

−  
+  
chopWith delimiter (match delimiter > Just tail) = return ([], tail) 
chopWith delimiter (match delimiter > Just tail) = return ([], tail) 

−  chopWith delimiter (c:cs) = chopWith delimiter cs 
+  chopWith delimiter (c:cs) = case chopWith delimiter cs of 
−  return (c:head, tail) 
+  Just (head, tail) > return (c:head, tail) 
+  Nothing > return (c:cs, []) 

chopWith delimiter [] = Nothing 
chopWith delimiter [] = Nothing 

−   note: chopWith could be make 'more efficient' i.e. remove the >>=\> bit 

−   by adding an accumulator 

−  
unintercalate delimiter = unfoldr (chopWith delimiter) 
unintercalate delimiter = unfoldr (chopWith delimiter) 

+   NOTE: this unintercalate used to have bugs, don't trust it 

−   > unintercalate "x" "axbxcxxyxzx" 

+   this one discards separators and combines multiple adjacent separators 

−   
+   splitOn (==',') "foo,bar,,,baz" = ["foo", "bar", "baz"] 
−  +   this is the behavior you want for List.words 

+   

splitOn :: (a > Bool) > [a] > [[a]] 
splitOn :: (a > Bool) > [a] > [[a]] 

splitOn _ [] = [] 
splitOn _ [] = [] 

Line 56:  Line 55:  
 f x = splitOn f xs 
 f x = splitOn f xs 

 otherwise = let (h,t) = break f l in h:(splitOn f t) 
 otherwise = let (h,t) = break f l in h:(splitOn f t) 

+  
+   this variant discards separators but does not combine adjacent separators 

+   splitOn' (==',') "foo,bar,,,baz" = ["foo", "bar", "", "", "baz"] 

+   this is the behavior you want for List.lines 

+   

+  splitOn' :: (a > Bool) > [a] > [[a]] 

+  splitOn' f xs = split xs 

+  where split xs = case break f xs of 

+  (chunk,[]) > chunk : [] 

+  (chunk,_:rest) > chunk : split rest 

+  
+   this variant keeps the separators but combines them 

+   splitOn'' Char.isSpace "foo bar \t baz" = ["foo"," ","bar"," \t ","baz"] 

+   

+  splitOn'' :: (a > Bool) > [a] > [[a]] 

+  splitOn'' sep xs = split xs 

+  where split [] = [] 

+  split xs = case break sep xs of 

+  (chunk,[]) > chunk : [] 

+  (chunk,rest) > 

+  case span sep rest of 

+  (seps, rest) 

+   null chunk > seps : split rest 

+   otherwise > chunk : seps : split rest 

 take the element who make predict true as delimiter 
 take the element who make predict true as delimiter 

Line 62:  Line 85:  
  like String split, except for any element that obeys Eq 
  like String split, except for any element that obeys Eq 

+   This trick works for all the 'splitOn' variants above. 

+   

splitEq :: Eq a => a > [a] > [[a]] 
splitEq :: Eq a => a > [a] > [[a]] 

splitEq e = splitOn (==e) 
splitEq e = splitOn (==e) 

Line 67:  Line 92:  
  split at regular intervals 
  split at regular intervals 

chunk :: Int > [a] > [[a]] 
chunk :: Int > [a] > [[a]] 

−  chunk _ [] = 
+  chunk _ [] = [] 
chunk n xs = y1 : chunk n y2 
chunk n xs = y1 : chunk n y2 

where 
where 

(y1, y2) = splitAt n xs 
(y1, y2) = splitAt n xs 

+  
+   another version (CPS) of chunk 

+  chunk n list = case list of { [] > [] ; (y:ys) > ch' ys (n1) (y:) } where 

+  ch' [] _ k = k [] : [] 

+  ch' (y:ys) 0 k = k [] : ch' ys (n1) (y:) 

+  ch' (y:ys) (c+1) k = ch' ys c (k . (y:)) 

</haskell> 
</haskell> 

Line 102:  Line 133:  
splitWhen = split . whenElt 
splitWhen = split . whenElt 

+  </haskell> 

+  
+  == Splits of known lengths == 

+  
+  I frequently require two types of splits, splitting into blocks of fixed length and splitting into lists of sizes of increasing powers of 2. My implementation was designed to be fold/builded as much as possible, so here goes: 

+  
+  <haskell> 

+  splitEvery :: Int > [e] > [[e]] 

+  splitEvery i l = map (take i) (build (splitter l)) where 

+  splitter [] _ n = n 

+  splitter l c n = l `c` splitter (drop i l) c n 

+  
+  For more general splits with foreknown lengths, 

+  splitPlaces :: [Int] > [e] > [[e]] 

+  splitPlaces ls xs = build (splitPlacer ls xs) where 

+  splitPlacer [] _ _ n = n 

+  splitPlacer _ [] _ n = n 

+  splitPlacer (l:ls) xs c n = let (x1, x2) = splitAt l xs 

+  in x1 `c` splitPlacer ls x2 c n 

+  
+  splitPowersOf2 :: [e] > [[e]] 

+  splitPowersOf2 = splitPlaces (iterate (*2) 1) 

+  </haskell> 

+  To be sure, neither is a good consumer, but I don't think that's avoidable, given that drop isn't a good consumer either. 

+  
+  Here splitEvery is equivalent to "chunks" above, but it is a much better producer, I think. (It is also intended to be mapped on, given that the (map (take i)) makes every element of the list into a producer. 

+  
+  == Break on Nothing == 

+  
+  <haskell> 

+  import Data.Maybe 

+  import Data.Either 

+  import Data.List (find, isPrefixOf) 

+  
+  breaks :: [Maybe a] > [[a]] 

+  breaks xs = (if null cur then id else ((map fromJust cur):)) 

+  (if null rem then [] else breaks (tail rem)) 

+  where (cur, rem) = break isNothing xs 

+  
+  replaces :: (Eq a) => [([a], [b])] > (a > b) > [a] > [b] 

+  replaces reps f = process 

+  where process [] = [] 

+  process l@(x:xs) = 

+  case find ((`isPrefixOf` xs).fst) reps of 

+  Nothing > f x : process xs 

+  Just (pat, rep) > rep ++ (process $ drop (length pat) l) 

+  
+  split :: (Eq a) => [([a], [Maybe a])] > [a] > [[a]] 

+  split reps = breaks . replaces reps Just 

+  
+  onSeq, onSeqKeep :: (Eq a) => [a] > ([a], [Maybe a]) 

+  onSeq xs = (xs, [Nothing]) 

+  onSeqKeep xs = (xs, Nothing : (map Just xs ++ [Nothing])) 

+  
+  onElt, onEltKeep :: (Eq a) => a > ([a], [Maybe a]) 

+  onElt x = onSeq [x] 

+  onEltKeep x = onSeqKeep [x] 

+  
+  insertAfter :: [Int] > a > [a] > [a] 

+  insertAfter [] _ xs = xs 

+  insertAfter (i:_) _ []  i > 0 = [] 

+  insertAfter (i:is) x xs = pre ++ [x] ++ insertAfter is x post 

+  where (pre, post) = splitAt i xs 

+  
+  splitEvery :: Int > [a] > [[a]] 

+  splitEvery i = splitPlaces (repeat i) 

+  
+  splitPlaces :: [Int] > [a] > [[a]] 

+  splitPlaces is xs = breaks $ insertAfter is Nothing $ map Just xs 

+  
+  splitPowersOf2 = splitPlaces (iterate (*2) 1) 

+  </haskell> 

+  
+  Implements all of the above ideas (except predicate matching). In order to split up an arithmetic expression, for example: 

+  
+  <haskell> 

+  split (onElt ' ' : map onEltKeep "+/*^()") 

+  </haskell> 

+  
+  == Split of known lengths again == 

+  I've written once something simply: 

+  
+  <haskell> 

+  fixFields :: [Int] > String > [String] 

+  fixFields _ [] = [] 

+  fixFields (n:ns) ls = xs:fixFields ns ys 

+  where (xs,ys) = splitAt n ls 

+  fixFields _ ls = [ls] 

+  </haskell> 

+  
+  == Simple split at known length == 

+  Simple recursive definition: 

+  
+  <haskell> 

+  chunk :: Int > [a] > [[a]] 

+  chunk n [] = [] 

+  chunk n xs = ys : chunk n zs 

+  where (ys,zs) = splitAt n xs 

+  </haskell> 

+  
+  Simple version using iterate: 

+  
+  <haskell> 

+   Author: mm_freak irc.freenode.net #haskell 

+  chunk' :: Int > [a] > [[a]] 

+  chunk' n = takeWhile (not.null) . map (take n) . iterate (drop n) 

+  </haskell> 

+  
+  Simple version using unfoldr: 

+  
+  <haskell> 

+  chunk'' :: Int > [a] > [[a]] 

+  chunk'' n = takeWhile (not.null) . List.unfoldr (Just . splitAt n) 

</haskell> 
</haskell> 
Latest revision as of 13:43, 22 February 2012
A (no longer theoretical) module which contains implementations/combinators for implementing every possible method of listsplitting known to man. This way no one has to argue about what the correct interface for split is, we can just have them all.
This has been brought up many times on the mailing lists, and there's a summary of the split proposals on List function suggestions. Just for reference:
Some possible ways to split a list, to get your creative juices flowing:
 what to split on?
 singleelement separator
 sublist separator
 use a list of possible separators instead of just one
 use a predicate on elements or sublists instead of giving explicit separators
 use approximate matching?
 chunks of fixed length
 how to split?
 discard the separators
 keep the separators with the preceding or following splits
 keep the separators as their own separate pieces of the result list
 what to do with separators at the beginning/end? create a blank split before/after, or not?
 keep blank splits between consecutive delimiters, or merge multiple consecutive delimiters into one delimiter?
An important caveat: we should strive to keep things flexible yet SIMPLE. The more complicated things get, the closer this gets to just being a general parsing or regex library. So the right balance needs to be struck.
Add your implementations below! Once we converge on something good we can upload it to hackage.
{# LANGUAGE ViewPatterns #}
import Data.List (unfoldr)
 intercalate :: [a] > [[a]] > [a]
 intercalate x [a,b,c,x,y,z] = [a,x,b,x,c,x,x,y,x,z,x]
 unintercalate :: [a] > [a] > [[a]]
 unintercalate x [a,x,b,x,c,x,x,y,x,z,x] = [a,b,c,[],y,z]
 unintercalate is the "inverse" of intercalate
match [] string = Just string
match (_:_) [] = Nothing
match (p:ps) (q:qs)  p == q = match ps qs
match (_:_) (_:_)  otherwise = Nothing
chopWith delimiter (match delimiter > Just tail) = return ([], tail)
chopWith delimiter (c:cs) = case chopWith delimiter cs of
Just (head, tail) > return (c:head, tail)
Nothing > return (c:cs, [])
chopWith delimiter [] = Nothing
unintercalate delimiter = unfoldr (chopWith delimiter)
 NOTE: this unintercalate used to have bugs, don't trust it
 this one discards separators and combines multiple adjacent separators
 splitOn (==',') "foo,bar,,,baz" = ["foo", "bar", "baz"]
 this is the behavior you want for List.words

splitOn :: (a > Bool) > [a] > [[a]]
splitOn _ [] = []
splitOn f l@(x:xs)
 f x = splitOn f xs
 otherwise = let (h,t) = break f l in h:(splitOn f t)
 this variant discards separators but does not combine adjacent separators
 splitOn' (==',') "foo,bar,,,baz" = ["foo", "bar", "", "", "baz"]
 this is the behavior you want for List.lines

splitOn' :: (a > Bool) > [a] > [[a]]
splitOn' f xs = split xs
where split xs = case break f xs of
(chunk,[]) > chunk : []
(chunk,_:rest) > chunk : split rest
 this variant keeps the separators but combines them
 splitOn'' Char.isSpace "foo bar \t baz" = ["foo"," ","bar"," \t ","baz"]

splitOn'' :: (a > Bool) > [a] > [[a]]
splitOn'' sep xs = split xs
where split [] = []
split xs = case break sep xs of
(chunk,[]) > chunk : []
(chunk,rest) >
case span sep rest of
(seps, rest)
 null chunk > seps : split rest
 otherwise > chunk : seps : split rest
 take the element who make predict true as delimiter
 > splitOn even [1,3,5,6,7,3,3,2,1,1,1]
 [[1,3,5],[7,3,3],[1,1,1]]
  like String split, except for any element that obeys Eq
 This trick works for all the 'splitOn' variants above.

splitEq :: Eq a => a > [a] > [[a]]
splitEq e = splitOn (==e)
  split at regular intervals
chunk :: Int > [a] > [[a]]
chunk _ [] = []
chunk n xs = y1 : chunk n y2
where
(y1, y2) = splitAt n xs
 another version (CPS) of chunk
chunk n list = case list of { [] > [] ; (y:ys) > ch' ys (n1) (y:) } where
ch' [] _ k = k [] : []
ch' (y:ys) 0 k = k [] : ch' ys (n1) (y:)
ch' (y:ys) (c+1) k = ch' ys c (k . (y:))
Contents
A combinator approach?
Here are some initial thoughts on a combinator approach. The trick is to find nice implementations of the declarations below. Please add your own thoughts, other combinators, etc.
data Splitter a
split :: Splitter a > [a] > [[a]]
onElts :: [a] > Splitter a  split on any of these elements
onSublist :: [a] > Splitter a  split on this exact subsequence
whenElt :: (a > Bool) > Splitter a
keepingDelims :: Splitter a > Splitter a
collapsingNulls :: Splitter a > Splitter a
 other basic combinators?
 now you can write things like

 split (collapsingNulls $ onElts " ,") "abc,def , gh"

 which should evaluate to ["abc", "def", "gh"].
 some convenience functions can be provided, such as...
splitOn = split . onElts
splitWhen = split . whenElt
Splits of known lengths
I frequently require two types of splits, splitting into blocks of fixed length and splitting into lists of sizes of increasing powers of 2. My implementation was designed to be fold/builded as much as possible, so here goes:
splitEvery :: Int > [e] > [[e]]
splitEvery i l = map (take i) (build (splitter l)) where
splitter [] _ n = n
splitter l c n = l `c` splitter (drop i l) c n
For more general splits with foreknown lengths,
splitPlaces :: [Int] > [e] > [[e]]
splitPlaces ls xs = build (splitPlacer ls xs) where
splitPlacer [] _ _ n = n
splitPlacer _ [] _ n = n
splitPlacer (l:ls) xs c n = let (x1, x2) = splitAt l xs
in x1 `c` splitPlacer ls x2 c n
splitPowersOf2 :: [e] > [[e]]
splitPowersOf2 = splitPlaces (iterate (*2) 1)
To be sure, neither is a good consumer, but I don't think that's avoidable, given that drop isn't a good consumer either.
Here splitEvery is equivalent to "chunks" above, but it is a much better producer, I think. (It is also intended to be mapped on, given that the (map (take i)) makes every element of the list into a producer.
Break on Nothing
import Data.Maybe
import Data.Either
import Data.List (find, isPrefixOf)
breaks :: [Maybe a] > [[a]]
breaks xs = (if null cur then id else ((map fromJust cur):))
(if null rem then [] else breaks (tail rem))
where (cur, rem) = break isNothing xs
replaces :: (Eq a) => [([a], [b])] > (a > b) > [a] > [b]
replaces reps f = process
where process [] = []
process l@(x:xs) =
case find ((`isPrefixOf` xs).fst) reps of
Nothing > f x : process xs
Just (pat, rep) > rep ++ (process $ drop (length pat) l)
split :: (Eq a) => [([a], [Maybe a])] > [a] > [[a]]
split reps = breaks . replaces reps Just
onSeq, onSeqKeep :: (Eq a) => [a] > ([a], [Maybe a])
onSeq xs = (xs, [Nothing])
onSeqKeep xs = (xs, Nothing : (map Just xs ++ [Nothing]))
onElt, onEltKeep :: (Eq a) => a > ([a], [Maybe a])
onElt x = onSeq [x]
onEltKeep x = onSeqKeep [x]
insertAfter :: [Int] > a > [a] > [a]
insertAfter [] _ xs = xs
insertAfter (i:_) _ []  i > 0 = []
insertAfter (i:is) x xs = pre ++ [x] ++ insertAfter is x post
where (pre, post) = splitAt i xs
splitEvery :: Int > [a] > [[a]]
splitEvery i = splitPlaces (repeat i)
splitPlaces :: [Int] > [a] > [[a]]
splitPlaces is xs = breaks $ insertAfter is Nothing $ map Just xs
splitPowersOf2 = splitPlaces (iterate (*2) 1)
Implements all of the above ideas (except predicate matching). In order to split up an arithmetic expression, for example:
split (onElt ' ' : map onEltKeep "+/*^()")
Split of known lengths again
I've written once something simply:
fixFields :: [Int] > String > [String]
fixFields _ [] = []
fixFields (n:ns) ls = xs:fixFields ns ys
where (xs,ys) = splitAt n ls
fixFields _ ls = [ls]
Simple split at known length
Simple recursive definition:
chunk :: Int > [a] > [[a]]
chunk n [] = []
chunk n xs = ys : chunk n zs
where (ys,zs) = splitAt n xs
Simple version using iterate:
 Author: mm_freak irc.freenode.net #haskell
chunk' :: Int > [a] > [[a]]
chunk' n = takeWhile (not.null) . map (take n) . iterate (drop n)
Simple version using unfoldr:
chunk'' :: Int > [a] > [[a]]
chunk'' n = takeWhile (not.null) . List.unfoldr (Just . splitAt n)