Question

I m 采用从使用Haskell的生物信息学领域提取的磁力算法。我拿到了其他算法的细节,然后说算法是分行的,有约束的中位搜索。我计划通过同时采取的做法(后来采用STM办法),使我的执行工作更加令人感兴趣,以便加快多分速度,但随后与以下旗帜汇编。

$ ghc -prof -auto-all -O2 -fllvm -threaded -rtsopts --make main

我看到了一些有趣(也许显而易见的):

COST CENTRE      entries  %time %alloc  
hammingDistance  34677951  47.6   14.7  
motifs           4835446   43.8   71.1

很显然,如果不在多核心方案拟定工作附近任何地方进行,就能够取得显著的加速(尽管已经这样做了,我刚刚需要找到一些良好的测试数据并摆脱危机)。

Anyway, both of these functions are purely functional and in no way concurrent. They re also doing quite simple stuff, so I was surprised that they took so much time. Here s the code for them:

data NukeTide = A | T | C | G deriving (Read, Show, Eq, Ord, Enum)

type Motif = [NukeTide] 

hammingDistance :: Motif -> Motif -> Int
hammingDistance [] [] = 0
hammingDistance xs [] = 0 -- optimistic
hammingDistance [] ys = 0 -- optimistic
hammingDistance (x:xs) (y:ys) = case (x == y) of
    True  -> hammingDistance xs ys
    False -> 1 + hammingDistance xs ys

motifs :: Int -> [a] -> [[a]]
motifs n nukeTides = [ take n $ drop k nukeTides | k <- [0..length nukeTides - n] ]

值得注意的是,在两个论点中,我实际上可以假定,x将长于x,而如果这种论点打开了改进的余地,则ys将小于或等于。

如你所看到的那样,草率计算出两种杀菌剂(即核素清单)之间的 ha距离。 mo状功能包含若干项内容和一份清单,并将这一长度的所有小段重新编号,例如:

> motifs 3 "hello world"
["hel","ell","llo","lo ","o w"," wo","wor","orl","rld"]

由于所涉算法过程非常简单,我无法想出进一步优化这一方法。然而,我确实有两点问,我应该由谁来领导:

HammingDistance: The data types I m using (NukeTides and []) are slow/clumsy. This is just a guess, since I m not familiar with their implementations but I think defining my own datatype, although more legible, probably involves more overhead then I intend. Also the pattern matching is foreign to me, I don t know if that is trivial or costly.
Motifs: If I m reading this correctly, 70% of all memory allocations are done by motifs, and I d assume that has to be garbage collected at some time. Again using the all purpose list might be slowing me down or the list comprehension, since the cost of that is incredibly unclear to me.

Does anybody have any advice on the usual procedure here? If data types are the problem, would arrays be the right answer? (I ve heard they come in boxes)

得到帮助。

Edit:我刚才看到,如果我描述这两个职能被称作:

totalDistance :: Motif -> Int
totalDistance motif = sum $ map (minimum . map (hammingDistance motif) . motifs l) dna

这一功能是另一个功能的结果,在树子中环绕节。在每个树冠上,都使用总价值对核素(长度和带;=n,即=0,然后是一页)进行评价。记分 no。从那时起,它就算出你的典型分支和约束算法。

Edit: John asked that I print out the change I made which virutally eliminated the cost of motifs:

scoreFunction :: DNA -> Int -> (Motif -> Int)
scoreFunction dna l = totalDistance
    where
        -- The sum of the minimum hamming distance in each line of dna
        -- is given by totalDistance motif
        totalDistance motif = sum $ map (minimum . map (hammingDistance motif)) possibleMotifs
        possibleMotifs = map (motifs l) dna -- Previously this was computed in the line above

我在原岗位上说得很明确,但只需要一次记分功能,结果在树序/镜头上通过,并用于评价节点。追溯到每一步的争.,是我所看到的最亮的东西之一。

Answer 1

你对<代码>hammingDistance的定义可能比其效率低得多。

hammingDistance (x:xs) (y:ys) = case (x == y) of
    True  -> hammingDistance xs ys
    False -> 1 + hammingDistance xs ys

由于 has伤,这种情况将扩大到(在最糟糕的情况下):

(1 + (1 + (1 + ...)))

只有在使用时才会减少。这实际上是否是一个问题,取决于电话站、汇编者选择等,因此,通常的做法是以完全避免这一问题的方式写你的法典。

一种共同的解决办法是,形成一种具有严格加固剂的复合形式,但在这种情况下,你可以像以下那样使用较高顺序的功能:

hammingDistance :: Motif -> Motif -> Int
hammingDistance xs ys = length . filter (uncurry (==)) $ zip xs ys

此处为尾是为了比较

hammingDistance :: Motif -> Motif -> Int
hammingDistance xs ys = go 0 xs ys
  where
    go !acc [] [] = acc
    go !acc xs [] = acc -- optimistic
    go !acc [] ys = acc -- optimistic
    go !acc (x:xs) (y:ys) = case (x == y) of
      True  -> go acc xs ys
      False -> go (acc+1) xs ys

采用<代码>BangPatterns的延伸,强迫加固器接受严格评价,否则会与你的当前定义存在同样的问题。

3. 直接回答您的其他一些问题:

Pattern matching is trivial
Whether you should use lists or arrays depends mostly on how the data is created and how it s consumed. For this case, it s possible that lists may be the best type. In particular, if your lists are all consumed as they re created, and you don t ever need the whole list in memory, they should be fine. If you do retain lists in memory though, they have a lot of space overhead.

www.un.org/Depts/DGACM/index_spanish.htm 使用模式

我认为,你利用这些职能的方式也做了一些额外工作:

(minimum . map (hammingDistance motif) . motifs l

由于你只需要最低<代码>hammingDistance,你可以计算出许多必要的额外价值。我可以考虑两个解决办法:

Option 1. Define a new function hammingDistanceThresh :: Motif -> Int -> Motif -> Int, which stops when it exceeds the threshold. The slightly odd type ordering is to facilitate using it in a fold, like this:

let motifs  = motifs l
in foldl  (hammingDistanceThresh motif) (hammingDistance motif $ head motifs ) (tail motifs )

备选案文2. 如果你界定了“zy”自然编号类型,则你可以使用该编号,而不是<代码>。因<代码>hammingDistance而产生的t 。然后只计算出所需的大部分 ha距离。

一份最后说明:使用<代码>-auto-all,其代码往往比其他简介方案要慢得多。我建议你首先使用<代码>-auto,然后在必要时增加人工编码。

Answer 2

Your definition of motifs looks like it s doing a lot more traversing than necessary because each application of drop has to traverse the list from the beginning. I would implement it using Data.List.tails instead:

motifs2 :: Int -> [a] -> [[a]]
motifs2 n nukeTides = map (take n) $ take count $ tails nukeTides
  where count = length nukeTides - n + 1

A quick comparison in GHCi shows the difference (using sum . map length to force evaluation):

*Main> let xs = concat (replicate 10000 [A, T, C, G])
(0.06 secs, 17914912 bytes)
*Main> sum . map length $ motifs 5 xs
199980
(3.47 secs, 56561208 bytes)
*Main> sum . map length $ motifs2 5 xs
199980
(0.15 secs, 47978952 bytes)

Answer 3

权利我无法抵制这一限制,并写下了一种简单金属包装实施:

{-# language TypeSynonymInstances #-}
{-# language BangPatterns #-}

import Data.Bits
import Data.Word


data NukeTide = A | T | C | G deriving (Read, Show, Eq, Ord, Enum)

type UnpackedMotif = [NukeTide] 

type PackageType = Word32
nukesInPackage = 16 :: Int
allSetMask = complement 0 :: PackageType


-- Be careful to have length of motif == nukesInPackage here!
packNukesToWord :: UnpackedMotif -> PackageType
packNukesToWord = packAt 0
    where packAt _ [] = 0
          packAt i (m:ml) =    (b0 m .&. bit i)
                           .|. (b1 m .&. bit (i+1))
                           .|. packAt (i+2) ml
          b0 A = 0
          b0 T = allSetMask
          b0 C = 0
          b0 G = allSetMask
          b1 A = 0
          b1 T = 0
          b1 C = allSetMask
          b1 G = allSetMask

unpackNukesWord :: PackageType -> UnpackedMotif
unpackNukesWord = unpackNNukesFromWord nukesInPackage

unpackNNukesFromWord :: Int -> PackageType -> UnpackedMotif
unpackNNukesFromWord = unpackN
    where unpackN 0 _ = []
          unpackN i w = (nukeOf $ w .&. r2Mask):(unpackN (i-1) $ w`shiftR`2)
          nukeOf bs
           | bs == 0      = A
           | bs == bit 0  = T
           | bs == bit 1  = C
           | otherwise    = G
          r2Mask = (bit 1 .|. bit 0) :: PackageType


data PackedMotif = PackedMotif { motifPackets::[PackageType]
                               , nukesInLastPack::Int        }
 -- note nukesInLastPack will never be zero; motifPackets must be [] to represent empty motifs.
packNukes :: UnpackedMotif -> PackedMotif
packNukes m = case remain of
               [] -> PackedMotif [packNukesToWord takeN] (length takeN)
               r  -> prAppend (packNukesToWord takeN) (packNukes r)
    where (takeN, remain) = splitAt nukesInPackage m
          prAppend w (PackedMotif l i) = PackedMotif (w:l) i

unpackNukes :: PackedMotif -> UnpackedMotif
unpackNukes (PackedMotif l i) = unpack l i
  where unpack [l] i = unpackNNukesFromWord i l
        unpack (l:ls) i = unpackNukesWord l ++ unpack ls i
        unpack [] _ = []

instance Show PackedMotif where
  show = show . unpackNukes



class Nukes a where
  pLength :: a -> Int
  shiftLN1 :: a -> a
  hammingDistance :: a -> a -> Int
  motifs :: Int -> a -> [a]

instance Nukes PackageType where
  pLength _ = nukesInPackage
  shiftLN1 = (`shiftR`2)
  hammingDistance !x !y = fromIntegral $ abt (x `xor` y)
      where abt !b = bbt(b.&.a0Mask .|. ((b.&.a1Mask) `shiftR` 1))
            bbt !b = sbt $ (b.&.r16Mask) + (b `shiftR` nukesInPackage)
            sbt !b = (r2Mask .&. b)             + (r2Mask .&. (b`shiftR`2))
                   + (r2Mask .&. (b`shiftR`4))  + (r2Mask .&. (b`shiftR`6))
                   + (r2Mask .&. (b`shiftR`8))  + (r2Mask .&. (b`shiftR`10))
                   + (r2Mask .&. (b`shiftR`12)) + (r2Mask .&. (b`shiftR`14))
            a0Mask = 0x55555555 :: PackageType
            a1Mask = 0xAAAAAAAA :: PackageType
            r16Mask = 0xFFFF :: PackageType
            r2Mask = 0x3 :: PackageType
  motifs 0 _ = []
  motifs l x = x : motifs (l-1) (shiftLN1 x)


maskNukesBut :: Int -> PackageType -> PackageType
maskNukesBut i = ( ( allSetMask `shiftR` (2*(nukesInPackage - i)) ) .&.)

instance Nukes PackedMotif where
  pLength (PackedMotif (x:xs) ix) = nukesInPackage * (length xs) + ix
  pLength _ = 0
  shiftLN1 ξ@(PackedMotif [] _) = ξ
  shiftLN1 (PackedMotif [x] ix) | ix>1       = PackedMotif [x`shiftR`2] (ix-1)
                                | otherwise  = PackedMotif [] nukesInPackage
  shiftLN1 (PackedMotif (x:x :xs) ix)
        = PackedMotif (( shiftLN1 x .|. pnext ):sxs) resLMod
      where sxs = motifPackets $ shiftLN1 (PackedMotif (x :xs) ix)
            pnext = shiftL (x .&.0x3) 30
            resLMod = if ix > 1 then (ix-1) else nukesInPackage
  hammingDistance xs ys = go 0 xs ys
    where
      go :: Int -> PackedMotif -> PackedMotif -> Int
      go !acc (PackedMotif [x] ix) (PackedMotif [y] iy)
       | ix > iy    = acc + (hammingDistance y $ maskNukesBut iy x)
       | otherwise  = acc + (hammingDistance x $ maskNukesBut ix y)
      go !acc (PackedMotif [x] ix) (PackedMotif (y:ys) iy)
        = acc + (hammingDistance x $ maskNukesBut ix y)
      go !acc (PackedMotif (x:xs) ix) (PackedMotif [y] iy)
        = acc + (hammingDistance y $ maskNukesBut iy x)
      go !acc (PackedMotif (x:xs) ix) (PackedMotif (y:ys) iy)
        = go (acc + hammingDistance x y) (PackedMotif xs ix) (PackedMotif ys iy)
      go !acc _ _ = acc
  motifs l ξ
     | l>0        = fShfts (min nukesInPackage $ pLength ξ + 1 - l) ξ >>= ct
     | otherwise  = []
    where fShfts k χ | k > 0      = χ : fShfts (k-1) (shiftLN1 χ)
                     | otherwise  = []
          ct (PackedMotif ys iy) = case remain of
                [] -> if (length takeN - 1) * nukesInPackage + iy >= l
                       then [PackedMotif takeN lMod] else []
                _  -> PackedMotif takeN lMod : ct(PackedMotif (tail ys) iy)
            where (takeN, remain) = splitAt lQuot ys
          (lQuot,lMod) = case l `quotRem` nukesInPackage of
                   (i,0) -> (i, nukesInPackage)
                   (i,m) -> (i+1, m)

可使用<代码>。 Un PackedMotif = [NukeTide] s with the PackNukes function, e.g.



*BioNuke0> motifs 23 $ packNukes $ take 27 $ cycle [A,T,G,C,A]
[[A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G],[T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C],[G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A],[C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A],[A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T,G,C,A,A,T]]

*BioNuke0> hammingDistance (packNukes [A,T,G,C,A,A,T,G]) (packNukes [A,T,C,C,A,T,G])
3

*BioNuke0> map (hammingDistance (packNukes $ take 52 $ cycle [A,T,C,C,A,T,G])) (motifs 52 $ packNukes $ take 523 $ cycle [A,T,C,C,A,T,G])
[0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44,38,52,0,52,36,45,44]


I haven t compared the performance to the original version yet, but it should be quite a bit faster than any algebraic-datatype implementation. Plus, it readily offers a space-efficient storage format.

友情链接