I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing.



last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.

  • list item 2
  • list item 3
  • list item 3

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit

Ideally the algorithm would match only the bold parts.

是否有任何建议的办法,或甚至是否有解决这一问题的现有决定因素? 我是否应当根据固定点数、长度等原因,尝试定期表达或统计方面的更多困难?


You need to go through serious NLP stuff to get the desired processing done (depends on what level of precision you are expecting and the randomness and vagueness of the input email data for your code).

http://research.microsoft.com/en-us/um/people/joshuago/conference/papers-2004/135.pdf” rel=“nofollow” :this 。 参见其他有关部分。


In the example you post, line length suffices.


I suggest just use line length until you find a counter example, at which point revise your algorithm. Repeat until problem solved to your satisfaction.



