Question

I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing.

例如:

上午

last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.

list item 2

list item 3

list item 3

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit

页: 1

页: 1

例如,c。

33处罪恶街道,隆登

移动:00 234534/234345

Ideally the algorithm would match only the bold parts.

是否有任何建议的办法,或甚至是否有解决这一问题的现有决定因素? 我是否应当根据固定点数、长度等原因,尝试定期表达或统计方面的更多困难?

Answer 1

You need to go through serious NLP stuff to get the desired processing done (depends on what level of precision you are expecting and the randomness and vagueness of the input email data for your code).

http://research.microsoft.com/en-us/um/people/joshuago/conference/papers-2004/135.pdf” rel=“nofollow” :this 。参见其他有关部分。

。

Answer 2

In the example you post, line length suffices.

没有完美的算法;甚至人也将对线路进行不同分类。

I suggest just use line length until you find a counter example, at which point revise your algorithm. Repeat until problem solved to your satisfaction.

Answer 3

你们需要许多犹豫不决才能找到解决办法的近似性,这里就这样了:你可以安全地断裂变数之后的任何东西(海芬-海芬-空间),而标准一致的电子邮件信息将电文从签字中分离出来。

您可以采用的另一个做法是,从同一发件人那里储存电子邮件副本;这应使您能够提取在每一电文(如估价和签名)中相同或类似的内容,并查明其邮件客户是如何援引的。

Answer 4

If your only task is to fish out the bold parts, look on how the bold text technically implemented in your mail database. For example, if it s html, you could have something like this:

上午

last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.
list item 2
list item 3
list item 3
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit

页: 1

页: 1

然后,你可以实施以下法典:

import re
# save the mail above as variable MailAbove
print re.findall(r <b>(.*?)</b> ,MailAbove)

结果:

[最后一天,我们做的是bla,Lorre Lorem ipsum dolor静坐, conetur adipisici elit, sed eiusmod tempor incidunt ut Labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exerciting ullamco Laboris nisi ut aliquid x ea commodi consequat. 静脉冲积层的变现

Edit: It follows from the comment that I misunderstood the question. Generally, such tasks are a multiple stage process: you apply some methods, then see the result and what is missing out or is in by mistake, then you make fixes or add new methods and see what s the outcome.
I recommend you to read this - an excellent tutorial/book on solving the tasks like yours and beyond.

友情链接