English 中文(简体)
邮件中与自然文本相匹配
原标题:Algorithm to match natural text in mail

I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing.

例如:

上午

last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.

  • list item 2
  • list item 3
  • list item 3

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit

页: 1

页: 1

例如,c。

33处罪恶街道,隆登

移动:00 234534/234345

Ideally the algorithm would match only the bold parts.

是否有任何建议的办法,或甚至是否有解决这一问题的现有决定因素? 我是否应当根据固定点数、长度等原因,尝试定期表达或统计方面的更多困难?

最佳回答

You need to go through serious NLP stuff to get the desired processing done (depends on what level of precision you are expecting and the randomness and vagueness of the input email data for your code).

http://research.microsoft.com/en-us/um/people/joshuago/conference/papers-2004/135.pdf” rel=“nofollow” :this 。 参见其他有关部分。

问题回答

In the example you post, line length suffices.

没有完美的算法;甚至人也将对线路进行不同分类。

I suggest just use line length until you find a counter example, at which point revise your algorithm. Repeat until problem solved to your satisfaction.

你们需要许多犹豫不决才能找到解决办法的近似性,这里就这样了:你可以安全地断裂变数之后的任何东西(海芬-海芬-空间),而标准一致的电子邮件信息将电文从签字中分离出来。

您可以采用的另一个做法是,从同一发件人那里储存电子邮件副本;这应使您能够提取在每一电文(如估价和签名)中相同或类似的内容,并查明其邮件客户是如何援引的。





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签