Question

这一特殊问题很容易解决,但我不敢确保我所达成的解决办法能够计算效率。因此,我问专家!

如何通过一个大型档案,收集同一线中经常出现两字的统计资料(整个档案)?

例如,如果案文仅包含以下两行:

"This is the white baseball." "These guys have white baseball bats."

You would end up collecting the following stats: (this, is: 1), (this, the: 1), (this, white: 1), (this, baseball: 1), (is, the: 1), (is, white: 1), (is, baseball: 1) ... and so forth.

就条目(基球、白色:2)而言,价值为2,因为这一句子在同一行中共发生2次。

理想的情况是,统计应放在字典上,在座标上按字母顺序排列(即,你不想为“这、是”和“是,”。我们不关心这里的秩序:我们只是想看到,在整个案文中,每一种可能的言辞都经常出现。

Answer 1

from collections import defaultdict
import itertools as it
import re

pairs = defaultdict(int)

for line in lines:
    for pair in it.combinations(re.findall( w+ , line), 2):
        pairs[tuple(pair)] += 1

resultList = [pair + (occurences, ) for pair, occurences in pairs.iterkeys()]

友情链接