Question

我有一系列文本是自定义 WebText 类的例子。每个文本都是一个带有评级( 10 到 +10) 和单词计数( nltk. FreqDist) 的对象 :

>>trainingTexts = [WebText( train1.txt ), WebText( train2.txt ), WebText( train3.txt ), WebText( train4.txt )]
>>trainingTexts[1].rating
10
>>trainingTexts[1].freq_dist
<FreqDist:  the : 60,  , : 49,  to : 38,  is : 34,...>

您现在如何获得两个单词列表( 或词典), 包含所有正值或负值文本的总字数, 以便您得到这样的信息 :

>>only_positive_words
[( sky , 10), ( good , 9), ( great , 2)...] 
>>only_negative_words
[( earth , 10), ( ski , 9), ( food , 2)...]

我考虑过使用数据集,因为数据集包含独特的实例, 但我不明白如何用 nltk. FreqDist 来做到这一点, 并且, 此外, 一组不会按字数频率排序。有什么想法吗?

Answer 1

好吧,让我们假设你从这个开始是为了测试的目的:

class Rated(object): 
  def __init__(self, rating, freq_dist): 
    self.rating = rating
    self.freq_dist = freq_dist

a = Rated(5, nltk.FreqDist( the boy sees the dog .split()))
b = Rated(8, nltk.FreqDist( the cat sees the mouse .split()))
c = Rated(-3, nltk.FreqDist( some boy likes nothing .split()))

trainingTexts = [a,b,c]

那么你的代码会看起来像:

from collections import defaultdict
from operator import itemgetter

# dictionaries for keeping track of the counts
pos_dict = defaultdict(int)
neg_dict = defaultdict(int)

for r in trainingTexts:
  rating = r.rating
  freq = r.freq_dist

  # choose the appropriate counts dict
  if rating > 0:
    partition = pos_dict
  elif rating < 0: 
    partition = neg_dict
  else:
    continue

  # add the information to the correct counts dict
  for word,count in freq.iteritems():
    partition[word] += count

# Turn the counts dictionaries into lists of descending-frequency words
def only_list(counts, filtered):
  return sorted(filter(lambda (w,c): w not in filtered, counts.items()), 
                key=itemgetter(1), 
                reverse=True)

only_positive_words = only_list(pos_dict, neg_dict)
only_negative_words = only_list(neg_dict, pos_dict)

结果就是:

>>> only_positive_words
[( the , 4), ( sees , 2), ( dog , 1), ( cat , 1), ( mouse , 1)]
>>> only_negative_words
[( nothing , 1), ( some , 1), ( likes , 1)]

友情链接