尝试通过贝叶斯分类器或其他自动分类系统将有趣/好的论坛帖子分类的困难在于帖子的单词和/或单词结构与其相对价值或实用性之间可能缺乏相关性。
垃圾邮件过滤器主要起作用是因为字词选择和结构总体上是系统性的不寻常:垃圾邮件发送者试图促销特定产品、服务等。虽然垃圾邮件发送者可以通过各种技术来增加识别难度,但仍有合理的相关性和模式可以学习。
这种字面/结构模式不可能存在,因为论坛职位不好。 然而,还有一种可能有用的办法,来调整问题:
- Allow users to classify posts as good or bad or otherwise rank them as you described.
- Use Bayesian classifiers or some other statistical inference method to identify forum users who have among the highest correlation with the ranking behavior of the overall community, i.e., the users who have the best taste and are good predictors for how the community as a whole would view the content.
- Use forum post rankings from the pool of good-predictor users identified in step #2 to filter forum posts. This requires that one or more such users actually rank the new content at some point, so this pool needs to be of some size and include regular users for such a filtering system to be useful.
- This classifier system will require periodic rebuilding as the community of users is presumably dynamic, has changing interests, etc.
我所提出的方法在解决您的问题上能否起到好的效果,很大程度上取决于论坛的性质、用户在排名内容方面有多愿意参与以及他们对发布内容价值的看法有多大程度的共性。此外,用户社区的整体规模也可能是一个因素:如果太小,可能没有足够的数据可供使用;如果太大,使用分类器推断法来对排名数据进行计算可能会导致计算规模问题。