English 中文(简体)
Naive Bayesian for Topic detection using "Bag of Words" approach
原标题:

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?

Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.

How should i go about doing this ? Is my approach the right one ?

Which programming language would be best suited for the implementation ?

最佳回答

Existing Implementations of Naive Bayes

You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:

Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.

Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here s sample code that detects whether Family Guy quotes are funny or not-funny.

Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.

C# - C# programmers can use nBayes. The project s home page has sample code for a simple spam/not-spam classifier.

Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.

Bootstrapping Classification from Keywords

It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.

This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

问题回答

暂无回答




相关问题
Java Stanford NLP: Part of Speech labels?

The Stanford NLP, demo d here, gives an output like this: Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./. What do the Part of Speech tags mean? I am unable to find an official list. Is it ...

Java Stanford NLP: Find word frequency?

I m using the Stanford NLP Parsing toolkit. Given a word in the lexicon, how can I find its frequency*? Or, given a frequency rank, how can I determine the corresponding word? *in the entire language,...

c/c++ NLP library [closed]

I am looking for an open source Natural Language Processing library for c/c++ and especially i am interested in Part of speech tagging.

Clustering text in Python [closed]

I need to cluster some text documents and have been researching various options. It looks like LingPipe can cluster plain text without prior conversion (to vector space etc), but it s the only tool I ...

Natural language rendering

Do you know any frameworks that implement natural language rendering concept ? I ve found several NLP oriented frameworks like Anthelope or Open NLP but they have only parsers but not renderers or ...

热门标签