I ve got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don t yet have categories. I m trying to find the best way to programmaticly categorize them.
I ve been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I m all ears).
My problem is that I don t have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).
Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don t really match any of the categories get shoe-horned into on by the classifier just because it s the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn t fit into any of the categories?
Here is my test class http://gist.github.com/451880