English 中文(简体)
Orange vs NLTK for Content Classification in Python [closed]
原标题:
Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 9 years ago.

We need a content classification module. Bayesian classifier seems to be what I am looking for. Should we go for Orange or NLTK ?

最佳回答

Well as evidenced by the documentation, the Naive Bayes implementation in each Library is easy to use, so why not run your data with both and compare the results?

Both Orange and NLTK are both mature, stable libraries (10+ years in development for each library) that originated in large universities; they share some common features primarily Machine Learning algorithms. Beyond that, they are quite different in scope, purpose, and implementation.

Orange is domain agnostic--not directed towards a particular academic discipline or commercial domain, instead it advertises itself as full-stack data mining and ML platform. It s focus is on the tools themselves and not the application of those tools in a particular discipline.

Its features include IO, the data analysis algorithm, and a data visualization canvas.

NLTK, on the other hand, began as and remains an academic project in a computational linguistics department of a large university. The task you mentioned (document content classification) and your algorithm of choice (Naive Bayesian) are pretty much right at the core of NLTK s functionality. NLTK does indeed have ML/Data Mining algorithms but its only because they have a particular utility in computational linguistics.

NLTK of course includes some ML algorithms but only because they have utility in computational linguistics, along with document parsers, tokenizers, part-of-speech analyzers, etc.--all of which comprise NLTK.

Perhaps the Naive Bayes implementation in Orange is just as good, i would still choose NLTK s implementation because it is clearly optimized for the particular task you mentioned.

There are numerous tutorials on NLTK and in particular for its Naive Bayes for use content classification. A blog post by Jim Plus and another in streamhacker.com, for instance present excellent tutorials for the use of NLTK s Naive Bayes; the second includes a line-by-line discussion of the code required to access this module. The authors of both of these posts report good results using NLTK (92% in the former, 73% in the latter).

问题回答

I don t know Orange, but +1 for NLTK:

I ve successively used the classification tools in NLTK to classify text and related meta data. Bayesian is the default but there are other alternatives such as Maximum Entropy. Also being a toolkit, you can customize as you see fit - eg. creating your own features (which is what I did for the meta data).

NLTK also has a couple of good books - one of which is available under Creative Commons (as well as O Reilly).

NLTK is a toolkit that supports a four state model of natural language processing:

  1. Tokenizing: grouping characters as words. This ranges from trivial regex stuff to dealing with contractions like "can t"
  2. Tagging. This is applying part-of-speech tags to the tokens (eg "NN" for noun, "VBG" for verb gerund). This is typically done by training a model (eg Hidden Markov) on a training corpus (i.e. large list of by by hand tagged sentences).
  3. Chunking/Parsing. This is taking each tagged sentence and extracting features into a tree (eg noun phrases). This can be according to a hand-written grammar or a one trained on a corpus.
  4. Information extraction. This is traversing the tree and extracting the data. This is where your specific orange=fruit would be done.

NLTK supports WordNet, a huge semantic dictionary that classifies words. So there are 5 noun definitions for orange (fruit, tree, pigment, color, river in South Africa). Each of these has one or more hypernym paths that are hierarchies of classifications. E.g. the first sense of orange has a two paths:

  • orange/citrus/edible_fruit/fruit/reproductive_structure/plant_organ/plant_part/natural_object/whole/object/physical_entity/entity

and

  • orange/citrus/edible_fruit/produce/food/solid/matter/physical_entity/entity

Depending on your application domain you can identify orange as a fruit, or a food, or a plant thing. Then you can use the chunked tree structure to determine more (who did what to the fruit, etc.)





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签