English 中文(简体)
Automatically pick tags from context using Python
原标题:
  • 时间:2009-11-21 14:38:13
  •  标签:
  • python
  • tags

How can I pick tags from an article or a user s post using Python?

Is the following method ok?

  1. Build a list of word frequency from the text and sort them.

  2. Remove some common words and pick the top 10 words remained in the list as the tags.

If the above method is ok, what library can detect if which words are common, like "the, if, you, etc" and which are descriptive words?

最佳回答

Here s an article on removing stop words. The link to the stop word list in the article is broken but here s another one.

问题回答

The Natural Language Toolkit offers a broad variety of methods for this kind of stuff. I can t give you hands-on advice as I m not familiar with this subject, but I think it s worth the effort to read a few articles about this topic first before you start: just picking words from the text directly won t get you very far I think, you should probably try to find similar words to the ones for that tags already exist. And of course you need to filter out the common words of the language like "the" and stuff. Again, this Python library can help you with this, at least for a few common languages.

I d suggest you download the Stack Overflow data dump. There you get a lot of real world posts, with appropriate tags, to test different algorithms of tag selection.

But generally I doubt it will work too well. For your own question "words" is the clear winner in word count, followed by a list of words with two appearances each, like "common", "list", "method", "pick" and "tags". Which of those would you automatically choose as tags? Also the tags you chose manually contain "python" and "context", none of which shows up with high word frequency.

Train Bayes or Fischer filter with already tagged data (e.g. with Stackoverflow data dump suggested by sth) and use it to classify new posts. I d recommend reading excellent Programming Collective Intelligence book by Toby Segaran for more information and python examples on this topic.

Instead of blacklisting words that shouldn t be tags, why don t you instead build a whitelist of words that would make for good tags?

Start with an handful of tags that you would like to have, like Python, off-topic, football, rickroll or whatnot (depends on the kind of site you are building!) and have the system only suggest between those, then let users handpick appropriate tags and also let them type in their own tags.

When enough users suggest a tag, it gets into the pool of "known good" tags for auto suggestion -- maybe after some sort of moderation, so that you can still blacklist stupid tags like the, lolol, or typoed tags like objectoriented when you have object-oriented.

Only show few suggestions. Offer autocompletion. Limit the number of tags per item. If this will be about coding, maybe some sort of language detection system (the file linux command is not too shabby on this) will help your suggestion system.





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签