Question

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.

Closed 10 years ago.

i m working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results. I m using the JAVA language for textmining and AJAX with ICEFACES for the development of the website. What do I have : A list of articles returned from a search. Each article has an ID and an abstract. The idea is to get keywords from each abstract text. And then compare all the keywords from all abstracts and find the ones that are the most repeated. So then show in the website the related words for the search. Any ideas ? I searched a lot in the web, and I know there is Named Entity Recognition,Part Of Speech tagging, there is teh GENIA thesaurus for NER on genes and proteins, I already tried stemming ... Stop words lists, etc... I just need to know the best aproahc to resolve this problem. Thanks a lot.

Answer 1

i would recommend you use a combination of POS tagging and then string tokenizing to extract all the nouns out of each abstract.. then use some sort of dictionary/hash to count the frequency of each of these nouns and then outputting the N most prolific nouns.. combining that with some other intelligent filtering mechanisms should do reasonably well in giving you the important keywords from the abstract
for POS tagging check out the POS tagger at http://nlp.stanford.edu/software/index.shtml

但是,如果你期望在你的档案中有许多多字词,那么你可以拿到最精彩的n-grams, n=2-4

Answer 2

为此开展了一个帕奇项目。 In t used but, OpenNLP 开放源帕帕奇项目。它属于孵化器,因此可能是一种光生。

这个员额来自jeff s searchstart cafe,还有其他一些建议。

Answer 3

This might be relevant as well: https://github.com/jdf/cue.language

它停止了言语、字和克语的频率。

它是Wordle。

Answer 4

我最后使用

友情链接