English 中文(简体)
NLTK未服满刑期
原标题:Find subject in incomplete sentence with NLTK

我有一份清单,列出我试图分类的各类产品。 他们将被描述为刑期不完整:

“国家住房”

“Hard Drive Cable”

“1TB Hard Drive”

“500GB Hard Drive, Refurbished from Manufacturingr”

How can I use python and NLP to get an output like "Housing, Cable, Drive, Drive", or a tree that describes which word is modifying which? Thank you in advance

最佳回答

NLP技术相对缺乏处理此类案文的能力。

Phrased differently: it is quite possible to build a solution which includes NLP processes to implement the desired classifier but the added complexity doesn t necessarily pays off in term of speed of development nor classifier precision improvements.
If one really insists on using NLP techniques, POS-tagging and its ability to identify nouns is the most obvious idea, but Chunking and access to WordNet or other lexical sources are other plausible uses of NLTK.

相反,根据简单的定期表述和一些犹豫不决,如NoBug所建议的那样,临时解决办法可能是解决这一问题的适当办法。 当然,这种解决办法有两个主要风险:

  • over-fitting to the portion of the text reviewed/considered in building the rules
  • possible messiness/complexity of the solution if too many rules and sub-rules are introduced.

对拟审议的案文的完整(或非常大的样本)进行一些固定式分析,将有助于指导选择少数犹豫不决,也避免过度适用的关切。 我确信,与习惯词典有关的相对较少的规则应当足以产生一个具有适当精确度和速度/资源性能的等级。

A few ideas:

  • count all the words (and possibly all the bi-grams and tri-grams) in a sizable portion of the corpus a hand. This info can drive the design of the classifier by allowing to allocate the most effort and the most rigid rules to the most common patterns.
  • manually introduce a short dictionary which associates the most popular words with:
    • their POS function (mostly a binary matter here: i.e. nouns vs. modifiers and other non-nouns.
    • their synonym root [if applicable]
    • their class [if applicable]
  • If the pattern holds for most of the input text, consider using the last word before the end of text or before the first comma as the main key to class selection. If the pattern doesn t hold, just give more weight to the first and to the last word.
  • consider a first pass where the text is re-written with the most common bi-grams replaced by a single word (even an artificial code word) which would be in the dictionary
  • consider also replacing the most common typos or synonyms with their corresponding synonym root. Adding regularity in the input helps improve precision and also help making a few rules / a few entries in the dictionary have a big return on precision.
  • for words not found in dictionary, assume that words which are mixed with numbers and/or preceded by numbers are modifiers, not nouns. Assume that the
  • consider a two-tiers classification whereby inputs which cannot be plausibly assigned a class are put in the "manual pile" to prompt additional review which results in additional of rules and/or dictionary entries. After a few iterations the classifier should require less and less improvements and tweaks.
  • look for non-obvious features. For example some corpora are made from a mix of sources but some of the sources, may include particular regularities which help identify the source and/or be applicable as classification hints. For example some sources may only contains say uppercase text (or text typically longer than 50 characters, or truncated words at the end etc.)

I m afraid this answer falls short of providing Python/NLTK snippets as a primer towards a solution, but frankly such simple NLTK-based approaches are likely to be disappointing at best. Also, we should have a much bigger sample set of the input text to guide the selection of plausible approaches, include ones that are based on NLTK or NLP techniques at large.

问题回答

安装间谍

python -m spacy download en import spacy

nlp = spacy.load( en )
sent = "INCOMEPLETE SENTENCE HERE"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]

实例:

sent = "Solid State Drive Housing"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]

产出:

sent = "Hard Drive Cable"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]

产出:[Cable]

sent = "1TB Hard Drive"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]

产出:[驱动]

sent = "500GB Hard Drive, Refurbished from Manufacturer"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]

产出:[驱动]

将案文改为不完整的句子。

import spacy
import en_core_web_sm
nlp = spacy.load( en_core_web_sm )
sentence = "I need to be able to log into the Equitable siteI tried my username and password from the AXA Equitable site which worked fine yesterday but it won t allow me to log in and when"
nlp_doc=nlp(sentence)
subject = [tok for tok in nlp_doc if (tok.dep_ == "nsubj") ]
print(subject)




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签