Question

这是。我正在使用斜线来教化个人、组织及其关系。借助 ,我得以创建人与组织的司;然而,我在Nltk.sem发现错误。

AttributeError:  Tree  object has no attribute  text

这里是一部完整的法典:

import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open( billgatesbio.txt ,  r ) as f:
    sample = f.read()

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)

# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]

# pattern to find <person> served as <title> in <org>
IN = re.compile(r .+s+ass+ )
for doc in chunked_sentences:
    for rel in nltk.sem.extract_rels( ORG ,  PERSON , doc,corpus= ieer , pattern=IN):
        print nltk.sem.show_raw_rtuple(rel)

这个例子与《<<

My ultimate goal is to extract persons, organizations, titles (dates) for some companies; then create network maps of persons and organizations.

Answer 1

它看起来是一种“Parsed Doc”的物体,需要拥有headline member和text个成员,这两个成员都是标的,其中一部分被标为树木。例如,这一(hacky)实例证明:

import nltk
import re

IN = re.compile (r .*in(?!.+ing) )

class doc():
  pass

doc.headline=[ foo ]
doc.text=[nltk.Tree( ORGANIZATION , [ WHYY ]),  in , nltk.Tree( LOCATION ,[ Philadelphia ]),  . ,  Ms. , nltk.Tree( PERSON , [ Gross ]),  , ]

for rel in  nltk.sem.extract_rels( ORG , LOC ,doc,corpus= ieer ,pattern=IN):
   print nltk.sem.relextract.show_raw_rtuple(rel)

产出:

[ORG:  WHYY ]  in  [LOC:  Philadelphia ]

显然,你实际上照此办理,但是,这为<编码>Exract_rels所期望的数据格式提供了一个工作范例,你只是需要确定如何采取预处理步骤,使你的数据按此格式集中。

Answer 2

The source Code of nltk.sem.extract_rels function :

def extract_rels(subjclass, objclass, doc, corpus= ace , pattern=None, window=10):
"""
Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.

The parameters ``subjclass`` and ``objclass`` can be used to restrict the
Named Entities to particular types (any of  LOCATION ,  ORGANIZATION ,
 PERSON ,  DURATION ,  DATE ,  CARDINAL ,  PERCENT ,  MONEY ,  MEASURE ).

:param subjclass: the class of the subject Named Entity.
:type subjclass: str
:param objclass: the class of the object Named Entity.
:type objclass: str
:param doc: input document
:type doc: ieer document or a list of chunk trees
:param corpus: name of the corpus to take as input; possible values are
     ieer  and  conll2002 
:type corpus: str
:param pattern: a regular expression for filtering the fillers of
    retrieved triples.
:type pattern: SRE_Pattern
:param window: filters out fillers which exceed this threshold
:type window: int
:return: see ``mk_reldicts``
:rtype: list(defaultdict)
"""
....

So if you pass corpus parameter as ieer, the nltk.sem.extract_rels function expects the doc parameter to be a IEERDocument object. You should pass corpus as ace or just don t pass it(default is ace). In this case it expects a list of chunk trees(that s what you wanted). I modified the code as below:

import nltk
import re
from nltk.sem import extract_rels,rtuple

#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open( billgatesbio.txt ,  r ) as f:
    sample = f.read().decode( utf-8 )

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

# here i changed reg ex and below i exchanged subj and obj classes  places
OF = re.compile(r .*of.* )

for i, sent in enumerate(tagged_sentences):
    sent = nltk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
    rels = extract_rels( PER ,  ORG , sent, corpus= ace , pattern=OF, window=7) # extract_rels method expects one chunked sentence
    for rel in rels:
        print( {0:<5}{1} .format(i, rtuple(rel)))

And it gives the result :

[PER: u Chairman/NNP ] u and/CC Chief/NNP Executive/NNP Officer/NNP of/IN the/DT  [ORG: u Company/NNP ]

Answer 3

this is nltk version problem. your code should work in nltk 2.x but for nltk 3 you should code like this

IN = re.compile(r .*in(?!.+ing) )
for doc in nltk.corpus.ieer.parsed_docs( NYT_19980315 ):
    for rel in nltk.sem.relextract.extract_rels( ORG ,  LOC , doc,corpus= ieer , pattern = IN):
         print (nltk.sem.relextract.rtuple(rel))

NLTK 相关排外法例不工作

友情链接