Question

关于在大案文中找到名字,我有以下格文:

([A-Z][a-z]*)[s-]([A-Z][a-z]*)

对于诸如“Jack Oneill”或“John Guidetti”等普通名称,该作品是罚款的。但我想找到一些可能性,但无法找到。和:

Chandler Murial Bing
Gandalf the Gray
Pieter van den Woude

在我对定期表达方式的了解有限的情况下,我似乎无法享有这一权利。任何人都可以帮助我(请为此提供良好的网站/手册):

Answer 1

处理经常表达问题的最佳方式是描述你所期望的匹配(通常称为grammar/em>)。

例如,从你的问题来看,我可以说是:

A capitalized word is defined as one capital letter and 1+ letters/dashes or one capital letter and a . (an initial).

An uncapitalized word is defined as 1 letter and 1+ letters/dashes (not perfect, because that could allow ending in a dash).

First word starts with a capital letter

Last word ends with a capital letter

0+ capitalized words between first and last word

Then 0-2 uncapitalized words between first capitalized words and last word

At least two words.

Words are broken by whitespace

如果这提供了与预期结果相当的接近(而且为了明确,就姓名而言,你会有这么多的改动,否则就会有虚假的正面或错误的负面之处),那么你就开始构造以下表述:

Capitalized word: [A-Z]([a-z]+|.)

Uncapitalized word: [a-z][a-z-]+

Result:

[A-Z]([a-z]+|.)(?:s+[A-Z]([a-z]+|.))*(?:s+[a-z][a-z-]+){0,2}s+[A-Z]([a-z]+|.)

Match(果敢):

Hello my name is Chandler Muriel Bing. I have a friend who is named Pieter van den Woude and he has another friend, A. A. Milne. Gandalf the Gray joins us. Together, we make up the Friends Cast and Crew.

问题:

Because you want to match Gandalf the Gray and Pieter van den Woude you will inevitably match other sets that consist of names with uncapitalized words in between (Friends Cast and Crew). The above grammar attempts to limit the problem by limiting it to 2 uncapitalized words. You could also create a set of allowed uncapitalized words instead ("van", "der", "the"), and only match those words.

Doesn t allow for non-Latin-alphabet letters, ligatures, diacritics, etc.

As I and others have pointed out, regular expressions will never be perfect for this situation, but as you said, you want something to get you most of the way there. In this case, the above expression should do a pretty good job, but consider it a blunt instrument! You ve been warned.

Answer 2

在您的案例中,仅加上另一句。

[s-]([A-Z][a-z]*)

一般而言,监管机构不适合这一问题,有太多的特殊案例,你需要编制这些名字的清单。

关于复杂名称,请参见[自然语言处理]:。

友情链接