English 中文(简体)
定名和定期表述
原标题:Find names with Regular Expression
  • 时间:2011-10-04 20:58:38
  •  标签:
  • regex

关于在大案文中找到名字,我有以下格文:

([A-Z][a-z]*)[s-]([A-Z][a-z]*)

对于诸如“Jack Oneill”或“John Guidetti”等普通名称,该作品是罚款的。 但我想找到一些可能性,但无法找到。 和:

Chandler Murial Bing
Gandalf the Gray
Pieter van den Woude

在我对定期表达方式的了解有限的情况下,我似乎无法享有这一权利。 任何人都可以帮助我(请为此提供良好的网站/手册):

最佳回答

处理经常表达问题的最佳方式是描述你所期望的匹配(通常称为grammar/em>)。

例如,从你的问题来看,我可以说是:

  1. A capitalized word is defined as one capital letter and 1+ letters/dashes or one capital letter and a . (an initial).
  2. An uncapitalized word is defined as 1 letter and 1+ letters/dashes (not perfect, because that could allow ending in a dash).
  3. First word starts with a capital letter
  4. Last word ends with a capital letter
  5. 0+ capitalized words between first and last word
  6. Then 0-2 uncapitalized words between first capitalized words and last word
  7. At least two words.
  8. Words are broken by whitespace

如果这提供了与预期结果相当的接近(而且为了明确,就姓名而言,你会有这么多的改动,否则就会有虚假的正面或错误的负面之处),那么你就开始构造以下表述:

  1. Capitalized word: [A-Z]([a-z]+|.)
  2. Uncapitalized word: [a-z][a-z-]+

Result:

 [A-Z]([a-z]+|.)(?:s+[A-Z]([a-z]+|.))*(?:s+[a-z][a-z-]+){0,2}s+[A-Z]([a-z]+|.)

Match(果敢):

Hello my name is Chandler Muriel Bing. I have a friend who is named Pieter van den Woude and he has another friend, A. A. Milne. Gandalf the Gray joins us. Together, we make up the Friends Cast and Crew.

问题:

  • Because you want to match Gandalf the Gray and Pieter van den Woude you will inevitably match other sets that consist of names with uncapitalized words in between (Friends Cast and Crew). The above grammar attempts to limit the problem by limiting it to 2 uncapitalized words. You could also create a set of allowed uncapitalized words instead ("van", "der", "the"), and only match those words.
  • Doesn t allow for non-Latin-alphabet letters, ligatures, diacritics, etc.
  • As I and others have pointed out, regular expressions will never be perfect for this situation, but as you said, you want something to get you most of the way there. In this case, the above expression should do a pretty good job, but consider it a blunt instrument! You ve been warned.
问题回答

在您的案例中,仅加上另一句。

[s-]([A-Z][a-z]*)

一般而言,监管机构不适合这一问题,有太多的特殊案例,你需要编制这些名字的清单。

关于复杂名称,请参见[自然语言处理]:





相关问题
Uncommon regular expressions [closed]

Recently I discovered two amazing regular expression features: ?: and ?!. I was curious of other neat regex features. So maybe you would like to share some tricky regular expressions.

regex to trap img tag, both versions

I need to remove image tags from text, so both versions of the tag: <img src="" ... ></img> <img src="" ... />

C++, Boost regex, replace value function of matched value?

Specifically, I have an array of strings called val, and want to replace all instances of "%{n}%" in the input with val[n]. More generally, I want the replace value to be a function of the match ...

PowerShell -match operator and multiple groups

I have the following log entry that I am processing in PowerShell I m trying to extract all the activity names and durations using the -match operator but I am only getting one match group back. I m ...

Is it possible to negate a regular expression search?

I m building a lexical analysis engine in c#. For the most part it is done and works quite well. One of the features of my lexer is that it allows any user to input their own regular expressions. This ...

regex for four-digit numbers (or "default")

I need a regex for four-digit numbers separated by comma ("default" can also be a value). Examples: 6755 3452,8767,9865,8766,3454 7678,9876 1234,9867,6876,9865 default Note: "default" ...