English 中文(简体)
姓名
原标题:Human Name parsing

我有一组人名。 它们是“西方”的名称,我只需要美国公约/缩略语(例如,主席先生,而不是S.S.先生)。 不幸的是,我寄来的东西的人没有输入他们自己的名字,因此,我可以问他们想要什么。 我知道每个人的性别及其全名,但并没有更具体地描述事情。

一些实例:

  1. John Smith
  2. John Smith, Jr.
  3. John Smith Jr.
  4. John Smith XIV
  5. Dr. John Smith, Ph.D.

我能够把每个名字的部分分开:

name = Name.new("John Smith Jr.")
name.first_name # <= John
name.greeting   # <= Mr. Smith

如果我看着“游乐”的话(也许不是最好的术语),那么我在这里想要的是1-4年“史密斯先生”。 我祝愿史密斯博士发言。

A Ruby gem for this would be ideal. I was inspired to ask for something this strange by Chronic, a Ruby gem that handles time in a remarkably human way, letting me correctly tell it "last Tuesday" and having it come up with something sensible." Some algorithm would suffice that hits most of the corner cases.

I m试图处理

最佳回答

由于你只读了西方的名字,我认为,一些规则将使你走到其中的大部分道路:

  1. If a comma appears, delete the leftmost one and everything after.
  2. Continue removing words from the beginning while, after converting to lowercase and removing any full stops, they belong to the set { mr mrs miss ms rev dr prof } and any more you can think of. Using a table of title "scores" (e.g. [mr=1, mrs=1, rev=2, dr=3, prof=4] -- order them however you want), record the highest-scoring title that was deleted.
  3. Continue removing words from the end while they belong to the set { jr phd } or are Roman numerals of value roughly 50 or less (/[XVI]+/ is probably a good enough regex).
  4. If one or more titles having nonzero scores were deleted in step 2, use the highest-scoring one. Otherwise, use "Mr." or "Mrs." according to the supplied gender.
  5. As the surname, use the last word.

永远不可能保证像“John Baxter Smith”这样的名字得到正确表述,因为并非所有双管齐下的姓都使用hy子。 “Baxter Smith” 姓吗? 或者,“Baxter”是中间名称? 我认为,可以安全地假设,中间名字比双管束双管齐下但又一字的姓更为常见,也就是说,如果把最后一字说成是姓,那就更会失职。 不过,你也不妨汇编一份双管齐下的共同姓氏清单,并对此进行核对。

问题回答

http://search.cpan.org/~kimryan/Lingua-EN-NameParse/“rel=“nofollow”>。

我通过你的例子来总结如下成果。 它只处理多达12(XII)的 or子,也不承认博士的 子,因此我不得不在你的投入数据中改变这种说法。

JOHN SMITH                                John                             Smith                       
JOHN SMITH, JR.                           John                             Smith                Jr     
JOHN SMITH JR.                            John                             Smith                Jr     
JOHN SMITH XII                            John                             Smith                XII    
DR. JOHN SMITH, PHD              Dr.      John                             Smith                Phd    

Humanparser

改写人类名,一等名,中名,最后名,uff。

Install

npm install Humanparser

Usage

var human = require( Humanparser );

var fullName =  Mr. William R. Jenkins, III 
    , attrs = human.parseName(fullName);

console.log(attrs);

//produces the following output

{ saluation:  Mr. ,
  firstName:  William ,
  suffix:  III ,
  lastName:  Jenkins ,
  middleName:  R. ,
  fullName:  Mr. William R. Jenkins, III  }

您是否尝试了parser. 姓名?

Parsing names is complex so I would recommend to use an API service that parses names into components. You can integrate the RESTful API into your project or use the web app to run a list in your browser. This way you are sure you get the first and last name and its even validated and contains more details like salutation, nationality and gender.

我认识到这是一个老问题,但我最终却有同样的问题,因此我认为我是我的解决方法。

如果你的数据是干净的,只有美国数据,基于测试和制图的方法就可以做出色的工作。 然而,如果你的名字仅仅是在细微书写系统中,那么纯粹的同学做法往往被打破,即使所有名字都遵循>,西方命名传统。 考虑如下:

  • Initials, e.g., J. R. R. Tolkien, J. Edgar Hoover, etc.
  • Double-barreled names, e.g., Jean-Pierre Flamel, Catherine Zeta-Jones
  • Particles in names, e.g., Fernando de la Vega, Bashar al Assad
  • Middle name versus second last name ambiguity, e.g., Juan García Lopez

我真的像你把这篇文章联系起来! 我补充说,讨论上述一些想法的另一点是:

如果你发现 approach办法失败,那么下一个最佳办法就是采用statistical parser去编码。 这些工具没有就案件、姓名顺序(即: 相对于, 订单),等等作出任何综合假设。 相反,它们使用一个大型的真实名称数据库来进行模式匹配。 它采取这些办法,使它们能够解决模棱两可的问题,识别名字中的微粒等等。 这些系统往往产生人口估算和结构化的名称数据,你可以用来丰富和(或)核实你的姓名(因为你已经有性别)。

如果需要数据驱动的名称,那么HumanGraphics是一个很好的选择。 其依据的是描述70亿人的姓名数据,而且它可靠地以各种书写系统和命名传统的名称。

申斥: 我建造了“人类形象”,完全是你引用的原因!





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签