  • ruby


str = "foo bar jim jam. jar jee joon."

我需要用空格分隔的所有 2 和 3 字词的输出 :

[ "foo bar", "bar jim", "jim jam", "jar jee", "jee joon",
  "foo bar jim", "bar jim jam", "jar jee joon" ]

特别要指出的是,由于这一时期的原因,上述文件缺少“jam jar”、“jim Jam jar”和“jam jar Jee”。

我无法使用 str.scan(/w+/).each_cons(2).map{aa a.join()/code>, 因为这包括 "jam jar"

扫描 /w+ w+/ 产地 [“foo bar”, "jim jam, "jarjee"] , 特别是缺少“bar jim” 和“jee joon”, 并突出问题。

用于此功能的真实世界应用程序正在为搜索引擎生成一个基于词组的索引。 我想将所有真正连续的单词都作为词组, 不包括有标点分隔单词的单词 。

< 坚固 > 编辑 < / 坚固 > : 似乎有办法在regex/ scan 中做到这一点, 其变异于 :

"a b c d".scan(/(?=([abc] [abc]) )[abc]/)
#=> [["a b"], ["b c"]]


str.split(".").map do |s|
  pairs_and_triples = []
  s.split.each_cons(2){ |*words| pairs_and_triples << words.join(" ") }
  s.split.each_cons(3){ |*words| pairs_and_triples << words.join(" ")}

<强度 > EDIT 或稍稍少一点回旋:

str.split(".").map do |s|
  [2,3].map do |i|
    s.split.each_cons(i).map{ |*words| words.join(" ") }
str = "foo bar jim jam. jar jee joon."
arr = str.split(   ).each_cons(2).map do |a|
  a.join(   ) if a.join(   ).match(/w+ w+/)
p arr.compact
#=> ["foo bar", "bar jim", "jim jam.", "jar jee", "jee joon."]

看来你已经改变了你的问题 要求3个字的词句


  1. split on sentence boundaries
  2. scan for words (ignoring uninteresting punctuation like commas)
  3. use each_cons to process the variations on that array

在代码中 :

max_words_per_phrase = 5
str = "foo bar, jim jam. jar: jee joon."

phrases = str.split(/[.!?]+/).flat_map do |sentence|
  words = sentence.scan(/w+/)
  2.upto(max_words_per_phrase).flat_map do |i|
    words.each_cons(i).map{ |a| a.join(   ) }

p phrases
#=> ["foo bar", "bar jim", "jim jam", "foo bar jim", "bar jim jam",
#=>  "foo bar jim jam", "jar jee", "jee joon", "jar jee joon"]

删除标点后 :

str = "foo bar jim jam jar jee joon"


r2 = /(w+)(?=(s+w+))/
r3 = /(w+)(?=(s+w+)(s+w+))/
  #=> ["foo bar", "bar jim", "jim jam", "jam jar", "jar jee", "jee joon",
  #    "foo bar jim", "bar jim jam", "jim jam jar", "jam jar jee", "jar jee joon"] 

