English 中文(简体)
扫描连续一组单词
原标题:Scan for groups of consecutive words
  • 时间:2012-05-26 02:50:19
  •  标签:
  • ruby

鉴于投入:

str = "foo bar jim jam. jar jee joon."

我需要用空格分隔的所有 2 和 3 字词的输出 :

[ "foo bar", "bar jim", "jim jam", "jar jee", "jee joon",
  "foo bar jim", "bar jim jam", "jar jee joon" ]

特别要指出的是,由于这一时期的原因,上述文件缺少“jam jar”、“jim Jam jar”和“jam jar Jee”。

我无法使用 str.scan(/w+/).each_cons(2).map{aa a.join()/code>, 因为这包括 "jam jar"

扫描 /w+ w+/ 产地 [“foo bar”, "jim jam, "jarjee"] , 特别是缺少“bar jim” 和“jee joon”, 并突出问题。

用于此功能的真实世界应用程序正在为搜索引擎生成一个基于词组的索引。 我想将所有真正连续的单词都作为词组, 不包括有标点分隔单词的单词 。

< 坚固 > 编辑 < / 坚固 > : 似乎有办法在regex/ scan 中做到这一点, 其变异于 :

"a b c d".scan(/(?=([abc] [abc]) )[abc]/)
#=> [["a b"], ["b c"]]
最佳回答

我相信这行行得通,尽管它假定唯一的标点是以时期的形式:

str.split(".").map do |s|
  pairs_and_triples = []
  s.split.each_cons(2){ |*words| pairs_and_triples << words.join(" ") }
  s.split.each_cons(3){ |*words| pairs_and_triples << words.join(" ")}
  pairs_and_triples
end.flatten

<强度 > EDIT 或稍稍少一点回旋:

str.split(".").map do |s|
  [2,3].map do |i|
    s.split.each_cons(i).map{ |*words| words.join(" ") }
  end.flatten
end.flatten
问题回答
str = "foo bar jim jam. jar jee joon."
arr = str.split(   ).each_cons(2).map do |a|
  a.join(   ) if a.join(   ).match(/w+ w+/)
end
p arr.compact
#=> ["foo bar", "bar jim", "jim jam.", "jar jee", "jee joon."]

看来你已经改变了你的问题 要求3个字的词句

@ChrisRice勾画出一个由@muistososhort(@muistososhort)提出、由@ChrisRice(@ChrisRice)勾画出来的解决方案:

  1. split on sentence boundaries
  2. scan for words (ignoring uninteresting punctuation like commas)
  3. use each_cons to process the variations on that array

在代码中 :

max_words_per_phrase = 5
str = "foo bar, jim jam. jar: jee joon."

phrases = str.split(/[.!?]+/).flat_map do |sentence|
  words = sentence.scan(/w+/)
  2.upto(max_words_per_phrase).flat_map do |i|
    words.each_cons(i).map{ |a| a.join(   ) }
  end
end

p phrases
#=> ["foo bar", "bar jim", "jim jam", "foo bar jim", "bar jim jam",
#=>  "foo bar jim jam", "jar jee", "jee joon", "jar jee joon"]

删除标点后 :

str = "foo bar jim jam jar jee joon"

正如你在提问中建议的那样,可以使用正面的眼光:

r2 = /(w+)(?=(s+w+))/
r3 = /(w+)(?=(s+w+)(s+w+))/
str.scan(r2).concat(str.scan(r3)).map(&:join)
  #=> ["foo bar", "bar jim", "jim jam", "jam jar", "jar jee", "jee joon",
  #    "foo bar jim", "bar jim jam", "jim jam jar", "jam jar jee", "jar jee joon"] 




相关问题
Ruby parser in Java

The project I m doing is written in Java and parsers source code files. (Java src up to now). Now I d like to enable parsing Ruby code as well. Therefore I am looking for a parser in Java that parses ...

rails collection_select vs. select

collection_select and select Rails helpers: Which one should I use? I can t see a difference in both ways. Both helpers take a collection and generates options tags inside a select tag. Is there a ...

RubyCAS-Client question: Rails

I ve installed RubyCAS-Client version 2.1.0 as a plugin within a rails app. It s working, but I d like to remove the ?ticket= in the url. Is this possible?

Ordering a hash to xml: Rails

I m building an xml document from a hash. The xml attributes need to be in order. How can this be accomplished? hash.to_xml

multiple ruby extension modules under one directory

Can sources for discrete ruby extension modules live in the same directory, controlled by the same extconf.rb script? Background: I ve a project with two extension modules, foo.so and bar.so which ...

Text Editor for Ruby-on-Rails

guys which text editor is good for Rubyonrails? i m using Windows and i was using E-Texteditor but its not free n its expired now can anyone plese tell me any free texteditor? n which one is best an ...