English 中文(简体)
Regex 匹配产生部分无效输出的 URL
原标题:Regex to match URLs generating partially invalid output

我试图在我的 Ruby 应用程序中使用以下 Regex 代码来匹配 HTTP 链接, 但它产生无效输出, 附加一个时段, 有时是一个时段和一个词, 在链接后面, 当在网络上测试时, 其无效 。

URL_PATTERN  = Regexp.new %r{http://[w/.%-]+}i
<input>.to_s.scan( URL_PATTERN ).uniq

扫描链接的上述代码有问题吗?

应用程序代码 :

require  bundler/setup 
require  twitter 

RECORD_LIMIT = 100
URL_PATTERN  = Regexp.new %r{http://[w/.%-]+}i

def usage
  warn "Usage: ruby #{File.basename $0} <hashtag>"  
  exit 64
end

# Ensure that the hashtag has a hash symbol. This makes the leading  # 
# optional, which avoids the need to quote or escape it on the command line.
def format_hashtag(hashtag)  
  (hashtag.scan(/^#/).empty?) ? "##{hashtag}" : hashtag
end

# Return a sorted list of unique URLs found in the list of tweets.
def uniq_urls(tweets)  
  tweets.map(&:text).grep( %r{http://}i ).to_s.scan( URL_PATTERN ).uniq
end

def search(hashtag)  
  Twitter.search(hashtag, rpp: RECORD_LIMIT, result_type:  recent )
end

if __FILE__ == $0 usage unless ARGV.size >= 1  
hashtag = format_hashtag(ARGV[0]) 
tweets = search(hashtag) 
puts uniq_urls(tweets)
end
问题回答

TL;DR

人们总是张贴错误的链接。 链接也会受到位罗的干扰 。

The Likely Answer

您手动验证过 Tweet 吗? 您确定原始 Tweet 不含错误的 URL 吗? 如果有人发布 :

"http://foo. anyny" rel="no follow" >http://foo.还有更多的吐司吗?

然后您肯定会得到一个无效的结果,因为正则要求在 URL 周围保持空白。如果您想要使用无效的结果,那么您需要使用链接检查器,该检查器可以处理重定向来验证您找到的每个链接。

Author s Disclaimer

你重新张贴的代码是我的,来自CodeGnome/twitter_url_extractor 。我故意不进行链接检查,因为我有兴趣提取URL,而不是验证这些URL。

"对我有用,你的里程可能不同"SM

问题是, 您的regex 将包含一个跟踪期, 因为您正在不分青红皂白地检查任意的单词字符序列、 鞭笞、 百分数符号、 连字符( aka“ minus ”) < strong > 和 seum < / strong > 。 这将遇到一个跟踪期, 事实上当 URL 处于句尾时会标出, 而如果人们忽略了此段之后的空格, 之后的任何内容 — — https:// stackoverflown.com/ a/ 10788022/990363 ”, 以 CodeGnome 正确标示 < / a > 。 您可以通过排除这样的跟踪标点来部分缓解这一问题( 注意这仍然会直接被非URL 东西标出 ) :

http://w+(?:[./%-]w+)+$

然而,这仍然会错过大部分现有的 URL, 并捕捉很多无效的东西 : < a href="http://www.w3.org/Addressing/URL/" rel="Nofoln noreferrr" >URLs 是相当复杂的野兽 。 如果您想要完美匹配, John Gruber 张贴了 < a href="http://daringfireball. net/2010/07/ improved_for_ matching_urls" rel="nofoln noreferr" > > a regex , 匹配今天用作 URL( UR) 的任何内容, 而不仅仅是 http(s) 。 。 包括 HTPS 变异体在内的大量网络单一的URL的作物要更近一些, 包括 HTPS, 以确保您在起始时有一个完善的域, 并捕捉到查询和碎片标识符, regex 。

https?://[w-]+(?:.[w-]+)+(?:/[w-]+)*(?:(?:[./%?=&#-]w+)+)?

- 这仍然会捕捉到无效的东西, 并排除相当一部分现有的 URL(以及更大比例的有效 URL ) — — 见上文链接的 RFC I ), 但它会让您更接近 。

为何不用Ruby s URI.extract ?它与Ruby捆绑在一起。

文献资料:

Synopsis

URI::extract(str[, schemes][,&blk])

Args

str     String to extract URIs from.
schemes Limit URI matching to a specific schemes.

Description

Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
Usage

require "uri"

URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.")
# => ["http://foo.example.com/bla", "mailto:[email protected]"]

如果您只想要 HTTP URL :

[3] (pry) main: 0> URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.", %w[http])
=> ["http://foo.example.org/bla"]




相关问题
Ruby parser in Java

The project I m doing is written in Java and parsers source code files. (Java src up to now). Now I d like to enable parsing Ruby code as well. Therefore I am looking for a parser in Java that parses ...

rails collection_select vs. select

collection_select and select Rails helpers: Which one should I use? I can t see a difference in both ways. Both helpers take a collection and generates options tags inside a select tag. Is there a ...

RubyCAS-Client question: Rails

I ve installed RubyCAS-Client version 2.1.0 as a plugin within a rails app. It s working, but I d like to remove the ?ticket= in the url. Is this possible?

Ordering a hash to xml: Rails

I m building an xml document from a hash. The xml attributes need to be in order. How can this be accomplished? hash.to_xml

multiple ruby extension modules under one directory

Can sources for discrete ruby extension modules live in the same directory, controlled by the same extconf.rb script? Background: I ve a project with two extension modules, foo.so and bar.so which ...

Text Editor for Ruby-on-Rails

guys which text editor is good for Rubyonrails? i m using Windows and i was using E-Texteditor but its not free n its expired now can anyone plese tell me any free texteditor? n which one is best an ...

热门标签