Question

我试图在我的 Ruby 应用程序中使用以下 Regex 代码来匹配 HTTP 链接, 但它产生无效输出, 附加一个时段, 有时是一个时段和一个词, 在链接后面, 当在网络上测试时, 其无效。

URL_PATTERN  = Regexp.new %r{http://[w/.%-]+}i
<input>.to_s.scan( URL_PATTERN ).uniq

扫描链接的上述代码有问题吗?

应用程序代码 :

require  bundler/setup 
require  twitter 

RECORD_LIMIT = 100
URL_PATTERN  = Regexp.new %r{http://[w/.%-]+}i

def usage
  warn "Usage: ruby #{File.basename $0} <hashtag>"  
  exit 64
end

# Ensure that the hashtag has a hash symbol. This makes the leading  # 
# optional, which avoids the need to quote or escape it on the command line.
def format_hashtag(hashtag)  
  (hashtag.scan(/^#/).empty?) ? "##{hashtag}" : hashtag
end

# Return a sorted list of unique URLs found in the list of tweets.
def uniq_urls(tweets)  
  tweets.map(&:text).grep( %r{http://}i ).to_s.scan( URL_PATTERN ).uniq
end

def search(hashtag)  
  Twitter.search(hashtag, rpp: RECORD_LIMIT, result_type:  recent )
end

if __FILE__ == $0 usage unless ARGV.size >= 1  
hashtag = format_hashtag(ARGV[0]) 
tweets = search(hashtag) 
puts uniq_urls(tweets)
end

Answer 1

TL;DR

人们总是张贴错误的链接。链接也会受到位罗的干扰。

The Likely Answer

您手动验证过 Tweet 吗? 您确定原始 Tweet 不含错误的 URL 吗? 如果有人发布 :

"http://foo. anyny" rel="no follow" >http://foo.还有更多的吐司吗?

然后您肯定会得到一个无效的结果,因为正则要求在 URL 周围保持空白。如果您想要使用无效的结果,那么您需要使用链接检查器,该检查器可以处理重定向来验证您找到的每个链接。

Author s Disclaimer

你重新张贴的代码是我的,来自 CodeGnome/twitter_url_extractor 。我故意不进行链接检查,因为我有兴趣提取URL,而不是验证这些URL。

"对我有用,你的里程可能不同"SM

Answer 2

问题是, 您的regex 将包含一个跟踪期, 因为您正在不分青红皂白地检查任意的单词字符序列、鞭笞、百分数符号、连字符( aka“ minus ”) < strong > 和 seum < / strong > 。这将遇到一个跟踪期, 事实上当 URL 处于句尾时会标出, 而如果人们忽略了此段之后的空格, 之后的任何内容 — — https:// stackoverflown.com/ a/ 10788022/990363 ”, 以 CodeGnome 正确标示 < / a > 。您可以通过排除这样的跟踪标点来部分缓解这一问题( 注意这仍然会直接被非URL 东西标出 ) :

http://w+(?:[./%-]w+)+$

然而,这仍然会错过大部分现有的 URL, 并捕捉很多无效的东西 : < a href="http://www.w3.org/Addressing/URL/" rel="Nofoln noreferrr" >URLs 是相当复杂的野兽。如果您想要完美匹配, John Gruber 张贴了 < a href="http://daringfireball. net/2010/07/ improved_for_ matching_urls" rel="nofoln noreferr" > > a regex , 匹配今天用作 URL( UR) 的任何内容, 而不仅仅是 http(s) 。。包括 HTPS 变异体在内的大量网络单一的URL的作物要更近一些, 包括 HTPS, 以确保您在起始时有一个完善的域, 并捕捉到查询和碎片标识符, regex 。

https?://[w-]+(?:.[w-]+)+(?:/[w-]+)*(?:(?:[./%?=&#-]w+)+)?

- 这仍然会捕捉到无效的东西, 并排除相当一部分现有的 URL(以及更大比例的有效 URL ) — — 见上文链接的 RFC I ), 但它会让您更接近。

Answer 3

为何不用Ruby s URI.extract ?它与Ruby捆绑在一起。

文献资料:

Synopsis

URI::extract(str[, schemes][,&blk])

Args

str     String to extract URIs from.
schemes Limit URI matching to a specific schemes.

Description

Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
Usage

require "uri"

URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.")
# => ["http://foo.example.com/bla", "mailto:[email protected]"]

如果您只想要 HTTP URL :

[3] (pry) main: 0> URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.", %w[http])
=> ["http://foo.example.org/bla"]

TL;DR

The Likely Answer

Author s Disclaimer

友情链接