English 中文(简体)
Getting Wikipedia infoboxes in a format that Ruby can understand
原标题:

I am trying to get the data from Wikipedia s infoboxes into a hash or something so that I can use it in my Ruby on Rails program. Specifically I m interested in the Infobox company and Infobox person. The example I have been using is "Ford Motor Company". I want to get the company info for that and the person info for the people linked to in Ford s company box.

I ve tried figuring out how to do this from the Wikipedia API or DBPedia but I haven t had much luck. I know wikipedia can return some things as json which I could parse with ruby but I haven t been able to figure out how to get the infobox. In the case of DBPedia I am kind of lost on how to even query it to get the info for Ford Motor Company.

最佳回答

I vote for DBpedia.

A simple explanation is:

The dbpedia naming scheme is http://dbpedia.org/resource/WikipediaArticleName (unique identifier) with spaces replaced by _.

http://dbpedia.org/page/ArticleName (the html preview) and http://dbpedia.org/data/ArticleName(.json/.jsod) are the JSON representation for the information about the article you want. (.rdf etc. might be confusing for you right now.)

For Ford Motor Company you should ask for:

http://dbpedia.org/data/Ford_Motor_Company.json

or:

http://dbpedia.org/data/Ford_Motor_Company.jsod

(Whichever is simpler for you)

Now, depending on the article type, person or company, there are different properties that define them that depend on the dbpedia ontology (http://wiki.dbpedia.org/Ontology).

A more advanced step could be to use SPARQL queries to get your data.

问题回答

Don t try to parse HTML with RegExp.

See: RegEx match open tags except XHTML self-contained tags

Use xpath or something similar.

I looked at their API, and it looks like there s a lot of detail, but the complexity is a hurdle. For long-term use it d be best to figure it out, but for quick and dirty, here s a way to get at the data.

I m using Nokogiri, which is a XML/HTML parser, and very flexible. For ease of use I m using CSS accessors.

#!/usr/bin/env ruby

require  open-uri 
require  nokogiri 
require  uri 

URL =  http://en.wikipedia.org/wiki/Ford_Motor_Company 
doc = Nokogiri::HTML(open(URL))
infobox = doc.at( table[class="infobox vcard"] )
infobox_caption = infobox.at( caption ).text

uri = URI.parse(URL)
infobox_agents = Hash[ *infobox.search( td.agent a ).map{ |a| [ a.text, uri.merge(a[ href ]).to_s ] }.flatten ]

require  ap 
ap infobox_caption
ap infobox_agents

The output looks like:

"Ford Motor Company"
{
              "Henry Ford" => "http://en.wikipedia.org/wiki/Henry_Ford",
    "William C. Ford, Jr." => "http://en.wikipedia.org/wiki/William_Clay_Ford,_Jr.",
      "Executive Chairman" => "http://en.wikipedia.org/wiki/Chairman",
        "Alan R. Mulally" => "http://en.wikipedia.org/wiki/Alan_Mulally",
              "President" => "http://en.wikipedia.org/wiki/President",
                    "CEO" => "http://en.wikipedia.org/wiki/Chief_executive_officer"
}

So, it s pulled the text of the caption, and returned a hash of the people s names, where the keys are their names and the values are the URLs.

You can use open-uri to download the HTML code of one wiki page and then interpret with Regexp. Look:

require  open-uri 
infobox = {}
open( http://en.wikipedia.org/wiki/Wikipedia ) do |page|
  page.read.scan(/<th scope="row" style="text-align:left;">(.*?)</th>.<td class="" style="">(.*?)</td>/m) do |key, value|
    infobox[key.gsub(/<.*?>/,   ).strip] = value.gsub(/<.*?>/,   ).strip # Removes tags (as hyperlink)
  end
end
infobox["Slogan"]                #=> "The free encyclopedia that anyone can edit."
infobox["Available language(s)"] #=> "257 active editions (276 in total)"

Should exist some better method. But this works.





相关问题
Ruby parser in Java

The project I m doing is written in Java and parsers source code files. (Java src up to now). Now I d like to enable parsing Ruby code as well. Therefore I am looking for a parser in Java that parses ...

rails collection_select vs. select

collection_select and select Rails helpers: Which one should I use? I can t see a difference in both ways. Both helpers take a collection and generates options tags inside a select tag. Is there a ...

RubyCAS-Client question: Rails

I ve installed RubyCAS-Client version 2.1.0 as a plugin within a rails app. It s working, but I d like to remove the ?ticket= in the url. Is this possible?

Ordering a hash to xml: Rails

I m building an xml document from a hash. The xml attributes need to be in order. How can this be accomplished? hash.to_xml

multiple ruby extension modules under one directory

Can sources for discrete ruby extension modules live in the same directory, controlled by the same extconf.rb script? Background: I ve a project with two extension modules, foo.so and bar.so which ...

Text Editor for Ruby-on-Rails

guys which text editor is good for Rubyonrails? i m using Windows and i was using E-Texteditor but its not free n its expired now can anyone plese tell me any free texteditor? n which one is best an ...

热门标签