Question

I am trying to get the data from Wikipedia s infoboxes into a hash or something so that I can use it in my Ruby on Rails program. Specifically I m interested in the Infobox company and Infobox person. The example I have been using is "Ford Motor Company". I want to get the company info for that and the person info for the people linked to in Ford s company box.

I ve tried figuring out how to do this from the Wikipedia API or DBPedia but I haven t had much luck. I know wikipedia can return some things as json which I could parse with ruby but I haven t been able to figure out how to get the infobox. In the case of DBPedia I am kind of lost on how to even query it to get the info for Ford Motor Company.

Answer 1

I vote for DBpedia.

A simple explanation is:

The dbpedia naming scheme is http://dbpedia.org/resource/WikipediaArticleName (unique identifier) with spaces replaced by _.

http://dbpedia.org/page/ArticleName (the html preview) and http://dbpedia.org/data/ArticleName(.json/.jsod) are the JSON representation for the information about the article you want. (.rdf etc. might be confusing for you right now.)

For Ford Motor Company you should ask for:

http://dbpedia.org/data/Ford_Motor_Company.json

or:

http://dbpedia.org/data/Ford_Motor_Company.jsod

(Whichever is simpler for you)

Now, depending on the article type, person or company, there are different properties that define them that depend on the dbpedia ontology (http://wiki.dbpedia.org/Ontology).

A more advanced step could be to use SPARQL queries to get your data.

Answer 2

Don t try to parse HTML with RegExp.

See: RegEx match open tags except XHTML self-contained tags

Use xpath or something similar.

Answer 3

I looked at their API, and it looks like there s a lot of detail, but the complexity is a hurdle. For long-term use it d be best to figure it out, but for quick and dirty, here s a way to get at the data.

I m using Nokogiri, which is a XML/HTML parser, and very flexible. For ease of use I m using CSS accessors.

#!/usr/bin/env ruby

require  open-uri 
require  nokogiri 
require  uri 

URL =  http://en.wikipedia.org/wiki/Ford_Motor_Company 
doc = Nokogiri::HTML(open(URL))
infobox = doc.at( table[class="infobox vcard"] )
infobox_caption = infobox.at( caption ).text

uri = URI.parse(URL)
infobox_agents = Hash[ *infobox.search( td.agent a ).map{ |a| [ a.text, uri.merge(a[ href ]).to_s ] }.flatten ]

require  ap 
ap infobox_caption
ap infobox_agents

The output looks like:

"Ford Motor Company"
{
              "Henry Ford" => "http://en.wikipedia.org/wiki/Henry_Ford",
    "William C. Ford, Jr." => "http://en.wikipedia.org/wiki/William_Clay_Ford,_Jr.",
      "Executive Chairman" => "http://en.wikipedia.org/wiki/Chairman",
        "Alan R. Mulally" => "http://en.wikipedia.org/wiki/Alan_Mulally",
              "President" => "http://en.wikipedia.org/wiki/President",
                    "CEO" => "http://en.wikipedia.org/wiki/Chief_executive_officer"
}

So, it s pulled the text of the caption, and returned a hash of the people s names, where the keys are their names and the values are the URLs.

Answer 4

You can use open-uri to download the HTML code of one wiki page and then interpret with Regexp. Look:

require  open-uri 
infobox = {}
open( http://en.wikipedia.org/wiki/Wikipedia ) do |page|
  page.read.scan(/<th scope="row" style="text-align:left;">(.*?)</th>.<td class="" style="">(.*?)</td>/m) do |key, value|
    infobox[key.gsub(/<.*?>/,   ).strip] = value.gsub(/<.*?>/,   ).strip # Removes tags (as hyperlink)
  end
end
infobox["Slogan"]                #=> "The free encyclopedia that anyone can edit."
infobox["Available language(s)"] #=> "257 active editions (276 in total)"

Should exist some better method. But this works.

友情链接