English 中文(简体)
Getting a large number (but not all) Wikipedia pages
原标题:

For a NLP project of mine, I want to download a large number of pages (say, 10000) at random from Wikipedia. Without downloading the entire XML dump, this is what I can think of:

  1. Open a Wikipedia page
  2. Parse the HTML for links in a Breadth First Search fashion and open each page
  3. Recursively open links on the pages obtained in 2

In steps 2 and 3, I will quit, if I have reached the number of pages I want.

How would you do it? Please suggest better ideas you can think of.

ANSWER: This is my Python code:

# Get 10000 random pages from Wikipedia.
import urllib2
import os
import shutil
#Make the directory to store the HTML pages.
print "Deleting the old randompages directory"
shutil.rmtree( randompages )

print "Created the directory for storing the pages"
os.mkdir( randompages )

num_page = raw_input( Number of pages to retrieve::  )

for i in range(0, int(num_page)):
    opener = urllib2.build_opener()
    opener.addheaders = [( User-agent ,  Mozilla/5.0 )]
    infile = opener.open( http://en.wikipedia.org/wiki/Special:Random )

    page = infile.read()

    # Write it to a file.
    # TODO: Strip HTML from page
    f= open( randompages/file +str(i)+ .html , w )
    f.write(page)
    f.close()

    print "Retrieved and saved page",i+1
最佳回答
for i = 1 to 10000
    get "http://en.wikipedia.org/wiki/Special:Random"
问题回答

Wikipedia has an API. With this API you can get any random article in a given namespace:

http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=5

and for each article you call also get the wiki text:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content

I d go the opposite way-- start with the XML dump, and then throw away what you don t want.

In your case, if you are looking to do natural language processing, I would assume that you are interested in pages that have complete sentences, and not lists of links. If you spider the links in the manner you describe, you ll be hitting a lot of link pages.

And why avoid the XML, when you get the benefit of using XML parsing tools that will make your selection process easier?

You may be able to do an end run around most of the requirement:

http://cs.fit.edu/~mmahoney/compression/enwik8.zip

is a ZIP file containing 100 MB of Wikipedia, already pulled out for you. The linked file is ~ 16 MB in size.

I know it has been long, but for those who are still looking for an efficient way to crawl and download large number of wikipedia pages (or entire wikipedia) without violating the robot.txt file, Webb library is useful. Here is the link:

Webb Library for Web Crawling and Scrapping

Look at the DBpedia project.

There are small downloadable chunks with at least some article URLs. Once you parsed 10000, you can batch-download them carefully ...





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签