For a NLP project of mine, I want to download a large number of pages (say, 10000) at random from Wikipedia. Without downloading the entire XML dump, this is what I can think of:
- Open a Wikipedia page
- Parse the HTML for links in a Breadth First Search fashion and open each page
- Recursively open links on the pages obtained in 2
In steps 2 and 3, I will quit, if I have reached the number of pages I want.
How would you do it? Please suggest better ideas you can think of.
ANSWER: This is my Python code:
# Get 10000 random pages from Wikipedia.
import urllib2
import os
import shutil
#Make the directory to store the HTML pages.
print "Deleting the old randompages directory"
shutil.rmtree( randompages )
print "Created the directory for storing the pages"
os.mkdir( randompages )
num_page = raw_input( Number of pages to retrieve:: )
for i in range(0, int(num_page)):
opener = urllib2.build_opener()
opener.addheaders = [( User-agent , Mozilla/5.0 )]
infile = opener.open( http://en.wikipedia.org/wiki/Special:Random )
page = infile.read()
# Write it to a file.
# TODO: Strip HTML from page
f= open( randompages/file +str(i)+ .html , w )
f.write(page)
f.close()
print "Retrieved and saved page",i+1