Screen Scraping Wikipedia with Hpricot

Posted by shane
on Sunday, October 01

Wikipedia is one of the few sites that should have an API but doesn’t. It’s a shame, considering it is one of the best sources for free quality content. Due to this limitation, we have to resort to the pre-historic art of screen scraping. why’s Hpricot is my favorite tool to do this. It uses a fast HTML scanner written in C using Ragel, the same technology that makes Mongrel so fast. It allows you to parse HTML using either CSS selectors or XPath, in a similar vein to jQuery.

The few Wikipedia clients out there only output data in Wikitext format or clear text. I wanted something that will reproduce the content with basic styling intact, so it can be republished in a similar fashion to Answers.com. Here is the code to do this:

require 'hpricot'
require 'open-uri'

items_to_remove = [
  "#contentSub",        #redirection notice
  "div.messagebox",     #cleanup data
  "#siteNotice",        #site notice
  "#siteSub",           #"From Wikipedia..." 
  "table.infobox",      #sidebar box
  "#jump-to-nav",       #jump-to-nav
  "div.editsection",    #edit blocks
  "table.toc",          #table of contents 
  "#catlinks"           #category links
  ]

doc = Hpricot open('wikipedia url')
@article = (doc/"#content").each do |content|
  #change /wiki/ links to point to full wikipedia path
  (content/:a).each do |link|
    unless link.attributes['href'].nil?
      if (link.attributes['href'][0..5] == "/wiki/")
        link.attributes['href'].sub!('/wiki/', 'http://en.wikipedia.org/wiki/')
      end
    end
  end  

  #remove unnecessary content and edit links
  items_to_remove.each { |x| (content/x).remove }

  #replace links to create new entries with plain text
  (content/"a.new").each do |link|
    link.parent.insert_before Hpricot.make(link.attributes['title']), link
  end.remove
end 

puts @article.inner_html

For comparison, here is a Wikiepdia article scraped using this script, and its cousin at Answers.com. Here is the original Wikipedia article. There are still a few things to be worked out, like filtering out Javascript and working with other languages. But I will leave that as an exercise for the reader!

If you are going to use this script, please don’t forget to give credit to Wikipedia and include a link to the original article. It looks like it already does add a Notes section with a link-back which isn’t on the original site. It must be checking the user agent of the browser and adding the line if it detects an unknown browser. You still have to include a link to the GNU Free Documentation License.

Update: _why makes some suggestions to improve this script, and adds a new method swap which eliminates the ugly end.remove syntax.

Comments

Leave a response

  1. Wikipedia APIOctober 02, 2006 @ 05:18 AM
    Wikipedia does have an API. See [en.wikipedia.org/w/query.php](http://en.wikipedia.org/w/query.php) for an example and [meta.wikimedia.org/wiki/Query](http://meta.wikimedia.org/wiki/Query) for details.
  2. AndyOctober 02, 2006 @ 05:57 AM
    Regardless of whether there is an API or not I believe that this is useful as it provides a helpful, real-world example of Hpricot usage.
  3. Brian EngOctober 02, 2006 @ 08:19 AM
    Good stuff man!
  4. shaneOctober 03, 2006 @ 01:10 PM
    Commenter #1, thanks for the links. Unfortunately that API doesn't give what I want either. I would have to download a page in XML (or one of the supported formats), process it, and transform it to HTML. Also, according to the 2nd link: "pagenames in the output are not linked." So it looks like that approach would be more work. Hpricot meets my needs in this case.
  5. derekDecember 16, 2006 @ 03:22 PM
    A good pairing for this would be the FUTEF Wikipedia Search APi - http://api.futef.com/apidocs.html - Allows for basic keyword searches, fielded searches, faceted searching/browsing/navigation. Additionally, all of it is returned in easy to use JSON.
  6. francineddNovember 02, 2007 @ 05:27 AM

    I was wondering what some of the more mature members here do about dating. It seems much harder for older singles to find a mate, so I might be turning to online dating for older singles. any suggestions? thanks.

  7. Loriann GoodmanDecember 21, 2007 @ 02:53 PM

    salele chapless subparagraph garrulousness fatagaga overgrind taslet antistrophon Shape Sorter http://sportsillustrated.cnn.com/basketball/college/women/teams/bas/

Comment