I have had the need to scrape various data lately and went looking for a good ruby library to make it quick and easy to write small code snippets for grabbing the data I need. I found the nokogiri ruby gem to be exceptional for this purpose. I also found a selector gadget that pinpoints the CSS selector you need to grab a data item. You could also use Google Chrome’s Inspector, but this gadget is super easy to use.
Prerequisites:
1. Make sure Ruby is installed on your machine, if you are a OS X or Linux user you should be set (Windows user will have to Google instructions for Ruby and Ruby Gems installation)
2. open a console window
3. sudo gem install nokogiri
4. Got to the selector gadget website and follow the drag-n-drop instructions for installing the bookmarklet.
Here is a quick example
First we need to get the CSS selector that we will use to pull data out of the page.
Let’s look at the page below which is from the following URL: http://www.trackinfo.com/dog.jsp?runnername=noble+sky

Greyhound Dog Info Page
One cool thing about adopting a retired racer is there is data on the web about your dog. Each retired racer is uniquely identified by two strings of data. In their left ear is a tatoo that represents their litter id, in their right ear is their position in the litter. In this case my retired racer was born in litter 46422, and his right ear tattoo (28E) indicates that he was born on Feb 2008 (28), and was the 5th pup born in the litter (E). Not that useful but a fun fact.
Note: if you are interested in adopting a retired racer please check out the web site of an adoption group in your area, they really are the best dogs. I have some greyhound info on a pintrest board.
So today let’s scrape and print the name of any retired racer that sent to the script, and also use the litter id (left ear tattoo) to list the names of all of the litter mates.
This will take two requests, one for the dog info, and one for the litter info. The litter information web page looks like this…

Greyhound Litter Information Page
Using the Selector Gadget
1. In your browser go to the first URL: http://www.trackinfo.com/dog.jsp?runnername=noble+sky
2. click on your “SelectorGadget” bookmarklet in your bookmark bar.
3. Click on the number to the right of the Left Tattoo label, it is 46422 in my example.
4. You should see a green highlight on the item you selected and a lot of yellow on other items. We will see later with the litter page that this approach of selecting more than where you clicked makes sense if what you wanted was multiple like items in a table. Here we only want the number we selected so next click on the yellow boxes until they turn red leaving only 46422 highlighted in green. (ignore the yellow associated with your cursor moving around).
5. If you did this correctly (and they haven’t changed the web page structure since I wrote this) you should see the selector in a small box near the bottom of your browser that lists the selector: tr:nth-child(2) .it2 a
Now that we have a selector lets look at the code snippet for scraping the page
# Code Snipet for demonstrating the use of nokogiri
# The code will look up yor retired racing greyhound by its racing name
# And return info about your dog as well as a list of its litter mates.
#
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
# Use the dog's name for lookup of dog information
dogname = ARGV[0] || "Noble Sky"
get_dog_info_url = "http://www.trackinfo.com/dog.jsp?runnername=" + URI.escape(dogname)
dog_doc = Nokogiri::HTML(open(get_dog_info_url))
# Get the left ear tag which is all the litter id for identifying all litter mates.
left_ear_tag = dog_doc.css("tr:nth-child(2) .it2 a").text
name = dog_doc.css("h1").text
right_ear_tag = dog_doc.css("tr:nth-child(2) .it4").text
starts = dog_doc.css(".detail-wrap tr:nth-child(4) td:nth-child(2)").text
dob = dog_doc.css(".it4 em").text
sex_color = dog_doc.css(".it2 em").text
sex = sex_color[0,1]
color = sex_color[6,4]
sire = dog_doc.css("tr:nth-child(1) .it2 a").text
dam = dog_doc.css(".it4 a").text
# Look up all litter mates
get_litter_info_url = "http://www.trackinfo.com/dog-search.jsp?keyword=" + URI.escape(left_ear_tag) + "&by=ltattoo"
litter_doc = Nokogiri::HTML(open(get_litter_info_url))
puts("Your Dog: ", dogname, "has the following siblings:")
# Parse the resulting document for key data elements.
# Then print the info to the console.
litter_doc.css(".item-table tbody tr").each do |tr|
if tr.css(".name").text.casecmp(dogname) != 0
puts(tr.css(".name").text)
end
end
You can grab the snippet of code from GitHub Gist and save it to a file called gh_lookup.r, or just copy and paste from the snippet below to your favorite editor to try it out.
The first part of the code is the call to get the dog information page.

1. This is pretty straight forward. Use the argument passed in as the dog name, or “Noble Sky” if no dog was passed in.
2. Create the url string making sure to escape the dog name using the uri gem.
3. Request the document using the Nokogiri gem.
In the next section of code we simply parse out the items we want using the CSS selectors we got using the SelectorGadget earlier. The first is the left ear tatoo that we calculated. There rest you can verify yourself.

The final section of code

makes another call using the Nokogiri gem to get a list of dogs from the identified litter (46422), and a loop to print out information about each of the dogs.
Run the ruby script in your console window:

That’s is it, pretty easy. Don’t let a the lack of a clean API stop you from getting data when it is this easy to grab what you need.
Possible Additional Steps:
Simply printing to the console is not very useful. For real projects you probably want to store the data somewhere. I typically take two paths
1. Store the data as objects in a persistent store like a sql database. In this case you would use the data to create “Dog” objects. In rails the Dog would be a model, and this code could go in the model for Dog.create() or possibly Dog.find() if you were using find like a cache where you check your DB first and only if it wasn’t there you performed the scraping. This live scraping isn’t always such a good idea for a live multiuser service, but works well for many cases.
2. Lately I have been using a NoSQL database for this type of data collection. CouchDB is my choice usually because it has a web interface to the db. CouchDB is a document databases, so you could save dogs as documents, and then query them later. For this case you would probably also save a litter document, which would serve to connect the dog hierarchy if it included all dogs born in the litter and the dam, and sire.
Javascript Alternatives
If you prefer Javascript to Ruby you probably want to use NodeJS with jQuery for achieving the same goal. If I get a chance I will create and post a version written in NodeJS/jQuery so you can see that version as well. The nice thing about NodeJS is your code is written just like in the client side so you are creating skills and talents that work on both the front and backend.