Getting Started with Nokogiri »
Created at: 14.01.2010 20:00, source: Engine Yard Blog, tagged: Technology css html Nokogiri ParseTree XML XPath
Nokogiri is a library for dealing with XML and HTML documents. I wrote Nokogiri along with my (more attractive) partner in crime, Mike Dalessio. We both use and enjoy working with Nokogiri for dealing with HTML and XML on a daily basis, and I’d like to share it with you! In this post, we’ll be covering:
- Getting Nokogiri installed
- Basic document parsing
- Basic data extraction
Hopefully by the end of this article you will also be able to use and enjoy Nokogiri on a day to day basis too!
Installation
Nokogiri is actually a wrapper around Daniel Veillard’s excellent HTML/XML parsing library written, libxml2. Since Nokogiri simply wraps and builds upon this already existing library, installing libxml2 is a prerequisite for installing Nokogiri. Fortunately, libxml2 has been ported to most systems, so the installation is pretty easy.
OS X
I recommend installing libxml2 on OS X from macports. OS X ships with libxml2 installed, but macports is more up to date, so I’d recommend using it instead.
To install libxml2 from macports:
$ sudo port install libxml2 libxslt
Then to install nokogiri:
$ sudo gem install nokogiri
And that should be it!
Linux
On Linux, we still need to install libxml2. The command for installing libxml2 will change depending on the package manager and linux distribution you’re using, but we’ll cover Fedora and Ubuntu here.
On Fedora:
$ sudo yum install libxml2-devel libxslt-devel $ gem install nokogiri
On Ubuntu:
$ sudo apt-get install libxml2 libxml2-dev libxslt libxslt-dev $ gem install nokogiri
Windows
Dealing with libxml2 on Windows is so much work, that we built libxml2 for you, and now ship it along with Nokogiri. On Windows, to install, simply do gem install nokogiri.
Oh Noes! Something Went Wrong!
Nokogiri ships with some basic intelligence for finding your installation of libxml2, but clever developers can easily fool it! If you have problems, first check that the libxml2 and libxslt development packages are installed. If everything seems OK, and Nokogiri still won’t install, send an email to the Nokogiri mailing list. We’re here to help!
Basic Parsing
Now that we have installation out of the way, it’s time to get Nokogiri to do some work for us. Nokogiri lets you parse an HTML or XML document using a few different strategies:
- DOM
- SAX
- Reader
- Pull
Each of these strategies have different advantages and disadvantages. We won’t go through all the differences in this post; the DOM interface is the most common, and generally regarded as the easiest to use, so that’s what we’ll focus on here.
There are two main entry points to Nokogiri depending on the kind of document you wish to parse: one for HTML documents and one for XML documents. Parsing HTML documents looks like this:
doc = Nokogiri::HTML(html_document)
Parsing XML documents looks like this:
doc = Nokogiri::XML(xml_document)
Both of these functions will take an IO object or a String object. Since both forms accept IO objects, we can even feed open-uri straight in to Nokogiri like this:
doc = Nokogiri::HTML(open("http://www.google.com/search?q=doughnuts"))
Feeding Nokogiri an IO object is slightly more efficient than using a String, but you should choose the one that is most convenient.
Data Structures
To become data extraction Zen Masters, we first need to understand the data structure returned by Nokogiri. Notably, we need to understand that Nokogiri converts HTML and XML documents into a tree data structure.
For example, an HTML document that looks like this:
<html>
<head>
<title>Hello!</title>
</head>
<body id="uniq">
<h1>Hello World!</h1>
</body>
</html>
…will be represented in memory with a tree that looks like this:

Any data extraction technique used is simply a way for traversing this in-memory tree. If we keep this structure in mind while trying to do data extraction, we can enter data extraction nirvana!
Data Extraction
We’ve seen how to turn an HTML or XML document into an in-memory tree. Now we’re going to try to do something useful with this tree: extract some data. Let’s take a look at a few different strategies for unlocking the data in our tree.
There are three different ways to traverse our in-memory tree. The first two, XPath and CSS, are small languages built specifically for tree traversal. The last one we’ll examine is the Nokogiri API for manual tree traversal.
Basic XPath
The XPath language was written to easily traverse an XML tree structure, but we can use it with HTML trees as well. Here’s a sample program for extracting search result links from a google search. We’ll use XPath to find the data we want, and then pick apart the XPath syntax:
require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open("http://www.google.com/search?q=doughnuts")) doc.xpath('//h3/a').each do |node| puts node.text end
The XPath used in this program is:
//h3/a
In English, this XPath says:
Find all “a” tags with a parent tag whose name is “h3″
Thus, our program finds all “a” tags with “h3″ parents, loops over them, and prints out the text content.
XPath works like a directory structure where the leading “/” indicates the root of the tree. Slashes separate the tag matching information. When there’s nothing between slashes, it’s a sort of wild card—meaning “any tag matches”. The “h3″ and “a” are tag name matchers, and only match when the tag name matches.
Finding tag names is great, but if you run the previous program, you might find that it returns more “a” tags than we actually want. We need to narrow down our search based on some attributes of the tags, specifically the “class” values. To match attribute values in XPath, we use brackets. Now let’s look at a couple of examples.
To match “h3″ tags that have a class attribute, we write:
h3[@class]
To match “h3″ tags whose class attribute is equal to the string “r”, we write:
h3[@class = "r"]
Using the attribute matching construct, we can modify our previous query to:
//h3[@class = "r"]/a[@class = "l"]
which in English terms is:
Find all “a” tags with a class attribute equal to “l” and an immediate parent tag “h3″ that has a class attribute equal to “r”
If we substitute that XPath back in to our original program, we’ll get the expected results.
For more information on doing XPath queries, I recommend checking out the tutorial at w3schools as well as the w3 recommendation.
For more information on using XPath within Nokogiri, check out the Nokogiri tutorials as well as the RDoc.
Next, let’s look at CSS syntax.
Basic CSS
CSS is similar to XPath in that it’s another language for searching a tree data structure. In this section, we’ll perform the same task as the XPath section, but we’ll examine the CSS syntax.
CSS does not separate tag matching patterns by slashes, but rather by whitespace or “greater than” characters (actually, there are more, but we’re just going to talk about those two for now). Let’s rewrite our previous XPath as CSS and examine the syntax.
//h3/a
…can be written in CSS as:
h3 > a
The “>” character indicates that the “a” tag must be a direct descendant of the “h3″ tag. Most CSS that I see uses space separators like this:
h3 a
Using a space indicates that there could be any number of tags between the ”h3″ tag and the “a” tag. The space is similar to “//” in XPath, and this CSS query could be written in XPath like this:
//h3//a
Similar to XPath, CSS can use brackets for matching attributes. Let’s do a couple more XPath to CSS translations. On the left is XPath, on the right is CSS:
h3[@class] => h3[class] h3[@class = "r"] => h3[class = "r"]
This syntax works, but CSS provides us with a shorthand for matching the ”class” attribute. To find all h3 tags whose class attribute contains “r”, we can say:
h3.r
There’s a subtle difference between the two previous examples. The selector h3[@class = "r"] must be an exact match; the class value must exactly equal the string r. In the second example, the selector h3.r means “the class attribute must contain the value r”. That means h3.r will match the following tag, but h3[@class = "r"] will not:
<h3 class="r foo">Hi!</h3>
The XPath selector and our translated CSS selector would not match this tag, but the “h3.r” selector would. Most of the time, the CSS class selectors do what we want. Only when I need something very specific do I use the bracket form in my CSS selectors.
With this knowledge in hand, we can rewrite our original program using CSS selectors:
doc = Nokogiri::HTML(open("http://www.google.com/search?q=doughnuts")) doc.css('h3.r > a.l').each do |node| puts node.text end
I think the CSS selectors usually result in more concise and clear queries than XPath, so I usually stick to CSS queries in my code. There are some tasks which CSS cannot accomplish that XPath can though, so it’s nice to be able to fall back to XPath queries when I need to.
Next, let’s look at some basic node API’s provided by Nokogiri.
Basic Node API
Since we’re dealing with a tree data structure, Nokogiri provides methods for navigating that tree. In fact, all of the tree traversal we’ve seen so far using XPath and CSS can be accomplished manually via Ruby. Manual tree traversal is, however, cumbersome and verbose, which is why languages like XPath and CSS exist. Sometimes a combination of XPath or CSS plus manual tree traversal is easiest, so it is still important to know the API.
Every tag in a document is represented by class called a Node. Each node in the tree has 0 or more children, 0 or 1 parent, 0 or more siblings, and 0 or more attributes. Nokogiri provides methods for accessing all of these things on any particular node. We can access any of those relative nodes like so:
node.parent #=> parent node node.children #=> children nodes node.next_sibling #=> next sibling node node.previous_sibling #=> previous sibling node
These node access methods can be used for manually traversing a tree, but I tend to leave the hard work to XPath or CSS queries and only use manual tree access when I have to.
When it comes to accessing attributes of a tag, the node may be treated like a normal Ruby Hash. We can get and set attributes on a node like so:
node['class'] #=> the value of the class attribute node['class'] = 'foo'
We can even get a list of attributes or values of attributes like so:
node.keys #=> list of attribute name node.values #=> list of attribute values
For more information on things you can do with Nodes, check out the Node Documentation and also the Nokogiri tutorials section.
Conclusion
I hope this article has you on your way to HTML and XML parsing nirvana. Remember the tree data structure, and remember that XPath and CSS can be performed on HTML documents and XML documents.
Make sure to check out our documentation, and if you have any problems make sure to join the mailing list!
more »
The State of XML Parsing in Ruby (Circa 2009) »
Created at: 20.11.2009 20:30, source: Engine Yard Blog, tagged: Technology Nokogiri REXML XML
It’s almost the end of 2009, and I have to ask: are we through dealing with XML yet?
Although many of us wish we could consume the web through a magic programmer portal that shields us and our code from all the pointy angle brackets, the reality that is the legacy of HTML, Atom and RSS on the web leaves us little choice but to soldier on. So let’s take a look at what Ruby-colored armor is available to us as we continue our quest to slay the XML dragons.
Background
Historically, Ruby has had a number of options for dealing with structured markup, though oddly none have reached a solid consensus among Ruby developers as the “go to” library. The earliest available library seems to be Yoshida Masato’s XMLParser, which wraps Expat and was first released around the time that Expat itself was released, back in 1998. A pure Ruby parser by Jim Menard called NQXML appeared in 2001, though it never matured to the level of a robust XML parser.
In late 2001, Matz expressed his desire for out of the box XML support, but sadly, nothing appeared in Ruby’s standard library until 2003, when REXML was imported for the 1.8.0 release. After reading bike-shed discussions like this one on ruby-talk in November 2001, or this wayback-machine page from the old RubyGarden wiki, it’s not hard to see why. Meanwhile, other language runtimes, such as Python and Java, moved along and built solid, acceptable foundations, making Ruby’s omission seem more glaring.
But all was not lost: Ruby has always had a quality without a name that made it a great language for distilling an API. All that was needed was an infusion of interest and talent in Ruby, and a few more experiments and iterations.
Fast forward to the present time, and all those chips have fallen. We’ve seen evolution from REXML to libxml-ruby to Hpricot, and finally to Nokogiri. So, is the XML landscape on Ruby so dire? Certainly not, as you’ll see by the end of this article! While the standard library support for XML hasn’t progressed beyond REXML yet, state-of-the-art solutions are a few keystrokes away.
XML APIs
A big part of what makes XML such a pain to work with is the APIs. We Rubyists tend to have an especially low tolerance for friction in API design, and we really feel it when we work with XML. If XML is just a tree structure, why isn’t navigating it as simple and elegant as traversing a Ruby Enumerable?
The canonical example of API craptasticism is undoubtedly the W3C DOM API. For proof, observe the meteoric rise of jQuery in the JavaScript world. While it would be easy to fill an entire article with criticisms regarding the DOM, it’s been done before. (Incidentally, read the whole series of interviews with Elliotte Rusty Harold for a series of insights on API design, schema evolution, and more.)
Instead, we’ll take a brief exploratory tour of some Ruby XML APIs using code examples. Though some of the examples may seem trivially short, don’t underestimate their power. Conciseness and readability are Ruby’s gifts to the library authors and they’re being put to good use.
The libraries we’ll use for comparison are REXML, Nokogiri, and JAXP, Java’s XML parsing APIs (via JRuby).
Parsing
The simplest possible thing to do in XML is to hand the library some XML and get back a document.
REXML
require 'rexml/document' document = REXML::Document.new(xml)
Nokogiri
require 'nokogiri' document = Nokogiri::XML(xml)
Both REXML and Nokogiri more or less get this right. What’s also nice is that they both transparently accept either an IO-like object or a string. Contrast this to Java:
JAXP/JRuby
factory = javax.xml.parsers.DocumentBuilderFactory.newInstance factory.namespace_aware = true # unfortunately, important! parser = factory.newDocumentBuilder # String document = parser.parse(xml) # IO document = parser.parse(xml.to_inputstream)
In that familiar Java style, the JAXP approach forces you to choose from many options and write more code for the happy path. JRuby helps you a little bit by converting a Ruby string into a Java string, but needs a little help with intent for converting an IO to a Java InputStream.
XPath
Now that we’ve got a document object, let’s query it via XPath, assuming the underlying format is an Atom feed. Here is the code to grab the entries’ titles and store them as an array of strings:
REXML
elements = REXML::XPath.match(document.root, "//atom:entry/atom:title/text()",
"atom" => "http://www.w3.org/2005/Atom")
titles = elements.map {|el| el.value }
Nokogiri
elements = document.xpath("//atom:entry/atom:title/text()",
"atom" => "http://www.w3.org/2005/Atom")
titles = elements.map {|e| e.to_s}
Again, both REXML and Nokogiri clock in at similar code sizes, but subtle differences begin to emerge. Nokogiri’s use of #xpath as an instance method on the document object feels more natural as a way of drilling down for further detail. Also, note that both APIs return DOM objects for the text, so we need to take one more step to convert them to pure Ruby strings. Here, Nokogiri’s use of the standard String#to_s method is more intuitive; REXML::Text’s version returns the raw text without the entities replaced.
Unfortunately, doing XPath in Java gets a bit more complicated. First we need to construct an XPath object. At least JRuby helps us a bit here–we can create an instance of the NamespaceContext interface completely in Ruby, and omit the methods we don’t care about.
JAXP/JRuby
xpath = javax.xml.xpath.XPathFactory.newInstance.newXPath
ns_context = Object.new
def ns_context.getNamespaceURI(prefix)
{"atom" => "http://www.w3.org/2005/Atom"}[prefix]
end
xpath.namespace_context = ns_context
Next, we evaluate the expression and construct the array titles:
JAXP/JRuby
nodes = xpath.evaluate("//atom:entry/atom:title/text()",
document, javax.xml.xpath.XPathConstants::NODESET)
titles = []
0.upto(nodes.length-1) do |i|
titles << nodes.item(i).node_value
end
That last bit where we need to externally iterate the DOM API is particularly un-Ruby-like. With JRuby we can mix in some methods to the NodeList class:
JAXP/JRuby
module org::w3c::dom::NodeList
include Enumerable
def each
0.upto(length - 1) do |i|
yield item(i)
end
end
end
And replace the external iteration with a more natural internal one:
JAXP/JRuby
titles = nodes.map {|e| e.node_value}
This kind of technique tends to become a fairly common occurrence when coding Ruby to Java libraries in JRuby. Fortunately Ruby makes it simple to hide away the ugliness in the Java APIs!
Walking the DOM
Say we’d like to explore the DOM. Both REXML and Nokogiri provide multiple ways of doing this, with parent/child/sibling navigation methods. They also each sport a recursive descent method, which is quite convenient.
REXML
titles = [] document.root.each_recursive do |elem| titles << elem.text.to_s if elem.name == "title" end
Nokogiri
titles = [] document.root.traverse do |elem| titles << elem.content if elem.name == "title" end
Needless to say, Java’s DOM API has no such convenience method, so we have to write one. But again, JRuby makes it easy to Rubify the code. Note that our #traverse method makes use of our Enumerable-ization of NodeList above as well.
JAXP/JRuby
module org::w3c::dom::Node
def traverse(&blk)
blk.call(self)
child_nodes.each do |e|
e.traverse(&blk)
end
end
end
titles = []
document.traverse do |elem|
titles << elem.text_content if elem.node_name == "title"
end
Pull parsing
All three libraries have a pull parser (also called a stream parser or reader) as well. Pull parsers are efficient because they behave like a cursor scrolling through the document, but usually result in more verbose code because of the need to implement a small state machine on top of lower-level XML events. They are best employed on very large documents where it’s impractical to store the entire DOM tree in memory at once.
REXML
parser = REXML::Parsers::PullParser.new(xml_stream)
titles = []
text = ''
grab_text = false
parser.each do |event|
case event.event_type
when :start_element
grab_text = true if event[0] == "title"
when :text
text << event[1] if grab_text
when :end_element
if event[0] == "title"
titles << text
text = ''
grab_text = false
end
end
end
Nokogiri
reader = Nokogiri::XML::Reader(xml_stream)
titles = []
text = ''
grab_text = false
reader.each do |elem|
if elem.name == "title"
if elem.node_type == 1 # start element?
grab_text = true
else # elem.node_type == 15 # end element?
titles << text
text = ''
grab_text = false
end
elsif grab_text && elem.node_type == 3 # text?
text << elem.value
end
end
(Aside to the Nokogiri team: where are the reader node type constants?)
JAXP/JRuby
include javax.xml.stream.XMLStreamConstants
factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(xml_stream.to_inputstream)
titles = []
text = ''
grab_text = false
while reader.has_next
case reader.next
when START_ELEMENT
grab_text = true if reader.local_name == "title"
when CHARACTERS
text << reader.text if grab_text
when END_ELEMENT
if reader.local_name == "title"
titles << text
text = ''
grab_text = false
end
end
end
Not surprisingly, all three pull parser examples end up looking very similar. The subtleties of the pull parser APIs end up getting blurred in the loops and conditionals. Only write this code when you have to.
Performance
At the end of the day, it comes down to performance, doesn’t it? Although the topic of Ruby XML parser performance has been discussed before, I thought it would be instructive to do another round of comparisons with JRuby and Ruby 1.9 thrown into the mix.
System Configuration
- Mac OS X 10.5 on a MacBook Pro 2.53 GHz Core 2 Duo
- Ruby 1.8.6p287
- Ruby 1.9.1p243
- JRuby 1.5.0.dev (rev c7b3348) on Apple JDK 5 (32-bit)
- Nokogiri 1.4.0
- libxml2 2.7.3
Here are results comparing Nokogiri and Hpricot on the three implementations along with the JAXP version which only runs on JRuby (smaller is better).

The REXML results were over an order of magnitude slower, so it’s easier to view them on a separate graph. Note the number of iterations here is 100 vs. 1000 for the results above.

While these results don’t paint a complete picture of XML parser performance, they should give you enough of a guideline to make a decision on which parser to use once you take portability and readability into account. In summary:
- Use REXML when your parsing needs are minimal and want the widest portability (across all implementations) with the smallest install footprint.
- Use JRuby with the JAXP APIs for portability across any operating system that supports the Java platform (including Google AppEngine).
- Use Nokogiri for everything else. It’s the fastest implementation, and produces the most programmer-friendly code of all Ruby XML parsers to date.
(As a footnote, we on the Nokogiri and JRuby teams are looking for community help to further develop the pure-Java backend for Nokogiri so that AppEngine and other JVM deployment scenarios that don’t allow loading native code can benefit from Nokogiri’s awesomeness. Please leave a comment or contact the JRuby team on the mailing list if you’re interested.)
The source code for this article is available if you’d like to examine the code or run the benchmarks yourself. Keep an eye on the Engine Yard blog for an upcoming post on Nokogiri, and as always, leave questions and thoughts in the comments!
more »
