February 17, 2011

CSS selector goodness in Mojo::DOM

Now that we've seen how easy Mojolicious' Mojo::DOM makes parsing html, let's take a closer look at the css selector goodness it provides.

Here's a fairly verbose html sample for us to work with:

First, we initialize and parse the file:

use File::Slurp 'slurp';
use Mojo::DOM;
my $dom = Mojo::DOM->new->parse(scalar slurp 'some.html');

Getting all the articles' contents, of course, is easy:

$dom->find('li a');

But we can do better than that. Let's say we want only the article titles that have page anchors:


Nah, let's get the article titles that link to external urls:


How about only article titles that link to .net domains?


We can also get the page anchors themselves:

$dom->find('div.article a[name]');

It could be that some articles have no text content; let's single those out:

$dom->find('div.article p:empty');

Or, if we want only the articles with text content:

$dom->find('div.article p:not(:empty)');

Let's get the articles that are only snippets (class name ends with 'snippet'):

$dom->find('div.article p[class$=snippet]');

There's an advertisement in the markup, let's look at the article immediately following it:

$dom->find('a.advertisement + div.article');

If you're looking to be particularly awesome, with Mojolicious 1.1, you can use all of these selectors from the command line.

Mojo::DOM currently implements all the selectors from jQuery that make contextual sense; if you run into a use case for something that's not implemented, pop into #mojo or the mailing list and make your case. Join the revolution!

As always, it's one-step easy to install:

curl -L get.mojolicio.us | sh

Mojo::DOM docs

comments powered by Disqus