Now that we've seen how easy Mojolicious' Mojo::DOM makes parsing html, let's take a closer look at the css selector goodness it provides.
Here's a fairly verbose html sample for us to work with:
First, we initialize and parse the file:
use File::Slurp 'slurp'; use Mojo::DOM; my $dom = Mojo::DOM->new->parse(scalar slurp 'some.html');
Getting all the articles' contents, of course, is easy:
But we can do better than that. Let's say we want only the article titles that have page anchors:
Nah, let's get the article titles that link to external urls:
How about only article titles that link to .net domains?
We can also get the page anchors themselves:
It could be that some articles have no text content; let's single those out:
Or, if we want only the articles with text content:
Let's get the articles that are only snippets (class name ends with 'snippet'):
There's an advertisement in the markup, let's look at the article immediately following it:
$dom->find('a.advertisement + div.article');
If you're looking to be particularly awesome, with Mojolicious 1.1, you can use all of these selectors from the command line.
Mojo::DOM currently implements all the selectors from jQuery that make contextual sense; if you run into a use case for something that's not implemented, pop into #mojo or the mailing list and make your case. Join the revolution!
As always, it's one-step easy to install:
curl -L get.mojolicio.us | shcomments powered by Disqus