Web scraping in Java with Jsoup, Part 2 (How-to)
Web scraping refers to programmatically downloading a page and traversing its DOM to extract the data you are interested in. I wrote a parser class in Java to perform the web scraping for my blog analyzer project. In Part 1 of this how-to I explained how I set up the calling mechanism for executing the parser against blog URLs. Here, I explain the parser class itself.
But before getting into the code, it is important to take note of the HTML structure of the document that will be parsed. The pages of The Dish are quite heavy–full of menus and javascript and other stuff, but the area of interest is the set of blog posts themselves. This example shows the HTML structure of each blog post on The Dish:
<article> <aside> <ul class="entryActions" id="meta-6a00d83451c45669e2014e885e4354970d"> <li class="entryEmail ir"> <div class="st_email_custom maildiv" st_url="http://andrewsullivan.thedailybeast.com/2011/05/fac-5.html" st_title="Face Of The Day">email</div> </li> <li class="entryLink ir"> <a href="http://andrewsullivan.thedailybeast.com/2011/05/fac-5.html" title="permalink this entry">permalink</a> </li> <li class="entryTweet"></li> <li class="entryLike"></li> </ul> <time datetime="2011-05-12T23:37:00-4:00" pubdate>12 May 2011 07:37 PM</time> </aside> <div class="entry"> <h1> <a href="http://andrewsullivan.thedailybeast.com/2011/05/fac-5.html">Face Of The Day</a> </h1> <p> <a href="http://dailydish.typepad.com/.a/6a00d83451c45669e2014e885e4233970d-popup" onclick="window.open( this.href, '_blank', 'width=640,height=480,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0' ); return false" style="display: inline;"> <img alt="GT_WWII-VET-JEWISH-110511" class="asset asset-image at-xid-6a00d83451c45669e2014e885e4233970d" src="http://dailydish.typepad.com/.a/6a00d83451c45669e2014e885e4233970d-550wi" style="width: 515px;" title="GT_WWII-VET-JEWISH-110511" /> </a> </p> <p> A decorated veteran takes part [truncated] </p> </div> </article>
Blog posts are each contained within an HTML5 article
tag. There is a time
tag holding the date and time the post was published. A div
with class aentry
holds both the title and body of the post. The title is within an h1
and also contains the permalink for the post.
Now, the code to parse this page.
The simple blog parser interface again:
public interface BlogParser { public List<Link> parseURL(URL url) throws ParseException; }
Now to talk about the implementation class: DishBlogParser. The goal is to return a list of Link objects (a “Link” in this context represents one blog URL and its associated data). DishBlogParser will extract the title and body text of each blog post along with the post date, images, videos, and links contained therein. I’ll go through the class a section at a time. Starting from the top:
@Component("blogParser") public class DishBlogParser implements BlogParser { @Value("${config.excerptLength}") private int excerptLength; @Autowired private DateTimeFormatter blogDateFormat; private final Cleaner cleaner; private final UrlValidator urlvalidator; public DishBlogParser() { Whitelist clean = Whitelist.simpleText().addTags("blockquote", "cite", "code", "p", "q", "s", "strike"); cleaner = new Cleaner(clean); urlvalidator = new UrlValidator(new String[]{"http","https"}); }
The excerptLength field defines the maximum length for post body excerpts. The @Value annotation pulls in the value from a properties file configured in applicationContext.xml.
The blogDateFormat is a Joda formatter configured also in applicationContext.xml to match the date/time format used on The Dish. It will be used to parse dates from HTML into Joda DateTime objects. Here is how blogDateFormat is configured in applicationContext.xml:
<bean id="blogDateFormat" class="org.joda.time.format.DateTimeFormat" factory-method="forPattern"> <constructor-arg value="dd MMM yyyy hh:mm aa"/> </bean>
The Cleaner object is a Jsoup class that applies a whitelist filter to HTML. In this case, the cleaner is used to whitelist tags that will be allowed to appear in blog body excerpts.
Finally, the UrlValidator comes from Apache Commons and will be used to validate the syntax of URLs contained within blog posts.
Now, for the parseURL method:
public List<Link> parseURL(URL url) throws ParseException { try { // retrieve the document using Jsoup Connection conn = Jsoup.connect(url.toString()); conn.timeout(12000); conn.userAgent("Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)"); Document doc = conn.get(); // select all article tags Elements posts = doc.select("article"); // base URI will be used within the loop below String baseUri = (new StringBuilder()) .append(url.getProtocol()) .append("://") .append(url.getHost()) .toString(); // initialize a list of Links List<Link> links = new ArrayList<Link>();
Here, Jsoup is used to connect to the URL. I set a generous connection timeout, because at times The Dish server is not very snappy. I also set a common user agent, just as a general practice when requesting a web page programmatically.
On Line 7 the Document is retrieved–this is a DOM representation of the entire page. For this project, only the blog posts themselves are needed. Because each blog post is contained in an article
tag, the set of posts is obtained by calling doc.select("article")
(Line 10). We’re about to loop through them, but first we need to define the base URI of our URL for something a bit further down, and also initialize the List
which will hold our extracted Link
objects.
Now, the loop. It starts like this:
// loop through, extracting relevant data for (Element post : posts) { Link link = new Link(); // extract the title of the post Elements elms = post.select(".entry h1"); String title = (elms.isEmpty() ? "No Title" : elms.first().text().trim()); link.setTitle(title);
First, an empty Link
object is initialized. Then we extract the title. Recall that “post” is a Jsoup element pointing to the article
tag in the DOM. post.select(".entry h1")
grabs the h1 title tag, from which we get the title string.
In a similar fashion, we grab the URL and the date:
// extract the URL of the post elms = post.select("aside .entryLink a"); if (elms.isEmpty()) { Logger.getLogger(DishBlogParser.class.getName()).log(Level.WARNING, "UNABLE TO LOCATE PERMALINK, TITLE = "+ title +", URL = "+ url); continue; } link.setUrl(elms.first().attr("href")); // extract the date of the post elms = post.select("aside time"); if (elms.isEmpty()) { Logger.getLogger(DishBlogParser.class.getName()).log(Level.WARNING, "UNABLE TO LOCATE DATE, TITLE = "+ title +", URL = "+ url); continue; } // parse the date string into a Joda DateTime object DateTime dt = blogDateFormat.parseDateTime(elms.first().text().trim()); link.setLinkDate(dt);
Note that failure to extract the URL or date is unacceptable, a warning is logged, and further processing is skipped. Note also on Line 16 blogDateFormat
is used to parse the date string from the HTML into a DateTime object.
Next, let’s grab the body of the post and create an excerpt from it:
// extract the body of the post (includes title tag at this point) Elements body = post.select(".entry"); // remove the "more" link body.select(".moreLink").remove(); // remove the title (h1) now from the body body.select("h1").remove(); // set full text on link, used for indexing/searching (not stored) link.setFullText(body.text()); // create a body "Document" Document bodyDoc = Document.createShell(baseUri); for (Element bodyEl : body) bodyDoc.body().appendChild(bodyEl); // remove unwanted tags by applying a tag whitelist // the whitelisted tags will appear when displaying excerpts String bodyhtml = cleaner.clean(bodyDoc).body().html(); if (bodyhtml.length() > excerptLength) { // we need to trim it down to excerptLength bodyhtml = trimExerpt(bodyhtml, excerptLength); // we need to parse this again now to fix possible unclosed tags caused by trimming bodyhtml = Jsoup.parseBodyFragment(bodyhtml).body().html(); } link.setExerpt(bodyhtml);
Recall the body is contained in a div
classed entry
. The body may contain a “read on” link that expands the content. That link, if present, is removed on Line 4. The title h1
tag is also removed, and the remaining text is stored on Line 9. This full text is not destined to be stored in the database–instead it will be indexed by our search engine.
To create the excerpt, unwanted HTML tags must be removed. This is where the Jsoup Cleaner comes in. Because the Cleaner only processes Document objects, a dummy Document is created for the post (this is also where baseUri is used).
If, after processing the post body through the Cleaner, the length exceeds the excerptLength, it must be trimmed down to size. The trimExcerpt
method does this. Because trimming might truncate closing HTML tags, Jsoup is used once more to parse the excerpt string, correcting any unbalanced tags. Finally, we have our excerpt.
This is the trimExerpt method that is called on Line 21 above:
private String trimExcerpt(String str, int maxLen) { if (str.length() <= maxLen) return str; int endIdx = maxLen; while (endIdx > 0 && str.charAt(endIdx) != ' ') endIdx--; return str.substring(0, endIdx); }
The idea is to use maxLen as a suggestion, and keep backing up until a space character is found. In this way, words will not be cut off in the middle.
Continuing the loop, next the links are extracted. They are represented by InnerLink
objects. Any invalid or self links are skipped.
// extract the links within the post List<InnerLink> inlinks = new ArrayList<InnerLink>(); Elements innerlinks = body.select("a[href]"); // loop through each link, discarding self-links and invalids for (Element innerlink : innerlinks) { String linkUrl = innerlink.attr("abs:href").trim(); if (linkUrl.equals(link.getUrl())) continue; else if (urlvalidator.isValid(linkUrl)) { //System.out.println("link = "+ linkUrl); InnerLink inlink = new InnerLink(); inlink.setUrl(linkUrl); inlinks.add(inlink); } else Logger.getLogger(DishBlogParser.class.getName()).log(Level.INFO, "INVALID URL: "+ linkUrl); } link.setInnerLinks(inlinks);
Next, extract any images:
// extract the images from the post List<Image> linkimgs = new ArrayList<Image>(); Elements images = body.select("img"); for (Element image : images) { Image img = new Image(); img.setOrigUrl(image.attr("src")); img.setAltText(image.attr("alt").replaceAll("_", " ")); linkimgs.add(img); } link.setImages(linkimgs);
Finally, extract any Youtube or Vimeo videos (the two most-popular types). Note that this requires a more complex selector syntax (Line 2), in particular because over the years several different HTML codes have been used:
// extract Youtube and Vimeo videos from the post elms = body.select("iframe[src~=(youtube\\.com|vimeo\\.com)], object[data~=(youtube\\.com|vimeo\\.com)], embed[src~=(youtube\\.com|vimeo\\.com)]"); List<Video> videos = new ArrayList<Video>(2); for (Element video : elms) { String vidurl = video.attr("src"); if (vidurl == null) vidurl = video.attr("data"); if (vidurl == null || vidurl.trim().equals("")) continue; Video vid = new Video(); vid.setUrl(vidurl); if (vidurl.toLowerCase().contains("vimeo.com")) vid.setProvider(VideoProvider.VIMEO); else vid.setProvider(VideoProvider.YOUTUBE); videos.add(vid); } link.setVideos(videos);
Finally, the loop is finished; all data has been gathered. So this Link
object is added to the List
, end loop, and return:
links.add(link); } return links; } catch (IOException ex) { Logger.getLogger(DishBlogParser.class.getName()).log(Level.SEVERE, "IOException when attemping to parse URL "+ url, ex); throw new ParseException(ex); } }
In conclusion…
This post has demonstrated web scraping using the open-source Jsoup library. Specifically, we loaded a page from a URL and used Jsoup’s selector syntax to extract the desired pieces of data. In a future post, I will write about what happens next: the list of Links is processed by a service bean and stored in the database.
If you are looking for easy and efficient web screen scraper – here you are! SmokeDoc will extract almost any type of data and help you to use it according to your purposes!
http://smokedoc.org
How can i use jsoup to omit ads from websites if I have the list of ad sites links
I know this is an old post, but I’m just curious: did you ever get around to writing up that part of the tutorial? I couldn’t find it on your site.
I need to do something very similar, so I’m interested in your approach.
Thanks.