How to extract titles from web pages in Java
Let’s say you have a set of URLs and you want the web page titles associated with them. Maybe you’ve data-mined a bunch of links from HTML pages, or acquired a flat file listing URLs. How would you go about getting the corresponding page titles, and associating them with the URLs using Java?
You could use an HTML parser such as Jsoup to request the HTML document associated with each URL and parse it into a DOM document. Once obtained, you could navigate the document and select the text from the title tag, like so:
String titleText = document.select("title").first().text(); |
Elegant, but a lot of overhead for such a simple task. You’d be loading the whole page into memory and parsing it into a DOM structure just to extract the title. Instead, you could use the Apache HTTP Client library, which provides a robust API for requesting resources over the HTTP protocol. But it would be unnecessary in this case. Let’s keep it simple and rely only on the java standard library.
To extract the title from a web page, you need to open up a URLConnection. With this connection, you’ll be able to read response headers from the server as well as the response body (which ought to contain a title tag). Before attempting to grab the page title, you should consider the Content-Type response header. Validate that the URL does indeed reference a document of type text/html, otherwise your URL may be referencing an image file, PDF or other type of resource.
Next, it is good practice to determine the character set of the HTML page. This piece of data is frequently sent by the server in the Content-Type header value. It isn’t always, and may instead be sent in an HTML meta tag. For this example we’ll look only to the Content-Type header, and if the character set is not specified there, we will default to your platform’s default character set.
After you’ve verified that the URL points to an HTML page and have determined the character set, the next step is to extract the title text from the response body. In this example, I use regular expressions to extract and clean up the title. Have a look at this TitleExtractor class, with comments to explain what is going on:
import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection; import java.nio.charset.Charset; import java.util.regex.Matcher; import java.util.regex.Pattern; public class TitleExtractor { /* the CASE_INSENSITIVE flag accounts for * sites that use uppercase title tags. * the DOTALL flag accounts for sites that have * line feeds in the title text */ private static final Pattern TITLE_TAG = Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); /** * @param url the HTML page * @return title text (null if document isn't HTML or lacks a title tag) * @throws IOException */ public static String getPageTitle(String url) throws IOException { URL u = new URL(url); URLConnection conn = u.openConnection(); // ContentType is an inner class defined below ContentType contentType = getContentTypeHeader(conn); if (!contentType.contentType.equals("text/html")) return null; // don't continue if not HTML else { // determine the charset, or use the default Charset charset = getCharset(contentType); if (charset == null) charset = Charset.defaultCharset(); // read the response body, using BufferedReader for performance InputStream in = conn.getInputStream(); BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset)); int n = 0, totalRead = 0; char[] buf = new char[1024]; StringBuilder content = new StringBuilder(); // read until EOF or first 8192 characters while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) { content.append(buf, 0, n); totalRead += n; } reader.close(); // extract the title Matcher matcher = TITLE_TAG.matcher(content); if (matcher.find()) { /* replace any occurrences of whitespace (which may * include line feeds and other uglies) as well * as HTML brackets with a space */ return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim(); } else return null; } } /** * Loops through response headers until Content-Type is found. * @param conn * @return ContentType object representing the value of * the Content-Type header */ private static ContentType getContentTypeHeader(URLConnection conn) { int i = 0; boolean moreHeaders = true; do { String headerName = conn.getHeaderFieldKey(i); String headerValue = conn.getHeaderField(i); if (headerName != null && headerName.equals("Content-Type")) return new ContentType(headerValue); i++; moreHeaders = headerName != null || headerValue != null; } while (moreHeaders); return null; } private static Charset getCharset(ContentType contentType) { if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName)) return Charset.forName(contentType.charsetName); else return null; } /** * Class holds the content type and charset (if present) */ private static final class ContentType { private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); private String contentType; private String charsetName; private ContentType(String headerValue) { if (headerValue == null) throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue"); int n = headerValue.indexOf(";"); if (n != -1) { contentType = headerValue.substring(0, n); Matcher matcher = CHARSET_HEADER.matcher(headerValue); if (matcher.find()) charsetName = matcher.group(1); } else contentType = headerValue; } } } |
Making use of this class is simple:
String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/"); System.out.println(title); |
Output: Wikipedia, the free encyclopedia
So in this example, we used the standard java library to look up a web page and extract its title. Normally I would recommend using an HTML parser, but for this simple task it was not necessary.
is there any like button here 😀
thnx alot 🙂
Thanks a lot very helpful. Saved me some time.
Thank you for this code, it works fine.
great post, thank you!
Nice contirbution. Ty.
nice post
Thanks thats really helpful 🙂
Thank you!
I attempted to do this on Android with API past 3.0 and this code runs on the Main Thread. I used it before but as of present it is not possible