Validate that URLs exist using jQuery / PHP
It would be nice to have a pure javascript method of validating that URLs exist. One imagines you could use an AJAX call and verify the HTTP status code (200, 404, etc.) returned. However, browser security does not permit cross-domain AJAX calls. So, this method would only work if you are validating that URLs exist on the same domain.
Perhaps there is a way to use a hidden iframe to test the existence of a URL. I am not aware of a way to get the HTTP status code of a page that loads inside of an iframe, though. I’m not sure it is possible. So you must rely on javascript plus a server-side programming language to perform this validation. I chose jQuery and PHP.
Let’s start with the PHP code. We need a script that will be provided a URL and will validate whether the URL exists. Briefly: the script will do a regex check upfront to ensure the URL has a valid structure, then will make an HTTP request to the URL and check the HTTP status code returned.
You can use cURL to make the request, but PHP’s fsocketopen function is a bit more efficient for this quick check. Also note, it is not necessary to issue an HTTP GET command. HEAD will suffice, because we are not concerned with the body of the response at all. We are only interested in the HTTP status code, and so HEAD is more efficient.
Here is the PHP code. The fsocketopen block of code comes from here. I’ve included comments in the code, but I’ll elaborate further on some points below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | // Strip out any leading http:// or https:// // if other protocols such as ftp are present, // the intention is they'll fail regex further down $url = trim($_GET['url']); if (stripos($url, 'http://') === 0) $url = substr($url, 7); else if (stripos($url, 'https://') === 0) $url = substr($url, 8); // Get the string index of the first forward slash // if there is none, add one at the end $slashIdx = strpos($url, '/'); if ($slashIdx === false) { $slashIdx = strlen($url); $url .= '/'; } // Regex validation of URL string if (!preg_match('/^[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~\(\)])*$/i', $url)) echo '20 invalid URL string'; else { // Prepare for fsocketopen call // Split the URL into domain and path $domain = substr($url, 0, $slashIdx); $path = substr($url, $slashIdx); $portno = 80; $method = "HEAD"; // HEAD request is more efficient $http_response = ""; $http_request .= $method ." ". $path ." HTTP/1.0\r\n"; $http_request .= "User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)\r\n"; $http_request .= "\r\n"; // Attempt to connect to this domain $fp = @fsockopen($domain, $portno, $errno, $errstr); if ($fp) { fputs($fp, $http_request); // Read first 64 bytes of response $http_response = fgets($fp, 64); fclose($fp); // regex out the HTTP status code // then validate whether the code is a 200 OK or 301/302 redirect preg_match('/HTTP\/\d\.\d (\d{3})/', $http_response, $matches); if (in_array(intval($matches[1]), array(200, 301, 302))) echo '10 valid'; else echo '30 http error'; } else echo '40 unknown host'; } |
On Lines 12 to 16, I check for the existence of a forward slash. If there isn’t one, the only valid possibility is that the user supplied a domain name. In that case, a forward slash should be appended.
Line 19 is a regular expression for URL validation that I found using (what else?) Google. I’ve modified it a bit, and it should work for most URLs. If the regex check fails, I echo out ’20 invalid URL string’. Why? Well, for this script I’ve invented a few response codes:
- 10 valid means we have successfully validated the URL (yay!)
- 20 invalid URL string means our regex validation failed
- 30 http error means we requested the URL but got an unfriendly HTTP status code back
- 40 unknown host means our call to fsocketopen failed, probably due to a bad domain name
One Line 30 I set the User-Agent header, in this case to MSIE 8. It’s a good idea to fake a common user agent when requesting an URL because some sites are set up to do funny things if they see an unexpected or missing user agent. We just care about how the site responds to a routine browser request.
One Line 38, $http_response will be set to something like HTTP 1.1 200 OK. A few lines down I use regex to extract the “200” portion (or whatever other status code is returned). Here, I’m accepting 200, 301 and 302 as valid responses. 301 and 302 are redirects, but for this exercise I consider them an acceptable signal that the URL exists. An enhancement might be to actually follow the redirect and validate that response.
Okay we have our server-side URL validator, now for the javascript.
This is the easy part.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | <form> <input type="text" name="u" id="utext"> <input type="button" id="ubtn" value="Test"> </form> <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"></script> <script type="text/javascript"> $(document).ready(function() { $('#ubtn').click(function() { alert("URL: "+ $('#utext').val()); $.get('/validateURL.php', {url: $('#utext').val()}, function(response) { alert(response); }); }); }); </script> |
The above code demonstrates using jQuery to call our URL validator. I’ve got a text input field (for entering a URL to test) and a button. When the button is clicked, jQuery makes an AJAX call (Line 11) to our script (named “validateURL.php” in this example). One of the four status codes will then be alerted. This jQuery block can be modified to be part of your general form validation. You would test to see if the response begins with “10” to identify a URL that is syntactically valid and exists.
And there you have it. We’ve validated not only the syntax of a URL using regex, but actually tested whether the URL exists and is valid, using jQuery and PHP.
Great Tut. Would it be possible to extract the lastmod date too?
thanks for all