Core Java Technologies Technical Tips

Return-Path: Received: from po10.mit.edu (po10.mit.edu [18.7.21.66]) by po10.mit.edu (Cyrus v2.1.5) with LMTP; Tue, 22 Apr 2003 20:53:50 -0400 X-Sieve: CMU Sieve 2.2 Received: from pacific-carrier-annex.mit.edu by po10.mit.edu (8.12.4/4.7) id h3N0rmKB017447; Tue, 22 Apr 2003 20:53:48 -0400 (EDT) Received: from hermes.sun.com (hermes.sun.com [64.124.140.169]) by pacific-carrier-annex.mit.edu (8.12.4/8.9.2) with SMTP id h3N0rmFW023857 for ; Tue, 22 Apr 2003 20:53:48 -0400 (EDT) Date: 22 Apr 2003 16:16:44 -0800 From: "JDC Tech Tips" To: alexp@mit.edu Message-Id: <33831229-769574691@hermes.sun.com> Subject: Core Java Technologies Tech Tips, April 22, 2003 (Validating URL Links, Reusing Exceptions) Mime-Version: 1.0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Mailer: SunMail 1.0 X-Spam-Score: 2.2 X-Spam-Level: ** (2.2) X-Spam-Flag: NO X-Scanned-By: MIMEDefang 2.28 (www . roaringpenguin . com / mimedefang) Core Java Technologies Technical Tips

View this issue as simple text April 22, 2003

Welcome to the Core Java Technologies Tech Tips, April 22, 2003. Here you'll get tips on using core Java technologies and APIs, such as those in Java 2 Platform, Standard Edition (J2SE).

This issue covers:

Validating URL Links
Reusing Exceptions

These tips were developed using Java 2 SDK, Standard Edition, v 1.4.

This issue of the Core Java Technologies Tech Tips is written by John Zukowski, president of JZ Ventures, Inc.

VALIDATING URL LINKS

A frequent problem that web site maintainers have is making sure that links on a site remain valid. Sometimes a resource that is the target of a link is removed. For example, consider links to technical articles. Over time, these articles can get out of date, and so are sometimes removed. After a resource like this is dropped, any link to it is no longer valid. Validating these links, especially links to resources on other sites, can keep an entire team of people busy, so automating the process can save a lot of time. The following tip presents a programmatic technique for validating URL links. Specifically, it presents a program that checks the response codes for all the foreign URLs on a web page, and then generates a report. The report has more information than simply the status of a link. It also include things like what a redirected URL actually points to, so you can use the report to update the web page.

If you've used the Web at all, it's likely that you've encountered the dreaded 404 error. This means that the target of a link on a web page, that is, a destination page, is not found. The 404 in the error is a response code, one of the many response codes covered in the HTTP protocol defined by the World Wide Web Consortium (W3C). Page 40 of RFC 2616 shows the complete list of response codes.

There are three classes in the java.net package that are useful in checking the response code for a URL link: URL, URLConnection, and HttpURLConnection. The URL class allows you to create a URL object for an http/https string. (You can create URLs for other protocols such as ftp, but the response code is only valid for HTTP connections). The URLConnection class gives you a way to find out the response code associated with a specific URL. The HttpURLConnection class is a URLConnection for HTTP requests (a sister class, HttpsURLConnection is for HTTPS requests).

To get a URLConnection for a URL, you open a connection on a URL object using the openConnection method. This gives you the connection object, but it doesn't yet make the connection to the URL. This gives you the option to configure the connection in some special way, for instance, you can set any special header fields. To make the connection to the object associated with the URL, you call the connect method. To check the response code you, have to call the getResponseCode() method. By default, HttpURLConnection uses the HTTP GET method when retrieving an object. This means the actual contents of the object are returned. You should read the data from the input stream, and then close the stream when finished. This avoids the possibility of leaving the connection hanging, with the data only partially read.

Here's what a simple check for the response code associated with a URL looks like:

    import java.net.*;
    import java.io.*;

    public class SimpleURLCheck {

      public static void main(String args[]) {
        if (args.length == 0) {
          System.err.println
            ("Please provide a URL to check");
        } else {
          String urlString = args[0];
          try {
            URL url = new URL(urlString);
            URLConnection connection = 
            url.openConnection();
            if (connection instanceof HttpURLConnection) {
              HttpURLConnection httpConnection = 
                 (HttpURLConnection)connection;
              httpConnection.connect();
              int response = 
                 httpConnection.getResponseCode();
              System.out.println(
                 "Response: " + response);
              InputStream is = 
                httpConnection.getInputStream();
              byte[] buffer = new byte [256];
              while (is.read (buffer) != -1) {}
              is.close();
            }
          } catch (IOException e) {
            e.printStackTrace();
          }
        }
      }
   }

If you run the program with the URL http://java.sun.com/jdc/:

	java SimpleURLCheck http://java.sun.com/jdc/

it should return a response code of 200.

Note that if you're behind a firewall, you need to set the proxyHost and proxyPort properties as appropriate for your proxy. In other words, you need to add code in the program that looks something like this:

    Properties prop = System.getProperties(); 
    prop.put("http.proxyHost","your-proxy-host-name");
    prop.put("http.proxyPort","your-proxy-port-number");

There's a specific reason why http://java.sun.com/jdc/ was picked as the URL. If you enter that URL in the browser you'll notice that the browser gets redirected to http://developer.java.sun.com/developer/. The browser then loads the page to which it is redirected. It is that page that loads without problem, and so the browser sends back a response code of 200 (or HttpURLConnection.HTTP_OK).

Why doesn't HttpURLConnection report that this is a redirected URL? By default, HttpURLConnection will follow redirected URLs. If you want to find out if a URL redirects, you have to turn off the default behavior. You can do this either for all HttpURLConnection objects by using the setFollowRedirects method. Or you can turn off the default behavior for a specific HttpURLConnection by using the setInstanceFollowRedirects method. In either case, providing an argument of false turns off the automatic redirect behavior. You can test this, by adding the line:

    HttpURLConnection.setFollowRedirects(false);

to the SimpleURLCheck program. If you rerun the program:

    java SimpleURLCheck http://java.sun.com/jdc/

the response code should be 301.

The 301 error code means that the URL has moved permanently (by comparison, a response code of 302 represents a temporary redirect). That means that if the original URL was saved, a smart application could update the saved URL by getting the target of the URL redirect. To get that redirected URL, you need to retrieve the Location header of the HttpURLConnection. You do this with the getHeaderField method.

Here's the SimpleURLCheck program with setFollowRedirects and getHeaderField methods added:

    import java.net.*;
    import java.io.*;

    public class SimpleURLCheck {

      public static void main(String args[]) {
        if (args.length == 0) {
          System.err.println(
              "Please provide a URL to check");
        } else {
          HttpURLConnection.setFollowRedirects(false);
          String urlString = args[0];
          try {
            URL url = new URL(urlString);
            URLConnection connection = 
              url.openConnection();
            if (connection instanceof HttpURLConnection) {
              HttpURLConnection httpConnection = 
               (HttpURLConnection)connection;
              httpConnection.connect();
              int response = 
               httpConnection.getResponseCode();
              System.out.println("Response: " + response);
              String location =          
               httpConnection.getHeaderField("Location");
              if (location != null) {
                System.out.println(
                 "Location: " + location);
              }
              InputStream is = 
                httpConnection.getInputStream();
              byte[] buffer = new byte [256];
              while (is.read (buffer) != -1) {}
              is.close();
            }
          } catch (IOException e) {
            e.printStackTrace();
          }
        }
      }
    }

Now, when you run the application with a URL of http://java.sun.com/jdc/:

    java SimpleURLCheck http://java.sun.com/jdc/

it should produce the following results:

    Response: 301
    Location: http://developer.java.sun.com/developer/

The check even works for https URLs (the following command should be entered on one line):

   java SimpleURLCheck 
   https://www.madonotcall.govconnect.com/

should produce the following results:

 
   Response: 302
   Location: cookiestest.asp

Remember to add proxy settings if you're behind a firewall, that is, for https.proxyHost and https.proxyPort.

Notice that this particular site runs a quick cookie test. If you rerun the program with the new URL, appending cookiestest.asp to end of first URL, you would see the redirection (again, the following command goes on one line):

    java SimpleURLCheck 
    https://www.madonotcall.govconnect.com/cookiestest.asp
    
    Response: 302
    Location: cookies_error.htm

Of course, this little command-line program doesn't support cookies, so the web site redirects to an error page. Had the URL been entered into a browser, the response would have been a redirect to https://www.madonotcall.govconnect.com/Welcome.asp.

Yet another thing to add to the link checker program is a smarter way to check status codes. Because the program doesn't care what the content actually is, you can set the request method to HEAD in the request. This setting specifies that the request is only for the heading of the response, not the actual data. By default, an HTTP request is a GET request -- in this case, everything is embedded in the URL. You can make a HEAD request for the HttpURLConnection by specifying setRequestMethod("HEAD"). For example, you can add the following line to the SimpleURLCheck program:

   httpConnection.setRequestMethod("HEAD");

Instead of showing the full program again here, you'll see the setRequestMethod method in use in the enhanced URL check report later.

Unlike setFollowsRedirects, which you can specify to cover all connections, setRequestMethod needs to be specified for each connection.

With the SimpleURLCheck program, you are able to check a single URL at a time. By moving the checking code to a method, and automating the scanning for URLs from a web page, you can generate a report on the validity of the URLs. The report can also track redirects, in other words, to where the URLs have been pointed. In this case, it is not necessary to have code that gets the InputStream, reads the data, and then closes the stream. However, there is no problem if you leave that code in the program -- the read returns -1 immediately.

An earlier Tech Tip tip titled "Extracting Links from an HTML File" presented a program that fetches URLs from a web page. You can combine that program with the SimpleURLCheck program to generate the link check report.

For a simple report, let's just print out error codes for each page (and redirect URLs for those that have one). To understand the report, you need to understand the range of response codes for HTTP requests. Going back to RFC 2616, you'll notice the following categories of response codes:

    1xx - informational
    2xx - successful
    3xx - redirection
    4xx - error
    5xx - server error

If you were to generate a smarter report, you could ignore the web pages that generate error codes in the 100 and 200 range. You could report the rest as errors. Another way to make the report smarter is to also check internal links, not just external ones. And, if you want the program to be truly smart, you might want to ignore errors like redirecting http://sun.com to http://www.sun.com/, or at least flag them differently. Another enhancement, though a tricky one, is automating the tagging of URLs with session information, such as when you visit a URL like http://www.networksolutions.com/.

Here's what the enhanced URL check program looks like.

    import java.io.*;
    import java.net.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;

    class EnhancedURLCheck {
      public static void main(String[] args) {
        HttpURLConnection.setFollowRedirects(false);
        EditorKit kit = new HTMLEditorKit();
        Document doc = kit.createDefaultDocument();

        // The Document class does not yet 
        // handle charset's properly.
        doc.putProperty("IgnoreCharsetDirective", 
          Boolean.TRUE);

        try {

          // Create a reader on the HTML content.
          Reader rd = getReader(args[0]);

          // Parse the HTML.
          kit.read(rd, doc, 0);

          // Iterate through the elements 
          // of the HTML document.
          ElementIterator it = new ElementIterator(doc);
          javax.swing.text.Element elem;
          while ((elem = it.next()) != null) {
            SimpleAttributeSet s = (SimpleAttributeSet)
              elem.getAttributes().getAttribute(
                HTML.Tag.A);
            if (s != null) {
              validateHref(
                (String)s.getAttribute(
                  HTML.Attribute.HREF));
            }
          }
        } catch (Exception e) {
          e.printStackTrace();
        }
        System.exit(1);
      }
   
    // Returns a reader on the HTML data. If 'uri' begins
    // with "http:", it's treated as a URL; otherwise,
    // it's assumed to be a local filename.
      static Reader getReader(String uri) 
        throws IOException {
        if (uri.startsWith("http:")) {

    // Retrieve from Internet.
          URLConnection conn = 
            new URL(uri).openConnection();
          return new 
            InputStreamReader(conn.getInputStream());
        } else {

    // Retrieve from file.
          return new FileReader(uri);
    }
  }

      private static void validateHref(String urlString) {
        if ((urlString != null) && 
         urlString.startsWith("http://")) {
          try {
            URL url = new URL(urlString);
            URLConnection connection = 
              url.openConnection();
            if (connection instanceof HttpURLConnection) {
              HttpURLConnection httpConnection = 
                (HttpURLConnection)connection;
              httpConnection.setRequestMethod("HEAD");
              httpConnection.connect();
              int response = 
                httpConnection.getResponseCode();
              System.out.println("[" + response + "]" + 
                urlString);
              String location = 
                httpConnection.getHeaderField("Location");
              if (location != null) {
                System.out.println(
                  "Location: " + location);
              }
          System.out.println();
            }
          } catch (IOException e) {
            e.printStackTrace();
          }
        }
      }
    }

If you run the report on the http://java.sun page:

    java EnhancedURLCheck http://java.sun.com

it should produce output similar to the following (only the first few lines of the output are shown):

    [200]http://java.sun.com/

    [200]http://search.java.sun.com/search/java/
    advanced.jsp

    [200]http://java.sun.com/

    [302]http://servlet.java.sun.com/logRedirect/
    frontpage-nav/http://java.sun.com/products/
    Location: http://java.sun.com/products/

    [302]http://servlet.java.sun.com/logRedirect/
    frontpage-nav/http://java.sun.com/j2ee/Location: 
    http://java.sun.com/j2ee/

For more information about using URLs, see the lesson "Working with URLs" in the Java Tutorial. Also see the lesson "Reading from and Writing to a URLConnection".

REUSING EXCEPTIONS

When an exceptional condition occurs in a program, it's typical to throw an exception. The syntax for throwing an exception often looks something like this:

  throw new MyException("String");

Literally, you create the exception when the exceptional condition happens, and immediately throw it to the caller (or to some enclosing try-catch block).

If you print a stack trace at that point with printStackTrace(), the information you see is filled with the details of the stack at the time the exception was created. This might sound obvious. However, if you want to avoid creating new objects when an exception occurs, you can create the exception once, and reuse it when the exceptional condition reoccurs. A common design pattern for this is called a Singleton. You can create the object the first time you need it, and then keep reusing the same object. For example:

    MyObject myObject = null;

    private synchronized MyObject getMyObject() {
      if (myObject == null) {
        myObject = new MyObject();
      }
      return myObject;
    }

What happens if you use a pattern similar to this with an Exception object (or in fact anything that subclasses java.lang.Throwable)? In this case, the stack trace is filled with the trace from when the exception was created. Typically, that isn't what you want because you probably want to see the exact trace for each exception, that is, when each exception happened. In order to share exception objects in this way, you have to refill the stack trace for the new stack when the exception happens. The method of Throwable to do this is fillInStackTrace.

Here's what the getMyObject pattern looks like after adjusting it to work with exception objects:

    MyExceptionObject myExceptionObject = null;
  
    private synchronized MyExceptionObject 
      getMyExceptionObject() {
      if (myExceptionObject == null) {
        myExceptionObject = new MyExceptionObject();
      } else {
        myExceptionObject.fillInStackTrace();
      }
      return myExceptionObject;
    }

Notice that you don't have to fill in the stack manually the first time. It's done for you. You only have to fill in the stack for subsequent occurrences.

At this point, you might ask why you would want to reuse exception objects. One reason to reuse (or even just pre-create) exception objects is to minimize the number of objects created when an exception actually happens. Of course, filling the stack does take resources, so this isn't completely free of the need for memory resources. And, if you truly don't need the stack trace, this pattern allows you to not bother filling it.

To demonstrate this behavior, the following program shows two ways of reusing exception objects. For each approach, the exception is printed three times. In the first case, the same stack is shown three times, even though three different methods print the trace. The latter case demonstrates the difference in each stack when you refill the stack trace with each call.

    import java.io.*;

    public class ReuseException {

        IOException exception1 = null;

        private synchronized IOException 
          getException1() {
            if (exception1 == null) {
                exception1 = new IOException();
            }
            return exception1;
        }

        IOException exception2 = null;

        private synchronized IOException 
          getException2() {
            if (exception2 == null) {
                exception2 = new IOException();
            } else {
                exception2.fillInStackTrace();
            }
            return exception2;
        }

        void exception1Method1() {
            getException1().printStackTrace();
        }

        void exception1Method2() {
            getException1().printStackTrace();
        }

        void exception1Method3() {
            getException1().printStackTrace();
        }

        void exception2Method1() {
            getException2().printStackTrace();
        }

        void exception2Method2() {
            getException2().printStackTrace();
        }

        void exception2Method3() {
            getException2().printStackTrace();
        }

        public static void main(String[] args) {
            ReuseException reuse = 
              new ReuseException();
            reuse.exception1Method1();
            reuse.exception1Method2();
            reuse.exception1Method3();
            System.out.println("---");
            reuse.exception2Method1();
            reuse.exception2Method2();
            reuse.exception2Method3();
        }
    }

When you run the program, your output should look something like this:

    java.io.IOException
        at ReuseException.getException1
        (ReuseException.java:9)
        at ReuseException.exception1Method1
        (ReuseException.java:26)
        at ReuseException.main
        (ReuseException.java:51)
    java.io.IOException
        at ReuseException.getException1
        (ReuseException.java:9)
        at ReuseException.exception1Method1
        (ReuseException.java:26)
        at ReuseException.main
        (ReuseException.java:51)
    java.io.IOException
        at ReuseException.getException1
        (ReuseException.java:9)
        at ReuseException.exception1Method1
        (ReuseException.java:26)
        at ReuseException.main
        (ReuseException.java:51)
---
    java.io.IOException
        at ReuseException.getException2
        (ReuseException.java:18)
        at ReuseException.exception2Method1
        (ReuseException.java:38)
        at ReuseException.main
        (ReuseException.java:55)
    java.io.IOException
        at ReuseException.getException2
        (ReuseException.java:20)
        at ReuseException.exception2Method2
        (ReuseException.java:42)
        at ReuseException.main
        (ReuseException.java:56)
    java.io.IOException
        at ReuseException.getException2
        (ReuseException.java:20)
        at ReuseException.exception2Method3
        (ReuseException.java:46)
        at ReuseException.main
        (ReuseException.java:57)

For more information about working with exceptions, see the lesson "Handling Errors with Exceptions" in the Java Tutorial. Also see the documentation for the Throwable class.

IMPORTANT: Please read our Terms of Use, Privacy, and Licensing policies:
http://www.sun.com/share/text/termsofuse.html
http://www.sun.com/privacy/
http://developer.java.sun.com/berkeley_license.html

Comments? Send your feedback on the Core Java Technologies Tech Tips to: jdc-webmaster@sun.com

Subscribe to other Java developer Tech Tips:

- Enterprise Java Technologies Tech Tips. Get tips on using enterprise Java technologies and APIs, such as those in the Java 2 Platform, Enterprise Edition (J2EE).
- Wireless Developer Tech Tips. Get tips on using wireless Java technologies and APIs, such as those in the Java 2 Platform, Micro Edition (J2ME).

To subscribe to these and other JDC publications:
- Go to the JDC Newsletters and Publications page, choose the newsletters you want to subscribe to and click "Update".
- To unsubscribe, go to the subscriptions page, uncheck the appropriate checkbox, and click "Update".

ARCHIVES: You'll find the Core Java Technologies Tech Tips archives at:
http://java.sun.com/jdc/TechTips/index.html

Copyright 2003 Sun Microsystems, Inc. All rights reserved.
4150 Network Circle, Santa Clara, CA 95054 USA.

This document is protected by copyright. For more information, see:
http://java.sun.com/jdc/copyright.html

Java, J2SE, J2EE, J2ME, and all Java-based marks are trademarks or registered trademarks (http://www.sun.com/suntrademarks/) of Sun Microsystems, Inc. in the United States and other countries.

Please unsubscribe me from this newsletter.

In this Issue

VALIDATING URL LINKS

REUSING EXCEPTIONS

Reader Feedback