Received: from SOUTH-STATION-ANNEX.MIT.EDU by po10.MIT.EDU (5.61/4.7) id AA25800; Thu, 23 Sep 99 21:44:49 EDT Received: from hermes.javasoft.com by MIT.EDU with SMTP id AA11802; Thu, 23 Sep 99 21:44:35 EDT Received: (from nobody@localhost) by hermes.java.sun.com (8.9.3+Sun/8.9.1) id BAA21324; Fri, 24 Sep 1999 01:42:45 GMT Date: Fri, 24 Sep 1999 01:42:45 GMT Message-Id: <199909240142.BAA21324@hermes.java.sun.com> X-Authentication-Warning: hermes.java.sun.com: Processed from queue /bulkmail/data/ed_1/mqueue9 X-Mailing: 191 From: JDCTechTips@sun.com Subject: JDC Tech Tips Sept. 23, 1999 To: JDCMember@sun.com Reply-To: JDCTechTips@sun.com Errors-To: JDCMailErrors@sun.com Precedence: junk Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Beyond Email 2.2 J D C T E C H T I P S TIPS, TECHNIQUES, AND SAMPLE CODE WELCOME to the Java Developer Connection(sm) (JDC) Tech Tips, September 23, 1999. This issue covers: * Extracting Links from an HTML File * Sorting Arrays This issue of the JDC Tech Tips is written by Patrick Chan, the author of the publication "The Java(tm) Developers Almanac" (http://www.amazon.com/exec/obidos/ASIN/0201432986/xeo). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EXTRACTING LINKS FROM AN HTML FILE There are many applications that fetch an HTML page from the Web and then extract the links from the page. For example, a link-checker application fetches a page, extracts the links, and then checks the links to see of they refer to actual pages. The HTML 3.2 support in the Java(tm) 2 platform makes it fairly easy to find and parse links. This tip demonstrates how to use that support. The first step is to create an editor kit. The purpose of an editor kit is to parse data in some format, such as HTML or RTF, and store the information in a data structure that fully represents the data. This data structure, called a Document, allows you to examine and modify the data in a convenient way. Let's look at an example. In the following example program, we're going to examine the HTML data in a Document object. The program looks for A (anchor) tags and extracts the HREF attribute information from these tags. import java.io.*; import java.net.*; import javax.swing.text.*; import javax.swing.text.html.*; class GetLinks { public static void main(String[] args) { EditorKit kit = new HTMLEditorKit(); Document doc = kit.createDefaultDocument(); // The Document class does not yet handle charset's properly. doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE); try { // Create a reader on the HTML content. Reader rd = getReader(args[0]); // Parse the HTML. kit.read(rd, doc, 0); // Iterate through the elements of the HTML document. ElementIterator it = new ElementIterator(doc); javax.swing.text.Element elem; while ((elem = it.next()) != null) { SimpleAttributeSet s = (SimpleAttributeSet) elem.getAttributes().getAttribute(HTML.Tag.A); if (s != null) { System.out.println(s.getAttribute(HTML.Attribute.HREF)); } } } catch (Exception e) { e.printStackTrace(); } System.exit(1); } // Returns a reader on the HTML data. If 'uri' begins // with "http:", it's treated as a URL; otherwise, // it's assumed to be a local filename. static Reader getReader(String uri) throws IOException { if (uri.startsWith("http:")) { // Retrieve from Internet. URLConnection conn = new URL(uri).openConnection(); return new InputStreamReader(conn.getInputStream()); } else { // Retrieve from file. return new FileReader(uri); } } } This program takes one parameter from the command line. If the parameter starts with "http:", the program treats the parameter as a URL and fetches the HTML from that URL. Otherwise, the parameter is treated as a filename and the HTML is fetched from that file. For example, $ java GetLinks http://java.sun.com retrieves the HTML from the main page at java.sun.com. The editor kit is an HTMLEditorKit object that contains an HTML parser. It creates a Document object that can represent HTML. And it's the editor kit's read() method that parses the HTML and stores the information in the Document. Once the HTML data is saved in the Document object, we're ready to look for links. This is done by creating an iterator (using ElementIterator) that iterates over all the visible text pieces (called elements) in the HTML. For each text piece, we check to see if it has been formatted for linking, in other words, whether the text is formatted with the A (anchor) tag. We do this by calling getAttributes().getAttribute(HTML.Tag.A). If the text piece has been formatted with the A tag, the method call returns the set of attributes of the A tag used to format that text piece. Otherwise the method call simply returns null. Note: The name getAttributes() is a little confusing because it has nothing to do with HTML attributes; the "attributes" in this case are all the HTML tags (such as an A tag) that were used to format that text piece. Now we have the set of attributes of the A tag used to format a piece of text; it's stored in a SimpleAttributeSet object. So we just need to get the value of the HREF attribute and we're done. We can do this by calling getAttribute(HTML.Attribute.HREF) on the A tag's attribute set. SORTING ARRAYS This tip discusses how you can sort data in arrays. Sorting arrays of primitive types is easy. There are seven methods in the class Arrays for sorting arrays of each of the seven primitive types: byte, char, double, float, int, long, and short. Here's an example that sorts an array of doubles. import java.util.*; import java.awt.*; class Sort1 { // Sorts an array of random double values. public static void main(String[] args) { double[] dblarr = new double[10]; for (int i=0; i