Received: from FORT-POINT-STATION.MIT.EDU by po10 (5.61/4.7) id AA04954; Tue, 27 Jun 00 16:50:19 EDT Received: from hermes.java.sun.com (hermes.javasoft.com [204.160.241.85]) by fort-point-station.mit.edu (8.9.2/8.9.2) with ESMTP id QAA15217; Tue, 27 Jun 2000 16:47:57 -0400 (EDT) Received: (from nobody@localhost) by hermes.java.sun.com (8.9.3+Sun/8.9.1) id UAA04917; Tue, 27 Jun 2000 20:44:18 GMT Date: Tue, 27 Jun 2000 20:44:18 GMT Message-Id: <200006272044.UAA04917@hermes.java.sun.com> X-Authentication-Warning: hermes.java.sun.com: Processed from queue /bulkmail/data/ed_82/mqueue8 X-Mailing: 224 From: JDCTechTips@sun.com Subject: JDC Tech Tips June 27, 2000 To: JDCMember@sun.com Reply-To: JDCTechTips@sun.com Errors-To: bounced_mail@hermes.java.sun.com Precedence: junk Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Beyond Email 2.2 J D C T E C H T I P S TIPS, TECHNIQUES, AND SAMPLE CODE WELCOME to the Java Developer Connection(sm) (JDC) Tech Tips, June 27, 2000. This issue is covers some aspects of using the Java(tm) programming language with XML. First there's a short introduction to XML, followed by tips on how to use two APIs designed for use with XML. The tips are: * Using the SAX API * Using the DOM API These tips were developed using Java(tm) 2 SDK, Standard Edition, v 1.3. This issue of the JDC Tech Tips is written by Stuart Halloway, a Java specialist at DevelopMentor (http://www.develop.com/java). You can view this issue of the Tech Tips on the Web at http://developer.java.sun.com/developer/TechTips/2000/tt0627.html - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - XML INTRODUCTION The Extensible Markup Language (XML) is a way of specifying the content elements of a page to a Web browser. XML is syntactically similar to HTML. In fact, XML can be used in many of the places in which HTML is used today. Here's an example. Imagine that the JDC Tech Tip index was stored in XML instead of HTML. Instead of HTML coding such as this:

JDC Tech Tip Index

Random Access for Files

It might look something like this: Notice the coding similarities between XML and HTML. In each case, the document is organized as a hierarchy of elements, where each element is demarcated by angle brackets. As is true for most HTML elements, each XML element consists of a start tag, followed by some data, followed by an end tag: element data Also as in HTML, XML elements can be annotated with attributes. In the XML example above, each element has several attributes. The 'title' attribute is the name of the tip, the 'author' attribute gives a short form of the author's name, and the 'htmlURL' and 'textURL' attributes contain links to different archived formats of the tip. The similarities between the two markup languages is an important advantage as the world moves to XML, because hard-earned HTML skills continue to be useful. However, it does beg the question "Why bother to switch to XML at all?" To answer this question, look again at the XML example above, and this time consider the semantics instead of the syntax. Where HTML tells you how to format a document, XML tells you about the content of the document. This capability is very powerful. In an XML world, clients can reorganize data in a way most useful to them. They are not restricted to the presentation format delivered by the server. Importantly, the XML format has been designed for the convenience of parsers, without sacrificing readability. XML imposes strong guarantees about the structure of documents. To name a few: begin tags must have end tags, elements must nest properly, and all attributes must have values. This strictness makes parsing and transforming XML much more reliable than attempting to manipulate HTML. The similarities between XML and HTML stem from a shared history. HTML is a simplified vocabulary of a powerful markup language called SGML. SGML is the "kitchen sink" of markup, allowing you to do almost anything, including the ability to define your own domain-specific vocabularies. HTML is a dim shadow of SGML, with a predefined vocabulary. Thus HTML is basically a static snapshot of some presentation features that seemed useful circa 1992. Both SGML and HTML are problematic: SGML does everything, but is too complex. HTML is simple, but its parsing rules are loose, and its vocabulary does not provide a standard mechanism for extension. XML, by comparison, is a streamlined version of SGML. It aims to meet the most important objectives of SGML without too much complexity. If SGML is the "kitchen sink," XML is a "Swiss Army knife." Given its advantages, XML does far more than simply displace HTML in some applications. It can also displace SGML, and open new opportunities where the complexity of SGML had been a barrier. Regardless of how you plan to use XML, the programming language of choice is likely to be the Java programming language. You could write your own code to parse XML directly, the Java language provides higher level tools to parse XML documents through the the Simple API for XML (SAX) and the Document Object Model (DOM) interfaces. The SAX and DOM parsers are standards that are implemented in several different languages. In the Java programming language, you can instantiate the parsers by using the Java(tm) API for XML Parsing (JAXP). To execute the code in this tip, you will need to download JAXP and a reference implementation of the SAX and DOM parsers from http://java.sun.com/xml/download.html. You will also need to download SAX 2.0 from http://www.megginson.com/SAX/Java. Remember to update your class path to include the jaxp, parser, and sax2 JAR files. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - USING THE SAX API The SAX API provides a serial mechanism for accessing XML documents. It was developed by members of the XML-DEV mailing list as a standard set of interfaces to allow different vendor implementations. The SAX model allows for simple parsers by allowing parsers to read through a document in a linear way, and then to call an event handler every time a markup event occurs. The original SAX implementation was released in May 1998. It was superseded by SAX 2.0 in May 2000. (The code is this tip is SAX2 compliant.) All you have to do to use SAX2 for notification of markup events, is implement a few methods and interfaces. The ContentHandler interface is the most important of these interfaces. It declares a number of methods for different steps in parsing an XML document. In many cases, you will only be interested in few of these methods. For example, the code below handles only a single ContentHandler method (startElement), and uses it to build an HTML page from the XML Tech Tip Index: import java.io.*; import java.net.*; import java.util.*; import javax.xml.parsers.*; import org.xml.sax.*; import org.xml.sax.helpers.*; /** * Builds a simple HTML page which lists tip titles * and provides links to HTML and text versions */ public class UseSAX2 extends DefaultHandler { StringBuffer htmlOut; public String toString() { if (htmlOut != null) return htmlOut.toString(); return super.toString(); } public void startElement(String namespace, String localName, String qName, Attributes atts) { if (localName.equals("tip")) { String title = atts.getValue("title"); String html = atts.getValue("htmlURL"); String text = atts.getValue("textURL"); htmlOut.append("
"); htmlOut.append("HTML TEXT "); htmlOut.append(title); } } public void processWithSAX(String urlString) throws Exception { System.out.println("Processing URL " + urlString); htmlOut = new StringBuffer("

JDC Tech Tips Archive

"); SAXParserFactory spf = SAXParserFactory.newInstance(); SAXParser sp = spf.newSAXParser(); ParserAdapter pa = new ParserAdapter(sp.getParser()); pa.setContentHandler(this); pa.parse(urlString); htmlOut.append(""); } public static void main(String[] args) { try { UseSAX2 us = new UseSAX2(); us.processWithSAX(args[0]); String output = us.toString(); System.out.println("Saving result to " + args[1]); FileWriter fw = new FileWriter(args[1]); fw.write(output, 0, output.length()); fw.flush(); } catch (Throwable t) { t.printStackTrace(); } } } To test the program, you can use the XML fragment in the XML Introduction that precedes this tip, or download a longer version from http://staff.develop.com/halloway/TechTips/TechTipArchive.xml. Save the XML fragment or the longer XML version in your local directory as TechTipArchive.xml. You can then produce an HTML version with the command: java UseSAX2 file:TechTipArchive.xml SimpleList.html Then use your browser of choice to view SimpleList.html, and follow links to either text or HTML versions of recent Tech Tips. (In a production scenario you would probably merge this code into a client browser or into a servlet or JSP page on the server.) There are several interesting points about the code above. Notice the steps in creating the parser. SAXParserFactory spf = SAXParserFactory.newInstance(); SAXParser sp = spf.newSAXParser(); In JAXP, the SAXParser class is not created directly, but instead through the factory method newSAXParser(). This allows different implementations to be plug-compatible without source code changes. The factory also provides control over more advanced parsing features such as namespace support and validation. Even after you have the JAXP parser instance, you still aren't ready to parse. The current JAXP parser only supports SAX 1.0; to get SAX 2.0 support, you must wrap the parser in a ParserAdapter. ParserAdapter pa = new ParserAdapter(sp.getParser()); The ParserAdapter class adds SAX2 functionality to an existing SAX1 parser and is part of the SAX2 download. Notice that instead of implementing the ContentHandler interface, UseSAX extends the DefaultHandler class. DefaultHandler is an adapter class that provides an empty implementation of all the ContentHandler methods, so only the methods that are of interest need to be overridden. The startElement() method does the real work. Because the program only wants to list the tips by title, the element is all-important, and the and elements are ignored. The startElement method checks the element name and continues only if the current element is . The method also provides access to an element's attributes via an Attributes reference, so it is easy to extract the tip name, htmlURL, and textURL. The end result of this exercise is an HTML document that allows you to browse the list of recent Tech Tips. You could have done this directly by coding in HTML. But doing this in XML, and writing the SAX code provides additional flexibility. If another person wanted to view the Tech Tips sorted by date, or by author, or filtered by some constraint, then various views could be generated from a single XML file, with different parsing code for each view. Unfortunately, as the XML data gets more complicated, the sample above becomes more difficult to code and maintain. The example suffers from two problems. First, the code to generate the HTML output is just raw string manipulation, which makes it easy to lose a '>' or a '/' somewhere. Second, the SAX API doesn't remember much; if you need to refer back to some earlier element, then you have to build your own state machine to remember the elements that have already been parsed. The Document Object Model (DOM) API solves both of these problems. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - USING THE DOM API The DOM API is based on an entirely different model of document processing than the SAX API. Instead of reading a document one piece at a time (as with SAX), a DOM parser reads an entire document. It then makes the tree for the entire document available to program code for reading and updating. Simply put, the difference between SAX and DOM is the difference between sequential, read-only access, and random, read-write access. At the core of the DOM API are the Document and Node interfaces. A Document is a top level object that represents an XML document. The Document holds the data as a tree of Nodes, where a Node is a base type that can be an element, an attribute, or some other type of content. The Document also acts as a factory for new Nodes. Nodes represent a single piece of data in the tree, and provide all of the popular tree operations. You can query nodes for their parent, their siblings, or their children. You can also modify the document by adding or removing Nodes. To demonstrate the DOM API, let's process the same XML document that got "SAXed" above. This time, let's group the output by author. This will take a little more work. Here's the code: //UseDOM.java import java.io.*; import java.net.*; import java.util.*; import javax.xml.parsers.*; import org.w3c.dom.*; public class UseDOM { private Document outputDoc; private Element body; private Element html; private HashMap authors = new HashMap(); public String toString() { if (html != null) { return html.toString(); } return super.toString(); } public void processWithDOM(String urlString) throws Exception { System.out.println("Processing URL " + urlString); DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); DocumentBuilder db = dbf.newDocumentBuilder(); Document doc = db.parse(urlString); Element elem = doc.getDocumentElement(); NodeList nl = elem.getElementsByTagName("author"); for (int n=0; n

JDC Tech Tips Archive

"); Using the DOM API to build documents isn't as terse or as fast as direct String manipulation, but it is much less error-prone, especially in larger documents. The important part of the useDOM example is the processWithDOM method. This method does two things: (1) it finds the author elements and provides them as output, and (2) finds the tips and provides them as output organized by their respective author. Each of these steps requires access to the top level element of the document. This is done via the getDocumentElement() method. The author information is in elements. These elements are found by calling getElementsByTagName("author") on the top-level element. The getElementsByTagName method returns a NodeList; this is a simple collection of Nodes. Each Node is then cast to an Element in order to use the convenience method getAttribute(). The getAttribute method gets the author's id and fullName. Each author is listed as a second-level heading; to do this, the output document is used to create an

element containing the author's fullName. Adding a Node requires two steps. First the output document is used to create the Node with a factory method such as createElement(). Then the node is added with appendChild(). Nodes can only be added to the document that created them. After the author headings are in place, it is time to create the links for individual tips. The elements are found in the same way as the elements, that is, via getElementsByTagName(). The logic for extracting the tip attributes is also similar. The only difference is deciding where to add the Nodes. Different authors should be added to different lists. The groundwork for this was laid back when the author elements were processed by adding an
node and storing it in a HashMap indexed by author id. Now, the author id attribute of the tip can be used to look up the appropriate
node for adding the tip. For more in-depth coverage of XML, see The XML Companion, by Neil Bradley, Addision-Wesley 2000. For more information about JAXP, see the Java(tm) Technology and XML page at http://java.sun.com/xml/index.html. For more information about SAX2, see http://www.megginson.com/SAX/index.html. The DOM standard is available at http://www.w3.org/TR/REC-DOM-Level-1. . . . . . . . . . . . . . . . . . . . . . . . - NOTE The names on the JDC mailing list are used for internal Sun Microsystems(tm) purposes only. To remove your name from the list, see Subscribe/Unsubscribe below. - FEEDBACK Comments? Send your feedback on the JDC Tech Tips to: jdc-webmaster@sun.com - SUBSCRIBE/UNSUBSCRIBE The JDC Tech Tips are sent to you because you elected to subscribe when you registered as a JDC member. To unsubscribe from JDC email, go to the following address and enter the email address you wish to remove from the mailing list: http://developer.java.sun.com/unsubscribe.html To become a JDC member and subscribe to this newsletter go to: http://java.sun.com/jdc/ - ARCHIVES You'll find the JDC Tech Tips archives at: http://developer.java.sun.com/developer/TechTips/index.html - COPYRIGHT Copyright 2000 Sun Microsystems, Inc. All rights reserved. 901 San Antonio Road, Palo Alto, California 94303 USA. This document is protected by copyright. For more information, see: http://developer.java.sun.com/developer/copyright.html JDC Tech Tips June 27, 2000