Return-Path: Received: from pacific-carrier-annex.mit.edu by po10.mit.edu (8.9.2/4.7) id SAA29581; Tue, 23 Apr 2002 18:51:10 -0400 (EDT) Received: from hermes.sun.com (hermes.sun.com [64.124.140.169]) by pacific-carrier-annex.mit.edu (8.9.2/8.9.2) with SMTP id SAA17276 for ; Tue, 23 Apr 2002 18:51:09 -0400 (EDT) Date: Tue, 23 Apr 2002 14:51:09 GMT-08:00 From: "JDC Tech Tips" To: alexp@mit.edu Message-Id: <14086384-227227126@hermes.sun.com> Subject: JDC Tech Tips, April 23, 2002 (Pattern Matching, Creating a HelpSet) Precedence: junk Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Beyond Email J D C T E C H T I P S TIPS, TECHNIQUES, AND SAMPLE CODE WELCOME to the Java Developer Connection(sm) (JDC) Tech Tips, April 23, 2002. This issue covers: * Pattern Matching * Creating a HelpSet with JavaHelp(tm) software These tips were developed using Java 2 SDK, Standard Edition, v 1.4. This issue of the JDC Tech Tips is written by John Zukowski, president of JZ Ventures, Inc. (http://www.jzventures.com). You can view this issue of the Tech Tips on the Web at http://java.sun.com/jdc/JDCTechTips/2002/tt0423.html - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PATTERN MATCHING The javax.util.regex package is a new package in Java 2 Platform, Standard Edition version 1.4. The package provides a regular expression library. A regular expression is a pattern of characters that describes a set of strings, and is often used in pattern matching. The classes in the javax.util.regex package let you match sequences of characters against a regular expression. These classes, which comprise the regular expression library, use the Perl 5 regular expression pattern syntax, and provide a much more powerful way of parsing text than was previously available with the java.io.StreamTokenizer and the java.util.StringTokenizer classes. The regular expression library has three classes: Pattern, Matcher, and PatternSyntaxException. Ignoring the exception class, what you really have is one class to define the regular expression you want to match (the Pattern), and another class (the Matcher) for searching a pattern in a given string. Most of the work of using the regular expression library is understanding its pattern syntax. The actual parsing is the easy part. So let's look at what makes up a regular expression. The simplest kind of regular expression is a literal. A literal is not simply a character within the regular expression, but a character that is not part of some special grouping or expression within the regular expression. For instance, the literal "x" is a regular expression. Using the literal, a matcher, and a string, you can ask "Does the regular expression 'x' match the entire string?" Here's an expression that asks the question: boolean b = Pattern.matches("x", someString); If the pattern "x" is the string referenced by someString, then b is true. Otherwise, b is false. By itself, literals are not that complicated to understand. Notice here that the matcher is defined by the Pattern class, not the Matcher class. The matches method is defined by the Pattern class as a convenience for when a regular expression is used just once. Normally, you would define a Pattern class, a Matcher class for the Pattern, and then use the matches method defined by the Matcher class: Pattern p = new Pattern("x"); Matcher m = p.matcher("sometext"); boolean b = m.matches(); The tip will cover those steps later. Of course, regular expressions can be more complex than literals. Adding to the complexity are wildcards and quantifiers. There is only one wildcard used in regular expressions. It is the period (.) character. A wildcard is used to match any single character, possibly even a newline. The quantifier characters are the + and *. (Technically, the question mark is also a quantifier character.) The + character placed after a regular expression allows for a regular expression to be matched one or more times. The * is like the + character, but works zero or more times. For instance, if you want to find a string with a j at the beginning, a z at the end, and at least one character between the two, you use the expression "j.+z". If there doesn't have to be any characters between the j and the z, you use "j.*z" instead. Note that pattern matching tries to find the largest possible "hit" within a string. So if you request a match against the pattern "j.*z", using the string "jazjazjazjaz", it returns the entire string, not just a single "jaz". This is called "greedy behavior." It is the default in a regular expression unless you specify otherwise. Now let's get a little more complex. By placing multiple expressions in parentheses, you can request a match against multi-character patterns. For instance, to match a j followed by a z, you can use the "(jz)" pattern. By itself, that doesn't buy you much. It is the same as "jz". But, by using parenthesis, you can use the quantifiers and say match any number of "jz" patterns: "(jz)+". Another way of working with patterns is through character classes. With character classes, you specify a range of possible characters instead of specifying individual characters. For instance, if you want to match against any letter from j to z, you specify the range j-z in square brackets: "[j-z]". You could also attach a quantifier to the expression, for example, "[j-z]+", to get an expression matching at least one character between j and z, inclusively. Certain character classes are predefined. These represent classes that are common, and so they have a common shorthand. Some of the predefined character classes are: \d A digit ([0-9]) \D A non-digit ([^0-9]) \s A whitespace character [ \t\n\x0B\f\r] \S non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w] Notice that for character classes, ^ is used for negation of an expression. There is a second set of predefined character classes, called POSIX character classes. These are taken from the POSIX specification, and work with US-ASCII characters only: \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9] \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation: one of !"#$%&'()*,-./:;<=>?@[\]^_`{|}~ \p{Graph} A visible character: [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}] \p{Blank} A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F] \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [ \t\n\x0B\f\r] The final set of character classes listed here are the boundary matchers. These are meant to match the beginning or end of a sequence of characters, specifically a line, word, or pattern. ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word boundary \A The beginning of the input \G The end of the previous match \Z The end of the input but for the final terminator, if any \z The end of the input The key thing to understand about all the character class expressions is the use of the \. When you compose a regular expression as a Java string, you must escape the \ character. Otherwise, the character following the \ will be treated as special by the javac compiler. To escape the \ character, specify a double \\. By placing a double \\ in the string, you are saying you want the actual \ character there. For instance, if you want to use a pattern for any string of alphanumeric characters, simply having a string containing \p{Alnum}* is not sufficient. You must escape the \ as follows: boolean b = Pattern.matches("\\p{Alnum}*", someString); As the name implies, the Pattern class is for defining patterns, that is, it defines the regular expression you want to match. Instead of using matches to see if a pattern matches the whole string, what normally happens is you check to see if a pattern matches the next part of the string. To use a pattern you must compile it. You do this with the compile method. Pattern pattern = Pattern.compile(somePattern); Pattern compilation can take some time, and doing it once is wise. The matches method of the Pattern class compiles the pattern with each call. If you want to use a pattern many times, you can avoid multiple compilation by getting a Matcher class for the Pattern class and then using the Matcher class. After you compile the pattern, you can request to get a Matcher for a specific string. Matcher matcher = pattern.matcher(someString); The Matcher provides a matches method that checks against the entire string. The class also provides a find() method that tries to find the next sequence, possibly not at the beginning of the string, that matches the pattern. After you know you have a match, you can get the match with the group method: if (matcher.find()) { System.out.println(matcher.group()); } You can also use the matcher as a search and replace mechanism. For instance, to replace all occurrences of a pattern within a string, you use the following expression: String newString = matcher.replaceAll("replacement words"); Here, all occurrences of the pattern in question would be replaced by the replacement words. Here's a demonstration of pattern matching. The following program takes three command line arguments. The first argument is a string to search. The second is a pattern for the search. The third is the replacement string. The replacement string replaces each occurrence of the pattern found in the search string. import java.util.regex.*; public class MyMatch { public static void main(String args[]) { if (args.length != 3) { System.out.println( "Pass in source string, pattern, " + "and replacement string"); System.exit(-1); } String sourceString = args[0]; String thePattern = args[1]; String replacementString = args[2]; Pattern pattern = Pattern.compile(thePattern); Matcher match = pattern.matcher(sourceString); if (match.find()) { System.out.println( match.replaceAll(replacementString)); } } } For example, if you compile the program, and then run it like this: java MyMatch "I want to be in lectures" "lect" "pict" It returns: I want to be in pictures Notice that when you run the program, it is unnecessary to escape the \ character from the command line. That's because the javac compiler does not process that information. For example, if the search string is: "I want to be in lectures\I want to be a star" and you run the program with the same pattern ("lect") and replacement string ("pict"), it returns: I want to be in pictures\I want to be a star For more information about pattern matching and regular expressions, see the technical article Regular Expressions and the Java Programming Language (http://java.sun.com/jdc/technicalArticles/releases/1.4regex/). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - CREATING A HELPSET WITH JAVAHELP SOFTWARE JavaHelp software allows you to add online help to any system that has a Java Runtime Environment (JRE). With JavaHelp software, you can embed online documentation inside your client-side programs. This includes the obvious applets and applications, but you can also use JavaHelp software with JavaBeans(tm) technology components or as standalone help for third-party systems. Getting started with the JavaHelp software is easy. Just go to http://java.sun.com/products/javahelp/download_binary.html. You can download either the user-version, with a JRE, or a developer-centric version (as a Zip or self-extracting executable). There is also a JavaHelp User Guide that comes with the software downloads. If you view the JavaHelp User Guide, you'll see the JavaHelp system in action. Once started, the Swing-based help viewer for JavaHelp presents information in a series of views. You'll find a Table of Contents, index of topics, and search. These three features combined are called the HelpSet and may include multiple help topic files. Essentially, it is your job to create the help topic files and the navigation files, mapping the help topic to the file with the necessary information. The topic files are basic HTML, and the navigation files are formatted in XML. You can however use a third-party tool to automatically produce the necessary files. For example, a tool such as RoboHELP generates the necessary files in the JavaHelp format. See the list of tools supporting the JavaHelp software format at http://java.sun.com/products/javahelp/industry.html. To demonstrate the JavaHelp system in action, let's create a "Hello, JavaHelp" HelpSet. To do this, you'll need to configure a special directory structure. It helps if you work in a subdirectory to start, so that you don't mix up the HelpSet files with any others. Navigation files go in the top-level directory, and topic and image files in subdirectories. To get started, create a directory named help. Under help, create a directory named Hello. In the Hello directory, you create subdirectories for subtopics to hold the actual help files. For the "Hello, JavaHelp" demonstration, create one directory named First and another Last. Once the directory structure is created, you can start creating the navigation and help files. The directory structure now looks as follows: + help + Hello + First + Last The DTD for the main HelpSet file is contained in http://java.sun.com/products/javahelp/helpset_1_0.dtd. In it, you create entries for the term map as well as table of contents and index views. There is really no magic in the filenames. Just be sure the HelpSet file ends with the extension .hs. Here's what the HelpSet file, hello.hs, might look like, where the map is in Map.jhm, table of contents is in toc.xml, and index is in index.xml. Create this hello.hs file in the help directory. Hello, JavaHelp overview TOC javax.help.TOCView toc.xml Index javax.help.IndexView index.xml For the map file, you need to create a mapping from map ID to files, similar to the following: Be sure the help files are specified as relative locations from the HelpSet. You could hard code complete paths, but then as soon as you JAR up the HelpSet, all paths would be wrong. Of course, these could be complete URLs to resources on the Web. If you want to have one "overview" help file at the top, and two help files in each of the First and Last directories, your XML mapping might appear as follows. Create this Map.jhm file in the help directory. The table of contents and index files are next. These provide alternate means of working through the various help files. Again, these are described in XML files. For the table of contents, each target from the map is mapped to text to appear in the table of contents. Create this toc.xml file in the help directory. The index is just another way of presenting the data. As you create the index.xml file, you must alphabetize/list terms in the order you want them presented. Simply create the XML file with a set of hierarchical entries. In each entry, provide a value for the text attribute and a value for the target attribute. The value for the text attribute specifies what to display to the user in the index. The value for the target attribute specifies what help to display. Create this index.xml file in the help directory. The map file mentions five HTML files: Hello/overview.htm Hello/First/one.htm Hello/First/two.htm Hello/Last/three.htm Hello/Last/four.htm So you must create them. Make sure to create the files in the appropriate Hello directory or subdirectory. Try to create the files with something interesting in them, for example, a few sentences of overview information in the overview.htm file. The whole directory structure now looks like this: + help hello.hs index.xml Map.jhm toc.xml + Hello overview.htm + First one.htm two.htm + Last three.htm four.htm To test if you have everything connected properly, run the hsviewer utility that comes with the JavaHelp software, and have it load the hello.hs file. You can find the utility in the demos/bin (Unix) or demos\bin (Windows) subdirectory of your JavaHelp installation directory. For example, in Unix change to the demos/bin subdirectory, and enter: hsviewer -helpset hello.hs -classpath path Replace "path" with the path to the hello.hs HelpSet. After starting up hsviewer, click on the Browse button to locate the hello.hs file. Then click on the Display button to bring up the help viewer. Because hello.hs has two tags, you'll find two tabs on the left side: one for the TOC and one for the index. The right side will display the HTML associated with the item selected on the left. You can also add a search tab. To do this, run the jhindexer program and add another to the HelpSet. Enter the jhindexer command as follows in the directory that contains the hello.hs file. jhindexer Hello If the command isn't in your path, you'll need to prefix the command with its full path. You can find the command in the javahelp/bin (Unix) or javahelp\bin (Windows) subdirectory of your JavaHelp installation directory. Here's the tag you need to add to hello.hs. JavaHelpSearch is the name of the directory used for the help index support files to be saved. Search javax.help.SearchView JavaHelpSearch For more information about JavaHelp software, see the JavaHelp software page (http://java.sun.com/products/javahelp/). . . . . . . . . . . . . . . . . . . . . . . . IMPORTANT: Please read our Terms of Use, Privacy, and Licensing policies: http://www.sun.com/share/text/termsofuse.html http://www.sun.com/privacy/ http://developer.java.sun.com/berkeley_license.html * FEEDBACK Comments? Send your feedback on the JDC Tech Tips to: jdc-webmaster@sun.com * SUBSCRIBE/UNSUBSCRIBE - To subscribe, go to the subscriptions page, (http://developer.java.sun.com/subscription/), choose the newsletters you want to subscribe to and click "Update". - To unsubscribe, go to the subscriptions page, (http://developer.java.sun.com/subscription/), uncheck the appropriate checkbox, and click "Update". - To use our one-click unsubscribe facility, see the link at the end of this email: - ARCHIVES You'll find the JDC Tech Tips archives at: http://java.sun.com/jdc/TechTips/index.html - COPYRIGHT Copyright 2002 Sun Microsystems, Inc. All rights reserved. 901 San Antonio Road, Palo Alto, California 94303 USA. This document is protected by copyright. For more information, see: http://java.sun.com/jdc/copyright.html JDC Tech Tips April 23, 2002 Sun, Sun Microsystems, Java, Java Developer Connection, JavaHelp, and JavaBeans are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. To use our one-click unsubscribe facility, select the following URL: http://bulkmail.sun.com/unsubscribe?14086384-227227126