Received: from FORT-POINT-STATION.MIT.EDU by po10 (5.61/4.7) id AA24942; Tue, 13 Jun 00 15:59:45 EDT Received: from hermes.java.sun.com (hermes.javasoft.com [204.160.241.85]) by fort-point-station.mit.edu (8.9.2/8.9.2) with ESMTP id PAA12573; Tue, 13 Jun 2000 15:57:23 -0400 (EDT) Received: (from nobody@localhost) by hermes.java.sun.com (8.9.3+Sun/8.9.1) id TAA22260; Tue, 13 Jun 2000 19:57:26 GMT Date: Tue, 13 Jun 2000 19:57:26 GMT Message-Id: <200006131957.TAA22260@hermes.java.sun.com> X-Authentication-Warning: hermes.java.sun.com: Processed from queue /bulkmail/data/ed_38/mqueue3 X-Mailing: 217 From: JDCTechTips@sun.com Subject: JDC Tech Tips, June 13, 2000 To: JDCMember@sun.com Reply-To: JDCTechTips@sun.com Errors-To: bounced_mail@hermes.java.sun.com Precedence: junk Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Beyond Email 2.2 J D C T E C H T I P S TIPS, TECHNIQUES, AND SAMPLE CODE WELCOME to the Java Developer Connection(sm) (JDC) Tech Tips, June 13, 2000. This issue covers: * Using BreakIterator to Parse Text * Goto Statements and Java(tm) Programming These tips were developed using Java(tm) 2 SDK, Standard Edition, v 1.2.2. You can view this issue of the Tech Tips on the Web at http://developer.java.sun.com/developer/TechTips/2000/tt0613.html - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - USING BREAKITERATOR TO PARSE TEXT The standard Java(tm) packages such as java.util include several classes that you can use to break text into words or other logical units. One of these classes is java.util.StringTokenizer. When you use StringTokenizer, you specify a set of delimiter characters; instances of StringTokenizer then return words delimited by these characters. java.io.StreamTokenizer is a class that does something similar. These classes are quite useful. However they have some limitations. This is especially true when you're trying to parse text that represents human language. For example, the classes don't have built-in knowledge of punctuation rules, and the classes might define a "word" as simply a string of contiguous non-whitespace characters. java.text.BreakIterator is a class specifically designed to parse human language text into words, lines, and sentences. To see how it works, here's a simple example: import java.text.BreakIterator; public class BreakDemo1 { public static void main(String args[]) { // string to be broken into sentences String str = "\"Testing.\" \"???\" (This is a test.)"; // create a sentence break iterator BreakIterator brkit = BreakIterator.getSentenceInstance(); brkit.setText(str); // iterate across the string int start = brkit.first(); int end = brkit.next(); while (end != BreakIterator.DONE) { String sentence = str.substring(start, end); System.out.println(start + " " + sentence); start = end; end = brkit.next(); } } } The input string is: "Testing." "???" (This is a test.) It is immediately apparent that parsing this input is not trivial. For example, suppose you follow a simple rule that a sentence ends with a period. Well, actually, it doesn't. The fact that it doesn't is demonstrated by the following two sentences, both of which are considered correct: "This is a test." "This is a test". The first of these sentences is more standard relative to long-standing English usage. BreakIterator applies a set of rules to handle situations such as this. When you run the BreakDemo1 program in the United States locale, the result is: 0 "Testing." 11 "???" 17 (This is a test.) The numbers are offsets into the string where each sentence starts. In other words, BreakIterator return a series of offsets that tell where some particular unit (sentence, word) starts in a string. BreakIterator is particularly useful in applications such as word processing, where, for example, you might be trying to find the location of the next sentence in some currently displayed text. The demo program uses default locale settings, but it could have specified a specific locale, for example: ... BreakIterator.getSentenceInstance(Locale.GERMAN); Another way you can use BreakIterator is to find line breaks, that is, locations in text where a line could be broken for text formatting. Here's an example: import java.text.BreakIterator; public class BreakDemo2 { public static void main(String args[]) { // string to be broken into sentences String str = "This sen-tence con-tains hyphenation."; // create a line break iterator BreakIterator brkit = BreakIterator.getLineInstance(); brkit.setText(str); // iterate across the string int start = brkit.first(); int end = brkit.next(); while (end != BreakIterator.DONE) { String sentence = str.substring(start, end); System.out.println(start + " " + sentence); start = end; end = brkit.next(); } } } Program output is: 0 This 5 sen- 9 tence 15 con- 19 tains 25 hyphenation. BreakIterator applies punctuation rules about where text can be broken, such as between words or within a hyphenated word (but not between a word and a following "."). You can also use BreakIterator to find word and character breaks. It's important to note that in finding breaks, BreakIterator analyzes characters independently of how they are stored. A "character" in a human language is not necessarily equivalent to a single Java 16-bit char. For example, an accented character might be stored as a base character along with a mark. BreakIterator analyzes these kinds of composite characters as a single character. One final note about BreakIterator: it's intended for use with human languages, not computer ones. For example, a "sentence" in programming language source code has little meaning. For more information about BreakIterator, see http://java.sun.com/products//jdk/1.2/docs/api/java/text/BreakIterator.html - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - GOTO STATEMENTS AND JAVA(TM) PROGRAMMING Suppose you write a C/C++ program that searches a 5 x 5 array to find the first occurrence of a particular value. You might use the following approach: #include /* 5 x 5 array of numbers */ #define N 5 static int vec[N][N] = { {1, 2, 3, 4, 5}, {2, 3, 4, 5, 6}, {3, 4, 5, 6, 7}, {4, 5, 6, 7, 8}, {5, 6, 7, 8, 9} }; /* target number to be searched for */ static int TARGET = 8; int main() { int i = 0; int j = 0; int found = 0; /* iterate through the array, looking for the target */ for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { if (vec[i][j] == TARGET) { found = 1; goto done; } } } done: if (found) { printf("Found at %d %d\n", i, j); } return 0; } If you run the program, you get the result: Found at 3 4 In this example, a loop nested in another loop is used to find the matching array element. If the program finds the element, it needs to "break" from the nested loops. It's not sufficient to simply break from the inner loop. Doing that only takes the program to the outer loop, it does not actually terminate both loops. So a goto is used to jump out of the inner loop and transfer control to the "done:" label. Using a goto is not the only way to solve the problem in C/C++, but this is one place where a goto is sometimes used. Goto statements are controversial. One problem is that it's hard to control the program logic effectively if you use these statements. For example, look again at the program above. It's clear that the "found" test that is just after the "done:" label is intended for use after the loop has terminated (that is, after the loop terminates normally or through the goto). But there's no way to enforce this rule; control can be transferred to this label from anywhere in the function. In the Java(tm) programming language, goto is a reserved word; the Java programming language does not have a goto statement. However there are alternative statements that you can use in the Java programming language in place of the goto statement. This tip demonstrates three alternative statements. The first of these is a rewrite of the above program: public class ControlDemo1 { // 5 x 5 array of numbers static int vec[][] = { {1, 2, 3, 4, 5}, {2, 3, 4, 5, 6}, {3, 4, 5, 6, 7}, {4, 5, 6, 7, 8}, {5, 6, 7, 8, 9} }; static final int N = 5; // target number to be searched for static final int TARGET = 8; public static void main(String args[]) { int i = 0; int j = 0; boolean found = false; // iterate through the array, looking for the target outer: for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { if (vec[i][j] == TARGET) { found = true; break outer; } } } if (found) { System.out.println("Found at " + i + " " + j); } } } The key point in this example is that break statements can be labeled, that is, a break can designate a labeled loop. Specifying "break outer" in the above example terminates the loop labeled "outer". In other words, the break statement terminates both loops. The same idea applies to continue statements, for example: public class ControlDemo2 { public static void main(String args[]) { outer: for (int i = 1; i <= 3; i++) { for (int j = 1; j <= 3; j++) { System.out.println(i + " " + j); if (i == 2 && j == 2) { continue outer; } } } } } Output here is: 1 1 1 2 1 3 2 1 2 2 3 1 3 2 3 3 Break statements are normally used in loop and switch statements, but you can also use them in any labeled block. Here's an example that illustrates this idea: public class ControlDemo3 { // add two numbers together, a >= 0 and b >= 0 // throw IllegalArgumentException if a or b out of range static int add(int a, int b) { block1: { if (a < 0) { break block1; } if (b < 0) { break block1; } return a + b; } throw new IllegalArgumentException("a < 0 || b < 0"); } public static void main(String args[]) { // legal case try { int a = 37; int b = 47; int c = add(a, b); System.out.println(c); } catch (IllegalArgumentException e) { System.err.println(e); } // illegal case try { int a = 37; int b = -47; int c = add(a, b); System.out.println(c); } catch (IllegalArgumentException e) { System.err.println(e); } } } In this example there's a block labeled "block1". The program handles errors by breaking out of the block. If there are no errors, the program returns normally from within the block. An error causes an exception to be thrown after the block is exited. Note in this example that there are other ways of structuring the code. For example, you might simply say: if (a < 0 || b < 0) { throw new IllegalArgumentException("a < 0 || b < 0"); } return a + b; Which approach is "correct" depends a lot on the complexity of the logic, and what style you prefer. The final example illustrates the case where you'd like to perform some actions, and then somehow gain control for cleanup processing. You want to do this whether the actions succeed, fail, or trigger an exception. This case is sometimes implemented in C/C++ by using a goto to jump to the end of a function, where there is some cleanup code. Here's an example of how you can do this using a Java(tm) program: public class ControlDemo4 { // add two numbers together, a >= 0 and b >= 0 // throw IllegalArgumentException if a or b out of range static int traceadd(int a, int b) { try { if (a < 0 || b < 0) { throw new IllegalArgumentException( "a < 0 || b < 0"); } return a + b; } finally { System.out.println("trace: leaving traceadd"); } } public static void main(String args[]) { // legal case try { int a = 37; int b = 47; int c = traceadd(a, b); System.out.println(c); } catch (IllegalArgumentException e) { System.err.println(e); } // illegal case try { int a = 37; int b = -47; int c = traceadd(a, b); System.out.println(c); } catch (IllegalArgumentException e) { System.err.println(e); } } } This example does program tracing. It prints a message when the traceadd method exits. The exit can be normal, through the return statement, or abnormal, through an exception. Using try...finally (no catch) like this: try { statement 1 statement 2 statement 3 ... } finally { cleanup } is a way to get control for cleanup, no matter what happens in the try clause. For further reading, see chapter 14 in "The Java(tm) Language Specification" by James Gosling, Bill Joy, and Guy Steele (http://java.sun.com/docs/books/jls/). . . . . . . . . . . . . . . . . . . . . . . . - NOTE The names on the JDC mailing list are used for internal Sun Microsystems(tm) purposes only. To remove your name from the list, see Subscribe/Unsubscribe below. - FEEDBACK Comments? Send your feedback on the JDC Tech Tips to: jdc-webmaster@sun.com - SUBSCRIBE/UNSUBSCRIBE The JDC Tech Tips are sent to you because you elected to subscribe when you registered as a JDC member. To unsubscribe from JDC email, go to the following address and enter the email address you wish to remove from the mailing list: http://developer.java.sun.com/unsubscribe.html To become a JDC member and subscribe to this newsletter go to: http://java.sun.com/jdc/ - ARCHIVES You'll find the JDC Tech Tips archives at: http://developer.java.sun.com/developer/TechTips/index.html - COPYRIGHT Copyright 2000 Sun Microsystems, Inc. All rights reserved. 901 San Antonio Road, Palo Alto, California 94303 USA. This document is protected by copyright. For more information, see: http://developer.java.sun.com/developer/copyright.html This issue of the JDC Tech Tips is written by Glen McCluskey. JDC Tech Tips June 13, 2000