Go to the previous, next section.

Patterns

Patterns in awk control the execution of rules: a rule is executed when its pattern matches the current input record. This chapter tells all about how to write patterns.

Kinds of Patterns

Here is a summary of the types of patterns supported in awk.

/regular expression/
A regular expression as a pattern. It matches when the text of the input record fits the regular expression. (See section Regular Expressions as Patterns.)

expression
A single expression. It matches when its value, converted to a number, is nonzero (if a number) or nonnull (if a string). (See section Expressions as Patterns.)

pat1, pat2
A pair of patterns separated by a comma, specifying a range of records. (See section Specifying Record Ranges with Patterns.)

BEGIN
END
Special patterns to supply start-up or clean-up information to awk. (See section BEGIN and END Special Patterns.)

null
The empty pattern matches every input record. (See section The Empty Pattern.)

Regular Expressions as Patterns

A regular expression, or regexp, is a way of describing a class of strings. A regular expression enclosed in slashes (`/') is an awk pattern that matches every input record whose text belongs to that class.

The simplest regular expression is a sequence of letters, numbers, or both. Such a regexp matches any string that contains that sequence. Thus, the regexp `foo' matches any string containing `foo'. Therefore, the pattern /foo/ matches any input record containing `foo'. Other kinds of regexps let you specify more complicated classes of strings.

How to Use Regular Expressions

A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is matched against the entire text of each record. (Normally, it only needs to match some part of the text in order to succeed.) For example, this prints the second field of each record that contains `foo' anywhere:

awk '/foo/ { print $2 }' BBS-list

Regular expressions can also be used in comparison expressions. Then you can specify the string to match against; it need not be the entire current input record. These comparison expressions can be used as patterns or in if, while, for, and do statements.

exp ~ /regexp/
This is true if the expression exp (taken as a character string) is matched by regexp. The following example matches, or selects, all input records with the upper-case letter `J' somewhere in the first field:

awk '$1 ~ /J/' inventory-shipped

So does this:

awk '{ if ($1 ~ /J/) print }' inventory-shipped

exp !~ /regexp/
This is true if the expression exp (taken as a character string) is not matched by regexp. The following example matches, or selects, all input records whose first field does not contain the upper-case letter `J':

awk '$1 !~ /J/' inventory-shipped

The right hand side of a `~' or `!~' operator need not be a constant regexp (i.e., a string of characters between slashes). It may be any expression. The expression is evaluated, and converted if necessary to a string; the contents of the string are used as the regexp. A regexp that is computed in this way is called a dynamic regexp. For example:

identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+"
$0 ~ identifier_regexp

sets identifier_regexp to a regexp that describes awk variable names, and tests if the input record matches this regexp.

Regular Expression Operators

You can combine regular expressions with the following characters, called regular expression operators, or metacharacters, to increase the power and versatility of regular expressions.

Here is a table of metacharacters. All characters not listed in the table stand for themselves.

^
This matches the beginning of the string or the beginning of a line within the string. For example:

^@chapter

matches the `@chapter' at the beginning of a string, and can be used to identify chapter beginnings in Texinfo source files.

$
This is similar to `^', but it matches only at the end of a string or the end of a line within the string. For example:

p$

matches a record that ends with a `p'.

.
This matches any single character except a newline. For example:

.P

matches any single character followed by a `P' in a string. Using concatenation we can make regular expressions like `U.A', which matches any three-character sequence that begins with `U' and ends with `A'.

[...]
This is called a character set. It matches any one of the characters that are enclosed in the square brackets. For example:

[MVX]

matches any one of the characters `M', `V', or `X' in a string.

Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in brackets. For example:

[0-9]

matches any digit.

To include the character `\', `]', `-' or `^' in a character set, put a `\' in front of it. For example:

[d\]]

matches either `d', or `]'.

This treatment of `\' is compatible with other awk implementations, and is also mandated by the POSIX Command Language and Utilities standard. The regular expressions in awk are a superset of the POSIX specification for Extended Regular Expressions (EREs). POSIX EREs are based on the regular expressions accepted by the traditional egrep utility.

In egrep syntax, backslash is not syntactically special within square brackets. This means that special tricks have to be used to represent the characters `]', `-' and `^' as members of a character set.

In egrep syntax, to match `-', write it as `---', which is a range containing only `-'. You may also give `-' as the first or last character in the set. To match `^', put it anywhere except as the first character of a set. To match a `]', make it the first character in the set. For example:

[]d^]

matches either `]', `d' or `^'.

[^ ...]
This is a complemented character set. The first character after the `[' must be a `^'. It matches any characters except those in the square brackets (or newline). For example:

[^0-9]

matches any character that is not a digit.

|
This is the alternation operator and it is used to specify alternatives. For example:

^P|[0-9]

matches any string that matches either `^P' or `[0-9]'. This means it matches any string that contains a digit or starts with `P'.

The alternation applies to the largest possible regexps on either side.

(...)
Parentheses are used for grouping in regular expressions as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, `|'.

*
This symbol means that the preceding regular expression is to be repeated as many times as possible to find a match. For example:

ph*

applies the `*' symbol to the preceding `h' and looks for matches to one `p' followed by any number of `h's. This will also match just `p' if no `h's are present.

The `*' repeats the smallest possible preceding expression. (Use parentheses if you wish to repeat a larger expression.) It finds as many repetitions as possible. For example:

awk '/\(c[ad][ad]*r x\)/ { print }' sample

prints every record in the input containing a string of the form `(car x)', `(cdr x)', `(cadr x)', and so on.

+
This symbol is similar to `*', but the preceding expression must be matched at least once. This means that:

wh+y

would match `why' and `whhy' but not `wy', whereas `wh*y' would match all three of these strings. This is a simpler way of writing the last `*' example:

awk '/\(c[ad]+r x\)/ { print }' sample

?
This symbol is similar to `*', but the preceding expression can be matched once or not at all. For example:

fe?d

will match `fed' and `fd', but nothing else.

\
This is used to suppress the special meaning of a character when matching. For example:

\$

matches the character `$'.

The escape sequences used for string constants (see section Constant Expressions) are valid in regular expressions as well; they are also introduced by a `\'.

In regular expressions, the `*', `+', and `?' operators have the highest precedence, followed by concatenation, and finally by `|'. As in arithmetic, parentheses can change how operators are grouped.

Case-sensitivity in Matching

Case is normally significant in regular expressions, both when matching ordinary characters (i.e., not metacharacters), and inside character sets. Thus a `w' in a regular expression matches only a lower case `w' and not an upper case `W'.

The simplest way to do a case-independent match is to use a character set: `[Ww]'. However, this can be cumbersome if you need to use it often; and it can make the regular expressions harder for humans to read. There are two other alternatives that you might prefer.

One way to do a case-insensitive match at a particular point in the program is to convert the data to a single case, using the tolower or toupper built-in string functions (which we haven't discussed yet; see section Built-in Functions for String Manipulation). For example:

tolower($1) ~ /foo/  { ... }

converts the first field to lower case before matching against it.

Another method is to set the variable IGNORECASE to a nonzero value (see section Built-in Variables). When IGNORECASE is not zero, all regexp operations ignore case. Changing the value of IGNORECASE dynamically controls the case sensitivity of your program as it runs. Case is significant by default because IGNORECASE (like most variables) is initialized to zero.

x = "aB"
if (x ~ /ab/) ...   # this test will fail

IGNORECASE = 1
if (x ~ /ab/) ...   # now it will succeed

In general, you cannot use IGNORECASE to make certain rules case-insensitive and other rules case-sensitive, because there is no way to set IGNORECASE just for the pattern of a particular rule. To do this, you must use character sets or tolower. However, one thing you can do only with IGNORECASE is turn case-sensitivity on or off dynamically for all the rules at once.

IGNORECASE can be set on the command line, or in a BEGIN rule. Setting IGNORECASE from the command line is a way to make a program case-insensitive without having to edit it.

The value of IGNORECASE has no effect if gawk is in compatibility mode (see section Invoking awk). Case is always significant in compatibility mode.

Comparison Expressions as Patterns

Comparison patterns test relationships such as equality between two strings or numbers. They are a special case of expression patterns (see section Expressions as Patterns). They are written with relational operators, which are a superset of those in C. Here is a table of them:

x < y
True if x is less than y.

x <= y
True if x is less than or equal to y.

x > y
True if x is greater than y.

x >= y
True if x is greater than or equal to y.

x == y
True if x is equal to y.

x != y
True if x is not equal to y.

x ~ y
True if x matches the regular expression described by y.

x !~ y
True if x does not match the regular expression described by y.

The operands of a relational operator are compared as numbers if they are both numbers. Otherwise they are converted to, and compared as, strings (see section Conversion of Strings and Numbers, for the detailed rules). Strings are compared by comparing the first character of each, then the second character of each, and so on, until there is a difference. If the two strings are equal until the shorter one runs out, the shorter one is considered to be less than the longer one. Thus, "10" is less than "9", and "abc" is less than "abcd".

The left operand of the `~' and `!~' operators is a string. The right operand is either a constant regular expression enclosed in slashes (/regexp/), or any expression, whose string value is used as a dynamic regular expression (see section How to Use Regular Expressions).

The following example prints the second field of each input record whose first field is precisely `foo'.

awk '$1 == "foo" { print $2 }' BBS-list

Contrast this with the following regular expression match, which would accept any record with a first field that contains `foo':

awk '$1 ~ "foo" { print $2 }' BBS-list

or, equivalently, this one:

awk '$1 ~ /foo/ { print $2 }' BBS-list

Boolean Operators and Patterns

A boolean pattern is an expression which combines other patterns using the boolean operators "or" (`||'), "and" (`&&'), and "not" (`!'). Whether the boolean pattern matches an input record depends on whether its subpatterns match.

For example, the following command prints all records in the input file `BBS-list' that contain both `2400' and `foo'.

awk '/2400/ && /foo/' BBS-list

The following command prints all records in the input file `BBS-list' that contain either `2400' or `foo', or both.

awk '/2400/ || /foo/' BBS-list

The following command prints all records in the input file `BBS-list' that do not contain the string `foo'.

awk '! /foo/' BBS-list

Note that boolean patterns are a special case of expression patterns (see section Expressions as Patterns); they are expressions that use the boolean operators. See section Boolean Expressions, for complete information on the boolean operators.

The subpatterns of a boolean pattern can be constant regular expressions, comparisons, or any other awk expressions. Range patterns are not expressions, so they cannot appear inside boolean patterns. Likewise, the special patterns BEGIN and END, which never match any input record, are not expressions and cannot appear inside boolean patterns.

Expressions as Patterns

Any awk expression is also valid as an awk pattern. Then the pattern "matches" if the expression's value is nonzero (if a number) or nonnull (if a string).

The expression is reevaluated each time the rule is tested against a new input record. If the expression uses fields such as $1, the value depends directly on the new input record's text; otherwise, it depends only on what has happened so far in the execution of the awk program, but that may still be useful.

Comparison patterns are actually a special case of this. For example, the expression $5 == "foo" has the value 1 when the value of $5 equals "foo", and 0 otherwise; therefore, this expression as a pattern matches when the two values are equal.

Boolean patterns are also special cases of expression patterns.

A constant regexp as a pattern is also a special case of an expression pattern. /foo/ as an expression has the value 1 if `foo' appears in the current input record; thus, as a pattern, /foo/ matches any record containing `foo'.

Other implementations of awk that are not yet POSIX compliant are less general than gawk: they allow comparison expressions, and boolean combinations thereof (optionally with parentheses), but not necessarily other kinds of expressions.

Specifying Record Ranges with Patterns

A range pattern is made of two patterns separated by a comma, of the form begpat, endpat. It matches ranges of consecutive input records. The first pattern begpat controls where the range begins, and the second one endpat controls where it ends. For example,

awk '$1 == "on", $1 == "off"'

prints every record between `on'/`off' pairs, inclusive.

A range pattern starts out by matching begpat against every input record; when a record matches begpat, the range pattern becomes turned on. The range pattern matches this record. As long as it stays turned on, it automatically matches every input record read. It also matches endpat against every input record; when that succeeds, the range pattern is turned off again for the following record. Now it goes back to checking begpat against each record.

The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don't want to operate on these records, you can write if statements in the rule's action to distinguish them.

It is possible for a pattern to be turned both on and off by the same record, if both conditions are satisfied by that record. Then the action is executed for just that record.

BEGIN and END Special Patterns

BEGIN and END are special patterns. They are not used to match input records. Rather, they are used for supplying start-up or clean-up information to your awk script. A BEGIN rule is executed, once, before the first input record has been read. An END rule is executed, once, after all the input has been read. For example:

awk 'BEGIN { print "Analysis of `foo'" }
     /foo/ { ++foobar }
     END   { print "`foo' appears " foobar " times." }' BBS-list

This program finds the number of records in the input file `BBS-list' that contain the string `foo'. The BEGIN rule prints a title for the report. There is no need to use the BEGIN rule to initialize the counter foobar to zero, as awk does this for us automatically (see section Variables).

The second rule increments the variable foobar every time a record containing the pattern `foo' is read. The END rule prints the value of foobar at the end of the run.

The special patterns BEGIN and END cannot be used in ranges or with boolean operators (indeed, they cannot be used with any operators).

An awk program may have multiple BEGIN and/or END rules. They are executed in the order they appear, all the BEGIN rules at start-up and all the END rules at termination.

Multiple BEGIN and END sections are useful for writing library functions, since each library can have its own BEGIN or END rule to do its own initialization and/or cleanup. Note that the order in which library functions are named on the command line controls the order in which their BEGIN and END rules are executed. Therefore you have to be careful to write such rules in library files so that the order in which they are executed doesn't matter. See section Invoking awk, for more information on using library functions.

If an awk program only has a BEGIN rule, and no other rules, then the program exits after the BEGIN rule has been run. (Older versions of awk used to keep reading and ignoring input until end of file was seen.) However, if an END rule exists as well, then the input will be read, even if there are no other rules in the program. This is necessary in case the END rule checks the NR variable.

BEGIN and END rules must have actions; there is no default action for these rules since there is no current record when they run.

The Empty Pattern

An empty pattern is considered to match every input record. For example, the program:

awk '{ print $1 }' BBS-list

prints the first field of every record.

Go to the previous, next section.