Go to the previous, next section.
Patterns in awk
control the execution of rules: a rule is
executed when its pattern matches the current input record. This
chapter tells all about how to write patterns.
Here is a summary of the types of patterns supported in awk
.
/regular expression/
expression
pat1, pat2
BEGIN
END
awk
. (See section BEGIN
and END
Special Patterns.)
null
A regular expression, or regexp, is a way of describing a
class of strings. A regular expression enclosed in slashes (`/')
is an awk
pattern that matches every input record whose text
belongs to that class.
The simplest regular expression is a sequence of letters, numbers, or
both. Such a regexp matches any string that contains that sequence.
Thus, the regexp `foo' matches any string containing `foo'.
Therefore, the pattern /foo/
matches any input record containing
`foo'. Other kinds of regexps let you specify more complicated
classes of strings.
A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is matched against the entire text of each record. (Normally, it only needs to match some part of the text in order to succeed.) For example, this prints the second field of each record that contains `foo' anywhere:
awk '/foo/ { print $2 }' BBS-list
Regular expressions can also be used in comparison expressions. Then
you can specify the string to match against; it need not be the entire
current input record. These comparison expressions can be used as
patterns or in if
, while
, for
, and do
statements.
exp ~ /regexp/
awk '$1 ~ /J/' inventory-shipped
So does this:
awk '{ if ($1 ~ /J/) print }' inventory-shipped
exp !~ /regexp/
awk '$1 !~ /J/' inventory-shipped
The right hand side of a `~' or `!~' operator need not be a constant regexp (i.e., a string of characters between slashes). It may be any expression. The expression is evaluated, and converted if necessary to a string; the contents of the string are used as the regexp. A regexp that is computed in this way is called a dynamic regexp. For example:
identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" $0 ~ identifier_regexp
sets identifier_regexp
to a regexp that describes awk
variable names, and tests if the input record matches this regexp.
You can combine regular expressions with the following characters, called regular expression operators, or metacharacters, to increase the power and versatility of regular expressions.
Here is a table of metacharacters. All characters not listed in the table stand for themselves.
^
^@chapter
matches the `@chapter' at the beginning of a string, and can be used to identify chapter beginnings in Texinfo source files.
$
p$
matches a record that ends with a `p'.
.
.P
matches any single character followed by a `P' in a string. Using concatenation we can make regular expressions like `U.A', which matches any three-character sequence that begins with `U' and ends with `A'.
[...]
[MVX]
matches any one of the characters `M', `V', or `X' in a string.
Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in brackets. For example:
[0-9]
matches any digit.
To include the character `\', `]', `-' or `^' in a character set, put a `\' in front of it. For example:
[d\]]
matches either `d', or `]'.
This treatment of `\' is compatible with other awk
implementations, and is also mandated by the POSIX Command Language
and Utilities standard. The regular expressions in awk
are a superset
of the POSIX specification for Extended Regular Expressions (EREs).
POSIX EREs are based on the regular expressions accepted by the
traditional egrep
utility.
In egrep
syntax, backslash is not syntactically special within
square brackets. This means that special tricks have to be used to
represent the characters `]', `-' and `^' as members of a
character set.
In egrep
syntax, to match `-', write it as `---',
which is a range containing only `-'. You may also give `-'
as the first or last character in the set. To match `^', put it
anywhere except as the first character of a set. To match a `]',
make it the first character in the set. For example:
[]d^]
matches either `]', `d' or `^'.
[^ ...]
[^0-9]
matches any character that is not a digit.
|
^P|[0-9]
matches any string that matches either `^P' or `[0-9]'. This means it matches any string that contains a digit or starts with `P'.
The alternation applies to the largest possible regexps on either side.
(...)
*
ph*
applies the `*' symbol to the preceding `h' and looks for matches to one `p' followed by any number of `h's. This will also match just `p' if no `h's are present.
The `*' repeats the smallest possible preceding expression. (Use parentheses if you wish to repeat a larger expression.) It finds as many repetitions as possible. For example:
awk '/\(c[ad][ad]*r x\)/ { print }' sample
prints every record in the input containing a string of the form `(car x)', `(cdr x)', `(cadr x)', and so on.
+
wh+y
would match `why' and `whhy' but not `wy', whereas `wh*y' would match all three of these strings. This is a simpler way of writing the last `*' example:
awk '/\(c[ad]+r x\)/ { print }' sample
?
fe?d
will match `fed' and `fd', but nothing else.
\
\$
matches the character `$'.
The escape sequences used for string constants (see section Constant Expressions) are valid in regular expressions as well; they are also introduced by a `\'.
In regular expressions, the `*', `+', and `?' operators have the highest precedence, followed by concatenation, and finally by `|'. As in arithmetic, parentheses can change how operators are grouped.
Case is normally significant in regular expressions, both when matching ordinary characters (i.e., not metacharacters), and inside character sets. Thus a `w' in a regular expression matches only a lower case `w' and not an upper case `W'.
The simplest way to do a case-independent match is to use a character set: `[Ww]'. However, this can be cumbersome if you need to use it often; and it can make the regular expressions harder for humans to read. There are two other alternatives that you might prefer.
One way to do a case-insensitive match at a particular point in the
program is to convert the data to a single case, using the
tolower
or toupper
built-in string functions (which we
haven't discussed yet;
see section Built-in Functions for String Manipulation).
For example:
tolower($1) ~ /foo/ { ... }
converts the first field to lower case before matching against it.
Another method is to set the variable IGNORECASE
to a nonzero
value (see section Built-in Variables). When IGNORECASE
is not zero,
all regexp operations ignore case. Changing the value of
IGNORECASE
dynamically controls the case sensitivity of your
program as it runs. Case is significant by default because
IGNORECASE
(like most variables) is initialized to zero.
x = "aB" if (x ~ /ab/) ... # this test will fail IGNORECASE = 1 if (x ~ /ab/) ... # now it will succeed
In general, you cannot use IGNORECASE
to make certain rules
case-insensitive and other rules case-sensitive, because there is no way
to set IGNORECASE
just for the pattern of a particular rule. To
do this, you must use character sets or tolower
. However, one
thing you can do only with IGNORECASE
is turn case-sensitivity on
or off dynamically for all the rules at once.
IGNORECASE
can be set on the command line, or in a BEGIN
rule. Setting IGNORECASE
from the command line is a way to make
a program case-insensitive without having to edit it.
The value of IGNORECASE
has no effect if gawk
is in
compatibility mode (see section Invoking awk
).
Case is always significant in compatibility mode.
Comparison patterns test relationships such as equality between two strings or numbers. They are a special case of expression patterns (see section Expressions as Patterns). They are written with relational operators, which are a superset of those in C. Here is a table of them:
x < y
x <= y
x > y
x >= y
x == y
x != y
x ~ y
x !~ y
The operands of a relational operator are compared as numbers if they
are both numbers. Otherwise they are converted to, and compared as,
strings (see section Conversion of Strings and Numbers,
for the detailed rules). Strings are compared by comparing the first
character of each, then the second character of each,
and so on, until there is a difference. If the two strings are equal until
the shorter one runs out, the shorter one is considered to be less than the
longer one. Thus, "10"
is less than "9"
, and "abc"
is less than "abcd"
.
The left operand of the `~' and `!~' operators is a string.
The right operand is either a constant regular expression enclosed in
slashes (/regexp/
), or any expression, whose string value
is used as a dynamic regular expression
(see section How to Use Regular Expressions).
The following example prints the second field of each input record whose first field is precisely `foo'.
awk '$1 == "foo" { print $2 }' BBS-list
Contrast this with the following regular expression match, which would accept any record with a first field that contains `foo':
awk '$1 ~ "foo" { print $2 }' BBS-list
or, equivalently, this one:
awk '$1 ~ /foo/ { print $2 }' BBS-list
A boolean pattern is an expression which combines other patterns using the boolean operators "or" (`||'), "and" (`&&'), and "not" (`!'). Whether the boolean pattern matches an input record depends on whether its subpatterns match.
For example, the following command prints all records in the input file `BBS-list' that contain both `2400' and `foo'.
awk '/2400/ && /foo/' BBS-list
The following command prints all records in the input file `BBS-list' that contain either `2400' or `foo', or both.
awk '/2400/ || /foo/' BBS-list
The following command prints all records in the input file `BBS-list' that do not contain the string `foo'.
awk '! /foo/' BBS-list
Note that boolean patterns are a special case of expression patterns (see section Expressions as Patterns); they are expressions that use the boolean operators. See section Boolean Expressions, for complete information on the boolean operators.
The subpatterns of a boolean pattern can be constant regular
expressions, comparisons, or any other awk
expressions. Range
patterns are not expressions, so they cannot appear inside boolean
patterns. Likewise, the special patterns BEGIN
and END
,
which never match any input record, are not expressions and cannot
appear inside boolean patterns.
Any awk
expression is also valid as an awk
pattern.
Then the pattern "matches" if the expression's value is nonzero (if a
number) or nonnull (if a string).
The expression is reevaluated each time the rule is tested against a new
input record. If the expression uses fields such as $1
, the
value depends directly on the new input record's text; otherwise, it
depends only on what has happened so far in the execution of the
awk
program, but that may still be useful.
Comparison patterns are actually a special case of this. For
example, the expression $5 == "foo"
has the value 1 when the
value of $5
equals "foo"
, and 0 otherwise; therefore, this
expression as a pattern matches when the two values are equal.
Boolean patterns are also special cases of expression patterns.
A constant regexp as a pattern is also a special case of an expression
pattern. /foo/
as an expression has the value 1 if `foo'
appears in the current input record; thus, as a pattern, /foo/
matches any record containing `foo'.
Other implementations of awk
that are not yet POSIX compliant
are less general than gawk
: they allow comparison expressions, and
boolean combinations thereof (optionally with parentheses), but not
necessarily other kinds of expressions.
A range pattern is made of two patterns separated by a comma, of
the form begpat, endpat
. It matches ranges of
consecutive input records. The first pattern begpat controls
where the range begins, and the second one endpat controls where
it ends. For example,
awk '$1 == "on", $1 == "off"'
prints every record between `on'/`off' pairs, inclusive.
A range pattern starts out by matching begpat against every input record; when a record matches begpat, the range pattern becomes turned on. The range pattern matches this record. As long as it stays turned on, it automatically matches every input record read. It also matches endpat against every input record; when that succeeds, the range pattern is turned off again for the following record. Now it goes back to checking begpat against each record.
The record that turns on the range pattern and the one that turns it
off both match the range pattern. If you don't want to operate on
these records, you can write if
statements in the rule's action
to distinguish them.
It is possible for a pattern to be turned both on and off by the same record, if both conditions are satisfied by that record. Then the action is executed for just that record.
BEGIN
and END
Special Patterns
BEGIN
and END
are special patterns. They are not used to
match input records. Rather, they are used for supplying start-up or
clean-up information to your awk
script. A BEGIN
rule is
executed, once, before the first input record has been read. An END
rule is executed, once, after all the input has been read. For
example:
awk 'BEGIN { print "Analysis of `foo'" } /foo/ { ++foobar } END { print "`foo' appears " foobar " times." }' BBS-list
This program finds the number of records in the input file `BBS-list'
that contain the string `foo'. The BEGIN
rule prints a title
for the report. There is no need to use the BEGIN
rule to
initialize the counter foobar
to zero, as awk
does this
for us automatically (see section Variables).
The second rule increments the variable foobar
every time a
record containing the pattern `foo' is read. The END
rule
prints the value of foobar
at the end of the run.
The special patterns BEGIN
and END
cannot be used in ranges
or with boolean operators (indeed, they cannot be used with any operators).
An awk
program may have multiple BEGIN
and/or END
rules. They are executed in the order they appear, all the BEGIN
rules at start-up and all the END
rules at termination.
Multiple BEGIN
and END
sections are useful for writing
library functions, since each library can have its own BEGIN
or
END
rule to do its own initialization and/or cleanup. Note that
the order in which library functions are named on the command line
controls the order in which their BEGIN
and END
rules are
executed. Therefore you have to be careful to write such rules in
library files so that the order in which they are executed doesn't matter.
See section Invoking awk
, for more information on
using library functions.
If an awk
program only has a BEGIN
rule, and no other
rules, then the program exits after the BEGIN
rule has been run.
(Older versions of awk
used to keep reading and ignoring input
until end of file was seen.) However, if an END
rule exists as
well, then the input will be read, even if there are no other rules in
the program. This is necessary in case the END
rule checks the
NR
variable.
BEGIN
and END
rules must have actions; there is no default
action for these rules since there is no current record when they run.
An empty pattern is considered to match every input record. For example, the program:
awk '{ print $1 }' BBS-list
prints the first field of every record.
Go to the previous, next section.