Go to the previous, next section.
awk
An array is a table of values, called elements. The
elements of an array are distinguished by their indices. Indices
may be either numbers or strings. Each array has a name, which looks
like a variable name, but must not be in use as a variable name in the
same awk
program.
The awk
language has one-dimensional arrays for storing groups
of related strings or numbers.
Every awk
array must have a name. Array names have the same
syntax as variable names; any valid variable name would also be a valid
array name. But you cannot use one name in both ways (as an array and
as a variable) in one awk
program.
Arrays in awk
superficially resemble arrays in other programming
languages; but there are fundamental differences. In awk
, you
don't need to specify the size of an array before you start to use it.
Additionally, any number or string in awk
may be used as an
array index.
In most other languages, you have to declare an array and specify how many elements or components it contains. In such languages, the declaration causes a contiguous block of memory to be allocated for that many elements. An index in the array must be a positive integer; for example, the index 0 specifies the first element in the array, which is actually stored at the beginning of the block of memory. Index 1 specifies the second element, which is stored in memory right after the first element, and so on. It is impossible to add more elements to the array, because it has room for only as many elements as you declared.
A contiguous array of four elements might look like this,
conceptually, if the element values are 8
, "foo"
,
""
and 30
:
+---------+---------+--------+---------+ | 8 | "foo" | "" | 30 | value +---------+---------+--------+---------+ 0 1 2 3 index
Only the values are stored; the indices are implicit from the order of
the values. 8
is the value at index 0, because 8
appears in the
position with 0 elements before it.
Arrays in awk
are different: they are associative. This means
that each array is a collection of pairs: an index, and its corresponding
array element value:
Element 4 Value 30 Element 2 Value "foo" Element 1 Value 8 Element 3 Value ""
We have shown the pairs in jumbled order because their order is irrelevant.
One advantage of an associative array is that new pairs can be added
at any time. For example, suppose we add to the above array a tenth element
whose value is "number ten"
. The result is this:
Element 10 Value "number ten" Element 4 Value 30 Element 2 Value "foo" Element 1 Value 8 Element 3 Value ""
Now the array is sparse (i.e., some indices are missing): it has elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.
Another consequence of associative arrays is that the indices don't have to be positive integers. Any number, or even a string, can be an index. For example, here is an array which translates words from English into French:
Element "dog" Value "chien" Element "cat" Value "chat" Element "one" Value "un" Element 1 Value "un"
Here we decided to translate the number 1 in both spelled-out and numeric form--thus illustrating that a single array can have both numbers and strings as indices.
When awk
creates an array for you, e.g., with the split
built-in function,
that array's indices are consecutive integers starting at 1.
(See section Built-in Functions for String Manipulation.)
The principal way of using an array is to refer to one of its elements. An array reference is an expression which looks like this:
array[index]
Here, array is the name of an array. The expression index is the index of the element of the array that you want.
The value of the array reference is the current value of that array
element. For example, foo[4.3]
is an expression for the element
of array foo
at index 4.3.
If you refer to an array element that has no recorded value, the value
of the reference is ""
, the null string. This includes elements
to which you have not assigned any value, and elements that have been
deleted (see section The delete
Statement). Such a reference
automatically creates that array element, with the null string as its value.
(In some cases, this is unfortunate, because it might waste memory inside
awk
).
You can find out if an element exists in an array at a certain index with the expression:
index in array
This expression tests whether or not the particular index exists,
without the side effect of creating that element if it is not present.
The expression has the value 1 (true) if array[index]
exists, and 0 (false) if it does not exist.
For example, to test whether the array frequencies
contains the
index "2"
, you could write this statement:
if ("2" in frequencies) print "Subscript \"2\" is present."
Note that this is not a test of whether or not the array
frequencies
contains an element whose value is "2"
.
(There is no way to do that except to scan all the elements.) Also, this
does not create frequencies["2"]
, while the following
(incorrect) alternative would do so:
if (frequencies["2"] != "") print "Subscript \"2\" is present."
Array elements are lvalues: they can be assigned values just like
awk
variables:
array[subscript] = value
Here array is the name of your array. The expression subscript is the index of the element of the array that you want to assign a value. The expression value is the value you are assigning to that element of the array.
The following program takes a list of lines, each beginning with a line number, and prints them out in order of line number. The line numbers are not in order, however, when they are first read: they are scrambled. This program sorts the lines by making an array using the line numbers as subscripts. It then prints out the lines in sorted order of their numbers. It is a very simple program, and gets confused if it encounters repeated numbers, gaps, or lines that don't begin with a number.
{ if ($1 > max) max = $1 arr[$1] = $0 } END { for (x = 1; x <= max; x++) print arr[x] }
The first rule keeps track of the largest line number seen so far;
it also stores each line into the array arr
, at an index that
is the line's number.
The second rule runs after all the input has been read, to print out all the lines.
When this program is run with the following input:
5 I am the Five man 2 Who are you? The new number two! 4 . . . And four on the floor 1 Who is number one? 3 I three you.
its output is this:
1 Who is number one? 2 Who are you? The new number two! 3 I three you. 4 . . . And four on the floor 5 I am the Five man
If a line number is repeated, the last line with a given number overrides the others.
Gaps in the line numbers can be handled with an easy improvement to the
program's END
rule:
END { for (x = 1; x <= max; x++) if (x in arr) print arr[x] }
In programs that use arrays, often you need a loop that executes
once for each element of an array. In other languages, where arrays are
contiguous and indices are limited to positive integers, this is
easy: the largest index is one less than the length of the array, and you can
find all the valid indices by counting from zero up to that value. This
technique won't do the job in awk
, since any number or string
may be an array index. So awk
has a special kind of for
statement for scanning an array:
for (var in array) body
This loop executes body once for each different value that your program has previously used as an index in array, with the variable var set to that index.
Here is a program that uses this form of the for
statement. The
first rule scans the input records and notes which words appear (at
least once) in the input, by storing a 1 into the array used
with
the word as index. The second rule scans the elements of used
to
find all the distinct words that appear in the input. It prints each
word that is more than 10 characters long, and also prints the number of
such words. See section Built-in Functions, for more information
on the built-in function length
.
# Record a 1 for each word that is used at least once. { for (i = 1; i <= NF; i++) used[$i] = 1 } # Find number of distinct words more than 10 characters long. END { for (x in used) if (length(x) > 10) { ++num_long_words print x } print num_long_words, "words longer than 10 characters" }
See section Sample Program, for a more detailed example of this type.
The order in which elements of the array are accessed by this statement
is determined by the internal arrangement of the array elements within
awk
and cannot be controlled or changed. This can lead to
problems if new elements are added to array by statements in
body; you cannot predict whether or not the for
loop will
reach them. Similarly, changing var inside the loop can produce
strange results. It is best to avoid such things.
delete
Statement
You can remove an individual element of an array using the delete
statement:
delete array[index]
You can not refer to an array element after it has been deleted; it is as if you had never referred to it and had never given it any value. You can no longer obtain any value the element once had.
Here is an example of deleting elements in an array:
for (i in frequencies) delete frequencies[i]
This example removes all the elements from the array frequencies
.
If you delete an element, a subsequent for
statement to scan the array
will not report that element, and the in
operator to check for
the presence of that element will return 0:
delete foo[4] if (4 in foo) print "This will never be printed"
It is not an error to delete an element which does not exist.
An important aspect of arrays to remember is that array subscripts are always strings. If you use a numeric value as a subscript, it will be converted to a string value before it is used for subscripting (see section Conversion of Strings and Numbers).
This means that the value of the CONVFMT
can potentially
affect how your program accesses elements of an array. For example:
a = b = 12.153 data[a] = 1 CONVFMT = "%2.2f" if (b in data) printf "%s is in data", b else printf "%s is not in data", b
should print `12.15 is not in data'. The first statement gives
both a
and b
the same numeric value. Assigning to
data[a]
first gives a
the string value "12.153"
(using the default conversion value of CONVFMT
, "%.6g"
),
and then assigns 1 to data["12.153"]
. The program then changes
the value of CONVFMT
. The test `(b in data)' forces b
to be converted to a string, this time "12.15"
, since the value of
CONVFMT
only allows two significant digits. This test fails,
since "12.15"
is a different string from "12.153"
.
According to the rules for conversions
(see section Conversion of Strings and Numbers), integer
values are always converted to strings as integers, no matter what the
value of CONVFMT
may happen to be. So the usual case of
for (i = 1; i <= maxsub; i++) do something with array[i]
will work, no matter what the value of CONVFMT
.
Like many things in awk
, the majority of the time things work
as you would expect them to work. But it is useful to have a precise
knowledge of the actual rules, since sometimes they can have a subtle
effect on your programs.
A multi-dimensional array is an array in which an element is identified
by a sequence of indices, not a single index. For example, a
two-dimensional array requires two indices. The usual way (in most
languages, including awk
) to refer to an element of a
two-dimensional array named grid
is with
grid[x,y]
.
Multi-dimensional arrays are supported in awk
through
concatenation of indices into one string. What happens is that
awk
converts the indices into strings
(see section Conversion of Strings and Numbers) and
concatenates them together, with a separator between them. This creates
a single string that describes the values of the separate indices. The
combined string is used as a single index into an ordinary,
one-dimensional array. The separator used is the value of the built-in
variable SUBSEP
.
For example, suppose we evaluate the expression foo[5,12]="value"
when the value of SUBSEP
is "@"
. The numbers 5 and 12 are
converted to strings and
concatenated with an `@' between them, yielding "5@12"
; thus,
the array element foo["5@12"]
is set to "value"
.
Once the element's value is stored, awk
has no record of whether
it was stored with a single index or a sequence of indices. The two
expressions foo[5,12]
and foo[5 SUBSEP 12]
always have
the same value.
The default value of SUBSEP
is the string "\034"
,
which contains a nonprinting character that is unlikely to appear in an
awk
program or in the input data.
The usefulness of choosing an unlikely character comes from the fact
that index values that contain a string matching SUBSEP
lead to
combined strings that are ambiguous. Suppose that SUBSEP
were
"@"
; then foo["a@b", "c"]
and foo["a",
"b@c"]
would be indistinguishable because both would actually be
stored as foo["a@b@c"]
. Because SUBSEP
is
"\034"
, such confusion can arise only when an index
contains the character with ASCII code 034, which is a rare
event.
You can test whether a particular index-sequence exists in a
"multi-dimensional" array with the same operator in
used for single
dimensional arrays. Instead of a single index as the left-hand operand,
write the whole sequence of indices, separated by commas, in
parentheses:
(subscript1, subscript2, ...) in array
The following example treats its input as a two-dimensional array of fields; it rotates this array 90 degrees clockwise and prints the result. It assumes that all lines have the same number of elements.
awk '{ if (max_nf < NF) max_nf = NF max_nr = NR for (x = 1; x <= NF; x++) vector[x, NR] = $x } END { for (x = 1; x <= max_nf; x++) { for (y = max_nr; y >= 1; --y) printf("%s ", vector[x, y]) printf("\n") } }'
When given the input:
1 2 3 4 5 6 2 3 4 5 6 1 3 4 5 6 1 2 4 5 6 1 2 3
it produces:
4 3 2 1 5 4 3 2 6 5 4 3 1 6 5 4 2 1 6 5 3 2 1 6
There is no special for
statement for scanning a
"multi-dimensional" array; there cannot be one, because in truth there
are no multi-dimensional arrays or elements; there is only a
multi-dimensional way of accessing an array.
However, if your program has an array that is always accessed as
multi-dimensional, you can get the effect of scanning it by combining
the scanning for
statement
(see section Scanning all Elements of an Array) with the
split
built-in function
(see section Built-in Functions for String Manipulation).
It works like this:
for (combined in array) { split(combined, separate, SUBSEP) ... }
This finds each concatenated, combined index in the array, and splits it
into the individual indices by breaking it apart where the value of
SUBSEP
appears. The split-out indices become the elements of
the array separate
.
Thus, suppose you have previously stored in array[1,
"foo"]
; then an element with index "1\034foo"
exists in
array. (Recall that the default value of SUBSEP
contains
the character with code 034.) Sooner or later the for
statement
will find that index and do an iteration with combined
set to
"1\034foo"
. Then the split
function is called as
follows:
split("1\034foo", separate, "\034")
The result of this is to set separate[1]
to 1 and separate[2]
to "foo"
. Presto, the original sequence of separate indices has
been recovered.
Go to the previous, next section.