Go to the previous, next section.
Complicated awk
programs can often be simplified by defining
your own functions. User-defined functions can be called just like
built-in ones (see section Function Calls), but it is up to you to define
them--to tell awk
what they should do.
Definitions of functions can appear anywhere between the rules of the
awk
program. Thus, the general form of an awk
program is
extended to include sequences of rules and user-defined function
definitions.
The definition of a function named name looks like this:
function name (parameter-list) { body-of-function }
name is the name of the function to be defined. A valid function name is like a valid variable name: a sequence of letters, digits and underscores, not starting with a digit. Functions share the same pool of names as variables and arrays.
parameter-list is a list of the function's arguments and local variable names, separated by commas. When the function is called, the argument names are used to hold the argument values given in the call. The local variables are initialized to the null string.
The body-of-function consists of awk
statements. It is the
most important part of the definition, because it says what the function
should actually do. The argument names exist to give the body a
way to talk about the arguments; local variables, to give the body
places to keep temporary values.
Argument names are not distinguished syntactically from local variable names; instead, the number of arguments supplied when the function is called determines how many argument variables there are. Thus, if three argument values are given, the first three names in parameter-list are arguments, and the rest are local variables.
It follows that if the number of arguments is not the same in all calls to the function, some of the names in parameter-list may be arguments on some occasions and local variables on others. Another way to think of this is that omitted arguments default to the null string.
Usually when you write a function you know how many names you intend to use for arguments and how many you intend to use as locals. By convention, you should write an extra space between the arguments and the locals, so other people can follow how your function is supposed to be used.
During execution of the function body, the arguments and local variable
values hide or shadow any variables of the same names used in the
rest of the program. The shadowed variables are not accessible in the
function definition, because there is no way to name them while their
names have been taken away for the local variables. All other variables
used in the awk
program can be referenced or set normally in the
function definition.
The arguments and local variables last only as long as the function body is executing. Once the body finishes, the shadowed variables come back.
The function body can contain expressions which call functions. They can even call this function, either directly or by way of another function. When this happens, we say the function is recursive.
There is no need in awk
to put the definition of a function
before all uses of the function. This is because awk
reads the
entire program before starting to execute any of it.
In many awk
implementations, the keyword function
may be
abbreviated func
. However, POSIX only specifies the use of
the keyword function
. This actually has some practical implications.
If gawk
is in POSIX-compatibility mode
(see section Invoking awk
), then the following
statement will not define a function:
func foo() { a = sqrt($1) ; print a }
Instead it defines a rule that, for each record, concatenates the value
of the variable `func' with the return value of the function `foo',
and based on the truth value of the result, executes the corresponding action.
This is probably not what was desired. (awk
accepts this input as
syntactically valid, since functions may be used before they are defined
in awk
programs.)
Here is an example of a user-defined function, called myprint
, that
takes a number and prints it in a specific format.
function myprint(num) { printf "%6.3g\n", num }
To illustrate, here is an awk
rule which uses our myprint
function:
$3 > 0 { myprint($3) }
This program prints, in our special format, all the third fields that contain a positive number in our input. Therefore, when given:
1.2 3.4 5.6 7.8 9.10 11.12 -13.14 15.16 17.18 19.20 21.22 23.24
this program, using our function to format the results, prints:
5.6 21.2
Here is a rather contrived example of a recursive function. It prints a string backwards:
function rev (str, len) { if (len == 0) { printf "\n" return } printf "%c", substr(str, len, 1) rev(str, len - 1) }
Calling a function means causing the function to run and do its job. A function call is an expression, and its value is the value returned by the function.
A function call consists of the function name followed by the arguments
in parentheses. What you write in the call for the arguments are
awk
expressions; each time the call is executed, these
expressions are evaluated, and the values are the actual arguments. For
example, here is a call to foo
with three arguments (the first
being a string concatenation):
foo(x y, "lose", 4 * z)
Caution: whitespace characters (spaces and tabs) are not allowed
between the function name and the open-parenthesis of the argument list.
If you write whitespace by mistake, awk
might think that you mean
to concatenate a variable with an expression in parentheses. However, it
notices that you used a function name and not a variable name, and reports
an error.
When a function is called, it is given a copy of the values of its arguments. This is called call by value. The caller may use a variable as the expression for the argument, but the called function does not know this: it only knows what value the argument had. For example, if you write this code:
foo = "bar" z = myfunc(foo)
then you should not think of the argument to myfunc
as being
"the variable foo
." Instead, think of the argument as the
string value, "bar"
.
If the function myfunc
alters the values of its local variables,
this has no effect on any other variables. In particular, if myfunc
does this:
function myfunc (win) { print win win = "zzz" print win }
to change its first argument variable win
, this does not
change the value of foo
in the caller. The role of foo
in
calling myfunc
ended when its value, "bar"
, was computed.
If win
also exists outside of myfunc
, the function body
cannot alter this outer value, because it is shadowed during the
execution of myfunc
and cannot be seen or changed from there.
However, when arrays are the parameters to functions, they are not copied. Instead, the array itself is made available for direct manipulation by the function. This is usually called call by reference. Changes made to an array parameter inside the body of a function are visible outside that function. This can be very dangerous if you do not watch what you are doing. For example:
function changeit (array, ind, nvalue) { array[ind] = nvalue } BEGIN { a[1] = 1 ; a[2] = 2 ; a[3] = 3 changeit(a, 2, "two") printf "a[1] = %s, a[2] = %s, a[3] = %s\n", a[1], a[2], a[3] }
prints `a[1] = 1, a[2] = two, a[3] = 3', because calling
changeit
stores "two"
in the second element of a
.
return
Statement
The body of a user-defined function can contain a return
statement.
This statement returns control to the rest of the awk
program. It
can also be used to return a value for use in the rest of the awk
program. It looks like this:
return expression
The expression part is optional. If it is omitted, then the returned value is undefined and, therefore, unpredictable.
A return
statement with no value expression is assumed at the end of
every function definition. So if control reaches the end of the function
body, then the function returns an unpredictable value. awk
will not warn you if you use the return value of such a function; you will
simply get unpredictable or unexpected results.
Here is an example of a user-defined function that returns a value for the largest number among the elements of an array:
function maxelt (vec, i, ret) { for (i in vec) { if (ret == "" || vec[i] > ret) ret = vec[i] } return ret }
You call maxelt
with one argument, which is an array name. The local
variables i
and ret
are not intended to be arguments;
while there is nothing to stop you from passing two or three arguments
to maxelt
, the results would be strange. The extra space before
i
in the function parameter list is to indicate that i
and
ret
are not supposed to be arguments. This is a convention which
you should follow when you define functions.
Here is a program that uses our maxelt
function. It loads an
array, calls maxelt
, and then reports the maximum number in that
array:
awk ' function maxelt (vec, i, ret) { for (i in vec) { if (ret == "" || vec[i] > ret) ret = vec[i] } return ret } # Load all fields of each record into nums. { for(i = 1; i <= NF; i++) nums[NR, i] = $i } END { print maxelt(nums) }'
Given the following input:
1 5 23 8 16 44 3 5 2 8 26 256 291 1396 2962 100 -6 467 998 1101 99385 11 0 225
our program tells us (predictably) that:
99385
is the largest number in our array.
Go to the previous, next section.