[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6. Patterns, Actions, and Variables

As you have already seen, each awk statement consists of a pattern with an associated action. This chapter describes how you build patterns and actions, what kinds of things you can do within actions, and awk's built-in variables.

The pattern-action rules and the statements available for use within actions form the core of awk programming. In a sense, everything covered up to here has been the foundation that programs are built on top of. Now it's time to start building something useful.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1 Pattern Elements

Patterns in awk control the execution of rules--a rule is executed when its pattern matches the current input record. The following is a summary of the types of awk patterns:

/regular expression/

A regular expression. It matches when the text of the input record fits the regular expression. (See section Regular Expressions.)

expression

A single expression. It matches when its value is nonzero (if a number) or non-null (if a string). (See section Expressions as Patterns.)

pat1, pat2

A pair of patterns separated by a comma, specifying a range of records. The range includes both the initial record that matches pat1 and the final record that matches pat2. (See section Specifying Record Ranges with Patterns.)

BEGIN
END

Special patterns for you to supply startup or cleanup actions for your awk program. (See section The BEGIN and END Special Patterns.)

empty

The empty pattern matches every input record. (See section The Empty Pattern.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1.1 Regular Expressions as Patterns

Regular expressions are one of the first kinds of patterns presented in this book. This kind of pattern is simply a regexp constant in the pattern part of a rule. Its meaning is `$0 ~ /pattern/'. The pattern matches when the input record matches the regexp. For example:

 
/foo|bar|baz/  { buzzwords++ }
END            { print buzzwords, "buzzwords seen" }

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1.2 Expressions as Patterns

Any awk expression is valid as an awk pattern. The pattern matches if the expression's value is nonzero (if a number) or non-null (if a string). The expression is reevaluated each time the rule is tested against a new input record. If the expression uses fields such as $1, the value depends directly on the new input record's text; otherwise, it depends on only what has happened so far in the execution of the awk program.

Comparison expressions, using the comparison operators described in Variable Typing and Comparison Expressions, are a very common kind of pattern. Regexp matching and nonmatching are also very common expressions. The left operand of the `~' and `!~' operators is a string. The right operand is either a constant regular expression enclosed in slashes (/regexp/), or any expression whose string value is used as a dynamic regular expression (see section Using Dynamic Regexps). The following example prints the second field of each input record whose first field is precisely `foo':

 
$ awk '$1 == "foo" { print $2 }' BBS-list

(There is no output, because there is no BBS site with the exact name `foo'.) Contrast this with the following regular expression match, which accepts any record with a first field that contains `foo':

 
$ awk '$1 ~ /foo/ { print $2 }' BBS-list
-| 555-1234
-| 555-6699
-| 555-6480
-| 555-2127

A regexp constant as a pattern is also a special case of an expression pattern. The expression /foo/ has the value one if `foo' appears in the current input record. Thus, as a pattern, /foo/ matches any record containing `foo'.

Boolean expressions are also commonly used as patterns. Whether the pattern matches an input record depends on whether its subexpressions match. For example, the following command prints all the records in `BBS-list' that contain both `2400' and `foo':

 
$ awk '/2400/ && /foo/' BBS-list
-| fooey        555-1234     2400/1200/300     B

The following command prints all records in `BBS-list' that contain either `2400' or `foo' (or both, of course):

 
$ awk '/2400/ || /foo/' BBS-list
-| alpo-net     555-3412     2400/1200/300     A
-| bites        555-1675     2400/1200/300     A
-| fooey        555-1234     2400/1200/300     B
-| foot         555-6699     1200/300          B
-| macfoo       555-6480     1200/300          A
-| sdace        555-3430     2400/1200/300     A
-| sabafoo      555-2127     1200/300          C

The following command prints all records in `BBS-list' that do not contain the string `foo':

 
$ awk '! /foo/' BBS-list
-| aardvark     555-5553     1200/300          B
-| alpo-net     555-3412     2400/1200/300     A
-| barfly       555-7685     1200/300          A
-| bites        555-1675     2400/1200/300     A
-| camelot      555-0542     300               C
-| core         555-2912     1200/300          C
-| sdace        555-3430     2400/1200/300     A

The subexpressions of a Boolean operator in a pattern can be constant regular expressions, comparisons, or any other awk expressions. Range patterns are not expressions, so they cannot appear inside Boolean patterns. Likewise, the special patterns BEGIN and END, which never match any input record, are not expressions and cannot appear inside Boolean patterns.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1.3 Specifying Record Ranges with Patterns

A range pattern is made of two patterns separated by a comma, in the form `begpat, endpat'. It is used to match ranges of consecutive input records. The first pattern, begpat, controls where the range begins, while endpat controls where the pattern ends. For example, the following:

 
awk '$1 == "on", $1 == "off"' myfile

prints every record in `myfile' between `on'/`off' pairs, inclusive.

A range pattern starts out by matching begpat against every input record. When a record matches begpat, the range pattern is turned on and the range pattern matches this record as well. As long as the range pattern stays turned on, it automatically matches every input record read. The range pattern also matches endpat against every input record; when this succeeds, the range pattern is turned off again for the following record. Then the range pattern goes back to checking begpat against each record.

The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don't want to operate on these records, you can write if statements in the rule's action to distinguish them from the records you are interested in.

It is possible for a pattern to be turned on and off by the same record. If the record satisfies both conditions, then the action is executed for just that record. For example, suppose there is text between two identical markers (e.g., the `%' symbol), each on its own line, that should be ignored. A first attempt would be to combine a range pattern that describes the delimited text with the next statement (not discussed yet, see section The next Statement). This causes awk to skip any further processing of the current record and start over again with the next input record. Such a program looks like this:

 
/^%$/,/^%$/    { next }
               { print }

This program fails because the range pattern is both turned on and turned off by the first line, which just has a `%' on it. To accomplish this task, write the program in the following manner, using a flag:

 
/^%$/     { skip = ! skip; next }
skip == 1 { next } # skip lines with `skip' set

In a range pattern, the comma (`,') has the lowest precedence of all the operators (i.e., it is evaluated last). Thus, the following program attempts to combine a range pattern with another, simpler test:

 
echo Yes | awk '/1/,/2/ || /Yes/'

The intent of this program is `(/1/,/2/) || /Yes/'. However, awk interprets this as `/1/, (/2/ || /Yes/)'. This cannot be changed or worked around; range patterns do not combine with other patterns:

 
$ echo Yes | gawk '(/1/,/2/) || /Yes/'
error--> gawk: cmd. line:1: (/1/,/2/) || /Yes/
error--> gawk: cmd. line:1:           ^ parse error
error--> gawk: cmd. line:2: (/1/,/2/) || /Yes/
error--> gawk: cmd. line:2:                   ^ unexpected newline

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1.4 The BEGIN and END Special Patterns

All the patterns described so far are for matching input records. The BEGIN and END special patterns are different. They supply startup and cleanup actions for awk programs. BEGIN and END rules must have actions; there is no default action for these rules because there is no current record when they run. BEGIN and END rules are often referred to as "BEGIN and END blocks" by long-time awk programmers.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1.4.1 Startup and Cleanup Actions

A BEGIN rule is executed once only, before the first input record is read. Likewise, an END rule is executed once only, after all the input is read. For example:

 
$ awk '
> BEGIN { print "Analysis of \"foo\"" }
> /foo/ { ++n }
> END   { print "\"foo\" appears", n, "times." }' BBS-list
-| Analysis of "foo"
-| "foo" appears 4 times.

This program finds the number of records in the input file `BBS-list' that contain the string `foo'. The BEGIN rule prints a title for the report. There is no need to use the BEGIN rule to initialize the counter n to zero, since awk does this automatically (see section Variables). The second rule increments the variable n every time a record containing the pattern `foo' is read. The END rule prints the value of n at the end of the run.

The special patterns BEGIN and END cannot be used in ranges or with Boolean operators (indeed, they cannot be used with any operators). An awk program may have multiple BEGIN and/or END rules. They are executed in the order in which they appear: all the BEGIN rules at startup and all the END rules at termination. BEGIN and END rules may be intermixed with other rules. This feature was added in the 1987 version of awk and is included in the POSIX standard. The original (1978) version of awk required the BEGIN rule to be placed at the beginning of the program, the END rule to be placed at the end, and only allowed one of each. This is no longer required, but it is a good idea to follow this template in terms of program organization and readability.

Multiple BEGIN and END rules are useful for writing library functions, because each library file can have its own BEGIN and/or END rule to do its own initialization and/or cleanup. The order in which library functions are named on the command line controls the order in which their BEGIN and END rules are executed. Therefore, you have to be careful when writing such rules in library files so that the order in which they are executed doesn't matter. See section Command-Line Options, for more information on using library functions. See section A Library of awk Functions, for a number of useful library functions.

If an awk program has only a BEGIN rule and no other rules, then the program exits after the BEGIN rule is run.(28) However, if an END rule exists, then the input is read, even if there are no other rules in the program. This is necessary in case the END rule checks the FNR and NR variables.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1.4.2 Input/Output from BEGIN and END Rules

There are several (sometimes subtle) points to remember when doing I/O from a BEGIN or END rule. The first has to do with the value of $0 in a BEGIN rule. Because BEGIN rules are executed before any input is read, there simply is no input record, and therefore no fields, when executing BEGIN rules. References to $0 and the fields yield a null string or zero, depending upon the context. One way to give $0 a real value is to execute a getline command without a variable (see section Explicit Input with getline). Another way is simply to assign a value to $0.

The second point is similar to the first but from the other direction. Traditionally, due largely to implementation issues, $0 and NF were undefined inside an END rule. The POSIX standard specifies that NF is available in an END rule. It contains the number of fields from the last input record. Most probably due to an oversight, the standard does not say that $0 is also preserved, although logically one would think that it should be. In fact, gawk does preserve the value of $0 for use in END rules. Be aware, however, that Unix awk, and possibly other implementations, do not.

The third point follows from the first two. The meaning of `print' inside a BEGIN or END rule is the same as always: `print $0'. If $0 is the null string, then this prints an empty line. Many long time awk programmers use an unadorned `print' in BEGIN and END rules, to mean `print ""', relying on $0 being null. Although one might generally get away with this in BEGIN rules, it is a very bad idea in END rules, at least in gawk. It is also poor style, since if an empty line is needed in the output, the program should print one explicitly.

Finally, the next and nextfile statements are not allowed in a BEGIN rule, because the implicit read-a-record-and-match-against-the-rules loop has not started yet. Similarly, those statements are not valid in an END rule, since all the input has been read. (See section The next Statement, and see Using gawk's nextfile Statement.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1.5 The Empty Pattern

An empty (i.e., nonexistent) pattern is considered to match every input record. For example, the program:

 
awk '{ print $1 }' BBS-list

prints the first field of every record.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.2 Using Shell Variables in Programs

awk programs are often used as components in larger programs written in shell. For example, it is very common to use a shell variable to hold a pattern that the awk program searches for. There are two ways to get the value of the shell variable into the body of the awk program.

The most common method is to use shell quoting to substitute the variable's value into the program inside the script. For example, in the following program:

 
echo -n "Enter search pattern: "
read pattern
awk "/$pattern/ "'{ nmatches++ }
     END { print nmatches, "found" }' /path/to/data

the awk program consists of two pieces of quoted text that are concatenated together to form the program. The first part is double-quoted, which allows substitution of the pattern variable inside the quotes. The second part is single-quoted.

Variable substitution via quoting works, but can be potentially messy. It requires a good understanding of the shell's quoting rules (see section Shell-Quoting Issues), and it's often difficult to correctly match up the quotes when reading the program.

A better method is to use awk's variable assignment feature (see section Assigning Variables on the Command Line) to assign the shell variable's value to an awk variable's value. Then use dynamic regexps to match the pattern (see section Using Dynamic Regexps). The following shows how to redo the previous example using this technique:

 
echo -n "Enter search pattern: "
read pattern
awk -v pat="$pattern" '$0 ~ pat { nmatches++ }
       END { print nmatches, "found" }' /path/to/data

Now, the awk program is just one single-quoted string. The assignment `-v pat="$pattern"' still requires double quotes, in case there is whitespace in the value of $pattern. The awk variable pat could be named pattern too, but that would be more confusing. Using a variable also provides more flexibility, since the variable can be used anywhere inside the program--for printing, as an array subscript, or for any other use--without requiring the quoting tricks at every point in the program.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.3 Actions

An awk program or script consists of a series of rules and function definitions interspersed. (Functions are described later. See section User-Defined Functions.) A rule contains a pattern and an action, either of which (but not both) may be omitted. The purpose of the action is to tell awk what to do once a match for the pattern is found. Thus, in outline, an awk program generally looks like this:

 
[pattern] [{ action }]
[pattern] [{ action }]
…
function name(args) { … }
…

An action consists of one or more awk statements, enclosed in curly braces (`{…}'). Each statement specifies one thing to do. The statements are separated by newlines or semicolons. The curly braces around an action must be used even if the action contains only one statement, or if it contains no statements at all. However, if you omit the action entirely, omit the curly braces as well. An omitted action is equivalent to `{ print $0 }':

 
/foo/  { }     match foo, do nothing --- empty action
/foo/          match foo, print the record --- omitted action

The following types of statements are supported in awk:

Expressions

Call functions or assign values to variables (see section Expressions). Executing this kind of statement simply computes the value of the expression. This is useful when the expression has side effects (see section Assignment Expressions).

Control statements

Specify the control flow of awk programs. The awk language gives you C-like constructs (if, for, while, and do) as well as a few special ones (see section Control Statements in Actions).

Compound statements

Consist of one or more statements enclosed in curly braces. A compound statement is used in order to put several statements together in the body of an if, while, do, or for statement.

Input statements

Use the getline command (see section Explicit Input with getline). Also supplied in awk are the next statement (see section The next Statement), and the nextfile statement (see section Using gawk's nextfile Statement).

Output statements

Such as print and printf. See section Printing Output.

Deletion statements

For deleting array elements. See section The delete Statement.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4 Control Statements in Actions

Control statements, such as if, while, and so on, control the flow of execution in awk programs. Most of the control statements in awk are patterned on similar statements in C.

All the control statements start with special keywords, such as if and while, to distinguish them from simple expressions. Many control statements contain other statements. For example, the if statement contains another statement that may or may not be executed. The contained statement is called the body. To include more than one statement in the body, group them into a single compound statement with curly braces, separating them with newlines or semicolons.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.1 The if-else Statement

The if-else statement is awk's decision-making statement. It looks like this:

 
if (condition) then-body [else else-body]

The condition is an expression that controls what the rest of the statement does. If the condition is true, then-body is executed; otherwise, else-body is executed. The else part of the statement is optional. The condition is considered false if its value is zero or the null string; otherwise, the condition is true. Refer to the following:

 
if (x % 2 == 0)
    print "x is even"
else
    print "x is odd"

In this example, if the expression `x % 2 == 0' is true (that is, if the value of x is evenly divisible by two), then the first print statement is executed; otherwise, the second print statement is executed. If the else keyword appears on the same line as then-body and then-body is not a compound statement (i.e., not surrounded by curly braces), then a semicolon must separate then-body from the else. To illustrate this, the previous example can be rewritten as:

 
if (x % 2 == 0) print "x is even"; else
        print "x is odd"

If the `;' is left out, awk can't interpret the statement and it produces a syntax error. Don't actually write programs this way, because a human reader might fail to see the else if it is not the first thing on its line.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.2 The while Statement

In programming, a loop is a part of a program that can be executed two or more times in succession. The while statement is the simplest looping statement in awk. It repeatedly executes a statement as long as a condition is true. For example:

 
while (condition)
  body

body is a statement called the body of the loop, and condition is an expression that controls how long the loop keeps running. The first thing the while statement does is test the condition. If the condition is true, it executes the statement body. After body has been executed, condition is tested again, and if it is still true, body is executed again. This process repeats until the condition is no longer true. If the condition is initially false, the body of the loop is never executed and awk continues with the statement following the loop. This example prints the first three fields of each record, one per line:

 
awk '{ i = 1
       while (i <= 3) {
           print $i
           i++
       }
}' inventory-shipped

The body of this loop is a compound statement enclosed in braces, containing two statements. The loop works in the following manner: first, the value of i is set to one. Then, the while statement tests whether i is less than or equal to three. This is true when i equals one, so the i-th field is printed. Then the `i++' increments the value of i and the loop repeats. The loop terminates when i reaches four.

A newline is not required between the condition and the body; however using one makes the program clearer unless the body is a compound statement or else is very simple. The newline after the open-brace that begins the compound statement is not required either, but the program is harder to read without it.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.3 The do-while Statement

The do loop is a variation of the while looping statement. The do loop executes the body once and then repeats the body as long as the condition is true. It looks like this:

 
do
  body
while (condition)

Even if the condition is false at the start, the body is executed at least once (and only once, unless executing body makes condition true). Contrast this with the corresponding while statement:

 
while (condition)
  body

This statement does not execute body even once if the condition is false to begin with. The following is an example of a do statement:

 
{      i = 1
       do {
          print $0
          i++
       } while (i <= 10)
}

This program prints each input record 10 times. However, it isn't a very realistic example, since in this case an ordinary while would do just as well. This situation reflects actual experience; only occasionally is there a real use for a do statement.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.4 The for Statement

The for statement makes it more convenient to count iterations of a loop. The general form of the for statement looks like this:

 
for (initialization; condition; increment)
  body

The initialization, condition, and increment parts are arbitrary awk expressions, and body stands for any awk statement.

The for statement starts by executing initialization. Then, as long as the condition is true, it repeatedly executes body and then increment. Typically, initialization sets a variable to either zero or one, increment adds one to it, and condition compares it against the desired number of iterations. For example:

 
awk '{ for (i = 1; i <= 3; i++)
          print $i
}' inventory-shipped

This prints the first three fields of each input record, with one field per line.

It isn't possible to set more than one variable in the initialization part without using a multiple assignment statement such as `x = y = 0'. This makes sense only if all the initial values are equal. (But it is possible to initialize additional variables by writing their assignments as separate statements preceding the for loop.)

The same is true of the increment part. Incrementing additional variables requires separate statements at the end of the loop. The C compound expression, using C's comma operator, is useful in this context but it is not supported in awk.

Most often, increment is an increment expression, as in the previous example. But this is not required; it can be any expression whatsoever. For example, the following statement prints all the powers of two between 1 and 100:

 
for (i = 1; i <= 100; i *= 2)
  print i

If there is nothing to be done, any of the three expressions in the parentheses following the for keyword may be omitted. Thus, `for (; x > 0;)' is equivalent to `while (x > 0)'. If the condition is omitted, it is treated as true, effectively yielding an infinite loop (i.e., a loop that never terminates).

In most cases, a for loop is an abbreviation for a while loop, as shown here:

 
initialization
while (condition) {
  body
  increment
}

The only exception is when the continue statement (see section The continue Statement) is used inside the loop. Changing a for statement to a while statement in this way can change the effect of the continue statement inside the loop.

The awk language has a for statement in addition to a while statement because a for loop is often both less work to type and more natural to think of. Counting the number of iterations is very common in loops. It can be easier to think of this counting as part of looping rather than as something to do inside the loop.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.5 The switch Statement

NOTE: This subsection describes an experimental feature added in gawk 3.1.3. It is not enabled by default. To enable it, use the `--enable-switch' option to configure when gawk is being configured and built. See section Additional Configuration Options, for more information.

The switch statement allows the evaluation of an expression and the execution of statements based on a case match. Case statements are checked for a match in the order they are defined. If no suitable case is found, the default section is executed, if supplied.

Each case contains a single constant, be it numeric, string, or regexp. The switch expression is evaluated, and then each case's constant is compared against the result in turn. The type of constant determines the comparison: numeric or string do the usual comparisons. A regexp constant does a regular expression match against the string value of the original expression. The general form of the switch statement looks like this:

 
switch (expression) {
case value or regular expression:
    case-body
default:
    default-body
}

Control flow in the switch statement works as it does in C. Once a match to a given case is made, case statement bodies are executed until a break, continue, next, nextfile or exit is encountered, or the end of the switch statement itself. For example:

 
switch (NR * 2 + 1) {
case 3:
case "11":
    print NR - 1
    break

case /2[[:digit:]]+/:
    print NR

default:
    print NR + 1

case -1:
    print NR * -1
}

Note that if none of the statements specified above halt execution of a matched case statement, execution falls through to the next case until execution halts. In the above example, for any case value starting with `2' followed by one or more digits, the print statement is executed and then falls through into the default section, executing its print statement. In turn, the -1 case will also be executed since the default does not halt execution.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.6 The break Statement

The break statement jumps out of the innermost for, while, or do loop that encloses it. The following example finds the smallest divisor of any integer, and also identifies prime numbers:

 
# find smallest divisor of num
{
   num = $1
   for (div = 2; div*div <= num; div++)
     if (num % div == 0)
       break
   if (num % div == 0)
     printf "Smallest divisor of %d is %d\n", num, div
   else
     printf "%d is prime\n", num
}

When the remainder is zero in the first if statement, awk immediately breaks out of the containing for loop. This means that awk proceeds immediately to the statement following the loop and continues processing. (This is very different from the exit statement, which stops the entire awk program. See section The exit Statement.)

Th following program illustrates how the condition of a for or while statement could be replaced with a break inside an if:

 
# find smallest divisor of num
{
  num = $1
  for (div = 2; ; div++) {
    if (num % div == 0) {
      printf "Smallest divisor of %d is %d\n", num, div
      break
    }
    if (div*div > num) {
      printf "%d is prime\n", num
      break
    }
  }
}

The break statement has no meaning when used outside the body of a loop. However, although it was never documented, historical implementations of awk treated the break statement outside of a loop as if it were a next statement (see section The next Statement). Recent versions of Unix awk no longer allow this usage. gawk supports this use of break only if `--traditional' has been specified on the command line (see section Command-Line Options). Otherwise, it is treated as an error, since the POSIX standard specifies that break should only be used inside the body of a loop. (d.c.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.7 The continue Statement

As with break, the continue statement is used only inside for, while, and do loops. It skips over the rest of the loop body, causing the next cycle around the loop to begin immediately. Contrast this with break, which jumps out of the loop altogether.

The continue statement in a for loop directs awk to skip the rest of the body of the loop and resume execution with the increment-expression of the for statement. The following program illustrates this fact:

 
BEGIN {
     for (x = 0; x <= 20; x++) {
         if (x == 5)
             continue
         printf "%d ", x
     }
     print ""
}

This program prints all the numbers from 0 to 20--except for 5, for which the printf is skipped. Because the increment `x++' is not skipped, x does not remain stuck at 5. Contrast the for loop from the previous example with the following while loop:

 
BEGIN {
     x = 0
     while (x <= 20) {
         if (x == 5)
             continue
         printf "%d ", x
         x++
     }
     print ""
}

This program loops forever once x reaches 5.

The continue statement has no meaning when used outside the body of a loop. Historical versions of awk treated a continue statement outside a loop the same way they treated a break statement outside a loop: as if it were a next statement (see section The next Statement). Recent versions of Unix awk no longer work this way, and gawk allows it only if `--traditional' is specified on the command line (see section Command-Line Options). Just like the break statement, the POSIX standard specifies that continue should only be used inside the body of a loop. (d.c.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.8 The next Statement

The next statement forces awk to immediately stop processing the current record and go on to the next record. This means that no further rules are executed for the current record, and the rest of the current rule's action isn't executed.

Contrast this with the effect of the getline function (see section Explicit Input with getline). That also causes awk to read the next record immediately, but it does not alter the flow of control in any way (i.e., the rest of the current action executes with a new input record).

At the highest level, awk program execution is a loop that reads an input record and then tests each rule's pattern against it. If you think of this loop as a for statement whose body contains the rules, then the next statement is analogous to a continue statement. It skips to the end of the body of this implicit loop and executes the increment (which reads another record).

For example, suppose an awk program works only on records with four fields, and it shouldn't fail when given bad input. To avoid complicating the rest of the program, write a "weed out" rule near the beginning, in the following manner:

 
NF != 4 {
  err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR)
  print err > "/dev/stderr"
  next
}

Because of the next statement, the program's subsequent rules won't see the bad record. The error message is redirected to the standard error output stream, as error messages should be. For more detail see Special File Names in gawk.

According to the POSIX standard, the behavior is undefined if the next statement is used in a BEGIN or END rule. gawk treats it as a syntax error. Although POSIX permits it, some other awk implementations don't allow the next statement inside function bodies (see section User-Defined Functions). Just as with any other next statement, a next statement inside a function body reads the next record and starts processing it with the first rule in the program. If the next statement causes the end of the input to be reached, then the code in any END rules is executed. See section The BEGIN and END Special Patterns.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.9 Using gawk's nextfile Statement

gawk provides the nextfile statement, which is similar to the next statement. However, instead of abandoning processing of the current record, the nextfile statement instructs gawk to stop processing the current data file.

The nextfile statement is a gawk extension. In most other awk implementations, or if gawk is in compatibility mode (see section Command-Line Options), nextfile is not special.

Upon execution of the nextfile statement, FILENAME is updated to the name of the next data file listed on the command line, FNR is reset to one, ARGIND is incremented, and processing starts over with the first rule in the program. (ARGIND hasn't been introduced yet. See section Built-in Variables.) If the nextfile statement causes the end of the input to be reached, then the code in any END rules is executed. See section The BEGIN and END Special Patterns.

The nextfile statement is useful when there are many data files to process but it isn't necessary to process every record in every file. Normally, in order to move on to the next data file, a program has to continue scanning the unwanted records. The nextfile statement accomplishes this much more efficiently.

While one might think that `close(FILENAME)' would accomplish the same as nextfile, this isn't true. close is reserved for closing files, pipes, and coprocesses that are opened with redirections. It is not related to the main processing that awk does with the files listed in ARGV.

If it's necessary to use an awk version that doesn't support nextfile, see Implementing nextfile as a Function, for a user-defined function that simulates the nextfile statement.

The current version of the Bell Laboratories awk (see section Other Freely Available awk Implementations) also supports nextfile. However, it doesn't allow the nextfile statement inside function bodies (see section User-Defined Functions). gawk does; a nextfile inside a function body reads the next record and starts processing it with the first rule in the program, just as any other nextfile statement.

Caution: Versions of gawk prior to 3.0 used two words (`next file') for the nextfile statement. In version 3.0, this was changed to one word, because the treatment of `file' was inconsistent. When it appeared after next, `file' was a keyword; otherwise, it was a regular identifier. The old usage is no longer accepted; `next file' generates a syntax error.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4.10 The exit Statement

The exit statement causes awk to immediately stop executing the current rule and to stop processing input; any remaining input is ignored. The exit statement is written as follows:

 
exit [return code]

When an exit statement is executed from a BEGIN rule, the program stops processing everything immediately. No input records are read. However, if an END rule is present, as part of executing the exit statement, the END rule is executed (see section The BEGIN and END Special Patterns). If exit is used as part of an END rule, it causes the program to stop immediately.

An exit statement that is not part of a BEGIN or END rule stops the execution of any further automatic rules for the current record, skips reading any remaining input records, and executes the END rule if there is one.

In such a case, if you don't want the END rule to do its job, set a variable to nonzero before the exit statement and check that variable in the END rule. See section Assertions, for an example that does this.

If an argument is supplied to exit, its value is used as the exit status code for the awk process. If no argument is supplied, exit returns status zero (success). In the case where an argument is supplied to a first exit statement, and then exit is called a second time from an END rule with no argument, awk uses the previously supplied exit value. (d.c.)

For example, suppose an error condition occurs that is difficult or impossible to handle. Conventionally, programs report this by exiting with a nonzero status. An awk program can do this using an exit statement with a nonzero argument, as shown in the following example:

 
BEGIN {
       if (("date" | getline date_now) <= 0) {
         print "Can't get system date" > "/dev/stderr"
         exit 1
       }
       print "current date is", date_now
       close("date")
}

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.5 Built-in Variables

Most awk variables are available to use for your own purposes; they never change unless your program assigns values to them, and they never affect anything unless your program examines them. However, a few variables in awk have special built-in meanings. awk examines some of these automatically, so that they enable you to tell awk how to do certain things. Others are set automatically by awk, so that they carry information from the internal workings of awk to your program.

This section documents all the built-in variables of gawk, most of which are also documented in the chapters describing their areas of activity.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.5.1 Built-in Variables That Control awk

The following is an alphabetical list of variables that you can change to control how awk does certain things. The variables that are specific to gawk are marked with a pound sign (`#').

BINMODE #

On non-POSIX systems, this variable specifies use of binary mode for all I/O. Numeric values of one, two, or three specify that input files, output files, or all files, respectively, should use binary I/O. Alternatively, string values of "r" or "w" specify that input files and output files, respectively, should use binary I/O. A string value of "rw" or "wr" indicates that all files should use binary I/O. Any other string value is equivalent to "rw", but gawk generates a warning message. BINMODE is described in more detail in Using gawk on PC Operating Systems.

This variable is a gawk extension. In other awk implementations (except mawk, see section Other Freely Available awk Implementations), or if gawk is in compatibility mode (see section Command-Line Options), it is not special.

CONVFMT

This string controls conversion of numbers to strings (see section Conversion of Strings and Numbers). It works by being passed, in effect, as the first argument to the sprintf function (see section String-Manipulation Functions). Its default value is "%.6g". CONVFMT was introduced by the POSIX standard.

FIELDWIDTHS #

This is a space-separated list of columns that tells gawk how to split input with fixed columnar boundaries. Assigning a value to FIELDWIDTHS overrides the use of FS for field splitting. See section Reading Fixed-Width Data, for more information.

If gawk is in compatibility mode (see section Command-Line Options), then FIELDWIDTHS has no special meaning, and field-splitting operations occur based exclusively on the value of FS.

FS

This is the input field separator (see section Specifying How Fields Are Separated). The value is a single-character string or a multi-character regular expression that matches the separations between fields in an input record. If the value is the null string (""), then each character in the record becomes a separate field. (This behavior is a gawk extension. POSIX awk does not specify the behavior when FS is the null string.)

The default value is " ", a string consisting of a single space. As a special exception, this value means that any sequence of spaces, tabs, and/or newlines is a single separator.(29) It also causes spaces, tabs, and newlines at the beginning and end of a record to be ignored.

You can set the value of FS on the command line using the `-F' option:

 
awk -F, 'program' input-files

If gawk is using FIELDWIDTHS for field splitting, assigning a value to FS causes gawk to return to the normal, FS-based field splitting. An easy way to do this is to simply say `FS = FS', perhaps with an explanatory comment.

IGNORECASE #

If IGNORECASE is nonzero or non-null, then all string comparisons and all regular expression matching are case independent. Thus, regexp matching with `~' and `!~', as well as the gensub, gsub, index, match, split, and sub functions, record termination with RS, and field splitting with FS, all ignore case when doing their particular regexp operations. However, the value of IGNORECASE does not affect array subscripting and it does not affect field splitting when using a single-character field separator. See section Case Sensitivity in Matching.

If gawk is in compatibility mode (see section Command-Line Options), then IGNORECASE has no special meaning. Thus, string and regexp operations are always case-sensitive.

LINT #

When this variable is true (nonzero or non-null), gawk behaves as if the `--lint' command-line option is in effect. (see section Command-Line Options). With a value of "fatal", lint warnings become fatal errors. With a value of "invalid", only warnings about things that are actually invalid are issued. (This is not fully implemented yet.) Any other true value prints nonfatal warnings. Assigning a false value to LINT turns off the lint warnings.

This variable is a gawk extension. It is not special in other awk implementations. Unlike the other special variables, changing LINT does affect the production of lint warnings, even if gawk is in compatibility mode. Much as the `--lint' and `--traditional' options independently control different aspects of gawk's behavior, the control of lint warnings during program execution is independent of the flavor of awk being executed.

OFMT

This string controls conversion of numbers to strings (see section Conversion of Strings and Numbers) for printing with the print statement. It works by being passed as the first argument to the sprintf function (see section String-Manipulation Functions). Its default value is "%.6g". Earlier versions of awk also used OFMT to specify the format for converting numbers to strings in general expressions; this is now done by CONVFMT.

OFS

This is the output field separator (see section Output Separators). It is output between the fields printed by a print statement. Its default value is " ", a string consisting of a single space.

ORS

This is the output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character. (See section Output Separators.)

RS

This is awk's input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text. (See section How Input Is Split into Records.)

The ability for RS to be a regular expression is a gawk extension. In most other awk implementations, or if gawk is in compatibility mode (see section Command-Line Options), just the first character of RS's value is used.

SUBSEP

This is the subscript separator. It has the default value of "\034" and is used to separate the parts of the indices of a multidimensional array. Thus, the expression foo["A", "B"] really accesses foo["A\034B"] (see section Multidimensional Arrays).

TEXTDOMAIN #

This variable is used for internationalization of programs at the awk level. It sets the default text domain for specially marked string constants in the source text, as well as for the dcgettext, dcngettext and bindtextdomain functions (see section Internationalization with gawk). The default value of TEXTDOMAIN is "messages".

This variable is a gawk extension. In other awk implementations, or if gawk is in compatibility mode (see section Command-Line Options), it is not special.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.5.2 Built-in Variables That Convey Information

The following is an alphabetical list of variables that awk sets automatically on certain occasions in order to provide information to your program. The variables that are specific to gawk are marked with a pound sign (`#').

ARGC, ARGV

The command-line arguments available to awk programs are stored in an array called ARGV. ARGC is the number of command-line arguments present. See section Other Command-Line Arguments. Unlike most awk arrays, ARGV is indexed from 0 to ARGC - 1. In the following example:

 
$ awk 'BEGIN {
>         for (i = 0; i < ARGC; i++)
>             print ARGV[i]
>      }' inventory-shipped BBS-list
-| awk
-| inventory-shipped
-| BBS-list

ARGV[0] contains "awk", ARGV[1] contains "inventory-shipped", and ARGV[2] contains "BBS-list". The value of ARGC is three, one more than the index of the last element in ARGV, because the elements are numbered from zero.

The names ARGC and ARGV, as well as the convention of indexing the array from 0 to ARGC - 1, are derived from the C language's method of accessing command-line arguments.

The value of ARGV[0] can vary from system to system. Also, you should note that the program text is not included in ARGV, nor are any of awk's command-line options. See section Using ARGC and ARGV, for information about how awk uses these variables.

ARGIND #

The index in ARGV of the current file being processed. Every time gawk opens a new data file for processing, it sets ARGIND to the index in ARGV of the file name. When gawk is processing the input files, `FILENAME == ARGV[ARGIND]' is always true.

This variable is useful in file processing; it allows you to tell how far along you are in the list of data files as well as to distinguish between successive instances of the same file name on the command line.

While you can change the value of ARGIND within your awk program, gawk automatically sets it to a new value when the next file is opened.

This variable is a gawk extension. In other awk implementations, or if gawk is in compatibility mode (see section Command-Line Options), it is not special.

ENVIRON

An associative array that contains the values of the environment. The array indices are the environment variable names; the elements are the values of the particular environment variables. For example, ENVIRON["HOME"] might be `/home/arnold'. Changing this array does not affect the environment passed on to any programs that awk may spawn via redirection or the system function.

Some operating systems may not have environment variables. On such systems, the ENVIRON array is empty (except for ENVIRON["AWKPATH"], see section The AWKPATH Environment Variable).

ERRNO #

If a system error occurs during a redirection for getline, during a read for getline, or during a close operation, then ERRNO contains a string describing the error.

ERRNO works similarly to the C variable errno. In particular gawk never clears it (sets it to zero or ""). Thus, you should only expect its value to be meaningful when an I/O operation returns a failure value, such as getline returning -1. You are, of course, free to clear it yourself before doing an I/O operation.

This variable is a gawk extension. In other awk implementations, or if gawk is in compatibility mode (see section Command-Line Options), it is not special.

FILENAME

The name of the file that awk is currently reading. When no data files are listed on the command line, awk reads from the standard input and FILENAME is set to "-". FILENAME is changed each time a new file is read (see section Reading Input Files). Inside a BEGIN rule, the value of FILENAME is "", since there are no input files being processed yet.(30) (d.c.) Note, though, that using getline (see section Explicit Input with getline) inside a BEGIN rule can give FILENAME a value.

FNR

The current record number in the current file. FNR is incremented each time a new record is read (see section Explicit Input with getline). It is reinitialized to zero each time a new input file is started.

NF

The number of fields in the current input record. NF is set each time a new record is read, when a new field is created or when $0 changes (see section Examining Fields).

Unlike most of the variables described in this section, assigning a value to NF has the potential to affect awk's internal workings. In particular, assignments to NF can be used to create or remove fields from the current record: See section Changing the Contents of a Field.

NR

The number of input records awk has processed since the beginning of the program's execution (see section How Input Is Split into Records). NR is incremented each time a new record is read.

PROCINFO #

The elements of this array provide access to information about the running awk program. The following elements (listed alphabetically) are guaranteed to be available:

PROCINFO["egid"]

The value of the getegid system call.

PROCINFO["euid"]

The value of the geteuid system call.

PROCINFO["FS"]

This is "FS" if field splitting with FS is in effect, or it is "FIELDWIDTHS" if field splitting with FIELDWIDTHS is in effect.

PROCINFO["gid"]

The value of the getgid system call.

PROCINFO["pgrpid"]

The process group ID of the current process.

PROCINFO["pid"]

The process ID of the current process.

PROCINFO["ppid"]

The parent process ID of the current process.

PROCINFO["uid"]

The value of the getuid system call.

PROCINFO["version"]

The version of gawk. This is available from version 3.1.4 and later.

On some systems, there may be elements in the array, "group1" through "groupN" for some N. N is the number of supplementary groups that the process has. Use the in operator to test for these elements (see section Referring to an Array Element).

This array is a gawk extension. In other awk implementations, or if gawk is in compatibility mode (see section Command-Line Options), it is not special.

RLENGTH

The length of the substring matched by the match function (see section String-Manipulation Functions). RLENGTH is set by invoking the match function. Its value is the length of the matched string, or -1 if no match is found.

RSTART

The start-index in characters of the substring that is matched by the match function (see section String-Manipulation Functions). RSTART is set by invoking the match function. Its value is the position of the string where the matched substring starts, or zero if no match was found.

RT #

This is set each time a record is read. It contains the input text that matched the text denoted by RS, the record separator.

This variable is a gawk extension. In other awk implementations, or if gawk is in compatibility mode (see section Command-Line Options), it is not special.

Advanced Notes: Changing NR and FNR

awk increments NR and FNR each time it reads a record, instead of setting them to the absolute value of the number of records read. This means that a program can change these variables and their new values are incremented for each record. (d.c.) This is demonstrated in the following example:

 
$ echo '1
> 2
> 3
> 4' | awk 'NR == 2 { NR = 17 }
> { print NR }'
-| 1
-| 17
-| 18
-| 19

Before FNR was added to the awk language (see section Major Changes Between V7 and SVR3.1), many awk programs used this feature to track the number of records in a file by resetting NR to zero when FILENAME changed.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.5.3 Using ARGC and ARGV

Built-in Variables That Convey Information, presented the following program describing the information contained in ARGC and ARGV:

 
$ awk 'BEGIN {
>        for (i = 0; i < ARGC; i++)
>            print ARGV[i]
>      }' inventory-shipped BBS-list
-| awk
-| inventory-shipped
-| BBS-list

In this example, ARGV[0] contains `awk', ARGV[1] contains `inventory-shipped', and ARGV[2] contains `BBS-list'. Notice that the awk program is not entered in ARGV. The other special command-line options, with their arguments, are also not entered. This includes variable assignments done with the `-v' option (see section Command-Line Options). Normal variable assignments on the command line are treated as arguments and do show up in the ARGV array:

 
$ cat showargs.awk
-| BEGIN {
-|     printf "A=%d, B=%d\n", A, B
-|     for (i = 0; i < ARGC; i++)
-|         printf "\tARGV[%d] = %s\n", i, ARGV[i]
-| }
-| END   { printf "A=%d, B=%d\n", A, B }
$ awk -v A=1 -f showargs.awk B=2 /dev/null
-| A=1, B=0
-|        ARGV[0] = awk
-|        ARGV[1] = B=2
-|        ARGV[2] = /dev/null
-| A=1, B=2

A program can alter ARGC and the elements of ARGV. Each time awk reaches the end of an input file, it uses the next element of ARGV as the name of the next input file. By storing a different string there, a program can change which files are read. Use "-" to represent the standard input. Storing additional elements and incrementing ARGC causes additional files to be read.

If the value of ARGC is decreased, that eliminates input files from the end of the list. By recording the old value of ARGC elsewhere, a program can treat the eliminated arguments as something other than file names.

To eliminate a file from the middle of the list, store the null string ("") into ARGV in place of the file's name. As a special feature, awk ignores file names that have been replaced with the null string. Another option is to use the delete statement to remove elements from ARGV (see section The delete Statement).

All of these actions are typically done in the BEGIN rule, before actual processing of the input begins. See section Splitting a Large File into Pieces, and see Duplicating Output into Multiple Files, for examples of each way of removing elements from ARGV. The following fragment processes ARGV in order to examine, and then remove, command-line options:

 
BEGIN {
    for (i = 1; i < ARGC; i++) {
        if (ARGV[i] == "-v")
            verbose = 1
        else if (ARGV[i] == "-d")
            debug = 1
        else if (ARGV[i] ~ /^-./) {
            e = sprintf("%s: unrecognized option -- %c",
                    ARGV[0], substr(ARGV[i], 2, 1))
            print e > "/dev/stderr"
        } else
            break
        delete ARGV[i]
    }
}

To actually get the options into the awk program, end the awk options with `--' and then supply the awk program's options, in the following manner:

 
awk -f myprog -- -v -d file1 file2 …

This is not necessary in gawk. Unless `--posix' has been specified, gawk silently puts any unrecognized options into ARGV for the awk program to deal with. As soon as it sees an unknown option, gawk stops looking for other options that it might otherwise recognize. The previous example with gawk would be:

 
gawk -f myprog -d -v file1 file2 …

Because `-d' is not a valid gawk option, it and the following `-v' are passed on to the awk program.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated on December, 30 2007 using texi2html 1.76.