AWK: A Powerful Text Processing Tool and Programming Language

Overview:
AWK is a robust text processing tool and programming language, primarily used for formatting, analyzing, and processing text on Unix and Linux systems. It excels at handling structured text like tables, CSV files, and logs.

Syntax:

awk -f 'scripts' -v var=value filename
awk 'BEGIN{ print "start" } pattern{ commands } END{ print "end" }' filename

Explanation:
AWK reads files or input streams (including stdin) line by line, processing text data based on user-specified patterns and actions. It is particularly useful for structured text. While AWK can be used directly from the command line, it is more often employed through scripts. As a programming language, AWK shares many features with C, such as arrays and functions.

Options:

  • -F: Specifies the field separator (can be a string or regular expression).
  • -f 'scripts': Reads AWK commands from the script file 'scripts'.
  • -v var=value: Assigns a value to a variable, passing external variables to AWK.

AWK Script Structure:

  • pattern: Matches specific lines.
  • { commands }: Executes actions on matching lines.
  • filename: The file to be processed by AWK.

An AWK script typically consists of three optional parts: a BEGIN block, pattern matching, and an END block. The workflow proceeds as follows:

  1. Execute the BEGIN statement.
  2. Process each line from the file or standard input, executing the pattern matching.
  3. Execute the END statement.

AWK Built-in Variables:

  • $0: The current record (line).
  • $n: The nth field (column) of the current record, where $1 is the first column and $n is the nth column.
  • FS: The field separator (default is a space or tab), can be customized with the -F option.
  • OFS: Output field separator (used for formatted output).
  • RS: Record separator (default is newline).
  • ORS: Output record separator (default is newline).
  • NR: Current line number (starting from 1).
  • NF: Number of fields (columns) in the current line.

AWK Operators:

  • Arithmetic Operators:
    +, -, *, /, %, ^
    Increment and decrement operators (++, --) can be used as prefixes or suffixes.
  • Assignment Operators:
    =, +=, -=, *=, /=, %=, ^=
  • Regular Expression Operators:
    ~: Matches regular expression.
    !~: Does not match regular expression.
  • Logical Operators:
    ||: Logical OR
    &&: Logical AND
  • Relational Operators:
    <, <=, >, >=, !=, ==
  • Other Operators:
    $: Refers to a field by its number.
    Space: Concatenates strings.
    ?:: Ternary operator.
    in: Checks if a key exists in an array.

AWK Regular Expression Syntax:

  • ^: Matches the start of a line.
  • $: Matches the end of a line.
  • .: Matches any single character.
  • *: Matches zero or more occurrences of the preceding character.
  • +: Matches one or more occurrences of the preceding character.
  • ?: Matches zero or one occurrence of the preceding character.
  • []: Matches any character in the specified range.
  • [^]: Matches any character not in the specified range.
  • () and |: Subexpressions and alternations.
  • \: Escape character.
  • {m}: Matches exactly m occurrences of a character.
  • {m,}: Matches at least m occurrences.
  • {m,n}: Matches between m and n occurrences.

AWK Built-in Functions:

  • toupper(): Converts all lowercase letters to uppercase.
  • length(): Returns the length of a string.

Custom Functions in AWK: AWK scripts can include user-defined functions. For example:

function square(x) {
  return x * x;
}

To use the function:

awk '{ print square($1) }' file.txt

AWK Control Flow Statements:

  • if-else: Conditional statements.
  • while and do-while: Loops.
  • for: Standard loops, including array traversal with for-in.
  • break and continue: Loop control.
  • exit: Terminates the script execution.
  • next: Skips the remaining commands for the current line.
  • return: Returns a value from a function.
  • ?:: Ternary operator for conditional expressions.

AWK Arrays: AWK supports associative arrays, meaning array indexes can be strings as well as numbers. Arrays in AWK don’t need to be declared or sized; they are created as soon as you assign a value to an index.

Examples:

1. Basic Example:

$ echo "hello" | awk 'BEGIN{ print "start" } END{ print "end" }'
start
end

2. Using Built-in Variables: To print the first and third columns of a file:

awk '{ print $1, $3 }' test.txt

3. Using External Variables:

$ a=100
$ b=100
$ echo | awk '{ print v1 * v2 }' v1=$a v2=$b
10000

4. Using Regular Expressions: To print the second column of lines starting with “a”:

awk '/^a/ { print $2 }' test.txt

5. Using Built-in Functions: Convert all lowercase letters to uppercase:

awk '{ print toupper($0) }' test.txt

6. Handling Different Delimiters: For CSV files with comma-separated values:

awk -F ',' '{ print $1, $2 }' test.csv

7. Writing and Running AWK Scripts: Save an AWK script to a file (e.g., script.awk):

BEGIN { FS=","; OFS=" - " }
{ print $1, $3 }

Run the script:

awk -f script.awk test.csv

Conclusion:

AWK is a versatile and powerful tool for text processing, offering rich features like pattern matching, regular expressions, and scripting capabilities. From simple one-liners to complex data analysis scripts, AWK excels at processing structured text efficiently and flexibly.

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注