Unix - Text Processing with awk in UNIX

Introduction to awk

awk is one of the most powerful text-processing tools available in UNIX and Linux systems. It is designed for pattern scanning, data extraction, report generation, and text manipulation. The name "awk" comes from the surnames of its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan.

Unlike simple commands such as grep, which only search for patterns, awk can search, extract, calculate, format, and generate reports from structured text data. It is particularly useful when working with tables, log files, CSV files, configuration files, and system-generated reports.

The basic strength of awk lies in its ability to process text line by line and divide each line into fields automatically.

How awk Works

When an awk command is executed, it performs the following steps:

  1. Reads input one line at a time.

  2. Splits each line into fields.

  3. Compares each line against specified patterns.

  4. Executes actions when patterns match.

  5. Produces output according to the instructions.

For example, if a file contains:

John 25 Manager
Alice 30 Engineer
David 28 Analyst

Each line is treated as a record, and each word is treated as a field.

Field Number Value
$1 John
$2 25
$3 Manager

For the second line:

Field Number Value
$1 Alice
$2 30
$3 Engineer

Basic Syntax

The general syntax of awk is:

awk 'pattern { action }' filename

Where:

  • Pattern specifies the condition.

  • Action specifies what should be done.

  • Filename is the input file.

Example:

awk '{print $1}' employees.txt

Output:

John
Alice
David

This command prints only the first field from each line.

Understanding Records and Fields

In awk:

  • A record is a line of text.

  • A field is a column within that line.

Special variables include:

Variable Meaning
$0 Entire line
$1 First field
$2 Second field
$3 Third field
NF Number of fields
NR Record number
FS Field separator
OFS Output field separator

Example:

awk '{print $0}' employees.txt

Prints the entire line.

Example:

awk '{print NF}' employees.txt

Output:

3
3
3

This shows the number of fields in each record.

Printing Specific Fields

To print selected columns:

awk '{print $1, $3}' employees.txt

Output:

John Manager
Alice Engineer
David Analyst

This extracts only the name and designation columns.

Using Patterns

awk can process only those records that match a specific condition.

Example:

awk '$2 > 27' employees.txt

Output:

Alice 30 Engineer
David 28 Analyst

The condition checks whether the second field is greater than 27.

Pattern Matching with Strings

Example:

awk '/Engineer/' employees.txt

Output:

Alice 30 Engineer

This prints lines containing the word Engineer.

BEGIN and END Blocks

awk provides two special blocks:

BEGIN

Executed before reading any input.

Example:

awk 'BEGIN {print "Employee Report"}'

Output:

Employee Report

END

Executed after processing all records.

Example:

awk 'END {print "Processing Complete"}' employees.txt

Output:

Processing Complete

Combined Example

awk '
BEGIN {print "Employee List"}
{print $1}
END {print "End of Report"}
' employees.txt

Output:

Employee List
John
Alice
David
End of Report

Built-in Variables

NR (Number of Records)

awk '{print NR, $0}' employees.txt

Output:

1 John 25 Manager
2 Alice 30 Engineer
3 David 28 Analyst

NF (Number of Fields)

awk '{print $1, NF}' employees.txt

Output:

John 3
Alice 3
David 3

FILENAME

awk '{print FILENAME}' employees.txt

Output:

employees.txt
employees.txt
employees.txt

Field Separators

By default, fields are separated by spaces or tabs.

A custom separator can be specified using -F.

Consider a file:

John,25,Manager
Alice,30,Engineer
David,28,Analyst

To process comma-separated values:

awk -F ',' '{print $1, $3}' employees.csv

Output:

John Manager
Alice Engineer
David Analyst

Arithmetic Operations

awk can perform calculations directly.

Example:

awk '{sum += $2} END {print sum}' employees.txt

Output:

83

The ages are added together.

Calculating Average

awk '{sum += $2} END {print sum/NR}' employees.txt

Output:

27.6667

Conditional Statements

awk supports if statements.

Example:

awk '
{
if($2 > 28)
print $1
}
' employees.txt

Output:

Alice

Only employees older than 28 are displayed.

Loops in awk

for Loop

awk '
{
for(i=1;i<=NF;i++)
print $i
}
' employees.txt

Output:

John
25
Manager
Alice
30
Engineer
David
28
Analyst

while Loop

awk '
{
i=1
while(i<=NF)
{
print $i
i++
}
}
' employees.txt

This produces a similar result.

Formatting Output

awk provides the printf function.

Example:

awk '{printf "%-10s %-5s %-10s\n",$1,$2,$3}' employees.txt

Output:

John       25    Manager
Alice      30    Engineer
David      28    Analyst

This creates properly aligned columns.

Data Filtering

Consider a file:

101 John 50000
102 Alice 70000
103 David 45000

Find employees earning more than 50000:

awk '$3 > 50000' salary.txt

Output:

102 Alice 70000

Generating Reports

Example:

awk '
BEGIN {
print "Salary Report"
}
{
total += $3
}
END {
print "Total Salary:", total
}
' salary.txt

Output:

Salary Report
Total Salary: 165000

This demonstrates how awk can generate summary reports from raw data.

Combining awk with Other UNIX Commands

Using with ps

ps -ef | awk '{print $1,$2}'

Displays user names and process IDs.

Using with ls

ls -l | awk '{print $9}'

Displays only file names.

Using with df

df -h | awk '{print $1,$5}'

Displays file systems and their usage percentages.

Practical Applications of awk

Log Analysis

System administrators use awk to analyze server logs.

Example:

awk '/ERROR/' logfile.txt

Displays all error messages.

Report Generation

awk can create employee reports, sales reports, and inventory summaries.

Data Cleaning

It can remove unwanted fields, reformat records, and prepare data for databases.

Monitoring System Performance

awk is frequently used with commands such as:

top
vmstat
iostat
sar

to extract performance-related information.

Advantages of awk

  1. Fast and efficient text processing.

  2. Built into almost every UNIX/Linux system.

  3. Supports variables, loops, and conditions.

  4. Handles large files effectively.

  5. Excellent for report generation.

  6. Integrates easily with shell scripts.

  7. Supports arithmetic calculations and data analysis.

Limitations of awk

  1. Complex programs can become difficult to maintain.

  2. Not ideal for graphical applications.

  3. Less suitable for very large software projects.

  4. Advanced data structures are limited compared to modern programming languages.

Conclusion

awk is a powerful UNIX text-processing utility that goes far beyond simple searching. It enables users to extract information, perform calculations, generate reports, filter records, and automate data-processing tasks. Because of its speed, flexibility, and availability on almost every UNIX system, awk remains one of the most important tools for system administrators, developers, and data analysts who work extensively with text-based data.