Unix - Text Processing with awk in UNIX
Introduction to awk
awk is one of the most powerful text-processing tools available in UNIX and Linux systems. It is designed for pattern scanning, data extraction, report generation, and text manipulation. The name "awk" comes from the surnames of its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan.
Unlike simple commands such as grep, which only search for patterns, awk can search, extract, calculate, format, and generate reports from structured text data. It is particularly useful when working with tables, log files, CSV files, configuration files, and system-generated reports.
The basic strength of awk lies in its ability to process text line by line and divide each line into fields automatically.
How awk Works
When an awk command is executed, it performs the following steps:
-
Reads input one line at a time.
-
Splits each line into fields.
-
Compares each line against specified patterns.
-
Executes actions when patterns match.
-
Produces output according to the instructions.
For example, if a file contains:
John 25 Manager
Alice 30 Engineer
David 28 Analyst
Each line is treated as a record, and each word is treated as a field.
| Field Number | Value |
|---|---|
| $1 | John |
| $2 | 25 |
| $3 | Manager |
For the second line:
| Field Number | Value |
|---|---|
| $1 | Alice |
| $2 | 30 |
| $3 | Engineer |
Basic Syntax
The general syntax of awk is:
awk 'pattern { action }' filename
Where:
-
Pattern specifies the condition.
-
Action specifies what should be done.
-
Filename is the input file.
Example:
awk '{print $1}' employees.txt
Output:
John
Alice
David
This command prints only the first field from each line.
Understanding Records and Fields
In awk:
-
A record is a line of text.
-
A field is a column within that line.
Special variables include:
| Variable | Meaning |
|---|---|
| $0 | Entire line |
| $1 | First field |
| $2 | Second field |
| $3 | Third field |
| NF | Number of fields |
| NR | Record number |
| FS | Field separator |
| OFS | Output field separator |
Example:
awk '{print $0}' employees.txt
Prints the entire line.
Example:
awk '{print NF}' employees.txt
Output:
3
3
3
This shows the number of fields in each record.
Printing Specific Fields
To print selected columns:
awk '{print $1, $3}' employees.txt
Output:
John Manager
Alice Engineer
David Analyst
This extracts only the name and designation columns.
Using Patterns
awk can process only those records that match a specific condition.
Example:
awk '$2 > 27' employees.txt
Output:
Alice 30 Engineer
David 28 Analyst
The condition checks whether the second field is greater than 27.
Pattern Matching with Strings
Example:
awk '/Engineer/' employees.txt
Output:
Alice 30 Engineer
This prints lines containing the word Engineer.
BEGIN and END Blocks
awk provides two special blocks:
BEGIN
Executed before reading any input.
Example:
awk 'BEGIN {print "Employee Report"}'
Output:
Employee Report
END
Executed after processing all records.
Example:
awk 'END {print "Processing Complete"}' employees.txt
Output:
Processing Complete
Combined Example
awk '
BEGIN {print "Employee List"}
{print $1}
END {print "End of Report"}
' employees.txt
Output:
Employee List
John
Alice
David
End of Report
Built-in Variables
NR (Number of Records)
awk '{print NR, $0}' employees.txt
Output:
1 John 25 Manager
2 Alice 30 Engineer
3 David 28 Analyst
NF (Number of Fields)
awk '{print $1, NF}' employees.txt
Output:
John 3
Alice 3
David 3
FILENAME
awk '{print FILENAME}' employees.txt
Output:
employees.txt
employees.txt
employees.txt
Field Separators
By default, fields are separated by spaces or tabs.
A custom separator can be specified using -F.
Consider a file:
John,25,Manager
Alice,30,Engineer
David,28,Analyst
To process comma-separated values:
awk -F ',' '{print $1, $3}' employees.csv
Output:
John Manager
Alice Engineer
David Analyst
Arithmetic Operations
awk can perform calculations directly.
Example:
awk '{sum += $2} END {print sum}' employees.txt
Output:
83
The ages are added together.
Calculating Average
awk '{sum += $2} END {print sum/NR}' employees.txt
Output:
27.6667
Conditional Statements
awk supports if statements.
Example:
awk '
{
if($2 > 28)
print $1
}
' employees.txt
Output:
Alice
Only employees older than 28 are displayed.
Loops in awk
for Loop
awk '
{
for(i=1;i<=NF;i++)
print $i
}
' employees.txt
Output:
John
25
Manager
Alice
30
Engineer
David
28
Analyst
while Loop
awk '
{
i=1
while(i<=NF)
{
print $i
i++
}
}
' employees.txt
This produces a similar result.
Formatting Output
awk provides the printf function.
Example:
awk '{printf "%-10s %-5s %-10s\n",$1,$2,$3}' employees.txt
Output:
John 25 Manager
Alice 30 Engineer
David 28 Analyst
This creates properly aligned columns.
Data Filtering
Consider a file:
101 John 50000
102 Alice 70000
103 David 45000
Find employees earning more than 50000:
awk '$3 > 50000' salary.txt
Output:
102 Alice 70000
Generating Reports
Example:
awk '
BEGIN {
print "Salary Report"
}
{
total += $3
}
END {
print "Total Salary:", total
}
' salary.txt
Output:
Salary Report
Total Salary: 165000
This demonstrates how awk can generate summary reports from raw data.
Combining awk with Other UNIX Commands
Using with ps
ps -ef | awk '{print $1,$2}'
Displays user names and process IDs.
Using with ls
ls -l | awk '{print $9}'
Displays only file names.
Using with df
df -h | awk '{print $1,$5}'
Displays file systems and their usage percentages.
Practical Applications of awk
Log Analysis
System administrators use awk to analyze server logs.
Example:
awk '/ERROR/' logfile.txt
Displays all error messages.
Report Generation
awk can create employee reports, sales reports, and inventory summaries.
Data Cleaning
It can remove unwanted fields, reformat records, and prepare data for databases.
Monitoring System Performance
awk is frequently used with commands such as:
top
vmstat
iostat
sar
to extract performance-related information.
Advantages of awk
-
Fast and efficient text processing.
-
Built into almost every UNIX/Linux system.
-
Supports variables, loops, and conditions.
-
Handles large files effectively.
-
Excellent for report generation.
-
Integrates easily with shell scripts.
-
Supports arithmetic calculations and data analysis.
Limitations of awk
-
Complex programs can become difficult to maintain.
-
Not ideal for graphical applications.
-
Less suitable for very large software projects.
-
Advanced data structures are limited compared to modern programming languages.
Conclusion
awk is a powerful UNIX text-processing utility that goes far beyond simple searching. It enables users to extract information, perform calculations, generate reports, filter records, and automate data-processing tasks. Because of its speed, flexibility, and availability on almost every UNIX system, awk remains one of the most important tools for system administrators, developers, and data analysts who work extensively with text-based data.