How to and Benchmark of awk, grep and sed on Multiple (AND) Keywords

How to and Benchmark of awk, grep and sed on Multiple (AND) Keywords

Introduction

grep is useful for searching and matching text and regular expressions in one or more files on Linux. One example is finding information in logs that is related to a particular incident. We will provide a unique identifier such as the user ID and a timestamp to narrow the scope of details retrieved. In this tutorial, we are going to see how other Linux commands, such as awk and sed, compare against each other in a test to extract data from a big log file.

Test Environment

  • Type: KVM
  • Processor: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
  • CPU(s): 4
  • Memory: 8 GB Ram
  • Hard drive: SSD
  • OS: AlmaLinux 8.7 (Stone Smilodon)
  • Data source: FIFA 23 complete player dataset @kaggle.com – Link here

Sample ‘log’ File Used

We selected the largest CSV file, male_players.csv, which is 5.6 GB. Below is a small sample of this file.

248360,/player/248360/tobias-klysner/230009,23,9,2023-01-13,T. Klysner,Tobias Klysner Breuner,"RM, LM",65,72,1400000,3000,20,2001-07-03,183,78,1,Superliga,1,1786,Randers,SUB,18,,2018-07-01,2024,13,Denmark,,,,Right,3,3,1,Medium/Medium,Normal (170-185),No,2000000,,Speed Dribbler (AI),80,56,57,67,49,70,55,55,45,62,55,66,55,48,53,65,83,78,75,61,79,64,66,74,68,49,71,48,60,57,49,62,48,52,52,8,7,12,7,14,,61+2,61+2,61+2,63,63,63,63,63,62+2,62+2,62+2,64+2,60+2,60+2,60+2,64+2,60+2,58+2,58+2,58+2,60+2,59+2,56+2,56+2,56+2,59+2,15+2,https://cdn.sofifa.net/players/248/360/23_120.png

248498,/player/248498/khalid-al-ghannam/230009,23,9,2023-01-13,K. Al Ghannam,Khalid Essa Al Ghannam,"LM, RM",65,76,1600000,7000,21,2000-11-08,173,64,350,Pro League,1,112139,Al Nassr,RES,11,,2020-01-29,2025,183,Saudi Arabia,,,,Right,3,4,1,High/Low,Lean (170-185),No,3000000,,"Flair, Technical Dribbler (AI)",75,53,54,74,29,45,52,54,43,56,42,76,50,39,55,70,77,74,87,58,79,56,38,67,39,51,34,24,62,54,51,66,29,26,31,8,13,14,8,5,,59+2,59+2,59+2,64,63,63,63,64,63+2,63+2,63+2,64+2,57+2,57+2,57+2,64+2,50+2,44+2,44+2,44+2,50+2,47+2,36+2,36+2,36+2,47+2,14+2,https://cdn.sofifa.net/players/248/498/23_120.png

Test Script

We have two filter arrays that store a couple of keywords, and the bash script must only retrieve data that contains all the keywords. It is worth noting that this is an “AND” operation test only.

#!/bin/bash
# https://stackoverflow.com/questions/75588420/how-to-grep-all-keywords-from-array-in-bash-script/

# Find these set of keyword(s) in log (male_players.csv)
filter_list1=("Premier League")
filter_list2=("Premier League" "ST, CF" "Injury Prone")

#######################################
# awk
#######################################
# 1 keyword
awk -v w="$(printf '%s\n' "${filter_list1[@]}")" '
  BEGIN {IGNORECASE = 1; split(w,res,"\n"); for(i in res) res[i] = "\\<" res[i] "\\>"}
  {for(i in res) if($0 !~ res[i]) next; print}' "archive/male_players.csv" >> awk-extracted1.log

# 3 keywords - A && B && C
awk -v w="$(printf '%s\n' "${filter_list2[@]}")" '
  BEGIN {IGNORECASE = 1; split(w,res,"\n"); for(i in res) res[i] = "\\<" res[i] "\\>"}
  {for(i in res) if($0 !~ res[i]) next; print}' "archive/male_players.csv" >> awk-extracted2.log

#######################################
# sed
#######################################
# 1 keyword
printf -v sedcmd '/\\b%s\\b/{' "${filter_list1[@]}"
printf -v toadd '%*s' ${#filter_list1[@]}
sed -ne "$sedcmd"p${toadd// /\}} < archive/male_players.csv >> sed-extracted1.log

# 3 keywords - A && B && C
printf -v sedcmd '/\\b%s\\b/{' "${filter_list2[@]}"
printf -v toadd '%*s' ${#filter_list2[@]}
sed -ne "$sedcmd"p${toadd// /\}} < archive/male_players.csv >> sed-extracted2.log

#######################################
# grep
#######################################
# 1 keyword
grep -Ewi 'Premier League' archive/male_players.csv >> grep-extracted1.log

# 3 keywords - A && B && C
grep -Ewi 'Premier League' archive/male_players.csv | grep -Ewi 'ST, CF' | grep -Ewi 'Injury Prone' >> grep-extracted2.log

Captured Logs – awk, grep, sed

We execute the script 12 times, twice for each test case, to obtain an average. 6 logs are generated and kept after each second run and as you can see, the extracted files have the same size, hence showing that all the 3 Linux commands (awk, ⁣grep and sed⁣) are working similarly.

$ ls -l *.log
-rw-rw-r-- 1 user user 319877802 Mar  3 11:35 awk-extracted1.log
-rw-rw-r-- 1 user user    121597 Mar  3 11:24 awk-extracted2.log
-rw-rw-r-- 1 user user 319877802 Mar  3 11:30 grep-extracted1.log
-rw-rw-r-- 1 user user    121597 Mar  3 11:33 grep-extracted2.log
-rw-rw-r-- 1 user user 319877802 Mar  3 11:26 sed-extracted1.log
-rw-rw-r-- 1 user user    121597 Mar  3 11:28 sed-extracted2.log

Extracted log snippet – single keyword filter

Two player profiles (davit and murilo) are showcased, and they contain single keyword: ‘Premier League

239904,/player/239904/davit-khocholava/210010,21,10,2020-11-06,D. Khocholava,Davit Khocholava,CB,70,73,1800000,550,27,1993-02-08,192,92,332,Premier League,1,101059,Shakhtar Donetsk,RCB,5,,2017-07-01,2025,20,Georgia,,,,Right,3,2,1,Low/Medium,Normal (185+),No,3900000,,,63,47,47,56,69,77,31,51,75,67,23,56,27,21,51,54,64,63,56,63,66,72,75,72,82,25,71,69,40,36,39,65,66,70,73,10,9,7,9,5,,58+2,58+2,58+2,52,54,54,54,52,52+2,52+2,52+2,53+2,55+2,55+2,55+2,53+2,62+2,65+2,65+2,65+2,62+2,64+2,70+2,70+2,70+2,64+2,14+2,https://cdn.sofifa.net/players/239/904/21_120.png
240115,/player/240115/murilo-cerqueira-paim/210010,21,10,2020-11-06,Murilo,Murilo Cerqueira Paim,"CB, CDM",70,80,3300000,19000,23,1997-03-27,188,78,67,Premier League,1,100765,Lokomotiv Moskva,LCB,27,,2019-06-18,2023,54,Brazil,,,,Right,2,2,1,Medium/Medium,Normal (185+),No,5900000,,,69,36,52,55,72,70,40,31,71,70,26,51,41,24,64,56,68,70,58,71,57,60,68,73,73,25,61,72,38,34,44,68,69,75,73,11,7,12,11,9,,53+2,53+2,53+2,51,52,52,52,51,52+2,52+2,52+2,55+2,57+2,57+2,57+2,55+2,66+2,67+2,67+2,67+2,66+2,68+2,70+2,70+2,70+2,68+2,16+2,https://cdn.sofifa.net/players/240/115/21_120.png

Extracted log snippet – 3 keyword filters

Two random player profiles (samed and aleksey) are showcased, and they contain all three keywords: ‘Premier League‘, ‘ST, CF‘ and ‘Injury Prone

204782,/player/204782/samed-yesil/150004,15,4,2014-09-26,S. Yesil,Samed Yeşil,"ST, CF",63,77,300000,5000,20,1994-05-25,180,72,13,Premier League,1,9,Liverpool,RES,36,,2012-08-01,2017,21,Germany,,,,Right,3,3,1,Medium/Medium,Lean (170-185),No,,,"Injury Prone, Beat Offside Trap",75,63,44,71,26,47,34,72,45,57,68,71,55,41,35,67,76,75,83,56,79,56,65,64,43,49,33,20,69,38,61,,25,25,25,14,8,11,6,6,,63,63,63,61,62,62,62,61,60,60,60,57,50,50,50,57,45,41,41,41,45,40,36,36,36,40,12,https://cdn.sofifa.net/players/204/782/15_120.png
224717,/player/224717/aleksey-pugin/150004,15,4,2014-09-26,A. Pugin,Aleksey Pugin,"LW, RM, ST, CF",59,63,150000,3000,27,1987-03-07,182,74,67,Premier League,1,100768,Torpedo Moscow,SUB,18,,2014-07-04,2016,40,Russia,,,,Left,4,2,1,High/Low,Normal (170-185),No,,,Injury Prone,65,66,50,65,30,46,58,68,54,52,62,70,51,47,41,64,61,69,59,38,65,80,43,42,55,55,30,23,47,47,61,,29,26,31,15,12,9,7,15,,60,60,60,59,59,59,59,59,56,56,56,55,47,47,47,55,44,40,40,40,44,40,37,37,37,40,14,https://cdn.sofifa.net/players/224/717/15_120.png

Test Results

The benchmark testing was conducted using the above script (only a single command at any one time) on the CSV log using $ time main.sh. Below are the recorded results.

real1 keyword (1st run, 2nd)Avg3 keywords (1st run, 2nd) Avg
awk0m23.131s, 0m24.440s23.79s1m58.739s, 2m3.509s121.12s
grep0m11.696s, 0m11.508s11.60s0m11.370s, 0m11.461s11.42s
sed
(preferred
solution)
0m11.354s, 0m10.551s10.95s0m10.492s, 0m9.955s10.22s
An average of two runs using (awk, grep, sed) with 1 and 3 keywords.

Optional Readings

grep: matches all keywords, regardless of pattern sequences.

This is a search for “AND” conditions on the command line.

$ grep -Ewi "Premier League" "male_players.csv" | grep -Ewi "ST, CF" | grep -Ewi "Injury Prone"
  • -E, –extended-regexp Interpret PATTERN as an extended regular expression.
  • -e PATTERN, –regexp=PATTERN Use PATTERN as the pattern. This can be used to specify multiple search patterns, or to protect a pattern beginning with a hyphen (-).
  • -i ignore case sensitivity
  • -w flag will search for the line containing the exact matching word
  • -H, –with-filename Print the file name for each match. This is the default when there is more than one file to search.

Putting Command in a Variable

Just wanted to let you know that you can use an array to store arguments for a command, but I found out that this isn’t helping if you have dynamic parameters or “not really counted as” parameters. The script fails with error grep: ___ No such file or directory in the later part of the code, but it actually works when you copy and paste the string # debug to the command line.

#!/bin/bash

# http://mywiki.wooledge.org/BashFAQ/050
args=(-s "$subject" --flag "arg with spaces")
mail "${args[@]}"

# for loop here to form | grep keyword2 | grep keyword 3 | grep...
filter_list=(mod_jk "Dec 04") # array
for i in "${!filter_list[@]}" # with array keys
do
  if [ $i -eq 0 ]; then
    grep_args=(-Ewi "\"${filter_list[$i]}\"" "\"$log_path\"")
  else
    grep_args+=("|") # syntax error near unexpected token `|' if added below instead
    grep_args+=(grep -Ewi "\"${filter_list[$i]}\"") # cannot include pipe | here
  fi
done

grep "${grep_args[@]}" # actual
echo "grep ${grep_args[@]}" # debug

Conclusion

I spent the last 48 hours trying to find a solution on the internet and discovered that most solutions revolve around “OR” conditions. When I attempted to put the rest of the command line, which, I thought, would be ‘parameters’, into an array, it didn’t work as intended. I finally asked a question at Stack Overflow for an answer because I needed the “AND” condition to use keywords looped from an array. The benchmark showed that using sed or grep is similar, but awk takes much longer because the solution actually loops each line in the log for the keywords. As a result, I hope that the identified preferred solution here using sed will help others, like it helped me.

Show 4 Comments

4 Comments

  1. Ricardo

    Awesome! I really appreciate seeing that these old tools keep alive and I try to see and use them always.
    Have you ever seen the ripgrep? the command is rg only, I did some tests and it’s really faster than grep.

  2. Ricardo

    I’m testing your script and used the ripgrep (rg).. it was the fastest, so beautiful!!

  3. Ricardo

    also, created another way to match the AND in awk. It’ll look for the pattern only once a record. It reduced the time a lot.

    # 3 keywords – A && B && C
    time awk ‘BEGIN {IGNORECASE = 1} /\/&&/\/&&/\/’ “male_players.csv” > awk-extracted2.log

Leave a Reply

Your email address will not be published. Required fields are marked *