Assume a large log file of several GBs and several million lines where each line contains a token identifying the user account that generated the line.
All tokens have the same length and can be found at the position within each log line.
The goal is to figure out the amount of bytes logged by each account.
One way of doing this is in multiple steps, like this:
awk -F "|" '{ print $5 }' trace.log | sort | uniq | xargs -l sh -c 'echo -n $0 && grep "$0" trace.log | wc -c'
where awk extracts the token (5th entry tokenizing by '|'), sort | uniq extracts the list of unique tokens appearing in the file and finally xargs greps and counts the bytes.
Now this works but it is terribly inefficient because the same (huge) file gets grepped X times.
Is there a smarter way of achieving the same via shell commands? (where by smarter I mean faster and without consuming tons of RAM or temporary storage, like sorting the whole file in RAM or sorting it to a tmp file).
Answer
Try:
awk -F "|" '{ a[$5]+=1+length($0) } END{for (name in a) print name,a[name]}' trace.log
Example
Let's consider this test file:
$ cat trace.log
1|2|3|4|jerry|6
a|b|c|d|phil|f
1|2|3|4|jerry|6
The original command produces this output:
$ awk -F "|" '{ print $5 }' trace.log | sort | uniq | xargs -l sh -c 'echo -n $0 && grep "$0" trace.log | wc -c'
jerry32
phil15
The suggested command, which loops through the file just once, produces this output:
$ awk -F "|" '{ a[$5]+=1+length($0) } END{for (name in a) print name,a[name]}' trace.log
jerry 32
phil 15
How it works
-F "|"
This sets the field separator for input.
a[$5]+=1+length($0)
For each line, we add the length of the line to the count stored in associative array
a
under this line's user name.The quantity
length($0)
does not include the newline that ends the line. Consequently, we add one to this to account for the\n
.END{for (name in a) print name,a[name]}
After we have read through the file once, we print out the sums.
No comments:
Post a Comment