A summary and explanation of several methods for removing duplicate lines from files on the Bash command line of Linux computers/servers.
When you look at the following reference, youโll notice there are many ways to remove duplicate lines from files using commands.
shell - Remove duplicate entries using a Bash script - Stack Overflow
This time, after comparing the output of each command, I got some interesting verification results, so Iโd like to introduce them along with those findings.
First, letโs verify using fuga.txt as an example, which has duplicate hoge entries as shown below.
$ cat > fuga.txt
hoge
fuga
foo
hoge
bar
$ cat fuga.txt | sed '$!N; /^\(.*\)\n\1$/!P; D'
hoge
fuga
foo
hoge
bar
$ cat fuga.txt | sort -u
bar
foo
fuga
hoge
$ cat fuga.txt | awk '!a[$0]++'
hoge
fuga
foo
bar
With sed โ$!N; /^(.*)\n\1$/!P; Dโ, the duplicate line โhogeโ was not removed because the duplicate lines are not consecutive.
sort -u sorts before removing duplicate lines, so the result differs significantly from the original file content.
awk โ!a[$0]++โ seems to work without any particular problems.
Letโs verify with the following next. The results were interesting.
$ cat > fuga.txt
ใใใ
ใฑใใ
[ใใใ]
ใใใ
ใฑใทใใ
[ใใใ]
ใชใ
$ cat fuga.txt | sed '$!N; /^\(.*\)\n\1$/!P; D'
ใใใ
ใฑใใ
[ใใใ]
ใใใ
ใฑใทใใ
[ใใใ]
ใชใ
$ cat fuga.txt | sort -u
[ใใใ]
ใชใ
ใใใ
ใฑใทใใ
$ cat fuga.txt | awk '!a[$0]++'
ใใใ
ใฑใใ
[ใใใ]
ใฑใทใใ
ใชใ
Did you notice? The number of output lines differs between sort -u and awk โ!a[$0]++โ. The non-duplicate line โใฑใใโ that is maintained in the awk version has disappeared in the sort version.
I thought about it, but couldnโt figure out the cause. I think itโs safe to consider this a bug.
Based on these results, for removing duplicate lines from files, awk โ!a[$0]++โ appears to be the most suitable command as it maintains the order and is likely bug-free.