Select unique or distinct values from a list in UNIX shell script

Question

I have a ksh script that returns a long list of values, newline separated, and I want to see only the unique/distinct values. It is possible to do this?

For example, say my output is file suffixes in a directory:

tar
gz
java
gz
java
tar
class
class

I want to see a list like:

tar
gz
java
class

Community · Accepted Answer · 2017-05-23 10:31:38Z

579

You might want to look at the uniq and sort applications.

./yourscript.ksh | sort | uniq

(FYI, yes, the sort is necessary in this command line, uniq only strips duplicate lines that are immediately after each other)

EDIT:

Contrary to what has been posted by Aaron Digulla in relation to uniq's commandline options:

Given the following input:

class
jar
jar
jar
bin
bin
java

uniq will output all lines exactly once:

class
jar
bin
java

uniq -d will output all lines that appear more than once, and it will print them once:

jar
bin

uniq -u will output all lines that appear exactly once, and it will print them once:

class
java

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Mar 6, 2009 at 10:34

Matthew Scharley

133k55 gold badges199 silver badges225 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

mklement0 Over a year ago

Just an FYI for latecomers: @AaronDigulla's answer has since been corrected.

HattrickNZ Over a year ago

very good point this ` sort is necessary in this command line, uniq only strips duplicate lines that are immediately after each other` which I have just learnt!!

Mingye Wang Over a year ago

GNU sort features a -u version for giving the unique values too.

Stphane Over a year ago

I figured out that uniq seams to process only adjacent lines (at least by default) meaning one may sort input before feeding uniq.

MikeKulls Over a year ago

I did some testing on 400MB of data - sort | uniq was 95 seconds - sort -u was 77 - awk '!a[$0]++' from @ajak6 was 9 seconds. So awk wins but also the hardest to remember.

|

Community · Accepted Answer · 2017-05-23 11:47:30Z

117

./script.sh | sort -u

This is the same as monoxide's answer, but a bit more concise.

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Mar 6, 2009 at 14:44

gpojd

23.2k8 gold badges45 silver badges71 bronze badges

5 Comments

mklement0 Over a year ago

You're being modest: your solution will also perform better (probably only noticeable with large data sets).

adriantunez Over a year ago

I think that should be more efficient than ... | sort | uniq because it is performed in one shot

whyer Over a year ago

@AdrianAntunez maybe it's also because the sort -u doesn't need to update the sorted list each time it finds an already encountered earlier value. while the sort | has to sort all items before it passes it to uniq

Divano Over a year ago

@mklement0 @AdrianAntunez At the first time I thought sort -u could be faster because any optimal comparison sort algorithm has O(n*log(n)) complexity, but it is possible to find all unique values with O(n) complexity using Hash Set data structure. Nonetheless, both sort -u and sort | uniq have almost the same performance and they both are slow. I have conducted some tests on my system, more info at gist.github.com/sda97ghb/690c227eb9a6b7fb9047913bfe0e431d

Ferran Maylinch Over a year ago

Thanks! Your solution worked for me, while ./script.sh | sort | uniq -u didn't output anything. Maybe because the output was too large? Although it wasn't so big, the output had 50_000 lines, with just 4 distinct values.

Ajak6 · Accepted Answer · 2021-08-28 06:16:02Z

17

With AWK you can do:

 ./yourscript.ksh | awk '!a[$0]++'

I find it faster than sort and uniq

edited Aug 28, 2021 at 6:16

answered May 22, 2017 at 21:27

Ajak6

7475 silver badges17 bronze badges

3 Comments

Schmitzi Over a year ago

That's definitely my favorite way to do the job, thanks a lot! Especially for larger files, the sort|uniq-solutions are probably not what you want.

MikeKulls Over a year ago

I did some testing and this was 10 times faster than other solutions, but also 10x harder to remember :-)

Barbituate Over a year ago

Yeah, I'm not quite sure what awk is doing here. But thanks for the solution!!

ian · Accepted Answer · 2019-11-19 05:49:16Z

16

With zsh you can do this:

% cat infile 
tar
more than one word
gz
java
gz
java
tar
class
class
zsh-5.0.0[t]% print -l "${(fu)$(<infile)}"
tar
more than one word
gz
java
class

Or you can use AWK:

% awk '!_[$0]++' infile    
tar
more than one word
gz
java
class

edited Nov 19, 2019 at 5:49

ian

12.3k9 gold badges55 silver badges112 bronze badges

answered Mar 6, 2009 at 12:06

Dimitre Radoulov

28.1k4 gold badges42 silver badges50 bronze badges

5 Comments

mklement0 Over a year ago

Clever solutions that do not involve sorting the input. Caveats: The very-clever-but-cryptic awk solution (see stackoverflow.com/a/21200722/45375 for an explanation) will work with large files as long as the number of unique lines is small enough (as unique lines are kept in memory). The zsh solution reads the entire file into memory first, which may not be an option with large files. Also, as written, only lines with no embedded spaces are handled correctly; to fix this, use IFS=$'\n' read -d '' -r -A u <file; print -l ${(u)u} instead.

Dimitre Radoulov Over a year ago

Correct. Or: (IFS=$'\n' u=($(<infile)); print -l "${(u)u[@]}")

mklement0 Over a year ago

Thanks, that's simpler (assuming you don't need to set variables needed outside the subshell). I'm curious as to when you need the [@] suffix to reference all elements of an array - seems that - at least as of version 5 - it works without it; or did you just add it for clarity?

Dimitre Radoulov Over a year ago

@mklement0, you're right! I didn't think of it when I wrote the post. Actually, this should be sufficient: print -l "${(fu)$(<infile)}"

mklement0 Over a year ago

Fantastic, thanks for updating your post - I took the liberty of fixing the awk sample output, too.

Aaron Digulla · Accepted Answer · 2015-05-29 11:25:12Z

14

Pipe them through sort and uniq. This removes all duplicates.

uniq -d gives only the duplicates, uniq -u gives only the unique ones (strips duplicates).

edited May 29, 2015 at 11:25

answered Mar 6, 2009 at 10:35

Aaron Digulla

330k111 gold badges626 silver badges840 bronze badges

3 Comments

brabster Over a year ago

gotta sort first by the looks of it

Matthew Scharley Over a year ago

Yes, you do. Or more accurately, you need to group all the duplicate lines together. Sorting does this by definition though ;)

Matthew Scharley Over a year ago

Also, uniq -u is NOT the default behaviour (see the edit in my answer for details)

paxdiablo · Accepted Answer · 2009-03-06 11:02:43Z

11

For larger data sets where sorting may not be desirable, you can also use the following perl script:

./yourscript.ksh | perl -ne 'if (!defined $x{$_}) { print $_; $x{$_} = 1; }'

This basically just remembers every line output so that it doesn't output it again.

It has the advantage over the "sort | uniq" solution in that there's no sorting required up front.

answered Mar 6, 2009 at 11:02

paxdiablo

888k242 gold badges1.6k silver badges2k bronze badges

6 Comments

Aaron Digulla Over a year ago

Note that sorting of a very large file is not an issue per se with sort; it can sort files which are larger than the available RAM+swap. Perl, OTOH, will fail if there are only few duplicates.

paxdiablo Over a year ago

Yes, it's a trade-off depending on the expected data. Perl is better for huge dataset with many duplicates (no disk-based storage required). Huge dataset with few duplicates should use sort (and disk storage). Small datasets can use either. Personally, I'd try Perl first, switch to sort if it fails.

paxdiablo Over a year ago

Since sort only gives you a benefit if it has to swap to disk.

Bluu Over a year ago

This is great when I want the first occurrence of every line. Sorting would break that.

MikeKulls Over a year ago

Ultimately perl will be sorting the entries in some form to put into its dictionary (or whatever it is called in perl), so you can't actually avoid the processing time of a sort.

|

FGrose · Accepted Answer · 2012-07-31 03:54:19Z

Unique, as requested, (but not sorted);
uses fewer system resources for less than ~70 elements (as tested with time);
written to take input from stdin,
(or modify and include in another script):
(Bash)

bag2set () {
    # Reduce a_bag to a_set.
    local -i i j n=${#a_bag[@]}
    for ((i=0; i < n; i++)); do
        if [[ -n ${a_bag[i]} ]]; then
            a_set[i]=${a_bag[i]}
            a_bag[i]=$'\0'
            for ((j=i+1; j < n; j++)); do
                [[ ${a_set[i]} == ${a_bag[j]} ]] && a_bag[j]=$'\0'
            done
        fi
    done
}
declare -a a_bag=() a_set=()
stdin="$(</dev/stdin)"
declare -i i=0
for e in $stdin; do
    a_bag[i]=$e
    i=$i+1
done
bag2set
echo "${a_set[@]}"

Mary Marty · Accepted Answer · 2020-01-20 21:20:39Z

-1

I get a better tips to get non-duplicate entries in a file

awk '$0 != x ":FOO" && NR>1 {print x} {x=$0} END {print}' file_name | uniq -f1 -u

answered Jan 20, 2020 at 21:20

Mary Marty

11 bronze badge

Collectives™ on Stack Overflow

Select unique or distinct values from a list in UNIX shell script

8 Answers 8

6 Comments

5 Comments

3 Comments

5 Comments

3 Comments

6 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

6 Comments

5 Comments

3 Comments

5 Comments

3 Comments

6 Comments

Comments

Comments

Linked

Related