331

I have a ksh script that returns a long list of values, newline separated, and I want to see only the unique/distinct values. It is possible to do this?

For example, say my output is file suffixes in a directory:

tar
gz
java
gz
java
tar
class
class

I want to see a list like:

tar
gz
java
class

8 Answers 8

579

You might want to look at the uniq and sort applications.

./yourscript.ksh | sort | uniq

(FYI, yes, the sort is necessary in this command line, uniq only strips duplicate lines that are immediately after each other)

EDIT:

Contrary to what has been posted by Aaron Digulla in relation to uniq's commandline options:

Given the following input:

class
jar
jar
jar
bin
bin
java

uniq will output all lines exactly once:

class
jar
bin
java

uniq -d will output all lines that appear more than once, and it will print them once:

jar
bin

uniq -u will output all lines that appear exactly once, and it will print them once:

class
java
Sign up to request clarification or add additional context in comments.

6 Comments

Just an FYI for latecomers: @AaronDigulla's answer has since been corrected.
very good point this ` sort is necessary in this command line, uniq only strips duplicate lines that are immediately after each other` which I have just learnt!!
GNU sort features a -u version for giving the unique values too.
I figured out that uniq seams to process only adjacent lines (at least by default) meaning one may sort input before feeding uniq.
I did some testing on 400MB of data - sort | uniq was 95 seconds - sort -u was 77 - awk '!a[$0]++' from @ajak6 was 9 seconds. So awk wins but also the hardest to remember.
|
117
./script.sh | sort -u

This is the same as monoxide's answer, but a bit more concise.

5 Comments

You're being modest: your solution will also perform better (probably only noticeable with large data sets).
I think that should be more efficient than ... | sort | uniq because it is performed in one shot
@AdrianAntunez maybe it's also because the sort -u doesn't need to update the sorted list each time it finds an already encountered earlier value. while the sort | has to sort all items before it passes it to uniq
@mklement0 @AdrianAntunez At the first time I thought sort -u could be faster because any optimal comparison sort algorithm has O(n*log(n)) complexity, but it is possible to find all unique values with O(n) complexity using Hash Set data structure. Nonetheless, both sort -u and sort | uniq have almost the same performance and they both are slow. I have conducted some tests on my system, more info at gist.github.com/sda97ghb/690c227eb9a6b7fb9047913bfe0e431d
Thanks! Your solution worked for me, while ./script.sh | sort | uniq -u didn't output anything. Maybe because the output was too large? Although it wasn't so big, the output had 50_000 lines, with just 4 distinct values.
17

With AWK you can do:

 ./yourscript.ksh | awk '!a[$0]++'

I find it faster than sort and uniq

3 Comments

That's definitely my favorite way to do the job, thanks a lot! Especially for larger files, the sort|uniq-solutions are probably not what you want.
I did some testing and this was 10 times faster than other solutions, but also 10x harder to remember :-)
Yeah, I'm not quite sure what awk is doing here. But thanks for the solution!!
16

With zsh you can do this:

% cat infile 
tar
more than one word
gz
java
gz
java
tar
class
class
zsh-5.0.0[t]% print -l "${(fu)$(<infile)}"
tar
more than one word
gz
java
class

Or you can use AWK:

% awk '!_[$0]++' infile    
tar
more than one word
gz
java
class

5 Comments

Clever solutions that do not involve sorting the input. Caveats: The very-clever-but-cryptic awk solution (see stackoverflow.com/a/21200722/45375 for an explanation) will work with large files as long as the number of unique lines is small enough (as unique lines are kept in memory). The zsh solution reads the entire file into memory first, which may not be an option with large files. Also, as written, only lines with no embedded spaces are handled correctly; to fix this, use IFS=$'\n' read -d '' -r -A u <file; print -l ${(u)u} instead.
Correct. Or: (IFS=$'\n' u=($(<infile)); print -l "${(u)u[@]}")
Thanks, that's simpler (assuming you don't need to set variables needed outside the subshell). I'm curious as to when you need the [@] suffix to reference all elements of an array - seems that - at least as of version 5 - it works without it; or did you just add it for clarity?
@mklement0, you're right! I didn't think of it when I wrote the post. Actually, this should be sufficient: print -l "${(fu)$(<infile)}"
Fantastic, thanks for updating your post - I took the liberty of fixing the awk sample output, too.
14

Pipe them through sort and uniq. This removes all duplicates.

uniq -d gives only the duplicates, uniq -u gives only the unique ones (strips duplicates).

3 Comments

gotta sort first by the looks of it
Yes, you do. Or more accurately, you need to group all the duplicate lines together. Sorting does this by definition though ;)
Also, uniq -u is NOT the default behaviour (see the edit in my answer for details)
11

For larger data sets where sorting may not be desirable, you can also use the following perl script:

./yourscript.ksh | perl -ne 'if (!defined $x{$_}) { print $_; $x{$_} = 1; }'

This basically just remembers every line output so that it doesn't output it again.

It has the advantage over the "sort | uniq" solution in that there's no sorting required up front.

6 Comments

Note that sorting of a very large file is not an issue per se with sort; it can sort files which are larger than the available RAM+swap. Perl, OTOH, will fail if there are only few duplicates.
Yes, it's a trade-off depending on the expected data. Perl is better for huge dataset with many duplicates (no disk-based storage required). Huge dataset with few duplicates should use sort (and disk storage). Small datasets can use either. Personally, I'd try Perl first, switch to sort if it fails.
Since sort only gives you a benefit if it has to swap to disk.
This is great when I want the first occurrence of every line. Sorting would break that.
Ultimately perl will be sorting the entries in some form to put into its dictionary (or whatever it is called in perl), so you can't actually avoid the processing time of a sort.
|
1

Unique, as requested, (but not sorted);
uses fewer system resources for less than ~70 elements (as tested with time);
written to take input from stdin,
(or modify and include in another script):
(Bash)

bag2set () {
    # Reduce a_bag to a_set.
    local -i i j n=${#a_bag[@]}
    for ((i=0; i < n; i++)); do
        if [[ -n ${a_bag[i]} ]]; then
            a_set[i]=${a_bag[i]}
            a_bag[i]=$'\0'
            for ((j=i+1; j < n; j++)); do
                [[ ${a_set[i]} == ${a_bag[j]} ]] && a_bag[j]=$'\0'
            done
        fi
    done
}
declare -a a_bag=() a_set=()
stdin="$(</dev/stdin)"
declare -i i=0
for e in $stdin; do
    a_bag[i]=$e
    i=$i+1
done
bag2set
echo "${a_set[@]}"

Comments

-1

I get a better tips to get non-duplicate entries in a file

awk '$0 != x ":FOO" && NR>1 {print x} {x=$0} END {print}' file_name | uniq -f1 -u

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.