How to know if a text file is a subset of another

Question

I am trying to find a way to determine if a text file is a subset of another..

For example:

foo
bar

is a subset of

foo
bar
pluto

While:

foo
pluto

and

foo
bar

are not a subset of each other...

Is there a way to do this with a command?

This check must be a cross check, and it has to return:

file1 subset of file2 :    True
file2 subset of file1 :    True
otherwise             :    False

Potentially more efficient solution (if files are also ordered): github.com/barrycarter/bcapps/blob/master/… — user2267
– user2267, Commented Dec 26, 2014 at 16:03

Stéphane Chazelas · Accepted Answer · 2023-05-16 07:48:55Z

15

If those file contents are called file1, file2 and file3 in order of appearance, then you can do it with the following one-liner:

 # python3 -c "x=open('file1', mode='rb').read(); y=open('file2', mode='rb').read(); print(x in y or y in x)"
 True
 # python3 -c "x=open('file2', mode='rb').read(); y=open('file1', mode='rb').read(); print(x in y or y in x)"
 True
 # python3 -c "x=open('file1', mode='rb').read(); y=open('file3', mode='rb').read(); print(x in y or y in x)"
 False

edited May 16, 2023 at 7:48

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

answered Feb 12, 2014 at 13:03

Timo

6,4521 gold badge28 silver badges30 bronze badges

Thanks for your answer.. +1 .. I don't know if accept my answer because yours is not unix-linux specific and my answer is a bit faster, as far as I tested it.. what do you think?

gc5
– gc5

2014-02-12 13:12:06 +00:00
Commented Feb 12, 2014 at 13:12
You welcome, there are of course other solutions with more unix specific tools. But this seems a good use of Python's in operator.

Timo
– Timo

2014-02-12 13:21:01 +00:00
Commented Feb 12, 2014 at 13:21
There is python command line wrapper to make it more unix like, with piping built in, named pyp: code.google.com/p/pyp I think it is trivial to make this solution more unix like one liner tool.

IBr
– IBr

2014-11-14 09:15:44 +00:00
Commented Nov 14, 2014 at 9:15

Add a comment |

Stéphane Chazelas · Accepted Answer · 2019-06-02 08:21:51Z

With perl:

if perl -0777 -e '$n = <>; $h = <>; exit(index($h,$n)<0)' needle.txt haystack.txt
then echo needle.txt is found in haystack.txt
fi

-0octal defines the record delimiter. When that octal number is greater than 0377 (the maximum byte value), that means there's no delimiter, it's equivalent to doing $/ = undef. In that case, <> returns the full content of a single file, that's the slurp mode.

Once we have the content of the files in two $h and $n variables, we can use index() to determine if one is found in the other.

That means however that the whole files are stored in memory which means that method won't work for very large files.

For mmappable files (usually includes regular files and most seekable files like block devices), that can be worked around by using mmap() on the files, like with the Sys::Mmap perl module:

if 
  perl -MSys::Mmap -le '
    open N, "<", $ARGV[0] || die "$ARGV[0]: $!";
    open H, "<", $ARGV[1] || die "$ARGV[1]: $!";
    mmap($n, 0, PROT_READ, MAP_SHARED, N);
    mmap($h, 0, PROT_READ, MAP_SHARED, H);
    exit (index($h, $n) < 0)' needle.txt haystack.txt
then
  echo needle.txt is found in haystack.txt
fi

alecbz · Accepted Answer · 2018-01-24 15:56:37Z

4

From http://www.catonmat.net/blog/set-operations-in-unix-shell/:

Comm compares two sorted files line by line. It may be run in such a way that it outputs lines that appear only in the first specified file. If the first file is subset of the second, then all the lines in the 1st file also appear in the 2nd, so no output is produced:
$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1
# comm returns no output if subset ⊆ set
# comm outputs something if subset ⊊ set

answered Jan 24, 2018 at 15:56

alecbz

1762 silver badges6 bronze badges

Given a three line file { a, b, c } and a two line file { b, a } your proposed solution will wrongly claim that the second is contained in the first. It will also fail for { a, b, c } and { a, a, b, a, a, b }, claiming that the second is contained in the first

Chris Davies
– Chris Davies

2025-09-19 05:28:26 +00:00
Commented Sep 19 at 5:28
@ChrisDavies I assumed OP is considering these files as sets of lines. You're correct that this does not work if considering the files as multisets of lines, or if you're looking for substring or subsequence matching instead of set membership.

alecbz
– alecbz

2025-10-07 00:23:59 +00:00
Commented Oct 7 at 0:23

Add a comment |

Community · Accepted Answer · 2017-04-13 12:37:03Z

2

I found a solution thanks to this question

Basically I am testing two files a.txt and b.txt with this script:

#!/bin/bash

first_cmp=$(diff --unchanged-line-format= --old-line-format= --new-line-format='%L' "$1" "$2" | wc -l)
second_cmp=$(diff --unchanged-line-format= --old-line-format= --new-line-format='%L' "$2" "$1" | wc -l)

if [ "$first_cmp" -eq "0" -o "$second_cmp" -eq "0" ]
then
    echo "Subset"
    exit 0
else
    echo "Not subset"
    exit 1
fi

If one is subset of the other the script return 0 for True otherwise 1.

edited Apr 13, 2017 at 12:37

CommunityBot

1

answered Feb 12, 2014 at 13:07

gc5

3791 gold badge4 silver badges8 bronze badges

What does %L do? This script doesn't seem to work, and I am trying to debug it...

Alex
– Alex

2017-05-24 16:18:05 +00:00
Commented May 24, 2017 at 16:18
I actually don't remember the meaning of %L, it was three years ago. From man diff (current version) %L means "contents of line".

gc5
– gc5

2017-05-24 18:56:35 +00:00
Commented May 24, 2017 at 18:56
%L prints the contents of the "new" line. IOW, don't print anything for unchanged-lines or old-lines, but print the contents of the line for new-lines.

PLG
– PLG

2017-09-26 11:44:41 +00:00
Commented Sep 26, 2017 at 11:44
This script works for me, out of the box!

PLG
– PLG

2017-09-26 17:56:19 +00:00
Commented Sep 26, 2017 at 17:56

Add a comment |

Chris Davies · Accepted Answer · 2025-09-19 05:34:20Z

If f1 is a subset of f2 then f1 - f2 is an empty set. Building on that we can write an is_subset function and a function derived from it. As per Set difference between 2 text files

sort_files () {
  f1_sorted="$1.sorted"
  f2_sorted="$2.sorted"

  if [ ! -f "$f1_sorted" ]; then
    cat "$1" | sort | uniq > "$f1_sorted"
  fi

  if [ ! -f "$f2_sorted" ]; then
    cat "$2" | sort | uniq > "$f2_sorted"
  fi
}

remove_sorted_files () {
  f1_sorted="$1.sorted"
  f2_sorted="$2.sorted"
  rm -f "$f1_sorted"
  rm -f "$f2_sorted"
}

set_union () {
  sort_files "$1" "$2"
  cat "$1.sorted" "$2.sorted" | sort | uniq
  remove_sorted_files "$1" "$2"
}

set_diff () {
  sort_files "$1" "$2"
  cat "$1.sorted" "$2.sorted" "$2.sorted" | sort | uniq -u
  remove_sorted_files "$1" "$2"
}

rset_diff () {
  sort_files "$1" "$2"
  cat "$1.sorted" "$2.sorted" "$1.sorted" | sort | uniq -u
  remove_sorted_files "$1" "$2"
}

is_subset () {
  sort_files "$1" "$2"
  output=$(set_diff "$1" "$2")
  remove_sorted_files "$1" "$2"

  if [ -z "$output" ]; then
    return 0
  else
    return 1
  fi

}

Should this script start with #!/bin/bash?

Alex
– Alex

2017-05-24 16:20:39 +00:00
Commented May 24, 2017 at 16:20 — Alex
– Alex, Commented May 24, 2017 at 16:20

Alex Escodro · Accepted Answer · 2023-05-16 07:33:34Z

I had to do this just now, and while searching for an answer, I thought of an approach using diff + grep in bash:

#!/bin/bash

subset()
{
    ! diff --ignore-blank-lines "$1" "$2" | grep '^<' > /dev/null
}

crosscheck()
{
    subset "$1" "$2" || subset "$2" "$1" 
}

echo -e 'foo\nbar'        > file1
echo -e 'foo\nbar\npluto' > file2
echo -e 'foo\npluto'      > file3

echo; echo '  file1'; cat file1
echo; echo '  file2'; cat file2
echo; echo '  file3'; cat file3

echo
crosscheck file1 file2 && echo file1 is a subset of file2, or file2 is a subset of file1, or they are the same
crosscheck file2 file3 && echo file2 is a subset of file3, or file3 is a subset of file2, or they are the same
crosscheck file3 file1 || echo file3 and file1 are neither one subset of the other

rm file1 file2 file3

August Karlstrom · Accepted Answer · 2023-11-24 16:56:01Z

0

Here is a (POSIX compatible) solution in AWK which checks if file1 is a superset of file 2:

awk 'FILENAME == ARGV[1] { lines[$0] = 1; next } \
    FILENAME == ARGV[2] && ! lines[$0] { exit 1 }' file1 file2

edited Nov 24, 2023 at 16:56

answered Nov 24, 2023 at 16:48

August Karlstrom

1,9882 gold badges31 silver badges45 bronze badges

Add a comment |

Stack Exchange Network

How to know if a text file is a subset of another

7 Answers 7

You must log in to answer this question.

Linked

Hot Network Questions

How to know if a text file is a subset of another

7 Answers 7

You must log in to answer this question.

Linked

Related

Hot Network Questions