How to grep all xml files that do not begin with "<"

Question

How can I find all xml files in the current directory and all sub directories, that do not start with < in the first line.

I have tried this, but the grep doesn't work:

find . -type f -name '*.xml' | grep "^[^<]" | head -n 1

What are your criteria for identifying "all xml files"? Are you assuming *.xml — Chris Davies
– Chris Davies, Commented Jul 22, 2015 at 7:41
When you say "all xml files", is that in the current directory, in the current directory and all sub directories, or in the entire filesystem? — Chris Davies
– Chris Davies, Commented Jul 22, 2015 at 7:42
are you tied town to grep only, or is piping it through head an option? If so it's a piece of cake to just loop over all files and check for an unsuccessful grep after head -n 1 and then print the file name. — FelixJN
– FelixJN, Commented Jul 22, 2015 at 7:59
Please update your question with this new information. Otherwise it just gets lost in comments — Chris Davies
– Chris Davies, Commented Jul 22, 2015 at 8:23
Can I suggest giving an example of what you do or don't want to match? Because I'd suggest that an XML that doesn't start with < isn't actually XML at all, it's just something that looks a bit like it. — Sobrique
– Sobrique, Commented Jul 22, 2015 at 11:15

Sobrique · Accepted Answer · 2015-07-22 11:27:18Z

You have some solid answers already, however I'll offer an alternative - the XML spec is quite strict, and files that don't start with < isn't actually XML at all.

So a simple approach might be to simply test if the file is 'valid' or not. All XML parsers can do this, but here's an example:

#!/usr/bin/perl
use strict;
use warnings; 
use XML::Twig;

foreach my $filename ( @ARGV ) { 
    eval { XML::Twig -> new -> parsefile ( $filename ); };
    print "File: $filename is not valid XML $@\n" if $@;
}

This can oneliner-ify as:

perl -MXML::Twig -e 'foreach ( @ARGV ) { eval { XML::Twig -> new -> parsefile ( $_ ) }; print "File: $filename is not valid XML $@\n" if $@;' *.xml

If recursive traversal is important, then File::Find will also help:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
use File::Find;

sub check_valid_xml {
    #skip any files that don't end in '.xml'
    next unless m/\.xml$/;   
    #validate this file
    eval { XML::Twig->new->parsefile($File::Find::name); };
    #report errors if detected - parser will abort on invalid XML
    if ($@) { print "File $File::Find::name is not valid XML $@"; }
}

find( \&check_valid_xml, "." );

This will detect any 'bad XML' which will include the files you've specified in your question.

An interesting technique, but it's quite possible that it doesn't fit the needs of the situation. Imagine there are 400 *.xml files, and 100 of them are not valid, but 63 of those are bad only because they start with "xml:<" for some dumb reason. Those 63 could be fixed easily, but you wouldn't want them mixed in with the other 27. OK, a bit contrived, but I think possible. — Ross Presser
– Ross Presser, Commented Jul 22, 2015 at 16:59
Yes, that is true. It might not fit. On the flip side though, according to the spec you shouldn't fix bad XML. It is supposed to be a fatal condition, and reject the whole lot. I don't know if this directly suits the supplicant, but it does handle the case in point, and others for "free". The error message will also indicate where the fault is. — Sobrique
– Sobrique, Commented Jul 22, 2015 at 22:27

Chris Davies · Accepted Answer · 2015-07-22 09:42:06Z

3

To grep the first line of each file and print if they match you can use xargs and awk

find . -type f -name "*.xml" -print0 | xargs -0 -I{} awk 'NR==1&&!/^</' {}

To print the filename of the files

find . -type f -name "*.xml" -print0 | xargs -0 -I{} awk 'NR==1&&!/^</{print FILENAME}' {}

edited Jul 22, 2015 at 9:42

Chris Davies

128k16 gold badges178 silver badges323 bronze badges

answered Jul 22, 2015 at 8:57

123

1,5527 silver badges9 bronze badges

Thanks but taht gives the xml files that begins with "<" I want files that do not begin with “<”

Bizboss
– Bizboss

2015-07-22 09:05:00 +00:00
Commented Jul 22, 2015 at 9:05
@Bizboss It should give the one that don't.

123
– 123

2015-07-22 09:07:27 +00:00
Commented Jul 22, 2015 at 9:07
@Bizboss Try the new one.

123
– 123

2015-07-22 09:11:25 +00:00
Commented Jul 22, 2015 at 9:11
Maybe the test is on the complete line? I want to test only the first char of the first line.

Bizboss
– Bizboss

2015-07-22 09:12:54 +00:00
Commented Jul 22, 2015 at 9:12
@Bizboss What are you talking about? Have you got multiline file names ?

123
– 123

2015-07-22 09:14:02 +00:00
Commented Jul 22, 2015 at 9:14

| Show 7 more comments

Stéphane Chazelas · Accepted Answer · 2015-07-22 12:06:27Z

3

If your awk supports the nextfile statement (most do):

 find . -name '*.xml' -type f \( -size 0 -print -o -exec awk '
   !/^</ {print FILENAME}; {nextfile}' {} + \)

answered Jul 22, 2015 at 12:06

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

Add a comment |

mikeserv · Accepted Answer · 2015-07-22 12:00:23Z

find .//. -name \*.xml -type f -exec head -n1 {} + |
sed -ne:n -e'\|^==> \.//\.|!{H;$!d' -e\} \
    -ex   -e'\|\.xml <==\n|!{G;x;d' -e\} \
          -e's|[^/]*//\(.*\) <==\n[^<]*$|\1|p'

head lists filenames all on its own. So you can just -exec it and have sed watch its input for head's report on the names for those files which don't match a < in their first line.

If you just want to avoid listing filenames for files whose first character is < that is as easily done. In fact, with a GNU grep it can be easier...

find .//. -name \*.xml -type f -exec grep -EHaom1 '^.?' {} +|
sed     -ne'\|^\.//\.|!{H;$!d' -e\} \
         -e'x;\|\.xml:|!{G;x;d' -e\} \
         -e's|:[^<]*$||p;$!d;x;s|||p'

We shouldn't directly test for the < char with grep because we might end up testing the third or fourth line if the first begins with a <, but what we can do is tell grep to stop at 1 -match for -only 0 or 1 char at the head of a line. This means grep will print our filenames like...

.//./path/to/xml.xml:.

And so all sed must do is ensure it gathers the entire filename (in case it contains newline characters) and then to test if the last character in the string is < if it is not, sed strips the last two chars and prints the results.

Peter Cordes · Accepted Answer · 2015-07-22 12:56:03Z

0

Pure bash:

shopt -s globstar
for i in **/*.c;do
    read -N 1 h < "$i";
    if [[ $h != "<" ]]; then
        # echo "found $i";
        # do stuff with "$i"
    fi;
done

read -N 1 reads a single character from the file, without having to fork/exec anything. If you just need a list of filenames, use something else that makes it easier to use the -print0 style.

answered Jul 22, 2015 at 12:56

Peter Cordes

6,69024 silver badges42 bronze badges

Add a comment |

Stack Exchange Network

How to grep all xml files that do not begin with "<"

5 Answers 5

You must log in to answer this question.

Linked

Hot Network Questions

How to grep all xml files that do not begin with "<"

5 Answers 5

You must log in to answer this question.

Linked

Related

Hot Network Questions