How can I find all xml files in the current directory and all sub directories, that do not start with < in the first line.
I have tried this, but the grep doesn't work:
find . -type f -name '*.xml' | grep "^[^<]" | head -n 1
You have some solid answers already, however I'll offer an alternative - the XML spec is quite strict, and files that don't start with < isn't actually XML at all.
So a simple approach might be to simply test if the file is 'valid' or not. All XML parsers can do this, but here's an example:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
foreach my $filename ( @ARGV ) {
eval { XML::Twig -> new -> parsefile ( $filename ); };
print "File: $filename is not valid XML $@\n" if $@;
}
This can oneliner-ify as:
perl -MXML::Twig -e 'foreach ( @ARGV ) { eval { XML::Twig -> new -> parsefile ( $_ ) }; print "File: $filename is not valid XML $@\n" if $@;' *.xml
If recursive traversal is important, then File::Find will also help:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
use File::Find;
sub check_valid_xml {
#skip any files that don't end in '.xml'
next unless m/\.xml$/;
#validate this file
eval { XML::Twig->new->parsefile($File::Find::name); };
#report errors if detected - parser will abort on invalid XML
if ($@) { print "File $File::Find::name is not valid XML $@"; }
}
find( \&check_valid_xml, "." );
This will detect any 'bad XML' which will include the files you've specified in your question.
To grep the first line of each file and print if they match you can use xargs and awk
find . -type f -name "*.xml" -print0 | xargs -0 -I{} awk 'NR==1&&!/^</' {}
To print the filename of the files
find . -type f -name "*.xml" -print0 | xargs -0 -I{} awk 'NR==1&&!/^</{print FILENAME}' {}
If your awk supports the nextfile statement (most do):
find . -name '*.xml' -type f \( -size 0 -print -o -exec awk '
!/^</ {print FILENAME}; {nextfile}' {} + \)
find .//. -name \*.xml -type f -exec head -n1 {} + |
sed -ne:n -e'\|^==> \.//\.|!{H;$!d' -e\} \
-ex -e'\|\.xml <==\n|!{G;x;d' -e\} \
-e's|[^/]*//\(.*\) <==\n[^<]*$|\1|p'
head lists filenames all on its own. So you can just -exec it and have sed watch its input for head's report on the names for those files which don't match a < in their first line.
If you just want to avoid listing filenames for files whose first character is < that is as easily done. In fact, with a GNU grep it can be easier...
find .//. -name \*.xml -type f -exec grep -EHaom1 '^.?' {} +|
sed -ne'\|^\.//\.|!{H;$!d' -e\} \
-e'x;\|\.xml:|!{G;x;d' -e\} \
-e's|:[^<]*$||p;$!d;x;s|||p'
We shouldn't directly test for the < char with grep because we might end up testing the third or fourth line if the first begins with a <, but what we can do is tell grep to stop at 1 -match for -only 0 or 1 char at the head of a line. This means grep will print our filenames like...
.//./path/to/xml.xml:.
And so all sed must do is ensure it gathers the entire filename (in case it contains newline characters) and then to test if the last character in the string is < if it is not, sed strips the last two chars and prints the results.
Pure bash:
shopt -s globstar
for i in **/*.c;do
read -N 1 h < "$i";
if [[ $h != "<" ]]; then
# echo "found $i";
# do stuff with "$i"
fi;
done
read -N 1 reads a single character from the file, without having to fork/exec anything. If you just need a list of filenames, use something else that makes it easier to use the -print0 style.
greponly, or is piping it throughheadan option? If so it's a piece of cake to just loop over all files and check for an unsuccessfulgrepafterhead -n 1and then print the file name.<isn't actually XML at all, it's just something that looks a bit like it.