1

My file contains data which is not well idented. Say like:

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namspace/Service/1.0"><Record>
.
.</Record></ns0:collection>

I have to marge file N number of such files and create one file. So I need the following to be done:

  1. I need to remove only </ns0:collection> closing tag from the first file
  2. remove both <?xml version="1.0" encoding="UTF-8" ?><ns0:collection xmlns:ns0="http://namspace/Service/1.0"> and </ns0:collection> in the next (n-1) files
  3. Have to remove only <?xml version="1.0" encoding="UTF-8" ?><ns0:collection xmlns:ns0="http://namspace/Service/1.0"> in the last file and merge all of them together

I have tried using sed command to process first file which is not resulting anything, "merged.xml" is empty.

sed '/<\/ns0:collection>/d' $file1 > merged.xml

Any suggestions?

2
  • extend your input showing actual Record nodes Commented Jan 10, 2018 at 9:25
  • This actually sounds like an XY problem - you don't want to delete tags from your XML, you want to merge some files. Commented Jan 12, 2018 at 10:36

3 Answers 3

5

You didn't specify that you could only use sed, so if you have access to xml_grep (see Merge multiple XML files from commend line, second answer), I would recommend that because it does a lot of the heavy work for you and for a simple merge job like this can be done in one command:

xml_grep --cond Record --wrap "ns0:collection" --descr 'xmlns:ns0="http://namespace/Service/1.0"' --encoding "UTF-8" *.xml

Test files:

test.xml

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namespace/Service/1.0""><Record>
Test
</Record></ns0:collection>

test1.xml

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namespace/Service/1.0"><Record>
Test 1<a>a</a><b c="c">d</b>
</Record></ns0:collection>

Result

<?xml version="1.0" encoding="UTF-8" ?>
<ns0:collection xmlns:ns0="http://namespace/Service/1.0">
<Record>
Test 1<a>a</a><b c="c">d</b></Record><Record>
Test
</Record>
</ns0:collection>

I prefer to use XML-aware tools when dealing with XML files, because the chance of messing up the structure with sed and friends is quite high and you can easily end up with a malformed XML document!

1
  • Agreed. XML is a contextual language, and regular expressions ... aren't. They're always going to be a bit hacky. Commented Jan 12, 2018 at 10:50
0

I would suggest that using sed isn't good for processing XML, and instead using a parser.

I would also suggest that you have an XY problem here - it's not about deleting tags, but rather merging XML files.

Personally - I like perl and XML::Twig:

#!/usr/bin/env perl
use strict;
use warnings;

#load the parser
use XML::Twig; 

#get our file list - we use the "first" file as the basis.
#can use sort on this list if desired. 
my ( $first_file, @other_files ) = glob ( 'C://tmp//xmltest/*.xml' ); 

#Our 'parent' document. 
my $doc = XML::Twig -> new -> parsefile ( $first_file ); 


foreach my $file ( @other_files ) { 
   my $mergedoc = XML::Twig -> new -> parsefile ( $file ); 

   #//Record means any <Record> node anywhere in the tree. 
   foreach my $record ( $mergedoc -> get_xpath ( '//Record' ) ) {
      $record -> cut;
      #paste it into our parent doc, as the last node. 
      $record -> paste ( after => $doc -> root -> last_child );
   }
}

#set output formatting (optional)
$doc -> set_pretty_print ('indented_a'); 

#print to STDOUT.
$doc -> print;

#write to output file too
open ( my $output, '>', 'combined.xml' ) or die $!;
print {$output} $doc -> sprint;
close ( $output );

This deliberately extracts the Record elements from the target XML, and just merges those between the documents. However it's a flexible approach - xpath is quite powerful, and is the XML equivalent of regex - but works better because it's context aware, where regex is not.

0

Solutions:

  1. I need to remove only closing tag from the first file, solution:

    sed -i.bak -e 's/<\/ns0:collection>/ /' -e 's/<\/Record>/ /' n0
    
  2. remove both <?xml version="1.0" encoding="UTF-8" ?><ns0:collection xmlns:ns0="http://namspace/Service/1.0"> and </ns0:collection> in the next (n-1) files:

    sed -i.bak -e 's/<?xml version=1.0 encoding=UTF-8 ?>.*<ns0:collection/ /' -e 's/xmlns.*/ /' -e 's/<\/R.*>.*>/ /' n1
    
    • Doing so for file name range:

      find . -type f -name "n[1-3]" -exec sed i.bak -e 's/<?xml version=1.0 encoding=UTF-8 ?>.*<ns0:collection/ /' -e 's/xmlns.*/ /' -e 's/<\/R.*>.*>/ /' {} \;
      
  3. Have to remove only in the last file and merge all of them together:

    sed i.bak -e 's/<?xml version=1.0 encoding=UTF-8 ?>.*<ns0:collection/ /' -e 's/xmlns.*/ /' ne
    

Then finaly join them:

cat n0 n[1-3] ne > joined

I used the following files: n0,n1, n2, n3, and ne. I added the following text into each:

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namspace/Service/1.0"><Record>
hello from nigeria
</Record></ns0:collection>

The resulting file joined was as seen below:

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namspace/Service/1.0"><Record>
hello from nigeria



hello from nigeria



hello from nigeria



hello from nigeria



hello from nigeria
</Record></ns0:collection>

Note:

  1. I see from the first problem you need to remove both </Record></ns0:collection> not just </ns0:collection>. So I took the liberty to do, else we would have a duplicate </Record> entry when the files are merged.

  2. The filenames here would have to be modified by you so you can run one command over them all, here I used n[1-3]. Pick that which works best for you.

  3. PLEASE RUN A TEST FIRST AND SEE THE RESULTS, here I used i.bak so sed creates a backup automatically.

1
  • Please note you didn't provide any text between so to help you, you need to include the texts between those nodes so I change the sed command accordingly. I sed my own texts to show how I might do what your asking so update your information so we can be more helpful. Commented Jan 10, 2018 at 14:15

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.