To remove a tag from a xml file

Question

My file contains data which is not well idented. Say like:

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namspace/Service/1.0"><Record>
.
.</Record></ns0:collection>

I have to marge file N number of such files and create one file. So I need the following to be done:

I need to remove only </ns0:collection> closing tag from the first file
remove both <?xml version="1.0" encoding="UTF-8" ?><ns0:collection xmlns:ns0="http://namspace/Service/1.0"> and </ns0:collection> in the next (n-1) files
Have to remove only <?xml version="1.0" encoding="UTF-8" ?><ns0:collection xmlns:ns0="http://namspace/Service/1.0"> in the last file and merge all of them together

I have tried using sed command to process first file which is not resulting anything, "merged.xml" is empty.

sed '/<\/ns0:collection>/d' $file1 > merged.xml

Any suggestions?

This actually sounds like an XY problem - you don't want to delete tags from your XML, you want to merge some files. — Sobrique
– Sobrique, Commented Jan 12, 2018 at 10:36

Varsha Gowda · Accepted Answer · 2018-04-04 07:09:17Z

You didn't specify that you could only use sed, so if you have access to xml_grep (see Merge multiple XML files from commend line, second answer), I would recommend that because it does a lot of the heavy work for you and for a simple merge job like this can be done in one command:

xml_grep --cond Record --wrap "ns0:collection" --descr 'xmlns:ns0="http://namespace/Service/1.0"' --encoding "UTF-8" *.xml

Test files:

test.xml

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namespace/Service/1.0""><Record>
Test
</Record></ns0:collection>

test1.xml

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namespace/Service/1.0"><Record>
Test 1<a>a</a><b c="c">d</b>
</Record></ns0:collection>

Result

<?xml version="1.0" encoding="UTF-8" ?>
<ns0:collection xmlns:ns0="http://namespace/Service/1.0">
<Record>
Test 1<a>a</a><b c="c">d</b></Record><Record>
Test
</Record>
</ns0:collection>

I prefer to use XML-aware tools when dealing with XML files, because the chance of messing up the structure with sed and friends is quite high and you can easily end up with a malformed XML document!

Agreed. XML is a contextual language, and regular expressions ... aren't. They're always going to be a bit hacky. — Sobrique
– Sobrique, Commented Jan 12, 2018 at 10:50

Sobrique · Accepted Answer · 2018-01-12 10:47:04Z

I would suggest that using sed isn't good for processing XML, and instead using a parser.

I would also suggest that you have an XY problem here - it's not about deleting tags, but rather merging XML files.

Personally - I like perl and XML::Twig:

#!/usr/bin/env perl
use strict;
use warnings;

#load the parser
use XML::Twig; 

#get our file list - we use the "first" file as the basis.
#can use sort on this list if desired. 
my ( $first_file, @other_files ) = glob ( 'C://tmp//xmltest/*.xml' ); 

#Our 'parent' document. 
my $doc = XML::Twig -> new -> parsefile ( $first_file ); 


foreach my $file ( @other_files ) { 
   my $mergedoc = XML::Twig -> new -> parsefile ( $file ); 

   #//Record means any <Record> node anywhere in the tree. 
   foreach my $record ( $mergedoc -> get_xpath ( '//Record' ) ) {
      $record -> cut;
      #paste it into our parent doc, as the last node. 
      $record -> paste ( after => $doc -> root -> last_child );
   }
}

#set output formatting (optional)
$doc -> set_pretty_print ('indented_a'); 

#print to STDOUT.
$doc -> print;

#write to output file too
open ( my $output, '>', 'combined.xml' ) or die $!;
print {$output} $doc -> sprint;
close ( $output );

This deliberately extracts the Record elements from the target XML, and just merges those between the documents. However it's a flexible approach - xpath is quite powerful, and is the XML equivalent of regex - but works better because it's context aware, where regex is not.

Varsha Gowda · Accepted Answer · 2018-04-03 10:58:58Z

Solutions:

I need to remove only closing tag from the first file, solution:

sed -i.bak -e 's/<\/ns0:collection>/ /' -e 's/<\/Record>/ /' n0

remove both <?xml version="1.0" encoding="UTF-8" ?><ns0:collection xmlns:ns0="http://namspace/Service/1.0"> and </ns0:collection> in the next (n-1) files:

sed -i.bak -e 's/<?xml version=1.0 encoding=UTF-8 ?>.*<ns0:collection/ /' -e 's/xmlns.*/ /' -e 's/<\/R.*>.*>/ /' n1

Doing so for file name range:

find . -type f -name "n[1-3]" -exec sed i.bak -e 's/<?xml version=1.0 encoding=UTF-8 ?>.*<ns0:collection/ /' -e 's/xmlns.*/ /' -e 's/<\/R.*>.*>/ /' {} \;

Have to remove only in the last file and merge all of them together:

sed i.bak -e 's/<?xml version=1.0 encoding=UTF-8 ?>.*<ns0:collection/ /' -e 's/xmlns.*/ /' ne

Then finaly join them:

cat n0 n[1-3] ne > joined

I used the following files: n0,n1, n2, n3, and ne. I added the following text into each:

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namspace/Service/1.0"><Record>
hello from nigeria
</Record></ns0:collection>

The resulting file joined was as seen below:

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namspace/Service/1.0"><Record>
hello from nigeria



hello from nigeria



hello from nigeria



hello from nigeria



hello from nigeria
</Record></ns0:collection>

Note:

I see from the first problem you need to remove both </Record></ns0:collection> not just </ns0:collection>. So I took the liberty to do, else we would have a duplicate </Record> entry when the files are merged.
The filenames here would have to be modified by you so you can run one command over them all, here I used n[1-3]. Pick that which works best for you.
PLEASE RUN A TEST FIRST AND SEE THE RESULTS, here I used i.bak so sed creates a backup automatically.

Please note you didn't provide any text between so to help you, you need to include the texts between those nodes so I change the sed command accordingly. I sed my own texts to show how I might do what your asking so update your information so we can be more helpful. — George Udosen
– George Udosen, Commented Jan 10, 2018 at 14:15

Stack Exchange Network

To remove a tag from a xml file

3 Answers 3

You must log in to answer this question.

Hot Network Questions

To remove a tag from a xml file

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions