0

I have 1000+ html files, all with more than 1000 lines, on a Linux server.
Most of the files have a particular part of html code that needs to be deleted.

The part that I need to deleted looks about this:

<div class="LoginOuterCssClass" id="ctl07">
    ...
</div>

Is there some script or command-line solution for this?

Commands like the following didn't help:

X,Ys/search/replace/g
1,2s/\([a-z]*\), \([a-z]*\)/\2 \1/ig
s/<[^]*>//g

Help would me much appreciated!

4
  • What command did you try for this. In example is some patterns - no real command Commented Dec 14, 2012 at 15:35
  • see this question for using sed and grep to delete one line of text from several files: stackoverflow.com/q/1182756/1284631 Commented Dec 14, 2012 at 15:36
  • What you're talking about is parsing HTML, and simple command line tools are not up to the task. What if there's a <div> inside of the <div> you want deleted, for example? What if the closing </div> isn't on a line by itself? You need a proper HTML parser. Commented Dec 14, 2012 at 16:49
  • I used the find | xargs sed command, there are 42 lines of HTML and several divs inside de div I want to delete. None of them on the same line. Andy, you talk about a proper HTML parser, what could I use? Commented Dec 17, 2012 at 8:17

1 Answer 1

1

Try the following sed command on one file and see if it does what you want:

sed -n '/<div class="LoginOuterCssClass" id="ctl07">/{:a;N;/<\/div>/!ba;N;s/.*\n//};p' file.html

To run this on multiple files and edit them in-place, you run find and pass the files to sed via xargs as shown below:

find /some/path -name "*.html" -print0 | xargs -0 sed -i -n '/<div class="LoginOuterCssClass" id="ctl07">/{:a;N;/<\/div>/!ba;N;s/.*\n//};p'
Sign up to request clarification or add additional context in comments.

3 Comments

I tried find /some/path -name "*.html" -print0 | xargs -0 sed -in '/<div class="LoginOuterCssClass" id="ctl07">/{:a;N;/<\/div>/!ba;N;s/.*\n//};p', it works almost perfectly! Thank you very much! But te content in the file I tested now has every line double. Like <div class="input"> for example is now <div class="input"> <div class="input">. How does this could happen?
I have fixed the sed command. Should have been -i -n, not -in.
You just made me very happy! :D Thank you very much!