1

Most of the commandline tools I'm looking at have the ability to pick a field delimiter. However, I'd like to pick one delimiter to start, and a different one to end the segment of text I'd like to remove from each line I'm processing.

1text [blah blah blah] text number punctuation text text
2text text text
3text text (text) [blah blah blah] number text
4text <url> <email> text [blah blah blah] text

I'd like to remove all the 'blah blah blah' from those lines.

Blah can contain anything, except newlines, EOFs, and other breaky-things, and '['. ie: I shouldn't have '[[' (nor '[blah[') in any of the data

I only have one (optional) instance of [] per line. So, for line 2 there is nothing to remove, and this shouldn't cause a halt, stop or failure.

I'm almost 100% positive that if I've got a start '[' I also have a ']'. That might be nice to check for, however.

There are other forms of punctuation, so I don't want to work it with something that just looks for non-alphanumeric stuff to start removing (ie: line 4)

Bonus points for being able to figure out if I'm putting together two (now adjacent) whitespaces at that particular point - but without removing double whitespaces at any other point.

I'm pretty sure I'll have to use awk or sed, but if there were a way to do this via regular commandline tools, to make it as portable as possible, that would be ideal.

Also, explaining what you're doing (if you're using regex / sed) would certainly help, as:


A suggestion here says:

sed 's/^.*%\([^ ]*\) .*\$\([^$]*\)$/\1 \2/' infile

I got that kinda working with this bit of monkeying:

cat data | sed 's/^.*\[\([^ ]*\) .*\]\([^$]*\)$/\1 \2/'

However it doesn't take out the whole swath of 'blah blah blah', and leaves with an extra line-break.


Using cut/awk/sed with two different delimiters

Doesn't really answer the question in a general sense (or, at least I wasn't able to figure something out after reading it - maybe just a fail on my part), but seems to be (too) specifically tailored to that person's data.

1
  • Don, this looks like exactly what I was looking for - but it wasn't suggested to me / couldn't find it. I ended up using Terdon's first answer: sed 's/[.*]//g' Commented Jan 14, 2015 at 18:04

1 Answer 1

2

This is very simple. You don't need delimiters as such, a simple regular expression will do. Just look for an opening [, followed by as many non-] or [ characters as possible until the end of the line. For example:

  1. Perl

    If you know there are no [[ or other strange things:

    perl -pe 's/\[.+?\]//g' file
    

    If you can have strange things:

    perl -pe 's/\[[^\[\]]*\]//g' file
    
  2. sed

    sed  's/\[[^]]*\]//g' file
    
3
  • Costas, do you mean 'in this case'? I was also hoping to get enough of an explanation that any person could use it for any two delimiters, in a command pipe (cat | sed | grep | cut -- or whatever) Commented Jan 14, 2015 at 18:00
  • Terdon, perhaps you should list both answers. Greedy vs. non-greedy matching (and when you might want to use one or the other, non-greedy sounds good in case someone has messed up and put in some extra []). Commented Jan 14, 2015 at 18:02
  • @anon3202 you can. For example, to use 8 and 2 as start and end delimiter, you would run sed 's/8[^82]*2//g. As for greedy or non, Costas's suggestion is not non-greedy as such. It is just a better way than my original. It can do everything the original could and more so there's little point in posting both. The perl ones are non-greedy. Commented Jan 14, 2015 at 18:03

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.