I am cleaning up my data that is stored in text files. Each line starts with a category label followed by the actual data that I want to clean up. There are many text files in different subfolders, so I use egrep to pass the filenames on to sed.
CON: the Unix and Linux question
SEM: eins, the zwei, drei
AUTH: , the
AFF: The holy seat
TTITLE: As we go, the Kuckuck comes too
Now in every line starting with (SEM|AFF|CON) I want to replace (T|t)he[ ]* when it follows (:|\,). That is, the data should later look like
CON: Unix and Linux question
SEM: eins, zwei, drei
AUTH: , the
AFF: holy seat
TTITLE: As we go, the Kuckuck comes too
So far I tried to achieve this in two steps, one for the :-part and the other for the ,-part. But I struggle already with the first step.
First part
The command/pattern to identify the files is egrep -rl ^"(SEM|CON|AFF)\: (t|T)he". This works as intended.
Now when I do
egrep -rl ^"(SEM|CON|AFF)\: (t|T)he" | xargs sed -i 's/\((SEM|CON|AFF)\: \)(t|T)he[ ]*/\1/g'
nothing happens. Is my sed part wrong? Can't I backrefer to ((SEM|CON|AFF)\: with \1?
Second part
The command/pattern to identify the files is egrep -rl ^"(SEM|CON|AFF)\:.*\,[ ]*(t|T)he". This also works as intended. But every combination on sed that I tried so far deletes the content.
(in a Basic RE, it will be seen as a literal(. Try\(\(SEM|CON|AFF\): \)\(t|T\). Or use extended REs (sed -r), and replace all\(with just(. Since you usedegrep, you're getting extended REs for free.-ewhen I useegrep?egrepis likegrep -E.-eis used to denote an expression.