Remove pattern between two symbols that exist more than once in single line

Question

I have a number of files similar to the following file (which I will call file1):

(((1824663.AST0201.AST0202.AST0016...AST0087:0.2575,225845.AST0201.AST0202.AST0016...AST0087:0.717227):0.45328,190304.AST0201.AST0202.AST0016...AST0087:...........

I wanted to remove all AST*, between the . before AST* and the :, which always follow after the last .AST*. This pattern appear a lot on one line. The AST* shown here is repeated in exact, but I have whole lot more such line with different AST.

What I currently doing is splitting the file, remove all AST, then connect each line into one.

sed 's/:/:\n/g;s/,/,\n/g' file1 | sed 's/\.AST.*:/:/g' | sed -z 's/\n//g'

Expected output:

(((1824663:0.2575,225845:0.717227):0.45328,190304:...........

Are there any shorter command in calling sed for this?

When asking a question on text processing, please be sure to include not only an example of the input, but also the corresponding desired output, so that contributors can verify their solutions against your example. — AdminBee
– AdminBee, Commented Nov 1, 2022 at 13:51
What sort of intervening pattern is seen, if any? From your example, it's unclear if AST is separated occasionally by a literal ... three dots or not. Also, do you mean * to represent a digit, only? If you meant to write a Regex, you're AST* represents a literal A followed by a literal S followed by zero-or-more Ts. — jubilatious1
– jubilatious1, Commented Nov 1, 2022 at 14:03
@jubilatious1 The ... in between of AST* was to show there are a number of .AST in between instead of ... literally. I was intended to remove everything in between all \.AST up to the first following :, regardless of what's in between. I was stumbled on the sed command as I didn't know how to specify the removal at the first following : after each AST batch, Thanks for the respond anyway! :) — web
– web, Commented Nov 2, 2022 at 3:34

sseLtaH · Accepted Answer · 2022-11-01 12:52:00Z

1

You can try this sed

$ sed -E 's/\.AST[^:]*//g' input_file
(((1824663:0.2575,225845:0.717227):0.45328,190304:...........

answered Nov 1, 2022 at 12:52

sseLtaH

2,9061 gold badge8 silver badges20 bronze badges

Thanks! Do you mind to elaborate on the command? What did the [^:] do here?

web
– web

2022-11-02 03:20:25 +00:00
Commented Nov 2, 2022 at 3:20
1

The command will match a literal period . with AST beside it, then everything up to the next occurance of a semi colon `[^:]*' but not including the semi colon @web.

sseLtaH
– sseLtaH

2022-11-02 03:37:03 +00:00
Commented Nov 2, 2022 at 3:37

Add a comment |

jubilatious1 · Accepted Answer · 2022-11-01 22:59:27Z

Using Raku (formerly known as Perl_6)

~$ raku -pe  's:g/ [ AST \d+ ]+ % \.+ //;'  file

Sample Input:

(((1824663.AST0201.AST0202.AST0016...AST0087:0.2575,225845.AST0201.AST0202.AST0016...AST0087:0.717227):0.45328,190304.AST0201.AST0202.AST0016...AST0087:...........

Sample Output:

(((1824663.:0.2575,225845.:0.717227):0.45328,190304.:...........

Raku is a Perl-family programming language. Here, the familiar -pe autoprinting linewise flags are used (sed-like). The Regex above is written assuming that the . dot is a record or element separator. Therefore it makes more sense not to look for .AST but to leave the leading . dot untouched.

A new feature of Raku regexes is the modified quantifier for repeating elements. Simply put, if you have a pattern like AST \d+, you can group it with brackets to make [ AST \d+ ], then add a quantifier to indicate the number of repeats: [ AST \d+ ]+.

Normally the above would just recognize multiple instances of the pattern all run-together, however you can now follow the pattern with a modified quantifer indicator % \.+ to indicate the [ AST \d+ ] pattern % ("is separated by") \.+ one-or-more . dots. This construct avoids the problem seen when using just an optional \.? regex, namely that the separator is lost and patterns like AST0AST1 (if authentic) are deleted. Furthermore, the modified quantifier can be used to delete all-but-the-first or all-but-the-last targeted element(s):

Sample Input:

echo '(((1824663.AST0101:AST0201.AST0202:AST301.AST302.AST303:AST0401.AST0402.AST0403.AST0404:)))' > test_AST.txt

One-liners and Output (spaces inserted):

~$ raku -pe  's:g/ [ AST \d+ ]+  //;' test_AST.txt
(((1824663.:.:..:...:)))
 
~$ raku -pe  's:g/ [ AST \d+ ]+ % \.  //;' test_AST.txt
(((1824663.::::)))
 
~$ raku -pe  's:g/ [ AST \d+ ]**2..* % \.  //;' test_AST.txt
(((1824663.AST0101::::)))
 
~$ raku -pe  's:g/ [ AST \d+ ]**3..* % \.  //;' test_AST.txt
(((1824663.AST0101:AST0201.AST0202:::)))
 
~$ raku -pe  's:g/ [ AST \d+ ]**4..* % \.  //;' test_AST.txt
(((1824663.AST0101:AST0201.AST0202:AST301.AST302.AST303::)))
 
~$ raku -pe  's:g/ [ AST \d+ ]**5..* % \.  //;' test_AST.txt
(((1824663.AST0101:AST0201.AST0202:AST301.AST302.AST303:AST0401.AST0402.AST0403.AST0404:)))

https://docs.raku.org/language/regexes
https://raku.org

Stack Exchange Network

Remove pattern between two symbols that exist more than once in single line

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Remove pattern between two symbols that exist more than once in single line

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions