Revisions to how to de-duplicate block (timestamp+command) from bash history?

added 707 characters in body

Source Link

edited May 13, 2024 at 16:34

kos

4.3k
1
15
28

Read "in";
Split it in records, using ^#[0-9]{10}\n as a record separator, without consuming the separator (<> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg; \z is there only because of the last record, which won't be followed by another record separator, so we'll allow the regex to match the end of the file);
Process all records;
For each record, remove the separator (the line lincluding the timestamp); if the remainder (the command) is in a list of already processed commands, ignore it, otherwise store the command exclusive of the separator in the list of already processed commands and store the command inclusive of the separator in an array;
Join the elements of the array on an empty string, printing the resulting string.

Breakdown of the regex:

^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z): will match a line starting with a # character, followed by 10 digits and a newline; it will then lazily match anything (including newlines) until a new occurrence of ^#[0-9]{10}\n or the end of the string (\z) is found (avoiding to capture the newly found occurrence of ^#[0-9]{10}\n in the current match using a zero-length look-ahead assertion (?=) and allowing the next match to capture it); s will allow . to match newlines, m will allow ^ and $ to match after and before a newline and g will allow to capture multiple occurrences of the pattern in the string.

Read "in";
Split it in records, using ^#[0-9]{10}\n as a record separator (<> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg; \z is there only because of the last record, which won't be followed by another record separator, so we'll allow the regex to match the end of the file);
Process all records;
For each record, remove the separator (the line lincluding the timestamp); if the remainder (the command) is in a list of already processed commands, ignore it, otherwise store the command exclusive of the separator in the list of already processed commands and store the command inclusive of the separator in an array;
Join the elements of the array on an empty string, printing the resulting string.

Read "in";
Split it in records, using ^#[0-9]{10}\n as a record separator, without consuming the separator (<> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg);
Process all records;
For each record, remove the separator (the line lincluding the timestamp); if the remainder (the command) is in a list of already processed commands, ignore it, otherwise store the command exclusive of the separator in the list of already processed commands and store the command inclusive of the separator in an array;
Join the elements of the array on an empty string, printing the resulting string.

Breakdown of the regex:

^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z): will match a line starting with a # character, followed by 10 digits and a newline; it will then lazily match anything (including newlines) until a new occurrence of ^#[0-9]{10}\n or the end of the string (\z) is found (avoiding to capture the newly found occurrence of ^#[0-9]{10}\n in the current match using a zero-length look-ahead assertion (?=) and allowing the next match to capture it); s will allow . to match newlines, m will allow ^ and $ to match after and before a newline and g will allow to capture multiple occurrences of the pattern in the string.

added 12 characters in body

Source Link

edited May 13, 2024 at 16:00

kos

4.3k
1
15
28

perl -ge '
    my @u = map {
        $c = $_;
        $c =~ s/^#[0-9]{10}\n//;

        exists($d{$c}) ?
            () :
            ($d{$c} = 1 && $_)
        ;
    } <> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg;

    print join("", @u);
' in

perl -ge '
    my @u = map {
        $c = $_;
        $c =~ s/^#[0-9]{10}\n//;

        exists($d{$c}) ? () : ($d{$c} = 1 && $_);
    } <> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg;

    print join("", @u);
' in

perl -ge '
    @u = map {
        $c = $_;
        $c =~ s/^#[0-9]{10}\n//;

        exists($d{$c}) ?
            () :
            ($d{$c} = 1 && $_)
        ;
    } <> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg;

    print join("", @u);
' in

Source Link

answered May 13, 2024 at 15:35

kos

4.3k
1
15
28

Using Perl, you may do:

perl -ge '
    my @u = map {
        $c = $_;
        $c =~ s/^#[0-9]{10}\n//;

        exists($d{$c}) ? () : ($d{$c} = 1 && $_);
    } <> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg;

    print join("", @u);
' in

The big assumption being that that ^#[0-9]{10}\n will always positively identify the start of an entry in the file.

The command is a bit dense, but the logic behind it is:

Read "in";
Split it in records, using ^#[0-9]{10}\n as a record separator (<> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg; \z is there only because of the last record, which won't be followed by another record separator, so we'll allow the regex to match the end of the file);
Process all records;
For each record, remove the separator (the line lincluding the timestamp); if the remainder (the command) is in a list of already processed commands, ignore it, otherwise store the command exclusive of the separator in the list of already processed commands and store the command inclusive of the separator in an array;
Join the elements of the array on an empty string, printing the resulting string.

It works well on your sample input; I've also tested it with empty commands (a timestamp following another timestamp).

If duplicate entries are found, the first entry will be kept and later ones will be discarded.

% cat in
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308689
cpio -v -t -F init.cpio
#1713308690
ls
#1713308691
ls
% perl -ge '
        my @u = map {
                $c = $_;                                                                                                              
                $c =~ s/^#[0-9]{10}\n//;

                exists($d{$c}) ? () : ($d{$c} = 1 && $_);
        } <> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg;

        print join("", @u);
' in
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308690
ls

Stack Exchange Network

Return to Answer