-1

I'm working with bash_history file containing blocks with the following format: #unixtimestamp\ncommand\n

here's sample of the bash_history file:

#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308689
cpio -v -t -F init.cpio
#1713308690
ls
#1713308691
ls

My goal is to de-duplicate blocks entirely, meaning both the timestamp and the associated commands. I've attempted using awk, but this approach processes lines individually, not considering them as part of a block.

I've heard that using ignoredups prevents deduplication, but it won't work in this case (unless you retype the exact command) because the duplicate command is already there.

I'd appreciate suggestions on a more effective way to achieve this de-duplication.

EDIT: as suggested by Ed Morton on the comment, here's the expected output:

#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308690
ls

as a workaround, I add the delete functionality to this program. but I'm still open to other approaches that use existing commands.

9
  • 1
    Does this answer your question? Sorting Bash History for Redundancy Removal Commented May 13, 2024 at 8:10
  • Didn't you already ask this? unix.stackexchange.com/questions/775988/… Commented May 13, 2024 at 8:11
  • The previous questions are intended for sorting (so I can manually review and remove commands that aren't exactly duplicates from bash history). Now I'm looking for a way to automatically remove commands that are exactly duplicates using a Linux command that already exists. Commented May 13, 2024 at 8:19
  • 1
    I doubt if there's "a Linux command that already exists" but it may be simple to write one. Type the line #1234567890 at your command prompt. Now type a here-document including that same string. Now update your example to show your history file contents including those lines and a few others, including some duplicates. That is to test if we can robustly use a regexp like ^#[0-9]{10}$ or similar as a delimiter between records. If either of those strings in your history file are formatted indistinguishably from your timestamps like #1713308636 then it becomes a much harder problem to solve Commented May 13, 2024 at 11:18
  • 1
    I see you added expected output for your original sample input, thanks, but please provide sample input/output per my second comment so we can see what your history file looks like if it contains commands or parts of commands that follow the format of your timestamp lines. Commented May 13, 2024 at 12:58

2 Answers 2

3

You didn't show your attempt in awk, but the following awk program prints entries in the sense of

#[number]
[command consisting of one or more lines]

where the command is unique. The program is:

# dedup.awk
/^#[[:digit:]]+$/ {
    if (length(body) > 0)
    {
        if (!bodies[body])
        {
            bodies[body] = 1
            print header body
        }
        body = ""
    }
    header = $0
    next
}

{
    body = body "\n" $0
}

The output:

$ cat test.file
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308689
cpio -v -t -F init.cpio
#1713308690
ls
#1713308691
ls
$ awk -f dedup.awk test.file
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308690
ls

Note that in case the "commands" such as

$ #12345678

are saved in the file, the above awk program will just skip them, eg:

$ cat test.file.2
#1234512312
#1231231233
#1231231231
cd
#1237192388
ls
#1231231231
cd
$ awk -f dedup.awk test.file.2
#1231231231
cd
#1237192388
ls

The program can be adjusted to accommodate cases like this one, but it requires more precise specification of the problem. For example, how to deal with:

#1234512312
#1231231233
#1231231231
#1231231233
#1231231231
#1231231233
#1231231231
cd
#1237192388
ls
#1231231231
cd

Edit: Thanks to @G-Man Says 'Reinstate Monica' for the optimization suggestion.

3
  • You could at least salvage comments made of any number of digits but 10 by changing the pattern to ^#[[:digit:]]{10}$, although then of course we would still crash in the same wall of the indistinguishable timestamps / comments Commented May 13, 2024 at 17:43
  • 1
    Good job!  But why do you need the in_header variable?  Why not just move the header = $0 and next statements into the first code block? Commented May 14, 2024 at 18:12
  • Thanks for the optimization hint, @G-ManSays'ReinstateMonica'. I updated the answer. Commented May 15, 2024 at 16:42
1

Using Perl, you may do:

perl -ge '
    @u = map {
        $c = $_;
        $c =~ s/^#[0-9]{10}\n//;

        exists($d{$c}) ?
            () :
            ($d{$c} = 1 && $_)
        ;
    } <> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg;

    print join("", @u);
' in

The big assumption being that that ^#[0-9]{10}\n will always positively identify the start of an entry in the file.

The command is a bit dense, but the logic behind it is:

  • Read "in";
  • Split it in records, using ^#[0-9]{10}\n as a record separator, without consuming the separator (<> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg);
  • Process all records;
  • For each record, remove the separator (the line lincluding the timestamp); if the remainder (the command) is in a list of already processed commands, ignore it, otherwise store the command exclusive of the separator in the list of already processed commands and store the command inclusive of the separator in an array;
  • Join the elements of the array on an empty string, printing the resulting string.

Breakdown of the regex:

  • ^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z): will match a line starting with a # character, followed by 10 digits and a newline; it will then lazily match anything (including newlines) until a new occurrence of ^#[0-9]{10}\n or the end of the string (\z) is found (avoiding to capture the newly found occurrence of ^#[0-9]{10}\n in the current match using a zero-length look-ahead assertion (?=) and allowing the next match to capture it); s will allow . to match newlines, m will allow ^ and $ to match after and before a newline and g will allow to capture multiple occurrences of the pattern in the string.

It works well on your sample input; I've also tested it with empty commands (a timestamp following another timestamp).

If duplicate entries are found, the first entry will be kept and later ones will be discarded.

% cat in
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308689
cpio -v -t -F init.cpio
#1713308690
ls
#1713308691
ls
% perl -ge '
        my @u = map {
                $c = $_;                                                                                                              
                $c =~ s/^#[0-9]{10}\n//;

                exists($d{$c}) ? () : ($d{$c} = 1 && $_);
        } <> =~ /^#[0-9]{10}\n.*?(?=^#[0-9]{10}\n|\z)/smg;

        print join("", @u);
' in
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308690
ls
6
  • This solution currently prints timestamps of "commands" which are integer-comments, but doesn't print the comments themselves: printf "#1234512312\n#1231231233\n#1231231231\ncd\n#1237192388\nls\n#1231231231\ncd\n" | perl -ge ... produces #1234512312\n#1231231231\ncd\n#1237192388\nls\n (the "command" #1231231233 is omitted. As I noted in my solution, the specification from the question needs to be more clear in this case. Commented May 13, 2024 at 17:10
  • @Vilinkameni I know, it's written under "The big assumption being that that ^#[0-9]{10}\n will always positively identify the start of an entry in the file". The other way of tackling that would be to print all comments, but that would not discard empty commands. Handling both cases correctly is obviously impossible. Are you suggesting the second approach would be better? Commented May 13, 2024 at 17:18
  • As I said, the "correctness" of the approach depends on the specification, which currently doesn't cover this case. There is a comment by the OP in reply to @Ed Morton that the unix timestamp can be ignored and I just need to compare the command, in which case both this solution and my solution are correct, but it should be part of the question IMHO. Commented May 13, 2024 at 17:22
  • To be more precise, even that comment doesn't specify if "ignoring the timestamp" means leaving out the timestamps of empty commands (like in my solution), or having timestamps without the commands (like in this solution). Commented May 13, 2024 at 17:25
  • @Vilinkameni I loaded the page before you edited your comment, so I didn't see the last part of your first comment; I agree, it's unclear. Possibly your approach is more sensible. To be honest, I don't even know if empty entries could appear, while certainly a comment made of 10 digits could. Commented May 13, 2024 at 17:27

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.