-1

I have data in the format below, where in the first line, the lower and upper values are the same, but in the next line, the lower and upper values form a range (i.e., from 73760 to 73796). I need to calculate the count of all lines where the lower and upper values are the same, and, on the other hand, the count of all lines where the lower and upper values form a range, along with the expanded count of the range data (i.e., the number of values within the range).

Output as below:

individual data : 1
range data : 1
range expanded : 37 (expansion of 73760 to 73796)

Input data as below:

TMI=012813 FCI=00 low=000654 up=000654 sor=0E
TMI=012813 FCI=00 low=073760 up=073796 sor=0E
5
  • 3
    It is unclear what the output should look like when there are more than one range of non-zero width. Commented Oct 23, 2024 at 6:53
  • hi, Didnt get your query. the duplicate entries should be removed and data should be unique. better if we get the count for duplicate values as well Commented Oct 23, 2024 at 14:30
  • I'm not talking about any duplicated entries. I asked what the output should look like if there's a third line in the input, TMI=012813 FCI=00 low=093760 up=093796 sor=0E. Commented Oct 23, 2024 at 18:28
  • Hi @Kusalananda thanks for your time. The output should have only the count of lines as stated above in the query. I just need the count of lines which are unique, next the count of the lines having low and up in range, and then the count of range expanded to some figure. Commented Oct 24, 2024 at 4:38
  • 1
    "count of range expanded to some figure" does not make much sense to me. It would probably help if you expanded your sample to something that includes duplicates and more than one line with a range along with the expected output. Please edit your question to add the requested details. Commented Oct 24, 2024 at 5:17

4 Answers 4

2

With perl:

perl -lne '
  if (/ low=(\d+) up=(\d+) /) {
    $range = $2 - $1 + 1;
    $count[$range == 1]++;
    $max_range = $range if $range > $max_range;
  }
  END {
    printf "individual data: %d\nrange data: %d\nrange expanded: %d\n",
      $count[1], $count[0], $max_range;
  }' < your-file

Or if the range expanded is meant to be the overall range of all values for non-discrete ones as opposed to maximum length of a range for any single line:

perl -lne '
  if (/ low=(\d+) up=(\d+) /) {
    $count[$2 == $1]++;
    unless ($1 == $2) {
      $min_low = $1 if $1 < $min_low;
      $max_up = $2 if $2 > $max_up;
    }
  }
  END {
    printf "individual data: %d\nrange data: %d\nrange expanded: %d\n",
      $count[1], $count[0], $max_up - $min_low + 1;
  }' < your-file
4
  • Thanks for all your help but instead of perl can there be a linux command for this. I need to understand the way out as well...for getting the exact count so that i can log the individual data for all. In addition to it we can have duplicate entries as well. I would be happy to get that as well along with indivdual log for all such data Commented Oct 23, 2024 at 15:06
  • @SuryaShukla? What's a Linux command? Linux is just an operating system kernel. perl predates Linux and has been available for Linux-based operating systems since about as long as Linux has existed. Commented Oct 23, 2024 at 15:31
  • @SuryaShukla perl is a Linux command. Commented Oct 23, 2024 at 18:29
  • @StéphaneChazelas Thanks , i understand that perl is a linux command. But since I am not a hard core linux user, so just thought if something using grep/awk/cut like commands can be given...Thanks once again. The perl one worked but seems some issue in the range figure as there are duplicate entries also in the file I have to only ocnsider the unique ones for the range. example if the figure to individual entry is 1 and while range expansion it gets the same figure again, it should be ignored. Commented Oct 24, 2024 at 4:41
2

Using Miller (mlr):

$ mlr --ifs space put -q -f script file
Range expanded: 37 (73760 to 73796)
Single data: 1
Ranged data: 1

The input data looks very much like DKVP ("delimited key-value pairs"), Miller's native data format, but uses spaces in place of commas for field delimiters, which is why we use --ifs space on the command line of mlr to set the input field separator.

The script in script is inspired by Stéphane's answer, but interprets the question as if you want to see the expanded range for all ranges of length greater than 1.

lo = int($low, 10);
hi = int($up, 10);
range = 1 + hi - lo;

@count[int(range == 1)] += 1;

if (range > 1) {
    print "Range expanded: " . range . " (" . lo . " to " . hi  . ")";
}

end {
    print "Single data: " . @count[1];
    print "Ranged data: " . @count[0];
}
0

You can use FPAT option of awk to extract numbers as fields from each line. Then simply compare fields 3 and 4.

$ awk 'BEGIN { FPAT = "[0-9]+" }; $4>$3{c=c+1;r=$4-$3+1;print "count="c"  range expanded="r} END{print "Total ranges (expanded)="c;print "count of lines unexapnded="NR-c}' input_file
count=1  range expanded=37
Total ranges (expanded)=1
count of lines unexapnded=1
1
  • Note that FPAT is a GNU awk extension. Commented Oct 24, 2024 at 7:43
0

Using any awk:

$ cat tst.awk
BEGIN { FS="[ =]" }
$8 != $6 {
    beg[++numRange] = $6
    end[numRange] = $8
}
END {
    printf "individual data : %d\n", (NR ? NR - numRange : 0)
    printf "range data : %d\n", numRange
    for ( i=1; i<=numRange; i++ ) {
        printf "range expanded : %d (expansion of %d to %d)\n", \
            1 + end[numRange] - beg[numRange], beg[numRange], end[numRange]
    }
}

$ awk -f tst.awk file
individual data : 1
range data : 1
range expanded : 37 (expansion of 73760 to 73796)

The above assumes the 2nd input number on each line is always greater than or equal to the 1st input number. If that's not the case then tweak the END logic to suit.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.