Counting lines with identical and ranged values, including range expansion

Question

I have data in the format below, where in the first line, the lower and upper values are the same, but in the next line, the lower and upper values form a range (i.e., from 73760 to 73796). I need to calculate the count of all lines where the lower and upper values are the same, and, on the other hand, the count of all lines where the lower and upper values form a range, along with the expanded count of the range data (i.e., the number of values within the range).

Output as below:

individual data : 1
range data : 1
range expanded : 37 (expansion of 73760 to 73796)

Input data as below:

TMI=012813 FCI=00 low=000654 up=000654 sor=0E
TMI=012813 FCI=00 low=073760 up=073796 sor=0E

It is unclear what the output should look like when there are more than one range of non-zero width. — Kusalananda
– Kusalananda ♦, Commented Oct 23, 2024 at 6:53
hi, Didnt get your query. the duplicate entries should be removed and data should be unique. better if we get the count for duplicate values as well — Surya Shukla
– Surya Shukla, Commented Oct 23, 2024 at 14:30
I'm not talking about any duplicated entries. I asked what the output should look like if there's a third line in the input, TMI=012813 FCI=00 low=093760 up=093796 sor=0E. — Kusalananda
– Kusalananda ♦, Commented Oct 23, 2024 at 18:28
Hi @Kusalananda thanks for your time. The output should have only the count of lines as stated above in the query. I just need the count of lines which are unique, next the count of the lines having low and up in range, and then the count of range expanded to some figure. — Surya Shukla
– Surya Shukla, Commented Oct 24, 2024 at 4:38
"count of range expanded to some figure" does not make much sense to me. It would probably help if you expanded your sample to something that includes duplicates and more than one line with a range along with the expected output. Please edit your question to add the requested details. — Stéphane Chazelas
– Stéphane Chazelas, Commented Oct 24, 2024 at 5:17

Stéphane Chazelas · Accepted Answer · 2024-10-23 07:44:53Z

2

With perl:

perl -lne '
  if (/ low=(\d+) up=(\d+) /) {
    $range = $2 - $1 + 1;
    $count[$range == 1]++;
    $max_range = $range if $range > $max_range;
  }
  END {
    printf "individual data: %d\nrange data: %d\nrange expanded: %d\n",
      $count[1], $count[0], $max_range;
  }' < your-file

Or if the range expanded is meant to be the overall range of all values for non-discrete ones as opposed to maximum length of a range for any single line:

perl -lne '
  if (/ low=(\d+) up=(\d+) /) {
    $count[$2 == $1]++;
    unless ($1 == $2) {
      $min_low = $1 if $1 < $min_low;
      $max_up = $2 if $2 > $max_up;
    }
  }
  END {
    printf "individual data: %d\nrange data: %d\nrange expanded: %d\n",
      $count[1], $count[0], $max_up - $min_low + 1;
  }' < your-file

edited Oct 23, 2024 at 7:44

answered Oct 23, 2024 at 6:53

Stéphane Chazelas

584k96 gold badges1.1k silver badges1.7k bronze badges

Thanks for all your help but instead of perl can there be a linux command for this. I need to understand the way out as well...for getting the exact count so that i can log the individual data for all. In addition to it we can have duplicate entries as well. I would be happy to get that as well along with indivdual log for all such data

Surya Shukla
– Surya Shukla

2024-10-23 15:06:25 +00:00
Commented Oct 23, 2024 at 15:06
@SuryaShukla? What's a Linux command? Linux is just an operating system kernel. perl predates Linux and has been available for Linux-based operating systems since about as long as Linux has existed.

Stéphane Chazelas
– Stéphane Chazelas

2024-10-23 15:31:22 +00:00
Commented Oct 23, 2024 at 15:31
@SuryaShukla perl is a Linux command.

Kusalananda
– Kusalananda ♦

2024-10-23 18:29:07 +00:00
Commented Oct 23, 2024 at 18:29
@StéphaneChazelas Thanks , i understand that perl is a linux command. But since I am not a hard core linux user, so just thought if something using grep/awk/cut like commands can be given...Thanks once again. The perl one worked but seems some issue in the range figure as there are duplicate entries also in the file I have to only ocnsider the unique ones for the range. example if the figure to individual entry is 1 and while range expansion it gets the same figure again, it should be ignored.

Surya Shukla
– Surya Shukla

2024-10-24 04:41:16 +00:00
Commented Oct 24, 2024 at 4:41

Add a comment |

Kusalananda · Accepted Answer · 2024-10-23 07:52:26Z

Using Miller (mlr):

$ mlr --ifs space put -q -f script file
Range expanded: 37 (73760 to 73796)
Single data: 1
Ranged data: 1

The input data looks very much like DKVP ("delimited key-value pairs"), Miller's native data format, but uses spaces in place of commas for field delimiters, which is why we use --ifs space on the command line of mlr to set the input field separator.

The script in script is inspired by Stéphane's answer, but interprets the question as if you want to see the expanded range for all ranges of length greater than 1.

lo = int($low, 10);
hi = int($up, 10);
range = 1 + hi - lo;

@count[int(range == 1)] += 1;

if (range > 1) {
    print "Range expanded: " . range . " (" . lo . " to " . hi  . ")";
}

end {
    print "Single data: " . @count[1];
    print "Ranged data: " . @count[0];
}

canupseq · Accepted Answer · 2024-10-24 07:03:31Z

0

You can use FPAT option of awk to extract numbers as fields from each line. Then simply compare fields 3 and 4.

$ awk 'BEGIN { FPAT = "[0-9]+" }; $4>$3{c=c+1;r=$4-$3+1;print "count="c"  range expanded="r} END{print "Total ranges (expanded)="c;print "count of lines unexapnded="NR-c}' input_file
count=1  range expanded=37
Total ranges (expanded)=1
count of lines unexapnded=1

answered Oct 24, 2024 at 7:03

canupseq

1,9141 gold badge5 silver badges21 bronze badges

Note that FPAT is a GNU awk extension.

Stéphane Chazelas
– Stéphane Chazelas

2024-10-24 07:43:37 +00:00
Commented Oct 24, 2024 at 7:43

Add a comment |

Ed Morton · Accepted Answer · 2024-10-26 10:44:06Z

Using any awk:

$ cat tst.awk
BEGIN { FS="[ =]" }
$8 != $6 {
    beg[++numRange] = $6
    end[numRange] = $8
}
END {
    printf "individual data : %d\n", (NR ? NR - numRange : 0)
    printf "range data : %d\n", numRange
    for ( i=1; i<=numRange; i++ ) {
        printf "range expanded : %d (expansion of %d to %d)\n", \
            1 + end[numRange] - beg[numRange], beg[numRange], end[numRange]
    }
}

$ awk -f tst.awk file
individual data : 1
range data : 1
range expanded : 37 (expansion of 73760 to 73796)

The above assumes the 2nd input number on each line is always greater than or equal to the 1st input number. If that's not the case then tweak the END logic to suit.

Stack Exchange Network

Counting lines with identical and ranged values, including range expansion

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Counting lines with identical and ranged values, including range expansion

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions