How to remove lines with nonsense format numbers?

Question

I have the following data that I am processing to get the 1st and 5th column, convert the D format to E format and delete rows that have gibberish numbers such as 9.410-316.

DEG =      1.500
     2.600D+01     0.000D+00     0.000D+00     0.000D+00     0.000D+00
     2.700D+01     8.720-304     2.369-316     7.556-316     9.410-316
     4.300D+01     1.208D-83     4.156D-96     7.360D-96     6.984D-96
     1.590D+02     8.002D-07     6.555D-19     7.748D-19     7.376D-19
     1.600D+02     1.173D-06     9.669D-19     1.143D-18     1.089D-18
     1.610D+02     1.709D-06     1.417D-18     1.676D-18     1.596D+01
     1.620D+02     2.468D-06     2.058D-18     2.436D-18     2.320D-10
DEG =     18.500 
     2.700D+01     2.794-314     0.000D+00     0.000D+00     0.000D+00
     2.800D+01     4.352-285     1.224-297     3.685-297     4.412-297
     8.800D+01     1.371D-02     6.564D-15     7.852D-15     7.275D-15

My problem is in identifying the number formats that I want to delete. So far, I have tried

maxa=18.5
maxangle=$(printf "%.3f" $maxa)
if (( $(echo "$maxa < 10" | bc -l) )); then
  txt2search="DEG =      $maxangle"
  # 6 spaces between = and value if deg=>10, else only 5)
else
  txt2search="DEG =     $maxangle"
fi

line=$(grep -n "$txt2search" file  | cut -d : -f 1)

# Once the line number is read for the string, skip a few lines (4) and read next several lines(1000)
beginline=$((line + 4))
endline=$((line + 1002))
awk -v a="$beginline" -v b="$endline" 'NR==a, NR==b {print $1, $5}' fileinput > fileoutput
sed -i 's/D/E/g' fileoutput

Then, to discard the rows with the nonsense numbers, I tried (one at a time) and failed with the following commands.

sed -ni '/E/p' fileoutput
sed -E '/(E)/!d' fileoutput > spec2.tempdata
sed '/E/!d' fileoutput > spec2.tempdata
awk '!/E/' fileoutput > spec2.tempdata

How can I identify and remove lines with such nonsense numbers? The versions are

sed (GNU sed) 4.7
grep (GNU grep) 3.4
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)

The output would be

2.600D+01     0.000D+00     0.000D+00     0.000D+00     0.000D+00
4.300D+01     1.208D-83     4.156D-96     7.360D-96     6.984D-96
1.590D+02     8.002D-07     6.555D-19     7.748D-19     7.376D-19
1.600D+02     1.173D-06     9.669D-19     1.143D-18     1.089D-18
1.610D+02     1.709D-06     1.417D-18     1.676D-18     1.596D+01
1.620D+02     2.468D-06     2.058D-18     2.436D-18     2.320D-10

EDIT: The solution that I was looking for is (see first comment)

grep -v '[0-9]-'

If all your "nonsense" numbers are like x.y-z, you can probably just use grep -v '[0-9]-' — muru
– muru, Commented Apr 11, 2023 at 17:01
Please edit your question and specify what exactly you consider "nonsense numbers" and show the expected output for your example input. Where do the magic numbes 4 and 1002 in your code come from? Are they related to your input? — Bodo
– Bodo, Commented Apr 11, 2023 at 17:51
As someone who deals with large amounts of numbers on the regular: I'm worried about the validity of your whole dataset if you got numbers in there that you can't interpret. These numbers just seem to be omitting the D, and would then be very close to the machine epsilon of double precision IEEE754 floating point numbers. That's a bit much of a coincidence. I don't believe in computers producing gibberish out of thin air – these numbers came into that dataset somehow, and you erasing a lot of very small numbers sounds like you're falsifying a statistic just because you're too lazy. — Marcus Müller
– Marcus Müller, Commented Apr 11, 2023 at 17:53
You claim "they cannot be interpreted as valid numbers", to which I say, "They can, you just have decided to not investigate them properly". I even offer you an (in humble opinion reasonable) interpretation: For numbers with an exponent < -99, the D is ommitted for space reasons! — Marcus Müller
– Marcus Müller, Commented Apr 11, 2023 at 18:04
Can you please explain why 9.410-316 cannot be interpreted as a valid number? I mean, 9.410-316=−306.59. Could we think of it another way, can we simply remove any entry if it has a - unless it is the first character or it follows a D or E? And what should we do with the removed fields? Leave them blank? Add some filler? What is the expected output here? — terdon
– terdon ♦, Commented Apr 11, 2023 at 18:23

jubilatious1 · Accepted Answer · 2023-04-15 06:31:27Z

Using Raku (formerly known as Perl_6)

~$  raku -e 'my @a; for lines.join("\n").split(/ \n <?before DEG> /) { @a.push: %(.split("\n").[0].words.[2] => \
             .split("\n")[1..*].map(*.words[0,4])>>.map(*.subst( / (\d+) (<[+-]>) /, {$0 ~ "e" ~ $1} ).subst(/D/, "e") )>>.Num) };  \
             .raku.put for @a;'  file

Sample Output for visualization purposes:

${"1.500" => $($(26e0, 0e0), $(27e0, 9.41e-316), $(43e0, 6.984e-96), $(159e0, 7.376e-19), $(160e0, 1.089e-18), $(161e0, 15.96e0), $(162e0, 2.32e-10))}
${"18.500" => $($(27e0, 0e0), $(28e0, 4.412e-297), $(88e0, 7.275e-15))}

Raku is a programming language in the Perl family that features rational numbers and Unicode support built-in. Above, the general strategy is to create an array-of-hashes, with the DEG value as key and measurements in columns 1 and 5 (index [0,4]) as value.

An @-sigiled array is declared (@a). The Raku code reads auto-chomped lines in, joining them back together on \n newlines. From here we break into records by splitting on \n newlines that occur before DEG. Entering the { … } block, each record is again split on \n newlines, with the .[0].words.[2] third-word-of-the-first element becoming a key. The => fat-arrow denotes "pair" construction, with everything after becoming a value. Note the two .subst calls: the first to insert an "e" between a \d digit and a bespoke character-class consisting of <[+-]> plus-or-minus sign, and the second to change "D" to "e". Values are converted to .Num, and a %( … ) hash is pushed onto the @a array. The .raku method is added to the output line to enable visualization of Raku's internal representation of the data (note .perl also works as a synonym).

Actual Output for Plotting:

Change the final line .raku.put for @a to get your desired output for plotting. A few examples below (alternatively you can use Raku's printf or sprintf if desired):

1. Replace output line above to return the first DEG:

for @a[0].kv -> $k,$v {put ([Z] $k xx $v.elems, $v).join: "\n"}

#Returns 3-columns:

1.500 26 0
1.500 27 9.41e-316
1.500 43 6.984e-96
1.500 159 7.376e-19
1.500 160 1.089e-18
1.500 161 15.96
1.500 162 2.32e-10

2. Or return the whole 3-column table at once with the following output line:

for @a { for ($_.kv) -> $k,$v {put ([Z] $k xx $v.elems, $v).join: "\n"}};

#Returns:

1.500 26 0
1.500 27 9.41e-316
1.500 43 6.984e-96
1.500 159 7.376e-19
1.500 160 1.089e-18
1.500 161 15.96
1.500 162 2.32e-10
18.500 27 0
18.500 28 4.412e-297
18.500 88 7.275e-15

3. FINALLY: Raku has a =~= "tolerance" operator which can be used to determine if values are approx. equal to zero (defaults to 1e-15, see link below). Putting it all together:

~$ raku -e 'my @a; for lines.join("\n").split(/ \n <?before DEG> /) { @a.push: %(.split("\n").[0].words.[2] => \
            .split("\n")[1..*].map(*.words[0,4])>>.map(*.subst( / (\d+) (<[+-]>) /, {$0 ~ "e" ~ $1} ).subst(/D/, "e") )>>.Num) };  \
            for @a {  for ($_.kv) -> $k,$v {put ([Z] $k xx $v.elems,  $v>>.map( -> $i { ($i =~= 0) ?? 0 !! $i } )).join: "\n"}};'  file
1.500 26 0
1.500 27 0
1.500 43 0
1.500 159 0
1.500 160 0
1.500 161 15.96
1.500 162 2.32e-10
18.500 27 0
18.500 28 0
18.500 88 7.275e-15

https://docs.raku.org/language/hashmap.html
https://docs.raku.org/language/5to6-nutshell.html#=%3E_Fat_comma
https://docs.raku.org/routine/=~=.html
https://raku.org

given the answer you accepted, raku can do that too raku -e 'for lines() { .grep( {! /\d"-"/} ) ?? .say !! next }' file ;-) — librasteve
– librasteve, Commented Apr 17, 2023 at 20:51
@p6steve nice one! Or maybe even: ~$ raku -ne '.put unless / \d \- /;' file . But please don't take this comment as endorsing throwing out data--much better to use a tolerance criteria: docs.raku.org/routine/=~=.html — jubilatious1
– jubilatious1, Commented Apr 17, 2023 at 22:57
not sure why but I prefer talking about control flow for filtering raku -pe 'next unless /^\d/;' — Matt Oates
– Matt Oates, Commented Apr 18, 2023 at 10:58
i had forgotten the raku command flags -n and -p ... thanks for the reminder. just made a minor tweak to Matt's to get it to work on my data raku -pe 'next if / \d \- /' file docs.raku.org/language/5to6-nutshell.html#Command-line_flags — librasteve
– librasteve, Commented Apr 18, 2023 at 16:56

Ed Morton · Accepted Answer · 2023-04-12 14:38:31Z

FWIW here's how to change the Ds to Es in your input and then tell if a field is a number or not by comparing it's value before and after adding 0 to it (a number will retain it's value, a non-number won't)

$ awk 'NF>3{gsub(/D/,"E"); for (i=1; i<=NF; i++) if ($i != $i+0) print "not a number:", $i}' file
not a number: 8.720-304
not a number: 2.369-316
not a number: 7.556-316
not a number: 9.410-316
not a number: 2.794-314
not a number: 4.352-285
not a number: 1.224-297
not a number: 3.685-297
not a number: 4.412-297

and so to print the lines that only contain numbers would be:

$ awk 'NF>3{gsub(/D/,"E"); for (i=1; i<=NF; i++) if ($i != $i+0) next; print}' file
     2.600E+01     0.000E+00     0.000E+00     0.000E+00     0.000E+00
     4.300E+01     1.208E-83     4.156E-96     7.360E-96     6.984E-96
     1.590E+02     8.002E-07     6.555E-19     7.748E-19     7.376E-19
     1.600E+02     1.173E-06     9.669E-19     1.143E-18     1.089E-18
     1.610E+02     1.709E-06     1.417E-18     1.676E-18     1.596E+01
     1.620E+02     2.468E-06     2.058E-18     2.436E-18     2.320E-10
     8.800E+01     1.371E-02     6.564E-15     7.852E-15     7.275E-15

or:

$ awk 'NF>3{gsub(/D/,"E"); for (i=1; i<=NF; i++) if ($i != $i+0) next} 1' file
DEG =      1.500
     2.600E+01     0.000E+00     0.000E+00     0.000E+00     0.000E+00
     4.300E+01     1.208E-83     4.156E-96     7.360E-96     6.984E-96
     1.590E+02     8.002E-07     6.555E-19     7.748E-19     7.376E-19
     1.600E+02     1.173E-06     9.669E-19     1.143E-18     1.089E-18
     1.610E+02     1.709E-06     1.417E-18     1.676E-18     1.596E+01
     1.620E+02     2.468E-06     2.058E-18     2.436E-18     2.320E-10
DEG =     18.500
     8.800E+01     1.371E-02     6.564E-15     7.852E-15     7.275E-15

depending on whether you want those DEG lines output or not.

Stack Exchange Network

How to remove lines with nonsense format numbers?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How to remove lines with nonsense format numbers?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions