6

In Linux there is a file numbers.txt. It contains a few numbers, all separated with space.

There are numbers like this: 5476089856 71788143 9999744134 114731731 3179237376

In this example only the first and the last numbers are divisible by 4096. All others are not divisible by 4096. I need all numbers divisible by 4096.

The numbers which are not divisble by 4096 should be rounded up. The other ones should be untouched.

All of them I need written to a new file numbers_4096.txt.

Numbers like the second 71788143 must be rounded up to 71790592. 9999744134 to 9999745024 and so on...

Numbers like 5476089856 and 3179237376 should not be changed, they are divisible by 4096.

This script can do this with a single number:

#!/bin/sh

number=9999744134
divisor=4096

remainder=$((number % divisor))
new_number=$((number - remainder))
roundedup=$((new_number + divisor))

echo "Original number: $number"
echo "Divisor: $divisor"
echo "New number (divisible by $divisor): $new_number"
echo "Roundedup Number (divisible by $divisor): $roundedup"

But how to do that with a whole list that has to be rewritten and rounded up into a new file?

This script adds 4096 to numbers which are divisible by 4096, that is not what I want.

1
  • 4
    How many numbers in each line or does it vary? Commented Jun 17 at 1:12

7 Answers 7

15

in awk it is possible to avoid logical checks and rely on pure math in "one line":

awk  '{ for(i=1;i<=NF;i++) printf "%i ", int(($i+4095)/4096)*4096 ;print ""}' input_file

The logic is:

  1. add 4096-1
  2. get integer part of division to 4096
  3. multiply to 4096

With default Awk precision settings, this incorrectly rounds up for 2^n - 4096 for n >= 54, due to the limitations of double-precision floating point which Awk uses for all its numbers. When the FP exponent field = 53, only even numbers are representable, so adding 4095 actually rounds to the nearest-even number, which is upwards to 2^54. For larger numbers, the representable values are even farther apart, larger powers of 2.

Also, rounding can occur on input into Awk for huge integers (greater than 2^54), so the same problem is possible even with huge inputs in the original text. For a large-enough exponent, 2^n - 4096 wouldn't be representable and would have rounded up to 2^n in the first place, before we try adding 4095.

GNU Awk can do arbitrary-precision arithmetic with -M / --bignum, but that actually doesn't help here (GNU Awk 5.3.1). So be careful with this if your numbers are much bigger than the example.

Adding 4095 can't overflow to +Infinity, but incorrectly rounding up when a huge number was 2^n - 4096 is possible.

The smallest number with that problem is 2^54 - 4096 = 0x3ffffffffff000 = 18014398509477888. This answer works for all smaller numbers, so should be fine for many use-cases. But not for the full range of [u]int64_t, let alone of double.

7
  • 2
    Can this ever incorrectly round up for huge numbers? Awk uses floating-point for all numbers, IIRC. At some exponent, representable doubles will be 2048 apart, so adding 4095 will actually round up to a multiple of 2048, so it will be like adding 4096. But that's fine because the number is either a multiple of 4096 to start with, or 2048 away from that, not 1. Maybe an addition that crosses an exponent boundary? Will it fail for DBL_MAX + 4095 giving +Infinity, or does that round back down to DBL_MAX? (With integer math a common trick is ((x-1)|4095) + 1 to avoid overflow) Commented Jun 17 at 9:35
  • In IEEE binary64 math, DBL_MAX + 4095 == DBL_MAX;. (godbolt.org/z/bjznaWPj4). So yeah this is safe from overflow even for DBL_MAX which is already a multiple of a large power of 2. Still possible there could be a round-up error at the exponent value where representable numbers are 2 apart. 2^e - 4096 is representable for that exponent; adding 4095 rounds up to the next power of 2. That's a problem for many exponent values, when representable numbers are between 2 and 4096 apart, inclusive. en.wikipedia.org/wiki/Double-precision_floating-point_format Commented Jun 17 at 9:46
  • @PeterCordes, let say it on this way: the above code will work when the inputs are integer numbers less than 10^308 (which is huuuge number). At the end my answer is based on the OP input which show integer numbers which are not so big :) Commented Jun 17 at 9:49
  • 3
    The smallest number it fails for is 2^54 - 4096 (0x3ffffffffff000 = 18014398509477888), which is about 1.8 x 10^16. So there are multiple int64_t values where this fails. That's larger than the OP's examples, but small enough that it should definitely be mentioned as a limitation. (see the limits section of the wiki article). I tested just now with GNU Awk 5.3.1, and indeed it rounds 18014398509477888 up to 18014398509481984 Commented Jun 17 at 9:52
  • @PeterCordes, correct, just checked. But still this is big number :) Commented Jun 17 at 9:56
7

With perl, finding all sequences of decimal digits and rounding them up to multiple of 4096:

$ echo 'foo123bar 4095-4097 1e+4 0x1f2 004123 3.4' |
   perl -pe 's{\d+}{int(($& + 4095)/4096)*4096}ge'
foo4096bar 4096-8192 4096e+4096 0x4096f4096 8192 4096.4096

If you wanted to consider those -4097, 1e+4, 0x1f2, 3.4 as the negative / floating point / hexadecimal numbers they represent instead of only considering the sequences of decimal digits within, you could change it to:

$ echo '0x1p13 foo123bar 4095-4097 4096.001 1e+4 0x1f2 004123 3.4 infinity nan' |
   perl -MPOSIX=floor,strtold -pe '
     s{0x[\da-f]+(p[-+]?\d+)?|[-+]?(\d+\.?\d*|\.\d+)(e[-+]?\d+)?|\b(inf(inity)?|nan)\b}{
       ($x) = strtold($&);
       $y = floor($x / 4096) * 4096;
       $y == $x ? $y : $y+4096
     }gie'
8192 foo4096bar 4096-4096 8192 12288 4096 8192 4096 Inf NaN

Change ? $y to ? $& to not reformat the numbers that are multiples of 4096 as decimal integer (as in keep for instance 0x1000, 0X1P12, +40.96E+2, 4096.000 as is and not change them to 4096).

5
  • 1
    And perl -Mbigint -pe ... will transparently make it work for huge integers Commented Jun 18 at 7:15
  • Yes, though not for the ones that are constrained to long double by POSIX' strtold(). Commented Jun 18 at 8:21
  • Sure, but I don't see why you even added that section. "Divisible by 4096" makes no sense with floats. Commented Jun 18 at 9:07
  • 2
    @pipe, it may still make sense for 0x1p15 or 4.096e+3 which are other representations of integers that are multiple of 4096 and for which we use strtold to interpret. Commented Jun 18 at 9:43
  • @pipe, why not? 8192 is divisible by 4096, while 4000 and 0.5 aren't. 1.25 is divisible by 0.25, while 1.2 and 0.125 aren't. Commented Jun 18 at 11:21
6

shell (for one number)

#!/bin/sh

number=9999744134
divisor=4096

remainder=$((number % divisor))
echo "Original number: $number"
echo "Divisor: $divisor"

if [ "$remainder" -eq 0 ]; then
    echo "Unchanged number (divisible by $divisor): ${number}"
else
    new_number=$((number - remainder))
    roundedup=$((new_number + divisor))
    echo "New number (divisible by ${divisor}): ${new_number}"
    echo "Roundedup Number (divisible by ${divisor}): ${roundedup}"
fi

awk (for the whole file)

echo 5476089856 71788143 9999744134 114731731 3179237376 |
  awk -v divisor=4096 '
    { for (i=1;i<=NF;i++)  {
        num=$i; remainder = num % divisor;
        if (remainder == 0) new_num = num;
        else new_num = num + divisor - remainder;
        printf "%i ", new_num;
      }
      print "";
    }'
5476089856 71790592 9999745024 114733056 3179237376
7
  • This might be a possible use case for the factor command:- factor 8192 outputs '8192: 2 2 2 2 2 2 2 2 2 2 2 2 2' Commented Jun 17 at 1:13
  • 2
    @JeremyBoden, factor feels a bit awkward here, it gives both too much information (the whole breakdown to prime factors, while just divisibility by 4096 is needed), and too little information (it doesn't help in rounding up) Commented Jun 17 at 8:41
  • You could also use if (remainder != 0) num = num + divisor - remainder; printf "%i ", num, since the other branch is a no-op Commented Jun 17 at 9:40
  • 2
    ((x-1)|4095)+1 uses bitwise OR to safely round up without overflow, and without an expensive division operation or checking remainders. This of course only works for power-of-2 divisors because it relies on binary integer representations. Does /bin/sh support $(( )) math at all? I think that's not part of POSIX sh, so your script should be using /bin/bash or /bin/ksh or anything other than /bin/sh. Bash supports bitwise-OR with |. If you want to check whether rounding-up happened, check for equality of input vs. output. Commented Jun 17 at 10:25
  • 3
    @PeterCordes, $(( )) is standard and the standard refers to C operators in the context of arithmetic expansion, so ((x-1)|4095)+1 should be fine. (It explicitly doesn't require ++ and --, and the "standalone" (( .. )) construct isn't standard.) Commented Jun 18 at 11:16
1

If you are always using a power of 2, like 4096, which has 12 bits.

For bash:

for i in $(cat numbers.txt); do
   echo $(( (i & ~4095) + ((!(!(i & 4095))) << 12) ))
done

where the first mask rounds down. The second mask checks the lower bits and the logical nots create a 0 or 1 for the remainder. The left shift converts the remainder to 0 or 4096.

In my experience, some Linux distributions implicitly replace bash for sh.

1
  • 4
    if you're doing it in the shell, micro-optimizations with this sort of bit-fiddling aren't too useful since the shell is going to be slow anyway, and all this ends up just obfuscating the matter to future readers Commented Jun 18 at 11:19
1

numfmt from GNU coreutils can do this in 2 passes:

< numbers.txt \
  numfmt --to-unit=4096 --round=up --field=- |
  numfmt --from-unit=4096 --field=-

The first pass does the division (rounding up):

% print 5476089856 71788143 9999744134 114731731 3179237376 |
> numfmt --to-unit=4096 --round=up --field=-
1336936    17527    2441344     28011     776181

The second pass does the multiplication:

% print 5476089856 71788143 9999744134 114731731 3179237376 |
> numfmt --to-unit=4096 --round=up --field=- |
> numfmt --from-unit=4096 --field=-
5476089856 71790592 9999745024 114733056 3179237376

The above assumes numbers.txt only has space/tab separated decimal integers (no trailing space, optional - sign). If there are non-numeric fields, --invalid=ignore (for both numfmt invocations) should leave those fields unchanged. Or perhaps you could specify which fields are to be converted (e.g. --field=2,3 to convert only fields 2 and 3). If the input may contain a decimal point as in 71788143.0, then add --format=%.0f to the first numfmt invocation. In this case, if that decimal point character shall be . (dot), then you may want to use the C locale (e.g. with LC_ALL=C numfmt ...). Otherwise the decimal point should match the locale (see locale decimal_point).

1
  • 1
    Nice. Worth noting it assumes space/tab separated decimal integers (with optional - sign) and no trailing space. echo 1.0 | numfmt --to-unit=4096 --round=up --field=- returning 0.1 sounds like a bug to me. Adding --format=%.0f seems to work around it. With that and --invalid=ignore (to both numfmt invocations) you'd be able to accept more variation in input (and maybe LC_ALL=C to force period as decimal radix character) Commented Jun 21 at 10:15
0

You can avoid the rounding issues from other answers using bc, which supports arbitrarily large numbers. In interactive mode, you could enter an expression like (5476089856+4095)/4096*4096 to round up to the nearest multiple of 4096; so for non-interactive mode we'll just convert all the numbers into that form using sed.

sed 's^\S\+^(\0+4095)/4096*4096;^g' | bc

Quick explanation of the sed argument: s^X^Y^g says to replace all the occurrences of the pattern X with the result Y. The X part in the above command says to look for non-whitespace characters (\S) in a non-empty sequence (\+); the Y part says to copy in whatever the pattern matched (\0) and surround it with our rounding computation (( and +4095)/4096*4096;).

For example, to read numbers from in.txt and write them to out.txt, you would do

sed 's^\S\+^(\0+4095)/4096*4096;^g' <in.txt | bc >out.txt

The outputs of this will be on newlines; if it's important to have spaces instead of newlines, you can do that using the (GNU-specific) bc extension print:

sed 's^\S\+^print (\0+4095)/4096*4096, " ";^g' | bc
0

Using Raku (formerly known as Perl_6)

~$ echo "4095 4096 5476089856 71788143 9999744134 114731731 3179237376\n18014398509477887 18014398509477888" |\
    raku -ne '.words.map( { my $remainder = $_ % 4096; 
              $remainder == 0 ?? $_ !! $_ + (4096 - $remainder) }).put;'
4096 4096 5476089856 71790592 9999745024 114733056 3179237376
18014398509477888 18014398509477888

OR:

~$ echo "4095 4096 5476089856 71788143 9999744134 114731731 3179237376\n18014398509477887 18014398509477888" |\
    raku -e '.words.map( { my $remainder = $_ % 4096; 
             $remainder == 0 ?? $_ !! $_ + (4096 - $remainder) }).put for lines();'
4096 4096 5476089856 71790592 9999745024 114733056 3179237376
18014398509477888 18014398509477888

Raku is a programming language in the Perl-family. Raku uses $_ as topic variable--same as Perl. Above also uses Raku's new built-in ternary operator (which is spelled slightly differently than Perl's): Test ?? True !! False. Some people find Raku's ternary easier to read.

Spurious input: Only a tiny bit of testing above--adding an alphabetic character to the end of one number gives error Cannot convert string to number... (with the description trailing characters after number and the location of the alphabetic character). Not a bad error message!

ALSO, Raku does reasonably well with alternative numeric representations, e.g. the forms given in the Perl answer by @StéphaneChazelas. See below:

~$ echo "-4097 4096.001 1e+4 0x1f2 004123 3.4 Inf NaN \n $(seq 4095 4097)" |\
    raku -e '.words.map( { my $remainder = $_ % 4096;
             $remainder == 0 ?? $_ !! $_ + (4096 - $remainder) }).put for lines();'
-4096 8192 12288 4096 8192 4096 NaN NaN
4096
4096
8192

https://docs.raku.org/language/operators#infix_??_!!
https://docs.raku.org
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.