Skip to main content
20 x speedup
Source Link
DarthGizka
  • 2.8k
  • 1
  • 14
  • 30

Belated addition: a few days after the fact an idea hit me out of the blue while I was having a smoke, and I'm reporting it here because it can speed up things by a factor of 20 for the SPOJ limit of 50000 (much more for higher inputs). This idea is based on the observation that half of the primes p satisfy

(n / 2) < p <= n

and hence

1 <= (n / p) < 2

which is just a coy way of saying

n / p = 1

Knowing that there are k primes that satisfy this condition (bisect_right() to the rescue) I can compute the contribution for all these primes in one fell swoop as

(1 + 1)**k

This readily extends to other small quotients in the remaining range of primes, and the fun only stops when the primes get so small that their squares appear as divisors (i.e. p <= sqrt(n)).

Here's how you explain this to your computer in the pythonic manner:

def v6b (n):
    """number of divisors of n! (SPOJ DIVFACT)"""
    divisor_count = 1
    hi = bisect.bisect_right(primes, n)
    for m in range(2, int(math.sqrt(n))):
        lo = bisect.bisect_right(primes, n / m, 0, hi)
        k  = hi - lo
        hi = lo
        divisor_count = (divisor_count * m**k) % 1000000007
        if k < 2:
            break
    # the rest is algorithm v5b but without regular modulo trimming since growth is contained
    for current_prime in primes[0:hi]:
        q = n
        e = 0
        while q >= current_prime:
            q /= current_prime
            e += q
        divisor_count *= e + 1
    return divisor_count % 1000000007

Here are the timings for the usual 500 worst-case inputs, first for the earlier version (here called 'v5b') and then for this new version. I went beyond the SPOJ limit of 50000 to give some idea of the growth behaviour.

# timing v5b ...
500 @    500:    0.016  // cksum 230711749546
500 @   1000:    0.016  // cksum 256088767231
500 @   5000:    0.078  // cksum 248659270107
500 @  10000:    0.156  // cksum 255839576411
500 @  50000:    1.281  // cksum 252317442043
500 @ 100000:    3.313  // cksum 246710058695
500 @ 500000:   44.376  // cksum 244700225412

# timing v6b ...
500 @    500:    0.000  // cksum 230711749546
500 @   1000:    0.015  // cksum 256088767231
500 @   5000:    0.016  // cksum 248659270107
500 @  10000:    0.031  // cksum 255839576411
500 @  50000:    0.063  // cksum 252317442043
500 @ 100000:    0.078  // cksum 246710058695
500 @ 500000:    0.266  // cksum 244700225412

Beyond 100000 or thereabouts the code gets faster with a hand-coded powering function (via repeated squaring) but the builtin exponentiation operator is faster for lower limits, and hence on SPOJ. In C++ and C# things were faster when bailing out of the first loop for k < 12 but in Python it worked best with k < 2.

Note: the time for 500 test cases at the SPOJ limit of 50000 is virtually the same as for the C# version running under Mono on my box, which is no mean feat for an interpreted language...

Belated addition: a few days after the fact an idea hit me out of the blue while I was having a smoke, and I'm reporting it here because it can speed up things by a factor of 20 for the SPOJ limit of 50000 (much more for higher inputs). This idea is based on the observation that half of the primes p satisfy

(n / 2) < p <= n

and hence

1 <= (n / p) < 2

which is just a coy way of saying

n / p = 1

Knowing that there are k primes that satisfy this condition (bisect_right() to the rescue) I can compute the contribution for all these primes in one fell swoop as

(1 + 1)**k

This readily extends to other small quotients in the remaining range of primes, and the fun only stops when the primes get so small that their squares appear as divisors (i.e. p <= sqrt(n)).

Here's how you explain this to your computer in the pythonic manner:

def v6b (n):
    """number of divisors of n! (SPOJ DIVFACT)"""
    divisor_count = 1
    hi = bisect.bisect_right(primes, n)
    for m in range(2, int(math.sqrt(n))):
        lo = bisect.bisect_right(primes, n / m, 0, hi)
        k  = hi - lo
        hi = lo
        divisor_count = (divisor_count * m**k) % 1000000007
        if k < 2:
            break
    # the rest is algorithm v5b but without regular modulo trimming since growth is contained
    for current_prime in primes[0:hi]:
        q = n
        e = 0
        while q >= current_prime:
            q /= current_prime
            e += q
        divisor_count *= e + 1
    return divisor_count % 1000000007

Here are the timings for the usual 500 worst-case inputs, first for the earlier version (here called 'v5b') and then for this new version. I went beyond the SPOJ limit of 50000 to give some idea of the growth behaviour.

# timing v5b ...
500 @    500:    0.016  // cksum 230711749546
500 @   1000:    0.016  // cksum 256088767231
500 @   5000:    0.078  // cksum 248659270107
500 @  10000:    0.156  // cksum 255839576411
500 @  50000:    1.281  // cksum 252317442043
500 @ 100000:    3.313  // cksum 246710058695
500 @ 500000:   44.376  // cksum 244700225412

# timing v6b ...
500 @    500:    0.000  // cksum 230711749546
500 @   1000:    0.015  // cksum 256088767231
500 @   5000:    0.016  // cksum 248659270107
500 @  10000:    0.031  // cksum 255839576411
500 @  50000:    0.063  // cksum 252317442043
500 @ 100000:    0.078  // cksum 246710058695
500 @ 500000:    0.266  // cksum 244700225412

Beyond 100000 or thereabouts the code gets faster with a hand-coded powering function (via repeated squaring) but the builtin exponentiation operator is faster for lower limits, and hence on SPOJ. In C++ and C# things were faster when bailing out of the first loop for k < 12 but in Python it worked best with k < 2.

Note: the time for 500 test cases at the SPOJ limit of 50000 is virtually the same as for the C# version running under Mono on my box, which is no mean feat for an interpreted language...

added time for the original code, for comparison
Source Link
DarthGizka
  • 2.8k
  • 1
  • 14
  • 30

improved the time for a worst-case input file from 1.92 s to 1.39 s on my box, so it's something worth keeping in mind. For comparison: the original code clocked 2.64 s for this file.

improved the time for a worst-case input file from 1.92 s to 1.39 s on my box, so it's something worth keeping in mind.

improved the time for a worst-case input file from 1.92 s to 1.39 s on my box, so it's something worth keeping in mind. For comparison: the original code clocked 2.64 s for this file.

corrected a formula
Source Link
DarthGizka
  • 2.8k
  • 1
  • 14
  • 30
static ushort[] primes_up_to_LIMIT = U16.SmallPrimesUpTo(50000).ToArray();

...

uintlong divisor_count = 1;

foreach (uintint prime in primes_up_to_LIMITprimes)
{
    ulongif (prime > n)
        break;

    int e = 0;

    for (uintint q = n; q >= prime; )
        e += q /= prime; 

    divisor_count = (uint)((divisor_count * (e + 1)) % 1000000007u);1000000007;
}

return (int)divisor_count;

It isn't very fast - it clocked 0.2112 s on SPOJ - but it should show the principle well enough and it can serve as the basis of a more performant solution.

Note: detailed investigation shows that for once, the poor showinglackluster performance of my C# submission on SPOJ is not due to usual problems inherent in .Net, with the lion's share going to poor input/output performance (and a small part owing to the poor performance of unsigned types in newer versions of e.Net)g.

Unfortunately I do not know enough about Python to tell whether custom poor input/output code can have the same dramatic effect as it hasperformance) but rather to some strange problem with .Net..Mono. It may be worth looking intoOn my laptop a worst-case input file clocks 13. See SPOJ's7 ms under INTEST.Net and INOUTEST which are useful for tuning one's input/output performance136. At CodeChef you can even look at the code of submissions, to see how the fastest timings came about4 ms under Mono.

Here is an attempt at rending the inner loop in Python (renaming the num from your code to current_primesome things for clarity, and pr to divisor_count):

divisor_count = 1

for current_prime in primes:

    if current_prime > n:
        break

    q = n
    e = 0
    while q >= current_prime:
        q /= current_prime
        e += q

    divisor_count *= e + 1

print divisor_count % 1000000007

One thing worth looking into is the underlying data type of the divisor count (pr in the original code). This can grow quickly, far beyond anything representable in a plain integer, and hence it will go big integer internally. This makes the multiplication pr *= s + 1 more expensive, and the statement has a cost multiplier of max(T) * pi(max(N)) = 500 * 5133.

Reducing the division count modulo 1000000007 after every update by changing

divisor_count *= e + 1

to

divisor_count = (divisor_count * (e + 1)) % 1000000007

improved the time for a worst-case input file from 1.92 s to 1.39 s on my box, so it's something worth keeping in mind.

static ushort[] primes_up_to_LIMIT = U16.SmallPrimesUpTo(50000).ToArray();

...

uint divisor_count = 1;

foreach (uint prime in primes_up_to_LIMIT)
{
    ulong e = 0;

    for (uint q = n; q >= prime; )
        e += q /= prime; 

    divisor_count = (uint)((divisor_count * (e + 1)) % 1000000007u);
}

return divisor_count;

It isn't very fast - it clocked 0.21 s on SPOJ - but it should show the principle well enough and it can serve as the basis of a more performant solution.

Note: the poor showing of my submission on SPOJ is due to problems inherent in .Net, with the lion's share going to poor input/output performance (and a small part owing to the poor performance of unsigned types in newer versions of .Net).

Unfortunately I do not know enough about Python to tell whether custom input/output code can have the same dramatic effect as it has with .Net... It may be worth looking into. See SPOJ's INTEST and INOUTEST which are useful for tuning one's input/output performance. At CodeChef you can even look at the code of submissions, to see how the fastest timings came about.

Here is an attempt at rending the inner loop in Python (renaming the num from your code to current_prime for clarity, and pr to divisor_count):

q = n
e = 0
while q >= current_prime
    q /= current_prime
    e += q
divisor_count *= (e + 1)
static ushort[] primes_up_to_LIMIT = U16.SmallPrimesUpTo(50000).ToArray();

...

long divisor_count = 1;

foreach (int prime in primes)
{
    if (prime > n)
        break;

    int e = 0;

    for (int q = n; q >= prime; )
        e += q /= prime; 

    divisor_count = (divisor_count * (e + 1)) % 1000000007;
}

return (int)divisor_count;

It isn't very fast - it clocked 0.12 s on SPOJ - but it should show the principle well enough and it can serve as the basis of a more performant solution.

Note: detailed investigation shows that for once, the lackluster performance of my C# submission on SPOJ is not due to usual problems inherent in .Net (e.g. poor input/output performance) but rather to some strange problem with Mono. On my laptop a worst-case input file clocks 13.7 ms under .Net and 136.4 ms under Mono.

Here is an attempt at rending the loop in Python (renaming some things for clarity):

divisor_count = 1

for current_prime in primes:

    if current_prime > n:
        break

    q = n
    e = 0
    while q >= current_prime:
        q /= current_prime
        e += q

    divisor_count *= e + 1

print divisor_count % 1000000007

One thing worth looking into is the underlying data type of the divisor count (pr in the original code). This can grow quickly, far beyond anything representable in a plain integer, and hence it will go big integer internally. This makes the multiplication pr *= s + 1 more expensive, and the statement has a cost multiplier of max(T) * pi(max(N)) = 500 * 5133.

Reducing the division count modulo 1000000007 after every update by changing

divisor_count *= e + 1

to

divisor_count = (divisor_count * (e + 1)) % 1000000007

improved the time for a worst-case input file from 1.92 s to 1.39 s on my box, so it's something worth keeping in mind.

added pythonic loop
Source Link
DarthGizka
  • 2.8k
  • 1
  • 14
  • 30
Loading
more simplification/clarification
Source Link
DarthGizka
  • 2.8k
  • 1
  • 14
  • 30
Loading
changed exponentiation to Python style
Source Link
DarthGizka
  • 2.8k
  • 1
  • 14
  • 30
Loading
added remarks about the poor input/output performance of .Net
Source Link
DarthGizka
  • 2.8k
  • 1
  • 14
  • 30
Loading
Source Link
DarthGizka
  • 2.8k
  • 1
  • 14
  • 30
Loading