12

Okay this is not a question of "how to get all uniques" or "How to remove duplicates from my array in php". This is a question about the time complexity.

I figured that the array_unique is somewhat O(n^2 - n) and here's my implementation:

function array_unique2($array) 
{ 
    $to_return = array(); 
    $current_index = 0;

    for ( $i = 0 ; $i < count($array); $i++ ) 
    { 
        $current_is_unique = true; 

        for ( $a = $i+1; $a < count($array); $a++ ) 
        { 
            if ( $array[$i] == $array[$a] ) 
            { 
                $current_is_unique = false; 
                break; 
            } 
        } 
        if ( $current_is_unique ) 
        { 
            $to_return[$current_index] = $array[$i];
        } 

    } 

    return $to_return; 
}

However when benchmarking this against the array_unique i got the following result:

Testing (array_unique2)... Operation took 0.52146291732788 s.

Testing (array_unique)... Operation took 0.28323101997375 s.

Which makes the array_unique twice as fast, my question is, why ( Both had the same random data ) ?

And a friend of mine wrote the following:

function array_unique2($a)
{
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

which is twice as fast as the built in one in php.

I'd like to know, why?

What is the time-complexity of array_unique and in_array?

Edit I removed the count($array) from both loops and just used a variable in the top of the function, that gained 2 seconds on 100 000 elements!

7
  • 1
    Is PHP really a good choice if your concern is efficient execution? Commented Jan 25, 2009 at 18:14
  • Haha, guess not, but that's not the point here. I generally don't use PHP. But i found this very interesting while playing around. Commented Jan 25, 2009 at 18:15
  • Your first function isn’t correct: array_unique2(array(1, 2)) just returns array(2). Commented Jan 25, 2009 at 18:22
  • Oh did it.. i tested it with: $array_with_multiple = array("Filip", "Jenny", "Filip", "Tarzan"); Commented Jan 25, 2009 at 18:24
  • How were you performing the benchmarks? Commented Jan 25, 2009 at 18:44

8 Answers 8

13

While I can't speak for the native array_unique function, I can tell you that your friends algorithm is faster because:

  1. He uses a single foreach loop as opposed to your double for() loop.
  2. Foreach loops tend to perform faster than for loops in PHP.
  3. He used a single if(! ) comparison while you used two if() structures
  4. The only additional function call your friend made was in_array whereas you called count() twice.
  5. You made three variable declarations that your friend didn't have to ($a, $current_is_unique, $current_index)

While none of these factors alone is huge, I can see where the cumulative effect would make your algorithm take longer than your friends.

Sign up to request clarification or add additional context in comments.

10 Comments

Actually, count() is called many times -- once per loop iteration for each loop.
Thank you, by removing count() from both places i gained 2 seconds on 100 000 elements.
Your point 1 is not valid: in_array() seems to loop over the entries as well, so there's a hidden for-loop in Filip's friend's code as well!
@Christoph: Could you provide a reference? I would be very interested to read up on that myself.
In my time tests the "friend" function underperforms in large array situations.
|
9

The time complexity of in_array() is O(n). To see this, we'll take a look at the PHP source code.

The in_array() function is implemented in ext/standard/array.c. All it does is call php_search_array(), which contains the following loop:

while (zend_hash_get_current_data_ex(target_hash, (void **)&entry, &pos) == SUCCESS) {

    // checking the value...

    zend_hash_move_forward_ex(target_hash, &pos);
}

That's where the linear characteristic comes from.

This is the overall characteristic of the algorithm, becaus zend_hash_move_forward_ex() has constant behaviour: Looking at Zend/zend_hash.c, we see that it's basically just

*current = (*current)->pListNext;

As for the time complexity of array_unique():

  • first, a copy of the array will be created, which is an operation with linear characteristic
  • then, a C array of struct bucketindex will be created and pointers into our array's copy will be put into these buckets - linear characteristic again
  • then, the bucketindex-array will be sorted usign quicksort - n log n on average
  • and lastly, the sorted array will be walked and and duplicate entries will be removed from our array's copy - this should be linear again, assuming that deletion from our array is a constant time operation

Hope this helps ;)

1 Comment

the code for array.c can be found here: google.com/… (thanks to Gumbo); including the link in the answer somehow breaks stackoverflow :(
4

Try this algorithm. It takes advantage of the fact that the key lookup is faster than in_array():

function array_unique_mine($A) {
    $keys = Array();
    $values = Array();
    foreach ($A as $k => $v) {
        if (!array_key_exists($v, $values)) {
            $keys[] = $k;
            $values[$v] = $v;
        }
    }
    return array_combine($keys, $values);
}

4 Comments

I'm not sure the OP is looking for an efficient algorithm, but an explanation as to why the methods have different time complexities.
this will only work if the values in $A are all strings or integers!
true. I suppose you could serialize the $values[serialize($v)] = $v part.
@joe_mucchiello added your method to my post
2

Gabriel's answer has some great points about why your friend's method beats yours. Intrigued by the conversation following Christoph's answer, I decided to run some tests of my own.

Also, I tried this with differing lengths of random strings and although the results were different, the order was the same. I used 6 chars in this example for brevity.

Notice that array_unique5 actually has the same keys as native, 2 and 3, but just outputs in a different order.

Results...

Testing 10000 array items of data over 1000 iterations:
array_unique6:  1.7561039924622 array ( 9998 => 'b',    9992 => 'a',    9994 => 'f',    9997 => 'e',    9993 => 'c',    9999 => 'd',    )
array_unique4:  1.8798060417175 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   4 => 'c',   5 => 'd',   )
array_unique5:  7.5023629665375 array ( 10 => 'd',  0 => 'b',   3 => 'e',   2 => 'f',   9 => 'c',   1 => 'a',   )
array_unique3:  11.356487989426 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique:   22.535032987595 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique2:  62.107122898102 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique7:  71.557286024094 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )

And The Code...

set_time_limit(0);
define('HASH_TIMES', 1000);

header('Content-Type: text/plain');

$aInput  = array();
for ($i = 0; $i < 10000; $i++) {
    array_push($aInput, chr(rand(97, 102)));
}

function array_unique2($a) {
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

function array_unique3($aOriginal) {
    $aUnique = array();

    foreach ($aOriginal as $sKey => $sValue) {
        if (!isset($aUnique[$sValue])) {
            $aUnique[$sValue] = $sKey;
        }
    }

    return array_flip($aUnique);
}

function array_unique4($aOriginal) {
    return array_keys(array_flip($aOriginal));
}

function array_unique5($aOriginal) {
    return array_flip(array_flip(array_reverse($aOriginal, true)));
}

function array_unique6($aOriginal) {
    return array_flip(array_flip($aOriginal));
}

function array_unique7($A) {
    $keys = Array();
    $values = Array();
    foreach ($A as $k => $v) {
        if (!array_key_exists($v, $values)) {
            $keys[] = $k;
            $values[$v] = $v;
        }
    }
    return array_combine($keys, $values);
}

function showResults($sMethod, $fTime, $aInput) {
    echo $sMethod . ":\t" . $fTime . "\t" . implode("\t", array_map('trim', explode("\n", var_export(call_user_func($sMethod, $aInput), 1)))) . "\n";
}

echo 'Testing ' . (count($aInput)) . ' array items of data over ' . HASH_TIMES . " iterations:\n";

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique($aInput);
$aResults['array_unique'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique2($aInput);
$aResults['array_unique2'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique3($aInput);
$aResults['array_unique3'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique4($aInput);
$aResults['array_unique4'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique5($aInput);
$aResults['array_unique5'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique6($aInput);
$aResults['array_unique6'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique7($aInput);
$aResults['array_unique7'] = microtime(1) - $fTime;

asort($aResults, SORT_NUMERIC);
foreach ($aResults as $sMethod => $fTime) {
    showResults($sMethod, $fTime, $aInput);
}

Results using Christoph's data set from the comments:

$aInput = array(); for($i = 0; $i < 1000; ++$i) $aInput[$i] = $i; for($i = 500; $i < 700; ++$i) $aInput[10000 + $i] = $i;

Testing 1200 array items of data over 1000 iterations:
array_unique6:  0.83235597610474
array_unique4:  0.84050011634827
array_unique5:  1.1954448223114
array_unique3:  2.2937450408936
array_unique7:  8.4412341117859
array_unique:   15.225166797638
array_unique2:  48.685120105743

15 Comments

interesting; in my test with the following input array ` $array = array(); for($i = 0; $i < 1000; ++$i) $array[$i] = $i; for($i = 500; $i < 700; ++$i) $array[10000 + $i] = $i; ` array_unique() was second fastest...
ALso, could you include joe's algorithm ( stackoverflow.com/questions/478002/… ) in your tests?
or just check if a test for array_key_exists() would speed up array_unique3()?
That is interesting. I got similar results as what I'd posted using your data.
@Christoph I'll add it. I doubt it though as array_key_exists is a lot slower than isset.
|
1

PHP's arrays are implemented as hash tables, i.e. their performance characteristics are different from what you'd expect from 'real' arrays. An array's key-value-pairs are additionally stored in a linked list to allow fast iteration.

This explains why your implementation is so slow compared to your friend's: For every numeric index, your algorithm has to do a hash table lookup, whereas a foreach()-loop will just iterate over a linked list.

The following implementation uses a reverse hash table and might be the fastest of the crowd (double-flipping courtesy of joe_mucchiello):

function array_unique2($array) {
    return array_flip(array_flip($array));
}

This will only work if the values of $array are valid keys, ie integers or strings.

I also reimplemented your algorithm using foreach()-loops. Now, it will actually be faster than your friend's for small data sets, but still slower than the solution via array_flip():

function array_unique3($array) {
    $unique_array = array();

    foreach($array as $current_key => $current_value) {
        foreach($unique_array as $old_value) {
            if($current_value === $old_value)
                continue 2;
        }
        $unique_array[$current_key] = $current_value;
    }

    return $unique_array;
}

For large data sets, the built-in version array_unique() will outperform all other's except the double-flipping one. Also, the version using in_array() by your friend will be faster than array_unique3().

To summarize: Native code for the win!


Yet another version, which should preserve keys and their ordering:

function array_flop($array) {
    $flopped_array = array();

    foreach($array as $key => $value) {
        if(!isset($flopped_array[$value]))
            $flopped_array[$value] = $key;
    }

    return $flopped_array;
}

function array_unique4($array) {
    return array_flip(array_flop($array));
}

This is actually enobrev's array_unique3() - I didn't check his implementations as thoroughly as I should have...

11 Comments

This does not preserve the keys. You mean return array_flip(array_flip($array));
Nice idea, but I wouldn't expect that to be faster. A benchmark could prove me wrong, though.
Actually it smokes the other suggestion in this thread. It just doesn't produce the same results.
@joe: Filip's code assumed numeric keys, so I did as well; and no, I didn't mean a double flip: I only cared for the values, and these are the keys of the flipped array...
Yes, but you can't only care about the values since the builtin array_unique preserves the keys of the input array. Not preserving them is providing a different algorithm. At which point, speed comparisons are irrelevant.
|
0

PHP is slower to execute than raw machine code (which is most likely executed by array_unique).

Your second example function (the one your friend wrote) is interesting. I do not see how it would be faster than the native implementation, unless the native one is removing elements instead of building a new array.

Comments

0

I'll admit I don't understand the native code very well, but it seems to copy the entire array, sort it, then loop through it removing duplicates. In that case your second piece of code is actually a more efficient algorithm, since adding to the end of an array is cheaper than deleting from the middle of it.

Keep in mind the PHP developers probably had a good reason for doing it the way they do. Does anyone want to ask them?

Comments

0

The native PHP function array_unique is implemented in C. Thus it is faster than PHP, that has to be translated first. What’s more, PHP uses an different algorithm than you do. As I see it, PHP first uses Quick sort to sort the elements and then deletes the duplicates in one run.

Why his friend’s implementation is faster has his own? Because it uses more built-in functionality that trying to recreate them.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.