PHP Arrays - Remove duplicates ( Time complexity )

Question

Okay this is not a question of "how to get all uniques" or "How to remove duplicates from my array in php". This is a question about the time complexity.

I figured that the array_unique is somewhat O(n^2 - n) and here's my implementation:

function array_unique2($array) 
{ 
    $to_return = array(); 
    $current_index = 0;

    for ( $i = 0 ; $i < count($array); $i++ ) 
    { 
        $current_is_unique = true; 

        for ( $a = $i+1; $a < count($array); $a++ ) 
        { 
            if ( $array[$i] == $array[$a] ) 
            { 
                $current_is_unique = false; 
                break; 
            } 
        } 
        if ( $current_is_unique ) 
        { 
            $to_return[$current_index] = $array[$i];
        } 

    } 

    return $to_return; 
}

However when benchmarking this against the array_unique i got the following result:

Testing (array_unique2)... Operation took 0.52146291732788 s.

Testing (array_unique)... Operation took 0.28323101997375 s.

Which makes the array_unique twice as fast, my question is, why ( Both had the same random data ) ?

And a friend of mine wrote the following:

function array_unique2($a)
{
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

which is twice as fast as the built in one in php.

I'd like to know, why?

What is the time-complexity of array_unique and in_array?

Edit I removed the count($array) from both loops and just used a variable in the top of the function, that gained 2 seconds on 100 000 elements!

Is PHP really a good choice if your concern is efficient execution? — Chris Ballance
– Chris Ballance, Commented Jan 25, 2009 at 18:14
Haha, guess not, but that's not the point here. I generally don't use PHP. But i found this very interesting while playing around. — Filip Ekberg
– Filip Ekberg, Commented Jan 25, 2009 at 18:15
Your first function isn’t correct: array_unique2(array(1, 2)) just returns array(2). — Gumbo
– Gumbo, Commented Jan 25, 2009 at 18:22
Oh did it.. i tested it with: $array_with_multiple = array("Filip", "Jenny", "Filip", "Tarzan"); — Filip Ekberg
– Filip Ekberg, Commented Jan 25, 2009 at 18:24

Noah Goodrich · Accepted Answer · 2009-01-25 18:40:59Z

13

While I can't speak for the native array_unique function, I can tell you that your friends algorithm is faster because:

He uses a single foreach loop as opposed to your double for() loop.
Foreach loops tend to perform faster than for loops in PHP.
He used a single if(! ) comparison while you used two if() structures
The only additional function call your friend made was in_array whereas you called count() twice.
You made three variable declarations that your friend didn't have to ($a, $current_is_unique, $current_index)

While none of these factors alone is huge, I can see where the cumulative effect would make your algorithm take longer than your friends.

answered Jan 25, 2009 at 18:40

Noah Goodrich

25.4k14 gold badges69 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

strager Over a year ago

Actually, count() is called many times -- once per loop iteration for each loop.

Filip Ekberg Over a year ago

Thank you, by removing count() from both places i gained 2 seconds on 100 000 elements.

Christoph Over a year ago

Your point 1 is not valid: in_array() seems to loop over the entries as well, so there's a hidden for-loop in Filip's friend's code as well!

Noah Goodrich Over a year ago

@Christoph: Could you provide a reference? I would be very interested to read up on that myself.

jmucchiello Over a year ago

In my time tests the "friend" function underperforms in large array situations.

|

Christoph · Accepted Answer · 2009-01-25 23:25:28Z

The time complexity of in_array() is O(n). To see this, we'll take a look at the PHP source code.

The in_array() function is implemented in ext/standard/array.c. All it does is call php_search_array(), which contains the following loop:

while (zend_hash_get_current_data_ex(target_hash, (void **)&entry, &pos) == SUCCESS) {

    // checking the value...

    zend_hash_move_forward_ex(target_hash, &pos);
}

That's where the linear characteristic comes from.

This is the overall characteristic of the algorithm, becaus zend_hash_move_forward_ex() has constant behaviour: Looking at Zend/zend_hash.c, we see that it's basically just

*current = (*current)->pListNext;

As for the time complexity of array_unique():

first, a copy of the array will be created, which is an operation with linear characteristic
then, a C array of struct bucketindex will be created and pointers into our array's copy will be put into these buckets - linear characteristic again
then, the bucketindex-array will be sorted usign quicksort - n log n on average
and lastly, the sorted array will be walked and and duplicate entries will be removed from our array's copy - this should be linear again, assuming that deletion from our array is a constant time operation

Hope this helps ;)

the code for array.c can be found here: google.com/… (thanks to Gumbo); including the link in the answer somehow breaks stackoverflow :(

jmucchiello · Accepted Answer · 2009-01-25 18:50:48Z

4

Try this algorithm. It takes advantage of the fact that the key lookup is faster than in_array():

function array_unique_mine($A) {
    $keys = Array();
    $values = Array();
    foreach ($A as $k => $v) {
        if (!array_key_exists($v, $values)) {
            $keys[] = $k;
            $values[$v] = $v;
        }
    }
    return array_combine($keys, $values);
}

edited Jan 25, 2009 at 18:50

answered Jan 25, 2009 at 18:45

jmucchiello

19k7 gold badges44 silver badges62 bronze badges

4 Comments

strager Over a year ago

I'm not sure the OP is looking for an efficient algorithm, but an explanation as to why the methods have different time complexities.

Christoph Over a year ago

this will only work if the values in $A are all strings or integers!

jmucchiello Over a year ago

true. I suppose you could serialize the $values[serialize($v)] = $v part.

enobrev Over a year ago

@joe_mucchiello added your method to my post

Community · Accepted Answer · 2017-05-23 11:48:41Z

Gabriel's answer has some great points about why your friend's method beats yours. Intrigued by the conversation following Christoph's answer, I decided to run some tests of my own.

Also, I tried this with differing lengths of random strings and although the results were different, the order was the same. I used 6 chars in this example for brevity.

Notice that array_unique5 actually has the same keys as native, 2 and 3, but just outputs in a different order.

Results...

Testing 10000 array items of data over 1000 iterations:
array_unique6:  1.7561039924622 array ( 9998 => 'b',    9992 => 'a',    9994 => 'f',    9997 => 'e',    9993 => 'c',    9999 => 'd',    )
array_unique4:  1.8798060417175 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   4 => 'c',   5 => 'd',   )
array_unique5:  7.5023629665375 array ( 10 => 'd',  0 => 'b',   3 => 'e',   2 => 'f',   9 => 'c',   1 => 'a',   )
array_unique3:  11.356487989426 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique:   22.535032987595 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique2:  62.107122898102 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )
array_unique7:  71.557286024094 array ( 0 => 'b',   1 => 'a',   2 => 'f',   3 => 'e',   9 => 'c',   10 => 'd',  )

And The Code...

set_time_limit(0);
define('HASH_TIMES', 1000);

header('Content-Type: text/plain');

$aInput  = array();
for ($i = 0; $i < 10000; $i++) {
    array_push($aInput, chr(rand(97, 102)));
}

function array_unique2($a) {
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

function array_unique3($aOriginal) {
    $aUnique = array();

    foreach ($aOriginal as $sKey => $sValue) {
        if (!isset($aUnique[$sValue])) {
            $aUnique[$sValue] = $sKey;
        }
    }

    return array_flip($aUnique);
}

function array_unique4($aOriginal) {
    return array_keys(array_flip($aOriginal));
}

function array_unique5($aOriginal) {
    return array_flip(array_flip(array_reverse($aOriginal, true)));
}

function array_unique6($aOriginal) {
    return array_flip(array_flip($aOriginal));
}

function array_unique7($A) {
    $keys = Array();
    $values = Array();
    foreach ($A as $k => $v) {
        if (!array_key_exists($v, $values)) {
            $keys[] = $k;
            $values[$v] = $v;
        }
    }
    return array_combine($keys, $values);
}

function showResults($sMethod, $fTime, $aInput) {
    echo $sMethod . ":\t" . $fTime . "\t" . implode("\t", array_map('trim', explode("\n", var_export(call_user_func($sMethod, $aInput), 1)))) . "\n";
}

echo 'Testing ' . (count($aInput)) . ' array items of data over ' . HASH_TIMES . " iterations:\n";

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique($aInput);
$aResults['array_unique'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique2($aInput);
$aResults['array_unique2'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique3($aInput);
$aResults['array_unique3'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique4($aInput);
$aResults['array_unique4'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique5($aInput);
$aResults['array_unique5'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique6($aInput);
$aResults['array_unique6'] = microtime(1) - $fTime;

$fTime = microtime(1);
for ($i = 0; $i < HASH_TIMES; $i++) array_unique7($aInput);
$aResults['array_unique7'] = microtime(1) - $fTime;

asort($aResults, SORT_NUMERIC);
foreach ($aResults as $sMethod => $fTime) {
    showResults($sMethod, $fTime, $aInput);
}

Results using Christoph's data set from the comments:

$aInput = array(); for($i = 0; $i < 1000; ++$i) $aInput[$i] = $i; for($i = 500; $i < 700; ++$i) $aInput[10000 + $i] = $i;

Testing 1200 array items of data over 1000 iterations:
array_unique6:  0.83235597610474
array_unique4:  0.84050011634827
array_unique5:  1.1954448223114
array_unique3:  2.2937450408936
array_unique7:  8.4412341117859
array_unique:   15.225166797638
array_unique2:  48.685120105743

interesting; in my test with the following input array ` $array = array(); for($i = 0; $i < 1000; ++$i) $array[$i] = $i; for($i = 500; $i < 700; ++$i) $array[10000 + $i] = $i; ` array_unique() was second fastest...
ALso, could you include joe's algorithm ( stackoverflow.com/questions/478002/… ) in your tests?
or just check if a test for array_key_exists() would speed up array_unique3()?
That is interesting. I got similar results as what I'd posted using your data.
@Christoph I'll add it. I doubt it though as array_key_exists is a lot slower than isset.

9 revs · Accepted Answer · 2009-01-26 00:03:45Z

PHP's arrays are implemented as hash tables, i.e. their performance characteristics are different from what you'd expect from 'real' arrays. An array's key-value-pairs are additionally stored in a linked list to allow fast iteration.

This explains why your implementation is so slow compared to your friend's: For every numeric index, your algorithm has to do a hash table lookup, whereas a foreach()-loop will just iterate over a linked list.

The following implementation uses a reverse hash table and might be the fastest of the crowd (double-flipping courtesy of joe_mucchiello):

function array_unique2($array) {
    return array_flip(array_flip($array));
}

This will only work if the values of $array are valid keys, ie integers or strings.

I also reimplemented your algorithm using foreach()-loops. Now, it will actually be faster than your friend's for small data sets, but still slower than the solution via array_flip():

function array_unique3($array) {
    $unique_array = array();

    foreach($array as $current_key => $current_value) {
        foreach($unique_array as $old_value) {
            if($current_value === $old_value)
                continue 2;
        }
        $unique_array[$current_key] = $current_value;
    }

    return $unique_array;
}

For large data sets, the built-in version array_unique() will outperform all other's except the double-flipping one. Also, the version using in_array() by your friend will be faster than array_unique3().

To summarize: Native code for the win!

Yet another version, which should preserve keys and their ordering:

function array_flop($array) {
    $flopped_array = array();

    foreach($array as $key => $value) {
        if(!isset($flopped_array[$value]))
            $flopped_array[$value] = $key;
    }

    return $flopped_array;
}

function array_unique4($array) {
    return array_flip(array_flop($array));
}

This is actually enobrev's array_unique3() - I didn't check his implementations as thoroughly as I should have...

This does not preserve the keys. You mean return array_flip(array_flip($array));
Nice idea, but I wouldn't expect that to be faster. A benchmark could prove me wrong, though.
Actually it smokes the other suggestion in this thread. It just doesn't produce the same results.
@joe: Filip's code assumed numeric keys, so I did as well; and no, I didn't mean a double flip: I only cared for the values, and these are the keys of the flipped array...
Yes, but you can't only care about the values since the builtin array_unique preserves the keys of the input array. Not preserving them is providing a different algorithm. At which point, speed comparisons are irrelevant.

strager · Accepted Answer · 2009-01-25 18:43:03Z

0

PHP is slower to execute than raw machine code (which is most likely executed by array_unique).

Your second example function (the one your friend wrote) is interesting. I do not see how it would be faster than the native implementation, unless the native one is removing elements instead of building a new array.

answered Jan 25, 2009 at 18:43

strager

90.3k27 gold badges138 silver badges180 bronze badges

Comments

user42092 · Accepted Answer · 2009-01-25 20:15:26Z

0

I'll admit I don't understand the native code very well, but it seems to copy the entire array, sort it, then loop through it removing duplicates. In that case your second piece of code is actually a more efficient algorithm, since adding to the end of an array is cheaper than deleting from the middle of it.

Keep in mind the PHP developers probably had a good reason for doing it the way they do. Does anyone want to ask them?

answered Jan 25, 2009 at 20:15

user42092

Comments

Gumbo · Accepted Answer · 2009-01-25 20:32:08Z

0

The native PHP function array_unique is implemented in C. Thus it is faster than PHP, that has to be translated first. What’s more, PHP uses an different algorithm than you do. As I see it, PHP first uses Quick sort to sort the elements and then deletes the duplicates in one run.

Why his friend’s implementation is faster has his own? Because it uses more built-in functionality that trying to recreate them.

answered Jan 25, 2009 at 20:32

Gumbo

657k112 gold badges792 silver badges852 bronze badges

Collectives™ on Stack Overflow

PHP Arrays - Remove duplicates ( Time complexity )

8 Answers 8

10 Comments

1 Comment

4 Comments

15 Comments

11 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

10 Comments

1 Comment

4 Comments

15 Comments

11 Comments

Comments

Comments

Comments

Linked

Related