Revisions to Evaluation of a variation of quick-sort (pivot-selection)

Bumped by Community user

occurred May 27, 2019 at 20:02

Bumped by Community user

occurred Jan 27, 2019 at 20:01

deleted 324 characters in body

Source Link

edited Dec 26, 2018 at 7:53

Jamal

35.2k
13
134
238

BelowHere is a variation of quick-sort where-in the pivot selection is based on calculating the average of the values of the highest and lowest numbers.

The belowThis approach performs a pass on every iteration to determine the high and low values.

My understanding is that this approach requires a maximum of \$2*n*\log_2(n)\$\$2 \cdot n \cdot \log_2(n)\$.

Eg: 1,2,3,4,5,6,7,8.

Example: 1,2,3,4,5,6,7,8.

Eg: 1,2,4,8,16,32,64,128,256,512....

Example: 1,2,4,8,16,32,64,128,256,512....

However, given that numbers are typically represented as byte (8), short (16), int (32), long (64), for a given datatype - Ege.g.: integer, the maximum number in the worst case sequence would be : 2,147,483,648. So basically for an integer datatype, the sequence - 1,2,4,8,16,..., would reach a maximum of 2,147,483,648 after 30 steps after which the sequence must repeat.

1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,..

1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,...

simply because the byte can't hold more than 256 (unsigned)...

As such in case of worst case input as well, the approach : (high+low)/2 would still only have to deal with a depth of log₂(n)\$log_2 n\$ because the numbers would repeat.

Although above would not hold where the number of elements in the array is small compared to the datatype itself... Ege.g.: 10% of the total capacity, where in it's still possible to induce worst-case scenarios for the given data-set, for data-sets with sizes comparable to the maximum value supported by the data-type, the approach would work.

What is less clear is : Given

Given that the best-case scenario is in \$\mathcal{O}(n*\log_2(n))\$\$\mathcal{O}(n \cdot \log_2(n))\$, and given that for worst-case scenarios, for data-sets with sizes comparable to the maximum value the datatype also the scenarios appears to be \$n*\log_2(n)\$\$n \cdot \log_2(n)\$, is it true for average-case scenario as well?

From what I can tell, it's true and as such the entire approach is in \$\mathcal{O}(n*\log_2(n))\$\$\mathcal{O}(n \cdot \log_2(n))\$.

However, I need confirmation on the approach, understanding and the conclusion!.

Below is a code snippet implemented in java.

PS: Have blogged about the algorithm previously - in around 2013.

Basically, the goal is to determine whether it's comparable to the median-of-three approach that java uses in its Arrays.sort() implementation. In

In test-cases cases that I've run, it appears to be comparable to the time taken by the median-of-three algorithm - even for random data-sets sets. So, I want to know if the concept holds good as well or if it's just by chance and there is a data-set set where the median-of-three is better than this approach.

Below is a variation of quick-sort where-in the pivot selection is based on calculating the average of the values of the highest and lowest numbers.

The below approach performs a pass on every iteration to determine the high and low values.

My understanding is that this approach requires a maximum of \$2*n*\log_2(n)\$.

Eg: 1,2,3,4,5,6,7,8.

Eg: 1,2,4,8,16,32,64,128,256,512....

However given that numbers are typically represented as byte(8), short(16), int(32), long(64), for a given datatype - Eg: integer, the maximum number in the worst case sequence would be : 2,147,483,648. So basically for an integer datatype, the sequence - 1,2,4,8,16,..., would reach a maximum of 2,147,483,648 after 30 steps after which the sequence must repeat.

1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,..

simply because the byte can't hold more than 256 (unsigned)...

As such in case of worst case input as well, the approach : (high+low)/2 would still only have to deal with a depth of log₂(n) because the numbers would repeat.

Although above would not hold where the number of elements in the array is small compared to the datatype itself... Eg: 10% of the total capacity, where in it's still possible to induce worst-case scenarios for the given data-set, for data-sets with sizes comparable to the maximum value supported by the data-type, the approach would work.

What is less clear is : Given that the best-case scenario is in \$\mathcal{O}(n*\log_2(n))\$, and given that for worst-case scenarios, for data-sets with sizes comparable to the maximum value the datatype also the scenarios appears to be \$n*\log_2(n)\$, is it true for average-case scenario as well?

From what I can tell, it's true and as such the entire approach is in \$\mathcal{O}(n*\log_2(n))\$.

However need confirmation on the approach, understanding and the conclusion!

Below is a code snippet implemented in java.

PS: Have blogged about the algorithm previously - in around 2013.

Basically the goal is to determine whether it's comparable to the median-of-three approach that java uses in its Arrays.sort() implementation. In test-cases that I've run, it appears to be comparable to the time taken by the median-of-three algorithm - even for random data-sets. So want to know if the concept holds good as well or if it's just by chance and there is a data-set where the median-of-three is better than this approach.

Here is a variation of quick-sort where-in the pivot selection is based on calculating the average of the values of the highest and lowest numbers.

This approach performs a pass on every iteration to determine the high and low values.

My understanding is that this approach requires a maximum of \$2 \cdot n \cdot \log_2(n)\$.

Example: 1,2,3,4,5,6,7,8.

Example: 1,2,4,8,16,32,64,128,256,512....

However, given that numbers are typically represented as byte (8), short (16), int (32), long (64), for a given datatype - e.g.: integer, the maximum number in the worst case sequence would be: 2,147,483,648. So basically for an integer datatype, the sequence - 1,2,4,8,16,..., would reach a maximum of 2,147,483,648 after 30 steps after which the sequence must repeat.

1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,...

simply because the byte can't hold more than 256 (unsigned).

As such in case of worst case input as well, the approach: (high+low)/2 would still only have to deal with a depth of \$log_2 n\$ because the numbers would repeat.

Although above would not hold where the number of elements in the array is small compared to the datatype itself... e.g.: 10% of the total capacity, where in it's still possible to induce worst-case scenarios for the given data-set, for data-sets with sizes comparable to the maximum value supported by the data-type, the approach would work.

What is less clear is:

Given that the best-case scenario is in \$\mathcal{O}(n \cdot \log_2(n))\$, and given that for worst-case scenarios, for data-sets with sizes comparable to the maximum value the datatype also the scenarios appears to be \$n \cdot \log_2(n)\$, is it true for average-case scenario as well?

From what I can tell, it's true and as such the entire approach is in \$\mathcal{O}(n \cdot \log_2(n))\$.

However, I need confirmation on the approach, understanding and the conclusion.

Basically, the goal is to determine whether it's comparable to the median-of-three approach that java uses in its Arrays.sort() implementation.

In test cases that I've run, it appears to be comparable to the time taken by the median-of-three algorithm - even for random data sets. So, I want to know if the concept holds good as well or if it's just by chance and there is a data set where the median-of-three is better than this approach.

Rollback to Revision 2

Source Link

edited Dec 26, 2018 at 7:49

Jamal

35.2k
13
134
238

Update 17 Nov 2018

Below are some clarifications from greybeard's queries :

(1)

Idea behind creating it : was simply to see if I can come up with a sorting algorithm by myself - and be confirmed as it being a unique one.

I have posted previously regarding a different algorithm as well although it was only on the complexity and have not posted any code for the same.

What one could do with it : The licence in the comment was intended to convey that, I couldn't think of a better way to do it.

(2)

Concern's : in the context of this variation of pivot selection - only whether my understanding of the worst case scenario is correct or if there are more (in which case i would have to start figuring out a fix unless a fix is also suggested).

Update : Check the post-script at the end for a justification on the approach w.r.t worst-case-scenario.

(3)

Comparing quicksort's pivot selection schemes : Few of quicksort's pivot selections schemes include -> random-pivot selection and the median-of-three (or dual/triple/multi pivot schemes).

However believe they are known to have worst case scenarios, in any case they appear to be mitigation steps not fully guaranteeing that the performance will not degrade to quadratic.

This is an attempt to make it more predictable.

(4)

How the scheme presented relates to prior work : Have primarily compared it against median-of-three's performance in case of random elements.

Since median-of-three on odd cases can still result in n*n performance (in theory atleast), wanted to see if there was a more predictable way.

This seems to be a better -atleast in the sense that its easier to analyze few of its worst case scenarios (numbers in sequence of powers of 2, identical numbers).

Note : There is an updated understanding w.r.t worst-case scenario analysis, see below post-script.

(5)

Reason for posting code : Originally wanted to just post the technique and check if its unique but wasn't sure where to post it.

Since I had built the code as part of analyzing the technique, and based on suggestions from meta, posted it here.

As suggested, in retrospect, computer-science might have been a better place to ask.

(6)

QS-SAVP Acronym : SAVP is short for simple-averaged virtual pivot. I used 'simple' average since i couldn't find a better name for (high+low)/2 approach.

The logical extension of this approach (a different technique) would be to take the sum of all the elements and then divide by the total number of elements.

The resulting value would then be the pivot to be considered.

The first approach is similar to statistical/arithmetic-mean (which uses the middle elements instead of highest and lowest) and the second approach is exactly the same as statistical/arithmetic-median.

(7)

Pivot calculation : The technique used is ... pivot = (low) + ((high-low)/2) ... which doesn't work if the array elements are a mix of positive and negative numbers ( Eg: the maximum possible value and the minimum possible value for given bit length).

Similarly the 'first-iteration' flag exists to ensure that if the arrays is completely negative, then its not skipped since the current logic (intended to handle arrays with identical elements) does precisely that.

To summarize, first the pivot value '0' is used to separate positive and negative values. The 'temp_low_index' count being same as 'high_index' check is used to ensure performance doesn't degrade to n*n for arrays with equal elements. Finally the 'first-iteration' flag is used to ensure an array with completely negative elements are not skipped.

(8)

With regard to expanding arrays.copy_of to system.array_copy its just what I've picked up (also the fact that arrays-copy-of was introduced only around JDK-6).

(9)

Notes : Believe a technique for determining worst case scenario for median-of-three algorithm can be found here although have not analyzed it myself.

(10)

Updated understanding w.r.t worst-case-scenario :

PS : Now that I think about it, using the virtual pivot technique, it seems impossible that the recursion go beyond [log(n)+1] calls. The justification is as follows :
- Consider a 32 bit unsigned integer
- Min is 0 and Max is 4294967295 (lets denote it 'n').
- First level would be using 2147483648 (n/2) as the pivot.
- It would split the array into two halves (each of it being level-2 sub-arrays)
- 0 to 2147483648 and 2147483648 to 4294967295 (LHS being inclusive and RHS being exclusive).
- Subsequent splits at level 3 would be in steps of (n/4) : 0 to 1073741824, 1073741824 to 2147483648, 2147483648 to 3221225472, 3221225472 to 4294967295.
- Subsequent splits at level 4 would be in steps of (n/8)
- Subsequent splits at level 5 would be in steps of (n/16)
- ..
- and so on until the final split at around level 33 for (n/4294967296).
- Note that the above are regarding virtual pivots. The elements may or may not be present. If they are present, then they will get sorted, if not then its effectively a non-event. I'm merely trying to assert that in terms of complexity, it will not degrade to quadratic complexity.
- In the above technique, I'm merely doing an optimization involving highest and lowest valued elements effectively bringing the value of the virtual pivot closer to reality by calculating it effectively as : pivot = (high+low)/2.

(11) - With regard to using logging package, would have typically used log4j but am trying to avoid having external dependencies for the POC here.

Update 26 Nov 2018

Correction : The worst case should be 2n(k+1) (where k is the number of bits in the datatype and the additional one is due to handling negative values).

Update 25 Dec 2018

From what I've been able to gather from authors (through mails), believe this is equivalent to what is described as radix-exchange-sort (binary-radix-sort) in Knuth Vol-3. So effectively the complexity would be n*k where k is the number of bits in the datatype.

Regards

Ravindra

Update 17 Nov 2018

Below are some clarifications from greybeard's queries :

(1)

Idea behind creating it : was simply to see if I can come up with a sorting algorithm by myself - and be confirmed as it being a unique one.

I have posted previously regarding a different algorithm as well although it was only on the complexity and have not posted any code for the same.

What one could do with it : The licence in the comment was intended to convey that, I couldn't think of a better way to do it.

(2)

Concern's : in the context of this variation of pivot selection - only whether my understanding of the worst case scenario is correct or if there are more (in which case i would have to start figuring out a fix unless a fix is also suggested).

Update : Check the post-script at the end for a justification on the approach w.r.t worst-case-scenario.

(3)

Comparing quicksort's pivot selection schemes : Few of quicksort's pivot selections schemes include -> random-pivot selection and the median-of-three (or dual/triple/multi pivot schemes).

However believe they are known to have worst case scenarios, in any case they appear to be mitigation steps not fully guaranteeing that the performance will not degrade to quadratic.

This is an attempt to make it more predictable.

(4)

How the scheme presented relates to prior work : Have primarily compared it against median-of-three's performance in case of random elements.

Since median-of-three on odd cases can still result in n*n performance (in theory atleast), wanted to see if there was a more predictable way.

This seems to be a better -atleast in the sense that its easier to analyze few of its worst case scenarios (numbers in sequence of powers of 2, identical numbers).

Note : There is an updated understanding w.r.t worst-case scenario analysis, see below post-script.

(5)

Reason for posting code : Originally wanted to just post the technique and check if its unique but wasn't sure where to post it.

Since I had built the code as part of analyzing the technique, and based on suggestions from meta, posted it here.

As suggested, in retrospect, computer-science might have been a better place to ask.

(6)

QS-SAVP Acronym : SAVP is short for simple-averaged virtual pivot. I used 'simple' average since i couldn't find a better name for (high+low)/2 approach.

The logical extension of this approach (a different technique) would be to take the sum of all the elements and then divide by the total number of elements.

The resulting value would then be the pivot to be considered.

The first approach is similar to statistical/arithmetic-mean (which uses the middle elements instead of highest and lowest) and the second approach is exactly the same as statistical/arithmetic-median.

(7)

Pivot calculation : The technique used is ... pivot = (low) + ((high-low)/2) ... which doesn't work if the array elements are a mix of positive and negative numbers ( Eg: the maximum possible value and the minimum possible value for given bit length).

Similarly the 'first-iteration' flag exists to ensure that if the arrays is completely negative, then its not skipped since the current logic (intended to handle arrays with identical elements) does precisely that.

To summarize, first the pivot value '0' is used to separate positive and negative values. The 'temp_low_index' count being same as 'high_index' check is used to ensure performance doesn't degrade to n*n for arrays with equal elements. Finally the 'first-iteration' flag is used to ensure an array with completely negative elements are not skipped.

(8)

With regard to expanding arrays.copy_of to system.array_copy its just what I've picked up (also the fact that arrays-copy-of was introduced only around JDK-6).

(9)

Notes : Believe a technique for determining worst case scenario for median-of-three algorithm can be found here although have not analyzed it myself.

(10)

Updated understanding w.r.t worst-case-scenario :

PS : Now that I think about it, using the virtual pivot technique, it seems impossible that the recursion go beyond [log(n)+1] calls. The justification is as follows :
- Consider a 32 bit unsigned integer
- Min is 0 and Max is 4294967295 (lets denote it 'n').
- First level would be using 2147483648 (n/2) as the pivot.
- It would split the array into two halves (each of it being level-2 sub-arrays)
- 0 to 2147483648 and 2147483648 to 4294967295 (LHS being inclusive and RHS being exclusive).
- Subsequent splits at level 3 would be in steps of (n/4) : 0 to 1073741824, 1073741824 to 2147483648, 2147483648 to 3221225472, 3221225472 to 4294967295.
- Subsequent splits at level 4 would be in steps of (n/8)
- Subsequent splits at level 5 would be in steps of (n/16)
- ..
- and so on until the final split at around level 33 for (n/4294967296).
- Note that the above are regarding virtual pivots. The elements may or may not be present. If they are present, then they will get sorted, if not then its effectively a non-event. I'm merely trying to assert that in terms of complexity, it will not degrade to quadratic complexity.
- In the above technique, I'm merely doing an optimization involving highest and lowest valued elements effectively bringing the value of the virtual pivot closer to reality by calculating it effectively as : pivot = (high+low)/2.

(11) - With regard to using logging package, would have typically used log4j but am trying to avoid having external dependencies for the POC here.

Update 26 Nov 2018

Correction : The worst case should be 2n(k+1) (where k is the number of bits in the datatype and the additional one is due to handling negative values).

Update 25 Dec 2018

From what I've been able to gather from authors (through mails), believe this is equivalent to what is described as radix-exchange-sort (binary-radix-sort) in Knuth Vol-3. So effectively the complexity would be n*k where k is the number of bits in the datatype.

Regards

Ravindra

Updated conclusion w.r.t algorithm complexity

Source Link

edited Dec 25, 2018 at 15:31

Ravindra HV

141
6

Update 25 Dec 2018

From what I've been able to gather from authors (through mails), believe this is equivalent to what is described as radix-exchange-sort (binary-radix-sort) in Knuth Vol-3. So effectively the complexity would be n*k where k is the number of bits in the datatype.

Regards

Ravindra

Update 25 Dec 2018

From what I've been able to gather from authors (through mails), believe this is equivalent to what is described as radix-exchange-sort (binary-radix-sort) in Knuth Vol-3. So effectively the complexity would be n*k where k is the number of bits in the datatype.

Regards

Ravindra

Corrected the worst case calculation. The explanation remains same.

Source Link

edited Nov 26, 2018 at 2:17

Ravindra HV

141
6

Loading

Updated question to answer queries and concerns and also added updated understanding w.r.t worst case scenario

Source Link

edited Nov 17, 2018 at 16:40

Ravindra HV

141
6

Loading

Updated question to answer queries and concerns and also added updated understanding w.r.t worst case scenario

Source Link

edited Nov 17, 2018 at 16:33

Ravindra HV

141
6

Loading

added 27 characters in body

Source Link

edited Nov 9, 2018 at 17:22

mdfst13

22.4k
6
34
70

Loading

Source Link

asked Nov 8, 2018 at 21:08

Ravindra HV

141
6

Loading

Stack Exchange Network

Return to Question