Below is a variation of quick-sort where-in the pivot selection is based on calculating the average of the values of the highest and lowest numbers.
pivot = (high + low)/2
The pivot as such is usually a virtual one.
The below approach performs a pass on every iteration to determine the high and low values.
My understanding is that this approach requires a maximum of \$2*n*\log_2(n)\$.
The rationale is as follows :
Best case scenario would be one where the numbers are sorted and are sequential.
Eg: 1,2,3,4,5,6,7,8.
First run would yield \$\frac{1+8}{2}=4.5\$ which in turn can be rounded off to 4. Second run would then yield \$\frac{1+4}{2}=2.5\$ for the left sub-array and \$\frac{5+8}{2}=6.5\$ for the right sub-array and so on.
The worst case for the above would then be a sequence that doubles. Effectively a sequence in the powers of 2.
Eg: 1,2,4,8,16,32,64,128,256,512....
However given that numbers are typically represented as byte(8), short(16), int(32), long(64), for a given datatype - Eg: integer, the maximum number in the worst case sequence would be : 2,147,483,648. So basically for an integer datatype, the sequence - 1,2,4,8,16,..., would reach a maximum of 2,147,483,648 after 30 steps after which the sequence must repeat.
To illustrate the same with a byte (unsigned), the sequence would look something like this :
1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,1,2,4,8,16,32,64,128,256,..
simply because the byte can't hold more than 256 (unsigned)...
As such in case of worst case input as well, the approach : (high+low)/2 would still only have to deal with a depth of log2(n) because the numbers would repeat.
Although above would not hold where the number of elements in the array is small compared to the datatype itself... Eg: 10% of the total capacity, where in it's still possible to induce worst-case scenarios for the given data-set, for data-sets with sizes comparable to the maximum value supported by the data-type, the approach would work.
What is less clear is : Given that the best-case scenario is in \$\mathcal{O}(n*\log_2(n))\$, and given that for worst-case scenarios, for data-sets with sizes comparable to the maximum value the datatype also the scenarios appears to be \$n*\log_2(n)\$, is it true for average-case scenario as well?
From what I can tell, it's true and as such the entire approach is in \$\mathcal{O}(n*\log_2(n))\$.
However need confirmation on the approach, understanding and the conclusion!
Below is a code snippet implemented in java.
package org.example.so.sorts.qs;
//import java.util.Arrays;
import java.util.Random;
/**
*
*
* <pre>
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
* </pre>
*
* <p>
* Simple Averaged Virtual Pivot - Quick Sort
* </p>
*
* <p>
* Unstable, In-place, Big-O-Notation Classification : n*log(n)
* </p>
*
* <p>
* A variation of quick sort where the pivot is calculated using simple average
* of highest and lowest values.
* </p>
*
*
*
* @author Ravindra HV
* @version 0.2.1
* @since 2013
*/
public class QSortSAVP {
/*
* The pivot calculation works only for numbers with the same sign. As such,
* first step is to partition positive and negative numbers, thus preventing
* arithmetic overflow
*/
private static final int INITIAL_PIVOT = 0;
private static volatile int RECURSION_COUNT = 0;
private static volatile int MAX_RECURSION_DEPTH = 0;
private static volatile int RECURSION_DEPTH = 0;
private static final int[] POWERS_OF_TWO = { 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384,
32768, 65536 };
public static void main(String[] args) {
// int[] arr = {0,4,2,6,-1,-5,-3,-7};
// int[] arr = {0,4,2,6,-1,-5,-3,-7,35,41,2,6,-34,76};
// int[] arr = {1,2,4,8,16,32,64,128,256,512};
// int[] arr = {1024,32,64,1,2,4,8,16,128,256,512};
// int[] arr = {-256,-512,-1,-2,-4,-8,-16,-32,-64,-128,};
// int[] arr =
// {1,2,4,8,16,32,64,128,256,512,1,2,4,8,16,32,64,128,256,512,1,2,4,8,16,32,64,128,256,512,1,2,4,8,16,32,64,128,256,512,1,2,4,8,16,32,64,128,256,512,1,2,4,8,16,32,64,128,256,512,1,2,4,8,16,32,64,128,256,512,1,2,4,8,16,32,64,128,256,512,1,2,4,8,16,32,64,128,256,512,1,2,4,8,16,32,64,128,256,512};
// int[] arr =
// {-2,1,1,1,1,1,-1,-1,-1,-1,-1,0,0,0,0,0,-1,-1,-1,-1,-1,0,0,0,0,0,1,1,1,1,1,2};
int[] arr = new int[1024 * 1024];
Random random = new Random();
for (int i = 0; i < arr.length; i++) {
arr[i] = random.nextInt(arr.length);
// arr[i] = i;
// arr[i] = arr.length-i;
// arr[i] = arr[i] % 1024;
// int j = i % POWERS_OF_TWO.length;
// arr[i] = POWERS_OF_TWO[j];
// if( i % 2 == 0) {
// arr[i] = arr.length-i;
// }
// else {
// arr[i] = random.nextInt(arr.length);
// }
}
/* */
// handlePrintLine(Arrays.toString(arr));
long start = System.currentTimeMillis();
sort(arr);
// Arrays.sort(arr);
long end = System.currentTimeMillis();
System.out.println("Time taken : " + (end - start) + "... Recursion count :" + RECURSION_COUNT+", Recursion-Depth :"+RECURSION_DEPTH+", MaxRecursionDepth :"+MAX_RECURSION_DEPTH);
// handlePrintLine(Arrays.toString(arr));
validate(arr);
// handlePrintLine("Recursion count : "+RECURSION_COUNT );
}
/**
* Sorts the given array in ascending order
*
* @param arr
*/
public static void sort(int[] arr) {
if (arr.length < 2) {
return;
}
sort(arr, INITIAL_PIVOT, 0, arr.length, true);
}
/**
* @param arr
* @param createCopy
* (to ensure correctness in the event of concurrent modification of
* given array)
* @return
*/
public static int[] sort(int[] arr, boolean createCopy) {
int[] resArr = null;
if (createCopy) {
int[] tempArr = new int[arr.length];
System.arraycopy(arr, 0, tempArr, 0, tempArr.length);
sort(tempArr);
resArr = tempArr;
} else {
sort(arr);
resArr = arr;
}
return resArr;
}
private static void sort(int[] arr, int pivotVal, int lowIndex, int highIndex, boolean firstIteration) {
RECURSION_COUNT++;
// handlePrintLine("First Print Statement");
// handlePrintLine("Low-Index : "+lowIndex);
// handlePrintLine("High-Index : "+highIndex);
// print(arr, lowIndex, highIndex);
// handlePrintLine("Pivot : "+pivotVal);
int tempLowIndex = lowIndex;
int tempHighIndex = highIndex;
while (tempLowIndex < tempHighIndex) {
while ((tempLowIndex < highIndex) && (arr[tempLowIndex] <= pivotVal)) {
tempLowIndex++;
}
if (!firstIteration && tempLowIndex == highIndex) {
// handlePrintLine("Returning...");
return; // all entries in given range are less than or equal to pivot..
}
while ((tempHighIndex > tempLowIndex) && (arr[tempHighIndex - 1] > pivotVal)) {
tempHighIndex--;
}
if (tempLowIndex < tempHighIndex) {
swap(arr, tempLowIndex, tempHighIndex - 1);
tempLowIndex++;
tempHighIndex--;
}
}
// handlePrintLine("Final-Low-Index : "+tempLowIndex);
// handlePrintLine("Final-High-Index : "+tempHighIndex);
// handlePrintLine("Second Print Statement");
// print(arr, lowIndex, highIndex);
if ((tempLowIndex - lowIndex) > 1) {
int leftPartPivotVal = determinePivot(arr, lowIndex, tempLowIndex);
RECURSION_DEPTH++;
MAX_RECURSION_DEPTH = (RECURSION_DEPTH>MAX_RECURSION_DEPTH) ? RECURSION_DEPTH:MAX_RECURSION_DEPTH;
sort(arr, leftPartPivotVal, lowIndex, tempLowIndex, false);
RECURSION_DEPTH--;
}
if ((highIndex - tempLowIndex) > 1) {
int rightPartPivotVal = determinePivot(arr, tempLowIndex, highIndex);
RECURSION_DEPTH++;
MAX_RECURSION_DEPTH = (RECURSION_DEPTH>MAX_RECURSION_DEPTH) ? RECURSION_DEPTH:MAX_RECURSION_DEPTH;
sort(arr, rightPartPivotVal, tempLowIndex, highIndex, false);
RECURSION_DEPTH--;
}
}
/**
* <p>
* Pivot is calculated as the simple average of the highest and lowest elements,
* while ensuring that there is no overflow.
* </p>
*
* @param arr
* @param lowIndex
* @param highIndex
* @return
*/
private static int determinePivot(int[] arr, int lowIndex, int highIndex) {
int pivotVal = 0;
int lowVal = arr[lowIndex];
int highVal = lowVal;
for (int i = lowIndex; i < highIndex; i++) {
if (arr[i] < lowVal) {
lowVal = arr[i];
}
if (arr[i] > highVal) {
highVal = arr[i];
}
}
pivotVal = lowVal + ((highVal - lowVal) / 2);
// pivotVal = lowVal+((highVal-lowVal)>>1);
return pivotVal;
}
private static void swap(int[] arr, int lowIndex, int highIndex) {
int tempVal = arr[lowIndex];
arr[lowIndex] = arr[highIndex];
arr[highIndex] = tempVal;
}
private static void print(int[] arr, int lowIndex, int highIndex) {
for (int i = lowIndex; i < highIndex; i++) {
if (i == 0) {
handlePrint(arr[i]);
} else {
handlePrint(" " + arr[i]);
}
}
handlePrintLine("");
}
private static void validate(int[] arr) {
boolean sorted = true;
for (int i = 0; i < arr.length - 1; i++) {
if (arr[i] > arr[i + 1]) {
sorted = false;
break;
}
}
if (sorted) {
handlePrintLine("SUCCESS : ARRAY SORTED. Length : " + arr.length);
} else {
handlePrintLine("ERROR : ARRAY NOT SORTED. Length : " + arr.length);
}
}
private static void handlePrint(Object object) {
System.out.print(object.toString());
}
private static void handlePrintLine(Object object) {
System.out.println(object.toString());
}
}
PS: Have blogged about the algorithm previously - in around 2013.
Basically the goal is to determine whether it's comparable to the median-of-three approach that java uses in its Arrays.sort() implementation.
In test-cases that I've run, it appears to be comparable to the time taken by the median-of-three algorithm - even for random data-sets. So want to know if the concept holds good as well or if it's just by chance and there is a data-set where the median-of-three is better than this approach.
Update 17 Nov 2018
Below are some clarifications from greybeard's queries :
(1)
Idea behind creating it : was simply to see if I can come up with a sorting algorithm by myself - and be confirmed as it being a unique one.
I have posted previously regarding a different algorithm as well although it was only on the complexity and have not posted any code for the same.
What one could do with it : The licence in the comment was intended to convey that, I couldn't think of a better way to do it.
(2)
- Concern's : in the context of this variation of pivot selection - only whether my understanding of the worst case scenario is correct or if there are more (in which case i would have to start figuring out a fix unless a fix is also suggested).
- Update : Check the post-script at the end for a justification on the approach w.r.t worst-case-scenario.
(3)
- Comparing quicksort's pivot selection schemes : Few of quicksort's pivot selections schemes include -> random-pivot selection and the median-of-three (or dual/triple/multi pivot schemes).
- However believe they are known to have worst case scenarios, in any case they appear to be mitigation steps not fully guaranteeing that the performance will not degrade to quadratic.
- This is an attempt to make it more predictable.
(4)
- How the scheme presented relates to prior work : Have primarily compared it against median-of-three's performance in case of random elements.
- Since median-of-three on odd cases can still result in n*n performance (in theory atleast), wanted to see if there was a more predictable way.
- This seems to be a better -atleast in the sense that its easier to analyze few of its worst case scenarios (numbers in sequence of powers of 2, identical numbers).
- Note : There is an updated understanding w.r.t worst-case scenario analysis, see below post-script.
(5)
- Reason for posting code : Originally wanted to just post the technique and check if its unique but wasn't sure where to post it.
- Since I had built the code as part of analyzing the technique, and based on suggestions from meta, posted it here.
- As suggested, in retrospect, computer-science might have been a better place to ask.
(6)
- QS-SAVP Acronym : SAVP is short for simple-averaged virtual pivot. I used 'simple' average since i couldn't find a better name for (high+low)/2 approach.
- The logical extension of this approach (a different technique) would be to take the sum of all the elements and then divide by the total number of elements.
- The resulting value would then be the pivot to be considered.
- The first approach is similar to statistical/arithmetic-mean (which uses the middle elements instead of highest and lowest) and the second approach is exactly the same as statistical/arithmetic-median.
(7)
Pivot calculation : The technique used is ... pivot = (low) + ((high-low)/2) ... which doesn't work if the array elements are a mix of positive and negative numbers ( Eg: the maximum possible value and the minimum possible value for given bit length).
Similarly the 'first-iteration' flag exists to ensure that if the arrays is completely negative, then its not skipped since the current logic (intended to handle arrays with identical elements) does precisely that.
To summarize, first the pivot value '0' is used to separate positive and negative values. The 'temp_low_index' count being same as 'high_index' check is used to ensure performance doesn't degrade to n*n for arrays with equal elements. Finally the 'first-iteration' flag is used to ensure an array with completely negative elements are not skipped.
(8)
- With regard to expanding arrays.copy_of to system.array_copy its just what I've picked up (also the fact that arrays-copy-of was introduced only around JDK-6).
(9)
- Notes : Believe a technique for determining worst case scenario for median-of-three algorithm can be found here although have not analyzed it myself.
(10)
Updated understanding w.r.t worst-case-scenario :
- PS : Now that I think about it, using the virtual pivot technique, it seems impossible that the recursion go beyond [log(n)+1] calls.
The justification is as follows :
Consider a 32 bit unsigned integer
Min is 0 and Max is 4294967295 (lets denote it 'n').
First level would be using 2147483648 (n/2) as the pivot.
It would split the array into two halves (each of it being level-2 sub-arrays)
0 to 2147483648 and 2147483648 to 4294967295 (LHS being inclusive and RHS being exclusive).
Subsequent splits at level 3 would be in steps of (n/4) : 0 to 1073741824, 1073741824 to 2147483648, 2147483648 to 3221225472, 3221225472 to 4294967295.
Subsequent splits at level 4 would be in steps of (n/8)
Subsequent splits at level 5 would be in steps of (n/16)
..
and so on until the final split at around level 33 for (n/4294967296).
Note that the above are regarding virtual pivots. The elements may or may not be present. If they are present, then they will get sorted, if not then its effectively a non-event. I'm merely trying to assert that in terms of complexity, it will not degrade to quadratic complexity.
In the above technique, I'm merely doing an optimization involving highest and lowest valued elements effectively bringing the value of the virtual pivot closer to reality by calculating it effectively as : pivot = (high+low)/2.
(11) - With regard to using logging package, would have typically used log4j but am trying to avoid having external dependencies for the POC here.