Will's Blog...Much Ado About Nothing: sort

Thursday, November 13, 2014

Treapy Sort --Getting a Bit Treapy with a Treap

Treap

A treap is a randomized data structure that is hybrid of heap and a tree structure. The creators are Cecilia R. Aragon and Raimund Seidel in 1989. What is fascinating is the use of random priority for maintaining the heap structure or heap proprety. Randomness is used to bring order, structure. As Mr. Spock would say, "Fascinating..."

The basic operations of access, insert, remove all have a Big-Oh of O(lg N). Hence for 2**32 = 4,294,967,296 or 4-billion elements in a treap will require lg(4294967296) or lg(2**32) or 32-operations. Hence a treap, while a hybrid of a tree and heap, is very efficient in performance time.

Infimum and Supremum

One operation on a treap is to get the minimal or maximal elements--the mathematical term is infimum and supremum. I use the term extrema to mean either maximum or minimum element.

Not all treap data structure implementations have an extrema methods such as getMin() or getMax(), but such methods are necessary for the treapy sort.

Sorting by Treap

The treap as a method of sorting is discussed in A Modular Calculus for the Average Cost of Data Structuring by Michel Schellekens in 2008. Sorting using the extrema methods the sorting process is simply to getMin() or getMax() and delete the element.

But, after deletion, the treap requires rebalancing of elements to maintain the heap property. Hence deletion requires the expense of rebalancing the treap.

Wikipedia explains, "To delete a node x from the treap, if x is a leaf of the tree, simply remove it. If x has a single child z, remove x from the tree and make z be the child of the parent of x (or make z the root of the tree if x had no parent). Finally, if x has two children, swap its position in the tree with the position of its immediate successor z in the sorted order, resulting in one of the previous cases. In this final case, the swap may violate the heap-ordering property for z, so additional rotations may need to be performed to restore this property."

Getting Treapy

The treapy sort works by getting the extreme element, but avoids the cost of rebalancing the treap--however small the Big-Oh cost. The treapy sort avoids this, while preserving the structure of a treap while sorting. How does it do this? That's why its not a treap sort but a treapy sort.

One important feature of the treapy sort (and likewise the treap sort) is for duplicate elements each node in the treap maintain a count for duplicate elements (and to simplify the treap operation overall). Hence for a list of elements O(N) the space is O(M) where M <= N.

The treapy sort is linearithmic or O(N lg N)

For a list of data elements 2**32 = 4,294,967,296 the Big-Oh time is O(2**32 lg(2**32)) which simplifies to O(2**32 * 32) or O(2**32 * 2**5) or O(2**37).

One feature is that if the elements are in any partial ordering, or totally sorted, the treapy sort is immutable, it remains linearithmic.

The treapy sort is akin to the heap sort, but uses a treap, a hybrid of a heap and tree. One thesis examines the heap sort to treap sort in greater detail. The abstract states, "It was discovered that treap sort is much more faster than heap sort in sorting and searching for elements."

Source Code for Treapy Sort

The idea for the treapy sort came to me while examining diagrams, figures, and code for a treap. One of the best things about being a computer scientist is that to test an idea is to code, execute, test, and refine or optimize. I've implemented the treapy sort in Java to see it works.

Writing a research monograph, and doing a performance comparison against other linearithmic sorting algorithms--merge sort, quick sort, heap sort is an experiment for the future. Perhaps even a comparison with a treap sort with the cost of deletion and rebalancing a treap.

What is fascinating is that randomness to maintain a heap property is used but yet order emerges...the ordering of sort property.

The source code in Java I *might* post it, depending upon the level of interest. I *might* write a research monograph, again depending upon the level of interest. The downside of being a computer scientist is publishing results is involved, requiring writing a research monograph whereas writing source code is simpler and quicker.

Advantages of Treapy Sort

The two primary advantages of the treapy sort are:

The treapy sort maintains the treap structure (hence immutability).
No need or expense to delete and rebalance treap.
Linearithmic performance time independent of source data

Another advantage is that if a treap is used in other applications (for example compiler symbol table, or I considered using it in SEQu-RAmA) the treapy sort is useful to extract all the data in ascending or descending sorted order. Such as a debug operation, or to simply dump the information in the treap data structure.

Saturday, October 9, 2010

Have Data, Will Sort

I like to create, analyze, test, and tinker with algorithms, especially sorting algorithms. Lately, sorting has pursued hybridized sorts, like the Introsort in C++, and lately the TimSort. The TimSort is slated to become the default sort algorithm in the JDK. TimSort is a hybridized merge sort with insertion sort, that first emerged in the Python realm. Another hybrid algorithm is the FlashSort, which uses the insertion sort to fully sort after partial sorting. Another hybridized algorithm is the Dual Pivot Quicksort. It seems elegance is lost in sorting, and algorithms are spliced together.

The TimSort requires a Comparator to be implemented, a specific class with a function to compare any to possible objects that can be defined in Java. No more magic comparison defined on integers, floats, doubles, or on characters from the encoding, and likewise a string (array of characters from a C perspective).

One sorting algorithm I created and tested was the Binar Sort, a sorting algorithm that uses the binary digits to partition for each binary digit position. Further contemplation, tinkering, and experimenting led to another sorting algorithm, the WillSort. I won't explain how the algorithm works, that's for a research monograph, but my testing has compared it to the TimSort, QuickSort, MergeSort, FlashSort, and Dual Pivot Quicksort. The test was on 32-bit integers in a random permutation to avoid the worst case of QuickSort, and to force all the algorithms to work without a special case of partly or nearly sorted set of data.

To push the performance comparison, the test was with large data sets of 100-million to 500-million integers. After each sort a static method then checked the array of integers was sorted independently of the sorting algorithm as a cross-check for any anomalous or unexpected results. The same data set was used, and randomly shuffled with the same seed to the pseudo-random number generator each time. Thus, each sorting algorithm worked on the same size, same permutation data set of 32-bit integers. All things were equal, except on the sorting algorithm used.

The platform was a JVM with a 6-gigabyte heap, on an Windows PC running at 2.7GHz. The results measured in the milliseconds. Here is some raw data from a random permutation:


Size        Quick    Merge   Flash    Tim   2Pivot   Will

100-million 16926    24008   32105   54709   55116   9706    
200-million 35849    53352   82336  126391  126329  19594   
300-million 55661    77454  108218  238945  386350  32308
400-million 73289   149573  182769  426597  407020  40482
500-million 97515   307663  224577  990444  984796  51761

Interestingly, but not surprisingly, QuickSort (and not the 2-pivot QuickSort), maintains a speed advantage. WillSort is the best, and does not have any sudden spike or performance quirks. The MergeSort and FlashSort are in the middle. The TimSort and 2-pivot QuickSort lag behind, but they are both hybridized versions of the QuickSort and MergeSort. The FlashSort partially sorts, but then uses the Insertion Sort, so is not a composite of two sorts, using the flash sort to partly sort.

The ratio between the QuickSort to the WillSort is 1.7 to 1.8, and fluctuates to 1.88 at 510-million (the increment between each interval was 10-million integers). Hence the time processing (processing for the QuickSort is selecting a pivot element, partitioning using comparison) each element is 1.7 to 1.8 between the QuickSort and the WillSort.

It seems perplexing for the focus on TimSort and 2-Pivot Quicksort, when both QuickSort remains more efficient, and there are other potential algorithms (of course, the WillSort).

In the worst case, ignoring other possibilities, given the Introsort is time-tested by a wider group in the Standard Template Library of C++ developers, a port from C++ to Java of on the Introsort seems more plausible.

But have data, will sort using WillSort.