Will's Blog...Much Ado About Nothing: binar sort

Showing posts with label binar sort. Show all posts

Thursday, July 9, 2015

Binar--Binar Sort, Binar Shuffle, and Binar Search the Big Three of Algorithms

The three areas of algorithms are sort (order data into a specific permutation), shuffle (scramble data into random permutation), search (structure data into indexed permutation).

Binar

Two of three algorithms are in the "binar" category. By binar is meant "binarY" not the two associations often with the word "binar" which are: The Bynar aliens from Startrek The Next Generation The Goa'uld Bynarr from Stargate SG-1.

Binar Algorithms

The binar sort orders any datum using a bit extraction function to partition the datum into M-partitions by exchange of the N-elements. The binar sort is like a quick sort in partitioning, only no pivot element, and from 2 <= M < N sub-arrays.

The binar shuffle scrambles datum by using a bit extraction function, but to randomly move and exchange data elements. One interesting property of the binar shuffle is reversibility--the binar shuffle is invertible. The final permutation can be returned to the original permutation (something I need to add, along with a basic proof in the research monograph) by an inverse function.

Binar Search

The simplest search is to use the binar sort to organize that data elements into an ordered permutation, and then use a binary search to find an element. The Big-Oh for such an algorithm is O(c*N) + O(M*lg N) where M is the number of searches, and N is the number of elements.

Consider the following values:

2048-search operations, M=2048 or 2^12
65536-elements, N=65536 or 2^16, lg N = 16 = 2^4

Complexity: O(c*2^16) + O(2^12 * 2^4) = O(c*2^16) + O(2^16) = O(2*c*2^16) or 2*c*2^16 or c*2^17

Thus the binar sort is efficient, but for a sufficiently large data set, and many searches the complexity of the binar sort-binary search approaches that of twice sorting the data elements. If the value of M increases beyond the value of N (which is logical to presume...more searches on the data set once ordered) the binary search becomes linearithmic when M >= N, or O(c*N) + O(N * lg N).

Variation of Shuffle and Sort

The binar search is a variation upon the binar sort and binar shuffle. The actual bit extraction function is changed--instead of creating an ordered or random permutation, the bit function will map data elements, or create an index an organize the data elements accordingly. The array of data elements is unorganized initially, in a random permutation. The binar search will arrange the elements by an index computed by the bit extraction function, and position each element by the index--hash the elements into a hash table.

Search and Sort

This might seem unusual, but the proxmap sort by Standish also has a proxmap search. The binar sort simply puts data elements into an ordered permutation where the organization is dependent upon the predecessor and successor element in the permutation. The binar search first maps the data elements, and then each data element is accessed or searched for after the mapping.

The complexity of searching involves:

index data, pre-process, create permutation - O(N)
search data, use search function -- constant O(c)

For the number of searches M, there are two possibilities:

O(c*N) + O(M*c) when M < N
O(c*N) + O(c*M) = O(2*c*N) when M >= N (doesn't become linearithmic)

Searching

A search after mapping the data in an array can use the same bit extraction function to find data elements. With the right design of a bit extraction function, data elements related by the mapping function mapped into a partition are accessible. Instead of a complete bit extraction function at a constant time c, a lesser constant k < c is useful to find a data element, or a group of data elements.

Thus the complexity when M < N is O(c*N) + O(k*M) for a search, or O(c*N) + O(k*N) when M >= N. More simply, O((c+k)*N) for a search function that does not require the entire bit extraction function to find an element.

Bit Extraction Function

The core of a binar search is the bit extraction function to map a data element to an index, and then use the bit extraction function to access an element or elements. In effect, a search is an insert--but instead of putting an element into the array, it is a virtual insert to verify if an element is in the array.

Number of Permutations

For the number of permutations, each operation has a number of significant, possible permutations.

Sort - create ordered permutation ( 2-permutations) only ascending/descending
Shuffle - create random permutation ((N!-2)-permutations) exclude sorted 2-permutations
Search - create indexed permutation ( N!-permutations) any possible permutation

The ranking of permutations is 2 < N!-2 < N! which is an interesting relationship among the permutations. Consider permutation rankings is sort < shuffle < search.

Binar Search--the Missing Algorithm

The binar sort algorithm was the first binar algorithm, but with the right bit extraction function, could also unsure or shuffle, the second binar algorithm. The third algorithm is the binar search, and using the bit extraction function can take an array of elements and map them into a particular permutation that can be searched after processing. Hence searching, shuffling, and sorting are the same algorithm, a binar algorithm--just change or modify the bit extraction function.

Someday I'll have to write a research monograph about the binar search, just as I have for the binar sortand the binar shuffle. Until that day then...someday

Saturday, October 9, 2010

Have Data, Will Sort

I like to create, analyze, test, and tinker with algorithms, especially sorting algorithms. Lately, sorting has pursued hybridized sorts, like the Introsort in C++, and lately the TimSort. The TimSort is slated to become the default sort algorithm in the JDK. TimSort is a hybridized merge sort with insertion sort, that first emerged in the Python realm. Another hybrid algorithm is the FlashSort, which uses the insertion sort to fully sort after partial sorting. Another hybridized algorithm is the Dual Pivot Quicksort. It seems elegance is lost in sorting, and algorithms are spliced together.

The TimSort requires a Comparator to be implemented, a specific class with a function to compare any to possible objects that can be defined in Java. No more magic comparison defined on integers, floats, doubles, or on characters from the encoding, and likewise a string (array of characters from a C perspective).

One sorting algorithm I created and tested was the Binar Sort, a sorting algorithm that uses the binary digits to partition for each binary digit position. Further contemplation, tinkering, and experimenting led to another sorting algorithm, the WillSort. I won't explain how the algorithm works, that's for a research monograph, but my testing has compared it to the TimSort, QuickSort, MergeSort, FlashSort, and Dual Pivot Quicksort. The test was on 32-bit integers in a random permutation to avoid the worst case of QuickSort, and to force all the algorithms to work without a special case of partly or nearly sorted set of data.

To push the performance comparison, the test was with large data sets of 100-million to 500-million integers. After each sort a static method then checked the array of integers was sorted independently of the sorting algorithm as a cross-check for any anomalous or unexpected results. The same data set was used, and randomly shuffled with the same seed to the pseudo-random number generator each time. Thus, each sorting algorithm worked on the same size, same permutation data set of 32-bit integers. All things were equal, except on the sorting algorithm used.

The platform was a JVM with a 6-gigabyte heap, on an Windows PC running at 2.7GHz. The results measured in the milliseconds. Here is some raw data from a random permutation:


Size        Quick    Merge   Flash    Tim   2Pivot   Will

100-million 16926    24008   32105   54709   55116   9706    
200-million 35849    53352   82336  126391  126329  19594   
300-million 55661    77454  108218  238945  386350  32308
400-million 73289   149573  182769  426597  407020  40482
500-million 97515   307663  224577  990444  984796  51761

Interestingly, but not surprisingly, QuickSort (and not the 2-pivot QuickSort), maintains a speed advantage. WillSort is the best, and does not have any sudden spike or performance quirks. The MergeSort and FlashSort are in the middle. The TimSort and 2-pivot QuickSort lag behind, but they are both hybridized versions of the QuickSort and MergeSort. The FlashSort partially sorts, but then uses the Insertion Sort, so is not a composite of two sorts, using the flash sort to partly sort.

The ratio between the QuickSort to the WillSort is 1.7 to 1.8, and fluctuates to 1.88 at 510-million (the increment between each interval was 10-million integers). Hence the time processing (processing for the QuickSort is selecting a pivot element, partitioning using comparison) each element is 1.7 to 1.8 between the QuickSort and the WillSort.

It seems perplexing for the focus on TimSort and 2-Pivot Quicksort, when both QuickSort remains more efficient, and there are other potential algorithms (of course, the WillSort).

In the worst case, ignoring other possibilities, given the Introsort is time-tested by a wider group in the Standard Template Library of C++ developers, a port from C++ to Java of on the Introsort seems more plausible.

But have data, will sort using WillSort.

Sunday, March 14, 2010

Binar Sort--the One Thing

I put the draft monograph for a new sorting algorithm, the binar sort online. The response from readers was varied in the specifics, but invariably the same was of the form "impossible to sort in linear time" or "no, it's not" (double negative), or "because of this it proves your algorithm is not linear."

All negative statements, proving a point by making impossible to prove negative assertions (remember basic logic, you cannot prove a negative, something is not). One individual with incredibly stunning powers of observation said (to paraphrase): it will only work with binary data. I didn't bother to explain digital logic, bits, or what is under the hood inside memory chips, coprocessors, and a microprocessor, or other computer hardware. A criticism by making an assertion of obviousness.

I consider myself a scientist, in particular a computer scientist. If I can be proven wrong, so be it. The broadening of knowledge, and greater understanding is of importance. But the critics seem to prove by making unprovable or obvious statements which only demonstrate one-dimensional thinking, lacking any depth or understanding. Even worse, all the critics did one thing consistently--never asked for the source code in Java, C#, or C to try the algorithm for themselves.

Algorithms have a theoretical basis, analysis, but unlike other scientific domains, computer science is applied for an algorithm--it can be run, tested, and evaluated. But not one person did that--it was all assertions and commentary.

Of course, many technical papers and research monographs are theoretical, and often practical implementation is non-existent. Once I read about a data structure, I e-mailed the authors about an implementation. They used the abstract pseudo-code notation, popular in algorithms books, but did not implement the data structure. The response to my query was "it was trivial to implement" in the usual dismissive condescension. The response was so trivial as to be banal and predictable. In this case, theory did not transfer into the praxis of application--applied example.

But in the case of the binar sort, I have a Java implementation. So it would only take the chutzpah to codify the binar sort, create a bytecode class, and run it with different test sets. Any incongruous results would pique my interest--there is a case where theory and practice do not correspond, hence an unknown factor. Learning and discovering the unknown factor is increasing knowledge and experience--adding to the scientific knowledge.

Unfortunately, verbiage only demonstrates that new things are easier dismissed as impossible, rather than an individual to have re-learn, and re-thinking what they have known.

Computer science, in that sense of testable and running immediately makes it one of the better sciences. An intrepid person can tinker with an algorithm, data structure, or library and refine and improve it, find anomalous behavior, and rewrite it in another programming language for a different platform. Hence the science is tangible, immediate, and testable.

The critics are demonstrating intellectual and scientific sloth and laziness--idleness by negative criticism. Dismiss something as impossible and it removes the need to reconsider existing knowledge, and integrate the new fact.

The wonderful book "The Programmer's Stone" discusses this in detail, and the different mindsets--packers and mappers.

But the binar sort is an algorithm, but another kind of experiment was implicitly conducted--the behavior of those who comment. It is somewhat interesting (although too familiar) to see that everyone did the same thing--inaction, never bothering to test the code and evaluate for themselves first hand.

Questioning and challenging a scientific experiment is part of science, but so is replicating an experiment to verify results. That's pretty obvious, particularly in high school where you replicate the experiments in physics and chemistry to verify well established scientific laws and constants.

But when the scientific process is only through rebuttal and making intangible, unprovable assertions--that's the realm of politics, word games, and inane chatter--political science (talk about an oxymoron) and sophistry. In a nutshell, the binar sort is an applied example illustrating the adage--you can lead a horse/jackass/donkey to water, but you cannot make it drink. Or as Mark Twain once said, "Don't let education get in the way of your learning." Too late.