Thursday, July 9, 2015

Binar--Binar Sort, Binar Shuffle, and Binar Search the Big Three of Algorithms

The three areas of algorithms are sort (order data into a specific permutation), shuffle (scramble data into random permutation), search (structure data into indexed permutation).

Binar

Two of three algorithms are in the "binar" category. By binar is meant "binarY" not the two associations often with the word "binar" which are: The Bynar aliens from Startrek The Next Generation The Goa'uld Bynarr from Stargate SG-1.

Binar Algorithms

The binar sort orders any datum using a bit extraction function to partition the datum into M-partitions by exchange of the N-elements. The binar sort is like a quick sort in partitioning, only no pivot element, and from 2 <= M < N sub-arrays.

The binar shuffle scrambles datum by using a bit extraction function, but to randomly move and exchange data elements. One interesting property of the binar shuffle is reversibility--the binar shuffle is invertible. The final permutation can be returned to the original permutation (something I need to add, along with a basic proof in the research monograph) by an inverse function.

Binar Search

The simplest search is to use the binar sort to organize that data elements into an ordered permutation, and then use a binary search to find an element. The Big-Oh for such an algorithm is O(c*N) + O(M*lg N) where M is the number of searches, and N is the number of elements.

Consider the following values:

  • 2048-search operations, M=2048 or 2^12
  • 65536-elements, N=65536 or 2^16, lg N = 16 = 2^4

  • Complexity: O(c*2^16) + O(2^12 * 2^4) = O(c*2^16) + O(2^16) = O(2*c*2^16) or 2*c*2^16 or c*2^17

Thus the binar sort is efficient, but for a sufficiently large data set, and many searches the complexity of the binar sort-binary search approaches that of twice sorting the data elements. If the value of M increases beyond the value of N (which is logical to presume...more searches on the data set once ordered) the binary search becomes linearithmic when M >= N, or O(c*N) + O(N * lg N).

Variation of Shuffle and Sort

The binar search is a variation upon the binar sort and binar shuffle. The actual bit extraction function is changed--instead of creating an ordered or random permutation, the bit function will map data elements, or create an index an organize the data elements accordingly. The array of data elements is unorganized initially, in a random permutation. The binar search will arrange the elements by an index computed by the bit extraction function, and position each element by the index--hash the elements into a hash table.

Search and Sort

This might seem unusual, but the proxmap sort by Standish also has a proxmap search. The binar sort simply puts data elements into an ordered permutation where the organization is dependent upon the predecessor and successor element in the permutation. The binar search first maps the data elements, and then each data element is accessed or searched for after the mapping.

The complexity of searching involves:

  1. index data, pre-process, create permutation - O(N)
  2. search data, use search function -- constant O(c)
For the number of searches M, there are two possibilities:
  1. O(c*N) + O(M*c) when M < N
  2. O(c*N) + O(c*M) = O(2*c*N) when M >= N (doesn't become linearithmic)

Searching

A search after mapping the data in an array can use the same bit extraction function to find data elements. With the right design of a bit extraction function, data elements related by the mapping function mapped into a partition are accessible. Instead of a complete bit extraction function at a constant time c, a lesser constant k < c is useful to find a data element, or a group of data elements.

Thus the complexity when M < N is O(c*N) + O(k*M) for a search, or O(c*N) + O(k*N) when M >= N. More simply, O((c+k)*N) for a search function that does not require the entire bit extraction function to find an element.

Bit Extraction Function

The core of a binar search is the bit extraction function to map a data element to an index, and then use the bit extraction function to access an element or elements. In effect, a search is an insert--but instead of putting an element into the array, it is a virtual insert to verify if an element is in the array.

Number of Permutations

For the number of permutations, each operation has a number of significant, possible permutations.

  1. Sort - create ordered permutation ( 2-permutations) only ascending/descending
  2. Shuffle - create random permutation ((N!-2)-permutations) exclude sorted 2-permutations
  3. Search - create indexed permutation ( N!-permutations) any possible permutation
The ranking of permutations is 2 < N!-2 < N! which is an interesting relationship among the permutations. Consider permutation rankings is sort < shuffle < search.

Binar Search--the Missing Algorithm

The binar sort algorithm was the first binar algorithm, but with the right bit extraction function, could also unsure or shuffle, the second binar algorithm. The third algorithm is the binar search, and using the bit extraction function can take an array of elements and map them into a particular permutation that can be searched after processing. Hence searching, shuffling, and sorting are the same algorithm, a binar algorithm--just change or modify the bit extraction function.

Someday I'll have to write a research monograph about the binar search, just as I have for the binar sortand the binar shuffle. Until that day then...someday

Wednesday, February 4, 2015

Whither SEQu-RAmA, Quo Vadis???

SEQu-RAmA has the goal of aggregation and then identification of the various types for a search query. SEQu-RAmA works, but historically it is like Henry Ford's Quadracycle, or the Wright Brothers' Flyer--prototypes to determine if the concept is feasible, and experience some of the difficulties in design/implementation/test.

Hence some improvements to increase SEQu-RAmA efficiency, but it works for me making my search more effective. This leaves the question what next with SEQu-RAmA?

I will always tinker and improve the source code, especially when inspired. But for now it works fine with a trade off of speed against organizing links by type and status code with some color coding--think of ROY G. BIV with grey/black colored links.

One decision was to limit the number of links returned--or SEQu-RAmA would never stop or process on and on for hours. I'm tweaking that parameter, about 100-links from each search engine queried. With 20-search engines that is 2000-links or less, but I asked myself would I really click on all the thousands of references from a server? I might make the number of links a parameter set in the query page, or simply find the right balance between results and the number of.

Faster, Faster

One performance area was checking for redundant links before processing them. Originally I used a treap--a tree-heap with randomization to keep the heap property, and the performance is logarithmic O(lg N). But a constant performance O(c) seemed better--hence a hash table. I don't perform any threading mutex or locking--if thread 1 is inserting into the hash table and gets preempted by thread 2 which inserts a link, then thread 1 will not perform the processing of the link, while thread 2 does. Either thread then processes the link, and puts it into a resultant data structure, and the other does not.

The last improvement uses an algorithm I created some time ago (with controversy, one guy told about his two PhD's, and the best paper in the last 25-years...that the algorithm was 'impossible'...classic mapper vs. packer mental models) the hash sort. Sorting by hashing? Oh yes, although in another twist with NIST, I asked to include the hash sort algorithm in the list of algorithms and data structures. The person in the e-mail responded with "it's not a real sorting algorithm"...so now the gatekeeper for what is "real" as an algorithm--whatever that means...*shrug* As Mark Twain once said, "Don't let education get in the way of your learning."

But it was not the hash sort specifically, but the data structure--a matrix that was useful in storing links in an ordered way, and each server thread can hash the link and determine if is already processed. Two operations--hash to check if already there, and storing in an organized ordered way with a O(c) constant time.

The Fewer, the Higher

One potential optimization is give a fixed number of results, but continue processing. The results page would show the links processed thus far, but also include a link to get more results. With 20-servers returning 100-links, the user does not wait for the 2000-links, but gets 50-links while the SEQu-RAmA server processes links in the background. Perhaps the number of initial links is a parameter set by the user, fewer initial links is faster, but more clicking on the link to get more links as a resultant.

Where To Now?

Now a remaining question is what to do with SEQu-RAmA besides use it when I'm doing research on a particular topic and looking for PDF's, image files, or web pages.

It seems I'm at a fork in the road travelled in creating, tweaking, and using SEQu-RAmA. One fork is open-source, the other is closed-source. I have some personal experience in open-source with other projects in the past--a plugin for JBuilder, and an XML tokenizer.

Open source is a team effort, where the team is out there, somewhere on the Internet. Others take the open-source code, improve it, adapt it, and return the resultant improved source code...a team effort.

I had this experience with the plugin (the plugin took a "snapshot" of a project, and stored it in a ZIP file...very useful when loading the project to work on it, and before exiting JBuilder when done coding). Thus in this case the open-source was win-win, and helped make JBuilder experience better.

My other experience was with an XML tokenizer (in some instances, why use a XML parser if you want to do some processing of an XML document) that I wrote. I put it online as open source, and people used it.

When I moved the source code from the initial website, I had many e-mails, sometimes bordering on anger at my actions. No improvement, users that use the software library, and as author I'm beholden to them.

So open-source was not effective, and I had dozens of freeloading customers using my product that I had to provide/explain. When I posted on an XML news group, one of the original creators of XML said that they were nervous as the XML tokenizer did not validate the structure of a tokenized XML document. No duh!...it is a tokenizer, a lexer/scanner in compiler parlance. Adding that feature to validate is simple enough, but you turn the simple but fast tokenizer into a crude parser. So not only angry freebie customers, but a rebuke from an XML luminary.

The distinction in both experiences is one has a team of contributors that are also users to improve the open-source software, the other has users but without improvement.

Capitalism

I am a capitalist, and no pay, no way. I could use a standard proprietary license and sell SEQu-RAmA myself to others. The only difference is that paying users would still be angry customers that have a financial string in SEQu-RAmA and myself. I'd be making the improvements and fixing software glitches or adding new features. The other fork of open-source only with proprietary license and license fees in the equation.

The important criterion is most users want to use SEQu-RAmA, not tinker with the source code. SEQu-RAmA is a software app that a paying user wants to install and use. SEQu-RAmA itself is a mix of web server with multi-threaded code that a web browser accesses to utilize. But open-sourcing it and giving it away would be giving the big search engines access to the software technology without any improvement or optimization returned.

Released either way, once the source code is out on the Internet, there is no way of going back...the Internet is a STORM--STore Online Read Many.

Silicon Venture Valley

It is ironic that I am not in the Computer Valley (a.k.a Silicon Valley) as Robert X. Cringley calls it in his magnum opus "Accidental Empires: How the Boys of Silicon Valley Make Their Millions, Battle Foreign Competition, and Still Can't Get a Date".

SEQu-RAmA is a meta-search or super-search engine. I have the server on my MacBook, and a desktop PC. But a user wants to search the search engines with a browser. Thus in Silicon Valley, find a backer, and use a hosting service to have many instances of SEQu-RAmA running, responding to user queries. SEQu-RAmA is a web proxy that connects to many search engines. A user does not want to install and run a server on their computer, they want to simply search across many systems. Since the search is a web proxy, the search engine queried does not know the user information--browser/screen size/IP address, and so forth. The profit potential is similar to search engines, advertising based upon the query given. Other profit potential is simply by tracking link clicks relating to a query. If a user returns and clicks on another link, then the link discarded is not as significant as another; a time value for how long a user clicks a link, and then returns to click another could be useful information to track which links from a server are most significant. Other possibilities are specialized search for books, autos, etcetera but using SEQu-RAmA to search many other search engines.

Going Elastic

There is an ironic if strange simularity, a project Elasticsearchthat is a JSON search engine queried by HTTP. Elasticsearch stores documents which are then queried. SEQu-RAmA does not store files searched from the Internet, it retrieves, organizes, and then presents the web sites/pages with an indication of status. Efficient integration of information The irony is in Silicon Valley get funding for the SEQu-RAmA server, and possibly release an open-source version that could be used by enterprises--connected all company sites in a single search for information within an organization. A university with different branch sites could be unified in search for various academic, administrivia, and so forth. One major goal would be to setup on a leased server host for web search, and then making income by ads, collectin g browsing data for subjects, among other things. Another goal would be optimization, making SEQu-RAmA faster...the major trade-off is accuracy for speed. On a server cluster, caching might allow for data mining, and retain popular web wide search queries--in essence the users become the key in search.

Where To?

So where is SEQu-RAmA going? It will remain a search tool I utilize for more efficient search across multiple servers. I'll tweak and improve SEQu-RAmA, for example when a link is dead or unavailable, use the link to the cache on the search server, or perhaps even the Wayback Machine, or maybe Mr. Peabody? Another possible improvement is to automatically rank search engines by returned links but also speed. If SEQu-RAmA, based on the location geographically of the query, has to take longer to access an international web server, it might automatically adjust the timeout for the server to respond to the query.

For now SEQu-RAmA works for me, so how do search to--SEQu-RAmA for you (mostly me right now though...)