Thursday, November 13, 2014

Treapy Sort --Getting a Bit Treapy with a Treap

Treap

A treap is a randomized data structure that is hybrid of heap and a tree structure. The creators are Cecilia R. Aragon and Raimund Seidel in 1989. What is fascinating is the use of random priority for maintaining the heap structure or heap proprety. Randomness is used to bring order, structure. As Mr. Spock would say, "Fascinating..."

The basic operations of access, insert, remove all have a Big-Oh of O(lg N). Hence for 2**32 = 4,294,967,296 or 4-billion elements in a treap will require lg(4294967296) or lg(2**32) or 32-operations. Hence a treap, while a hybrid of a tree and heap, is very efficient in performance time.

Infimum and Supremum

One operation on a treap is to get the minimal or maximal elements--the mathematical term is infimum and supremum. I use the term extrema to mean either maximum or minimum element.

Not all treap data structure implementations have an extrema methods such as getMin() or getMax(), but such methods are necessary for the treapy sort.

Sorting by Treap

The treap as a method of sorting is discussed in A Modular Calculus for the Average Cost of Data Structuring by Michel Schellekens in 2008. Sorting using the extrema methods the sorting process is simply to getMin() or getMax() and delete the element.

But, after deletion, the treap requires rebalancing of elements to maintain the heap property. Hence deletion requires the expense of rebalancing the treap.

Wikipedia explains, "To delete a node x from the treap, if x is a leaf of the tree, simply remove it. If x has a single child z, remove x from the tree and make z be the child of the parent of x (or make z the root of the tree if x had no parent). Finally, if x has two children, swap its position in the tree with the position of its immediate successor z in the sorted order, resulting in one of the previous cases. In this final case, the swap may violate the heap-ordering property for z, so additional rotations may need to be performed to restore this property."

Getting Treapy

The treapy sort works by getting the extreme element, but avoids the cost of rebalancing the treap--however small the Big-Oh cost. The treapy sort avoids this, while preserving the structure of a treap while sorting. How does it do this? That's why its not a treap sort but a treapy sort.

One important feature of the treapy sort (and likewise the treap sort) is for duplicate elements each node in the treap maintain a count for duplicate elements (and to simplify the treap operation overall). Hence for a list of elements O(N) the space is O(M) where M <= N.

The treapy sort is linearithmic or O(N lg N)

For a list of data elements 2**32 = 4,294,967,296 the Big-Oh time is O(2**32 lg(2**32)) which simplifies to O(2**32 * 32) or O(2**32 * 2**5) or O(2**37).

One feature is that if the elements are in any partial ordering, or totally sorted, the treapy sort is immutable, it remains linearithmic.

The treapy sort is akin to the heap sort, but uses a treap, a hybrid of a heap and tree. One thesis examines the heap sort to treap sort in greater detail. The abstract states, "It was discovered that treap sort is much more faster than heap sort in sorting and searching for elements."

Source Code for Treapy Sort

The idea for the treapy sort came to me while examining diagrams, figures, and code for a treap. One of the best things about being a computer scientist is that to test an idea is to code, execute, test, and refine or optimize. I've implemented the treapy sort in Java to see it works.

Writing a research monograph, and doing a performance comparison against other linearithmic sorting algorithms--merge sort, quick sort, heap sort is an experiment for the future. Perhaps even a comparison with a treap sort with the cost of deletion and rebalancing a treap.

What is fascinating is that randomness to maintain a heap property is used but yet order emerges...the ordering of sort property.

The source code in Java I *might* post it, depending upon the level of interest. I *might* write a research monograph, again depending upon the level of interest. The downside of being a computer scientist is publishing results is involved, requiring writing a research monograph whereas writing source code is simpler and quicker.

Advantages of Treapy Sort

The two primary advantages of the treapy sort are:

  1. The treapy sort maintains the treap structure (hence immutability).
  2. No need or expense to delete and rebalance treap.
  3. Linearithmic performance time independent of source data

Another advantage is that if a treap is used in other applications (for example compiler symbol table, or I considered using it in SEQu-RAmA) the treapy sort is useful to extract all the data in ascending or descending sorted order. Such as a debug operation, or to simply dump the information in the treap data structure.

Wednesday, October 29, 2014

Data, Data Everywhere But Not a Link to Click

The title of the web log post is an allusion to "Rhyme of the Ancient Mariner"

Yes, me and SEQu-RAmA are still here!

One problem with SEQu-RAmA as a meta-search or super search engine is that from a group of search engines a flood of hyperlinks of different media types (html, pdf, xml) that might be valid or invalid. Since one function is to check/identify hyperlinks (for example determine if a link is valid, or is missing) taking all the raw results returned and processing them efficiently is critical.

Hyperlink Processing

One approach is that for each search thread (each search engine gets its own thread) processes the links creates the necessary entry into a common data structure. It works but is woefully inefficient and slow, reminds me of downloading files back in the days of dial-up with 56K modem.

Consider N-search engines that do a search, and then process the results. The raw time to process a hyperlink is Tp. The same link is returned by M-search engines (where M <= N), but each search thread processes the same link. Hence M-redundant, repeated processing of the same hyperlink, and (M x Tp) seconds of wasted time.

Very inefficient although it works. Another aspect is determining the rank or frequency of a hyperlink, the rank taking advantage of the shared information from all the search engines.

Datum Ranking

The duplicate hyperlinks are inefficient to process by each search thread, but if many different web search engines return the same hyperlink it is a commonality. The question of different search engines using the same web search indexing algorithm arises, but various search engines are like automobiles.

Automobiles are essentially the same but different automakers had different design, engineering, implementation approaches. A look an the many “top 25 best automobiles ever” and “top 25 worst automobiles ever” is an extreme contrast of features and properties of autos. The contrast illustrates that while doing the same thing moving people from point A to point B, various automobiles were very different.

A safety precaution is to analyze link results from each search engine queried, and if too many are duplicates with the overall resultants by some threshold, the search engine might be excluded. One possibility is to search the “metacrawler” search engines which search existing search engines and thus _will_ duplicate results from the search engines queried individually.

Still, the emphasis is on the collective results to give a rank to a hyperlink as a collective property not an individual property. Individual properties include type (html, pdf, xml, zzzz) status (202, 404, 999) hyperlink text using string ordering (www.fubar.org < zzz.asleep-at-the-switch.info) of the link text.

This collective ranking is not unlike auctioneering in embedded/control systems, or multiple processors on a space probe that “vote” to determine overall processor operation.

Hyperlink Efficient Processing

While the original duplicated effort works, an important skill a software engineer/computer scientist must have is to approach a problem from many different perspectives.

The approach I had was originally to create a micro-database for a common data store for all the returned hyperlinks from the various search engines. The problem is a micro-database is overkill in functionality. The SEQu-RAmA results do not need database functionality.

But it is the start of the right idea. I liken the approach to Futurama, and the “master in pile” at central bureaucracy in the episode “How Hermes Got His Groove Back.”

The results do not need to be flexible, it clusters results by type and hyperlink text alphabetically. Another important feature is the common data store of the “master in pile” must compute rank for duplicate links.

The final approach is what I term a “cardinal map” data structure. A map that stores a node consisting of the information of type, status code, hyperlink text, but adds a count or coefficient.

The cardinal map is primary for inserting data, in two possible cases. The first case is a node not in the cardinal map, so it is simply inserted with the count is equal to 1. The second case is the node with the data is in the cardinal map. Instead of inserting the node, it is retrieved and the count incremented.

The cardinal map is functioning like an ordered set in avoiding duplicates, and storing the nodes ordered by the data. The cardinal map is akin to the symbol table in a compiler, not just one data structure but a composition. A compiler symbol table allows the compiler to access information that a specific programming language expects to enforce the language rules.

Results Organization

The results are clustered together by type, color coded by the status of the link, and in alphabetic order by the hyperlink text. Now the rank is used to order the links, and then the alphabetic text.

With embedded links to click and go to a cluster of links by type, arranged by rank and hyperlink text. The color coded link then displays the status of the link.

Two Special Cases in Results Organization

There are two special cases (or at least the most obvious) for HTTP status code. The two status codes are:

  1. HTTP Not Found (404) - web page is not found at hyperlink
  2. HTTP ZZZZ       (999) - special case of unknown link status

An obvious question is “Why bother to include links that are not found and those that are unknown in status.”

HTTP 404 Not Found

The status code of HTTP 404 seems a dead end. After all, why click on a hyperlink that is not found at the site?

404 not found (use cache link from search engine, or from Wayback Machine

ZZZZ 999 Unknown

The custom status code of ZZZZ 999 seems like an HTTP 404 status code. But there is a significant difference. HTTP 404 is a web page not found, does not exist; but the the ZZZZ 999 is unknown, unable to verify that HTTP status code of the link.

999 ZZZZ is special case, unknown so take a chance by visiting hyperlink and/or including link to cache...something like a 40404

Color Coding of Hyperlinks

The color coding of the hyperlinks is a mix of choosing a color for a status of a hyperlink. One part is aesthetics, choosing a color that is not harsh on the eyes. The other part is a color logic, for example black for a status code of “web page not found” or 404, and grey for status code of “unknown status” or 999.

The choice of color is like the color coding of resistors in electronics only the difficult question and choice is what colors for what HTTP status code.

For example black for a dead, HTTP status 404 link gives forewarning, and perhaps grey as a warning for 999 ZZZZ status unknown. But the question remains of what color for other HTTP status codes...??

Saturday, June 28, 2014

May Your Filters Bloom

I'm rewriting SEQuRAmA for more cleaner code in version 2.0; version 1.0 was a prototype to see if a super search engine was possible. (But how often in software, the pressure to take the first prototype and continue to use it...) One goal is to organize search, aggregation, and reporting into self-contained i.e. encapsulated modules, and keep the multi-threaded search on each search engine self contained and mutually .

One method for efficiency is to avoid repeated links from the various search engines--repeated when the links are aggregated into the final result set of links. But, the result set, is a set--avoid duplicates during a check for if a link exists or not.

I initially tried the Bloom filter, a probabilistic hash structure giving a possibility of an element in a set. For N-search engines queried by the SEQuRAmA engine, there is a 1/N probability that a link is already in the resultant set. Part of my initial use was the Google's BigTable use of a Bloom Filter.

After implementing, I realized there was one problem. A Bloom filter gives a possible false positive for a link, but no false negative. Hence a link that is not in the set of results, might be given as in the results probabilistically.

If this were reversed, no false positives, but possible false negative. A negative would indicate the link is not in the set, but that could lead to duplicate efforts for repeated links.

While a Bloom filter has possibilities (no pun intended) it is not the best data structure, even with 1/N probability of a link in the result set. My thinking was guided by probabilities, but for the result set it requires "crisp logic" not probabilistic.

I've been looking at the code/theoretical features of other data structures for use in the final result set. I like to think of the SEQuRAmA result set is akin to the Futurama Central Bureaucracy "Master Inpile" that needs to be collated and filed. In this case the pile of links would be a probability, not a certainty, and a link is excluded because it was probabilistically in the result set.

Another feature of the Bloom filter was performance, an efficient hash (and I like hashing...on reason I created a sorting algorithm to sort by hashing--The Hash Sort), so very quickly would process links.

Fortunately the phase in the processing of links is more modular, so changing from a Bloom filter into data structure "X" is simple without any coupling to other functionality.

The next data structure I'll have to evaluate, often against various data structures, before deciding upon the final one to use. When in doubt, test it, and then try it. But without trying, the limits of a Bloom filter would not have surfaced. Good data structure, but incompatible with the functionality of the SEQuRAmA engine. To err is human, to repeat it is...DOH! (with an allusion to Homer Simpson).

Thursday, May 15, 2014

Seek, Locate, SEQuRAmA: Weave a Search Web by Super-Search Engine

SEQuRAmA

I've continued tweaking and adding search within search features to SEQuRAmA--the Search Engine QUery Result AccuMulator Aggregator. But in doing so I organized the various features under five possible categories.

Five Categories

The five categories are:
  1. verification
  2. presentation
  3. operation
  4. optimization
  5. improvements

List of Features by Category

Some of the features I have implemented, others are on the "to-do" list of software improvements. But by organizing the features into categories, it is easier to prioritize. Some features lead to other features, implementing one facilitates another implementation, or more simply without one feature already working, another cannot be so easily implemented. (Never say the word impossible in software, famous last words of many that uttered the word...)

The Features of SEQuRAmA

    Verification:

    1. Verify link media type (xml, pdf, html)
    2. Verify link exists

    Presentation:

    1. Color code link for reliability (exists, unknown)
    2. Cluster links together by type (xml, pdf, html, gif)
    3. Cluster similar links within domain
    4. Internal link to each cluster for easy navigation

    operation:

    1. Cookie control from search engines (delete, store)
    2. No search engine advertisements in output results
    3. Query multiple search engines in tandem

    Optimization:

    1. Permutations on search keywords--"Richard M. Nixon" one variant "Richard Nixon" and "Nixon"
    2. Rank resultant link by commonality--search 12-search engines 9/12 = 0.75 rank
    3. Connect private, internal resource (such internal employee website portal) to external publicly accessible resource

    Improvements:

    1. For non-existent link, replace with "The Wayback Machine" link if available, or search engine cache, or both
    2. Using most ranked results, tweak search using content in web pages (for example ID3 algorithm to classify web page)
    3. Determine other search keywords from words/markup on web pages (from Nixon, get Watergate, trip to China)
    4. Store advertisements returned as separate results accessed outside primary results
    5. Use a presentation template that specifies how to organize and structure the results

Modular, Multi-threaded, Multi-Class

One important implementation and design consideration is avoiding a monolithic block of Java source code. Each search is its own class, and a thread, implementing an interface, so that each search engine is a module. One change I've considered is dynamically loading a bytecode .class file, and then unloading if the search engine is down, or unresponsive within a wall clock timed time threshold. Other functionality accesses the raw results, stored in an ordered map (or multi-map since the resultant datum is stored within nested ordered maps). The original "store" (using a term from Babbage's Analytic Engine) used the standard Java data structures, but for more performance (which creeps up as I had more search engine modules...) I used open-source code, and implemented with a more specialized interface for specific-functionality and not general-purpose operation.

Do It Yourself

I implemented or created SEQuRAmA as a mix of fun, challenge, and a much useful tool for more efficient search online. Later one possibility is to have SEQuRAma work as a database of results, searchable through a web browser. Turing a super search engine into a local database, perhaps even extracting keywords from the query (and using the search terms that led to the resulting link) and result to organize the data internally. Of course, every software engineer often suffers from "creeping featuritis" and I'll have to reclassify these possible enhancements, but before then continue to tweak and improve the features already on the "to-do" list. Other features are superfluous...such as quote of the day, important historical events by date, jokes from search results...nice, but add nothing to efficient search across multiple search engines. Seek, locate, accumulate (a pun and paraphrase of the Daleks from Dr. Who...)

Thursday, January 9, 2014

Seek, and You Shall--Aggregate the Results



I search the Internet (who doesn't in the 21st century?--rhetorical question) using the many search engines to find information. An Internet search is not necessarily "one size fits all" with a search engine. But it is tedious and tiresome to search on a variety of search engines, or even some search engines that search using other search engines. I had the goal to search more efficiently, do more with less.

Project



This project has the goal of more efficient search--search using other search engines but at a single point--a locus or nexus. The operation is query many a search engine, accumulate and aggregate the results. Thus Search Engine QUery-Results AccuMulator Aggregator or SEQu-RAmA, or "Seek you rama"--remember the Big Three automaker "Futurama" of autos, or the cartoon with Phillip J. Fry?

The search engine is a simple web server using Java (for the different computers I have), that runs on the computer.

The web server always prompts for input using an HTML5 web page using JavaScript for some simple checks (no empty search) and functions (clear input).


To use the search engine, I just configured a browser to access it by default as "http://localhost:PORT" when the web browser is run. The web server uses the identity of the web browser when it connects--in effect, the web browser is a proxy for the browser, an extension in a web server.

Implementation



Each search engine has particular idiosyncrasies hence each uses an abstract class that is sub-classed for the particular search engine. Each search engine has its own sub-class, that also implements threading using Runnable. Search results are returned in "public String[] results" and a "public boolean readyFlag = false".

Once search parameters are submitted, the web server creates a series of threads for search in parallel as an array "SearchEngine[] engine" of the threads. Synchronization is unnecessary, as the web server checked each array, so that "engine[3].readyFlag" is indicative the search is complete. Each search always stores the results in the "String[] results" and then sets the flag. Hence if the web server accesses the readyFlag, and it is false while the thread attempts to update it to true, no problem as the web server will check it again.

Results



Each search engine has a default of returning an array of 0-results, and setting readyFlag to true if a connection times out, or some other error--a log file is kept for a web search.

When "engine[3].readyFlag" is true, the web server reads "engine[3].results" using a loop from 0 to "engine[3].results.length", inserting them into an index structure that is a set--no duplicates.

Once all the searches are complete, and the results in the index structure, the results are processed looking for the kind of result by type--.html, .xml, .pdf, .jpeg, .asp, etcetera. The extension extraction algorithm extracts the extension, then inserts the result into a ordered map-set...each extension has a set containing all results by extension. There is no predefined extensions, they are created dynamically for each extension. This has the quirk that .htm and .html have own sets, and sometimes one extension might have one result.

After all links are stored in the ordered map-set, the results are presented in a predefined format of HTML5, but with embedded links to allow quickly accessing results by type. The actual links are presented in a table, with each type having a sub-table. The output format is functional and navigable, but lacks the preview or other main search engines.

Kickin' Off



I have the SEQu-RAmA on my Mac, Linux, Windows computers. I've created a "kicker" program--a simple application that first starts the web server, delays for 3-seconds, and then fires the web browser. The fun or hard part was to create a native kicker using g++ on my Mac, gcc on my Linux, and then C# on the Windows machine. The icon is pretty basic on the desktop, but it works as the browser automatically connects to the web server. Click the icon, connect to web server, and search in parallel for data.

Future SEQu-RAmA



Of course this is the initial prototype, so some improvements, tweaks, and alterations will be made over time. For example, configure exclusion so that types of results like images, videos, sound are excluded.

Another area of possibility is to use the many results to determine the significance of links among the various search engines--a weight for each result. A repeated result for each search engine increases the weight--for M out of N search engines, and results of a type are then sorted accordingly.

Variations on input are possible, such as search on each parameter, and then create permutations of the parameter. Consider a search "Richard Nixon Watergate" and then search "Richard", "Nixon", "Watergate" and then "Richard Nixon", "Richard Watergate"--and then determine for each result from all search engines the weight of each result.

One wild possibility is a "deep search" where the initial results for an extension are used for a more deeper search--extract datum from the results (not possible for images/sounds/videos) and revise the query. Such a deeper search would require an existing query, but with a button or link to do so, and wait for the results requiring actual fetch of web data.

Nothing New



An acquaintance sneered at the SEQu-RAmA project/server, saying that "XYZ search engine works for me." I simply pointed out if they're that happy why be so disdainful, and I use the "XYZ search engine" in a multiple parallel search. I recently read an article from Slate by Jessica Olien, Inside the Box: People Don't Actually Like Creativity about how creativity and innovation are supposedly prized, but actually despised. No big surprise to me, asking "why" with the implication of wasted effort/time illustrates lack of imagination, dumbing down thought.

But I will continue tweaking, improving, and revising SEQu-RAmA for my own use. It might be possible as a tablet app, or even a browser plug-in, although I prefer a separate web server as a proxy for the web browser. Seek and you shall accumulate, aggregate, and then find.