Thursday, July 9, 2015

Binar--Binar Sort, Binar Shuffle, and Binar Search the Big Three of Algorithms

The three areas of algorithms are sort (order data into a specific permutation), shuffle (scramble data into random permutation), search (structure data into indexed permutation).

Binar

Two of three algorithms are in the "binar" category. By binar is meant "binarY" not the two associations often with the word "binar" which are: The Bynar aliens from Startrek The Next Generation The Goa'uld Bynarr from Stargate SG-1.

Binar Algorithms

The binar sort orders any datum using a bit extraction function to partition the datum into M-partitions by exchange of the N-elements. The binar sort is like a quick sort in partitioning, only no pivot element, and from 2 <= M < N sub-arrays.

The binar shuffle scrambles datum by using a bit extraction function, but to randomly move and exchange data elements. One interesting property of the binar shuffle is reversibility--the binar shuffle is invertible. The final permutation can be returned to the original permutation (something I need to add, along with a basic proof in the research monograph) by an inverse function.

Binar Search

The simplest search is to use the binar sort to organize that data elements into an ordered permutation, and then use a binary search to find an element. The Big-Oh for such an algorithm is O(c*N) + O(M*lg N) where M is the number of searches, and N is the number of elements.

Consider the following values:

2048-search operations, M=2048 or 2^12
65536-elements, N=65536 or 2^16, lg N = 16 = 2^4

Complexity: O(c*2^16) + O(2^12 * 2^4) = O(c*2^16) + O(2^16) = O(2*c*2^16) or 2*c*2^16 or c*2^17

Thus the binar sort is efficient, but for a sufficiently large data set, and many searches the complexity of the binar sort-binary search approaches that of twice sorting the data elements. If the value of M increases beyond the value of N (which is logical to presume...more searches on the data set once ordered) the binary search becomes linearithmic when M >= N, or O(c*N) + O(N * lg N).

Variation of Shuffle and Sort

The binar search is a variation upon the binar sort and binar shuffle. The actual bit extraction function is changed--instead of creating an ordered or random permutation, the bit function will map data elements, or create an index an organize the data elements accordingly. The array of data elements is unorganized initially, in a random permutation. The binar search will arrange the elements by an index computed by the bit extraction function, and position each element by the index--hash the elements into a hash table.

Search and Sort

This might seem unusual, but the proxmap sort by Standish also has a proxmap search. The binar sort simply puts data elements into an ordered permutation where the organization is dependent upon the predecessor and successor element in the permutation. The binar search first maps the data elements, and then each data element is accessed or searched for after the mapping.

The complexity of searching involves:

index data, pre-process, create permutation - O(N)
search data, use search function -- constant O(c)

For the number of searches M, there are two possibilities:

O(c*N) + O(M*c) when M < N
O(c*N) + O(c*M) = O(2*c*N) when M >= N (doesn't become linearithmic)

Searching

A search after mapping the data in an array can use the same bit extraction function to find data elements. With the right design of a bit extraction function, data elements related by the mapping function mapped into a partition are accessible. Instead of a complete bit extraction function at a constant time c, a lesser constant k < c is useful to find a data element, or a group of data elements.

Thus the complexity when M < N is O(c*N) + O(k*M) for a search, or O(c*N) + O(k*N) when M >= N. More simply, O((c+k)*N) for a search function that does not require the entire bit extraction function to find an element.

Bit Extraction Function

The core of a binar search is the bit extraction function to map a data element to an index, and then use the bit extraction function to access an element or elements. In effect, a search is an insert--but instead of putting an element into the array, it is a virtual insert to verify if an element is in the array.

Number of Permutations

For the number of permutations, each operation has a number of significant, possible permutations.

Sort - create ordered permutation ( 2-permutations) only ascending/descending
Shuffle - create random permutation ((N!-2)-permutations) exclude sorted 2-permutations
Search - create indexed permutation ( N!-permutations) any possible permutation

The ranking of permutations is 2 < N!-2 < N! which is an interesting relationship among the permutations. Consider permutation rankings is sort < shuffle < search.

Binar Search--the Missing Algorithm

The binar sort algorithm was the first binar algorithm, but with the right bit extraction function, could also unsure or shuffle, the second binar algorithm. The third algorithm is the binar search, and using the bit extraction function can take an array of elements and map them into a particular permutation that can be searched after processing. Hence searching, shuffling, and sorting are the same algorithm, a binar algorithm--just change or modify the bit extraction function.

Someday I'll have to write a research monograph about the binar search, just as I have for the binar sortand the binar shuffle. Until that day then...someday

Wednesday, February 4, 2015

Whither SEQu-RAmA, Quo Vadis???

SEQu-RAmA has the goal of aggregation and then identification of the various types for a search query. SEQu-RAmA works, but historically it is like Henry Ford's Quadracycle, or the Wright Brothers' Flyer--prototypes to determine if the concept is feasible, and experience some of the difficulties in design/implementation/test.

Hence some improvements to increase SEQu-RAmA efficiency, but it works for me making my search more effective. This leaves the question what next with SEQu-RAmA?

I will always tinker and improve the source code, especially when inspired. But for now it works fine with a trade off of speed against organizing links by type and status code with some color coding--think of ROY G. BIV with grey/black colored links.

One decision was to limit the number of links returned--or SEQu-RAmA would never stop or process on and on for hours. I'm tweaking that parameter, about 100-links from each search engine queried. With 20-search engines that is 2000-links or less, but I asked myself would I really click on all the thousands of references from a server? I might make the number of links a parameter set in the query page, or simply find the right balance between results and the number of.

Faster, Faster

One performance area was checking for redundant links before processing them. Originally I used a treap--a tree-heap with randomization to keep the heap property, and the performance is logarithmic O(lg N). But a constant performance O(c) seemed better--hence a hash table. I don't perform any threading mutex or locking--if thread 1 is inserting into the hash table and gets preempted by thread 2 which inserts a link, then thread 1 will not perform the processing of the link, while thread 2 does. Either thread then processes the link, and puts it into a resultant data structure, and the other does not.

The last improvement uses an algorithm I created some time ago (with controversy, one guy told about his two PhD's, and the best paper in the last 25-years...that the algorithm was 'impossible'...classic mapper vs. packer mental models) the hash sort. Sorting by hashing? Oh yes, although in another twist with NIST, I asked to include the hash sort algorithm in the list of algorithms and data structures. The person in the e-mail responded with "it's not a real sorting algorithm"...so now the gatekeeper for what is "real" as an algorithm--whatever that means...*shrug* As Mark Twain once said, "Don't let education get in the way of your learning."

But it was not the hash sort specifically, but the data structure--a matrix that was useful in storing links in an ordered way, and each server thread can hash the link and determine if is already processed. Two operations--hash to check if already there, and storing in an organized ordered way with a O(c) constant time.

The Fewer, the Higher

One potential optimization is give a fixed number of results, but continue processing. The results page would show the links processed thus far, but also include a link to get more results. With 20-servers returning 100-links, the user does not wait for the 2000-links, but gets 50-links while the SEQu-RAmA server processes links in the background. Perhaps the number of initial links is a parameter set by the user, fewer initial links is faster, but more clicking on the link to get more links as a resultant.

Where To Now?

Now a remaining question is what to do with SEQu-RAmA besides use it when I'm doing research on a particular topic and looking for PDF's, image files, or web pages.

It seems I'm at a fork in the road travelled in creating, tweaking, and using SEQu-RAmA. One fork is open-source, the other is closed-source. I have some personal experience in open-source with other projects in the past--a plugin for JBuilder, and an XML tokenizer.

Open source is a team effort, where the team is out there, somewhere on the Internet. Others take the open-source code, improve it, adapt it, and return the resultant improved source code...a team effort.

I had this experience with the plugin (the plugin took a "snapshot" of a project, and stored it in a ZIP file...very useful when loading the project to work on it, and before exiting JBuilder when done coding). Thus in this case the open-source was win-win, and helped make JBuilder experience better.

My other experience was with an XML tokenizer (in some instances, why use a XML parser if you want to do some processing of an XML document) that I wrote. I put it online as open source, and people used it.

When I moved the source code from the initial website, I had many e-mails, sometimes bordering on anger at my actions. No improvement, users that use the software library, and as author I'm beholden to them.

So open-source was not effective, and I had dozens of freeloading customers using my product that I had to provide/explain. When I posted on an XML news group, one of the original creators of XML said that they were nervous as the XML tokenizer did not validate the structure of a tokenized XML document. No duh!...it is a tokenizer, a lexer/scanner in compiler parlance. Adding that feature to validate is simple enough, but you turn the simple but fast tokenizer into a crude parser. So not only angry freebie customers, but a rebuke from an XML luminary.

The distinction in both experiences is one has a team of contributors that are also users to improve the open-source software, the other has users but without improvement.

Capitalism

I am a capitalist, and no pay, no way. I could use a standard proprietary license and sell SEQu-RAmA myself to others. The only difference is that paying users would still be angry customers that have a financial string in SEQu-RAmA and myself. I'd be making the improvements and fixing software glitches or adding new features. The other fork of open-source only with proprietary license and license fees in the equation.

The important criterion is most users want to use SEQu-RAmA, not tinker with the source code. SEQu-RAmA is a software app that a paying user wants to install and use. SEQu-RAmA itself is a mix of web server with multi-threaded code that a web browser accesses to utilize. But open-sourcing it and giving it away would be giving the big search engines access to the software technology without any improvement or optimization returned.

Released either way, once the source code is out on the Internet, there is no way of going back...the Internet is a STORM--STore Online Read Many.

Silicon Venture Valley

It is ironic that I am not in the Computer Valley (a.k.a Silicon Valley) as Robert X. Cringley calls it in his magnum opus "Accidental Empires: How the Boys of Silicon Valley Make Their Millions, Battle Foreign Competition, and Still Can't Get a Date".

SEQu-RAmA is a meta-search or super-search engine. I have the server on my MacBook, and a desktop PC. But a user wants to search the search engines with a browser. Thus in Silicon Valley, find a backer, and use a hosting service to have many instances of SEQu-RAmA running, responding to user queries. SEQu-RAmA is a web proxy that connects to many search engines. A user does not want to install and run a server on their computer, they want to simply search across many systems. Since the search is a web proxy, the search engine queried does not know the user information--browser/screen size/IP address, and so forth. The profit potential is similar to search engines, advertising based upon the query given. Other profit potential is simply by tracking link clicks relating to a query. If a user returns and clicks on another link, then the link discarded is not as significant as another; a time value for how long a user clicks a link, and then returns to click another could be useful information to track which links from a server are most significant. Other possibilities are specialized search for books, autos, etcetera but using SEQu-RAmA to search many other search engines.

Going Elastic

There is an ironic if strange simularity, a project Elasticsearchthat is a JSON search engine queried by HTTP. Elasticsearch stores documents which are then queried. SEQu-RAmA does not store files searched from the Internet, it retrieves, organizes, and then presents the web sites/pages with an indication of status. Efficient integration of information The irony is in Silicon Valley get funding for the SEQu-RAmA server, and possibly release an open-source version that could be used by enterprises--connected all company sites in a single search for information within an organization. A university with different branch sites could be unified in search for various academic, administrivia, and so forth. One major goal would be to setup on a leased server host for web search, and then making income by ads, collectin g browsing data for subjects, among other things. Another goal would be optimization, making SEQu-RAmA faster...the major trade-off is accuracy for speed. On a server cluster, caching might allow for data mining, and retain popular web wide search queries--in essence the users become the key in search.

Where To?

So where is SEQu-RAmA going? It will remain a search tool I utilize for more efficient search across multiple servers. I'll tweak and improve SEQu-RAmA, for example when a link is dead or unavailable, use the link to the cache on the search server, or perhaps even the Wayback Machine, or maybe Mr. Peabody? Another possible improvement is to automatically rank search engines by returned links but also speed. If SEQu-RAmA, based on the location geographically of the query, has to take longer to access an international web server, it might automatically adjust the timeout for the server to respond to the query.

For now SEQu-RAmA works for me, so how do search to--SEQu-RAmA for you (mostly me right now though...)

Thursday, November 13, 2014

Treapy Sort --Getting a Bit Treapy with a Treap

Treap

A treap is a randomized data structure that is hybrid of heap and a tree structure. The creators are Cecilia R. Aragon and Raimund Seidel in 1989. What is fascinating is the use of random priority for maintaining the heap structure or heap proprety. Randomness is used to bring order, structure. As Mr. Spock would say, "Fascinating..."

The basic operations of access, insert, remove all have a Big-Oh of O(lg N). Hence for 2**32 = 4,294,967,296 or 4-billion elements in a treap will require lg(4294967296) or lg(2**32) or 32-operations. Hence a treap, while a hybrid of a tree and heap, is very efficient in performance time.

Infimum and Supremum

One operation on a treap is to get the minimal or maximal elements--the mathematical term is infimum and supremum. I use the term extrema to mean either maximum or minimum element.

Not all treap data structure implementations have an extrema methods such as getMin() or getMax(), but such methods are necessary for the treapy sort.

Sorting by Treap

The treap as a method of sorting is discussed in A Modular Calculus for the Average Cost of Data Structuring by Michel Schellekens in 2008. Sorting using the extrema methods the sorting process is simply to getMin() or getMax() and delete the element.

But, after deletion, the treap requires rebalancing of elements to maintain the heap property. Hence deletion requires the expense of rebalancing the treap.

Wikipedia explains, "To delete a node x from the treap, if x is a leaf of the tree, simply remove it. If x has a single child z, remove x from the tree and make z be the child of the parent of x (or make z the root of the tree if x had no parent). Finally, if x has two children, swap its position in the tree with the position of its immediate successor z in the sorted order, resulting in one of the previous cases. In this final case, the swap may violate the heap-ordering property for z, so additional rotations may need to be performed to restore this property."

Getting Treapy

The treapy sort works by getting the extreme element, but avoids the cost of rebalancing the treap--however small the Big-Oh cost. The treapy sort avoids this, while preserving the structure of a treap while sorting. How does it do this? That's why its not a treap sort but a treapy sort.

One important feature of the treapy sort (and likewise the treap sort) is for duplicate elements each node in the treap maintain a count for duplicate elements (and to simplify the treap operation overall). Hence for a list of elements O(N) the space is O(M) where M <= N.

The treapy sort is linearithmic or O(N lg N)

For a list of data elements 2**32 = 4,294,967,296 the Big-Oh time is O(2**32 lg(2**32)) which simplifies to O(2**32 * 32) or O(2**32 * 2**5) or O(2**37).

One feature is that if the elements are in any partial ordering, or totally sorted, the treapy sort is immutable, it remains linearithmic.

The treapy sort is akin to the heap sort, but uses a treap, a hybrid of a heap and tree. One thesis examines the heap sort to treap sort in greater detail. The abstract states, "It was discovered that treap sort is much more faster than heap sort in sorting and searching for elements."

Source Code for Treapy Sort

The idea for the treapy sort came to me while examining diagrams, figures, and code for a treap. One of the best things about being a computer scientist is that to test an idea is to code, execute, test, and refine or optimize. I've implemented the treapy sort in Java to see it works.

Writing a research monograph, and doing a performance comparison against other linearithmic sorting algorithms--merge sort, quick sort, heap sort is an experiment for the future. Perhaps even a comparison with a treap sort with the cost of deletion and rebalancing a treap.

What is fascinating is that randomness to maintain a heap property is used but yet order emerges...the ordering of sort property.

The source code in Java I *might* post it, depending upon the level of interest. I *might* write a research monograph, again depending upon the level of interest. The downside of being a computer scientist is publishing results is involved, requiring writing a research monograph whereas writing source code is simpler and quicker.

Advantages of Treapy Sort

The two primary advantages of the treapy sort are:

The treapy sort maintains the treap structure (hence immutability).
No need or expense to delete and rebalance treap.
Linearithmic performance time independent of source data

Another advantage is that if a treap is used in other applications (for example compiler symbol table, or I considered using it in SEQu-RAmA) the treapy sort is useful to extract all the data in ascending or descending sorted order. Such as a debug operation, or to simply dump the information in the treap data structure.

Wednesday, October 29, 2014

Data, Data Everywhere But Not a Link to Click

The title of the web log post is an allusion to "Rhyme of the Ancient Mariner"

Yes, me and SEQu-RAmA are still here!

One problem with SEQu-RAmA as a meta-search or super search engine is that from a group of search engines a flood of hyperlinks of different media types (html, pdf, xml) that might be valid or invalid. Since one function is to check/identify hyperlinks (for example determine if a link is valid, or is missing) taking all the raw results returned and processing them efficiently is critical.

Hyperlink Processing

One approach is that for each search thread (each search engine gets its own thread) processes the links creates the necessary entry into a common data structure. It works but is woefully inefficient and slow, reminds me of downloading files back in the days of dial-up with 56K modem.

Consider N-search engines that do a search, and then process the results. The raw time to process a hyperlink is Tp. The same link is returned by M-search engines (where M <= N), but each search thread processes the same link. Hence M-redundant, repeated processing of the same hyperlink, and (M x Tp) seconds of wasted time.

Very inefficient although it works. Another aspect is determining the rank or frequency of a hyperlink, the rank taking advantage of the shared information from all the search engines.

Datum Ranking

The duplicate hyperlinks are inefficient to process by each search thread, but if many different web search engines return the same hyperlink it is a commonality. The question of different search engines using the same web search indexing algorithm arises, but various search engines are like automobiles.

Automobiles are essentially the same but different automakers had different design, engineering, implementation approaches. A look an the many “top 25 best automobiles ever” and “top 25 worst automobiles ever” is an extreme contrast of features and properties of autos. The contrast illustrates that while doing the same thing moving people from point A to point B, various automobiles were very different.

A safety precaution is to analyze link results from each search engine queried, and if too many are duplicates with the overall resultants by some threshold, the search engine might be excluded. One possibility is to search the “metacrawler” search engines which search existing search engines and thus _will_ duplicate results from the search engines queried individually.

Still, the emphasis is on the collective results to give a rank to a hyperlink as a collective property not an individual property. Individual properties include type (html, pdf, xml, zzzz) status (202, 404, 999) hyperlink text using string ordering (www.fubar.org < zzz.asleep-at-the-switch.info) of the link text.

This collective ranking is not unlike auctioneering in embedded/control systems, or multiple processors on a space probe that “vote” to determine overall processor operation.

Hyperlink Efficient Processing

While the original duplicated effort works, an important skill a software engineer/computer scientist must have is to approach a problem from many different perspectives.

The approach I had was originally to create a micro-database for a common data store for all the returned hyperlinks from the various search engines. The problem is a micro-database is overkill in functionality. The SEQu-RAmA results do not need database functionality.

But it is the start of the right idea. I liken the approach to Futurama, and the “master in pile” at central bureaucracy in the episode “How Hermes Got His Groove Back.”

The results do not need to be flexible, it clusters results by type and hyperlink text alphabetically. Another important feature is the common data store of the “master in pile” must compute rank for duplicate links.

The final approach is what I term a “cardinal map” data structure. A map that stores a node consisting of the information of type, status code, hyperlink text, but adds a count or coefficient.

The cardinal map is primary for inserting data, in two possible cases. The first case is a node not in the cardinal map, so it is simply inserted with the count is equal to 1. The second case is the node with the data is in the cardinal map. Instead of inserting the node, it is retrieved and the count incremented.

The cardinal map is functioning like an ordered set in avoiding duplicates, and storing the nodes ordered by the data. The cardinal map is akin to the symbol table in a compiler, not just one data structure but a composition. A compiler symbol table allows the compiler to access information that a specific programming language expects to enforce the language rules.

Results Organization

The results are clustered together by type, color coded by the status of the link, and in alphabetic order by the hyperlink text. Now the rank is used to order the links, and then the alphabetic text.

With embedded links to click and go to a cluster of links by type, arranged by rank and hyperlink text. The color coded link then displays the status of the link.

Two Special Cases in Results Organization

There are two special cases (or at least the most obvious) for HTTP status code. The two status codes are:

HTTP Not Found (404) - web page is not found at hyperlink
HTTP ZZZZ (999) - special case of unknown link status

An obvious question is “Why bother to include links that are not found and those that are unknown in status.”

HTTP 404 Not Found

The status code of HTTP 404 seems a dead end. After all, why click on a hyperlink that is not found at the site?

404 not found (use cache link from search engine, or from Wayback Machine

ZZZZ 999 Unknown

The custom status code of ZZZZ 999 seems like an HTTP 404 status code. But there is a significant difference. HTTP 404 is a web page not found, does not exist; but the the ZZZZ 999 is unknown, unable to verify that HTTP status code of the link.

999 ZZZZ is special case, unknown so take a chance by visiting hyperlink and/or including link to cache...something like a 40404

Color Coding of Hyperlinks

The color coding of the hyperlinks is a mix of choosing a color for a status of a hyperlink. One part is aesthetics, choosing a color that is not harsh on the eyes. The other part is a color logic, for example black for a status code of “web page not found” or 404, and grey for status code of “unknown status” or 999.

The choice of color is like the color coding of resistors in electronics only the difficult question and choice is what colors for what HTTP status code.

For example black for a dead, HTTP status 404 link gives forewarning, and perhaps grey as a warning for 999 ZZZZ status unknown. But the question remains of what color for other HTTP status codes...??

Saturday, June 28, 2014

May Your Filters Bloom

I'm rewriting SEQuRAmA for more cleaner code in version 2.0; version 1.0 was a prototype to see if a super search engine was possible. (But how often in software, the pressure to take the first prototype and continue to use it...) One goal is to organize search, aggregation, and reporting into self-contained i.e. encapsulated modules, and keep the multi-threaded search on each search engine self contained and mutually .

One method for efficiency is to avoid repeated links from the various search engines--repeated when the links are aggregated into the final result set of links. But, the result set, is a set--avoid duplicates during a check for if a link exists or not.

I initially tried the Bloom filter, a probabilistic hash structure giving a possibility of an element in a set. For N-search engines queried by the SEQuRAmA engine, there is a 1/N probability that a link is already in the resultant set. Part of my initial use was the Google's BigTable use of a Bloom Filter.

After implementing, I realized there was one problem. A Bloom filter gives a possible false positive for a link, but no false negative. Hence a link that is not in the set of results, might be given as in the results probabilistically.

If this were reversed, no false positives, but possible false negative. A negative would indicate the link is not in the set, but that could lead to duplicate efforts for repeated links.

While a Bloom filter has possibilities (no pun intended) it is not the best data structure, even with 1/N probability of a link in the result set. My thinking was guided by probabilities, but for the result set it requires "crisp logic" not probabilistic.

I've been looking at the code/theoretical features of other data structures for use in the final result set. I like to think of the SEQuRAmA result set is akin to the Futurama Central Bureaucracy "Master Inpile" that needs to be collated and filed. In this case the pile of links would be a probability, not a certainty, and a link is excluded because it was probabilistically in the result set.

Another feature of the Bloom filter was performance, an efficient hash (and I like hashing...on reason I created a sorting algorithm to sort by hashing--The Hash Sort), so very quickly would process links.

Fortunately the phase in the processing of links is more modular, so changing from a Bloom filter into data structure "X" is simple without any coupling to other functionality.

The next data structure I'll have to evaluate, often against various data structures, before deciding upon the final one to use. When in doubt, test it, and then try it. But without trying, the limits of a Bloom filter would not have surfaced. Good data structure, but incompatible with the functionality of the SEQuRAmA engine. To err is human, to repeat it is...DOH! (with an allusion to Homer Simpson).

Thursday, May 15, 2014

Seek, Locate, SEQuRAmA: Weave a Search Web by Super-Search Engine

SEQuRAmA

I've continued tweaking and adding search within search features to SEQuRAmA--the Search Engine QUery Result AccuMulator Aggregator. But in doing so I organized the various features under five possible categories.

Five Categories

The five categories are:

verification
presentation
operation
optimization
improvements

List of Features by Category

Some of the features I have implemented, others are on the "to-do" list of software improvements. But by organizing the features into categories, it is easier to prioritize. Some features lead to other features, implementing one facilitates another implementation, or more simply without one feature already working, another cannot be so easily implemented. (Never say the word impossible in software, famous last words of many that uttered the word...)

The Features of SEQuRAmA

Verification:

Verify link media type (xml, pdf, html)
Verify link exists

Presentation:

Color code link for reliability (exists, unknown)
Cluster links together by type (xml, pdf, html, gif)
Cluster similar links within domain
Internal link to each cluster for easy navigation

operation:

Cookie control from search engines (delete, store)
No search engine advertisements in output results
Query multiple search engines in tandem

Optimization:

Permutations on search keywords--"Richard M. Nixon" one variant "Richard Nixon" and "Nixon"
Rank resultant link by commonality--search 12-search engines 9/12 = 0.75 rank
Connect private, internal resource (such internal employee website portal) to external publicly accessible resource

Improvements:

For non-existent link, replace with "The Wayback Machine" link if available, or search engine cache, or both
Using most ranked results, tweak search using content in web pages (for example ID3 algorithm to classify web page)
Determine other search keywords from words/markup on web pages (from Nixon, get Watergate, trip to China)
Store advertisements returned as separate results accessed outside primary results
Use a presentation template that specifies how to organize and structure the results

Modular, Multi-threaded, Multi-Class

One important implementation and design consideration is avoiding a monolithic block of Java source code. Each search is its own class, and a thread, implementing an interface, so that each search engine is a module. One change I've considered is dynamically loading a bytecode .class file, and then unloading if the search engine is down, or unresponsive within a wall clock timed time threshold. Other functionality accesses the raw results, stored in an ordered map (or multi-map since the resultant datum is stored within nested ordered maps). The original "store" (using a term from Babbage's Analytic Engine) used the standard Java data structures, but for more performance (which creeps up as I had more search engine modules...) I used open-source code, and implemented with a more specialized interface for specific-functionality and not general-purpose operation.

Do It Yourself

I implemented or created SEQuRAmA as a mix of fun, challenge, and a much useful tool for more efficient search online. Later one possibility is to have SEQuRAma work as a database of results, searchable through a web browser. Turing a super search engine into a local database, perhaps even extracting keywords from the query (and using the search terms that led to the resulting link) and result to organize the data internally. Of course, every software engineer often suffers from "creeping featuritis" and I'll have to reclassify these possible enhancements, but before then continue to tweak and improve the features already on the "to-do" list. Other features are superfluous...such as quote of the day, important historical events by date, jokes from search results...nice, but add nothing to efficient search across multiple search engines. Seek, locate, accumulate (a pun and paraphrase of the Daleks from Dr. Who...)

Thursday, January 9, 2014

Seek, and You Shall--Aggregate the Results

I search the Internet (who doesn't in the 21st century?--rhetorical question) using the many search engines to find information. An Internet search is not necessarily "one size fits all" with a search engine. But it is tedious and tiresome to search on a variety of search engines, or even some search engines that search using other search engines. I had the goal to search more efficiently, do more with less.

Project

This project has the goal of more efficient search--search using other search engines but at a single point--a locus or nexus. The operation is query many a search engine, accumulate and aggregate the results. Thus Search Engine QUery-Results AccuMulator Aggregator or SEQu-RAmA, or "Seek you rama"--remember the Big Three automaker "Futurama" of autos, or the cartoon with Phillip J. Fry?

The search engine is a simple web server using Java (for the different computers I have), that runs on the computer.

The web server always prompts for input using an HTML5 web page using JavaScript for some simple checks (no empty search) and functions (clear input).

To use the search engine, I just configured a browser to access it by default as "http://localhost:PORT" when the web browser is run. The web server uses the identity of the web browser when it connects--in effect, the web browser is a proxy for the browser, an extension in a web server.

Implementation

Each search engine has particular idiosyncrasies hence each uses an abstract class that is sub-classed for the particular search engine. Each search engine has its own sub-class, that also implements threading using Runnable. Search results are returned in "public String[] results" and a "public boolean readyFlag = false".

Once search parameters are submitted, the web server creates a series of threads for search in parallel as an array "SearchEngine[] engine" of the threads. Synchronization is unnecessary, as the web server checked each array, so that "engine[3].readyFlag" is indicative the search is complete. Each search always stores the results in the "String[] results" and then sets the flag. Hence if the web server accesses the readyFlag, and it is false while the thread attempts to update it to true, no problem as the web server will check it again.

Results

Each search engine has a default of returning an array of 0-results, and setting readyFlag to true if a connection times out, or some other error--a log file is kept for a web search.

When "engine[3].readyFlag" is true, the web server reads "engine[3].results" using a loop from 0 to "engine[3].results.length", inserting them into an index structure that is a set--no duplicates.

Once all the searches are complete, and the results in the index structure, the results are processed looking for the kind of result by type--.html, .xml, .pdf, .jpeg, .asp, etcetera. The extension extraction algorithm extracts the extension, then inserts the result into a ordered map-set...each extension has a set containing all results by extension. There is no predefined extensions, they are created dynamically for each extension. This has the quirk that .htm and .html have own sets, and sometimes one extension might have one result.

After all links are stored in the ordered map-set, the results are presented in a predefined format of HTML5, but with embedded links to allow quickly accessing results by type. The actual links are presented in a table, with each type having a sub-table. The output format is functional and navigable, but lacks the preview or other main search engines.

Kickin' Off

I have the SEQu-RAmA on my Mac, Linux, Windows computers. I've created a "kicker" program--a simple application that first starts the web server, delays for 3-seconds, and then fires the web browser. The fun or hard part was to create a native kicker using g++ on my Mac, gcc on my Linux, and then C# on the Windows machine. The icon is pretty basic on the desktop, but it works as the browser automatically connects to the web server. Click the icon, connect to web server, and search in parallel for data.

Future SEQu-RAmA

Of course this is the initial prototype, so some improvements, tweaks, and alterations will be made over time. For example, configure exclusion so that types of results like images, videos, sound are excluded.

Another area of possibility is to use the many results to determine the significance of links among the various search engines--a weight for each result. A repeated result for each search engine increases the weight--for M out of N search engines, and results of a type are then sorted accordingly.

Variations on input are possible, such as search on each parameter, and then create permutations of the parameter. Consider a search "Richard Nixon Watergate" and then search "Richard", "Nixon", "Watergate" and then "Richard Nixon", "Richard Watergate"--and then determine for each result from all search engines the weight of each result.

One wild possibility is a "deep search" where the initial results for an extension are used for a more deeper search--extract datum from the results (not possible for images/sounds/videos) and revise the query. Such a deeper search would require an existing query, but with a button or link to do so, and wait for the results requiring actual fetch of web data.

Nothing New

An acquaintance sneered at the SEQu-RAmA project/server, saying that "XYZ search engine works for me." I simply pointed out if they're that happy why be so disdainful, and I use the "XYZ search engine" in a multiple parallel search. I recently read an article from Slate by Jessica Olien, Inside the Box: People Don't Actually Like Creativity about how creativity and innovation are supposedly prized, but actually despised. No big surprise to me, asking "why" with the implication of wasted effort/time illustrates lack of imagination, dumbing down thought.

But I will continue tweaking, improving, and revising SEQu-RAmA for my own use. It might be possible as a tablet app, or even a browser plug-in, although I prefer a separate web server as a proxy for the web browser. Seek and you shall accumulate, aggregate, and then find.