Wednesday, February 4, 2015

Whither SEQu-RAmA, Quo Vadis???

SEQu-RAmA has the goal of aggregation and then identification of the various types for a search query. SEQu-RAmA works, but historically it is like Henry Ford's Quadracycle, or the Wright Brothers' Flyer--prototypes to determine if the concept is feasible, and experience some of the difficulties in design/implementation/test.

Hence some improvements to increase SEQu-RAmA efficiency, but it works for me making my search more effective. This leaves the question what next with SEQu-RAmA?

I will always tinker and improve the source code, especially when inspired. But for now it works fine with a trade off of speed against organizing links by type and status code with some color coding--think of ROY G. BIV with grey/black colored links.

One decision was to limit the number of links returned--or SEQu-RAmA would never stop or process on and on for hours. I'm tweaking that parameter, about 100-links from each search engine queried. With 20-search engines that is 2000-links or less, but I asked myself would I really click on all the thousands of references from a server? I might make the number of links a parameter set in the query page, or simply find the right balance between results and the number of.

Faster, Faster

One performance area was checking for redundant links before processing them. Originally I used a treap--a tree-heap with randomization to keep the heap property, and the performance is logarithmic O(lg N). But a constant performance O(c) seemed better--hence a hash table. I don't perform any threading mutex or locking--if thread 1 is inserting into the hash table and gets preempted by thread 2 which inserts a link, then thread 1 will not perform the processing of the link, while thread 2 does. Either thread then processes the link, and puts it into a resultant data structure, and the other does not.

The last improvement uses an algorithm I created some time ago (with controversy, one guy told about his two PhD's, and the best paper in the last 25-years...that the algorithm was 'impossible'...classic mapper vs. packer mental models) the hash sort. Sorting by hashing? Oh yes, although in another twist with NIST, I asked to include the hash sort algorithm in the list of algorithms and data structures. The person in the e-mail responded with "it's not a real sorting algorithm"...so now the gatekeeper for what is "real" as an algorithm--whatever that means...*shrug* As Mark Twain once said, "Don't let education get in the way of your learning."

But it was not the hash sort specifically, but the data structure--a matrix that was useful in storing links in an ordered way, and each server thread can hash the link and determine if is already processed. Two operations--hash to check if already there, and storing in an organized ordered way with a O(c) constant time.

The Fewer, the Higher

One potential optimization is give a fixed number of results, but continue processing. The results page would show the links processed thus far, but also include a link to get more results. With 20-servers returning 100-links, the user does not wait for the 2000-links, but gets 50-links while the SEQu-RAmA server processes links in the background. Perhaps the number of initial links is a parameter set by the user, fewer initial links is faster, but more clicking on the link to get more links as a resultant.

Where To Now?

Now a remaining question is what to do with SEQu-RAmA besides use it when I'm doing research on a particular topic and looking for PDF's, image files, or web pages.

It seems I'm at a fork in the road travelled in creating, tweaking, and using SEQu-RAmA. One fork is open-source, the other is closed-source. I have some personal experience in open-source with other projects in the past--a plugin for JBuilder, and an XML tokenizer.

Open source is a team effort, where the team is out there, somewhere on the Internet. Others take the open-source code, improve it, adapt it, and return the resultant improved source code...a team effort.

I had this experience with the plugin (the plugin took a "snapshot" of a project, and stored it in a ZIP file...very useful when loading the project to work on it, and before exiting JBuilder when done coding). Thus in this case the open-source was win-win, and helped make JBuilder experience better.

My other experience was with an XML tokenizer (in some instances, why use a XML parser if you want to do some processing of an XML document) that I wrote. I put it online as open source, and people used it.

When I moved the source code from the initial website, I had many e-mails, sometimes bordering on anger at my actions. No improvement, users that use the software library, and as author I'm beholden to them.

So open-source was not effective, and I had dozens of freeloading customers using my product that I had to provide/explain. When I posted on an XML news group, one of the original creators of XML said that they were nervous as the XML tokenizer did not validate the structure of a tokenized XML document. No duh!...it is a tokenizer, a lexer/scanner in compiler parlance. Adding that feature to validate is simple enough, but you turn the simple but fast tokenizer into a crude parser. So not only angry freebie customers, but a rebuke from an XML luminary.

The distinction in both experiences is one has a team of contributors that are also users to improve the open-source software, the other has users but without improvement.

Capitalism

I am a capitalist, and no pay, no way. I could use a standard proprietary license and sell SEQu-RAmA myself to others. The only difference is that paying users would still be angry customers that have a financial string in SEQu-RAmA and myself. I'd be making the improvements and fixing software glitches or adding new features. The other fork of open-source only with proprietary license and license fees in the equation.

The important criterion is most users want to use SEQu-RAmA, not tinker with the source code. SEQu-RAmA is a software app that a paying user wants to install and use. SEQu-RAmA itself is a mix of web server with multi-threaded code that a web browser accesses to utilize. But open-sourcing it and giving it away would be giving the big search engines access to the software technology without any improvement or optimization returned.

Released either way, once the source code is out on the Internet, there is no way of going back...the Internet is a STORM--STore Online Read Many.

Silicon Venture Valley

It is ironic that I am not in the Computer Valley (a.k.a Silicon Valley) as Robert X. Cringley calls it in his magnum opus "Accidental Empires: How the Boys of Silicon Valley Make Their Millions, Battle Foreign Competition, and Still Can't Get a Date".

SEQu-RAmA is a meta-search or super-search engine. I have the server on my MacBook, and a desktop PC. But a user wants to search the search engines with a browser. Thus in Silicon Valley, find a backer, and use a hosting service to have many instances of SEQu-RAmA running, responding to user queries. SEQu-RAmA is a web proxy that connects to many search engines. A user does not want to install and run a server on their computer, they want to simply search across many systems. Since the search is a web proxy, the search engine queried does not know the user information--browser/screen size/IP address, and so forth. The profit potential is similar to search engines, advertising based upon the query given. Other profit potential is simply by tracking link clicks relating to a query. If a user returns and clicks on another link, then the link discarded is not as significant as another; a time value for how long a user clicks a link, and then returns to click another could be useful information to track which links from a server are most significant. Other possibilities are specialized search for books, autos, etcetera but using SEQu-RAmA to search many other search engines.

Going Elastic

There is an ironic if strange simularity, a project Elasticsearchthat is a JSON search engine queried by HTTP. Elasticsearch stores documents which are then queried. SEQu-RAmA does not store files searched from the Internet, it retrieves, organizes, and then presents the web sites/pages with an indication of status. Efficient integration of information The irony is in Silicon Valley get funding for the SEQu-RAmA server, and possibly release an open-source version that could be used by enterprises--connected all company sites in a single search for information within an organization. A university with different branch sites could be unified in search for various academic, administrivia, and so forth. One major goal would be to setup on a leased server host for web search, and then making income by ads, collectin g browsing data for subjects, among other things. Another goal would be optimization, making SEQu-RAmA faster...the major trade-off is accuracy for speed. On a server cluster, caching might allow for data mining, and retain popular web wide search queries--in essence the users become the key in search.

Where To?

So where is SEQu-RAmA going? It will remain a search tool I utilize for more efficient search across multiple servers. I'll tweak and improve SEQu-RAmA, for example when a link is dead or unavailable, use the link to the cache on the search server, or perhaps even the Wayback Machine, or maybe Mr. Peabody? Another possible improvement is to automatically rank search engines by returned links but also speed. If SEQu-RAmA, based on the location geographically of the query, has to take longer to access an international web server, it might automatically adjust the timeout for the server to respond to the query.

For now SEQu-RAmA works for me, so how do search to--SEQu-RAmA for you (mostly me right now though...)