Thursday, May 15, 2014

Seek, Locate, SEQuRAmA: Weave a Search Web by Super-Search Engine

SEQuRAmA

I've continued tweaking and adding search within search features to SEQuRAmA--the Search Engine QUery Result AccuMulator Aggregator. But in doing so I organized the various features under five possible categories.

Five Categories

The five categories are:
  1. verification
  2. presentation
  3. operation
  4. optimization
  5. improvements

List of Features by Category

Some of the features I have implemented, others are on the "to-do" list of software improvements. But by organizing the features into categories, it is easier to prioritize. Some features lead to other features, implementing one facilitates another implementation, or more simply without one feature already working, another cannot be so easily implemented. (Never say the word impossible in software, famous last words of many that uttered the word...)

The Features of SEQuRAmA

    Verification:

    1. Verify link media type (xml, pdf, html)
    2. Verify link exists

    Presentation:

    1. Color code link for reliability (exists, unknown)
    2. Cluster links together by type (xml, pdf, html, gif)
    3. Cluster similar links within domain
    4. Internal link to each cluster for easy navigation

    operation:

    1. Cookie control from search engines (delete, store)
    2. No search engine advertisements in output results
    3. Query multiple search engines in tandem

    Optimization:

    1. Permutations on search keywords--"Richard M. Nixon" one variant "Richard Nixon" and "Nixon"
    2. Rank resultant link by commonality--search 12-search engines 9/12 = 0.75 rank
    3. Connect private, internal resource (such internal employee website portal) to external publicly accessible resource

    Improvements:

    1. For non-existent link, replace with "The Wayback Machine" link if available, or search engine cache, or both
    2. Using most ranked results, tweak search using content in web pages (for example ID3 algorithm to classify web page)
    3. Determine other search keywords from words/markup on web pages (from Nixon, get Watergate, trip to China)
    4. Store advertisements returned as separate results accessed outside primary results
    5. Use a presentation template that specifies how to organize and structure the results

Modular, Multi-threaded, Multi-Class

One important implementation and design consideration is avoiding a monolithic block of Java source code. Each search is its own class, and a thread, implementing an interface, so that each search engine is a module. One change I've considered is dynamically loading a bytecode .class file, and then unloading if the search engine is down, or unresponsive within a wall clock timed time threshold. Other functionality accesses the raw results, stored in an ordered map (or multi-map since the resultant datum is stored within nested ordered maps). The original "store" (using a term from Babbage's Analytic Engine) used the standard Java data structures, but for more performance (which creeps up as I had more search engine modules...) I used open-source code, and implemented with a more specialized interface for specific-functionality and not general-purpose operation.

Do It Yourself

I implemented or created SEQuRAmA as a mix of fun, challenge, and a much useful tool for more efficient search online. Later one possibility is to have SEQuRAma work as a database of results, searchable through a web browser. Turing a super search engine into a local database, perhaps even extracting keywords from the query (and using the search terms that led to the resulting link) and result to organize the data internally. Of course, every software engineer often suffers from "creeping featuritis" and I'll have to reclassify these possible enhancements, but before then continue to tweak and improve the features already on the "to-do" list. Other features are superfluous...such as quote of the day, important historical events by date, jokes from search results...nice, but add nothing to efficient search across multiple search engines. Seek, locate, accumulate (a pun and paraphrase of the Daleks from Dr. Who...)