Wednesday, October 29, 2014

Data, Data Everywhere But Not a Link to Click

The title of the web log post is an allusion to "Rhyme of the Ancient Mariner"

Yes, me and SEQu-RAmA are still here!

One problem with SEQu-RAmA as a meta-search or super search engine is that from a group of search engines a flood of hyperlinks of different media types (html, pdf, xml) that might be valid or invalid. Since one function is to check/identify hyperlinks (for example determine if a link is valid, or is missing) taking all the raw results returned and processing them efficiently is critical.

Hyperlink Processing

One approach is that for each search thread (each search engine gets its own thread) processes the links creates the necessary entry into a common data structure. It works but is woefully inefficient and slow, reminds me of downloading files back in the days of dial-up with 56K modem.

Consider N-search engines that do a search, and then process the results. The raw time to process a hyperlink is Tp. The same link is returned by M-search engines (where M <= N), but each search thread processes the same link. Hence M-redundant, repeated processing of the same hyperlink, and (M x Tp) seconds of wasted time.

Very inefficient although it works. Another aspect is determining the rank or frequency of a hyperlink, the rank taking advantage of the shared information from all the search engines.

Datum Ranking

The duplicate hyperlinks are inefficient to process by each search thread, but if many different web search engines return the same hyperlink it is a commonality. The question of different search engines using the same web search indexing algorithm arises, but various search engines are like automobiles.

Automobiles are essentially the same but different automakers had different design, engineering, implementation approaches. A look an the many “top 25 best automobiles ever” and “top 25 worst automobiles ever” is an extreme contrast of features and properties of autos. The contrast illustrates that while doing the same thing moving people from point A to point B, various automobiles were very different.

A safety precaution is to analyze link results from each search engine queried, and if too many are duplicates with the overall resultants by some threshold, the search engine might be excluded. One possibility is to search the “metacrawler” search engines which search existing search engines and thus _will_ duplicate results from the search engines queried individually.

Still, the emphasis is on the collective results to give a rank to a hyperlink as a collective property not an individual property. Individual properties include type (html, pdf, xml, zzzz) status (202, 404, 999) hyperlink text using string ordering (www.fubar.org < zzz.asleep-at-the-switch.info) of the link text.

This collective ranking is not unlike auctioneering in embedded/control systems, or multiple processors on a space probe that “vote” to determine overall processor operation.

Hyperlink Efficient Processing

While the original duplicated effort works, an important skill a software engineer/computer scientist must have is to approach a problem from many different perspectives.

The approach I had was originally to create a micro-database for a common data store for all the returned hyperlinks from the various search engines. The problem is a micro-database is overkill in functionality. The SEQu-RAmA results do not need database functionality.

But it is the start of the right idea. I liken the approach to Futurama, and the “master in pile” at central bureaucracy in the episode “How Hermes Got His Groove Back.”

The results do not need to be flexible, it clusters results by type and hyperlink text alphabetically. Another important feature is the common data store of the “master in pile” must compute rank for duplicate links.

The final approach is what I term a “cardinal map” data structure. A map that stores a node consisting of the information of type, status code, hyperlink text, but adds a count or coefficient.

The cardinal map is primary for inserting data, in two possible cases. The first case is a node not in the cardinal map, so it is simply inserted with the count is equal to 1. The second case is the node with the data is in the cardinal map. Instead of inserting the node, it is retrieved and the count incremented.

The cardinal map is functioning like an ordered set in avoiding duplicates, and storing the nodes ordered by the data. The cardinal map is akin to the symbol table in a compiler, not just one data structure but a composition. A compiler symbol table allows the compiler to access information that a specific programming language expects to enforce the language rules.

Results Organization

The results are clustered together by type, color coded by the status of the link, and in alphabetic order by the hyperlink text. Now the rank is used to order the links, and then the alphabetic text.

With embedded links to click and go to a cluster of links by type, arranged by rank and hyperlink text. The color coded link then displays the status of the link.

Two Special Cases in Results Organization

There are two special cases (or at least the most obvious) for HTTP status code. The two status codes are:

  1. HTTP Not Found (404) - web page is not found at hyperlink
  2. HTTP ZZZZ       (999) - special case of unknown link status

An obvious question is “Why bother to include links that are not found and those that are unknown in status.”

HTTP 404 Not Found

The status code of HTTP 404 seems a dead end. After all, why click on a hyperlink that is not found at the site?

404 not found (use cache link from search engine, or from Wayback Machine

ZZZZ 999 Unknown

The custom status code of ZZZZ 999 seems like an HTTP 404 status code. But there is a significant difference. HTTP 404 is a web page not found, does not exist; but the the ZZZZ 999 is unknown, unable to verify that HTTP status code of the link.

999 ZZZZ is special case, unknown so take a chance by visiting hyperlink and/or including link to cache...something like a 40404

Color Coding of Hyperlinks

The color coding of the hyperlinks is a mix of choosing a color for a status of a hyperlink. One part is aesthetics, choosing a color that is not harsh on the eyes. The other part is a color logic, for example black for a status code of “web page not found” or 404, and grey for status code of “unknown status” or 999.

The choice of color is like the color coding of resistors in electronics only the difficult question and choice is what colors for what HTTP status code.

For example black for a dead, HTTP status 404 link gives forewarning, and perhaps grey as a warning for 999 ZZZZ status unknown. But the question remains of what color for other HTTP status codes...??