Saturday, June 28, 2014

May Your Filters Bloom

I'm rewriting SEQuRAmA for more cleaner code in version 2.0; version 1.0 was a prototype to see if a super search engine was possible. (But how often in software, the pressure to take the first prototype and continue to use it...) One goal is to organize search, aggregation, and reporting into self-contained i.e. encapsulated modules, and keep the multi-threaded search on each search engine self contained and mutually .

One method for efficiency is to avoid repeated links from the various search engines--repeated when the links are aggregated into the final result set of links. But, the result set, is a set--avoid duplicates during a check for if a link exists or not.

I initially tried the Bloom filter, a probabilistic hash structure giving a possibility of an element in a set. For N-search engines queried by the SEQuRAmA engine, there is a 1/N probability that a link is already in the resultant set. Part of my initial use was the Google's BigTable use of a Bloom Filter.

After implementing, I realized there was one problem. A Bloom filter gives a possible false positive for a link, but no false negative. Hence a link that is not in the set of results, might be given as in the results probabilistically.

If this were reversed, no false positives, but possible false negative. A negative would indicate the link is not in the set, but that could lead to duplicate efforts for repeated links.

While a Bloom filter has possibilities (no pun intended) it is not the best data structure, even with 1/N probability of a link in the result set. My thinking was guided by probabilities, but for the result set it requires "crisp logic" not probabilistic.

I've been looking at the code/theoretical features of other data structures for use in the final result set. I like to think of the SEQuRAmA result set is akin to the Futurama Central Bureaucracy "Master Inpile" that needs to be collated and filed. In this case the pile of links would be a probability, not a certainty, and a link is excluded because it was probabilistically in the result set.

Another feature of the Bloom filter was performance, an efficient hash (and I like hashing...on reason I created a sorting algorithm to sort by hashing--The Hash Sort), so very quickly would process links.

Fortunately the phase in the processing of links is more modular, so changing from a Bloom filter into data structure "X" is simple without any coupling to other functionality.

The next data structure I'll have to evaluate, often against various data structures, before deciding upon the final one to use. When in doubt, test it, and then try it. But without trying, the limits of a Bloom filter would not have surfaced. Good data structure, but incompatible with the functionality of the SEQuRAmA engine. To err is human, to repeat it is...DOH! (with an allusion to Homer Simpson).