Thursday, May 15, 2014

Seek, Locate, SEQuRAmA: Weave a Search Web by Super-Search Engine

SEQuRAmA

I've continued tweaking and adding search within search features to SEQuRAmA--the Search Engine QUery Result AccuMulator Aggregator. But in doing so I organized the various features under five possible categories.

Five Categories

The five categories are:
  1. verification
  2. presentation
  3. operation
  4. optimization
  5. improvements

List of Features by Category

Some of the features I have implemented, others are on the "to-do" list of software improvements. But by organizing the features into categories, it is easier to prioritize. Some features lead to other features, implementing one facilitates another implementation, or more simply without one feature already working, another cannot be so easily implemented. (Never say the word impossible in software, famous last words of many that uttered the word...)

The Features of SEQuRAmA

    Verification:

    1. Verify link media type (xml, pdf, html)
    2. Verify link exists

    Presentation:

    1. Color code link for reliability (exists, unknown)
    2. Cluster links together by type (xml, pdf, html, gif)
    3. Cluster similar links within domain
    4. Internal link to each cluster for easy navigation

    operation:

    1. Cookie control from search engines (delete, store)
    2. No search engine advertisements in output results
    3. Query multiple search engines in tandem

    Optimization:

    1. Permutations on search keywords--"Richard M. Nixon" one variant "Richard Nixon" and "Nixon"
    2. Rank resultant link by commonality--search 12-search engines 9/12 = 0.75 rank
    3. Connect private, internal resource (such internal employee website portal) to external publicly accessible resource

    Improvements:

    1. For non-existent link, replace with "The Wayback Machine" link if available, or search engine cache, or both
    2. Using most ranked results, tweak search using content in web pages (for example ID3 algorithm to classify web page)
    3. Determine other search keywords from words/markup on web pages (from Nixon, get Watergate, trip to China)
    4. Store advertisements returned as separate results accessed outside primary results
    5. Use a presentation template that specifies how to organize and structure the results

Modular, Multi-threaded, Multi-Class

One important implementation and design consideration is avoiding a monolithic block of Java source code. Each search is its own class, and a thread, implementing an interface, so that each search engine is a module. One change I've considered is dynamically loading a bytecode .class file, and then unloading if the search engine is down, or unresponsive within a wall clock timed time threshold. Other functionality accesses the raw results, stored in an ordered map (or multi-map since the resultant datum is stored within nested ordered maps). The original "store" (using a term from Babbage's Analytic Engine) used the standard Java data structures, but for more performance (which creeps up as I had more search engine modules...) I used open-source code, and implemented with a more specialized interface for specific-functionality and not general-purpose operation.

Do It Yourself

I implemented or created SEQuRAmA as a mix of fun, challenge, and a much useful tool for more efficient search online. Later one possibility is to have SEQuRAma work as a database of results, searchable through a web browser. Turing a super search engine into a local database, perhaps even extracting keywords from the query (and using the search terms that led to the resulting link) and result to organize the data internally. Of course, every software engineer often suffers from "creeping featuritis" and I'll have to reclassify these possible enhancements, but before then continue to tweak and improve the features already on the "to-do" list. Other features are superfluous...such as quote of the day, important historical events by date, jokes from search results...nice, but add nothing to efficient search across multiple search engines. Seek, locate, accumulate (a pun and paraphrase of the Daleks from Dr. Who...)

Thursday, January 9, 2014

Seek, and You Shall--Aggregate the Results



I search the Internet (who doesn't in the 21st century?--rhetorical question) using the many search engines to find information. An Internet search is not necessarily "one size fits all" with a search engine. But it is tedious and tiresome to search on a variety of search engines, or even some search engines that search using other search engines. I had the goal to search more efficiently, do more with less.

Project



This project has the goal of more efficient search--search using other search engines but at a single point--a locus or nexus. The operation is query many a search engine, accumulate and aggregate the results. Thus Search Engine QUery-Results AccuMulator Aggregator or SEQu-RAmA, or "Seek you rama"--remember the Big Three automaker "Futurama" of autos, or the cartoon with Phillip J. Fry?

The search engine is a simple web server using Java (for the different computers I have), that runs on the computer.

The web server always prompts for input using an HTML5 web page using JavaScript for some simple checks (no empty search) and functions (clear input).


To use the search engine, I just configured a browser to access it by default as "http://localhost:PORT" when the web browser is run. The web server uses the identity of the web browser when it connects--in effect, the web browser is a proxy for the browser, an extension in a web server.

Implementation



Each search engine has particular idiosyncrasies hence each uses an abstract class that is sub-classed for the particular search engine. Each search engine has its own sub-class, that also implements threading using Runnable. Search results are returned in "public String[] results" and a "public boolean readyFlag = false".

Once search parameters are submitted, the web server creates a series of threads for search in parallel as an array "SearchEngine[] engine" of the threads. Synchronization is unnecessary, as the web server checked each array, so that "engine[3].readyFlag" is indicative the search is complete. Each search always stores the results in the "String[] results" and then sets the flag. Hence if the web server accesses the readyFlag, and it is false while the thread attempts to update it to true, no problem as the web server will check it again.

Results



Each search engine has a default of returning an array of 0-results, and setting readyFlag to true if a connection times out, or some other error--a log file is kept for a web search.

When "engine[3].readyFlag" is true, the web server reads "engine[3].results" using a loop from 0 to "engine[3].results.length", inserting them into an index structure that is a set--no duplicates.

Once all the searches are complete, and the results in the index structure, the results are processed looking for the kind of result by type--.html, .xml, .pdf, .jpeg, .asp, etcetera. The extension extraction algorithm extracts the extension, then inserts the result into a ordered map-set...each extension has a set containing all results by extension. There is no predefined extensions, they are created dynamically for each extension. This has the quirk that .htm and .html have own sets, and sometimes one extension might have one result.

After all links are stored in the ordered map-set, the results are presented in a predefined format of HTML5, but with embedded links to allow quickly accessing results by type. The actual links are presented in a table, with each type having a sub-table. The output format is functional and navigable, but lacks the preview or other main search engines.

Kickin' Off



I have the SEQu-RAmA on my Mac, Linux, Windows computers. I've created a "kicker" program--a simple application that first starts the web server, delays for 3-seconds, and then fires the web browser. The fun or hard part was to create a native kicker using g++ on my Mac, gcc on my Linux, and then C# on the Windows machine. The icon is pretty basic on the desktop, but it works as the browser automatically connects to the web server. Click the icon, connect to web server, and search in parallel for data.

Future SEQu-RAmA



Of course this is the initial prototype, so some improvements, tweaks, and alterations will be made over time. For example, configure exclusion so that types of results like images, videos, sound are excluded.

Another area of possibility is to use the many results to determine the significance of links among the various search engines--a weight for each result. A repeated result for each search engine increases the weight--for M out of N search engines, and results of a type are then sorted accordingly.

Variations on input are possible, such as search on each parameter, and then create permutations of the parameter. Consider a search "Richard Nixon Watergate" and then search "Richard", "Nixon", "Watergate" and then "Richard Nixon", "Richard Watergate"--and then determine for each result from all search engines the weight of each result.

One wild possibility is a "deep search" where the initial results for an extension are used for a more deeper search--extract datum from the results (not possible for images/sounds/videos) and revise the query. Such a deeper search would require an existing query, but with a button or link to do so, and wait for the results requiring actual fetch of web data.

Nothing New



An acquaintance sneered at the SEQu-RAmA project/server, saying that "XYZ search engine works for me." I simply pointed out if they're that happy why be so disdainful, and I use the "XYZ search engine" in a multiple parallel search. I recently read an article from Slate by Jessica Olien, Inside the Box: People Don't Actually Like Creativity about how creativity and innovation are supposedly prized, but actually despised. No big surprise to me, asking "why" with the implication of wasted effort/time illustrates lack of imagination, dumbing down thought.

But I will continue tweaking, improving, and revising SEQu-RAmA for my own use. It might be possible as a tablet app, or even a browser plug-in, although I prefer a separate web server as a proxy for the web browser. Seek and you shall accumulate, aggregate, and then find.

Thursday, March 24, 2011

A Scan in Time Saves a Lexeme per Nanosecond in this Rhyme

I have implemented a scanner/lexical analyzer for a compiler of a programming language (the language itself is not relevant, the discussion of implementation and language design issues is for a later web log entry or monograph.) Originally, I implemented the compiler using JavaCC, but now have written the scanner (and the rest of the compiler) from scratch. After some tweaking and improving of the hand implemented scanner, I decided to test the performance by a comparison of scanning a test file.

Performance Tests



The test file contains 10,000-lexemes or tokens, a random pastiche of gibberish of the words of the programming language. Each test run is repeated several times to find an average time per token for each scanner. JavaCC is efficient and pretty cool in that the scanner is included with the parser (wish I'd had JavaCC when I took a compiler course in college using the infamous/famous "Dragon Book" and using Lex/Yacc) in the generated source code.

Each test run was repeated many times finding the minimum time in nanoseconds.


  1. JavaCC scanner: 7873-nanoseconds

  2. Custom scanner: 537-nanoseconds



The ratio is 14.66 or approximately a ratio of 15-to-1 on Windows 7 AMD64 desktop.


  1. JavaCC scanner: 9246-nanoseconds

  2. Custom scanner: 733-nanoseconds



The ratio is 12.614 or approximately a ratio of 13-to-1 on Mac 10.6.7 Intel notebook.

The custom scanner I expected slightly better performance, but at an approximate ratio of 14:1 was surprising. In retrospect, the JavaCC compiler generate creates Java source code, but the most general case for any potential programming language. Hence not the most optimal.

Consider the programming language of 100,000-lines of source code, with 10-lexemes per line. The source code is then 1-million lexemes. Total time to scan the file on Windows 7:


  1. JavaCC: 7873-milliseconds or 7.873-seconds, approximately 8-seconds.

  2. Custom: 537-milliseconds or 0.537-seconds, approximately 0.5-seconds, or a half-second.



Total time to scan the file on Mac:


  1. JavaCC: 9246-milliseconds or 9.25-seconds, approximately 9.5-seconds.

  2. Custom: 733-milliseconds or 0.733-seconds, approximately 0.75-seconds, or three-quarters of a second.



Optimize Bit by Bit



I had a work colleague chide me for optimizing code (including old legacy code 10-years old, a mix of C and C++ for Windows, Solaris, Linux, and some other archaic systems) as I fixed legacy bugs and other glitches. Eventually a team lead for another group stopped by asking for performance nearly 3x the older versions of the software package.

I made a joke about the Beta version leaking out, but later informed the team lead that it would require a reinstall of the software on systems on-site, and overseas sites. About a week later, I sat down with the team lead, and worked out a schedule and plan to install the latest, optimized build on temporary systems, reinstall, and replace the older, less optimal systems.

But the secret was to optimize, and improve code when a bug, defect, or glitch needed to be corrected. I can remember that in some places, the comments, and the reported error messages were cryptic at best, misleading at worse, hence optimize them. I left the old comments in place (at the time, a popular software methodology called for purging all comments to wipe the slate clean, but that is like learning from a mistake by getting amnesia) but with other comments to explain the change. Over several months as I fixed problems in the legacy code, it also improved performance bit by bit.

Optimize Gradually Design Implementation



For the language compiler, I'm trying to get each stage of the compiler as optimal as possible. Each slight increase in performance gives some performance time for other, more complex compiler functionality, such as a semantic check, or optimizing code generation, etcetera. Optimize software, but at a piece at a time.

An old rule of thumb or heuristic from the structure software paradigm was not to try an optimize performance for one particular thing, it was lots of little things to optimize. The optimize by the ingredients, not the overall recipe after it has baked/boiled/fried (to use a cooking metaphor, as I love to cook...)

Saturday, October 9, 2010

Have Data, Will Sort

I like to create, analyze, test, and tinker with algorithms, especially sorting algorithms. Lately, sorting has pursued hybridized sorts, like the Introsort in C++, and lately the TimSort. The TimSort is slated to become the default sort algorithm in the JDK. TimSort is a hybridized merge sort with insertion sort, that first emerged in the Python realm. Another hybrid algorithm is the FlashSort, which uses the insertion sort to fully sort after partial sorting. Another hybridized algorithm is the Dual Pivot Quicksort. It seems elegance is lost in sorting, and algorithms are spliced together.

The TimSort requires a Comparator to be implemented, a specific class with a function to compare any to possible objects that can be defined in Java. No more magic comparison defined on integers, floats, doubles, or on characters from the encoding, and likewise a string (array of characters from a C perspective).

One sorting algorithm I created and tested was the Binar Sort, a sorting algorithm that uses the binary digits to partition for each binary digit position. Further contemplation, tinkering, and experimenting led to another sorting algorithm, the WillSort. I won't explain how the algorithm works, that's for a research monograph, but my testing has compared it to the TimSort, QuickSort, MergeSort, FlashSort, and Dual Pivot Quicksort. The test was on 32-bit integers in a random permutation to avoid the worst case of QuickSort, and to force all the algorithms to work without a special case of partly or nearly sorted set of data.

To push the performance comparison, the test was with large data sets of 100-million to 500-million integers. After each sort a static method then checked the array of integers was sorted independently of the sorting algorithm as a cross-check for any anomalous or unexpected results. The same data set was used, and randomly shuffled with the same seed to the pseudo-random number generator each time. Thus, each sorting algorithm worked on the same size, same permutation data set of 32-bit integers. All things were equal, except on the sorting algorithm used.

The platform was a JVM with a 6-gigabyte heap, on an Windows PC running at 2.7GHz. The results measured in the milliseconds. Here is some raw data from a random permutation:

Size Quick Merge Flash Tim 2Pivot Will

100-million 16926 24008 32105 54709 55116 9706
200-million 35849 53352 82336 126391 126329 19594
300-million 55661 77454 108218 238945 386350 32308
400-million 73289 149573 182769 426597 407020 40482
500-million 97515 307663 224577 990444 984796 51761

Interestingly, but not surprisingly, QuickSort (and not the 2-pivot QuickSort), maintains a speed advantage. WillSort is the best, and does not have any sudden spike or performance quirks. The MergeSort and FlashSort are in the middle. The TimSort and 2-pivot QuickSort lag behind, but they are both hybridized versions of the QuickSort and MergeSort. The FlashSort partially sorts, but then uses the Insertion Sort, so is not a composite of two sorts, using the flash sort to partly sort.

The ratio between the QuickSort to the WillSort is 1.7 to 1.8, and fluctuates to 1.88 at 510-million (the increment between each interval was 10-million integers). Hence the time processing (processing for the QuickSort is selecting a pivot element, partitioning using comparison) each element is 1.7 to 1.8 between the QuickSort and the WillSort.

It seems perplexing for the focus on TimSort and 2-Pivot Quicksort, when both QuickSort remains more efficient, and there are other potential algorithms (of course, the WillSort).

In the worst case, ignoring other possibilities, given the Introsort is time-tested by a wider group in the Standard Template Library of C++ developers, a port from C++ to Java of on the Introsort seems more plausible.

But have data, will sort using WillSort.

Sunday, March 14, 2010

Binar Sort--the One Thing

I put the draft monograph for a new sorting algorithm, the binar sort online. The response from readers was varied in the specifics, but invariably the same was of the form "impossible to sort in linear time" or "no, it's not" (double negative), or "because of this it proves your algorithm is not linear."

All negative statements, proving a point by making impossible to prove negative assertions (remember basic logic, you cannot prove a negative, something is not). One individual with incredibly stunning powers of observation said (to paraphrase): it will only work with binary data. I didn't bother to explain digital logic, bits, or what is under the hood inside memory chips, coprocessors, and a microprocessor, or other computer hardware. A criticism by making an assertion of obviousness.

I consider myself a scientist, in particular a computer scientist. If I can be proven wrong, so be it. The broadening of knowledge, and greater understanding is of importance. But the critics seem to prove by making unprovable or obvious statements which only demonstrate one-dimensional thinking, lacking any depth or understanding. Even worse, all the critics did one thing consistently--never asked for the source code in Java, C#, or C to try the algorithm for themselves.

Algorithms have a theoretical basis, analysis, but unlike other scientific domains, computer science is applied for an algorithm--it can be run, tested, and evaluated. But not one person did that--it was all assertions and commentary.

Of course, many technical papers and research monographs are theoretical, and often practical implementation is non-existent. Once I read about a data structure, I e-mailed the authors about an implementation. They used the abstract pseudo-code notation, popular in algorithms books, but did not implement the data structure. The response to my query was "it was trivial to implement" in the usual dismissive condescension. The response was so trivial as to be banal and predictable. In this case, theory did not transfer into the praxis of application--applied example.

But in the case of the binar sort, I have a Java implementation. So it would only take the chutzpah to codify the binar sort, create a bytecode class, and run it with different test sets. Any incongruous results would pique my interest--there is a case where theory and practice do not correspond, hence an unknown factor. Learning and discovering the unknown factor is increasing knowledge and experience--adding to the scientific knowledge.

Unfortunately, verbiage only demonstrates that new things are easier dismissed as impossible, rather than an individual to have re-learn, and re-thinking what they have known.

Computer science, in that sense of testable and running immediately makes it one of the better sciences. An intrepid person can tinker with an algorithm, data structure, or library and refine and improve it, find anomalous behavior, and rewrite it in another programming language for a different platform. Hence the science is tangible, immediate, and testable.

The critics are demonstrating intellectual and scientific sloth and laziness--idleness by negative criticism. Dismiss something as impossible and it removes the need to reconsider existing knowledge, and integrate the new fact.

The wonderful book "The Programmer's Stone" discusses this in detail, and the different mindsets--packers and mappers.

But the binar sort is an algorithm, but another kind of experiment was implicitly conducted--the behavior of those who comment. It is somewhat interesting (although too familiar) to see that everyone did the same thing--inaction, never bothering to test the code and evaluate for themselves first hand.

Questioning and challenging a scientific experiment is part of science, but so is replicating an experiment to verify results. That's pretty obvious, particularly in high school where you replicate the experiments in physics and chemistry to verify well established scientific laws and constants.

But when the scientific process is only through rebuttal and making intangible, unprovable assertions--that's the realm of politics, word games, and inane chatter--political science (talk about an oxymoron) and sophistry. In a nutshell, the binar sort is an applied example illustrating the adage--you can lead a horse/jackass/donkey to water, but you cannot make it drink. Or as Mark Twain once said, "Don't let education get in the way of your learning." Too late.

Saturday, February 27, 2010

Schadenfreude of False Friends or Schaden-Freunde

The German word "schadenfreude" means a shameful joy, but life is replete with what I have coined the term schaden-Freund or "shameful friend". These are the friends that are there in passing, or worse, the ones that put a person down if only to feel superior. As Frank Sinatra croons in his song "That's Life" to quote "Some people get their kicks, Stomping on a dream" in this case "dream" is the aspirations of a friend.

Such friends are kept at a distance, as they are far, far, worse than an enemy.

These "friends" reveal themselves, often at a critical point where friendship is needed.

For example, I had a friend that I met in a previous job called Don. Looking back, he was a scadenfreund as it was always conversational one-ups, one better, always besting whatever I said. At the time, I naively took such talk as bravado, someone trying to impress. I was a friend to Don, helping him, and being generous in sharing books, papers, even copying some free material and Linux software so he did not have to find it, download it, and get it to work.

Later, when I was layed-off, and asked Don for a reference, the true colors of the schadenfreunde were revealed. In an e-mail, Don gave me his "I only give references to people I've worked with at least three years." How very considerate of him. Not!

Later when I began work on my computer architecture book, Don in an e-mail said "Why would you want to write a book with so many out there?" Stomping on a friend's dream. Simple--I want to, same reason people run, or climb mountains, or compose music, or play the piano. But I owed this false friend no explanations.

Later, after a few years, I got an e-mail from him, and again he said "I remember you were writing a book...". I did not respond. Don would want something, after all he suddenly contacts me after several years, and when I would not have it, would put down or make some nasty comment, then disappear again feeling superior. A schaden-freunde that needs someone to vomit their human excrement upon.

When he tried to contact me over LinkedIn, I blocked him. I also added his name to my e-mail blacklist. Don was a schadenfreunde, and the only way to deal with them is to avoid them. Having no reason to contact or communicate I severed all ties. A person that questions anything you do, does not share any sense of accomplishment, and does not help out of their own need for self-importance and self-aggrandizement is a person you do not need. I would be glad for a friend that was writing a book to be published, and if I could, would give a reference for a job to help them. The complete opposite applied to Don, so I need that like I need the swine flu.

Another schadenfreunde was Joe, whom again I met through work in a previous job. We were friends after both of us had left the company. But I noticed Joe always took a condescending view of patronizing superiority. When I considered going back to college once, Joe immediately started telling me about deadlines, forms, and documents, things I was well aware of. I made a point of asking Joe if he was my guidance counselor, telling him having been to college and some graduate school I was well aware of admission requirements and deadlines. That was a first, but not last time.

Later I had moved away, but Joe in e-mails was always dispensing his pithy advice, like to join the Peace Corps and move to a third world country. Unfortunately, my background is computer science, not much use in the bush of Africa. Also, I like air conditioning, hamburgers, milk shakes, Internet access, all the things of the United States--a first world nation. I pointed out to Joe that a quote of mine is "free advice is often worth the price you pay for it." It was as if Joe did not know me even though he talked to me, worked with me, and interacted with me--like I did not exist.

His response was "I was only trying to help" to which I responded trying to at least keep a dialog. But no response...can't dispense his pithy wisdom so does not communicate. A friend who constantly offers advice but only because it makes them feel self-important, and the worst kind of advice based on nothing about you. One reason I NEVER offer advice, even if directly asked by a friend. I can empathize and relate my situation of a similar event, perhaps state what I did, but not say what you should do, or how you should do it simply because it feels my empty, rotten insides for a feeling of self-importance.

Some schadenfreunde are former coworkers Vasudevan, Ravin, and Nathan, who after hurricane Katrina sent the token "I'm glad you survived." but no other e-mails again. Not interested in communicating, and having left work, you are no longer a lemming marching to the cliff in the same mindless shit job with the same equally mindless shit company (the movie Office Space comes to mind).

Questions to ask to determine if a "friend" is a schadenfreunde:


  1. For anything you do, is this friend in competition--one better or out doing you?

  2. Is this friend happy for your achievements and encourage you?

  3. Does this friend support your efforts by vouching for you?

  4. Will this friend empathize with you, or simply tell you what to do?



Life can be miserable, and a schadenfreunde only adds to the misery by being a traitor in your own ranks, a stabber in your back when you least expect it, and certainly when you do not need it. Usually with a big smile of friendship, and a carving knife ready to strike downward. Francis Bacon said, "If a man have a true friend, he may rest almost secure that the care of those things (things of great importance such as an unfinished work) will continue after him."

Dealing with such people is usually the more difficult thing, but the first step is to recognize them. I have true friends, but I can count them on one hand. I'd call a schadenfreude (borrowing from Francis Bacon's definition of a true friend) a share-not anything, whereas Francis Bacon in his book "Essays" Essayes: Religious Meditations. Places of Perswasion and Disswasion. Seene and Allowed states "I have given the rule, where a man cannot fitly play his own part; if he have not a friend, he may quit the stage." True friends make life worth living, schadenfreunde make life unlivable.

Friday, October 23, 2009

Anti-virus: Three Tools are Better than One

A friend of mine asked why I have installed three anti-virus programs free for private use on my desktop PC (an old clunker system running Windows XP).

I use a rule of thumb (a heuristic to be formal) that an anti-virus program detects 99% of all computer viruses (provided its up to date, and the maker keeps the current computer virus definitions current). Which means only 1% of computer viruses undetected.

Adding another anti-virus (with the same heuristic) 99% detected, or 1% undetected of the 1% undetected.

  1. 1-anti-virus 1% undetected computer viruses or 0.01. (1 in 100)

  2. 2-anti-virus 1% of 1% undetected computer viruses or 0.0001. (1 in 10,000)

  3. 3-anti-virus 1% of 1% of 1% undetected computer viruses or 0.0000001 (1 in 1,000,000)


One could install more anti-virus products, but it becomes difficult with updating each one, and the interaction. One product detects an update file as a potential virus, but the three programs work well together.

But, one in a million is favorable probability for detecting any possible computer virus. Better odds are possible with more anti-virus products, but there is an increase in the difficulty of using more anti-virus products. A diminishing margin for returns, and while decrease in numerical probability, its implausible.

That does not preclude the possibility of a computer virus, but then using common sense...don't open unknown attachments, scan downloads for a possible computer virus, routine scans for computer viruses, regular backups of data, regular restore points, etc. I do other routine things, but the point is avoiding obvious blunders, and developing effective habits for computer use. Avoid laziness and laxness to avoid a computer virus, or for that matter spyware.

So far, the three anti-virus products, with defensive measures has avoided any computer virus, although they have detected and eradicated computer viruses.