Mathematical differences between a database and a search system

I get asked this question often – on what is the actual difference between a search system (like Solr) and a traditional RDBMS… or even a NoSQL datastore like MongoDB. It almost seems like a rule of thumb that every webapp needs a database for its store and a search solution to do free text search. But why ?

What is the inherent differences between the two problems – retrieving data and retrieving search results – that require two different software systems. Can we not model a decent search stack using just the database ?

The answer is in the underlying mathematics. Databases work in terms of sets, where each element is a tuple consisting of various fields.


The figure on the left illustrates a join between two tables – an intersection of two sets.

Theoretically, you could progressively intersect sets to get a result set of the results you were looking for. So what really is Lucene good for ?

Lucene/Solr/Elasticsearch operate on ranked sets. Lets look at how that works:

The most fundamental data structure of Lucene is an inverted index – think of it as a list of terms mapped to the list of document ids (also known as postings list) which contains the terms – in reality, this is implemented as a skip list.

Basically, when you perform a search query, what you get as output is a bitvector which tells you which documents matched the search (by toggling between 1 and 0). Facets by themselves are also bitvectors. So, if you want to know how many results fall in a certain facet, all you need to do is a bitwise AND operation.

Now, there is a particular feature that many claim is extremely highly optimized in case of search technologies. I am not sure if this is the case, since RDBMS could be emulating a lot of these as well. However, there might be differences in caching behavior leading to massive performance differences.

The crux of the difference between relational mathematics and search based is this – The real optimizations of Lucene come from the fact that you never really want all documents which match the query, but only the top K. And this is what affects the difference in performance as well.

There is no boolean in search

One of the frequent misconceptions is to model a search query in terms of boolean logic – AND, OR & NOT. However, again the same concept comes into play – top-K only. Which is why, Lucene models it in terms of MUST, SHOULD, NOT.

Any clause you add with MUST signifies that, any documents returned MUST have that term. The SHOULD part of the query simply bumps the scores of documents with that clause. Remember that this gets more and more complex if you are using the Dismax or Edismax parser of Solr.
Which means that when there is a complex query with multiple boolean subqueries, you need to start diving deeper (e.g. dont mix should and must in the same BooleanQuery if strict boolean matching is needed. It will match all docs, containing the MUST clause and “boost” all SHOULD term).


So now, you can point this article if management questions you on why you cant run search through database queries.

What is a good technology strategy for ?

EDIT: The moment this blog post was published, my entire blog account was suspended for suspected spam. I suspect it is because of the references to in the post.  That’s how bad it is.

I have been thinking about the recent spate of acquisitions by and was intrigued by what the other old search portals were doing.

One of the most interesting is – it started as a semantic search engine (where you could ask your questions in plain english). As the world soon found out, people hate typing – so when Google innovated with search suggestions (technically known as typeahead), it pretty much killed the market for semantic search since – guess what – google suggested reasonably well. Plus it implemented typeahead on a per-word basis.Suggestion for just the word "industry"

However, for that strategy changed a long time ago, when they (rightfully) understood that they could not compete with Google. They then closed down the entire search technology teams and pivoted to a question and answer portal – similar to Quora but without the cultivated community. There is a claim that they use their search technology to power answer to questions – however they have two problems with that:

  1. their backend defaults to Google in case of missing queries
  2. they have awful SEO – as compared to newer upstarts like Quora. I will investigate this in a little more detail below.

At this moment, their relevance to the larger internet landscape is similar to Yahoo – distinguished pedigree, but doubtful importance.

If I were the CTO still has impressive IP around. A brief search reveals 60 patents for user-specific  search results, advertising and query interpretation. The problem is that this cannot be an effective weapon against Google or Bing – their golden weapon is not the algorithm but the massive technology infrastructure around search.

The big problem with their Q&A business : SEO.  Quora incentivizes users to write, share and blog about a question and answer interaction. However, this incentivization works because Quora has seeded a superlative community (hey – Werner Vogels, CTO-Amazon, posts there). This results in abysmal SEO for for the long tail – which is where most of the traffic for a Q&A site rests.

However, it can still use this technology to build products for several different verticals. Let us look at some of these verticals.

File sharing

There are many of them out there including Dropbox, but none of them incorporate a truly powerful semantic search technology incorporated into them. Semantic search truly shines when you are trying to make sense from text inside documents. The best way to understand this is to see how the Nepomuk project thinks about the semantic desktop (which is all about files and documents) and ontologies.

Plus, recent entrants like Mega clearly show that there is a lot of innovation still left on the table for personal document sharing. Hell, I would say build a Flickr (with 1TB) for your personal documents and layer on semantic search.

News Reader + Bookmarks + Web CLIPS + passwords

The Opportunity: Google Reader is dead. People love Lastpass and Xmarks. Evernote is indispensable. Yahoo is going crazy with article summarizing apps like Summly. could think about a Lastpass+Xmarks acquisition which gives them a fanatical community. Unify them and build out a Evernote on top of that which allows you to clip stories as well as subscribe to blogs in a single click effectively supplanting Google Reader (possibly acquire one of the upstarts – but none of them have the leverage to be able to work with data like Google could – or maybe can). Leverage’s semantic search technology to kick the shit of summarizing – my opinion: relevant search kicks the butt of interesting recommendations.

The Big One – compete with SIRI/Google NOW

Yes. It can still be done.  The problem is the dataset. However if the above two projects fall in place, one can use the Xmarks data + Q&A data + web clips data + personal document store + Semantic search technology to power a more effective personal assistant. The difference is the human/self generated content.

Technology Reorg

To achieve this, I would reorganize the tech organization into something similar to Amazon – a core semantic search + big data team. Individual product + tech teams that roll up to tech+product twin heads.

I see a 18 month time-to-pivot given the above.

Effective way to build production quality hadoop jobs (with mahout)


, , , ,

As I  started working on building usable data analytics with hadoop, I saw that there were several query languages out there – Pig and Hive being the primary ones. My primary goal is to define a reasonable framework for experimentation on Hadoop. My primary goal – deriving insights from data. Least important metric – cool new languages. (Which incidentally seems to fit the end conclusion).

Hive seemed to be the closest fit to the relational model and so serves as a good starting point. A very interesting post by Alan Gates @Yahoo made me realize that for most of the tasks out there, the big challenge is to take data and massage it to be usable for big data analysis (for example creating Vectors for Mahout’s recommenders).

Pig was the next point of experimentation – especially with its ability to work with Jruby UDFs.

This runs fine… on your laptop. However, the pig problem happens when you are actually preparing to run this job on a cluster. At that time all the hidden things start to hit you – JARs not being in classpath, JARs not being in the distributed cache and things start to add up to be a pain. These start impacting you more and more if you are combining more than one toolkit – for example Hadoop + HBase + Mahout.

That said, one of the very interesting points of integration that I stumbled upon was embedding Pig inside Python (using Java) or Javascript. (BTW, a lot of this comes via Hortonworks, which is doing interesting things around Hadoop and Pig). That seemed to me as very interesting, however if you notice there is a very quirky piece of code in there :

P = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');

Essentially, you could write a UDF in Python, then call Pig through Python. The Pig script then calls the UDF. All of this finally gets compiled on-the-fly into a Jar (cos, that’s what Pig does) and sent to Hadoop. This was essentially the step that convinced me that this tooling was unnecessary complex and there ought to be a simpler way.

I realized that Pig and Hive are actually step two in the whole implementation process of mapreduce and Hadoop.

Step one has got to be writing your own custom jobs in Java or Scala and creating standalone job JARs.

Yes, this is suboptimal since you end up with large sized jars (a simple job JAR with hadoop and mahout ended up being 32MB). Remember that these JARs will be populated to your cluster (and hopefully persisted in the distributed cache for subsequent runs), but when beginning you dont really need to worry about optimization. Secondly, at a later stage you could still this very workflow to create a non-standalone JAR and instead push the dependent libs out by using either

  • the -Dmapred.cache.files variable to force distributed caching of libraries
  •  use the handy -libjars variable.
  • or use HADOOP_CLASSPATH environment variable.

My Scala code here is a simple Mahout vector creation algorithm (based on sample data from here ) and creates a standalone JAR (using sbt assembly)

Complexity of clicking on a map – real estate based map UXes in India

So you have a Google map widget and you have plotted a few locations on it – say the new properties for sale in your city. Awesome ! How long did it take you ?  one week.

Now tell me what happens when you click on any property ? Answer seems obvious right – hey you show some stuff from the location you clicked.

Wrong answer.

The best way to understand this is to ask the following question – what exactly did you click on ? That answer is very contextual to India. What happens in India is huge housing projects with almost a 1000 residential units – but with a large number of different types (for e.g. 1 bedroom – 2 toilets or 1 bedroom – 1 toilet – 1 servant quarters… etc.). Therefore, this information can not be aggregated.

The US way.

Remember when I said this problem is not present in the US due to homogeneity in housing units plus the spatial distribution of properties ? Well, there is one context in which you can see the same problem – mobile. The density of information in the US at this screen size encounters the same interaction design problems as desktop maps in India.

Take for example Trulia’s Android app

trulia mobile app touch select

What you see here is an interaction that happens when you touch a particular point on the mobile screen.  The circle (and the list below that corresponds to the radius of the circle) is part of that interaction. Let’s analyze the reasons.

Several mapping based sites/apps load all the data at once. This is a by product of having a limited data set to work upon. If this is the case, then you can zoom in/out, drag/pan as you like to focus on a particular point. Several examples of these are Magicbricks, , etc.

However, if you have a large data set (as we had in Proptiger), then you need to plot the data based on the viewport that you have at that moment. (This will need you to have an API that supports viewport based queries). Now what happens at this point is that any interaction on the map – zoom, pan, etc. – refetches results from your database. Which means that when you zoom down, there is an entirely new set of results (with appropriate sorting) resulting in your original point-of-interest vanishing. This is an extremely important aspect of map based UXes with significant data.

Do note that there is another data interaction possible – something that vaniila google maps does very well. The map is secondary, the list on the left is primary – this means that once people are looking a particular paged set of results on the left, the map does not re-query on any pan/zoom. Now this works well for a known-item browse behavior (in UX design terms), but not for an exploratory map-centric behavior. In the real-estate paradigm, one of the big problems that can occur because of this is seeming empty spaces on the map even when you have zoomed to it because the list is still static. The best way to illustrate this is to look at how Airbnb represents its map results – the fact that you cannot zoom without reloading vs not reloading and having less results is obvious in its “Zoom in to see more properties – Redo search in map must be checked to see new results as you move the map” aspect. I’m not really sure how usable that is in India.

airbnb map UX - reload and zoom

Fun fact: It just so happens that implementing the javascript hooks to work with the panning/zooming and refreshing of data is a very hairy task. This all stems primarily from the quirky interaction between fitBounds and setZoom

 So what’s the solution – put simply it is this: somehow prevent people from zooming to isolate a data point. This is a UX problem – you need to somehow give an interaction to people that prevents them from panning and zooming. In my understanding there are two ways to achieve that:

  • Radius of interest – draw a small circle around the point of interest (regardless of touch and click) and give a separate popup which has a small limited set of results around there. Using this popup, you can zero-in to the exact point of interest. This is what Trulia does in its mobile app and is starting to do in its desktop radius of interest
  • Spidering – It is the same theory as OSX fan of icons (technically called Stacks). This is something we have attempted at Proptiger. Clicking on a close group of icons spreads it out as a fan allowing you to pick and choose.

Map icons - grouped (before) map icons spidered view - like stacks

This is just the first step for building UXes with large amounts of data on a map. Wonder what’s the next step ;)

Configuring the TP-Link travel router with DD-WRT (and solving the single port problem)

The TP-Link MR3020 / 702N or 703N are some brilliant tiny devices that fits in the palm of your hand. (I chose the MR3020 especially for its USB compatibilty). It is a superb replacement (improvement) of the sucky WTR54GS travel router.

Not only does the MR3020 have a great form factor (pack of cards), but it also has ability to connect your NAS through a USB port, your 3G dongle, is powered through a USB cable (!!!!) and best of all .. is 100% DD-WRT compatible.

This last factor changes the router from a simple to a super powered device.

I downloaded the firmware from . There are many guides to install DD-WRT, like this one here.

Now there is one problem with the MR3020 – in fact it is the same problem with any travel router which has DD-WRT installed on it: there is only a single ethernet port on the device (marked as WAN). For some reasons, the defaults for DD-WRT assume that this port is in bridged mode with the lan system … which means that you will not be able to use this like a conventional router. In fact, if you set the WAN connection to “Automatic Configuration – DHCP”, the router seemingly stops working. (including assigning IP addresses). It took a look time (and blogs like this ) to finally understand why this is happening amd to fix it. I suggest whoever is attempting this to read the official DD-WRT doc here.

The way to fix this is to break the bridging between the internal lan (which has the wifi) and the ethernet port. To do this, do the following steps without missing anything:

(Assuming your device’s ip address is )

  • go to
  • go to the section “Bridging”
  • Click add – let page reload – fill in “br1” in the empty space at the beginning. go down to the bottom of the page and click save.
  • now the section “Bridging” would have the “br1” row and some additional fields below (IP address and subnet mask).
  • I chose “” for IP Address and “” for subnet mask.
  • go to the bottom of the page and click “save”. then go again to the bottom of the page and click on “apply settings”
  • go to section “Assign to bridge” – click Add – let page reload.
  • select “br1” from dropdown and interface “ath0”. go to the bottom of the page and click “save”. then go again to the bottom of the page and click on “apply settings”
  • Go to the bottom of the page to the section called “dhcpd”. click “Add” – let page reload.
  • select “br1” from 1’st dropdown. go to the bottom of the page and click “save”. then go again to the bottom of the page and click on “apply settings”
  • then go to – go to “Local IP Address”. fill in “”. go to the bottom of the page and click “save”. then go again to the bottom of the page and click on “apply settings”
  • at this point, your router IP has changed to, so now go to – go to “Connection type” and select “Automatic Configuration – DHCP” . go to the bottom of the page and click “save”. then go again to the bottom of the page and click on “apply settings”

Voila. A brilliant, portable, travel router running DD-WRT for cheap !

About Culture – overheard on HN

“But culture is not a set of specific abstract principles. It’s the set of mutual values, expectations, conventions, and methods of communication that define the way people within your organization interact with each other and work together to achieve the company’s goals.

You guide the development of your culture not by fiat and ideology, but by carefully structuring the organization in a way conducive to healthy growth, by selecting the people with the right mindset to put in positions of responsibility, and most importantly, by setting the example in the way that you act and make decisions.

If you proclaim your company officially vegetarian, here are the effects I see to culture, as understood in this way:

* Alienate members on your staff who are not vegetarian.

* Blur the distinction between personal preferences and business strategy.

* Establish a precedent that the founders’s personal opinions on issues wholly orthogonal to the business are still valid grounds for policy-making (imagine the consequences of future managers emulating this).

In the long term, you’re probably moving toward a culture with an unhealthy amount of internal contention, and a staff that may not feel fully invested in the company’s goals.”

Hacking the SD960/IXUS110 for scuba photography (using RAW)

The SD960 is the cheapest camera that I can buy with a decent scuba housing – the WP DC32. However, the reason why this camera is dissed is due to its inability to shoot RAW, which is near essential for underwater photography of any kind.

My current trip to the underwater wrecks of Sri Lanka did not get good pics because of this problem. However, there is a solution – the CHDK project (Canon Hack Development Kit) for the SD960 (forum). The best part is that it doesnt impact your camera firmware at all – every time your camera boots, CHDK is temporarily loaded into memory.

The way to hack the SD960IS is fairly simple – first identify your current firmware. I used the ACID utility to do that. Mine was 101d. Incidentally, ACID downloaded the firmware for me, but you can download it yourself here. Unzip the firware zip somewhere. Inside there somewhere would be four files/folders:

  • PS.FI2
  • vers.req
  • CHDK/ (folder)

Prepare your SD card:

  • Make sure your SD card is NOT locked (yes – I wasted many a hour because I forgot this)
  • Partition your SD card to have two partitions (note: I used gparted to do this, use any tool that you want). A 16 mb FAT16 partition and the rest as a FAT32 partition. The larger partition is where your pictures will be stored.
  • Make sure the 16 mb partition has “BOOT” flag set. In Gparted, you need to right click -> “manage flags” -> check “boot”
  • Copy the four files/folders above to BOTH partitions.

Start your camera by pressing the play button not the on/off switch. In your play menu, at the very bottom, there will be a new option called “firmware update”. Start the firmware update. CHDK should now be temporarily loaded into camera memory. You should see a brief flash of the CHDK splash screen. In any case, press “play” and then “menu” button. You should now see a different menu.

Go down to “Miscellaneous Stuff -> Make card bootable”. If this does not work for you, then the linux command that does is

 echo -n BOOTDISK | sudo dd bs=1 count=8 seek=64 of=/dev/sdb1 (where sdb1 is your 16 mb partition)

Now, switch off your camera, take the card out and click the SD card pin to the lock position (yes lock position. CHDK will ignore it and let you click photos without a problem. This is needed for the camera to load CHDK on every start).

Now your SD card will boot CHDK every time the camera is switched on – if you get a new SD card, you’ll have to redo the steps above, since the camera is unaffected by any of this.

Enable RAW mode – this is what we have been waiting for ! Press “play” and “menu” and enter “RAW Options”. Just check the “Shoot in RAW” option and you will generate a RAW picture everytime. This is stored in a different folder called “100Canon“. The RAW format used is DNG which is the universally accepted RAW format.

The very first thing that you should do is press the shutter and get into photo shooting mode. Now pull up the CHDK menu (Play + Menu). Go to “RAW Options” and go to “Create badpixel.bin“. The camera will automatically take a couple of pictures for you. This eliminates bad pixels for you when you are shooting RAW.


Why Ubuntu should use Android Surfaceflinger instead of investing in Mir and pulling a “Bada” move

The Ubuntu foundation recently released a new spec for a new display server called Mir. This comes on the heels of a similar discovery by Phoronix that Ubuntu touch was using SurfaceFlinger under the covers as well as an announcement that Unity will be switching to Qt/QML for its next version.

I think investing in Mir is a poor move and Ubuntu seems to be all over the place. Most top end device SoC manufacturers will only release Android compatible binary drivers for a long time to come. Even if they go Linux, they will most likely stick to xorg compatible drivers (because the big enterprise Linux gorilla – Redhat – is still invested in Xorg).

If one has already made significant investment in porting Ubuntu userspace to Android+Surfaceflinger (on touch) why would you NOT use that work and port it over to the desktop. There is significant opportunity for innovation on the Android stack and though Google would need a little pushing to allow Ubuntu as a special tech member, it can still be done.

And here’s the most important reason of all – yesterday, I had to attend an important GoToMeeting. I used my Ubuntu desktop and clicked the link only to be faced with a “unsupported platform message”. I dont know why, but I decided to try it on my Android mobile device. I installed the GoToMeeting app – clicked on the email and in less than 2 minutes from the time I got the Linux-unsupported message, I was connected to my meeting.

Android already has a huge, huge, HUGE ecosystem that works frikking well – why would you not want to leverage that ? Ubuntu is trying to pull an Opera Browser or a Bada here with its Mir announcement – and we all know how that panned out.


Get every new post delivered to your Inbox.