Sunday, September 24, 2006

Human Computation

A friend of mine recently pointed me to this video by a guy called Luis von Ahn. It looks like his work has gotten a lot of coverage in the blogosphere. I would highly recommend the talk. Though provoking to say the very least. You don’t really need a technical background to be able to understand what he is talking about.

The gist of von Ahn’s work lies in being able to cleverly leverage what he calls wasted human cycles. People spend gazillions of human hours playing games that have no obvious use. So why not devise games that end up solving real world problems? That’s what von Ahn has done – ESPgame and Peekaboom. What ESPgame does is it pairs you up with an anonymous partner. It supplies you with a series of images that it asks you to label. You win points if you and your partner agree upon the same labels. So what’s the deal with leveraging human cycles here? In the process of playing players inadvertently end up improving the quality of image search results. The vast majority of images on the Internet are unlabeled and state of the art search engines use (not so clever) heuristics such as the name of the image file and words that can be found in the proximity of the image. Surprise surprise.. in the aftermath of von Ahn’s talk at Google, the folks in Mountain View have licensed his game to improve the quality of Google image search.

While the notion of human computation (as applied to image search) is interesting to say the very least, I suspect that it will not work well in isolation. Why? Because it will be ineffectual while catering to the long tail. Given the picture of a person, most players are likely to use the labels “man” or “woman” since they have no clue as to who that person is. For example, a label of “man” to describe a picture of the President of Uruguay is unhelpful. Here’s another example.. Since there is very little by means of visual information to distinguish this picture of Lake Tahoe from this one of Crater Lake, most players (who have not been to either place) are likely to use tags like “scenery” and “lake” that while being accurate in describing the picture are way too generic when it comes to actually making a qualitative difference to search experience.

On the flipside, the notion of human computation did not seem totally unfamiliar. That’s what del.icio.us users have been doing for several years now. They end up improving search results by tagging/book marking their URLs. I suspect that del.icio.us results are likely more accurate than those generated by the ESPGame. The reason being folks using del.icio.us have more information to describe a page that a URL refers to (having actually read it) as opposed to people who are presented with a random image of a person whom they have never seen before.

Wednesday, August 16, 2006

Handwaving on the Economy

Fred Wilson had this recent post on the state of the economy. I dropped a line in the comments section of that post (faithfully reproduced here):

“Making predictions about the economy is an exercise fraught with uncertainty (especially for someone who writes code for a living!). But here goes...

It seems like older economic models that guided conventional wisdom can no longer be relied upon in an increasingly globalized world. For example, the inverted yield curve is no longer reason to proclaim recession (so long as the Chinese continue to subsidize American long term interest rates). The US economy is holding up quite well to the high price of oil which is no longer reason enough to predict stagflation.

Likewise, its hard to say that the current boom in web services will burst even if there is a recession. Why? Because monetization for these companies is primarily ad-driven and they will continue to make money as consumers continue to spend more time online (or using their cell phones and iPods) and advertisers respond to this trend.

The burst of 2000 was not so much about economic cycles as it was about a cycle of greed. The fact that investors this time round are maintaining a laser sharp focus on monetization, makes a burst less likely this time round.

I am not disputing the cyclical nature of the economy. All I am saying is conventional wisdom cannot be used to guide investment decisions.”

While most of what I say here is stuff I still stand by, the last comment about the “cyclical nature of the economy” what I am going to talk about here. While I was sleeping, the community of economists seem to have developed a vigorous culture of blogging and a post I read on Prof. Greg Mankiw’s blog got me thinking about something we take for granted – namely business cycles. I am sure this topic has generated much ink in academic circles. But like most other people who have little more than a dilettante’s interest in economics, I look for simple mental models to explain economic phenomena and rely on articles like this one which speculate on the question of whether “the Fed will go too far.” Even modest changes in the Dow and the Nasdaq that follow the Fed Chairman’s speeches are interpreted by the mainstream press as “signs” of the market reading the FOMC’s mind. But the answer is, the Fed will always go too far. The Fed’s only job (by congressional mandate) is to maintain price stability. Do you remember the Fed Chairman talking about anything other than the need to contain inflation? I am thinking that the Fed would much rather bring a recession upon us rather than have the inflation dragon rear its head. They will of course do this by making monetary policy less accommodative than it needs to be. But theories about recessions in the mainstream are limited to “bubbles bursting” and “infectious greed”. Hence this rant : )

Saturday, August 05, 2006

del.icio.us API/Hack

A while after my last post on del.icio.us, Techcrunch wrote about it raising a couple of the issues that I talked about. They published some disappointing traffic stats and questioned their ability to achieve mainstream adoption. Later in the day, they published what amounted to a retraction. Oh well…

I am going to talk a bit more in this post about what I meant by referring to a del.icio.us API in my previous post. I wrote this fairly straightforward python script that takes your del.icio.us username, password and a URL and spits out the tags that people have used to describe it. Definitely install the relavent python packages before expecting this to work :)

import urllib2
import ClientForm
import ClientCookie
import re

cookieJar = ClientCookie.CookieJar()
opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cookieJar))
opener.addheaders = [("User-agent","Mozilla/5.0 (compatible)")]
ClientCookie.install_opener(opener)

fp = ClientCookie.urlopen("https://secure.del.icio.us/login")
forms = ClientForm.ParseResponse(fp)
fp.close()

form = forms[0]
form["user_name"] = "username"
form["password"] = "password"
mainurl = "http://del.icio.us/username?url="
url = "URL"
fp = ClientCookie.urlopen(form.click())
fp.close()

fp = ClientCookie.urlopen(mainurl + url)
items = fp.readlines()
fp.close()

for item in items:
    item_s = item.strip()
    l = re.findall("var\stagRec\s=\s", item_s)
    if len(l)==1:
        list1 = re.split("var\stagRec\s=\s", item_s)
        print list1[-1]
    r = re.findall("var\stagPop\s=\s", item_s)
    if len(r)==1:
        list1 = re.split("var\stagPop\s=\s", item_s)
        print list1[-1]

The power of python ensures that this does enough to login, maintain cookie state, parse out relavent HTML and spit out the tags. So this script does a little more than the length of it might imply :-) This is one example of an API call that a user might want. Using del.icio.us tags not just to search but to classify as well. Note that traditional classification methods rely almost exclusively on algorithms (as opposed to user-generated content).

Why would Yahoo! want to go this route in the first place? The answer lies in the fact that while tagged search may be **one** way to improve search results, it most certainly is not the only way. When Yahoo! does end up integrating del.icio.us into their search engine, tags are probably one of many factors (each with its own weight) being considered. So by staying the course, YHOO ends up losing del.icio.us in the noise of the search wars. But if they make their tag data available to their competitors, they actually end up making money off this thing.

The Techcrunch retraction mentioned the fact that del.icio.us now has over a 100 servers in action right now. The kind of infrastructure that can handle gazillions of queries per day from GOOG, MSFT, etc perhaps?

Tuesday, August 01, 2006

Monetizing del.icio.us

It seems like it’s a little late in the day to be talking about del.icio.us. Especially, since its been 6 months or so since they got bought by YHOO. So what has Yahoo! done with their new toy since then? Not much from what I can see. Not even a unified login (which is the case with flickr). To be sure, deli.cio.us does have some new features (importing/exporting bookmarks). But what I am trying to get at is the fact (the beauty of blogging lies in my right to peddle my opinions as facts ... doesn’t it ? :-P ) that YHOO hasn’t really used del.icio.us to contribute to its bottomline. And that’s what this blog post is about.

Besides, the obvious (ads), there are a couple of ways del.icio.us can be used to make $$. The first approach is somewhat indirect. It was mentioned at the time of the acquisition as a major driver behind the deal and continues to find currency in the blogosphere. Namely, the notion of tagged search. It is a given that any company that relies on search as a source of revenue needs to constantly improve the quality of its search results to grow its user-base (and by implication, its ad-revenue). Algorithmic approaches (to text search at least) have hit something of a plateau of late and one way to show some real improvement is by turning to non-traditional approaches such as the one offered by del.icio.us. Though the article I linked to claimed that this is being done, its not obvious to me how. But if YHOO is indeed taking definitive steps towards integrating del.icio.us into their search engine, then good for them.

The other approach to monetizing del.icio.us is somewhat less obvious (to me at least). Offer an API to the startup community and potentially to competitors as well. The latter (making del.icio.us results available to competitors) I think is more powerful. Needless to say, this site has a lot of momentum going in terms of being the leading social bookmarking service. Its hard to see Google replicate the level of success enjoyed by this piece of Internet real estate. Ergo, the only reasonable way Google to improve their search results via tagged search would be to use del.icio.us. At the very least, the del.icio.us API should pop out tags(s) to describe an article given its URL. A cursory glance at the del.icio.us API reveals that this may not have been done yet. What I am asking for is something along the lines of what is offered by technorati.

Of course, all this talk about tagged search improving search results is well and good. But what really needs to happen to make that a reality is mainstream adoption of del.icio.us. A simple experiment reveals that this may not be the case (yet). Articles from technology centric blogs tend to be heavily bookmarked. But try bookmarking something from a blog on economics or art. You will probably find that you are the first person doing so. No less than the front page article of the New York Times is likely to yield similar results. This suggests that social bookmarking has a ways to go before it Crosses the Chasm.

Wednesday, July 26, 2006

Time for GOOG to give up on Atom?

A friend complained to me recently about how blogspot doesn’t have something quite as basic as category tags. Since that’s something I have thought of in the past as being a drawback of this platform, I figured it may be worth it to dig a little deeper. On the face of it, providing category tags seems like a simple enough feature. So why don’t they do it? The most obvious reason is the well documented fact that Google despite all the hoopla has done little outside its core area of search. However, I think this has more to do with their choice of syndication standard. Its been known for some time now that Google has chosen to promote Atom over RSS. In fact, each blogspot account comes with its own automatically generated Atom feed. A closer look at the specs reveals that RSS allows categories whereas Atom does not. Since category tags make little sense when used outside the context of a feed (which can then be mined for information by the likes of technorati). With the advent of the tagging revolution brought about by the del.icio.us/flickr gang, it seems like it makes little sense for Google to continue down this route. Maybe its time then, to give up on Atom? Or at least offer RSS as an alternative for blogspot users. Turns out that an RSS Feed is indeed available. Me ineptum. In any case, I will leave this post up for its conspiracy theoretic value :)

Saturday, June 17, 2006

Random Ruminations on Search

A couple of random opinions on the state of search…

Much has been made of Ask.com’s recent resurgence. It has been said that Ask has better features than Google. While this may be true, great algorithms do not a successful search engine make. What really matters to the users of a search engine is how well it caters to the long tail. Take this recent search I made (to put this in context, refer to my previous post). By the looks of it, while the results returned by Ask are quite obviously inadequate, compare them with those returned by my friendly neighborhood search engine. See what I mean? It really boils down to the quality of a search engine’s crawling infrastructure which is going to be on par only if a company has the muscle to pull this off. Does anyone out there see Ask plonk down $2 billion to upgrade their datacenter? I think not.

Reading this recent article (if you don’t subscribe to the BBC tech news RSS feed, do so – they have some of the best articles out there) got me thinking about how cool it would be if search engines had a feature that validates search results for safety. I wanted to blog about it sometime (complete with mock up screen shots). Just as well I didn’t do it since scandoo already does it for you. While the creators of the product admit that it is somewhat nascent, a superficial test showed up some interesting results. Try searching scandoo for something as innocuous as your name followed my a search term related to the world’s favorite 3-letter word (I will refrain from giving suggestions in an attempt to keep this space G-rated). See the difference? I was impressed by how scandoo managed to whitelist a wikipedia entry on my search term while blacklisting a site whose credentials were somewhat more suspect. While it is not clear at this point how credible their techniques are, a mature product really has the potential to disrupt the consumer AV and spyware software market. Why? Because unlike enterprises, consumers don’t need to be secure. They just need to feel secure. If Google, Yahoo! and MSN plug a major entry point for spyware, it may end up removing desktop AV/spyware software from a consumer’s list of “must-haves”.

Wednesday, June 14, 2006

In Pursuit of Interoperability

Microsoft has often been accused of messing with protocol standards and such. I recently found out what this really meant. Ever heard of WINS? No? Well if you are a *nix person, I don’t really blame you. It is a naming service specific to NetBIOS. As with a lot of other things Microsoft attempts to mishmash existing standards and technologies with proprietary stuff. They have this feature that allows you to send a standard DNS request to a sever containing a NetBIOS name. The DNS server unable to resolve this request (duh!) can be configured to forward it on to a WINS server for resolution. Makes sense? What can go wrong here. Well, turns out that if a *nix client makes a zone transfer (AXFR) request to a Windows DNS server, the WINS resource record (RR) will be sent along with the rest of the zone file. The *nix client not knowing how to interpret this crashes and dies. This was in fact a bug in NT and was fixed in one of the innumerable Service Packs that were released for that product. Nowadays, if you want to look at the WINS RR via an AXFR, you will need to declare yourself as a Microsoft DNS client by appending the characters “MS” to a vanilla AXFR request. As if this quirk were not enough, the hex value 0xc00c is to be found pervasively in the results of a AXFR to a Microsoft client. This bizarre separator is explained by the fact that if there is a frequently occurring string, Microsoft’s implementation of DNS attempts to save bandwidth by simply giving a pointer to the very first instance of that string in the packet as explained by this paper. To their credit however, Microsoft does engineer for interoperability. I am not saying its not clever. Just quirky.

I am Baaacck

So Jim called me up this evening and guilt tripped me into ensuring that my blog does not die a premature death (“Hey Bharath, what’s up? Long time, no blog). Given that I am just about ready to hit the sack **yaaawn**, I figured that a good way to go would be to dispense with the customary navel gazing and blog about something that doesn’t require too much thinking – a much delayed addendum to my previous post. So what (else) makes you realize you’re in Silicon Valley?
  • Where else in the country do you have sports stadiums that are named the McAfee Coliseum, HP Pavilion and Monster Park?
  • Advertisement hoardings at the local baseball game plug Genentech, Applied Materials and Juniper Networks.
  • The cable guy shows up for an install, sees that I am Indian and leaves his shoes outside the door without me asking him to do it.
  • Chirayu and I were walking down a non-descript street in Palo Alto looking for the HP garage when this guy who fits the mould of a pizza delivery guy walks up to us and says – “You guys looking for the HP place? There it is, right there.”