What is Toluu?
Toluu is a free service for sharing the feeds you read and discovering new ones.
Get Invite

Datawocky

On Teasing Patterns from Data, with Applications to Search, Social Media, and Advertising


Google Chrome: A Masterstroke or a Blunder?September 8 2008

The internet world has been agog over Google's entry into the browser wars with Chrome. When we look back to this event  several years from now with the benefit of hindsight, we might see it either as a master stroke, or as Google's biggest strategic misstep.

The potential advantages to the internet community as a whole are considerable. The web has evolved beyond its roots as a collection of HTML documents and dumb frontends to database applications. We now expect everything from a web application that we do from a desktop application, and then some more: the added bonus of connectivity to vast computing resources in the cloud. In this context, browsers need to  evolve from HTML renderers to runtime containers, much as web servers evolved from simple servers of static files  and cgi scripts to modern application servers with an array of plugins that provide a variety of services. Chrome is the first browser to explicitly acknowledge this transition and make it the centerpiece of their efforts, and will force other browsers to follow suit. We will all benefit.

The potential advantages to Google also are considerable. If the stars and planets align, they can challenge Microsoft's dominance on the desktop by making the desktop irrelevant. Even otherwise, they can hope to use their dominance in search to promote Chrome, gaining significant browser marketshare and ensuring that Micro

Bridging the Gap between Relational Databases and MapReduce: Three New ApproachesSeptember 6 2008

Popularized by Google, the MapReduce paradigm has proven to be a powerful way to analyze large datasets by harnessing the power of commodity clusters. While it provides a straightforward computational model, the approach suffers from certain key limitations, as discussed in a prior post:

  • The restriction to a rigid data flow model (Map followed by Reduce). Sometimes you need other flows e.g., map-reduce-map, union-map-reduce, join-reduce.
  • Common data analysis operations, which are provided by database systems as primitives, need to be recoded by hand each time in Java or C/C++: e.g., join, filter, common aggregates, group by, union, distinct. 
  • The programmer has to hand-optimize the execution plan, for example by deciding how many map and reduce nodes are needed. For complex chained flows, this can become a nightmare. Databases provide query optimizers for this purpose -- the precise sequence of operations is decided by the optimizer rather than by a programmer.

Three approaches have emerged to bridge the gap between relational databases and Map Reduce. Let's examine each approach in turn and then discuss their pros and cons.

The first approach is to create a new higher-level scripting language that uses Map and Reduce as primitive operations. Using suc

Stop Email Overload and Break Silos Using Wikis, Blogs, and IMJuly 22 2008

Email is the central nervous system of most modern organizations, from startups to large corporations. Every communication, from the most important (planning for the big client meeting tomorrow) to the most trivial (fresh donuts in the kitchen) takes place through the corporate email system. The results: email overload and lowered productivity for the entire organization. Employees are tethered to their email via Blackberries even over the weekend, leading to communications burnout.

The biggest single reason for this is the inherent nature of email itself: it is a point-to-point communication medium. The sender has to decide both the content of the message as well as who the recipients are. If the recipient list is too large, it contributes to email overload. If it is too small, that could lead to communication gaps and "informational silos" in the organization, where one group in the company doesn't really know what the other group is doing. Another problem is that each email message is a single unit, making it hard to track conversations among multiple parties. Many email readers thread conversations, but that is done at a syntactic rather than semantic level. Finally, putting everything in email makes it difficult to build institutional memory.

We hit the email wall at my company Kosmix recently. When we were less than 30 people, managing by email worked reasonably well. The team was small enough that everyone knew w

The Real Long Tail: Why both Chris Anderson and Anita Elberse are WrongJuly 10 2008

A new study by Anita Elberse, published in the Harvard Business Review, raises questions about the validity of Chris Anderson's Long Tail theory. If you're related to Rip Van Winkle, the Long Tail theory suggests that the dramatically lower distribution costs for media (such as music and movies) enabled by the internet has the potential to reshape the demand curve for media. Traditionally, these businesses have been hits-driven, with the majority of revenue and profits being attributable to a small number of items (the hits). Anderson argues that the internet's ability to serve niches cost-effectively increases the demand for items further down the "tail" of the demand curve, making the aggregate demand for the tail comparable to that for the head.

Anderson's insight resonated instantly with the digerati. It is said that Helen of Troy's face launched a thousand ships; the Long Tail theory certainly launched more than a thousand startups, all with an obligatory Long Tail slide in their investor pitches. Recently, however, there has been a creeping suspicion that the data don't support the theory; the backlash has been spearheaded,

Searching for a Needle or Exploring the Haystack?June 27 2008

Note: This post is about a new product we're testing at my company Kosmix.

Search engines are great at finding the needle in a haystack. And that's perfect when you are looking for a needle. Often though, the main objective is not so much to find a specific needle as to explore the entire haystack.

When we're looking for a single fact, a single definitive web page, or the answer to a specific question, then the needle-in-haystack search engine model works really well. Where it breaks down is when the objective is to learn about, explore, or understand a broad topic. For example:

  • Hiking the Continental Divide Trail.
  • A loved one recently diagnosed with arthritis.
  • You read the Da Vinci code and have an irresistible urge to learn more about the Priory of Sion.
  • Saddened by George Carlin's death, you want to reminisce over his career.

The web contains a trove of information on all these topics. Moreover, the information of interest is not just facts (e.g., Wikipedia), but also opinion, community, multimedia, and products. What's missing is a service that organizes all the information on a topic so that you can explore it easily. The Kosmix team has been working for the past year on building just such a service, and we put out an alpha yesterday. You enter a topic, and our algorithms assemble a "topic page" for that topic. Check out the pages for