What is Toluu?
Toluu is a free service for sharing the feeds you read and discovering new ones.
Get Invite

Datawocky

On Teasing Patterns from Data, with Applications to Search, Social Media, and Advertising


Google Chrome: A Masterstroke or a Blunder?September 7

The internet world has been agog over Google's entry into the browser wars with Chrome. When we look back to this event  several years from now with the benefit of hindsight, we might see it either as a master stroke, or as Google's biggest strategic misstep.

The potential advantages to the internet community as a whole are considerable. The web has evolved beyond its roots as a collection of HTML documents and dumb frontends to database applications. We now expect everything from a web application that we do from a desktop application, and then some more: the added bonus of connectivity to vast computing resources in the cloud. In this context, browsers need to  evolve from HTML renderers to runtime containers, much as web servers evolved from simple servers of static files  and cgi scripts to modern application servers with an array of plugins that provide a variety of services. Chrome is the first browser to explicitly acknowledge this transition and make it the centerpiece of their efforts, and will force other browsers to follow suit. We will all benefit.

The potential advantages to Google also are considerable. If the stars and planets align, they can challenge Microsoft's dominance on the desktop by making the desktop irrelevant. Even otherwise, they can hope to use their dominance in search to promote Chrome, gaining significant browser marketshare and ensuring that Micro

Bridging the Gap between Relational Databases and MapReduce: Three New ApproachesSeptember 5

Popularized by Google, the MapReduce paradigm has proven to be a powerful way to analyze large datasets by harnessing the power of commodity clusters. While it provides a straightforward computational model, the approach suffers from certain key limitations, as discussed in a prior post:

  • The restriction to a rigid data flow model (Map followed by Reduce). Sometimes you need other flows e.g., map-reduce-map, union-map-reduce, join-reduce.
  • Common data analysis operations, which are provided by database systems as primitives, need to be recoded by hand each time in Java or C/C++: e.g., join, filter, common aggregates, group by, union, distinct. 
  • The programmer has to hand-optimize the execution plan, for example by deciding how many map and reduce nodes are needed. For complex chained flows, this can become a nightmare. Databases provide query optimizers for this purpose -- the precise sequence of operations is decided by the optimizer rather than by a programmer.

Three approaches have emerged to bridge the gap between relational databases and Map Reduce. Let's examine each approach in turn and then discuss their pros and cons.

The first approach is to create a new higher-level scripting language that uses Map and Reduce as primitive operations. Using suc

Stop Email Overload and Break Silos Using Wikis, Blogs, and IMJuly 21

Email is the central nervous system of most modern organizations, from startups to large corporations. Every communication, from the most important (planning for the big client meeting tomorrow) to the most trivial (fresh donuts in the kitchen) takes place through the corporate email system. The results: email overload and lowered productivity for the entire organization. Employees are tethered to their email via Blackberries even over the weekend, leading to communications burnout.

The biggest single reason for this is the inherent nature of email itself: it is a point-to-point communication medium. The sender has to decide both the content of the message as well as who the recipients are. If the recipient list is too large, it contributes to email overload. If it is too small, that could lead to communication gaps and "informational silos" in the organization, where one group in the company doesn't really know what the other group is doing. Another problem is that each email message is a single unit, making it hard to track conversations among multiple parties. Many email readers thread conversations, but that is done at a syntactic rather than semantic level. Finally, putting everything in email makes it difficult to build institutional memory.

We hit the email wall at my company Kosmix recently. When we were less than 30 people, managing by email worked reasonably well. The team was small enough that everyone knew w

Why Google Doesn't Provide Earnings ForecastsJuly 17

Most public companies provide forecasts of revenue and earnings in the upcoming quarters. These forecasts (sometimes called "guidance") form the basis of the work most stock analysts do to make buy and sell recommendations. Much to the consternation of these analysts, Google is among the few companies that have refused to follow this practice. As a result, estimates of Google's revenue by analysts using publicly available data, like comScore numbers, have often been spectacularly wrong. Today's earnings call may be no different.

A Google executive once explained to me why Google doesn't provide forecasts. To understand it, you have think about the engineers at Google who work on optimizing AdWords. How do they know they're doing a good job? We know that Google is constantly bucket-testing tweaks to their AdWords algorithms. An ad optimization project is considered successful if it has one of two results:

  • Increase revenue per search (RPS), while not using additional ad real estate on the search results page (SERP).
  • Reduce the ad real estate on each SERP, while not reducing RPS.

The tricky cases are the ones that increase RPS, while also using more ad real estate. It then becomes a judgment call on whether they should be rolled out across the site. If Google were to make earnings forecasts, the thinking went, there would be huge temptation to roll out tw

The Real Long Tail: Why both Chris Anderson and Anita Elberse are WrongJuly 10

A new study by Anita Elberse, published in the Harvard Business Review, raises questions about the validity of Chris Anderson's Long Tail theory. If you're related to Rip Van Winkle, the Long Tail theory suggests that the dramatically lower distribution costs for media (such as music and movies) enabled by the internet has the potential to reshape the demand curve for media. Traditionally, these businesses have been hits-driven, with the majority of revenue and profits being attributable to a small number of items (the hits). Anderson argues that the internet's ability to serve niches cost-effectively increases the demand for items further down the "tail" of the demand curve, making the aggregate demand for the tail comparable to that for the head.

Anderson's insight resonated instantly with the digerati. It is said that Helen of Troy's face launched a thousand ships; the Long Tail theory certainly launched more than a thousand startups, all with an obligatory Long Tail slide in their investor pitches. Recently, however, there has been a creeping suspicion that the data don't support the theory; the backlash has been spearheaded,