What is Toluu?
Toluu is a free service for sharing the feeds you read and discovering new ones.
Get Invite

Geeking with Greg

Exploring the future of personalized information


Yahoo's automatic content optimizationDecember 1
Deepak Agarwal along with many others from Yahoo Research have a paper at the upcoming NIPS 2008 conference, "Online Models for Content Optimization", with a fun peek inside of a system at Yahoo that automatically tests and optimizes which content to show to their readers.

It is not made entirely clear which pages at Yahoo use the system, but the paper says that it is "deployed on a major Internet portal and selects articles to serve to hundreds of millions of user visits per day."

The system picks which article to show in a slot from 16 potential candidates where the pool of candidates are picked by editors and change rapidly. The system seeks to optimize the clickthrough rate in the slot. The problem is made more difficult by the way the clickthrough rate on a given article changes rapidly as the article ages and as the audience coming to Yahoo changes over the course of a day, which means the system needs to adapt rapidly to new information.

The paper describes a few variations of algorithms that do explore/exploit by showing the article that performed best recently while constantly testing the other articles to see if they might perform better.

The result was that the CTR increased by 30-60% over editors manually selecting the content that was shown. Curiously, their attempt to show different content to different user segments (a coarse-grained version of personalization) did not







Finding task boundaries in search logsNovember 20
There has been more talk lately, it seems to me, on moving away from stateless search where each search is independent and toward a search engine that pays attention to your previous searches when it tries to help you find the information you seek.

Which makes that much more relevant a paper by Rosie Jones and Kristina Klinkner from Yahoo Research at CIKM 2008, "Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs" (PDF).

Rosie and Kristina looked at how to accurately determine when a searcher stops working on one task and starts looking for something new. The standard technique people have used in the past for finding task boundaries is to simply assume that all searches within a fixed period of time are part of the same task. But, in their experiments, they find that "timeouts, whatever their length, are of limited utility in identifying task boundaries, achieving a maximum precision of only 70%."

Looking at the Yahoo query logs more closely to explain this low accuracy, they find some surprises, such as the high number of searchers that work on multiple tasks simultaneously, even interleaving the searches corresponding to one task with the searches for another.

So, when the simple stuff fails, what do most people do? Think up a bunch of features and train a classifier. And, there you go, that's what Rosie an







Detecting spam just from HTTP headersNovember 19
You have to love research work that takes some absurdly simple idea and shows that it works much better than anyone would have guessed.

Steve Webb, James Caverlee, and Calton Pu had one of these papers at CIKM 2008, "Predicting Web Spam with HTTP Session Information" (PDF). They said, everyone else seems to think we need the content of a web page to see if it is spam. I wonder how far we can get just from the HTTP headers?

Turns out surprisingly far. From the paper:
In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.

We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.
It appears that web spammers tend to use specific IP ranges and put unusual gunk into their headers (e.g. "X-Powered-By" and "Link" fields), which makes it fairly easy to pick them out just from their headers. As one person suggested during the Q&A for the talk, spammers probably would quickly correct these oversights if it became impo





Measuring offline ads by their online impactNovember 17
Googlers Diane Lambert and Daryl Pregibon had a paper at AdKDD 2008, "Online Effects of Offline Ads" (PDF), with a fun look at how far we can get measuring the impact of offline advertising by increases in search queries or website visits.

Some excerpts:
One measure of offline ad effectiveness is an increase in brand related online activity .... As people spend more time on the web, [the] steps toward purchase increasingly include searching for the advertiser's brand or visiting the advertiser's websites, even if the ad campaign was in an offline medium such as print, radio, or TV.

There are two obvious strategies for estimating the [gain] ... We can assume that the past is like the present and use daily outcomes before the campaign ran ... [but] the "before" number of visits ... is not necessarily a good estimate ... if interest in the product is expected to change over time even if no ad campaign is run. For example, if an ad is more likely to be run when product interest is high, then comparing counts-after to counts-before overstates the effect of the campaign.

Alternatively, we could estimate the [gain] ... by the outcome in control DMAs, which are markets in which the ad did not appear ... One problem, though, is that the advertiser may be more likely to advertise in DMAs where the interest in the product is likely to be hig





Learning not to advertiseNovember 14
Andrei Broder and a large crew from Yahoo Research had a paper at CIKM 2008, "To Swing or not to Swing: Learning when (not) to Advertise" (PDF), that is a joy to see for those of us that are hoping to make advertising more useful and less annoying.

The paper starts by motivating the idea of sometimes not showing ads:
In Web advertising it is acceptable, and occasionally even desirable, not to show any [ads] if no "good" [ads] are available.

If no ads are relevant to the user's interests, then showing irrelevant ads should be avoided since they impair the user experience [and] ... may drive users away or "train" them to ignore ads.
The paper looks at a couple approaches on when to show ads, one based on a simple threshold on the relevance score produced by Yahoo's ad ranking system, another training a more specialized classifier based on a long list of features.

An unfortunate flaw in the paper is that the system was evaluated using a manually labeled set of relevant and irrelevant ads. As the paper itself says, it would have been better to consider expected revenue and user utility, preferably using data from actual Yahoo users. But, with the exception of a brief mention of "preliminary experiments ... using click-through data" that they "are unable to include ... due to space constraints", they leave the question of the revenue and user satisfact