What is Toluu?
Toluu is a free service for sharing the feeds you read and discovering new ones.
Get Invite

0x0000


Waiting at the doorMay 5 2008
IMG_1798 When running in the underground portions of the muni metro system, the trains are controlled by computer and because of this, they stop at the same place along the platform every time. Unlike bart, which is similarly controlled by computer, the spots where the doors will open are not marked out. Even without these markings it is still possible to tell where one will need to board. As people enter and leave the trains, their feet wipe clean the dirt form the very edge of the train platform leaving clean marks at each door opening. By observing and counting off the marks in sets of four it is possible to tell where to stand to enter at a particular door on a particular train.
The 5 minute ruleDecember 27 2007
There is a great paper every computer engineer should read, The 5 minute rule for trading memory for disc accesses and the 10 byte rule for trading memory for CPU time:
Abstract If an item is accessed frequently enough, it should be main memory resident. For current technology, "frequently enough" means about every five minutes. Along a similar vein, one can frequently trade memory space for CPU time. For example, bits can be packed in a byte at the expense of extra instructions to extract the bits. It makes economic sense to spend ten bytes of main memory to save one instruction per second. These results depend on current price ratios of processors, memory and disc accesses. These ratios are changing and hence the constants in the rules are changing.
Published by Tandem in 1986 it is still relevant today. Quick and fun, the paper provides a analytical tool that is powerful beyond it's original scope. It has been updated twice, ten years later (1997), and twenty years later (2007). The ten year paper show how the rule still stood at that time. The twenty year later paper provides new insights in to the topic, has lots of good analytical work, and should be read if you are trying to apply the 5 minute rule to modern systems. If I was teaching com
Corporate Police and Due ProcessDecember 25 2007
We are entering a time when it is becoming clear that corporations are a means to sidestep the judicial process. Today, and to a lesser extent in the past, companies are put into the position of adjudicating decisions of legal concern regarding peoples' actions within their domain of control. This post will use Flickr and issues around its Safe Search for our examples; but the symptoms described are common.

Safe Search is a feature meant to partition the photos uploaded to Flickr into three groups which they describe as follows:

1. Safety Level
  • Safe - Content suitable for a global, public audience
  • Moderate - If you're not sure whether your content is suitable for a global, public audience but you think that it doesn't need to be restricted per se, this category is for you
  • Restricted - This is content you probably wouldn't show to your mum, and definitely shouldn't be seen by kids
Where there are many reasons for the rating system the one we will concern ourselves with is that of compliance with the laws of Singapore, Hong Kong, Korea, and Germany. In these countries people's ability to view content is limited:

Note: If your Yahoo! ID is based in Singapore, Hong Kong or Korea you will only be able to view safe content based on your local Terms of Service so won't be able t




Simple ConsistencyDecember 24 2007
For this discussion we are interested in a database system which can provide a availability guarantee for data reads and writing such that a reader must be able obtain the information supplied by the last successful write with some probability n and a writer must be able to record its data to the database with some other probability n. To achieve this we must guarantee that the probability of critical component failure is less then n for either reads or writes. This can be done either by acquiring components which have a failure rate less then n or by by combining components in a redundant configuration such that the probability of all redundant components failing is less then  n. This article concerns it self with the second approach and the issues that arise from a system where one redundantly stores data on more then one distinct node in a cluster with to goal of achieving a low probability of failure.

With a data replication approach to availability we are bought the the primary concern of this work, detecting and handling error states which cause data to become inconsistent across nodes replicating data. Inconsistencies in replication create ambiguity in the definition of a successful wright and obfuscate the value of the last wright.

There are three different errors which can endanger consistency of writes:
  1. Partial write: The writ




Scaling Reads: Replication vs PartitioningNovember 28 2007

The common method of read scaling is to copy read heavy data to more nodes. This increases the available read throughput of a datum by increasing the total amount of resources dedicated to serving it. This will always be necessary when the read load for a single datum surpasses the resources of a single node (IO, CPU, Network, etc). This method scales well but has the down side that it is often not efficient with respect to storage space. This is not so often a issue for data on disk but is a often issue for data cached in memory. The following example demonstrates the difference:

Let us posit that there two nodes { node1, node2 } and a data set having four datums, { A, B, C, D } with the load distribution <A:40%, B:40%, C:10%, D:10%>. To scale reads one could fully replicate the datums across both nodes, node1:{ A, B, C, D }, node2:{ A, B, C, D }. We would then distribute the query load evenly between the two nodes so they were each handling one half of the total load for the data. A second option would be to share the load between the two nodes by partitioning the data across the nodes. One such repartitioning is node1:{ A, C }, node2:{ B, D } resulting in the load distribution of  node1:<A:40%, C:10%>, node2:<B:40%, D:10%>.  With this partitioning the read load is still evenly distributed across the nodes but only half as much data is stored at each node.

Even though this example appears to assumes that reliable load data is availabl