Stochastic vs. Authoritative Generation of Unique Identifiers

William R Soley
Sun Microsystems
July 1, 2000   [Rev 4]

Introduction

Distributed information systems that require assignment of globally and perpetually unique identifiers face many challenges. One attractive solution is to assign identifiers at random from a sufficiently large space to assure the desired probability of uniqueness. (I will call this the "stochastic" method.) I compare this to the popular solution based on a centralized authority that assigns and keeps a record of all identifiers. (I will call this the "authoritative" method.)

The random method is often criticized because it is perceived that because the authoritative methods use deterministic algorithms they are more reliable. In fact, authoritative algorithms are only deterministic if we assume that undetected errors never occur. In fact, undetected errors do occur, but the probability is so low that we have become accustomed to ignoring them. Undetected errors are possible due to programming bugs, human operational error, hardware failure, cosmic rays, noise on network circuits, and many other causes. In most cases, computers employ algorithms to detect and correct errors when they occur, however, the detection algorithm has only a certain probability of detecting the error. For example, there is some probability that a data block containing an error will by chance have a valid checksum. The combined effect of all possible sources of error in the authoritative method is nearly impossible to estimate. On the other hand, the probability of an identifier collision in the stochastic method can be precisely calculated and controlled (so the behavior of the system is predictable).

A clear advantage of the stochastic approach is the availability. Since we desire a distributed system, we do not want a single point of failure that could prevent identifiers from being assigned. There are techniques that can mitigate the problem in the authoritative model (such as preassignment and caching) but they add significant complexity to the system. In the stochastic model, identifiers are assigned locally as needed. This is much simpler and is guaranteed to be available when needed.

A major factor limiting the acceptance of stochastic methods is the perception that collisions are more likely than equivalent failures of the authoritative method. Even professionals who work with very large and very small numbers on a daily basis often find it difficult to visualize their relative size. The following example is presented in the hope that it will help readers visualize the relative size of the numbers that might be proposed for a stochastic solution and so make it easier to form an accurate opinion about its reliability compared to an authoritative solution. I also hope that we can have some fun in the process.

An Example

The example describes a 119 bit random number encoded using radix 62 to form a 20 character alphanumeric identifier. Radix 62 consists of the digits 0-9, letters A-Z and a-z. It is chosen over the more common radix 64 (Base64) to eliminate the use of punctuation characters which may be problematic in some contexts. The 119 bit number would be generated using a cryptographic grade random number generator (such as cryptorand). Library random number functions must not be used for this application as they usually have only a few bits of entropy.

The identifier in our example, having 119 bits, would have 2119 possible values:
        

This number is hard to visualize.

If you write that out without scientific notation, it comes to:
         664,613,997,892,457,936,451,903,530,140,172,288
which we can all agree is a big number, but exactly how big is it?

Lets try to map that into something that is easier to visualize. Consider that a California SuperLotto Plus ticket is about 0.005 inches thick:
        

Then, the total thickness of a stack of 2119 tickets (one for each possible random identifier) is:
        

This number is still hard to visualize.

Since the speed of light is approximately:
        

A lightyear is defined as the distance light travels in a year:
        

Then, the combined thickness of the lottery tickets would be:
        

This number is still hard to visualize.

Given that the diameter of the disk of the Milky Way Galaxy is about 1000 lightyears:
        

Then, the total thickness of the 2119 lottery tickets would be 9 trillion Milky Way diameters:
        

And that number is still hard to visualize.

The Need for Speed

Trying to visualize the distance isn't helping much, lets try another approach. So far, we have a stack of lottery tickets, each ticket representing a single possible unique 119 bit identifier. The stack is 9 trillion times the diameter of the Milky Way. Lets try thinking in terms of time. If we assume that we can somehow travel at the speed of light, it would take 9 quadrillion years to travel past all the tickets:
                  (C is the speed of light)

The age of the Earth is estimated to be about 4.55 billion years:
        

To put 9 quadrillion years in perspective, consider that it is 2 million times the age of the Earth:
        

I am not sure I can visualize that, but it certainly is impressive.

Since we are using lottery tickets as an example (no coincidence), lets consider the odds of winning the jackpot. The probability that a ticket will win the California SuperLotto Plus jackpot is:
        
        

So given that huge stack of lottery tickets, we expect many winning tickets:
        

Since we are traveling at the speed of light, we would fly past an average of 57 thousand jackpot winning tickets every second, and we could continue to do so for 2 million times the age of the Earth:
        

Of course, the average jackpot is $9 million, so you could have won $61 quadrillion just in the time it took you to read this silly example (Bill Gates would be jealous):
        

Getting Serious

These comparisons are fun, but what we really want to know is the probability that a system that randomly generates 119 bit identifiers will fail (where failure is defined as accidentally generating a duplicate identifier). To compute that, we have to know how many identifiers we expect to generate in the life of the system. Obviously if we generate only one identifier, the probability of duplicate is 0. After generating i unique identifiers, the probability of the next identifier being a duplicate is:
        

So let's assume that our generator has an expected lifetime of five billion identifiers. I picked this number to be conservative. It works out to three per second continuously for 53 years. It's also enough to assign several identifiers to every "connected" person on the planet:
        

The duplicate random identifier problem is essentially identical to the well studied birthday problem. The probability of any duplicate in a population of size N is given by:
        

Unfortunately, that expression is not tractable for such large numbers so I will use an approximation that is accurate to several digits for populations in our range of interest:
        

The approximation, Pduplicate(n), always yields an upper bound of the true probability, so it is conservative for our application. We also know a lower bound, so in fact we can say that for Pduplicate(n) < 0.0007, the approximation is accurate to at least 3 digits:
        

So the probability that our example stochastic model would generate one or more duplicates over the course of its design life is:
        

This number represents the chance of an error in the life of the system. If you are used to undetected bit error rate statistics, then for comparison purposes, the equivalent undetected bit error rate would be approximately:
        

That is about 30 times worse than disk, and about 30 times better than tape or network.

A duplicate is 1 billion times less likely to occur over the whole design life than winning the jackpot from a single lottery ticket purchase:
        

Some are fond of expressing reliability in terms of some number of nines. The reliability of the stochastic generator in our example is:
          = 99.9999999999999999999999994 %

Wow! That's much more accurate than some of Intel's floating point units.

Compared to Authoritative Model

The authoritative model uses deterministic algorithms to assign identifiers and track those that are previously assigned. Although there is no random number directly involved in this process, there is chance in the form of analog noise on communication circuits, bugs in the software, human error in operations, media failure in storage systems, cosmic rays interference with electronics, and other factors which we routinely ignore. Unfortunately these factors are very hard to quantify, especially in their combined effect. To keep things simple (and once again be conservative) I will only consider the network factor which is relatively easy to quantify.

I assume the data must pass over the network, probably at least a total of ten hops. The probability of an undetected bit error in the network is about 10-27 for ATM over fiber (one of the most reliable mediums). Thus, the total probability of an undetected bit error occurring in one or more identifiers over the design life of the system is:
        
        

This number is a very rough guess, however, it is a very conservative one that only considers one of the many possible sources of errors. The conclusion is that although one may intuitively expect the authoritative method to yield a lower risk due to its "deterministic" nature, it is in fact at least 300 times more likely to produce an error than the stochastic method:
        

Sensitivity to External Errors

Even if you have designed your system to use stochastic identifier generation, there is still the matter of what the identifiers are used for. Almost certainly they will be transmitted over networks, stored on disks, possibly read over the phone and typed into another system. In each of these situations, there is a possibility of error. Since the stochastic method guarantees that the allocation of identifiers is sparse and evenly distributed, it reduces the chance that such an error would produce another valid identifier to:
        

This fact alone is a compelling reason in favor of stochastic generation.

Design Guidelines

Once you have decided on a stochastic method, the three variables that determine your design are:
        life     - total number of identifiers consumed in the life of the system
        bits     - size of the identifier in bits
        Perror     - probability of an error over the life of the system

These variables are related by the following approximation, which is accurate to at least 3 digits for the range, 0 < Perror(life, bits) < 0.0007:
        

For comparison purposes, the equivalent undetected bit error rate is approximately:
        

The following table shows the required number of bits and the equivalent undetected bit error rate for the indicated lifetime error performance and lifetime identifier usage. Recommended values are shaded. For example, a system designed to use one billion (1E+09) identifiers over its lifetime, with a requirement for a 10-16 (1E-16) lifetime error performance, requires 112 bits of entropy in the identifier, and is comparable to a 10-27 (1E-27) undetected bit error rate.

Conclusion

Stochastic identifier generation is at least as reliable as a traditional authoritative approach. It is more available because it is generated locally without need to communicate with a service that might be down. It is simpler to implement and exhibits faster response time since no synchronization or communication is needed. It reduces sensitivity to external errors. Its behavior can be precisely predicted by mathematical model whereas an authoritative generator's behavior depends on numerous variables which are difficult to estimate. By carefully controlling the identifier size, stochastic generation can be made arbitrarily reliable to meet the most stringent requirements.