Resolution of Toom's rule paradox

A few days ago our Ghost Pontiff Dave Bacon wondered how Toom’s noisy but highly fault-tolerant 2-state classical cellular automaton  can get away with violating the Gibbs phase rule, according to which a finite-dimensional locally interacting system, at generic points in its phase diagram, can have only only one thermodynamically stable phase.  The Gibbs rule is well illustrated by the low-temperature ferromagnetic phases of the classical Ising model in two or more dimensions:  both phases are stable at zero magnetic field, but an arbitrarily small field breaks the degeneracy between their free energies, making one phase metastable with respect to nucleation and growth of islands of the other.  In the Toom model, by contrast, the two analogous phases are absolutely stable over a finite area of the phase diagram, despite biased noise that would seem to favor one phase over the other.  Of course Toom’s rule is not microscopically reversible,  so it is not bound by laws of equilibrium thermodynamics.

Nevertheless, as Dave points out, the distribution of histories of any locally interacting d-dimensional system, whether microscopically reversible or not, can be viewed  as an equilibrium Gibbs distribution of a d+1 dimensional system, whose local Hamiltonian is chosen so that the d dimensional system’s transition probabilities are given by Boltzmann exponentials of interaction energies between consecutive time slices.  So it might seem, looking at it from the d+1 dimensional viewpoint, that the Toom model ought to obey the Gibbs phase rule too.
The resolution of this paradox, described in my 1985 paper with Geoff Grinstein,  lies in the fact that the d to d+1 dimensional mapping is not surjective.  Rather it is subject to the normalization constraint that for every configuration X(t) at time t, the sum over configurations X(t+1) at time t+1 of transition probabilities P(X(t+1)|X(t)) is exactly 1.    This in turn forces the d+1 dimensional free energy to be identically zero, regardless of how the d dimensional system’s transition probabilities are varied.  The Toom model is able to evade the Gibbs phase rule because

  • being irreversible, its d dimensional free energy is ill-defined, and
  • the normalization constraint allows two phases to have exactly equal  d+1 dimensional free energy despite noise locally favoring one phase or the other.

Just outside the Toom model’s bistable region is a region of metastability (roughly within the dashed lines in the above phase diagram) which can be given an interesting interpretation in terms of the  d+1 dimensional free energy.  According to this interpretation, a metastable phase’s free energy is no longer zero, but rather -ln(1-Γ)≈Γ, where Γ is the nucleation rate for transitions leading out of the metastable phase.  This reflects the fact that the transition probabilities no longer sum to one, if one excludes transitions causing breakdown of the metastable phase.  Such transitions, whether the underlying d-dimensional model is reversible (e.g. Ising) or not (e.g. Toom), involve critical fluctuations forming an island of the favored phase just big enough to avoid being collapsed by surface tension.  Such critical fluctuations occur at a rate
Γ≈ exp(-const/s^(d-1))
where s>0 is the distance in parameter space from the bistable region (or in the Ising example, the bistable line).  This expression, from classical homogeneous nucleation theory, makes the d+1 dimensional free energy a smooth but non-analytic function of s, identically zero wherever a phase is stable, but lifting off very smoothly from zero as one enters the region of metastability.
 

 
Below, we compare  the d and d+1 dimensional free energies of the Ising model with the d+1 dimensional free energy of the Toom model on sections through the bistable line or region of the phase diagram.

We have been speaking so far only of classical models.  In the world of quantum phase transitions another kind of d to d+1 dimensional mapping is much more familiar, the quantum Monte Carlo method, nicely described in these lecture notes, whereby a d dimensional zero-temperature quantum system is mapped to a d+1 dimensional finite-temperature classical Monte Carlo problem.   Here the extra dimension, representing imaginary time, is used to perform a path integral, and unlike the classical-to-classical mapping considered above, the mapping is bijective, so that features of the d+1 dimensional classical system can be directly identified with corresponding ones of the d dimensional quantum one.
 
 

Non-chaotic irregularity

In principle, barring the intervention of chance, identical causes lead to identical effects.  And except in chaotic systems, similar causes lead to similar effects.  Borges’ story “Pierre Menard” exemplifies an extreme version of this idea: an early 20’th century writer studies Cervantes’ life and times so thoroughly that he is able to recreate several chapters of “Don Quixote” without mistakes and without consulting the original.
Meanwhile, back at the ShopRite parking lot in Croton on Hudson, NY,  they’d installed half a dozen identical red and white parking signs, presumably all from the same print run, and all posted in similar environments, except for two in a sunnier location.

The irregular patterns of cracks that formed as the signs weathered were so similar that at first I thought the cracks had also been printed, but then I noticed small differences. The sharp corners on letters like S and E,  apparently points of high stress, usually triggered near-identical cracks in each sign, but not always, and in the sunnier signs many additional fine cracks formed. 
Another example of reproducibly irregular dynamics was provided over 30 years ago by Ahlers and Walden’s experiments on convective turbulence, where a container of normal liquid helium, heated from below, exhibited nearly the same sequence of temperature fluctuations in several runs of the experiment.

 
 
 

What increases when a self-organizing system organizes itself? Logical depth to the rescue.

(An earlier version of this post appeared in the latest newsletter of the American Physical Society’s special interest group on Quantum Information.)
One of the most grandly pessimistic ideas from the 19th century is that of  “heat death” according to which a closed system, or one coupled to a single heat bath at thermal  equilibrium,  eventually inevitably settles into an uninteresting state devoid of life or macroscopic motion.  Conversely, in an idea dating back to Darwin and Spencer, nonequilibrium boundary conditions are thought to have caused or allowed the biosphere to self-organize over geological time.  Such godless creation, the bright flip side of the godless hell of heat death, nowadays seems to worry creationists more than Darwin’s initially more inflammatory idea that people are descended from apes. They have fought back, using superficially scientific arguments, in their masterful peanut butter video.
Self-organization versus heat death
Much simpler kinds of complexity generation occur in toy models with well-defined dynamics, such as this one-dimensional reversible cellular automaton.  Started from a simple initial condition at the left edge (periodic, but with a symmetry-breaking defect) it generates a deterministic wake-like history of growing size and complexity.  (The automaton obeys a second order transition rule, with a site’s future differing from its past iff exactly two of its first and second neighbors in the current time slice, not counting the site itself, are black and the other two are white.)

Fig 2.

Time →
But just what is it that increases when a self-organizing system organizes itself?
Such organized complexity is not a thermodynamic potential like entropy or free energy.  To see this, consider transitions between a flask of sterile nutrient solution and the bacterial culture it would become if inoculated by a single seed bacterium.  Without the seed bacterium, the transition from sterile nutrient to bacterial culture is allowed by the Second Law, but prohibited by a putative “slow growth law”, which prohibits organized complexity from increasing quickly, except with low probability.
Fig. 3

The same example shows that organized complexity is not an extensive quantity like free energy.  The free energy of a flask of sterile nutrient would be little altered by adding a single seed bacterium, but its organized complexity must have been greatly increased by this small change.  The subsequent growth of many bacteria is accompanied by a macroscopic drop in free energy, but little change in organized complexity.
The relation between universal computer programs and their outputs has long been viewed as a formal analog of the relation between theory and phenomenology in science, with the various programs generating a particular output x being analogous to alternative explanations of the phenomenon x.  This analogy draws its authority from the ability of universal computers to execute all formal deductive processes and their presumed ability to simulate all processes of physical causation.
In algorithmic information theory the Kolmogorov complexity, of a bit string x as defined as the size in bits of its minimal program x*, the smallest (and lexicographically first, in case of ties) program causing a standard universal computer U to produce exactly x as output and then halt.
 x* = min{p: U(p)=x}
Because of the ability of universal machines to simulate one another, a string’s Kolmogorov complexity is machine-independent up to a machine-dependent additive constant, and similarly is equal to within an additive constant to the string’s algorithmic entropy HU(x), the negative log of the probability that U would output exactly x and halt if its program were supplied by coin tossing.    Bit strings whose minimal programs are no smaller than the string itself are called incompressible, or algorithmically random, because they lack internal structure or correlations that would allow them to be specified more concisely than by a verbatim listing. Minimal programs themselves are incompressible to within O(1), since otherwise their minimality would be undercut by a still shorter program.  By contrast to minimal programs, any program p that is significantly compressible is intrinsically implausible as an explanation for its output, because it contains internal redundancy that could be removed by deriving it from the more economical hypothesis p*.  In terms of Occam’s razor, a program that is compressible by s bits deprecated as an explanation of its output because it suffers from  s bits worth of ad-hoc assumptions.
Though closely related[1] to statistical entropy, Kolmogorov complexity itself is not a good measure of organized complexity because it assigns high complexity to typical random strings generated by coin tossing, which intuitively are trivial and unorganized.  Accordingly many authors have considered modified versions of Kolmogorov complexity—also measured in entropic units like bits—hoping thereby to quantify the nontrivial part of a string’s information content, as opposed to its mere randomness.  A recent example is Scott Aaronson’s notion of complextropy, defined roughly as the number of bits in the smallest program for a universal computer to efficiently generate a probability distribution relative to which x  cannot efficiently be recognized as atypical.
However, I believe that entropic measures of complexity are generally unsatisfactory for formalizing the kind of complexity found in intuitively complex objects found in nature or gradually produced from simple initial conditions by simple dynamical processes, and that a better approach is to characterize an object’s complexity by the amount of number-crunching (i.e. computation time, measured in machine cycles, or more generally other dynamic computational resources such as time, memory, and parallelism) required to produce the object from a near-minimal-sized description.
This approach, which I have called  logical depth, is motivated by a common feature of intuitively organized objects found in nature: the internal evidence they contain of a nontrivial causal history.  If one accepts that an object’s minimal program represents its most plausible explanation, then the minimal program’s run time represents the number of steps in its most plausible history.  To make depth stable with respect to small variations in x or U, it is necessary also to consider programs other than the minimal one, appropriately weighted according to their compressibility, resulting in the following two-parameter definition.

  • An object  is called  d-deep with  s  bits significance iff every program for U to compute x in time <d is compressible by at least s bits. This formalizes the idea that every hypothesis for  x  to have originated more quickly than in time  d  contains  s bits worth of ad-hoc assumptions.

Dynamic and static resources, in the form of the parameters  d  and  s,  play complementary roles in this definition:  d  as the quantifier and  s  as the certifier of the object’s nontriviality.  Invoking the two parameters in this way not only stabilizes depth   with respect to small variations of  x and U, but also makes it possible to prove that depth obeys a slow growth law, without which any mathematically definition of organized complexity would seem problematic.

  • A fast deterministic process cannot convert shallow objects to deep ones, and a fast stochastic process can only do so with low probability.  (For details see Bennett88.)

 
Logical depth addresses many infelicities and problems associated with entropic measures of complexity.

  • It does not impose an arbitrary rate of exchange between the independent variables of strength of evidence and degree of nontriviality of what the evidence points to, nor an arbitrary maximum complexity that an object can have, relative to its size.  Just as a microscopic fossil can validate an arbitrarily long evolutionary process, so a small fragment of a large system, one that has evolved over a long time to a deep state, can contain evidence of entire depth of the large system, which may be more than exponential in the size of the fragment.
  • It helps explain the increase of complexity at early times and its decrease at late times by providing different mechanisms for these processes.  In figure 2, for example, depth increases steadily at first because it reflects the duration of the system’s actual history so far.  At late times, when the cellular automaton has run for a generic time comparable to its Poincare recurrence time, the state becomes shallow again, not because the actual history was uneventful, but because evidence of that history has become degraded to the point of statistical insignificance, allowing the final state to be generated quickly from a near-incompressible program that short-circuits the system’s actual history.
  • It helps explain while some systems, despite being far from thermal equilibrium, never self-organize.  For example in figure 1, the gaseous sun, unlike the solid earth, appears to lack means of remembering many details about its distant past.  Thus while it contains evidence of its age (e.g. in its hydrogen/helium ratio) almost all evidence of particular details of its past, e.g. the locations of sunspots, are probably obliterated fairly quickly by the sun’s hot, turbulent dynamics.  On the other hand, systems with less disruptive dynamics, like our earth, could continue increasing in depth for as long as their nonequilibrium boundary conditions persisted, up to an exponential maximum imposed by Poincare recurrence.
  • Finally, depth is robust with respect to transformations that greatly alter an object’s size and Kolmogorov complexity, and many other entropic quantities, provided the transformation leaves intact significant evidence of a nontrivial history. Even a small sample of the biosphere, such as a single DNA molecule, contains such evidence.  Mathematically speaking, the depth of a string x is not much altered by replicating it (like the bacteria in the flask), padding it with zeros or random digits, or passing it though a noisy channel (although the latter treatment decreases the significance parameter s).  If the whole history of the earth were derandomized, by substituting deterministic pseudorandom choices for all its stochastic accidents, the complex objects in this substitute world would have very little Kolmogorov complexity, yet their depth would be about the same as if they had resulted from a stochastic evolution.

The remaining infelicities of logical depth as a complexity measure are those afflicting computational complexity and algorithmic entropy theories generally.

  • Lack of tight lower bounds: because of open P=PSPACE question one cannot exhibit a system that provably generates depth more than polynomial in the space used.
  • Semicomputability:  deep objects can be proved deep (with exponential effort) but shallow ones can’t be proved shallow.  The semicomputability of depth, like that of Kolmogorov complexity, is an unavoidable consequence of the unsolvability of the halting problem.

The following observations can be made partially mitigating these infelicities.

  • Using the theory of cryptographically strong pseudorandom functions one can argue (if such functions exist) that deep objects can be produced efficiently, in time polynomial and space polylogarithmic in their depth, and indeed that they are produced efficiently by some physical processes.
  • Semicomputability does not render a complexity measure entirely useless. Even though a particular string cannot be proved shallow, and requires an exponential amount of effort to prove it deep, the depth-producing properties of stochastic processes can be established, assuming the existence of cryptographically strong pseudorandom functions. This parallels the fact that while no particular string can be proved to be algorithmically random (incompressible), it can be proved that the statistically random process of coin tossing produces algorithmically random strings with high probability.

 
Granting that a logically deep object is one plausibly requiring a lot of computation to produce, one can consider various related notions:

  • An object  y  is deep relative to  x  if all near-minimal sized programs for computing  y  from  x  are slow-running.  Two shallow objects may be deep relative to one another, for example a random string and its XOR with a deep string.
  • An object can be called cryptic if it is computationally difficult to obtain a near- minimal sized program for the object from the object itself, in other words if any near-minimal sized program for x is deep relative to x.  One-way functions, if they exist, can be used to define cryptic objects; for example, in a computationally secure but information theoretically insecure cryptosystem, plaintexts should be cryptic relative to their ciphertexts.
  • An object can be called ambitious if, when presented to a universal computer as input, it causes the computer to embark on a long but terminating computation. Such objects, though they determine a long computation, do not contain evidence of it actually having been done.  Indeed they may be shallow and even algorithmically random.
  • An object can be called wise if it is deep and a large and interesting family of other deep objects are shallow relative to it. Efficient oracles for hard problems, such as the characteristic function of an NP-complete set, or the characteristic set K of the halting problem, are examples of wise objects.  Interestingly, Chaitin’s omega is an exponentially more compact oracle for the halting problem than K is, but it is so inefficient to use that it is shallow and indeed algorithmically random.

Further details about these notions can be found in Bennett88.  K.W. Regan in Dick Lipton’s blog discusses the logical depth of Bill Gasarch’s recently discovered solutions to the 17-17 and 18×18 four-coloring problem
I close with some comments on the relation between organized complexity and thermal disequilibrium, which since the 19th century has been viewed as an important, perhaps essential, prerequisite for self-organization.   Broadly speaking, locally interacting systems at thermal equilibrium obey the Gibbs phase rule, and its generalization in which the set of independent parameters is enlarged to include not only intensive variables like temperature, pressure and magnetic field, but also all parameters of the system’s Hamiltonian, such as local coupling constants.   A consequence of the Gibbs phase rule is that for generic values of the independent parameters, i.e. at a generic point in the system’s phase diagram, only one phase is thermodynamically stable.  This means that if a system’s independent parameters are set to generic values, and the system is allowed to come to equilibrium, its structure will be that of this unique stable Gibbs phase, with spatially uniform properties and typically short-range correlations.   Thus for generic parameter values, when a system is allowed to relax to thermal equilibrium, it entirely forgets its initial condition and history and exists in a state whose structure can be adequately approximated by stochastically sampling the distribution of microstates characteristic of that stable Gibbs phase.  Dissipative systems—those whose dynamics is not microscopically reversible or whose boundary conditions prevent them from ever attaining thermal equilibrium—are exempt from the Gibbs phase rule for reasons discussed in BG85, and so are capable, other conditions being favorable, of producing structures of unbounded depth and complexity in the long time limit. For further discussion and a comparison of logical depth with other proposed measures of organized complexity, see B90.
 


[1] An elementary result of algorithmic information theory is that for any probability ensemble of bit strings (representing e.g. physical microstates), the ensemble’s Shannon entropy differs from the expectation of its members’ algorithmic entropy by at most of the number of bits required to describe a good approximation to the ensemble.

 
 

Time After Time

Ole Peters was a postdoc at the Santa Fe Institute during the time I was also a postdoc there. In addition to being a world class windsurfer, Ole likes to think about critical phenomena and stochastic processes. And in the TEDxGoodenoughCollege talk below he almost convinces me that I need to think harder about ensemble versus time averages 🙂

Immanants

Recently computer scientist Leslie Valliant won the ACM’s Turing Award, considered one of the most prestigious prizes in computer science. Valliant is famous for many results, not the least of which are his results on the Permanent of a matrix. Over at the Godel’s Lost Letter, the iced tea man has a nice collection of interesting permanent related complexity facts. Recall that the permanent of a n by n matrix A is given by

{rm per} A  = sum_{pi in S_n}  prod_{i=1}^n A_{i,pi(i)}

where S_n is the symmetric group on n elements and similarly the determinant of a n by n matrix A is given by

{rm det} A = sum_{pi in S_n} prod_{i=1}^n (-1)^{{rm sgn} pi} A_{i,pi(i)}

where {rm sgn} pi is 0 if the permutation is made up of an even number of transpositions and 1 if the permutation is made up of an odd number of transpositions. One day I was sitting in my office when a physics undergraduate came by (a former ex-sailor from Alaska) and said…”Hey Dave, what if we replace the function in front of each term in the permanent and determinant by a character of a symmetric group irrep?” Which of course knocked me off my chair, first because what undergrad physics major knows about symmetric group irreps and second because I had never thought about this interesting twist on the permanent and determinant.
After a little google foo later, we quickly found the answer. For an n by n matrix A the immanant of a matrix is given by

{rm imm_lambda A } =   sum_{pi in S_n} prod_{i=1}^n chi_{lambda}(pi) A_{i,pi(i)}

where lambda labels the irrep of S_n and chi_lambda(pi) is the character of the irrep lambda at group element pi. Recall that the irreps of the symmetric group S_n are labeled by partitions of n. A partition of n is a series of decreasing positive integers that sums to n, (lambda_1m lambda_2, dots,lambda_r) with lambda_1 geq lambda_2 geq dots geq lambda_r such that sum_{i=1}^r lambda_i = n. The partition corresponding to (n) corresponds to the trivial irrep in which chi_{(n)}(pi)=1, and on the opposite end of the spectrum, the partition corresponding to (1,1,dots,1) corresponds to the alternating irrep where chi_{(1,1,dots,1)}(pi)=(-1)^{{rm sgn} pi}. Thus we see that the permanent and determinant are at the end of a spectrum of polynomials known as the immanants.
One of Valiant’s most well known results is that evaluating the permanent of a matrix with 0 and 1 as its possible matrix entries is #P complete, a.k.a really hard. On the other hand evaluating the determinant is not computationally difficult at all. At first this seems odd because a determinant has those minus signs which you would think would make it easier and not hard, but alas this is not so. So what about the computational complexity of the immanant? The Computational Complexity of Immanants by Peter Bürgisser (sorry I could only find a paywalled version) shows that there is a since in which certain immanants are also #P complete to evaluate. Rough the idea is that if one defines a class of immanants that have partitions that have a “step” in them that grows polynomially in n (the step for the permanent is n) then these will be #P complete to evaluate.
So what good are immanants? Well I’m sure mathematicians have some fine uses for them. One interesting thing to note is that immanantal polynomials of adjacency matrices of graphs give you graph invariants (for the determinant this is the same as saying that the eigenvalues of the adjacency matrix are a graph invariant.) However it is known that this, like the eigenvalues, is not a complete set of graph invariants and so is not a route towards efficiently solving graph isomorphism. So no real use there, but I’m sure an object so symmetric must be of some use, no?

A Mathematical Definition of News?

Lately I’ve been thinking about the news. Mostly this involves me shouting obscenities at the radio or the internet for wasting my time with news items the depth of which couldn’t drown an ant and whose factual status makes fairy tales look like rigorous mathematical texts (you know the kind labeled “Introductory X”.) But also (and less violently) I’ve been pondering my favorite type of question, the quantification question: how would one “measure” the news?
Part of motivation for even suggesting that there is a measure of “news” is that if someone asked me if there was a measure of “information” back when I was a wee lad, I would have said they were crazy. How could one “measure” something so abstract and multifaceted as “information?” However there is a nice answer to how to measure information and this answer is given by the Shannon entropy. Of course this answer doesn’t satisfy everyone, but the nice thing about it is that it is the answer to a well defined operational question about resources.
Another thought that strikes me is that, of course Google knows the answer. Or at least there is an algorithm for Google News. Similarly Twitter has an algorithm for spotting trending topics. And of course there are less well known examples like Thoora which seeks to deliver news that is trending in social media. And probably there is academic literature out there about these algorithms, the best I could find with some small google-fu is TwitterMonitor: trend detection over the twitter stream. But all of this is very algorithm centered. The question I want to ask is what quantity are these services attempting to maximize (is it even the same quantity?)
The first observation is that clearly news has a very strong temporal component. If I took all of the newspapers, communications, books, letters, etc. that mankind has produced and regarded it without respect to time you wouldn’t convince many that there is news in this body of raw data (except that there are some monkeys who can type rather well.) Certainly also it seems that news has a time-frame. That is one could easily imagine a quantity that discusses the news of the day, the news of the week, etc.
A second observation is that we can probably define some limits. Suppose that we are examining tweets and that we are looking for news items on a day time scale. We could take the words in the different day’s tweets and make a frequency table for all of these words. A situation in which there is a maximum amount of news on the second day is then a situation where on the first day the frequency distribution over words is peeked one one word, while the second day is all concentrated on another word. One could probably also argue that, on the day time scale, if both frequency distributions were peaked on the same word, then this would not be (day scale) news (it might be week scale news, however.)
This all suggests that our friend, the news, is nothing more than the total variation distance. For two probability distributions p(x) and q(x) , the variation distance between these distribution is d(p,q)=frac{1}{2} sum_{x} |p(x)-q(x)| . This is also equal to sup_{E subset X} |P(E)-Q(E)| where P(E)=sum_{x in E} p(x) and similarly for Q(E). Ah, so perhaps this is not as exciting as I’d hoped 🙂 But at least it gives me a new way to talk about the variational distance between two probability distributions: this is a measure of the news that we could associate with changing from one probability distribution to another.
Of course this is just one approach to thinking about how to quantify “news.” What are the drawbacks for my method and what should a real measure have that this one lacks? I mean whats the worst that could happen in thinking about this problem. Okay, so maybe you would learn how many holes it takes
to fill the Albert Hall.

Mandelbrot 1924-2010

Benoît B. Mandelbrot, a mathematical maverick and curmudgeon extraordinaire has passed away at the age of 85 (NYTimes obit.)  Mandelbrot is most well known for coining the word “fractal” and studying the set which now bears his name, but was also one of the first people to recognize that price changes empirically are not well described by a Gaussian distribution.  Mandlebrot’s middle initial was self-assigned and, apparently, didn’t stand for anything.  However, I’ve always like to imagine that, actually, the “B” stood for “Benoît B. Mandelbrot”.
Like many I’m sure my first encounter with the Mandelbrot set was through a Scientific American by  A.K. Dewdney (who, I’m sad to report, is now a 9/11 doubter.)  For many years, the Mandelbrot set was the first program I’d write when encountering a new computer or was learning a new programming language.   One could get an idea of the speed of the computer by doing this in a few short lines of code, but also you got to test out the number of colors on the new machine (which included things like figuring out how to cycle the Apple IIGS palette so as to achieve 256 colors…all at the same time!)  Raise your hand if you’ve ever written a Mandelbrot set program for a programmable calculator 🙂
A less well known Mandelbrot story is the one that occurred in the journal Science.  There, David Avnir, Ofer Biham, Daniel Lidar (who I wrote a bunch of papers with in a grad school in a galaxy far far away), and Ofer Malcai wrote a Perspective titled Is the Geometry of Nature Fractal? (sorry pay-walled for those not involved in the racket that is scientific publishing.)  These authors did a survey of fractals presented in the Physical Review journals and looked at how many decades the claimed fractals spanned.  The results, let’s just say, were not very positive for those who wrote books called The Fractal Geometry of Nature.  This invoked a spirited response from Mandelbrot and Peter Pfeifer.  In the annals of catty responses, these documents surely are up there among the top ever written.  My favorite part is where Mandelbrot implies that one of the authors of the original Perspective must implicitly be withdrawing his own work on fractals over a small amount of size by writing the Perspective itself.  Ha, curmudgeon indeed!
Another fact I find fun about Mandelbrot is that he obtained his first tenured position at age 75.  Take that anyone complaining about the modern oligarchy known as academia!

Reading List: Graph Isomorphism

Warning: this idiosyncratic list of reading material is in no way meant to be comprehensive nor does is it even guaranteed to focus on the most important papers concerning graph isomorphism.  Suggestions for other papers to add to the list are greatly appreciated: leave a comment!

The graph isomorphism problem is an intriguing problem. A graph is a set of vertices along with the edges connecting these vertices. The graph isomorphism problem is to determine, given the description of two graphs, whether these descriptions are really the same graph. For example, is it possible to relabel and move the vertices in the two graphs below so as to make them look the same?

Give up? Okay, well it wasn’t so hard this one, but yes these graphs are isomorphic 🙂 Usually we aren’t given two pictures of the graphs, but instead are given a succinct description of what vertices are connected to each other. This is given by either and adjacency matrix or and adjacency list.  The adjacency matrix of a graph with n vertices is a n times n matrix whose (i,j)th entry is the number of edges from vertex i to j.  An algorithm is efficient for the graph isomorphism problem if it takes a running time that is polynomial in the number of vertices of the graphs involved.
One reason why graph isomorphism is interesting is it current status in computational complexity. It is not likely to be NP-complete (as this would cause a collapse of the polynomial hierarchy), yet despite a considerable amount of work, there is no known polynomial time algorithm for this problem. Graph isomorphism is known to be in the complexity class NP intersect co-AM, which is kind of similar to where the decision version of factoring lies (NP intersect co-NP.) This similarity, along with the fact that a natural generalization of the quantum problem solved by factoring (the hidden subgroup problem) is related to graph isomorphism (efficiently solving the hidden subgroup problem for the symmetric group would yield an efficient algorithm for graph isomorphism) has led to the hope that quantum computers might be able to efficiently solve graph isomorphism. I will admit, over the years, I have slowly but surely developed quite a severe of the graph isomorphism disease, albeit my case appears to be of the quantum kind. Like any disease, it is important to spread the disease. So here is a collection of readings about the graph isormophism problem.
One interesting aspect of the graph isomorphism problem is that there are classes of graphs for which there do exist efficient polynomial time algorithms for deciding isomorphism of graphs in these classes. Because one can get downbeat pretty quickly bashing your head against the problem, I think it would be nice to start out with an easy case: tree isomorphism.  The efficient algorithm for this is usually attributed to Aho, Hopcroft and Ullman (1974), but I really like

  • “Tree Isomorphism Algorithms: Speed vs. Clarity” by Douglas M. Campbell and David Radford, Mathematics Magazine, Vol. 64, No. 4 (Oct., 1991), pp. 252-261 (paper here)

It’s a fun easy read that will get you all excited about graph isomorphism, I’m sure.
Okay well after that fun start, maybe staying within the realm of polynomial time algorithms continues to feel good.  One of the classic ways to attempt to solve graph isomorphism is as follows: compute the eigenvalues of the adjacency matrices of the two graphs, and sort these eigenvalues.  If the eigenvalues are different, then the graphs cannot be isomorphic.  Why?  Well if the graphs are isomorphic, there exists a n times n permutation matrix P (a matrix made of zeros and ones that has only a single one in each row and in each column) such that A_2=PA_1 P^{-1} (here A_1 and A_2 are the adjacency matrices of the two graphs.)  And recall that the eigenvalues of X and MXM^{-1} are the same for invertible matrices M.  Thus isomorphic graphs must have the same eigenvalues.  Note, however, that this does not imply that non-isomorphic graphs have different eigenvalues.  For example, consider the following two trees

These trees are clearly not isomorphic.  On the other hand, if you take the eigenvalues of their adjacency matrices you will find out that they are both given by

left{frac{1}{2} left(-1-sqrt{13}right),frac{1}{2} left(1+sqrt{13}right),frac{1}{2} left(1-sqrt{13}right),frac{1}{2} left(-1+sqrt{13}right),0,0,0,0right}.

So there exists isospectral (same eigenvalues of the adjacency matrix) but non-isomorphic graphs.  Thus taking the eigenvalues of the graph isn’t enough to distinguish non-isomorphic graph.  In practice, however, this usually works pretty good.  Further, if we consider graphs where the maximum multiplicity of the eigenvalues (recall that an eigenvalue may appear multiple times in the list of eigenvalues of a matrix—the multiplicity is the number of times an eigenvalue appears) is bounded (that is the multiplicity does not grow with the graph size), then there is an efficient algorithm for graph isomorphism.  This was first shown for multiplicity one graphs, i.e. where all the eigenvalues are distinct, by Leighton and Miller.  This work is unpublished, but you can find the hand written notes at Gary Miller’s website:

  • “Certificates for Graphs with Distinct Eigen Values,” by F. Thomson Leighton and Gary l. Miller, unpublished, 1979 (paper here)

Another place to learn about this, without all the smudge marks, is at the Yale course site of Dan Spielman:

  • Spectral Graph Theory, Dan Spielman, Fall 2009 (course website , notes on multiplicity free graph isomorphism)

Here would be a good place to mention the general case for bounded multiplicity

  • “Isomorphism of graphs with bounded eigenvalue multiplicity” by Laszlo Babai, D. Yu. Grigoryev, and David M. Mount, STOC 1982 (paper here)

Though jumping in to that paper after the previous ones is a bit of a leap.
Okay, so now your all warmed up.  Indeed you’re probably so stoked that you think a polynomial time classical algorithm is just around the corner.  Time to put a damper in those thoughts.
First lets start with the paper that is, currently the best algorithm for general graphs.  Well actually for the original best algorithm, I’m not sure if there is an actual paper, but there is a paper which achieves a similar bound see

  • “Canonical labeling of graph” by László Babai and Eugene M. Luks STOC 1983 (paper here)

This describes a subexponential (or, err, what did Scott decide to call this?), exp(-n^{frac{1}{2}+c}), time algorithm for a graph with n vertices.  Now as a reading list I don’t recommend jumping into this paper quite yet, but I just wanted to douse your optimism for a classical algorithm by showing you (a) the best we have in general is a subexponential time algorithm, (b) that record has stood for nearly three decades, and (c) that if you look at the paper you can see it’s not easy.  Abandon hope all ye who enter?
Okay, so apparently you haven’t been discouraged, so you’re pressing onward.  So what to read next?
Well one strategy would be to go through the list of all the graphs for which there are efficient known algorithms.  This is quite a list, and it is useful to at least gain some knowledge of these papers, if at least to no re-solve one of these cases (above we noted the paper for bounded eigenvalue multiplicity):

  • Planar graphs: “Linear time algorithm for isomorphism of planar graphs” by J.E. Hopcroft and J.K. Wong, STOC 1974 (paper here)
  • or more generally graphs of bounded genus: “Isomorphism testing for graphs of bounded genus” by Gary Miller, STOC 1980 (paper here)
  • Graphs with bounded degree: “Isomorphism of graphs of bounded valence can be tested in polynomial time” by Eugene M. Luks, Journal of Computer and System Sciences 25: 42–65, 1982 (paper here)

Along a similar vein, there is a subexponential time algorithm for strongly regular graphs that is better than the best general algorithm described above:

  • “Faster Isomorphism Testing of Strongly Regular Graphs” by Daniel A. Spielman STOC 1996 (paper here)

At this point, you’re probably overloaded, so a good thing to do when you’re overloaded is to find a book.  One interesting book is

  • “Group-Theoretic Algorithms and Graph Isomorphism ” by C.M. Hoffman 1982 (link here)

As you can see this paper was written before the best algorithms were announced.  Yet this book is fairly readable and begins to set up the group theory you’ll need if you want to conquer the later papers.  Another book with some nice results is

  • “The graph isomorphism problem: its structural complexity” by Johannes Köbler, Uwe Schöning, Jacobo Torán (1993) (buy it here)

where you will find some nice basic complexity theory results about graph isomorphism.
Well now that you’ve gone and starting learning about some computational complexity in relationship to graph isomorphism, it’s probably a good time to stop and look at actual practical algorithms for graph isomorphism.  The king of the hill here, as far as I know, is the program nauty (No AUTomorphism, Yes?) by Brendan D. McKay.  Sadly a certain search engine seems to be a bit too weighted to the word “naughty” (note to self, consider porn related keyword when naming a software package):

Nauty’s main task is determining the automorphism group of a graph (the group of permutations that leave the graph representation unchanged) but nauty also produces a canonical label that can be used for testing isomorphism.  Nauty can perform isomorphism tests of graphs of 100 vertices in about a second.  There is an accompanying paper describing how nauty works:

  • “Practical Graph Isomorphism”, by Brendan D. McKay, Congressus Numerantium, 30, 45-87, 1981. (paper here)

A canonical label for a graph is a function from graphs to a set of labels such that for two graphs the label is the same if and only if the two graphs are isomorphic.  This is what nauty produces: if you want to know whether the graphs are isomorphic you just compute the canonical forms for these graphs and compare the results.  If they have the same canonical form they are isomorphic, if they don’t they are not isomorphic.

Which leads one to thinking about other labeling schemes for solving graph isomorphism.  One of the simplest attempts (attempt because it does not work) to make a canonical labeling scheme is the following: begin by labeling the vertices of the two graphs by their degrees.  If you then sort this list and they are different sorted lists, then the graphs cannot be isomorphic.  Next one can replace the vertex label by a label made up of the multiset of the current vertex label and the labels of the neighboring vertices (recall a multiset is a set where entries can appear multiple times.)  One can then sort these multisets lexographically and compare the two sorted lists for both graphs.  Again if the lists are not the same the graphs are not isomorphic.  If you still haven’t discovered that the graphs are not isomorphic you can continue, but first, to keep the labels small, you should replace the lexographically ordered labels by a small integer representing the sorted order of the label.  Then one can iterate the construction of a new multiset labeling from these new integer labelings.  This is a simple scheme to implement.  Below, for example, I perform it for two very simple graphs.  First:

Next:

At which point we stop because if we compare the sorted lists of these multisets they are different (one has {2,2,2} while the one on the right does not.)   One can show that the above labeling procedure will always stabilize after a polynomial number of iterations (can you see why?) but also it’s quite clear that it doesn’t work in general (i.e. it doesn’t provide a canonical labeling for all graphs) since it gets stuck right off the bat with regular graphs (graphs whose degree is the same across the entire graph.)  Here are two 3-regular graphs that are not isomorphic:

The above algorithm doesn’t even get off the ground with these non-isomorphic graphs 🙂  One can imagine, however, more sophisticated versions of the above algorithm, for example by labeling edges instead of vertices.  It turns out there is a very general set of algorithms of this form, known as the k-dimension Weisfeiler-Lehman algorithm, that work along the lines of what we have described above.  Boris Weisfeiler, as I mentioned in a previous post, went missing in Chile in 1985 under presumably nefarious circumstances.  For a good introduction to the WL algorithm, I recommend the paper

  • “On Construction and Identification of Graphs (with contributions by A. Lehman, G. M. Adelson-Velsky, V. Arlazarov, I. Faragev, A. Uskov, I. Zuev, M. Rosenfeld and B. Weisfeiler. Edited by B. Weisfeiler)”, Lecture Notes in Mathematics, 558 1976 (paper here)

The dimension k in the WL algorithm refers to whether the basic object being considered are subsets with cardinality k of {1,2,dots,n}.  If the k dimensional WL algorithm solved graph isomorphism for k constant, then this would give an efficient algorithm for graph isomorphism.  For many years whether this was true or not was not known, and a large body of work was developed (mostly in Russia) around this method, including the introduction of topics such as cellular algebras and coherent configurations.  However in 1991, Cai, Furer, and Immerman showed that this approach to efficiently solving graph isomorphism does not yield an efficient graph isomorphism algorithm.  The paper where they do that is very readable and gives a good history of the WL algorithm

  • “An optimal lower bound on the number of variables for graph identification” by Jin-Yi Cai, Martin Fürer, and Neil Immerman, Combinatorica, 12, 389-410 1991 (paper here)

At this point you might be thinking to yourself, “Hey, isn’t this the Quantum Pontiff?  Where is the discussion of quantum approaches to graph isomorphism?”  Okay, maybe that’s not what you’re thinking, but what the hell, now is as good as time as any to talk about quantum approaches to the problem.  There are two main approaches to attacking the graph isomorphism problem on quantum computers.  The first is related to the hidden subgroup problem and the second is related to state preparation and the swap test.  A third “spin-off”, inspired by quantum physics also exists, which I will also briefly mention.
Recall that the hidden subgroup problem is as follows:

Hidden Subgroup Problem (HSP): You are given query access to a function f from a group G to a set S such that f is constant and distinct on an left cosets of an unkown subgroup H.  Find H by querying f.

Quantum computers can efficiently (efficiently here means in a time polynomial in the logarithm of the size of the group) solve the hidden subgroup problem when the group G is a finite Abelian group.  Indeed Shor’s algorithm for factoring and discrete logarithm can be seen as solving such Abelian HSPs.  A nice introduction to the Abelian version of this problem, though it is now a little out of date is

A more up to date introduction to the problem is provided by

  • “Quantum algorithms for algebraic problems” by Andrew Childs and Wim van Dam, Reviews of Modern Physics 82, 1-52 2010 arXiv:0812.0380

What is the connection to the graph isomorphism problem?  Well the basic result is that if one could efficiently solve the HSP over the symmetric group (or the wreath product group S_n wr S_2) then one would be able to efficiently solve graph isomorphism.  A place to find this is in

  • “A Quantum Observable for the Graph Isomorphism Problem,” by Mark Ettinger and Peter Hoyer (1999) arXiv:quant-ph/9901029

This paper establishes that there is a measurement one can perform on a polynomial number of qubits that solves the graph isomorphism problem.  Unfortunately it is not known how to efficiently implement this measurement (by a circuit of polynomial depth.)  Even more unfortunately there is a negative result that says that you really have to do something non-trivial across the entire system when you implement this measurement.  This is the culminating work reported in

  • “Limitations of quantum coset states for graph isomorphism” by Sean Hallgren, Cristopher Moore, Martin Rotteler, Alexander Russell, and Pranab Sen, STOC 2006 (paper here)

At this point it seems that new ideas are needed to make progress along this line of attack.  I have tried my own pseudo-new ideas, but alas they have come up wanting.
At this point it is useful to mention that there is a variation on the hidden subgroup problem, the hidden shift problem, which is arguably more naturally connected to the graph isomorphism problem.  You can learn about this here

  • “On the quantum hardness of solving isomorphism problems as nonabelian hidden shift problems,” by Andrew Childs and Pawel Wocjan, Quantum Information and Computation, 7 (5-6) 504-521, 2007 arXiv:quant-ph/0510185

I could go on and on about the hidden subgroup problem, but I think I’ll spare you.
Beyond the hidden subgroup problem, what other approaches are there to finding efficient quantum algorithms for the graph isomorphism problem?  A lesser studied, but very interesting approach relates graph isomorphism to state preparation.  Let A_1 and A_2 denote the adjacency matrices of the two graphs to be tested.  Now imagine that we can create the states

|P_irangle = sqrt{|Aut(G_i)| over n!} sum_{P} |PA_iP^{-1}rangle

where Aut(G_i) is the automorphism group of graph G_i, and the sum is over all n times n permutation matrices.  Now notice that if G_1 and G_2 are isomorphic, then these two states are identical.  If, on the other hand, G_1 and G_2 are not isomorphic then these states are orthogonal langle P_1|P_2rangle=0, since the superpositions above cannot contain the same term or this would yield and isomorphism.  Using this fact one can use the swap test to solve graph isomorphism…given the ability prepare |P_1rangle and |P_2rangle.  (See here for a discussion of the swap test.)  Unfortunately, no one knows how to efficiently prepare |P_1rangle and |P_2rangle!  A good discussion, and one attack on this problem is given in the paper

  • “Adiabatic quantum state generation and statistical zero knowledge” by Dorit Aharonov and Amnon Ta-ShmaSTOC 2003 arXiv:quant-ph/0301023

Finally let me mention a “quantum inspired” approach to graph isomorphism which has recently led to a sort of dead end, but that I find interesting.  This is the approach is described in

The basic idea is as follows.  As I described above, using the spectra (the list of eigenvalues) of an adjacency matrix is not enough to distinguish non-ismorphic graphs.  Now the adjacency matrix is closely related to random walks on the graph being considered.  Indeed from the adjacency matrix and diagonal matrix one can construct what is called the Laplacian of a graph and this is the infinitesimal generator of a random walk on the graph (I’m being sketchy here.)  So thinking about this, one can ask: well what graph describes two random walkers?  Well if these walkers take turns moving, then one can see that two random walkers on a graph walk on the graph described by

A otimes I  + I otimes A

where A is the original one walker adjacency matrix.  But of course the spectra of this new walk doesn’t help you in making a better invariant for graphs as the eigenvalues of this new walk are just the various sums of the prior eigenvalues.  But, aha!, you say, what if we think like a quantum physicist and make these walkers either fermions or bosons.  This corresponds to either anti-symmeterizing the above walk or symmeterizing it:

S_{pm} (A otimes I  + I otimes A) S_{pm}

where S_{pm}={1 over 2}(I+SWAP) where SWAP swaps the two subsystems.  Well it is easy to check that this doesn’t help either: if you look at all of the eigenvalues of the fermion and boson walks you really just have the original eigenvalue information and no more.  What Terry did, however, was to think more like a physicist.  He said: well what if we consider the walkers to be bosons but now I make them what physicists call hard-core bosons (I wouldn’t recommend googling that): bosons that can’t sit on top of each other.  This means that in addition to symmetrizing the A otimes I  + I otimes A matrix, you also remove the part of matrix where the two walkers sit on the same vertex.  Terry shows that when he does this for certain non-isomorphic strongly regular graphs that are isospectral, if he considers three walkers, the eigenvalues are no longer the same.  Very cool.  Some follow up papers examined this in more detail

  • “Symmetric Squares of Graphs” by Koenraad Audenaert, Chris Godsil, Gordon Royle, Terry Rudolph Journal of Combinatorial Theory, Series B, 97, 74-90 2007 arXiv:math/0507251
  • “Physically-motivated dynamical algorithms for the graph isomorphism problem” by Shiue-yuan Shiau, Robert Joynt, and S.N. Coppersmith, Quantum Information and Computation, 5 (6) 492-506 2005 arXiv:quant-ph/0312170

(Also note another line of sort of similar investigation arXiv:quant-ph/0505026.)  So does this method lead anywhere?  Unfortunately the answer appears to be negative.  Here is a paper

  • “Spectra of symmetric powers of graphs and the Weisfeiler-Lehman refinements” by Alfredo Alzaga, Rodrigo Iglesias, and Ricardo Pignol arXiv:0801.2322

that demonstrates that the k hard-core boson walker matrix provides no more information than the 2k dimensional WL algorithm (which Cai, Furer, and Immerman showed cannot work.  Strangely it appears that Greg Kuperberg, like Babe Ruth before him, called this one.  Nice!)  Recently a different approach has also emerged to showing that this doesn’t work

  • “Non-Isomorphic Graphs with Cospectral Symmetric Powers” by Amir Rahnamai Barghi and Ilya Ponomarenko, The Electronic Journal of Combinatorics, 16(1) R120 2009 (paper here)

using the nice theory of schemes.  So a good physicist inspired idea, but so far, no breakthrough.
Well I’ve gone on for a long while now.  Perhaps one final reading to perform is related to graph isomorphism complete problems.  Since graph isomorphism is not known to be in P nor is it known to be NP-complete, it is “natural” to define the complexity class related to graph isomorphism, GI, which is made up of problems with a polynomial time reduction to the graph isomorphism problem.  Then, similar to NP-complete problems, there are GI-complete problems.  Wikipedia has a nice list of such problems, but it doesn’t contain my favorite.  I like this one because it doesn’t seem similar to graph isomorphism upon first reading:

  • “The Complexity of Boolean Matrix Root Computation” by Martin Kutz, Theoretical Computer Science, 325(3) 373-390, 2004 (paper here)

Sadly Martin Kutz passed away in 2007 at what appears to be a quite young age, considering his condensed publication record.  Read his paper for a nice square root of matrices problem that is graph isomorphism complete.
So see how much fun studying graph isomorphism can be?  Okay, for some definition of fun!  Please feel free to post other papers in the comment section.

Not True in Any Base

Yes, dear Gray Lady, you certainly sound more sophisticated when you use the word “prime number” in your newspaper. But perhaps you might want to look up the actual meaning of the word before placing those words prominently beside two times five times five.