Replies

Thank you for your post! Indeed, that statement of motive was quite troubling to me also.

With regard to your question about complexity/entropy and Shakespeare/DNA – I’m not finding a lot of agreement, but the information is fascinating! Here’s what I’ve got so far, emphasis mine.

Lurkers: KC is short for Kolmogorov Complexity.

Entropy

Consider a monkey typing randomly at a typewriter versus a monkey at a computer. It can be shown that the monkey at a computer is more likely to produce more ordered or "interesting" results. Consider the probability that either monkey will produce all the works of Shakespeare. Assuming the text is 1 million bits long, the monkey at a typewriter has probability about 2^-1,000,000 of producing the text. The monkey at the computer has probability 2^-KC(Shakespeare) or 2^-250,000 of producing the text assuming the Shakespeare is conpressable by four times. Neither of these probabilities is large but one is still many, many orders of magnitude more likely than the other. In this sense, we can say that a computer is an amplifier of order, information or "sense".

Problem of Complexity

Of particular importance to this paper is the concept of Kolmogorov Complexity (also known as Algorithmic Information Complexity) introduced independently by Kolmogorov[10], Chaitin[2] and Solomonoff[15]. Given a particular universal Turing Machine1 (UTM) U, the Kolmogorov complexity of a string of characters (ie a description) is the length of the shortest program running on U that generates the description.

There are two main problems with Kolmogorov complexity:

1. The dependence on U, as there is no unique way of specifying this. Even though the Invariance theorem[11, Thm 2.1.1] guarantees that any two UTMs U and V will agree on the complexity of a string x up to a constant independent of x, for any descriptions x and y, there will be two machines U and V disagreeing on whether x is more complex that y, or vice-versa.

2. Random sequences have maximum complexity, as by definition a random sequence can have no generating algorithm shorter than simply listing the sequence. As Gell-Mann[8] points out, this contradicts the notion that random sequences should contain no information.

The first problem of what reference machine to choose is a symptom of context dependence of complexity. Given a description x, any value of complexity can be chosen for it by choosing an appropriate reference machine. It would seem that complexity is in the eye of the beholder[5]. Is complexity completely subjective? Is everything lost?

Rather than trying to hide this context dependence, I would prefer to make it a feature. Instead of asserting complexity is a property of some system, it is a property of descriptions (which may or may not be about a system). There must also be an interpreter of these descriptions that can answer the question of whether two descriptions are equivalent or not. Consider Shakespeare's Romeo and Juliet. In Act II,ii, line 58, Juliet says ``My ears have yet not drunk a hundred words''. If we change the word ``drunk'' to ``heard'', your average theatre goer will not spot the difference. Perhaps the only one to notice would be a professional actor who has played the scene many times. Therefore the different texts differing by the single word ``drunk/heard'' in this scene are considered equivalent by our hypothetical theatre goer. There will be a whole equivalence class of texts that would be considered to be Romeo and Juliet by our theatre goer.

Once the set of all possible descriptions are given (strings of letters on a page, base pairs in a genome or bits on a computer harddisk for example), and an equivalence class between descriptions given, then one can apply the Shannon entropy formula to determine the complexity of that description, under that interpretation: …

New perspectives on Zipf’s law in linguistics

I highly recommend this one!!! The pdf is 101 pages, and begins speaking of both Shakespeare and DNA at page 9. They look at the issue from the language view (which is the correct view, IMHO) and at about page 17 offer a hierarchical improvement to figuring complexity.

Information Theory and Algorithmic Complexity: Applications to Linguistic Discourses and DNA Sequences as Complex Systems. Part I: Efficiency of the Genetic Code of Dna --- S. Naranan and V.K. Balasubrahmanyan

The genetic code is a mapping of 64 possible triplet codons from a 4-letter alphabet (A, C, G, U) into 20 amino acids and one STOP signal for protein synthesis. The pattern of degeneracies in codon assignments to amino acids suggests that there is an underlying variable length code, which meets all the optimal coding criteria dictated by Shannon's Information theory. The genetic code can be viewed as an instantaneous, prefix code with unique decipherability and compactness. Optimal codon assignments and average code lengths for 20 species from 10 groups in the phylogenetic tree have been determined from the available data on codon usage, using the Huffman algorithm.

The average binary code length of the genetic code exceeds the optimal average length of the Huffman code only by 2 to 5%, showing that the genetic code is close to optimal. But functionally, the genetic code is a fixed length code of 6 binary bits (3 bases). This provides the needed redundancy (? 25%) for tolerance of errors in the DNA sequences due to mutations.

This hybrid character of the genetic code, combining the advantages of variable and fixed length codes, supports a speculation that in the past the genetic code could have been a variable length code which has evolved to its modern version. The DNA sequence bears striking similarities to linguistic discourses from an Information Theoretic view. Both are complex adaptive systems with intrinsic elements of order and randomness. The complexity parameter, which we had defined earlier for linguistic discourses, is close to maximal in both DNA and natural languages. In this article, the first of two parts, we have focused on the variable length genetic code. In Part II, we deal with DNA sequence treated as a complex adaptive system and some possible correlation of certain parameters with evolution.

I haven’t read that last article yet, but I’m extremely curious what creatures with variable length genetic code might have looked like (LOL!)

Any string generation system that has a "blank" character that creates "words" rather than runtogethersentences will follow Zipf's law whether the genration is random or directed. There's a paper from the Santa Fé Institute (or of their first) that discusses the matter. (I don't remember the paper or author.) Zipf's law doesn't have much content as far as I can see. (Perhaps someone will find some expnanatory power in the law someday and I'll change my mind.) Plotting frequency as a function of rank is a bit like measuring gasses and plotting the pressure vs the log of the pressure.

There are two main problems with Kolmogorov complexity:

1. The dependence on U, as there is no unique way of specifying this.

Yes, that is one problem that has bothering me about it. The rules are algorithms and the 'shortness' and ability of an algorithm to produce a result is highly dependent on the abilities of the programmer, and yes even on the intuitional abilities of the person writing it. Seems there is something highly subjective in the formula.

2. Random sequences have maximum complexity

Yes, that bothers me the most as I pointed out before regarding the content of information. Someone may spend his life studying how complex the grains of sand on a beach are, but that will be of little use. On the other hand spending time studying the DNA code can have great benefits.

BTW - we discovered DNA some 50 years ago and we still have barely begun to uncover its secrets.

The monkey at the computer has probability 2^-KC(Shakespeare) or 2^-250,000 of producing the text assuming the Shakespeare is conpressable by four times.

As a technical nit, the real value for the compressibility of Shakespeare is probably around 40x rather than 4x, this based on real compressibility data points. Not terribly important, except that this would be an order of magnitude reduction of the exponent.

Even though the Invariance theorem[11, Thm 2.1.1] guarantees that any two UTMs U and V will agree on the complexity of a string x up to a constant independent of x, for any descriptions x and y, there will be two machines U and V disagreeing on whether x is more complex that y, or vice-versa.

This isn't a correct application for our universe. In our universe, two arbitrary machines will always agree on the complexity of a string. I touched on this in previous posts. The Invariance Theorem is misleading in practical application; our universe doesn't contain UTMs as far as we can tell, which has consequences.