Darwin's Doubt Critical Reviews: The measure of the information in the genetic message: The road not taken

Extracted from the book: Information Theory, Evolution, and the Origin of Life,
by: Hubert Yockey, PhD

It is almost universally believed that the number of sequences in polypeptide chains of length N, composed of the twenty common amino acids that form protein, can be calculated by the following expression:

Expression 4.1 gives the total number of sequences we must be concerned with if and only if all events are equally probable. However, many events in general, and amino acids in particular, do not have the same probability. Unfortunately, many distinguished authors have neglected that fact and led their readers and students astray.

But let us take the road less traveled; it will make all the difference and lead to the correct way to calculate the number of sequences in a family of nucleic acid and polypeptide chains. Shannon (1948) addressed this problem as follows: Let us consider a long sequence of N symbols selected from an alphabet of A symbols. In the present case, the symbols will be the alphabet of either codons or amino acids. Just as in the toss of dice, there is no intersymbol influence in the formation of these sequences. Let p(i) be the probability of the ith symbol. The sequence will contain Np(i) of the ith symbol. Let P be the probability of the sequence. Then, because the probabilities of independent events are multiplied (Shannon, 1948):

Taking the logarithm of both sides changes multiplication to addition:

where

H is called the Shannon entropy of the sequence of events. ∗
In communication, genetics, and molecular biology, we are interested in long sequences. Accordingly, the probability of a long sequence of N independent symbols or events taken from a finite alphabet is:

The number of sequences of length N is:

Notice that the expression for H was not introduced ad hoc; rather, it comes out of the woodwork, so to speak. Logarithms to base 2 can be calculated by the use of a pocket calculator: log2 y = log10 y/ log10 2.

Let us compare Expression 4.1 and Expression 4.7 by calculating the number of sequences in one hundred throws of a a pair of dice, where the probabilities of all events are known exactly and are not all equal. For a given throw, the probability of 2 and 12 is 1/36, whereas the probability of 7 is 6/36 because there are six ways to roll a 7 and only one way to roll 2 or 12. So we see that the number of sequences calculated by Equation 7 is only 2.69 × 10^(−6) of that calculated from expression (1), namely, 11^(N).

We have calculated the number of sequences of length N in two apparently correct ways and the question arises: What happened to the sequences left out by the second method? This is explained by the Shannon–McMillan–Breiman Theorem (Breiman, 1957; McMillan, 1953; Shannon, 1948):

For sequences of length N being sufficiently long, all sequences being chosen from an alphabet of A symbols, the ensemble of sequences can be divided into two groups such that:

1. The probability P of any sequences in the first group is equal to 2^(−NH)

2. The sum of the probabilities of all sequences in the second group is less than ε, a very small number.

The Shannon–McMillan–Breiman Theorem is a surprising result. It tells us that the number of sequences in the first, or high, probability group is 2^(NH) and they are all nearly equally probable. We can ignore all those in the second or low probability group because, if N is large, their total probability is very small. The number of sequences in the high probability group is almost always many orders of magnitude smaller than that given by Expression 1, which contains an enormous number of “junk” sequences. In a fast-forward to Section 6.4, I find that the information content of 1-isocytochrome c, a small protein of 113 amino acids is 233.19 bits. The number of 1-iso-cytochrome c sequences is 6.42392495176 × 10^(111). Calculating this number by Expression 1, we find 20^(113) = 1.03845927171 × 10^(147). The 1-iso-cytochrome c sequences are only a very tiny fraction, 6.1968577266 ×10^(−36) of the total possible sequences. Thus, one sees that Expression 1 is extremely misleading.

One must further remember that the word entropy is the name of a mathematical function, nothing more. One must not ascribe meaning to the function that is not in the mathematics. For example, the word information in this book is never used to connote knowledge or any other dictionary meaning of the word information not specifically stated here. The road we have taken, the one less traveled, has led us to the Shannon–McMillan–Breiman Theorem. It is, almost without exception, unknown to authors in molecular biology, and without it they have been led to many false conclusions. As in the sequences of throws of a pair of dice, all DNA, mRNA, and protein sequences are in the high probability group and are a very tiny fraction of the total possible number of such sequences.

======================

∗ Some authors are confused by the minus sign in Equation 4.5 and that leads them to believe in negative entropy (see Section 4.4). The probabilities of all events being considered must sum to 1. Probabilities lie between zero and one. Logarithms in that range are negative or zero, so log2 P is always zero or negative and so are the terms log2 p(i).We always take 0 log 0 to be zero. Therefore, Shannon entropy is always positive or zero.

Darwin's Doubt Critical Reviews

Saturday, September 14, 2013

The measure of the information in the genetic message: The road not taken

2 comments: