Thursday, September 19, 2013

Charles R. Marshall reviews Darwin's Doubt: When Prior Belief Trumps Scholarship

The power of scientific reasoning derives from the complex interplay between the desire to know, the ability to reason, and the ability to evaluate ideas with data. As scientists, we have learned how to make ideas dance with reality, and we expect them to be transformed in the process. We typically add to what we already know, often showing along the way that old ideas are incomplete or, occasionally, wrong. And so we collectively build an understanding of the world that is accurate, reliable, and useful.
In Darwin's Doubt, Stephen Meyer (who runs the Discovery Institute's Center for Science and Culture) also tries to build. He aims to construct the philosophical and scientific case for intelligent design. I am not a philosopher, so I will not attempt to evaluate his philosophical argument that in principle it might be possible to recognize the action of a designer in the history of life. But I am willing to evaluate his scientific case for the participation of such a designer. It centers on one of the most remarkable events in that history, the relatively rapid emergence of animal phyla in the Cambrian.
Meyer's scientific approach is negative. He argues that paleontologists are unable to explain the Cambrian explosion, thus opening the door to the possibility of a designer's intervention. This, despite his protest to the contrary, is a (sophisticated) “god of the gaps” approach, an approach that is problematic in part because future developments often provide solutions to once apparently difficult problems.
Figure

Scree slope site.
Walcott quarry in the Burgess Shale, which beautifully preserves soft-bodied animals from shortly after the Cambrian explosion.
CREDIT: MARK A. WILSON (DEPARTMENT OF GEOLOGY, WOOSTER COLLEGE)/WIKIMEDIA COMMONS
Darwin's Doubt begins with a very readable review of our knowledge of the Cambrian explosion. Despite its readability and a plethora of scholarly references, however, there are substantial omissions and misrepresentations. For example, Meyer completely omits mention of the Early Cambrian small shelly fossils and misunderstands the nuances of molecular phylogenetics, both of which cause him to exaggerate the apparent suddenness of the Cambrian explosion.
I like to read the arguments of those who hold fundamentally different views from my own in the hope of discovering weaknesses in my thinking. And so even after reading the flawed first part of his book, I dared hope that Meyer might point the way to fundamental problems in the way we paleontologists think about the Cambrian explosion.
However, my hope soon dissipated into disappointment. His case against current scientific explanations of the relatively rapid appearance of the animal phyla rests on the claim that the origin of new animal body plans requires vast amounts of novel genetic information coupled with the unsubstantiated assertion that this new genetic information must include many new protein folds. In fact, our present understanding of morphogenesis indicates that new phyla were not made by new genes but largely emerged through the rewiring of the gene regulatory networks (GRNs) of already existing genes (1). Now Meyer does touch on this: He notes that manipulation of such networks is typically lethal, thus dismissing their role in explaining the Cambrian explosion. But today's GRNs have been overlain with half a billion years of evolutionary innovation (which accounts for their resistance to modification), whereas GRNs at the time of the emergence of the phyla were not so encumbered. The reason for Meyer's idiosyncratic fixation with new protein folds is that one of his Discovery Institute colleagues has claimed that those are mathematically impossibly hard to evolve on the timescale of the Cambrian explosion.
As Meyer points out, he is not a biologist; so perhaps he could be excused for basing his scientific arguments on an outdated understanding of morphogenesis. But my disappointment runs deeper than that. It stems from Meyer's systematic failure of scholarship. For instance, while I was flattered to find him quote one of my own review papers (2)—although the quote is actually a chimera drawn from two very different parts of my review—he fails to even mention the review's (and many other papers') central point: that new genes did not drive the Cambrian explosion. His scholarship, where it matters most, is highly selective.
Meyer's book ends with a heart-warming story of his normally fearless son losing his orientation on the impressive scree slopes that cradle the Burgess Shale, the iconic symbol of the Cambrian explosion, and his need to look back to his father for security. I was puzzled: why the parable in a book ostensibly about philosophy and science? Then I realized that the book's subtext is to provide solace to those who feel their faith undermined by secular society and by science in particular. If the reviews on Amazon.com are any indication, it is achieving that goal. But when it comes to explaining the Cambrian explosion, Darwin's Doubt is compromised by Meyer's lack of scientific knowledge, his “god of the gaps” approach, and selective scholarship that appears driven by his deep belief in an explicit role of an intelligent designer in the history of life.

References:

  1.  
    The Cambrian Explosion: The Construction of Animal Biodiversity(Roberts and CompanyGreenwood Village, CO2013); reviewed in (3)Search Google Scholar
  2.  
    Annu. Rev. Earth Planet. Sci. 34355 (2006). CrossRefWeb of Science
  3. Science 3401170 (2013).Abstract/FREE Full Text

Saturday, September 14, 2013

The measure of the information in the genetic message: The road not taken

Extracted from the book: Information Theory, Evolution, and the Origin of Life,
by: Hubert Yockey, PhD

It is almost universally believed that the number of sequences in polypeptide chains of length N, composed of the twenty common amino acids that form protein, can be calculated by the following expression:


Expression 4.1 gives the total number of sequences we must be concerned with if and only if all events are equally probable. However, many events in general, and amino acids in particular, do not have the same probability. Unfortunately, many distinguished authors have neglected that fact and led their readers and students astray.

But let us take the road less traveled; it will make all the difference and lead to the correct way to calculate the number of sequences in a family of nucleic acid and polypeptide chains. Shannon (1948) addressed this problem as follows: Let us consider a long sequence of N symbols selected from an alphabet of A symbols. In the present case, the symbols will be the alphabet of either codons or amino acids. Just as in the toss of dice, there is no intersymbol influence in the formation of these sequences. Let p(i) be the probability of the ith symbol. The sequence will contain Np(i) of the ith symbol. Let P be the probability of the sequence. Then, because the probabilities of independent events are multiplied (Shannon, 1948):


Taking the logarithm of both sides changes multiplication to addition:


where


H is called the Shannon entropy of the sequence of events. ∗
In communication, genetics, and molecular biology, we are interested in long sequences. Accordingly, the probability of a long sequence of N independent symbols or events taken from a finite alphabet is:


The number of sequences of length N is:


Notice that the expression for H was not introduced ad hoc; rather, it comes out of the woodwork, so to speak. Logarithms to base 2 can be calculated by the use of a pocket calculator: log2 y = log10 y/ log10 2.

Let us compare Expression 4.1 and Expression 4.7 by calculating the number of sequences in one hundred throws of a a pair of dice, where the probabilities of all events are known exactly and are not all equal. For a given throw, the probability of 2 and 12 is 1/36, whereas the probability of 7 is 6/36 because there are six ways to roll a 7 and only one way to roll 2 or 12. So we see that the number of sequences calculated by Equation 7 is only 2.69 × 10^(−6) of that calculated from expression (1), namely, 11^(N).

We have calculated the number of sequences of length N in two apparently correct ways and the question arises: What happened to the sequences left out by the second method? This is explained by the Shannon–McMillan–Breiman Theorem (Breiman, 1957; McMillan, 1953; Shannon, 1948):

For sequences of length N being sufficiently long, all sequences being chosen from an alphabet of A symbols, the ensemble of sequences can be divided into two groups such that:
1. The probability P of any sequences in the first group is equal to 2^(−NH)
2. The sum of the probabilities of all sequences in the second group is less than ε, a very small number.

The Shannon–McMillan–Breiman Theorem is a surprising result. It tells us that the number of sequences in the first, or high, probability group is 2^(NH) and they are all nearly equally probable. We can ignore all those in the second or low probability group because, if N is large, their total probability is very small. The number of sequences in the high probability group is almost always many orders of magnitude smaller than that given by Expression 1, which contains an enormous number of “junk” sequences. In a fast-forward to Section 6.4, I find that the information content of 1-isocytochrome c, a small protein of 113 amino acids is 233.19 bits. The number of 1-iso-cytochrome c sequences is 6.42392495176 × 10^(111). Calculating this number by Expression 1, we find 20^(113) = 1.03845927171 × 10^(147). The 1-iso-cytochrome c sequences are only a very tiny fraction, 6.1968577266 ×10^(−36) of the total possible sequences. Thus, one sees that Expression 1 is extremely misleading.

One must further remember that the word entropy is the name of a mathematical function, nothing more. One must not ascribe meaning to the function that is not in the mathematics. For example, the word information in this book is never used to connote knowledge or any other dictionary meaning of the word information not specifically stated here. The road we have taken, the one less traveled, has led us to the Shannon–McMillan–Breiman Theorem. It is, almost without exception, unknown to authors in molecular biology, and without it they have been led to many false conclusions. As in the sequences of throws of a pair of dice, all DNA, mRNA, and protein sequences are in the high probability group and are a very tiny fraction of the total possible number of such sequences.

======================

∗ Some authors are confused by the minus sign in Equation 4.5 and that leads them to believe in negative entropy (see Section 4.4). The probabilities of all events being considered must sum to 1. Probabilities lie between zero and one. Logarithms in that range are negative or zero, so log2 P is always zero or negative and so are the terms log2 p(i).We always take 0 log 0 to be zero. Therefore, Shannon entropy is always positive or zero.

Rates of evolution during the Cambrian are "consistent with Darwin's theory of evolution"

Original post by: Phys.org

A new study led by Adelaide researchers has estimated, for the first time, the rates of evolution during the "Cambrian explosion" when most modern animal groups appeared between 540 and 520 million years ago.
The findings, published online today in the journal Current Biology, resolve "Darwin's dilemma": the sudden appearance of a plethora of modern animal groups in the fossil record during the early Cambrian period.
"The abrupt appearance of dozens of animal groups during this time is arguably the most important evolutionary event after the origin of life," says lead author Associate Professor Michael Lee of the University of Adelaide's School of Earth and Environmental Sciences and the South Australian Museum.
"These seemingly impossibly fast rates of evolution implied by this Cambrian explosion have long been exploited by opponents of evolution. Darwin himself famously considered that this was at odds with the normal evolutionary processes.
"However, because of the notorious imperfection of the ancient fossil record, no-one has been able to accurately measure rates of evolution during this critical interval, often called evolution's Big Bang.
"In this study we've estimated that rates of both morphological and genetic evolution during the Cambrian explosion were five times faster than today – quite rapid, but perfectly consistent with Darwin's theory of evolution."
The team, including researchers from the Natural History Museum in London, quantified the anatomical and genetic differences between living animals, and established a timeframe over which those differences accumulated with the help of the fossil record and intricate mathematical models. Their modelling showed that moderately accelerated evolution was sufficient to explain the seemingly sudden appearance of many groups of advanced animals in the fossil record during the Cambrian explosion.


Biologists measure evolution's Big Bang

The research focused on arthropods (insects, crustaceans, arachnids and their relatives), which are the most diverse animal group in both the Cambrian period and present day.
"It was during this Cambrian period that many of the most familiar traits associated with this group of animals evolved, like a hard exoskeleton, jointed legs, and compound (multi-faceted) eyes that are shared by all arthropods. We even find the first appearance in the fossil record of the antenna that insects, millipedes and lobsters all have, and the earliest biting jaws." says co-author Dr Greg Edgecombe of the Natural History Museum.
======================
You can find the original research paper here: Rates of Phenotypic and Genomic Evolution during the Cambrian Explosion

Monday, September 9, 2013

Exploring the protein universe: a response to Doug Axe

Original post by: Steve Matheson

One of the goals of the intelligent design (ID) movement is to show that evolution cannot be random and/or unguided, and one way to demonstrate this is to show that an evolutionary transition is impossibly unlikely without guidance or intervention. Michael Behe has attempted to do this, without success. And Doug Axe, the director of Biologic Institute, is working on a similar problem. Axe's work (most recently with a colleague, Ann Gauger) aims (in part, at least) to show that evolutionary transitions at the level of protein structure and function are so fantastically improbable that they could not have occurred "randomly."

Recently, Axe has been writing on this issue. First, he and Gauger just published some experimental results in the ID journal BIO-Complexity. Second, Axe wrote a blog post at the Biologic site in which he defends his approach against critics like Art Hunt and me. Here are some comments on both.

1. Like my friend Todd Wood, I am encouraged by the fact that Biologic Institute is doing good scientific work and generating publishable data. Axe and Gauger seem to be smart and capable scientists, and they are asking good questions. May their Institute and its scientific work live long and prosper.

2. Axe is primarily interested in the evolution of protein folds. That question is both intensely interesting and important. And difficult.

3. Like Todd, I found the BIO-Complexity paper to be interesting technically but badly flawed in its theoretical approach and conclusions. Specifically, I note what I think any evolutionary biologist would immediately see: that Axe and Gauger did not test an evolutionary hypothesis. Todd explains this very well, but here's the basic problem. To test an evolutionary hypothesis, as I mentioned above, one must study an evolutionary transition. In other words, one must study a change or transition from an ancestral state to a current (or later) state. Joe Thornton's work is a great example: his group examined protein function in a reconstruction of an evolutionary transition. What Axe and Gauger did was study a "transition" that has never been proposed to have happened. They examined a transition from one currently-existing protein to another currently-existing protein. It's as though they analyzed the "transition" from a cat to a dog, when they should have analyzed the transition from ancestral mammals to dogs and/or cats. Their conclusions tell us something about protein structure and function but, crucially, notabout the evolution of those proteins.

This does not mean that Axe and Gauger are incorrect in their hypothesis, namely that different proteins are separated by vast evolutionary wastelands that can only be traversed with the help of "design." That may be the case. But the newly-published work in BIO-Complexity gets them no closer to establishing that hypothesis as reasonable or even likely.

4. In his blog post, Axe continues to insist that evidence for rarity of function in the protein universe is evidence for isolation of individual functions in the protein universe. His arguments from probability, which have been used so many times before, simply do not convince me because, as I wrote before: isolation and rarity are not the same thing. I don't happen to think that Axe's data tell us much about the rarity of function (more on this below), but even if I did, I would find that insufficient to undermine the proposal that proteins are linked in a phylogenetic tree the way species are. Again, this is not to say that I know that Axe is wrong. I'm saying that his arguments are unconvincing to me, and that the experiments needed to test his conjecture have yet to be done.

5. Axe claims that I was wrong to describe his 2004 experiments as "whopping mutations on crippled proteins." But that's what they were. He nicely explains why that was the best way to do his experiment, and I think he's right about that. But the fact remains that his analysis doesn't help us understand evolution if his experiment involved a barely-functioning enzyme subjected to mutagenesis that changed ten amino acids at a time. As I think Art Hunt tries to make clear, this doesn't mean that his experiment was stupid or poorly designed. It does mean, clearly in my view, that the experiment tells us little about evolutionary change. And Axe himself seems to agree: he explains that he wasn't attempting to simulate evolution, only to estimate the rarity of protein function in the protein universe (or the protein-fold universe).

6. In my opinion, Axe significantly overstates his findings on the topic of "function." So for example, in both the 2004 paper and the new BIO-Complexity paper, the experiments involve measuring a single function for each enzyme. It seems to me (and I could be wrong) that when the authors see that a particular variant (mutant) of the protein stops performing that one function, they conclude that the protein "has no function." (In the BIO-Complexity paper, it's two proteins and two functions, but the point is the same.) But of course we don't know that, and evolutionary explanations would propose that new functions frequently arise when an enzyme has more than one function (or is broad-based in its function, or is modular in its structure and function). This is why I think that Axe and colleagues can't make any headway in their efforts to understand the evolution of protein function until they focus intentionally on evolutionary transitions. Instead of showing us that mutated proteins no longer do what they used to do, they should invert their reasoning to look like something like this:

Here are the proteins in a postulated evolutionary trajectory. What can we learn about the functions of the intermediates during the transition?

Those would be extensive and demanding experiments, to be sure, but they're the only kinds of experiments that can address the difficult questions that Axe wants to ask. This, by the way, is the same critique I gave Mike Behe in response to his erroneous claims in his most recent book.

7. I'm not so sure that function is as rare as Axe (and others) think. It turns out that completely novel (and foreign) protein sequences can be shown to have function, in living bacterial cells. We may be mistaken in our assumption that islands of function in the protein universe are fantastically rare.

8. Axe and his colleagues do good work, and they're asking important questions. I hope they are in close contact with scientists working on similar questions. There are many strong labs working hard on protein evolution, from various angles, and I'm sure that the scientists at Biologic Institute would profit immensely from regular interactions with the scientific community. (Consider, for example, the authors of a 2010 PLoS ONE paper on "Evolutionary Innovations and the Organization of Protein Functions in Genotype Space.") Perhaps this is happening, and if so, great. But it needs to be emphasized.

So, kudos to the scientists of Biologic Institute for working hard in the lab, and for tackling an important and formidable problem. They haven't shown us anything important about evolution yet, but I hope they keep at it, with a little more careful thought and a lot more input from colleagues.

How to evolve a new protein in (about) 8 easy steps

Original post by: Steve Matheson

If you have only read the more superficial descriptions of intelligent design theory, and specifically the descriptions of irreducible complexity, you might (reasonably) conclude that Michael Behe and other devotees of ID have claimed that any precise interaction between two biological components (two parts of a flagellum, two enzymes in the blood clotting cascade, or a hormone and its receptor) cannot arise through standard Darwinian evolution. (If you don't know anything about the term 'irreducible complexity' you should probably read a little about it before proceeding.) In other words, you may be under the impression that Behe doesn't think that such a system could arise through a stepwise process of mutation and selection. You may even be under the impression that Behe has demonstrated the near impossibility of such a system coming to be through naturalistic means.

You would be mistaken, albeit (in my opinion) understandably so. Behe has not claimed this -- though he's often come pretty close -- and recently he has made it clear that this is not his position. Unfortunately, many of the critiques of irreducible complexity contain significant errors, including the claim that Behe rejects all stepwise accounts of molecular evolution, and you have to look pretty hard to find well-reasoned examinations of the problems with Behe's interesting but fruitless challenge to evolutionary theory.

My purpose in the preamble above is to make it clear that this Journal Club is not intended to refute Behe's claims regarding the ability of Darwinian mechanisms to generate irreducibly complex structures. (In my view, his claims are wholly mistaken, and Christian enthusiasm for his natural theology is a disastrous mistake. But that's for another time.) Rather, it is to discuss a superb recent example of the kind of experimental molecular analysis of evolution that can be done in this postgenomic era. Experiments like this are revealing how evolutionary adaptation actually comes about at the molecular level, thereby addressing the very questions raised by ID thinkers. ID apologists are, in a sense, wise to attack the work described here, because these experiments are the first fruits of the types of analysis that will usher ID into permanent scientific ignominy.

So, to our two papers.

How, exactly, does a protein acquire a new function during evolution? This is one of those "big questions" in evolutionary biology. Broad concepts such as gene duplication are quite helpful in formulating explanations, but the specific question raised is focused on the details -- the actual steps -- that must occur during the step-by-step modification of a protein such that it performs a different job than the proteins from which it has descended. The constraints on the process of change are significant, and the issues are similar to those I discussed when describing the concept offitness landscapes in morphospace. The problem, basically, is this: how can you change a protein without wrecking it in the process? In other words, can you get from function A to function B, step by step, without passing through an intermediate form, call it protein C, which is worthless (or even harmful)?

These are precisely the questions addressed in an elegant set of experiments reported in two reports over the last year or so. The second article, by Ortlund et al., was reported in the 14 September issue of Science, and built on work reported in Science in April 2006. Their studies focused on two closely-related proteins that are receptors for steroid hormones. In this case, the steroids of interest are corticosteroids (the kind often used to treat inflammation; Ortlund et al. studied receptors for cortisol, which is of course quite similar to cortisone) and a mineralocorticoid (a less well-known hormone, aldosterone, that regulates fluid and salt intake). The hormones are structurally similar (being steroids).

Joseph Thornton, at the University of Oregon, has been studying the origins of these receptors for about 10 years, and has assembled an interesting (and detailed) account of their history. The basic outline is as follows: the original steroid receptor was an estrogen receptor, and is extremely ancient, apparently arising "before the origin of bilaterally symmetric animals" (Thornton et al., Science 2003). (That's seriously ancient, sometime in the Cambrian or earlier.) The progesterone receptor seems to have arisen next, followed by the androgen (i.e., testosterone) receptor. (Now that's intriguing.) Fairly late in this game, the two receptors of interest to us here, the corticosteroid receptor and the mineralocorticoid receptor, were added to the vertebrate repertoire. The two modern receptors are thought to descend from an ancestral corticosteroid receptor, which underwent a gene duplication. Hereafter, I'll refer to the receptors as the corticosteroid receptor and the aldosterone receptor, hoping that all the jargon won't obscure the message.

In a widely-discussed paper published in Science a year ago (Bridgham et al., Science 2006), Thornton's group determined the most likely DNA sequence of this ancestral gene, then "resurrected" it, meaning simply that they created that very DNA sequence in the lab. (Determining the ancestral sequence was a nifty piece of work; actually making the DNA is quite straightforward, especially if you have a little dough.)

Their experiments showed that the ancestral receptor could bind to a hormone that didn't exist yet (aldosterone) while it was functioning as a receptor for corticosteroids. In other words, the receptor was available for activation by aldosterone long before aldosterone was around. (All jawed vertebrates make corticosteroids, but only tetrapods make and use aldosterone, an innovation that occurred at least 50 million years later.) The modern corticosteroid receptor has since lost its ability to interact with aldosterone, and Bridgham et al. chart the most likely evolutionary path, at the molecular level, by which we and other tetrapods came to have a corticosteroid receptor that won't bind to aldosterone. The surprising result, however, is the fact that the ancient receptor was able to bind aldosterone, millions of years before aldosterone is thought to have been present.

The 2006 paper is, I think, more notable as an illustration of an important evolutionary principle ("molecular exploitation" is the authors' term) than as a set of observations; Michael Behe's trashing of the group's work is disgusting, but it's true that the findings are limited in scope. It's worth having a look at the whole paper, though (and I believe it's freely available with free registration), because the authors very clearly explain the rationale for their continuing work, which is to begin to address one of the major "gaps in evolutionary knowledge": the mechanisms underlying stepwise evolution of "complex systems that depend on specific interactions among the parts."

If you're well-read on ID thought, that last sentence should sound pretty familiar. So let's note that prominent papers in science's premier journals are acknowledging that the evolutionary mechanisms that generate complex structures -- including "irreducibly complex" systems -- are as yet poorly understood. And let's give ID credit for asking a good question. (Not a new one...but a good one.)

The 2006 paper did not, as advertised, utterly destroy ID arguments, and again Behe is right to criticize the near-hysteria surrounding that work. But I find Behe's bravado otherwise unconvincing. Because that paper did set up the most recent work, and the whole story illustrates rather clearly how ID's question will (soon) be answered.

The most recent paper adds significantly to the picture, and introduces some genetic concepts that Behe's fans should pray he understands. The authors (Ortlund et al.) took their analysis to a far more detailed level, by extending their previous observations to include much more of the receptor family tree. In the 2006 work, they had assembled a detailed family tree for the receptors, by looking at DNA sequences from living species known to represent various branches on the tree of life. In other words, they chose organisms such as lampreys, bony fish, amphibians and mammals, and examined their DNA codes (for the receptors) to find the changes that occurred in each branch of the lineage. Now, please stop and think about this, because it's really cool. What the authors did was mine existing databases of DNA sequence data, pulling out the sequences of the steroid receptors from 29 different vertebrate species. You could repeat this part of the experiment right now, by referring to their list of organisms in Supplemental Table S5, which provides the ID codes needed to locate the DNA sequences in the Entrez Gene database. Then they charted the changes in the DNA sequence in the context of the tree of life as sketched out in the fossil record. The tree they assembled includes all the steroid receptors, and I've annotated it a little if you want to have a look. They used this tree to guide their further experiments, as I'll explain below. What the most recent paper added to the story was an analysis of the 3-D structure of the various postulated intermediates in the evolutionary pathway. The authors accomplished this by making proteins from the "resurrected" genes, then crystallizing them and using X-ray diffraction techniques to determine their precise structures.

Examination of their receptor family tree revealed something interesting. Most vertebrates have highly specific receptors: the corticosteroid receptor isn't strongly stimulated by aldosterone, and vice versa. But some living vertebrates (skates, in particular) show a different pattern: the corticosteroid receptor isn't all that specific for cortisol. Because the ancestral receptor also lacked specificity (as shown in the 2006 paper), the authors concluded that the receptor acquired its discriminating taste at some point between the branching-off of skates (and their kin) and the separation of fish from tetrapods. Their Figure 1 is a little crowded, but it illustrates this nicely:


To follow the evolutionary narrative in this graph, start at the blue circle, which represents the ancestral receptor that was "resurrected" in the 2006 paper and that happily binds to both corticosteroids and aldosterone. (The graphs on the right side of the figure demonstrate the specificity, or lack thereof, of the receptors at different times in history.) There's a branch leading up and to the left, to the various GRs (corticosteroid receptors), and one leading up and to the right, to the MRs (aldosterone receptors). At the green circle, another branching event occurred, 440 million years ago, at which point certain groups of fishes (skates among them) branched off, up and to the right. The receptor at that point is an ancestral corticosteroid receptor, and it still isn't specific for corticosteroids. But the receptor at the yellow circle, in the common ancestor of tetrapods and bony fishes, is specific. The authors conclude that specificity arose between those two points, between 420 and 440 million years ago. With some (deliberate?) irony, they indicate that process with a black box.

The rest of the paper explores the pathway by which the receptor might have been successively altered so as to install specificity for cortisol. During those 20 million years of evolution, at least 36 different changes were introduced in the makeup of the receptors. By looking at the 3-D structures of the ancestral forms, the authors were able to discern the specific functional ramifications of these various changes, and they found that the alterations fell into three groups:
  • Group 'X' alterations included the changes reported in the 2006 article. These are the biggies, that account for much of the functional 'switch' between GRs and MRs. These alterations don't account for the specificity change that occurred inside the black box in Figure 1.
  • Group 'Y' alterations are all strongly conserved (meaning that they were permanent changes), and occurred during the black box time period. Moreover, this group of changes is always seen together: modern receptors have all of these alterations, while ancestral receptors have none of them.
  • Group 'Z' alterations are also conserved changes, but they don't always occur together like group 'Y'.
The authors set about the work of examining the function of "resurrected" receptors bearing these groups of changes. When they introduced group 'X' changes into the ancestral receptor, they got a receptor that was almost modern (i.e., specifically tuned to cortisol) but not quite; this was what the previous work had indicated. Then they hypothesized that the group 'Y' changes, because they were so highly conserved and because they all occurred together, would make the transition complete. But no: instead, the group 'Y' alterations made the receptor worthless, unable to bind any hormone at all. Surprise! Looking at their 3-D structures, they figured out what this meant. The group 'Y' changes were somehow important, but they could only have a beneficial influence in the presence of another set of alterations, group 'Z', which had to occur in advance. The biophysical details don't concern us, but the basic idea is that the group 'Z' changes created a permissive environment for the group 'Y' changes, which are the alterations that complete the development of the modern specific form of the receptor for cortisol.

In genetics, we have a word for this type of interaction between genetic influences: epistasis. The fascinating history of steroid receptor evolution includes examples of what the authors call "conformational epistasis," meaning that some alterations in 3-D structure are required in advance for other alterations to ever get off the ground. Specifically, some alterations are evolutionary dead ends, because they yield worthless proteins, unless those alterations follow another set of changes that generated a different -- and more fruitful -- environment.

The authors then construct a map of what they call "restricted evolutionary paths through sequence space," showing how you can get there from here, without traversing an evolutionary no-man's-land of non-function. The path includes changes that don't apparently improve the receptor, but that yielded the right environment for the changes that did improve function. Their map is in Figure 3:


The idea is that you want to get from the lower left corner of the cube (the ancestral receptor) to the upper right corner (the modern receptor) without hitting a stop sign (a worthless receptor). The green arrows indicate a change in function of some kind, the white arrows no change. Yes, you can get there from here.

The authors note that their data "shed light on long-standing issues in evolutionary genetics," firstly the question of whether adaptation proceeds through "large-effect" changes (mutations), or through baby steps. Their conclusion:
Our findings are consistent with a model of adaptation in which large-effect mutations move a protein from one sequence optimum to the region of a different function, which smaller-effect substitutions then fine-tune; permissive substitutions of small intermediate effect, however, precede this process.
They note that the large-effect changes are inherently easier to identify (of course), and that the painstaking work of "resurrecting" the ancestral proteins and studying their function is the only way to identify the critical small-effect alterations that made the "big jump" work.

The authors also comment on the big "contingency" debate. I'll write more on the whole "rewinding the tape of life" question some other time; for now, we'll just consider the authors' words:
A second contentious issue is whether epistasis makes evolutionary histories contingent on chance events. We found several examples of strong epistasis, where substitutions that have very weak effects in isolation are required for the protein to tolerate subsequent mutations that yield a new function. Such permissive mutations create “ridges” connecting functional sequence combinations and narrow the range of selectively accessible pathways, making evolution more predictable.
If you have read my summary of the wormholes in morphospace story, this metaphor of "ridges" should make a little sense. The authors here are describing the same concept: an evolutionary exploration of a design space, with paths meandering through a map of the possibilities. But:
Whether a ridge is followed, however, may not be a deterministic outcome. If there are few potentially permissive substitutions and these are nearly neutral, then whether they will occur is largely a matter of chance. If the historical “tape of life” could be played again, the required permissive changes might not happen, and a ridge leading to a new function could become an evolutionary road not taken.
The history of the steroid hormone receptor, then, appears to include several different aspects of evolutionary biology combined: "chance" creating opportunity, leading (via epistasis) to selection for improvement, all done step by step, with some steps generating more apparently dramatic change than others.

Amazingly, Michael Behe is pretending that this analysis is utterly unimportant, with no implications at all for ID proposals, because the receptor-hormone system isn't "irreducibly complex." Some critics of ID claim that the goalposts are being regularly moved, and I'm inclined to agree. But let's just grant Behe the difference between protein-hormone interactions and protein-protein interactions. Does anyone really believe that Joseph Thornton's work doesn't show us exactly how the "irreducible complexity" challenge is going to fare in the near future?

Sunday, September 8, 2013

Axe (2004) and the evolution of enzyme function


[Preface - the subject of protein evolution pops up on a regular basis in ID circles.  Recently, William Dembski mentioned the study alluded to in the title of this essay as an improved argument/piece of evidence for intelligent design.  Specifically, Dembski said:
"(2) The challenge for determining whether a biological structure exhibits CSI is to find one that’s simple enough on which the probability calculation can be convincingly performed but complex enough so that it does indeed exhibit CSI. The example in NFL ch. 5 doesn’t fit the bill. The example from Doug Axe in ch. 7 of THE DESIGN OF LIFE (www.thedesignoflife.net) is much stronger."

"The example from Doug Axe in ch. 7 of THE DESIGN OF LIFE" would appear to be Axe's 2004 paper in the Journal of Molecular Biology, the subject of my first ever essay on The Panda's Thumb.  Since I have been a bit remiss in re-posting older essays here, I thought I would use this excuse to put this here.  It's "published" without change, so as to maintain some sort of continuity.  As always, enjoy.]

Douglas Axe recently (well, sort of) published an article in the Journal of Molecular Biology entitled “Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds” (Axe, J Mol Biol 341, 1295-1315, 2004). In his discussion of the experimental observations, Dr. Axe mentions some numbers that are likely to generate much discussion amongst Intelligent Design advocates and critics. For example, Stephen Meyer (2004) cites Axe at a key point in the argument in his recent article advocating Intelligent Design, “The Origin of Biological Information and the Higher Taxonomic Categories,” much discussed in previous Panda’s Thumb threads (here).
“Axe (2004) has performed site directed mutagenesis experiments on a 150-residue protein-folding domain within a B-lactamase enzyme. His experimental method improves upon earlier mutagenesis techniques and corrects for several sources of possible estimation error inherent in them. On the basis of these experiments, Axe has estimated the ratio of (a) proteins of typical size (150 residues) that perform a specified function via any folded structure to (b) the whole set of possible amino acids sequences of that size. Based on his experiments, Axe has estimated his ratio to be 1 to 10^77. Thus, the probability of finding a functional protein among the possible amino acid sequences corresponding to a 150-residue protein is similarly 1 in 10^77.”

More recently, Dembski cited Axe in his Expert Witness Report for the Dover trial (see this).
“Recent research by Douglas Axe (see Appendix 3) provides such evidence in the form of a rigorous experimental assessment of the rarity of function-bearing protein sequences. By addressing this problem at the level of single protein molecules, this work provides an empirical basis for deeming functional proteins and systems of functional proteins to be unequivocally beyond Darwinian explanation.”

Given that this subject is often raised by ID proponents (such as this), and that the Biologic Institute (where Axe works) has made some news accounts, it seems appropriate to review Axe’s work. The purpose of this PT blog entry is to try and lay out the study cited above (Axe DD, J Mol Biol 341, 1295-1315, 2004) in a form that is accessible to most interested parties, and to discuss a larger context into which this work might be placed. Needless to say, the grand pronouncements being made by the ID camp are not warranted.


Section 1. What Axe did

First, a brief overview of the experiment and results. The object of interest was the so-called large domain of the TEM-1 penicillinase, an enzyme that breaks down antibiotics related to penicillin. (Antibiotics such as penicillin are called, collectively, beta-lactams, and enzymes that break down these antibiotics and confer drug resistance are called beta-lactamases, which is why the term beta-lactamase may pop up in this blog entry from time to time.) Axe was interested in using a mutational approach to explore the constraints for forming a functional large TEM-1 domain, and applying these results to estimate of the density of functional sequences in the space of all possible amino acid sequences.

The approach taken was to generate collections of randomized mutant sequence variants in a functional TEM-1 variant and “count” the numbers of mutants that retained some measure of activity. Activity was measured by growth of bacteria containing the variants on relatively low levels of ampicillin, a target (or substrate) of TEM-1. (Cells with active TEM-1 can break down the ampicillin and thus survive, whereas cells with mutant TEM-1 variants that can no longer maintain a stably-folded enzyme cannot break down the antibiotic, and this will not grow.)

Axe anticipated that the native TEM-1 would be rather “resistant” to random mutagenesis, owing to a “buffering effect” contributed by what is probably a robust structural fold. This would preclude a proper assessment of the constraints governing low-level function, which in turn are the constraints relevant to the question of the emergence of functional sequences. Accordingly, he first isolated, by targeted mutagenesis, a so-called “reference sequence”, a TEM-1 variant that was expected to be much more susceptible to the effects of mutational change. (This is a crucial aspect of the experiment, the ramifications of which are discussed in Section 2.) The variant was identified as a temperature-sensitive enzyme that permitted growth of bacteria on selective (ampicillin-containing) media at a permissive temperature (25°C), and differs from the wild-type at 22% of the 153 positions. (“Temperature-sensitive” enzymes lose function after a small change in temperature. Here, the enzyme had some modicum of activity at a lower temperature — 25°C — but was inactive at elevated temperatures — e.g., 37°C, the temperature at which E. coli is usually grown.)

Having generated a mutation-sensitive TEM-1 variant, Axe then set about to do the mutagenesis. For this, four ten amino-acid clusters (each of which is spatially separate from the others in the 3-dimensional structure) were partially randomized. The variations that were introduced in each cluster were designed so as to retain the general hydropathic profile (see [1]) at the positions being varied. Populations of randomized pools were plated on selective media (at the permissive temperature) and the numbers of colonies counted. From this (and from other measurements that established the total numbers of screened mutants), the relative frequency of functional mutants was determined. For two of the four clusters, a recovery rate of about 0.03% (e.g., 3×10^-4) was found. For one of the clusters, a rate of 1 in 10^5 was seen. No functional variants for one of the clusters were isolated; based on the total numbers of clones analyzed, this sets a limit of 2×10^-5 as the frequency of functional mutant for the large domain of TEM-1.

These are the experiments and “raw data”. Axe averaged these four values and derived a mean per-residue tolerance to change; this value is 0.38 (roughly speaking, this is the fraction of variants at a given position that will yield a functional enzyme, and thus a functional fold). From this, he calculated that the fraction of all possible variants in the 153 amino-acid TEM-1 fold that will be functional is about 1 in 10^64 (e.g., 0.38 raised to the 153rd power).

This number represents the number of functional variants that are related to the specific reference sequence that was randomized. Axe also compared a great many naturally-occurring “relatives” of the TEM-1 fold and derived a general hydropathic “signature”. From this work, Axe estimated that about 1 in 10^33 of all sequences will possess the TEM-1 hydropathy signature, and hence a fold related structurally to the TEM-1 domain. Since each of the properly-folded (1 in 10^33) variants might be expected to possess a similar range of individual functionality (e.g., 1 in 10^64 or the family of sequences related to each variant is expected to be functional), we can estimate that 1 in 10^97 of all possible 153-mers will possess a functional TEM-1 fold.


Section 2. Going from Axe’s work to “functional proteins are isolated in sequence space”

A claim that is being made by ID proponents (as in Meyer’s paper) is that work such as this shows that functional proteins are so rare in sequence space that the natural origin of new proteins is so improbable as to be effectively impossible. Briefly, the argument is (or will be) that, if function occurs only once every 10^77 sequences (to use one of the numbers from Axe’s work), then it is rather unlikely that new functions can arise in the biosphere. However, Axe’s approach does not permit such a conclusion. The following hopefully conveys this.

Put pictorially, the issue that ID proponents are arguing about is the relative structure, or shape, of the topography of functional sequences in all of sequence space. To illustrate, the issue becomes one of the parameters of the hill shown in this figure (we’ll call it Figure 1):


newerAxeF1b.png


In this illustration, the base formed by the X and Y axes represents the sequence “space”, each hypothetical point or patch would depict a different sequence, and the Z-axis depicts some measure of activity. The “accessibility” of function, using this illustration, is a matter of the area of the base of the hill shown — the broader the base, the greater the number of related and functional sequences, and the greater the number of ways that function may be “found”. The idea that ID proponents push is that, if such a hill has a narrow enough base, then it is not likely that random processes can “find” even the base of the hill, let alone the peak. The experimental approach used by Axe is predicated on the assumption that the shape of this hill can be determined by assessing its susceptibility to mutation. Thus, the greater the sensitivity, the narrower the base, and the less likely is it that function can arise. ID proponents argue that Axe’s work shows that, indeed, the base is very narrow. This follows a. from the numbers given in the preceding; and b. from the nature of Axe’s experimental design.

There is, however, a fly in the ointment. (Actually, there are many.) Recall that Axe did not work with the native TEM-1 penicillinase, but rather with a variant that had a lower activity. The assay system made this necessary. (Scoring bacteria on antibiotic-containing media isn’t particularly discriminating, and it’s hard to tell is, say, if a wild-type detoxifying enzyme has lost 90% of its activity.) In other words, Axe decided to select a particular part of the “hill” such as that shaded in black in the following illustration (Figure 2):


newerAxeF2b.png

(Look carefully – the black patch isn’t very big, because Axe has limited his scope in an analogous fashion.)

In addition, Axe deliberately identified and chose for study a temperature sensitive variant. In altering the enzyme in this way, he molded a variant that would be exquisitely sensitive to mutation. In terms of our illustrations, Axe’s TEM-1 variant is a tiny “hill” with very steep sides, as shown in the following (Figure 3):


newerAxeF3b.png


Obviously, from these considerations, we can see that assertions that the tiny base of the “hill” in Figure 3 in any way reflects that of a normal enzyme are not appropriate. On this basis alone, we may conclude that the claims of ID proponents vis-Ã -vis Axe 2004 are exaggerated and wrong. Axe’s numbers tell us about the apparent isolation of the low-activity variant, but reveal little (nor can it be expected to) about the “isolation” or evolution of TEM-1 penicillinase. (Or any other enzyme, for that matter.)

Of course, there is more. Most naturally-occurring enzymes are not isolated activities as Figure 1 would imply. Something like the next illustration (Figure 4) is a better depiction — distinct activities and enzymes are often derived from common structural and sequence themes. This expands the base of the “hill” to include those of the neighboring activities; this may be considerable indeed. (In the example of TEM-1 penicillinases, the neighbors would include DD-peptidases; Knox et al, 1996; Adediran et al., 2005.)


newerAxeF4.png


But there is even more. Since the “goal” of the evolutionary exercise is a catalytic activity, and not a particular structure, the possible existence of totally unrelated structures and sequences that possess a similar activity complicates matter even more. This is pertinent for penicillinases, that are beta-lactamases. We know of a number of other families of structures that include beta-lactamases (Helfand and Bonomo, 2003). One of these, the metallo-beta-lactamases (Daiyasu et al., 2001), is quite unrelated to the TEM-1 enzymes. Axe’s study does not “count” these families of enzymes (or their neighbors), nor does it acknowledge that many more such structures are at least hypothetically possible.


Section 3. So how broad is the base of the hill?


That is the real question that Axe, ID proponents, and other who follow this sort of discussion would ask. To get some idea, we can turn to Axe’s paper. Axe mentions two other studies — one deals with experiments done with the lambda repressor, and the other with chorismate mutase. Work with the lambda repressor (Reidhaar-Olson and Sauer, 1990) yielded a “value” for the frequency of functional variants of 1 in 10^63 (roughly) for the 92-mer. Work with chorismate mutase (Taylor et al., PNAS 98, 10596-10601, 2001) gave a value of 1 in 10^24 for the 93 amino acid enzyme. Scaled for a similar size protein, Axe’s work gives a value of 1 in 10^59, which falls within the range established by previous work. (The literature in this area is rather large, far beyond the scope of this article to review. Suffice to say that the range of “probability” stated here is representative of the numerous studies in this area.)

Studies such as these involve what Axe calls a “reverse” approach — one starts with known, functional sequences, introduces semi-random mutants, and estimates the size of the functional sequence space from the numbers of “surviving” mutants. Studies involving the “forward” approach can and have been done as well. Briefly, this approach involves the synthesis of collections of random sequences and isolation of functional polymers (e.g., polypeptides or RNAs) from these collections. Historically, these studies have involved rather small oligomers (7-12 or so), owing to technical reasons (this is the size range that can be safely accommodated by the “tools” used). However, a relatively recent development, the so-called “mRNA display” technique, allows one to screen random sequences that are much larger (approaching 100 amino acids in length). What is interesting is that the forward approach typically yields a “success rate” in the 10^-10 to 10^-15 range — one usually need screen between 10^10 -> 10^15 random sequences to identify a functional polymer. This is true even for mRNA display. These numbers are a direct measurement of the proportion of functional sequences in a population of random polymers, and are estimates of the same parameter — density of sequences of minimal function in sequence space — that Axe is after.

10^-10 -> 10^-63 (or thereabout): this is the range of estimates of the density of functional sequences in sequence space that can be found in the scientific literature. The caveats given in Section 2 notwithstanding, Axe’s work does not extend or narrow the range. To give the reader a sense of the higher end (10^-10) of this range, it helps to keep in mind that 1000 liters of a typical pond will likely contain some 10^12 bacterial cells of various sorts. If each cell gives rise to just one new protein-coding region or variant (by any of a number of processes) in the course of several thousands of generations, then the probability of occurrence of a function that occurs once in every 10^10 random sequences is going to be pretty nearly 1. In other words, 1 in 10^10 is a pretty large number when it comes to “probabilities” in the biosphere.

The uncertainties in estimating the densities of functional sequences are very high. Obviously, we all would like to home in on a narrower range. This is complicated by the technical and theoretical shortcomings of the various approaches. The “reverse” approach is tied to a single family of sequences and functions and makes assumptions that may not be warranted (Section 2 here is an example). The “forward” approach may find too many things, some (many?) of which may have no biological relevance. Sorting these things out is a tough nut to crack experimentally.


Summary

To summarize, the claims that have been and will be made by ID proponents regarding protein evolution are not supported by Axe’s work. As I show, it is not appropriate to use the numbers Axe obtains to make inferences about the evolution of proteins and enzymes. Thus, this study does not support the conclusion that functional sequences are extremely isolated in sequence space, or that the evolution of new protein function is an impossibility that is beyond the capacity of random mutation and natural selection.


Endnote

1. the hydropathic signature is technical-ese for a particular pattern of polar and apolar amino acid residues in a structure or sequence. In the case of this study, the signature was the guide in determining the extent of variation that was introduced in the ten amino-acid clusters. Also, an aside to help readers with terminology — when we speak of sequence space, we are talking about nothing more than some collection of possible sequences. Thus, for example, the total “sequence space” of polypeptides of 153 amino acids in length is the number of possible polypeptides – 20^153, or 10^199.

Acknowledgements:

Thanks to the efforts of the PT crew, and particularly Ian Musgrave, who helped me keep this on topic. Also, many thanks are due to Douglas Axe, who graciously helped me with early drafts of this essay. Please note that all of these ideas are mine, and I make no claim that any of these thoughts represent Axe’s views.

References:

Axe DD, J Mol Biol 341, 1295-1315, 2004
Meyer SC, Proc Biol Soc Washington 117, 219-239, 2004
Knox JR, Moews PC, Frere JM. Chem Biol. 3, 937-47, 1996
Adediran SA, Zhang Z, Nukaga M, Palzkill T, Pratt RF, Biochemistry 44, 7543-52, 2005
Helfand MS, Bonomo RA., Curr Drug Targets Infect Disord 3, 9-23, 2003. (A review of different types of known beta-lactamases.)
Daiyasu H et al., FEBS Lett 503, 1-6, 2001. (A short review that describes the “connectedness” of metallo-beta-lactamases with other activities.)
Reidhaar-Olson and Sauer, Proteins: Struct Funct Genet 7, 306-316, 1990
Taylor et al., Proc Natl Acad Sci USA 98, 10596-10601, 2001
Cho et al., J Mol Biol 297, 309-319, 2000 (this describes one of the first successes in using mRNA display; Pubmed searches for “mRNA display” will yield many other papers)

[Update on 12-27-2008:  Added a link that was missing.  Sorry about that.]

[Update on 10-20-2009:  corrected a typo that was noted by Uncommon Descent commenter Nakashima]