May 2009

One of the most common arguments used when people debate evolution (in my opinion it’s not much of a debate), is that there is simply too much information contained in the genome to “randomly” show up. A common misconception is that one generation may succeed because of one singular mutation. Somewhat logically, the next step is to assume that each of the 4 billion genes resulted from billions of generations, one at a time. The earth has only been around for 5 billion years, life much less, so on a first pass, the creationists might seem to have a point. The key, though, is realizing that mutations happen in parallel and compound really, really fast.

During gene duplication, each gene has a certain probability of mutating. In reality, there are endless variables that determine this probability, but DNA is assembled one base pair (0.5 second Bio review, a base pair is one of AT, CG, GC, and TA) at a time, so one mutation upstrand shouldn’t affect base pairs downstrand. Turns out that this greatly reduces generations required to get to a “fit” goal gene.

All in all it’s pretty simple, and per suggestion of a book my brother and father recently read (I can’t think of the title). You can simulate evolution in quite a terse chunk of code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import string, random
 
goal = "To be or not to be, that is the question"
gen_size = 100
gen_count = 100000000
 
bases = ''.join(set(list(goal)))
# Our Adam/Eve
seed = {'gene': ''.join([random.choice(bases) for i in xrange(len(goal))]),
        'fitness': 0}
print 'Start: ' + seed['gene'] + "(" + str(seed['fitness']) + ")"
# For each generation
for i in xrange(gen_count):
    children = []
    # Make each baby
    for j in xrange(gen_size):
        child = {'gene': '', 'fitness': 0}
        # Copy each letter
        for k in xrange(len(seed['gene'])):
            nextbase = seed['gene'][k]
            # Mutation!
            if random.random() > 0.999:
                nextbase = random.choice(bases)
            # Add base to gene and calculate fitness boost
            child['gene'] += nextbase
            if nextbase == goal[k]:
                child['fitness'] += 1
        # Append each child to the generation
        children.append(child)
    # Choose a winner
    newseed = max(children,key=(lambda x : x['fitness']))
    if newseed != seed:
            print '#' + str(i) + ' ' + newseed['gene'] + \
                "(" + str(newseed['fitness']) + "/" + str(len(goal)) + ")"
    seed = newseed
    # Finish if we're done
    if seed['gene'] == goal:
        break

What this does, is starts out with a random string of letters, and then for each generation, makes 100 copies of the previous winning species, with a finite probability of mutation for each base pair (for simplicity I only consider point mutations). At the end of each generation, it evaluates the fitness of each child by comparing it to the goal string and selects the “most fit” to father the next generation. An important point in making this realistic is that no base pair is immune from mutation. That means, that if all of the children randomly progressive backwards, then the species as a whole progresses backwards.

So running this script, we find the output to be:

Start: iTmZ bReNyaQDtJWPbBkZSQmJPYNjCerARSqEPNE(0)
#0 iTmZ bReNyaQDtJWPbBkZSQmJPYNjCerARSqEPNE(2/40)
#2 iTmZ nReNyaQDtJWPbBkZSQmJPYNjCerARSqEPNE(2/40)
#23 iTmZ nReNyaQDtJWPbBkZSQmJPYNjCerARSsEPNE(3/40)
#29 iTmZ nReNyaQDtJWPbBktSQmJPYNjCerARSsEPNE(4/40)
#40 iTmZ nReNyaQ tJWPbBktSQmJPYNjCerARSsEPNE(5/40)
#53 iTmZ nReNyaQ tJWPbBktSQtJPYNjCerARSsEPNE(6/40)
#62 iTmZ nReNyaQ tJWPbBktSQtJPYNtCerARSsEPNE(7/40)
#70 iTmZ nReNyQQ tJWPbBktSQtJPYNtCerARSsEPNE(7/40)
#71 iTmZ nReNyQQ tJWPbB tSQtJPYNtCerARSsEPNE(8/40)
#78 iTmZ nReNyQQ tJWPbB tSQtJPYNtherARSsEPNE(9/40)
#87 iTmZ nReNyQQ tJWPbB tSQtJPYNtherARSstPNE(10/40)
#99 iTmZ nReNyQQ tJWPbB tSQtJPYNtherARSstPoE(11/40)
#107 iTmZ nReNyQQ tJWPb, tSQtJPYNtherARSstPoE(12/40)
#111 iTmZ nReNyQQ toWPb, tSQtJPYNtherARSstPoE(13/40)
#140 iomZ nReNyQQ toWPb, tSQtJPYNtherARSstPoE(14/40)
#161 iomZ nReNiQQ toWPb, tSQtJPYNtherARSstPoE(14/40)
#164 iomZ nReNiQQ toWPe, tSQtJPYNtherARSstPoE(15/40)
#185 domZ nReNiQQ toWPe, tSQtJPYNtherARSstPoE(15/40)
#205 domZ nReNiQQ toWPe, tSJtJPYNtherARSstPoE(15/40)
#217 domZ nReNiQQ toWPe, tSatJPYNtherARSstPoE(16/40)
#234 domZ noeNiQQ toWPe, tSatJPYNtherARSstPoE(17/40)
#249 domZ  oeNiQQ toWPe, tSatJPYNtherARSstPoE(18/40)
#278 domZ  oeNiQQ toWPe, tSatJPYNtherAuSstPoE(19/40)
#285 domZ  oeNiQQ toWPe, tSatJkYNtherAuSstPoE(19/40)
#297 domZ  oeNiQQ toWPe, thatJkYNtherAuSstPoE(20/40)
#314 domZe oeNiQQ toWPe, thatJkYNtherAuSstPoE(21/40)
#315 domZe oeNYQQ toWPe, thatJkYNtherAuSstPoE(21/40)
#365 domZe oeNYQQ toWPe, thatJiYNtherAuSstPoE(22/40)
#372 TomZe oeNYQQ toWPe, thatJiYNtherAuSstPoE(23/40)
#386 TomZe oeNYQj toWPe, thatJiYNtherAuSstPoE(23/40)
#424 TomZe oeNYQc toWPe, thatJiYNtherAuSstPoE(23/40)
#431 TomZe oe YQc toWPe, thatJiYNtherAuSstPoE(24/40)
#432 TomZe oe YQc toWPe, thatJiYNtherAuSstPoY(24/40)
#461 TomZe oe YQc toWPe, thatJiYNthebAuSstPoY(24/40)
#462 TomZe oe YQc toWPe, thatJiYNthebquSstPoY(25/40)
#481 TomZe oe YQc totPe, thatJiYNthebquSstPoY(25/40)
#486 TomZe oe YQc totPe, thatJiYNthebquestPoY(26/40)
#613 TomZe oe Yoc totPe, thatJiYNthebquestPoY(27/40)
#647 TomZe or Yoc totPe, thatJiYNthebquestPoY(28/40)
#695 TomZe or Yoc totPe, thatJiYNthebquestPon(29/40)
#726 Tombe or Yoc totPe, thatJiYNthebquestPon(30/40)
#727 Tombe or Yoc totbe, thatJiYNthebquestPon(31/40)
#729 Tombe or Yoc totbe, thatJiYLthebquestPon(31/40)
#830 Tombe or noc totbe, thatJiYLthebquestPon(32/40)
#855 Tombe or not totbe, thatJiYLthebquestPon(33/40)
#879 Tombe or not totbe, thatJiYLthebquestion(34/40)
#880 Tombe or not totbe, thatJiYLthe question(35/40)
#899 Tombe or not tojbe, thatJiYLthe question(35/40)
#978 Tombe or not to be, thatJiYLthe question(36/40)
#1108 Tombe or not to be, thatJipLthe question(36/40)
#1276 To be or not to be, thatJipLthe question(37/40)
#2171 To be or not to be, thatJisLthe question(38/40)
#2383 To be or not to be, thatMisLthe question(38/40)
#2543 To be or not to be, thatMisethe question(38/40)
#2714 To be or not to be, thatMis the question(39/40)
#3096 To be or not to be, thatfis the question(39/40)
#3577 To be or not to be, that is the question(40/40)

So, as you can see, we started with the random string “iTmZ bReNyaQDtJWPbBkZSQmJPYNjCerARSqEPNE”, and after 3577 generations of 100 children with a 0.1% chance of mutation when copying each letter/base pair, we arrived at exactly the desired result. Doing some quick math, 3577*100 = 357,700 total copies of the DNA were explored (or alternatively, children born). For a completely random sequence to find our desired phrase “To be or not to be, that is the question”, it must exhaustively search 28^40 (~=7.7*10^57) solutions. That’s is a reduction of 53 orders of magnitude! This is the beauty of evolution.

You can play around with the script if you want, changing the mutation probability and generation size. For longer “genes”, a lower mutation probability is better, while a higher probability is better for shorter genes. You could even evolve these parameters alongside the genes themselves!

There are a lot of fun things you can “evolve” with this script. Here’s a few things I evolved from random garbage data:

  1. The entire ‘To Be or Not To Be’ Soliloquy by Shakespeare. 1411 characters long, took 13,033 generations. Again thats, 1,303,300 children versus the random brute force space of 32^1411 (~10^2124), a reduction of 2118 orders of magnitude.
  2. The entire Swine Flu Genome. 1381 bases, took 1125 generations. This is much faster than Shakespeare because there are only 4 bases.
  3. Valid arithmetic computations from a random selection of numbers and “*-/+”, using Python to interpret the arithmetic. One example it came up with was “613/6-1++0075*4+67-2-3*+77+6-59376/90641++1172083714/958+341*64626/4*483/+2+-583*642395*2948712+3110+989*0+77326+394*781991900*295*553318*1978*30+30-51-900553+4986368-0263-7/9354+96633652950757+9-220645036950122*+85+6508012+87/-06-40+212+49/830*22*05″, which is equal to 2984302412387195345265239. Another fun example it generated was “77805*+–7″ which is equal to 544635. I didn’t even know that was valid python!

For anything beyond string comparisons, by far, the hardest part of “evolving” data is creating good tests for fitness. My third example was actually heavily reduced from a first attempt to “evolve” a valid C program. The problem was, I couldn’t think of an easy way to rank compilation errors in terms of fitness. For my third example, since effectively the only errors were syntax errors and python tells you at which character it dies, I used the depth by which python parsed before dying as my fitness test. So in general, if you can figure out how to phrase a problem in terms of a fitness test, evolution could potentially save you a lot of work.

I remember from somewhere, that someone used evolution to design a radio antenna, and it turned out the antenna was better than anything a human could have conceived. Now I just need to “evolve” my homework…