Wednesday, December 26, 2007

Where Does All the Computer Power Go?

The human genome has some 3.1647 billion base pairs. While human genomes are 99.9 percent identical from one person to another, there are still three million places where nucleotides differ from person to person. There are some 30,000 genes. However, non-gene portions of the DNA affect the way the genes are expressed in the cells. (Moreover, during the life of the cell, chemical changes accumulate which also affect the way the genes are expressed.) Craig Vanter says that some 45 percent of his genes are heterozygous, having different alleles for the same gene. There are a few individual genomes that have been sequenced completely, but the plans are to sequence the DNA for 10,000 people in the next decade. There is a prize on offer for the first person (group) to develop technology that can sequence an individual's DNA for $1000. The era of individualized medicine, in which treatment will be matched with an individual's genes, will probably come in the next decade or two. Then there will be many times as many genomes available for study.

There are maybe three pounds of microorganisms in the average human being, and their behavior affects their human hosts (and vice-versa). There is now a program to sequence the genomes of these organisms, to create the larger genome of the collection of the human and its related microorganisms. That genome will be a couple of orders of magnitude larger than the genome of the individual human. One assumes that it will be subject to more diversity.

There are at least 1000 diseases now classified in the International Classification of Diseases. We can think of diseases as the result of the function of the genes, or of the interaction of organisms with different genomes. Thus an infectious disease is the result of the response of the human, and his associated organisms, to the infecting ageny, each determined by its genome and its history.

And of course we are interested in not only the interrelationship of genes and disease, but of development and of all of the traits of interest to people.

The Celera human genome project established a benchmark in the use of computer power.
Upon the establishment of the genome project at Celera in 1998, the company purchased and connected 700 CPUs and 70 terabites of hard drive space. This computing system was established to run the initial test of their algorithm code, which was used to sequence the genome of the Drosophilla fruit fly with a 13-fold coverage of the genome successfully in 1999. The most surprising thing about this approach was that it succeeded in coding the algorithm and sequencing the 120 Megabase pair genome of the fruit fly to that extent of completeness in just 11 months. Myers (Gene Myers, a professor of Computer Science at Berkeley) then modified the process so that the Whole Genome Shotgun Sequencing process would make a 5-fold coverage of the human genome, as he believed it would be adequate to provide a complete sequence of the human genome. In addition, Venter purchased 4 supercomputers referred to as the GeneMatcher from a company called Parcel Inc. Parcel Inc, a company that typically produces computers for government agencies such as the NSA, created this machine specifically for matching character strings, such as putting together sequences of DNA like a puzzle. It was composed of 7000 processors arranged to perform over 1000 times faster than any Pentium computer. With this new technology, on September 8, 1999, Celera began its sequencing of the human genome using this approach, and completed the first assembly of the whole human genome in June 17, 2000, only 9 months after the project began.


The understanding of the relationship of the genome to disease will involve statistical analysis of the health histories and individual genomes of tens or hundreds of thousands of people. It is expected that few if any conditions will be explained by a single gene. Even eye color is a complex phenomenon under control of different genes. So think about the computer power that will be used in the coming generation to clarify the genetic basis of disease and behavior.

No comments: