Health News

13 Feb 2009 03:00 AM

Unique New Tool For Analyzing And Comparing Data - Created By Berkeley Lab Scientists
What does uncovering the true authorship of plays attributed to Shakespeare have to do with identifying our genetic ancestors or classifying new life forms? All involve the comparative analysis of long sets of data and all will benefit from a unique new analytical tool developed by researchers at Berkeley Lab.

Sung-Hou Kim, a chemist who holds a joint appointment with Berkeley Lab's Physical Biosciences Division and UC Berkeley's Chemistry Department, led the development of a technique called "feature frequency profiles" (FFP), that makes it possible to compare, classify, index and catalog just about any type of linear information that can be electronically stored. The kinds of information that can be analyzed with the FFP technique include nucleotide base and amino acid sequences, books, documents and possibly images. It could even prove to be the ultimate music organizer.

"I call our technique a tool for demographic phylogeny because it enables us to organize large sets of data into groups and find relationships among these groups," says Kim. "The idea is to organize data sets into groups based on the frequency at which key features occur and then look for relationships. This is the reverse of what is usually done, where you find relationships in the data set then organize the data set into groups based on those relationships."

Using the FFP technique, Kim and his colleagues can create "family trees" that put into easy-to-see perspective the relationships between groups within a data set, whether those groups are books or genomes. The key is to identify the "optimal features" for profiling. For books, the optimal feature consisted of sequences of text about eight letters in length. For mammalian genomes, the optical feature consisted of sequences of nucleotide bases of about 18 base pairs in length. However, to keep their genomic computations manageable, Kim and his colleagues reduced the four-letter DNA alphabet (adenine, guanine, thymine and cytosine) to a two-letter alphabet by using R for the purine nucleic acids and Y for the pyrimidine nucleic acids). In a series of tests run on books and genomes, the FFP technique provided a more comprehensive and in some cases more accurate analysis over the standard analytical tools.

For example, Kim and his colleagues used the FFP technique to create a book tree composed of more than two dozen selected works under the categories of philosophy, mythology, religion, 19th Century fiction, science fiction and children's fiction. Their FFP-based book tree correctly grouped all books by category and author including some, such as the Koran, that were misplaced in a book tree based on a standard word frequency profile analysis. In the case of the Koran, the FFP-based tree placed it in the religion category on the same branch as the King James Bible and the Book of Mormon, whereas the word frequency book tree grouped it in the philosophy category, on the same branch as Plato's The Republic and Socrates' The Apology.

Kim and his colleagues later applied the FFP technique to a comparative analysis of the works of William Shakespeare, contemporaries such as Christopher Marlowe, plus several works from the Jacobean era that were once attributed to Shakespeare but whose authorships are now in question. The results cast new doubt on Shakespeare having been the author of the play Pericles, Prince of Tyre, and point to his authorship of the comedy Two Noble Kinsmen, for which in the past he has only received partial credit.

"I was stunned when I saw how well the technique worked with books," Kim says.

The next step was the successful application of the technique to the whole genomes of mammals whose phylogenic tree is well established, then on to whole genomes of prokaryote organisms (bacteria and Archaea) and finally on to viruses, for which current comparative genomic analytic tools sometimes cannot be applied…
To see status of your order and get your bonus pills
(9:00 am – 5:00 pm ET)

Call Toll-free: 1–800–775–4570