Modern Drug Discovery
January/February 2000
Modern Drug Discovery, 2000, 3(1) 28-29, 31, 32, 34.
© 2000 American Chemical Society.


Bioinformatics battles breast cancer
Large-scale screening tools are targeting gene mutations.

BY MONA MORT

artistic representation of gene sequencingIn the not too distant future, the entire sequence of the human genome, 3 billion base pairs, will be in the hands of researchers eager to find disease-related genes with an eye toward possible cures. Only 2% of the human genome accounts for an estimated 100,000 genes; the function of the remaining 98% is unknown. Mining the human genome to identify genetic mutations that cause complex diseases such as breast cancer is like looking for needles in a haystack. And after finding the needles, or coding regions, researchers must then work to find disease-related sequences within them.

To aid the search for "needles", scientists have applied information technology and software to biological research, giving rise to the new field of bioinformatics. With bioinformatics, it is now possible to search the “haystack” of 3 billion base pairs for anomalous genetic defects.

Years of research show that genetic mutations, whether caused by exposure to environmental mutagens or inherited as defective gene copies, are inherent to the onset of cancer. Cancer’s characteristic uncontrolled cell growth usually involves some combination of an impaired DNA repair pathway, the transformation of a normal gene to an oncogene, and a malfunctioning tumor suppressor gene. About 5% of all incidents of breast cancer are hereditary. Women carrying defective copies of the BRCA1 or BRCA2 gene are at an increased risk—perhaps as high as 85%—of developing the disease (see box, BRCA1 and BRCA2: Genetic links to breast cancer).

But something else is going on, says Andrew Futreal, assistant professor of surgery and genetics at the Duke University Medical Center in Durham, NC, and co-discoverer of BRCA1 and BRCA2. Futreal points out that many women from high-risk families develop breast cancer but do not carry BRCA1 or BRCA2 mutations. And what about the other 95% of breast cancer cases that are not hereditary?

Determining the cause of breast cancer in the cases that are not hereditary and searching for other genetic defects leading to breast cancer require large-scale screening of tumor tissue from all types of breast cancer cases, and not just in high-risk families, says geneticist Gertraud Robinson of the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) in Bethesda, MD. NIDDK sponsors the Mammary Genome Anatomy Project, which aims to understand the normal and cancerous development of breast tissue. Producing large quantities of information on cancer genes, and their expression, is a challenge in cataloging the information and making it available. In August 1997, Vice President Al Gore and the National Cancer Institute (NCI) in Bethesda, MD, launched the Tumor Gene Index, an initiative to compile, on the Internet, the first comprehensive index of genes involved in human cancer. The Tumor Gene Index is part of the NCI’s larger Cancer Genome Anatomy Project (CGAP), which develops publicly available databases and technologies to support the search for cancer-related genes. In contrast to the Mammary Genome Anatomy Project, the Tumor Gene Index and CGAP deal with all types of tumor tissue—a much broader goal.

It seems to be working. According to Susan Greenhut, technology coordinator in the NCI’s Office of Science Policy, “Of the [known] 73,000 genes in the public domain, 30,000 have been found by CGAP.” About 40,500 of the 66,400 new and previously defined genes cataloged by CGAP in its first two years are expressed in one or more cancers. Of the 5500 genes identified in breast tissue, 5327 are active in cancer, and more than 200 appear to be unique to breast tissue. Greenhut admits that, right now, nobody knows what all these genes do. “We just put the information on the Web,” in hopes of stimulating research, she says.

What is CGAP?
The none-too-modest goal of CGAP is to “identify all the genes responsible for the establishment and growth of cancer.” Achieving this goal means building libraries of cDNA, complementary DNA produced by reverse transcriptase from messenger RNA transcripts of normal, precancerous, and malignant cell DNA. Finding genes linked to the onset or prevention of cancer requires solving the gene expression pattern of cells (Figure 1). Researchers compare the production of gene products in normal and malignant cells to determine which genes may be overexpressed in cancer cells. The isolated mRNA from those gene products is then used to create cDNA.

Scientists use computer algorithms and bioinformatics technology to find consensus sequences—common features of promoter sequences of genes—that match known cancer genes or are similar to genes that play a key role in regulating cell growth. UniGene, an experimental computing program sponsored by the NIH’s National Center for Biotechnology Information (NCBI) in Bethesda, MD, automatically organizes the NIH’s GenBank sequences into gene families, or clusters, on the basis of their sequence similarity. GenBank is a database containing an annotated collection of all publicly available DNA sequences. The goal, according to NCBI’s Alex Lash, is “to have each UniGene cluster represent a unique gene,” and to incorporate related information such as the tissue type in which the gene is expressed and its location on the human chromosome map.

Sequence clustering is essential to making sense of the thousands of sequences available from the cDNA libraries that are now under study by CGAP. Lash oversees an elaborate network of information processing and bioinformatics tools which, at first glance, seem Rube Goldbergian in complexity and prone to mishap. But they work, and they get high marks from the research community. Duke’s Futreal sees CGAP as integral to determining which genes are up- or downregulated when cells become cancerous and to developing bioinformatics tools vital to complex pattern recognition when searching for cancer-related changes in gene expression.

NIDDK’s Robinson says that her research group is capitalizing on CGAP resources by co-developing, with the NCI, bioinformatics tools such as DNA microarray readers. Microarrays comprise thousands of DNA sequences on a small glass, nylon, or silicon chip surface of just a few square centimeters, to which fluorescent probes bind and give expression profiles (for an explanation of microarrays, see Modern Drug Discovery, September/October 1999, p. 30). Because microarrays allow rapid screening of several thousand genes in attempts to find those implicated in cancer, Robinson views them as essential to understanding breast cancer genetics.

Bioinformatics supreme
Automated DNA sequencers spew out sequences. Then, in the first step of the gene-sorting process, CGAP receives the genetic information as expressed sequence tags (ESTs), sequences that uniquely identify a certain gene, which CGAP puts into the EST database (dbEST). The ESTs are linked to cDNA libraries of cells that have been generated from normal, pre-cancerous, or cancerous tissues isolated by laser capture microdissection (Figure 2). ESTs unique to a specific tissue are like books in a library. You can search all the libraries cataloged in the dbEST “by subject heading, as if you were using a card catalog” in a physical library instead of a virtual one, says Lash. In the next step, the dbEST sequences are submitted to UniGene, which is the real key to finding cancer genes.

UniGene is a clustering algorithm that tries to find the best way to classify the EST, searching for similarity to known protein-coding regions and then naming the EST on the basis of the clustering results, says Lash. CGAP takes the information from UniGene and submits it into LocusLink, which catalogs the new sequence information and acts as “a clearinghouse for information on genes,” Lash added.

Because independent research groups often give the same sequence different names, the catalog provides a way for researchers to determine whether a given sequence has already been found, and if so, what is known about it.

But what about sequences that, even after submission to UniGene, are not related to any known gene families? The catalog labels those sequences, according to Lash, simply “EST”. It would be like going to a library and finding shelves filled with books that look different but all have the word “book” printed on their spines. These sequences are intriguing because they may turn out to be important. So, says Lash, even if they are black boxes at the moment, including all ESTs in the libraries is a resource meant to assist people sequencing new genes.

Two additional informatics tools for finding cancer-related genes include xProfiler (designed by NCBI for CGAP) and Serial Analysis of Gene Expression (SAGE, owned by Genzyme Molecular Oncogene). xProfiler, which is short for expression profiler, compares the cDNA libraries from two different tissue types and searches for genes expressed in one tissue type but not the other. SAGE, on the other hand, is an enhanced version of EST production. Researchers isolate the desired mRNA and synthesize the cDNA. The program tags all the cDNA molecules of the library, and the tags identify each transcript. Using these tags, SAGE estimates the number of times an expressed sequence is found in a cell or tissue.

On the eve of its second birthday, CGAP’s arsenal of bioinformatics techniques has already identified 30,000 genes from about 600,000 sequences isolated from its 142 libraries. CGAP catalogs libraries from other researchers, so it has a total of 15 breast tissue libraries, 5 from normal tissue and 10 from cancer tissue (see Table 1). In these libraries, the CGAP algorithm found one gene that is unique to breast cancer tissues. Although this gene is not yet definitively associated with breast cancer, its ubiquitous presence in the CGAP tumor libraries makes it a strong candidate for future research. In this way, researchers find the needles in the haystack and examine them further—fulfilling the vision of CGAP’s originators.

The battle rages
Is mining cDNA libraries for breast cancer genes a promising way of searching for genetic weapons to combat the disease? Success stories suggest that the answer is a resounding yes (see box, Victories in the battle against breast cancer). Bioinformatics facilitates data management, so as to statistically develop associations between genes before laboratory or clinical investigations. Lash points out that current EST technologies are good at discovering new genes but not at quantifying gene expression,” an important step in finding overexpressed genes in cancer cells.

Genes do not operate in a vacuum, however, but in a genetic background of great complexity. Population-wide analysis of how breast cancer relates to the presence of certain genes is critical, as is looking at “the interaction of genes” that predispose to cancer, according to Robinson.

Nonetheless, recent breakthroughs from Robert Weinberg’s laboratory at the Whitehead Institute in Cambridge, MA, culminated in defining the genetic events necessary to produce cancer cells. In the culture dish, normal cells became cancer cells when Weinberg and colleagues introduced three genetic changes. The first consisted of artificially expressing the catalytic domain of telomerase, a chromosome length-maintenance enzyme. This procedure alone produces immortal cells. Introducing two well-known oncogenes—activated ras and simian virus 40 large-T antigen (an inhibitor of tumor suppressor proteins)—into these immortal cells converted them into cancer cells. All three genetic changes, which affect at least four distinct pathways, were required to produce tumor cells in vitro, but no one knows the minimum number of genetic hits needed for tumor formation in vivo or the order in which they must occur. Elucidating how cells become cancerous only begins with the identification of one or two genes involved in the onset of cancer, because, as Weinberg’s results demonstrate, finding that those genes are important to a specific kind of cancer requires the simultaneous investigation of many interacting genes.

What is the next step in looking for breast cancer genes in those haystack needles? A close look at interactions between genetics and the environment, says Futreal. “[It’s] population genetics meets cancer genetics.”


For further reading

  • Weitzman, J. B.; Yaniv, M. Rebuilding the road to cancer. Nature 1999, 400, 401–402.
  • Chin, G. J. DNA repair in breast cancer; Blocking estrogen. Science 1999, 285, 637.
  • Hagmann, M. New model for hereditary breast cancer. Science 1999, 284, 723–725.


Mona Mort is a freelance science writer living in Tucson, AZ. Comments and questions for the author may be addressed to the Editorial Office by e-mail at mdd@acs.org, by fax at 202-776-8166 or by post at 1155 16th Street, NW; Washington, DC 20036.

SEE Modern Drug Discovery Home Page


CASChemPortChemCenterPubs Page