Objective 1. Continued Elucidation of Genome Structure and Organization

In the past five years, important guiding concepts have emerged from sequencing projects involving plants, animals and microbes. One concept has been the importance of focusing on “reference genomes”. The finished yeast, worm and Arabidopsis genomes represent references that serve as assembly templates for subsequent draft sequences of genomes from related organisms. Another important concept is the use of comparative sequence analysis. While a single genome sequence is useful, its utility increases dramatically as additional sequences become available. Comparative genomics uses these resources to extensively mine for all of the genes and regulatory sequences in a genome. For example, the sequencing of two mammalian genomes (human and mouse) has revealed new genes that were not found by sequencing one genome. Comparative genomics research will increase our understanding, at a sequence level, of the events that gave rise to new species or to the emergence of specific traits. The sequence resources developed over the next five years can be used to describe the structure of individual genomes and also to clarify the dynamic processes that shape genomes.

Contribute to the international effort to finish the rice genome sequence

A deep and highly accurate draft rice sequence was completed in December 2002, representing the first publicly available and most complete rice sequence information to date. However, additional sequencing to close remaining gaps will be required to finish this genome and facilitate its use as a reference genome for cereals. A complete, finished rice sequence will be an essential tool for the broader plant research community, both basic and applied.

Complete sequencing of the gene-rich regions of the maize genome

Like many of the cereals, maize has a large, complex genome, consisting of about 2,800 million base pairs (Mbp) of DNA, about the same size as the human genome and 21 times larger than the Arabidopsis genome. The maize genome organization is complex with more than 80% of the genomic DNA consisting of repetitive sequences and only about 15%, or about 300 Mbp, encoding genes. While it was not realistic to contemplate sequencing a genome of this size and complexity in 1998, it is now possible to sequence the gene-rich regions of the maize genome. Technical challenges are being surmounted by developing efficient methods to enrich for genes prior to sequencing and then assembling and mapping the sequences onto the existing master maize genome map. The technologies being developed for sequencing the maize genome can then be applied to sequence large and complex genomes, not just plant genomes. A complete sequence of the gene-rich regions of the maize genome would augment available genomic tools to address fundamental questions about gene function, evolution, development and physiology across all the cereals.

Detailed Genome analysis of a few key plant species

At the present time, it is still prohibitively expensive to sequence all plant genomes since many are large and complex (Table 1). The current cost is approximately $0.09 per base pair. At this price, a finished (99.99% accuracy) sequence for wheat would cost $1.44 billion. The National Human Genome Research Institute intends to develop sequencing technology in the next decade that will produce complete genome sequences for $1,000 each. Until then, the most efficient use of NPGI resources will be to develop a set of draft sequences for the gene-rich regions of key plant species, building on the concept derived from the first five years that reference genomes are essential genomics tools.

Criteria for selection of plant species for sequencing will minimally include the following considerations: (1) Experimental tractability; (2) Complexity of genome structure; (3) Potential for serving as a reference; and (4) Usefulness of the sequence information to advance plant science.

Table 1
Size of sample plant genomes

Plant Genome	Estimated Size (M base pair)

Arabidopsis	130
Rice	430
Medicago	550
Poplar	550
Apple	770
Tomato	950
Sorghum	1,000
Soybean	1,000-2,000
Cotton	2,110
Maize	2,500-3,000
Barley	5,000
Wheat	16,000
Onion	18,000
Fern	160,000

Genome analysis resources for a broad spectrum of plants of biological and economic importance

The majority of plants will not be candidates for detailed genome analysis in the next five years. In these cases, research needs can be met by the development of deep genetic and physical maps, Expressed Sequence Tags (ESTs) and Bacterial Artificial Chromosome (BAC) libraries. BAC libraries are relatively inexpensive to construct and are useful to many researchers who work on unique plant systems and to all researchers for comparative genomics research. A recently developed process called “Targeted Comparative Sequencing” uses BAC libraries as a promising tool to provide insight into genome evolution. ESTs prepared for unique cell types or plants grown under specific conditions are especially useful to identify networks of genes involved in specialized plant processes such as production of secondary metabolites or responses to specific stimuli.

Understanding the structural basis for plant genome organization

Plants are well suited for studying the structural basis of complex genome organization. Genome organization contains a record of the evolutionary history of the plant. Thus, comparison of select examples can reveal the processes that led to the current structure and organization of plant genomes. In the next few years, additional genome sequences, EST sequences, and other structural genomics resources will become available. These resources will make it possible to generate detailed, comparative maps for finding all genes and regulatory sequences, and studying genome evolution across a broad range of plants. Comparative studies will increase our understanding of the relationships between genome structure and organization and allow us to begin to ask major unanswered questions in plant sciences, such as:

Impact of domestication on genome structure and vice versa: Plant genomes, especially those of cultivated plants, are often radically different from other eukaryotic genomes, both in structure and in organization. It is likely that many of these differences reflect the strong selection applied during domestication over thousands of years. During domestication, whole genome duplication, segmental genome duplication or loss, and genome rearrangements have occurred in a number of crop plants. Understanding the basic biology of the domestication process will help researchers develop rational strategies for future crop improvement.

Role of subgenomes in allopolyploids: Many plants are hybrids of two or more progenitor plants, called “allopolyploids”. For example, bread wheat (Triticum aestivum) contains three ancestral genomes termed A, B and D. The D genome, is derived from Aegilops tauschii, and contains genes for bread quality. Having sequence information of these genomes would provide scientists with the tools to understand how diverse genomes combine to generate new plant species.

Objective 1. Continued Elucidation of Genome Structure and Organization

NEXT