44
Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen Copyright © 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim ISBN: 3-527-31348-6

Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

Part IKey Organisms

Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues.Edited by Christoph W. SensenCopyright © 2005 WILEY-VCH Verlag GmbH & Co. KGaA, WeinheimISBN: 3-527-31348-6

Verwendete Mac Distiller 5.0.x Joboptions
Dieser Report wurde automatisch mit Hilfe der Adobe Acrobat Distiller Erweiterung "Distiller Secrets v1.0.5" der IMPRESSED GmbH erstellt. Sie koennen diese Startup-Datei für die Distiller Versionen 4.0.5 und 5.0.x kostenlos unter http://www.impressed.de herunterladen. ALLGEMEIN ---------------------------------------- Dateioptionen: Kompatibilität: PDF 1.3 Für schnelle Web-Anzeige optimieren: Ja Piktogramme einbetten: Ja Seiten automatisch drehen: Nein Seiten von: 1 Seiten bis: Alle Seiten Bund: Links Auflösung: [ 2400 2400 ] dpi Papierformat: [ 481 680 ] Punkt KOMPRIMIERUNG ---------------------------------------- Farbbilder: Downsampling: Nein Komprimieren: Ja Komprimierungsart: ZIP Bitanzahl pro Pixel: 8 Bit Graustufenbilder: Downsampling: Nein Komprimieren: Ja Komprimierungsart: ZIP Bitanzahl pro Pixel: 8 Bit Schwarzweiß-Bilder: Downsampling: Nein Komprimieren: Ja Komprimierungsart: CCITT CCITT-Gruppe: 4 Graustufen glätten: Nein Text und Vektorgrafiken komprimieren: Ja SCHRIFTEN ---------------------------------------- Alle Schriften einbetten: Ja Untergruppen aller eingebetteten Schriften: Nein Wenn Einbetten fehlschlägt: Abbrechen Einbetten: Immer einbetten: [ /Courier-BoldOblique /Helvetica-BoldOblique /Courier /Helvetica-Bold /Times-Bold /Courier-Bold /Helvetica /Times-BoldItalic /Times-Roman /ZapfDingbats /Times-Italic /Helvetica-Oblique /Courier-Oblique /Symbol ] Nie einbetten: [ ] FARBE(N) ---------------------------------------- Farbmanagement: Farbumrechnungsmethode: Farbe nicht ändern Methode: Standard Geräteabhängige Daten: Einstellungen für Überdrucken beibehalten: Ja Unterfarbreduktion und Schwarzaufbau beibehalten: Ja Transferfunktionen: Beibehalten Rastereinstellungen beibehalten: Nein ERWEITERT ---------------------------------------- Optionen: Prolog/Epilog verwenden: Nein PostScript-Datei darf Einstellungen überschreiben: Nein Level 2 copypage-Semantik beibehalten: Ja Portable Job Ticket in PDF-Datei speichern: Ja Illustrator-Überdruckmodus: Ja Farbverläufe zu weichen Nuancen konvertieren: Ja ASCII-Format: Nein Document Structuring Conventions (DSC): DSC-Kommentare verarbeiten: Ja DSC-Warnungen protokollieren: Nein Für EPS-Dateien Seitengröße ändern und Grafiken zentrieren: Ja EPS-Info von DSC beibehalten: Ja OPI-Kommentare beibehalten: Nein Dokumentinfo von DSC beibehalten: Ja ANDERE ---------------------------------------- Distiller-Kern Version: 5000 ZIP-Komprimierung verwenden: Ja Optimierungen deaktivieren: Nein Bildspeicher: 524288 Byte Farbbilder glätten: Nein Graustufenbilder glätten: Nein Bilder (< 257 Farben) in indizierten Farbraum konvertieren: Ja sRGB ICC-Profil: sRGB IEC61966-2.1 ENDE DES REPORTS ---------------------------------------- IMPRESSED GmbH Bahrenfelder Chaussee 49 22761 Hamburg, Germany Tel. +49 40 897189-0 Fax +49 40 897189-71 Email: [email protected] Web: www.impressed.de
Adobe Acrobat Distiller 5.0.x Joboption Datei
<< /ColorSettingsFile () /LockDistillerParams true /DetectBlends true /DoThumbnails true /AntiAliasMonoImages false /MonoImageDownsampleType /Bicubic /GrayImageDownsampleType /Bicubic /MaxSubsetPct 100 /MonoImageFilter /CCITTFaxEncode /ColorImageDownsampleThreshold 1.5 /GrayImageFilter /FlateEncode /ColorConversionStrategy /LeaveColorUnchanged /CalGrayProfile (Dot Gain 20%) /ColorImageResolution 300 /UsePrologue false /MonoImageResolution 1200 /ColorImageDepth 8 /sRGBProfile (sRGB IEC61966-2.1) /PreserveOverprintSettings true /CompatibilityLevel 1.3 /UCRandBGInfo /Preserve /EmitDSCWarnings false /CreateJobTicket true /DownsampleMonoImages false /DownsampleColorImages false /MonoImageDict << /K -1 >> /ColorImageDownsampleType /Bicubic /GrayImageDict << /HSamples [ 2 1 1 2 ] /VSamples [ 2 1 1 2 ] /Blend 1 /QFactor 0.9 >> /CalCMYKProfile (U.S. Web Coated (SWOP) v2) /ParseDSCComments true /PreserveEPSInfo true /MonoImageDepth -1 /AutoFilterGrayImages false /SubsetFonts false /GrayACSImageDict << /HSamples [ 2 1 1 2 ] /VSamples [ 2 1 1 2 ] /Blend 1 /QFactor 0.9 >> /ColorImageFilter /FlateEncode /AutoRotatePages /None /PreserveCopyPage true /EncodeMonoImages true /ASCII85EncodePages false /PreserveOPIComments false /NeverEmbed [ ] /ColorImageDict << /HSamples [ 2 1 1 2 ] /VSamples [ 2 1 1 2 ] /Blend 1 /QFactor 0.9 >> /AntiAliasGrayImages false /GrayImageDepth 8 /CannotEmbedFontPolicy /Error /EndPage -1 /TransferFunctionInfo /Preserve /CalRGBProfile (sRGB IEC61966-2.1) /EncodeColorImages true /EncodeGrayImages true /ColorACSImageDict << /HSamples [ 2 1 1 2 ] /VSamples [ 2 1 1 2 ] /Blend 1 /QFactor 0.9 >> /Optimize true /ParseDSCCommentsForDocInfo true /GrayImageDownsampleThreshold 1.5 /MonoImageDownsampleThreshold 1.5 /AutoPositionEPSFiles true /GrayImageResolution 300 /AutoFilterColorImages false /AlwaysEmbed [ /Courier-BoldOblique /Helvetica-BoldOblique /Courier /Helvetica-Bold /Times-Bold /Courier-Bold /Helvetica /Times-BoldItalic /Times-Roman /ZapfDingbats /Times-Italic /Helvetica-Oblique /Courier-Oblique /Symbol ] /ImageMemory 524288 /OPM 1 /DefaultRenderingIntent /Default /EmbedAllFonts true /StartPage 1 /DownsampleGrayImages false /AntiAliasColorImages false /ConvertImagesToIndexed true /PreserveHalftoneInfo false /CompressPages true /Binding /Left >> setdistillerparams << /PageSize [ 595.276 841.890 ] /HWResolution [ 2400 2400 ] >> setpagedevice
Page 2: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen
Page 3: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

3

1.1

Introduction

Genome research enables the establish-ment of the complete genetic informationof organisms. The first complete genomesequences established were those of prokar-yotic and eukaryotic microorganisms, fol-lowed by those of plants and animals (see,for example, the TIGR web page athttp://www.tigr.org/). The organisms se-lected for genome research were mostlythose which were already important in sci-entific analysis and thus can be regarded asmodel organisms. In general, organismsare defined as model organisms when alarge amount of scientific knowledge hasbeen accumulated in the past. For thischapter on genome projects of model or-ganisms, several experts in genome re-search have been asked to give an overviewof specific genome projects and to report onthe respective organism from their specificpoint of view. The organisms selected in-clude prokaryotic and eukaryotic microor-ganisms, and plants and animals.

We have chosen the prokaryotes Escheri-chia coli, Bacillus subtilis, and Archaeoglobusfulgidus as representative model organisms.The E. coli genome project is described byM. KRÖGER (Giessen, Germany). He givesan historical outline of the intensive re-search on microbiology and genetics of thisorganism, which cumulated in the E. coligenome project. Many of the technologicaltools currently available have been devel-oped during the course of the E. coli ge-nome project. E. coli is without doubt thebest-analyzed microorganism of all. Theknowledge of the complete sequence ofE. coli has confirmed its reputation as theleading model organism of Gram_ eubacte-ria.

A. DANCHIN and A. SEKOWSKA (Paris,France) report on the genome project of theenvironmentally and biotechnologically rel-evant Gram+ eubacterium B. subtilis. Thecontribution focuses on the results andanalysis of the sequencing effort and givesseveral examples of specific and sometimesunexpected findings of this project. Specialemphasis is given to genomic data which

1Genome Projects on Model Organisms

Alfred Pühler, Doris Jording, Jörn Kalinowski, Detlev Buttgereit, Renate Renkawitz-Pohl, Lothar Altschmied, Antoin Danchin, Agnieszka Sekowska, Horst Feldmann, Hans-Peter Klenk, and Manfred Kröger

Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues.Edited by Christoph W. SensenCopyright © 2005 WILEY-VCH Verlag GmbH & Co. KGaA, WeinheimISBN: 3-527-31348-6

Page 4: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

4 1 Genome Projects in Model Organisms

support the understanding of general fea-tures such as translation and specific traitsrelevant for living in its general habitat or itsusefulness for industrial processes.

A. fulgidus is the subject of the contribu-tion by H.-P. KLENK (Feldafing, Germany).Although this genome project was startedbefore the genetic properties of the organ-ism had been extensively studied, its uniquelifestyle as a hyperthermophilic and sulfate-reducing organism makes it a model for alarge number of environmentally importantmicroorganisms and species with high bio-technological potential. The structure andresults of the genome project are describedin the contribution.

The yeast Saccharomyces cerevisiae hasbeen selected as a representative eukaryoticmicroorganism. The yeast project is pre-sented by H. FELDMANN (Munich, Germa-ny). S. cerevisiae has a long tradition in bio-technology and a long-term research historyas a eukaryotic model organism per se. Itwas the first eukaryote to be completely se-quenced and has led the way to sequencingother eukaryotic genomes. The wealth ofthe yeast’s sequence information as usefulreference for plant, animal, or human se-quence comparisons is outlined in the con-tribution.

Among the plants, the small cruciferArabidopsis thaliana was identified as theclassical model plant, because of simple cul-tivation and short generation time. Its ge-nome was originally considered to be thesmallest in the plant kingdom and wastherefore selected for the first plant genomeproject, which is described here by L.ALTSCHMIED (Gatersleben, Germany). Thesequence of A. thaliana helped to identifythat part of the genetic information uniqueto plants. In the meantime, other plant ge-nome sequencing projects were started,many of which focus on specific problemsof crop cultivation and nutrition.

The roundworm Caenorhabditis elegansand the fruitfly Drosophila melanogaster havebeen selected as animal models, because oftheir specific model character for higher an-imals and also for humans. The genomeproject of C. elegans is summarized by D.JORDING (Bielefeld, Germany). The contri-bution describes how the worm - despite itssimple appearance - became an interestingmodel organism for features such as neuro-nal growth, apoptosis, or signaling path-ways. This genome project has also provid-ed several bioinformatic tools which arewidely used for other genome projects.

The genome project concerning the fruit-fly D. melanogaster is described by D. BUTT-GEREIT and R. RENKAWITZ-POHL (Marburg,Germany). D. melanogaster is currently thebest-analyzed multicellular organism andcan serve as a model system for featuressuch as the development of limbs, the ner-vous system, circadian rhythms and evenfor complex human diseases. The contribu-tion gives examples of the genetic homolo-gy and similarities between Drosophila andthe human, and outlines perspectives forstudying features of human diseases usingthe fly as a model.

1.2

Genome Projects of Selected ProkaryoticModel Organisms

1.2.1

The Gram_ Enterobacterium Escherichia coli

1.2.1.1

The OrganismThe development of the most recent field ofmolecular genetics is directly connected withone of the best described model organisms,the eubacterium Escherichia coli. There is notextbook in biochemistry, genetics, or micro-biology which does not contain extensive sec-

Page 5: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

51.2 Genome Projects of Selected Prokaryotic Model Organisms

tions describing numerous basic observa-tions first noted in E. coli cells, or the respec-tive bacteriophages, or using E. coli enzymesas a tool. Consequently, several monographssolely devoted to E. coli have been published.Although it seems impossible to name orcount the number of scientists involved inthe characterization of E. coli, Tab. 1.1 is an

attempt to name some of the most deservingpeople in chronological order.

The scientific career of E. coli (Fig. 1.1)started in 1885 when the German pediatri-cian T. Escherich described isolation of thefirst strain from the feces of new-born ba-bies. As late as 1958 this discovery was rec-ognized internationally by use of his name

Table 1.1. Chronology of the most important primary detection and method applications with E. coli.

1886 “bacterium coli commune” by T. Escherich1922 Lysogeny and prophages by d’Herelle1940 Growth kinetics for a bacteriophage by M. Delbrück (Nobel prize 1969)1943 Statistical interpretation of phage growth curve (game theorie) by S. Luria (Nobel prize 1969)1947 Konjugation by E. Tatum and J. Lederberg (Nobel prize 1958)

Repair of UV-damage by A. Kelner and R. Dulbecco (Nobel prize for tumor virology)1954 DNA as the carrier of genetic information, proven by use of radioisotopes by M. Chase and

A. Hershey (Nobel prize 1969)1959 Phage immunity as the first example of gene regulation by A. Lwoff (Nobel prize 1965)

Transduction of gal-genes (first isolated gene) by E. and J. LederbergHost-controlled modification of phage DNA by G. Bertani and J.J. Weigle

1959 DNA-polymerase I by A. Kornberg (Nobel prize 1959)Polynucleotide-phosphorylase (RNA synthesis) by M. Grunberg-Manago and S. Ochoa (Nobel prize 1959)

1960 Semiconservative duplication of DNA by M. Meselson and F. Stahl1961 Operon theory and induced fit by F. Jacob and J. Monod (Nobel prize 1965)1964 Restriction enzymes by W. Arber (Nobel prize 1978)1965 Physical genetic map with 99 genes by A.L. Taylor and M.S. Thoman

Strain collection by B. Bachmann1968 DNA-ligase by several groups contemporaneously1976 DNA-hybrids by P. Lobban and D. Kaiser1977 Recombinant DNA from E. coli and SV40 by P. Berg (Nobel prize 1980)

Patent on genetic engineering by H. Boyer and S. Cohen1978 Sequencing techniques using lac operator by W. Gilbert and E. coli polymerase by F. Sanger

(Nobel prize 1980)1979 Promoter sequence by H. Schaller

Attenuation by C. YanowskyGeneral ribosome structure by H.G. Wittmann

1979 Rat insulin expressed in E. coli by H. GoodmannSynthetic gene expressed by K. Itakura and H. Boyer

1980 Site directed mutagenesis by M. Smith (Nobel prize 1993)1985 Polymerase chain reaction by K.B. Mullis (Nobel prize 1993)1988 Restriction map of the complete genome by Y. Kohara and K. Isono1990 Organism-specific sequence data base by M. Kröger1995 Total sequence of Haemophilus influenzae using an E. coli comparison1999 Systematic sequence finished by a Japanese consortium under leadership of H. Mori2000 Systematic sequence finished by F. Blattner2000 Three-dimensional structure of ribosome by four groups contemporaneously

Page 6: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

6 1 Genome Projects in Model Organisms

to classify this group of bacterial strains. In1921 the very first report on virus formationwas published for E. coli. Today we call therespective observation “lysis by bacterio-phages”. In 1935 these bacteriophages be-came the most powerful tool in defining thecharacteristics of individual genes. Becauseof their small size, they were found to beideal tools for statistical calculations per-formed by the former theoretical physicistM. Delbrück. His very intensive and suc-cessful work has attracted many others tothis area of research. In addition, Delbrück’sextraordinary capability to catalyze the ex-change of ideas and methods yielded thelegendary Cold Spring Harbor Phagecourse. Everybody interested in basic genet-ics has attended this famous summercourse or at least came to the respective an-nual phage meeting. This course, whichwas an ideal combination of joy and work,became an ideal means of spreading practi-cal methods. For many decades it was themost important exchange forum for resultsand ideas, and strains and mutants. Soon,the so called “phage family” was formed,which interacted almost like one big labora-tory; for example, results were communicat-ed preferentially by means of preprints. Fi-nally, 15 Nobel prize-winners have theirroots in this summer-school (Tab. 1.1).

The substrain E. coli K12 was first used byE. Tatum as a prototrophic strain. It waschosen more or less by chance from thestrain collection of the Stanford MedicalSchool. Because it was especially easy tocultivate and because it is, as an inhabitantof our gut, a nontoxic organism by defini-tion, the strain became very popular. Be-cause of the vast knowledge already ac-quired and because it did not form fimbri-ae, E. coli K12 was chosen in 1975 at the fa-mous Asilomar conference on biosafety asthe only organism on which early cloningexperiments were permitted [1]. No wonderthat almost all subsequent basic observa-tions in the life sciences were obtained ei-ther with or within E. coli. What started asthe “phage family”, however, dramaticallysplit into hundreds of individual groupsworking in tough competition. As one ofthe most important outcomes, sequencingof E. coli was performed more than once.Because of the separate efforts, the genomefinished only as number seven [2–4]. Theamount of knowledge acquired, however, iscertainly second to none and the way thisknowledge was acquired is interesting, bothin the history of sequencing methods andbioinformatics, and because of its influenceon national and individual pride.

Fig. 1.1 Scanning electron micrograph (SEM) of Escherichia coli cells. (Image courtesy ofShirley Owens, Center for Electron Optics,MSU; found at http://commtechab.msu.edu/sites/dlc-me/zoo/ zah0700.html#top#top)

Page 7: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

71.2 Genome Projects of Selected Prokaryotic Model Organisms

Work on E. coli is not finished with com-pletion of the DNA sequence; data will becontinuously acquired to fully characterizethe genome in terms of genetic functionand protein structures [5]. This is very im-portant, because several toxic E. coli strainsare known. Thus research on E. coli hasturned from basic science into appliedmedical research. Consequently, the hu-man toxic strain O157 has been completelysequenced, again more than once (unpub-lished).

1.2.1.2

Characterization of the Genome and Early Sequencing EffortsWith its history in mind and realizing theimpact of the data, it is obvious that an evergrowing number of colleagues worldwideworked with or on E. coli. Consequently,there was an early need for organization ofthe data. This led to the first physical genet-ic map, comprising 99 genes, of any livingorganism, published in by Taylor and Tho-man [6]. This map was improved and wasrefined for several decades by Bach-mann [7] and Berlyn [8]. These researchersstill maintain a very useful collection ofstrains and mutants at Yale University. Onethousand and twenty-seven loci had beenmapped by 1983 [7]; these were used as thebasis of the very first sequence databasespecific to a single organism [4]. As shownin Fig. 2 of Kröger and Wahl [4], sequenc-ing of E. coli started as early as 1967 withone of the first ever characterized tRNA se-quences. Immediately after DNA sequenc-ing had been established numerous labora-tories started to determine sequences oftheir personal interest.

1.2.1.3

Structure of the Genome ProjectIn 1987 Isono’s group published a very in-formative and incredibly exact restriction

map of the entire genome [9]. With the helpof K. Rudd it was possible to locate sequenc-es quite precisely [8, 10]. But only very fewsaw any advantage in closing the some-times very small gaps, and so a worldwidejoint sequencing approach could not be es-tablished. Two groups, one in Kobe, Ja-pan [3] and one in Madison, Wisconsin [2]started systematic sequencing of the ge-nome in parallel, and another laboratory, atHarvard University, used E. coli as a targetto develop new sequencing technology. Sev-eral meetings, organized especially on E.coli, did not result in a unified systematicapproach, thus many genes have been se-quenced two or three times. Although spe-cific databases have been maintained tobring some order into the increasing chaos,even this type of tool has been developedseveral times in parallel [4, 10]. Whenever anew contiguous sequence was published,approximately 75 % had already previouslybeen submitted to the international data-bases by other laboratories. The progress ofdata acquisition followed a classical e-curve,as shown in Fig. 2 of Kröger and Wahl [4].Thus in 1992 it was possible to predict thecompleteness of the sequence for 1997without knowledge of the enormous techni-cal innovations in between [4].

Both the Japanese consortium and thegroup of F. Blattner started early; some peo-ple say they started too early. They sub-cloned the DNA first and used manual se-quencing and older informatic systems. Se-quencing was performed semi-automatical-ly, and many students were employed toread and monitor the X-ray films. When thefirst genome sequence of Haemophilus in-fluenzae appeared in 1995 the science foun-dations wanted to discontinue support ofE. coli projects, which received their grantsupport mainly because of the model char-acter of the sequencing techniques devel-oped.

Page 8: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

8 1 Genome Projects in Model Organisms

Three facts and truly international protestconvinced the juries to continue financialsupport. First, in contrast with the othercompletely sequenced organisms, E. coli isan autonomously living organism. Second,when the first complete very small genomesequence was released, even the longestcontiguous sequence for E. coli was alreadylonger. Third, the other laboratories couldonly finish their sequences because the E.coli sequences were already publicly avail-able. Consequently, the two main compet-ing laboratories were allowed to purchaseseveral of the sequencing machines alreadydeveloped and use the shotgun approach tocomplete their efforts. Finally, they finishedalmost at the same time. H. Mori and hiscolleagues included already published se-quences from other laboratories in their se-quence data and sent them to the interna-tional databases on December 28th, 1996 [3]and F. Blattner reported an entirely new se-quence on January 16th, 1997 [2]. They add-ed the last changes and additions as late asOctober, 1998. Very sadly, at the end E. colihad been sequenced almost three times [4].Nowadays, however, most people forgetabout all the other sources and refer to theBlattner sequence.

1.2.1.4

Results from the Genome ProjectWhen the sequences were finally finished,most of the features of the genome were al-ready known. Consequently, people nolonger celebrate the E. coli sequence as amajor breakthrough. At that time everybodyknew the genome was almost completelycovered with genes, although fewer thanhalf had been genetically characterized.Tab. 1.2 illustrates this and shows thecounting differences. Because of this highdensity of genes, F. Blattner and coworkersdefined “gray holes” whenever they found anoncoding region of more than 2 kb [2]. It

was found that the termination of replica-tion is almost exactly opposite to the originof replication. No special differences havebeen found for either direction of replica-tion. Approximately 40 formerly describedgenetic features could not be located or sup-ported by the sequence [4, 8]. On the otherhand, there are several examples of multi-ple functions encoded by the same gene. Itwas found that the multifunctional genesare mostly involved in gene expression andused as a general control factor. M. Riley de-termined the number of gene duplications,which is also not unexpectedly low whenneglecting the ribosomal operons [10].

Everybody is convinced that the real workis starting only now. Several strain differ-ences might be the cause of the deviationsbetween the different sequences available.Thus the numbers of genes and nucleotidesdiffer slightly (Tab. 1.2). Everybody wouldlike to know the function of each of theopen reading frames [5], but nobody has re-ceived the grant money to work on this im-portant problem. Seemingly, other modelorganisms are of more public interest; thusit might well be that research on other or-ganisms will now help our understandingof E. coli, in just the same way that E. coliprovided information enabling understand-ing of them. In contrast with yeast, it is veryhard to produce knock-out mutants. Thus,we might have the same situation in thepostgenomic era as we had before the ge-nome was finished. Several laboratories willcontinue to work with E. coli, they will con-stantly characterize one or the other openreading frame, but there will be no mutualeffort [5]. A simple and highly efficientmethod using PCR products to inactivatechromosomal genes was recently devel-oped [11]. This method has greatly facilitat-ed systematic mutagenesis approaches in E.coli.

Page 9: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

91.2 Genome Projects of Selected Prokaryotic Model Organisms

1.2.1.5

Follow-up Research in the Postgenomic EraToday it seems more attractive to work withtoxic E. coli strains, for example O157, thanwith E. coli K12. This strain has recentlybeen completely sequenced; the data areavailable via the internet. Comparison oftoxic and nontoxic strains will certainly helpus to understand the toxic mechanisms. Itwas, on the other hand, found to be correct

to use E. coli K12 as the most intensivelyused strain for biological safety regula-tions [1]. No additional features changedthis. This E. coli strain is subject to compre-hensive transcriptomics and proteomicsstudies. For global gene expression profil-ing different systems like an AffymetrixGeneChip and several oligonucleotide setsfor the printing of microarrays are available.These tools have already been extensively

Table 1.2 Some statistical features of the E. coli genome.

Total size 4,639,221 bp1)

According to Regulon4) According to Blattner5)

Transcription units Proven 528Predicted 2328

Genes Total found 4408 4403Regulatory 85Essential 200Nonessential2) 2363 1897Unknown3) 1761 2376tRNA 84 84rRNA 29 29

Promoters Proven 624Predicted 4643

Sites 469

Regulatory interactions Found 642Predicted 275

Terminators Found 96

RBS 98

Gene products Regulatory proteins 85RNA 115 115Other peptides 4190 4201

1) Additional 63 bp compared with the original sequence2) Genes with known or predicted function3) No other data available other than the existence of an open reading frame with a start sequence and more

than 100 codons4) Data from http://tula.cifn.unam.mx/Computational_Genomics/regulondb/5) Data from http://www.genome.wisc.edu

Page 10: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

10 1 Genome Projects in Model Organisms

used by researchers during recent years.Proteomics studies resulted in a compre-hensive reference map for the E. coli K-12proteome (SWISS-2DPAGE, Two-dimen-sional polyacrylamide gel electrophoresisdatabase, http://www.expasy.org/ch2d). The“Encyclopedia of Escherichia coli K-12 Genesand Metabolism” (EcoCyc) (www.ecocyc.org)is a very useful and constantly growingE. coli metabolic pathway database for thescientific community [12].

Surprisingly, colleagues from mathemat-ics or informatics have shown the mostinterest in the bacterial sequences. Theyhave performed all kinds of statistical analy-sis and tried to discover evolutionary roots.Here another fear of the public is alreadyformulated – people are afraid of attemptsto reconstruct the first living cell. So thereare at least some attempts to find the mini-mum set of genes for the most basic needsof a cell. We have to ask again the very oldquestion: Do we really want to “play God”?If so, E. coli could indeed serve as an impor-tant milestone.

1.2.2

The Gram+ Spore-forming Bacillus subtilis

1.2.2.1

The OrganismSelf-taught ideas have a long life – articlesabout Bacillus subtilis (Fig. 1.2) almost invar-iably begin with words such as: “B. subtilis,a soil bacterium …”, nobody taking the ele-mentary care to check on what type of ex-perimental observation this is based. Bacil-lus subtilis, first identified in 1885, is namedko so kin in Japanese and laseczka sienna inPolish, or “hay bacterium”, and this refersto the real biotope of the organism, the sur-face of grass or low-lying plants [13]. Inter-estingly, it required its genome to be se-quenced to acquire again its right biotope.

Of course, plant leaves fall on the soil sur-face, and one must naturally find B. subtilisthere, but its normal niche is the surface ofleaves, the phylloplane. Hence, if one wish-es to use this bacterium in industrial pro-cesses, to engineer its genome, or simply tounderstand the functions coded by itsgenes, it is of fundamental importance tounderstand where it normally thrives, andwhich environmental conditions control itslife-cycle and the corresponding gene ex-pression. Among other important ancillaryfunctions, B. subtilis has thus to explore, col-onize, and exploit local resources, while atthe same time it must maintain itself, deal-ing with congeners and with other organ-isms: understanding B. subtilis requiresunderstanding the general properties of itsnormal habitat.

Fig. 1.2 Electron micrograph of a thin section of Bacillus subtilis. The dividing cell is surrounded by a relatively dense wall (CW), enclosing the cell membrane (cm). Within the cell, the nucleoplasm(n) is distinguishable by its fibrillar structure fromthe cytoplasm, densely filled with 70S ribosomes (r).

Page 11: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

111.2 Genome Projects of Selected Prokaryotic Model Organisms

1.2.2.2

A Lesson from Genome Analysis: The Bacillus subtilis BiotopeThe genome of B. subtilis (strain 168), se-quenced by a team in European and Japa-nese laboratories, is 4,214,630 bp long(http://genolist.pasteur.fr/SubtiList/). Ofmore than 4100 protein-coding genes, 53 %are represented once. One quarter of the ge-nome corresponds to several gene familieswhich have probably been expanded bygene duplication. The largest family con-tains 77 known and putative ATP-bindingcassette (ABC) permeases, indicating that,despite its large metabolism gene number,B. subtilis has to extract a variety of com-pounds from its environment [14]. In gen-eral, the permeating substrates are un-changed during permeation. Group-trans-fer, in which substrates are modified dur-ing transport, plays an important role in B.subtilis, however. Its genome codes for a va-riety of phosphoenolpyruvate-dependentsystems (PTS) which transport carbohy-drates and regulate general metabolism as afunction of the nature of the supplied car-bon source. A functionally-related catabo-lite repression control, mediated by aunique system (not cyclic AMP), exists inthis organism [15]. Remarkably, apart fromthe expected presence of glucose-mediatedregulation, it seems that carbon sources re-lated to sucrose play a major role, via a verycomplicated set of highly regulated path-ways, indicating that this plant-associatedcarbon supply is often encountered by thebacteria. In the same way, B. subtilis cangrow on many of the carbohydrates synthe-sized by grass-related plants.

In addition to carbon, oxygen, nitrogen,hydrogen, sulfur, and phosphorus are thecore atoms of life. Some knowledge aboutother metabolism in B. subtilis has accumu-lated, but significantly less than in its E. colicounterpart. Knowledge of its genome se-

quence is, however, rapidly changing thesituation, making B. subtilis a model of sim-ilar general use to E. coli. A frameshift mu-tation is present in an essential gene forsurfactin synthesis in strain 168 [16], but ithas been found that including a smallamount of a detergent into plates enabledthese bacteria to swarm and glide extreme-ly efficiently (C.-K. Wun and A. Sekowska,unpublished observations). The first lessonof genome text analysis is thus that B. subti-lis must be tightly associated with the plantkingdom, with grasses in particular [17].This should be considered in priority whendevising growth media for this bacterium,in particular in industrial processes.

Another aspect of the B. subtilis life cycleconsistent with a plant-associated life is thatit can grow over a wide range of differenttemperatures, up to 54–55 °C – an interest-ing feature for large-scale industrial pro-cesses. This indicates that its biosyntheticmachinery comprises control elements andmolecular chaperones that enable this ver-satility. Gene duplication might enable ad-aptation to high temperature, with iso-zymes having low- and high-temperatureoptima. Because the ecological niche of B.subtilis is linked to the plant kingdom, it issubjected to rapid alternating drying andwetting. Accordingly, this organism is veryresistant to osmotic stress, and can growwell in media containing 1 M NaCl. Also,the high level of oxygen concentrationreached during daytime are met with pro-tection systems – B. subtilis seems to haveas many as six catalase genes, both of theheme-containing type (katA, katB, and katXin spores) and of the manganese-containingtype (ydbD, PBX phage-associated yjqC, andcotJC in spores).

The obvious conclusion from these ob-servations is that the normal B. subtilisniche is the surface of leaves [18]. This isconsistent with the old observation that

Page 12: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

12 1 Genome Projects in Model Organisms

B. subtilis makes up the major population ofthe bacteria of rotting hay. Furthermore,consistent with the extreme variety of condi-tions prevailing on plants, B. subtilis is anendospore-forming bacterium, makingspores highly resistant to the lethal effectsof heat, drying, many chemicals, and radia-tion.

1.2.2.3

To Lead or to Lag: First Laws of GenomicsAnalysis of repeated sequences in theB. subtilis genome discovered an unexpect-ed feature: strain 168 does not contain in-sertion sequences. A strict constraint on thespatial distribution of repeats longer than25 bp was found in the genome, in contrastwith the situation in E. coli. Correlation ofthe spatial distribution of repeats and theabsence of insertion sequences in the ge-nome suggests that mechanisms aimed attheir avoidance and/or elimination havebeen developed [19]. This observation is par-ticularly relevant for biotechnological pro-cesses in which one has multiplied the copynumber of genes to improve production. Al-though there is generally no predictable linkbetween the structure and function of bio-logical objects, the pressure of natural selec-tion has adapted together gene and geneproducts. Biases in features of predictablyunbiased processes is evidence of prior se-lective pressure. With B. subtilis one ob-serves a strong bias in the polarity of tran-scription with respect to replication: 70 % ofthe genes are transcribed in the direction ofthe replication fork movement [14]. Globalanalysis of oligonucleotides in the genomedemonstrated there is a significant bias notonly in the base or codon composition ofone DNA strand relative to the other, but,quite surprisingly, there is a strong bias atthe level of the amino-acid content of theproteins. The proteins coded by the leadingstrand are valine-rich and those coded by

the lagging strand are threonine and isoleu-cine-rich. This first law of genomics seemsto extend to many bacterial genomes [20]. Itmust result from a strong selection pres-sure of a yet unknown nature, demonstrat-ing that, contrary to an opinion frequentlyheld, genomes are not, on a global scale,plastic structures. This should be taken intoaccount when expressing foreign proteinsin bacteria.

Three principal modes of transfer of ge-netic material – transformation, conjuga-tion, and transduction – occur naturally inprokaryotes. In B. subtilis, transformation isan efficient process (at least in some B. sub-tilis species such as the strain 168) andtransduction with the appropriate carrierphages is well understood.

The unique presence in the B. subtilis ge-nome of local repeats, suggesting Camp-bell-like integration of foreign DNA, is con-sistent with strong involvement of recombi-nation processes in its evolution. Recombi-nation must, furthermore, be involved inmutation correction. In B. subtilis, MutSand MutL homologs occur, presumably forthe purpose of recognizing mismatchedbase pairs [21]. No counterpart of MutH ac-tivity, which would enable the daughterstrand to be distinguished from its parent,has, however, been identified. It is, there-fore, not known how the long-patch mis-match repair system corrects mutations inthe newly synthesized strand. One can spec-ulate that the nicks caused in the daughterstrands by excision of newly misincorporat-ed uracil instead of thymine during replica-tion might provide the appropriate signal.Ongoing fine studies of the distribution ofnucleotides in the genome might substan-tiate this hypothesis.

The recently sequenced genome of thepathogen Listeria monocytogenes has manyfeatures in common with that of the ge-nome of B. subtilis [22]. Preliminary analysis

Page 13: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

131.2 Genome Projects of Selected Prokaryotic Model Organisms

suggests that the B. subtilis genome mightbe organized around the genes of coremetabolic pathways, such as that of sulfurmetabolism [23], consistent with a strongcorrelation between the organization of thegenome and the architecture of the cell.

1.2.2.4

Translation: Codon Usage and the Organization of the Cell’s CytoplasmExploiting the redundancy of the geneticcode, coding sequences show evidence ofhighly variable biases of codon usage. Thegenes of B. subtilis are split into three class-es on the basis of their codon usage bias.One class comprises the bulk of the pro-teins, another is made up of genes ex-pressed at a high level during exponentialgrowth, and a third class, with A + T-richcodons, corresponds to portions of the ge-nome that have been horizontally ex-changed [14].

When mRNA threads are emerging fromDNA they become engaged by the lattice ofribosomes, and ratchet from one ribosometo the next, like a thread in a wiredrawingmachine [24]. In this process, nascent pro-teins are synthesized on each ribosome,spread throughout the cytoplasm by the lin-ear diffusion of the mRNA molecule fromribosome to ribosome. If the environmentalconditions change suddenly, however, thetranscription complex must often break up.Truncated mRNA is likely to be a danger-ous molecule because, if translated, itwould produce a truncated protein. Suchprotein fragments are often toxic, becausethey can disrupt the architecture of multi-subunit complexes. A process copes withthis kind of accident in B. subtilis. When atruncated mRNA molecule reaches its end,the ribosome stops translating, and waits. Aspecialized RNA, tmRNA, that is folded andprocessed at its 3′ end like a tRNA andcharged with alanine, comes in, inserts its

alanine at the C-terminus of the nascentpolypeptide, then replaces the mRNA with-in a ribosome, where it is translated asASFNQNVALAA. This tail is a protein tagthat is then used to direct the truncatedtagged protein to a proteolytic complex(ClpA, ClpX), where it is degraded [25].

1.2.2.5

Post-sequencing Functional Genomics: Essential Genes and Expression-profilingStudiesSequencing a genome is not a goal per se.Apart from trying to understand how genesfunction together it is most important, es-pecially for industrial processes, to knowhow they interact. As a first step it wasinteresting to identify the genes essentialfor life in rich media. The European–Japa-nese functional genomics consortium en-deavored to inactivate all the B. subtilisgenes one by one [26]. In 2004, the outcomeof this work are still the first and only resultin which we can list all the essential genesin bacteria. In this genome counting over4100 genes, 271 seem to be essential forgrowth in rich medium under laboratoryconditions (i.e. without being challenged bycompetition with other organisms or bychanging environmental conditions). Mostof these genes can be placed into a few largeand predicable functional categories, for ex-ample information processing, cell enve-lope biosynthesis, shape, division, and en-ergy management. The remaining genes,however, fall into categories not expected tobe essential, for example some Embden–Meyerhof–Parnas pathway genes and genesinvolved in purine biosynthesis. This opensthe perspective that these enzymes can havenovel and unexpected functions in the cell.Interestingly, among the 26 essential genesthat belongs to either “other functions” or“unknown genes” categories, seven belongto or carry the signature for (ATP/)GTP-

Page 14: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

14 1 Genome Projects in Model Organisms

binding proteins – several now seem to codefor tRNA modifications [27], but some couldbe in charge of coordination of essentialprocesses listed above, as seems to be truefor eukaryotes. A remarkable outcome ofthis project was the discovery that essentialgenes are grouped along the leading replica-tion strands in the genome [28]. This fea-ture is general and indicates that genes can-not, at least under strong competitive condi-tions, be shuffled randomly in the chromo-some. This has important consequences forgenetic manipulation of organisms of bio-technological interest.

Beside identification of B. subtilis essen-tial genes, the European–Japanese projectproduced an almost complete representa-tive collection of mutants of this bacterium.This collection is freely available to the sci-entific community, in particular for biotech-nology-oriented studies. This strategy is agood example of genome-wide approachesthat would have been unthinkable twodecades ago. The obvious continuation inthis line was the use of transcriptome anal-ysis (identification of all transcripts on DNAarrays under a variety of experimental con-ditions). Several dozen reports appeared inthe literature in those years dealing withdata obtained by this global approach. Withdifferent technical solutions (from commer-cially available macroarrays with radioactivelabeling through custom-made glass micro-arrays with fluorescent labeling) they offeran almost exhaustive point of view at thelevel of transcription answering a givenquestion, assuming particular attention hasbeen devoted to controlling all upstream ex-perimental steps (RNA preparation, cDNAsynthesis) and to making use of well-chosenstatistical analyses. Many reports have beendevoted to the study of heat-shock proteinsbut not much work was devoted to theequally important cold-shock proteins of thebacteria. The two-component system desKR

was recently identified; this regulates ex-pression of des gene coding for desaturase,which participates in cold adaptationthrough membrane lipid modification. Todiscover whether the desKR system is exclu-sively devoted to des regulation or consti-tutes a cold-triggered regulatory system ofglobal relevance, macroarray studiesseemed to be the method of choice [29]. Amajor outcome of this study was, it seemed,that the desKR system controls des gene ex-pression only. Unexpectedly, this work un-covered many novel partners involved incold shock response, with almost half ofthese genes annotated as carrying an un-known function. The categories of genes af-fected by the cold-shock response were, asexpected, the cold shock protein genes thatwere already known, but also heat-shockprotein genes and genes involved in transla-tion machinery, amino acid biosynthesis,nucleoid structure, ABC permeases (for ac-etoin in particular), purine and pyrimidinebiosynthesis and glycolysis, the citric acidcycle, and ATP synthesis. Because these lastcategories are found in most transcriptomestudies, it remains to be seen whether theyare indeed specific to cold shock. Amonggenes of unknown function particularlyinteresting in biotechnology one can men-tion the yplP gene, which codes for a tran-scriptional regulator that belongs to theNtrC/NifA family and which, when inacti-vated, causes a cold-specific late-growthphenotype. However, the exact role of YplPprotein remains to be understood.

An essential complement to transcrip-tome studies is exploration of the bacterialproteome, which gives a detailed look at thebehavior of the final players. Because whatreally counts for a cell is the final level of itsmature proteins, and because many post-transcriptional modifications can alter thefate of mRNA translation products into ma-ture proteins, transcriptome analysis alone

Page 15: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

151.2 Genome Projects of Selected Prokaryotic Model Organisms

provides only an approximation of what isgoing to really happen in the cell. Studyingthe proteome expressed under specific con-ditions can help uncover interesting linksbetween different parts of metabolism. Thishas been explored under conditions impor-tant for biotechnology, for example the salt-stress response and iron metabolism [30].This recent work aiming at establishing thenetwork of proteins affected by osmoticstress has shown that B. subtilis cells grow-ing under high-salinity are subjected to ironlimitation, as indicated by the increase ofexpression of several putative iron-uptakesystems (fhuD, fhuB, feuA, ytiY and yfmC)or iron siderophore bacillibactin synthesisand modification genes (dhbABCE). Thederepression of the dhb operon seems to bemore a salt-specific effect than a general os-motic effect, because it is not produced byaddition of iso-osmotic non-ionic osmolites(sucrose or maltose) to the growth medium.Some high-salinity growth-defect pheno-types could, furthermore, be reversed bysupplementation of the medium with ex-cess iron. This work shows that two distinctfactors important in fermentors – iron lim-itation and high-salinity stress – hitherto re-garded as separate growth-limiting factors,are indeed not so separate.

1.2.2.6

Industrial ProcessesBacillus subtilis is generally recognized assafe (GRAS). It is much used industriallyboth for enzyme production and for food-supply fermentation. Riboflavin is derivedfrom genetically modified B. subtilis by useof fermentation techniques. For some timehigh levels of heterologous gene expressionin B. subtilis was difficult to achieve. In con-trast with Gram-negatives, A + T-rich Gram-positive bacteria have optimized transcrip-tion and translation signals; althoughB. subtilis has a counterpart of the rpsA

gene, this organism lacks the function ofthe corresponding ribosomal S1 proteinwhich enables recognition of the ribosome-binding site upstream of the translationstart codons [31]. Traditional techniques(e.g. random mutagenesis followed byscreening; ad hoc optimization of poorly de-fined culture media) are important and willcontinue to be used in the food industry,but biotechnology must now include ge-nomics to target artificial genes that followthe sequence rules of the genome at a pre-cise position, adapted to the genome struc-ture, and to modify intermediary metab-olism while complying with the adaptedniche of the organism, as revealed by its ge-nome. As a complement to standard genet-ic engineering and transgenic technology,knowing the genome text has opened awhole new range of possibilities in food-product development, in particular ena-bling “humanization” of the content of foodproducts (adaptation to the human metab-olism, and even adaptation to sick orhealthy conditions). These techniques pro-vide an attractive means of producinghealthier food ingredients and productsthat are currently not available or are veryexpensive. B. subtilis will remain a tool ofchoice in this respect.

1.2.2.7

Open QuestionsThe complete genome sequence of B. subti-lis contains information that remainsunderutilized in the current predictionmethods applied to gene functions, most ofwhich are based on similarity searches ofindividual genes. In particular, it is nowclear that the order of the genes in the chro-mosome is not random, and that some hotspots enable gene insertion without muchdamage whereas other regions are forbid-den. For production of small molecules onemust use higher-level information on meta-

Page 16: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

16 1 Genome Projects in Model Organisms

bolic pathways to reconstruct a completefunctional unit from a set of genes. The re-construction in silico of selected portions ofmetabolism using the existing biochemicalknowledge of similar gene products hasbeen undertaken. Although the core biosyn-thetic pathways of all twenty amino acidshave been completely reconstructed in B.subtilis, many satellite or recycling path-ways, in particular the synthesis of pyrimi-dines, have not yet been identified in sulfurand short-carbon-chain acid metabolism.Finally, there remain some 800 genes ofcompletely unknown function in the ge-nome of strain 168, including a few tens of“orphan” genes that have no counterpart inany known genome, and many more in thegenome of related species. It remains to beunderstood whether they play an importantrole for biotechnological processes.

1.2.3

The Archaeon Archaeoglobus fulgidus

1.2.3.1

The OrganismArchaeoglobus fulgidus is a strictly anaerobic,hyperthermophilic, sulfate-reducing archae-on. It is the first sulfate-reducing organismfor which the complete genome sequencehas been determined and published [32].Sulfate-reducing organisms are essential tothe biosphere, because biological sulfate re-duction is part of the global sulfur cycle.The ability to grow by sulfate-reduction isrestricted to a few groups of prokaryotes

only. The Archaeoglobales are in two waysunique within this group:

1. they are members of the Archaea andtherefore unrelated to all other sulfate re-ducers, which belong to the Bacteria; and

2. the Archaeoglobales are the only hyper-thermophiles within the sulfate-reducers,a feature which enables them to occupy ex-treme environments, for example hydro-thermal fields and sub-surface oil fields.

The production of iron sulfide as an endproduct of high-temperature sulfate reduc-tion by Archaeoglobus species contributes tooil-well “souring”, which causes corrosionof iron and steel in submarine oil- and gas-processing systems. A. fulgidus is also amodel for hyperthermopilic organisms andfor the Archaea, because it is only the sec-ond hyperthermophile whose genome hasbeen completely deciphered (after Methano-coccus jannaschii), and it is the third speciesof Archaea (after M. jannaschii and Methan-obacterium thermoautotrophicum) whose ge-nome has been completely sequenced andpublished.

A. fulgidus DSM4304 (Fig. 1.3) is the typestrain of the Archaeoglobales [33]. Its glyco-protein-covered cells are irregular spheres(diameter 2 µm) with four distinct monopo-lar flagella. It grows not only organohetero-trophically, using a variety of carbon andenergy sources, but also lithoautotrophical-ly on hydrogen, thiosulfate, and carbondioxide. Within the range 60–95 °C it growsbest at 83 °C.

Fig. 1.3 Electron-micrograph of A. fulgidusDSM4303 (strain VC-16), kindly provided by K.O. Stetter, University of Regensburg. The bar in the lower right corner represents 1 µm.

Page 17: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

171.2 Genome Projects of Selected Prokaryotic Model Organisms

Before genome sequencing very little wasknown about the genomic organization ofA. fulgidus. The first estimate of its genomesize, obtained by use of pulsed field gel elec-trophoresis, was published after final as-sembly of the genome sequences had al-ready been achieved. Because extra-chromo-somal elements are absent from A. fulgidus,it was determined that the genome consistsof only one circular chromosome. Althoughdata about genetic or physical mapping ofthe genome were unknown before the se-quencing project, a small-scale approach tophysical mapping was performed late in theproject for confirmation of the genome as-sembly. Sequences of only eleven genesfrom A. fulgidus had been published beforethe sequencing project started; these cov-ered less than 0.7 % of the genome.

1.2.3.2

Structure of the Genome ProjectThe whole-genome random sequencingprocedure was chosen as sequencing strate-gy for the A. fulgidus genome project. Thisprocedure had previously been applied tofour microbial genomes sequenced at TheInstitute for Genomic Research (TIGR):Haemophilus influenzae, Mycoplasma geni-talium, M. jannaschii, and Helicobacter pylo-ri [34]. Chromosomal DNA for the con-struction of libraries was prepared from aculture derived from a single cell isolated bymeans of optical tweezers and provided byK.O. Stetter. Three libraries were used forsequencing – two plasmid libraries(1.42 kbp and 2.94 kbp insert size) for masssequence production and one large insert ë-library (16.38 kbp insert size) for the ge-nome scaffold. The initial random sequenc-ing phase was performed with these librar-ies until 6.7-fold sequence coverage wasachieved. At this stage the genome was as-sembled into 152 contigs separated by se-quence gaps and five groups of contigs sep-

arated by physical gaps. Sequence gapswere closed by a combined approach of ed-iting the ends of sequence traces and byprimer walking on plasmid- and ë-clonesspanning the gaps. Physical gaps wereclosed by direct sequencing of PCR-frag-ments generated by combinatorial PCR re-actions. Only 0.33 % of the genome (90 re-gions) was covered by only one single se-quence after the gap-closure phase. Theseregions were confirmed by additional se-quencing reactions to ensure a minimumsequence coverage of two for the whole ge-nome. The final assembly consisted of29,642 sequencing runs which cover the ge-nome sequence 6.8-fold.

The A. fulgidus genome project was fi-nanced by the US Department of Energy(DOE) within the Microbial Genome Pro-gram. This program financed several of theearly microbial genome-sequencing pro-jects performed at a variety of genome cen-ters, for example M. jannaschii (TIGR), M.thermoautotrophicum (Genome Therapeu-tics), Aquifex aeolicus (Recombinant BioCa-talysis, now DIVERSA), Pyrobaculum aero-philum (California Institute of Technology),Pyrococcus furiosus (University of Utah), andDeinococcus radiodurans (Uniformed Servic-es University of the Health Sciences). Likethe M. jannaschii project which was startedone year earlier, the A. fulgidus genome wassequenced and analyzed in a collaborationbetween researchers at TIGR and Carl R.Woese and Gary J. Olsen at the Departmentof Microbiology at the University of Illinois,Champaign-Urbana. The plasmid librarieswere constructed in Urbana, whereas the ë-library was constructed at TIGR. Sequenc-ing and assembly was performed at TIGRusing automated ABI sequencers and a TIGRassembler, respectively. Confirmation of theassembly by mapping with large-size re-striction fragments was performed in Urba-na. Open reading frame (ORF) prediction

Page 18: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

18 1 Genome Projects in Model Organisms

and identification of functions, and the datamining and interpretation of the genomecontent was performed jointly by both teams.

Coding regions in the final genome se-quence were identified with a combinationof two sets of ORF generated by programsdeveloped by members of the two teams –GeneSmith, by H.O. Smith at TIGR andCRITICA, by G.J. Olsen and J.H. Badger inUrbana. The two sets of ORF identified byGeneSmith and CRITICA were merged intoone consensus set containing all membersof both initial sets. The amino acid sequenc-es derived from the consensus set werecompared with a non-redundant proteindatabase using BLASTX. ORFs shorter than30 codons were carefully inspected for data-base hits and eliminated when there was nosignificant database match. The results ofthe database comparisons were first in-spected and categorized by TIGR’s microbi-al annotation team. This initial annotationdatabase was then further analyzed and re-fined by a team of experts for all major bio-logical role categories.

The sequencing strategy chosen for the A.fulgidus genome project has some advantag-es compared with alternative strategies ap-plied in genome research:1. Given the relatively large set of automated

sequencers available at TIGR, the whole-genome random sequencing procedure ismuch faster than any strategy that in-cludes a mapping step before the se-quencing phase;

2. Within the DOE Microbial Genome Pro-gram the TIGR strategy and the sequenc-ing technology used for the M. jannaschiiand A. fulgidus genome projects proved tobe clearly superior in competition withprojects based on multiplex sequencing(M. thermoautotrophicum and P. furiosus),by finishing two genomes in less timethan the competing laboratories neededfor one genome each; and

3. The interactive annotation with a team ofexperts for the organism and for each bi-ological category ensured a more sophis-ticated final annotation than any auto-mated system could achieve at that time.

1.2.3.3

Results from the Genome ProjectAlthough the initial characterization of thegenome revealed all its basic features, anno-tation of biological functions for the ORFwill continue to be updated for new func-tions identified either in A. fulgidus or forhomologous genes characterized in otherorganisms. The size of the A. fulgidus ge-nome was determined to be 2,178,400 bp,with an average G + C content of 48.5 %.Three regions with low G + C content(<39 %) were identified, two of which en-code enzymes for lipopolysaccharide bio-synthesis. The two regions with the highestG + C content (>53 %) contain the riboso-mal RNA and proteins involved in heme bi-osynthesis. With the bioinformatics toolsavailable when genome characterizationwas complete, no origin of replication couldbe identified. The genome contains onlyone set of genes for ribosomal RNA. OtherRNA encoded in A. fulgidus are 46 species oftRNA, five of them with introns 15–62 bplong, no significant tRNA clusters, 7S RNAand RNase P. All together 0.4 % of the ge-nome is covered by genes for stable RNA.Three regions with short (<49 bp) non-cod-ing repeats (42–60 copies) were identified.All three repeated sequences are similar toshort repeated sequences found in M. jan-naschii [35]. Nine classes of long, coding re-peats (>95 % sequence identity) were iden-tified within the genome, three of themmight represent IS elements, and three oth-er repeats encode conserved hypotheticalproteins found previously in other ge-nomes. The consensus set of ORF contains2436 members with an average length of

Page 19: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

191.2 Genome Projects of Selected Prokaryotic Model Organisms

822 bp, similar to M. jannaschii (856 bp),but shorter than in most bacterial genomes(average 949 bp). With 1.1 ORF per kb, thegene density seems to be slightly higherthan in other microbial genomes, althoughthe fraction of the genome covered by pro-tein coding genes (92.2 %) is comparablewith that for other genomes. The elevatednumber of ORF per kbp might be artificial,because of a lack of stop codons in highG + C organisms. Predicted start codons are76 % ATG, 22 % GTG, and 2 % TTG. Nointeins were identified in the genome. Theisoelectric point of the predicted proteins inA. fulgidus is rather low (median pI is 6.3);for other prokaryotes distributions peakbetween 5.5 and 10.5. Putative functionscould be assigned to about half of the pre-dicted ORF (47 %) by significant matchesin database searches. One quarter (26.7 %)of all ORF are homologous to ORF previ-ously identified in other genomes (“con-served hypotheticals”), whereas the remain-ing quarter (26.2 %) of the ORF in A. fulgid-us seem to be unique, without any signifi-cant database match. A. fulgidus contains anunusually large number of paralogous genefamilies: 242 families with 719 members(30 % of all ORF). This might explain whythe genome is larger than most other ar-chaeal genomes (average approximately1.7 Mbp). Interestingly, one third of theidentified families (85 out of 242) have nosingle member for which a biological func-tion could be predicted. The largest familiescontain genes assigned to “energy metab-olism”, “transporters”, and “fatty acid me-tabolism”.

The genome of A. fulgidus is neither thefirst archaeal genome to be sequenced com-pletely nor is it the first genome of a hyper-thermophilic organism. The novelties forboth features had already been reported to-gether with the genome of M. jannas-chii [35]. A. fulgidus is, however, the first sul-

fate-reducing organism whose genome wascompletely deciphered. The next genome ofa sulfate reducer followed almost sevenyears later with that of Desulfotalea psychro-phila, an organism whose optimum growthtemperature is 75 ° lower than that of A. ful-gidus [36]. Model findings in respect of sul-fur and sulfate metabolism were not expect-ed from the genome, because sulfate me-tabolism had already been heavily studiedin A. fulgidus before the genome project.The genes for most enzymes involved insulfate reduction were already published,and new information from the genome con-firmed only that the sulfur oxide reductionsystems in Archaea and Bacteria were verysimilar. The single most exciting finding inthe genome of A. fulgidus was identificationof multiple genes for acetyl-CoA synthaseand the presence of 57 â-oxidation en-zymes. It has been reported that the organ-ism is incapable of growth on acetate [37],and no system for â-oxidation has previous-ly been described in the Archaea. It appearsnow that A. fulgidus can gain energy by deg-radation of a variety of hydrocarbons andorganic acids, because genes for a least fivetypes of ferredoxin-dependent oxidoreduc-tases and at least one lipase were also iden-tified. Interestingly, at about the same timeas the unexpected genes for metabolizingenzymes were identified, it was also report-ed that a close relative of A. fulgidus is ableto grow on olive oil (K. O. Stetter, personalcommunication), a feature that would re-quire the presence of the genes just identi-fied in A. fulgidus. On the other hand, not allgenes necessary for the pathways describedin the organism could be identified. Glu-cose has been described as a carbon sourcefor A. fulgidus [38], but neither an uptake-transporter nor a catabolic pathway for glu-cose could be identified in the genome.There is still a chance that the requiredgenes are hidden in the pool of functionally

Page 20: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

20 1 Genome Projects in Model Organisms

uncharacterized ORFs. Other interestingfindings in respect of the biology of A. ful-gidus concern sensory functions and regula-tion of gene expression. A. fulgidus seems tohave complex sensory and regulatory net-works – a major difference from results re-ported for M. jannaschii – consistent with itsextensive energy-producing metabolismand versatile system for carbon utilization.These networks contain more than 55 pro-teins with presumed regulatory functionsand several iron-dependent repressor pro-teins. At least 15 signal-transducing histi-dine kinases were identified, but only nineresponse regulators.

1.2.3.4

Follow-up ResearchAlmost 30 papers about A. fulgidus werepublished between the initial description ofthe organism in 1987 and the genome se-quence ten years later. In the six years sincethe genome was finished A. fulgidus wasmentioned in the title or abstract of morethan 165 research articles. Although func-tional genomics is now a hot topic at manyscientific meetings and discussions, A. ful-gidus seems to be no prime candidate forsuch studies. So far not a single publicationhas dealt with proteomics, transcriptome, orserial mutagenesis in this organism. The re-cently published papers that refer to the A.fulgidus genome sequence fall into three al-most equally represented categories: com-parative genomics, structure of A. fulgidusproteins, and characterization of expressedenzymes. One of the most interesting fol-low up stories is probably that on the flapendonucleases. In October 1998 Hosfield etal. described newly discovered archaeal flapendonucleases (FEN) from A. fulgidus, M.jannaschii and P. furiosus with a structure-specific mechanism for DNA substratebinding and catalysis resembling humanflap endonuclease [39]. In Spring 1999 Lya-

michev et al. showed how FEN could beused for polymorphism identification andquantitative detection of genomic DNA byinvasive cleavage of oligonucleotide pro-bes [40]. In May 2000, Cooksey et al. de-scribed an invader assay based on A. fulgid-us FEN that enables linear signal amplifica-tion for identification of mutatons [41]. Thisprocedure could eventually become impor-tant as a non-PCR based procedure for SNPdetection. The identification of 86 candi-dates for small non-messenger RNA byTang et al. [42] and identification of the un-usual organization of the putative replica-tion origin region by Maisnier-Patin etal. [43] also mark noteworthy progress inour knowledge about the model organismA. fulgidus.

1.3

Genome Projects of Selected EukaryoticModel Organisms

1.3.1

The Budding Yeast Saccharomyces cerevisiae

1.3.1.1

Yeast as a Model OrganismThe budding yeast, Saccharomyces cerevisiae(Fig. 1.4), can be regarded as one of themost important fungal organisms used inbiotechnological processes. It owes itsname to its ability to ferment saccharose,and has served mankind for several thou-sand years in the making of bread and alco-holic beverages. The introduction of yeastas an experimental system dates back to the1930s and has since attracted increasing at-tention. Unlike more complex eukaryotes,yeast cells can be grown on defined mediagiving the investigator complete controlover environmental conditions. The ele-gance of yeast genetics and the ease of ma-nipulation of yeast have substantially con-

Page 21: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

211.3 Genome Projects of Selected Eukaryotic Model Organisms

tributed to the explosive growth in yeastmolecular biology. This success is also aconsequence of the notion that the extent towhich basic biological processes have beenconserved throughout eukaryotic life is re-markable and makes yeast a unique unicel-lular model organism in which cell archi-tecture and fundamental cellular mecha-nisms can be successfully investigated. Nowonder, then, that yeast had again reachedthe forefront in experimental molecular bi-ology by being the first eukaryotic organismof which the entire genome sequence be-came available [44, 45]. The wealth of se-quence information obtained in the yeastgenome project was found to be extremelyuseful as a reference against which se-quences of human, animal, or plant genescould be compared.

The first genetic map of S. cerevisiae waspublished by Lindegren in 1949 [46]; manyrevisions and refinements have appearedsince. At the outset of the sequencing pro-ject approximately 1200 genes had beenmapped and detailed biochemical knowl-edge about a similar number of genes en-coding either RNA or protein products hadaccumulated [47]. The existence of 16 chro-mosomes ranging in size between 250 and~2500 kb was firmly established when it be-

came feasible to separate all chromosomesby pulsed-field gel electrophoresis (PFGE).This also provided definition of “electro-phoretic karyotypes” of strains by sizingchromosomes [48]. Not only do laboratorystrains have different karyotypes, becauseof chromosome length polymorphisms andchromosomal rearrangements, but so doindustrial strains. A defined laboratorystrain (αS288C) was therefore chosen forthe yeast sequencing project.

1.3.1.2

The Yeast Genome Sequencing ProjectThe yeast sequencing project was initiatedin 1989 within the framework of EU bio-technology programs. It was based on a net-work approach into which 35 European la-boratories initially became involved, andchromosome III – the first eukaryotic chro-mosome ever to be sequenced – was com-pleted in 1992. In the following years andengaging many more laboratories, sequenc-ing of further complete chromosomes wastackled by the European network. Soon af-ter its beginning, laboratories in other partsof the world joined the project to sequenceother chromosomes or parts thereof, end-ing up in a coordinated international enter-prise. Finally, more than 600 scientists inEurope, North America, and Japan becameinvolved in this effort. The sequence of theentire yeast genome was completed in early1996 and released to public databases inApril 1996.

Although the sequencing of chromosomeIII started from a collection of overlappingplasmid or phage lambda clones, it was ex-pected that cosmid libraries would subse-quently have to be used to aid large-scale se-quencing [49]. Assuming an average insertlength of 35–40 kb, a cosmid library con-taining 4600 random clones would repre-sent the yeast genome at approximatelytwelve times the genome equivalent. The

Fig. 1.4 Micrograph of the budding yeast Saccha-romyces cerevisiae during spore formation. The cellwall, nucleus, vacuole, mitochondria, and spindleare indicated.

Page 22: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

22 1 Genome Projects in Model Organisms

advantages of cloning DNA segments in cos-mids were at hand – clones were found to bestable for many years and the small numberof clones was advantageous in setting up or-dered yeast cosmid libraries or sorting outand mapping chromosome-specific subli-braries. High-resolution physical maps ofthe chromosomes to be sequenced wereconstructed by application of classical map-ping methods (fingerprints, cross-hybridza-tion) or by novel methods developed for thisprogram, for example site-specific chromo-some fragmentation [50] or a high-resolu-tion cross-hybridization matrix, to facilitatesequencing and assembly of the sequences.

In the European network chromosome-specific clones were distributed to the col-laborating laboratories according to ascheme worked out by the DNA coordina-tors. Each contracting laboratory was free toapply sequencing strategies and techniquesof its own provided that the sequences wereentirely determined on both strands andunambiguous readings were obtained. Twoprinciple approaches were used to preparesubclones for sequencing:

1. generation of sublibraries by use of a se-ries of appropriate restriction enzymes orfrom nested deletions of appropriate sub-fragments made by exonuclease III; and

2. generation of shotgun libraries fromwhole cosmids or subcloned fragmentsby random shearing of the DNA.

Sequencing by the Sanger technique was ei-ther performed manually, labeling with[35S]dATP being the preferred method ofmonitoring, or by use of automated devices(on-line detection with fluorescence label-ing or direct blotting electrophoresissystem) following a variety of establishedprocedures. Similar procedures were ap-plied to the sequencing of the chromo-somes contributed by the Sanger laboratoryand by laboratories in the USA, Canada, and

Japan. The American laboratories largelyrelied on machine-based sequencing.

Because of their repetitive substructureand the lack of appropriate restriction sites,the yeast chromosome telomeres were aparticular problem. Conventional cloningprocedures were successful for a few excep-tions only. Telomeres were usually physi-cally mapped relative to the terminal-mostcosmid inserts using the chromosome frag-mentation procedure [50]. The sequenceswere then determined from specific plas-mid clones obtained by “telomere trap clon-ing”, an elegant strategy developed by Louisand Borts [51].

Within the European network, all originalsequences were submitted by the collabo-rating laboratories to the Martinsried Insti-tute of Protein Sequences (MIPS) who act-ed as an informatics center. They were keptin a data library, assembled into progres-sively growing contigs, and, in collaborationwith the DNA coordinators, the final chro-mosome sequences were derived. Qualitycontrols were performed by anonymousresequencing of selected regions and sus-pected or difficult zones (total of 15–20 %per chromosome). Similar procedures wereused for sequence assembly and qualitycontrol in the other laboratories. During re-cent years further quality controls were car-ried and resulted in a nearly absolute accu-racy of the total sequence.

The sequences of the chromosomes weresubjected to analysis by computer algo-rithms, identifying ORF and other geneticentities, and monitoring compositionalcharacteristics of the chromosomes (basecomposition, nucleotide pattern frequen-cies, GC profiles, ORF distribution profiles,etc.). Because the intron splice site/branch-point pairs in yeast are highly conserved,they could be detected by using definedsearch patterns. It was finally found thatonly 4 % of the yeast genes contain (mostly

Page 23: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

231.3 Genome Projects of Selected Eukaryotic Model Organisms

short) introns. Centromere and telomereregions, and tRNA genes, sRNA genes, orthe retrotransposons, were sought by com-parison with previously characterized data-sets or appropriate search programs. All pu-tative proteins were annotated by using pre-viously established yeast data and evaluat-ing searches for similarity to entries in thedatabases or protein signatures detected byusing the PROSITE dictionary.

1.3.1.3

Life with Some 6000 GenesWith its 12.8 Mb, the yeast genome is ap-proximately a factor of 250 smaller than thehuman genome. The complete genome se-quence now defines some 6000 ORFswhich are likely to encode specific proteinsin the yeast cell. A protein-encoding gene isfound every 2 kb in the yeast genome, withnearly 70 % of the total sequence consistingof ORF. This leaves only limited space forthe intergenic regions which can bethought to harbor the major regulatory ele-ments involved in chromosome mainte-nance, DNA replication, and transcription.The genes are usually rather evenly distrib-uted among the two strands of the singlechromosomes, although arrays longer thaneight genes that are transcriptionally orient-ed in the same direction can be found even-tually. With a few exceptions, transcribedgenes on complementing strands are notoverlapping, and no “genes-in-genes” areobserved. Although the intergenic regionsbetween two consecutive ORF are some-times extremely short, they are normallymaintained as separate units and not cou-pled for transcription. In “head-to-head”gene arrangements the intervals betweenthe divergently transcribed genes might beinterpreted to mean that their expression isregulated in a concerted fashion involvingthe common promoter region. This, howev-er, seems not to be true for most genes and

seems to be a principle reserved for a fewcases. The sizes of the ORFs vary between100 to more than 4000 codons; less than1 % is estimated to be below 100 codons. Inaddition, the yeast genome contains some120 ribosomal RNA genes in a large tan-dem array on chromosome XII, 40 genesencoding small nuclear RNA (sRNA) and275 tRNA genes (belonging to 43 families)which are scattered throughout the ge-nome. Overall, the yeast genome is remark-ably poor in repeated sequences, except thetransposable elements (Tys) which accountfor approximately 2 % of the genome, and,because of to their genetic plasticity, are themajor source of polymorphisms betweendifferent strains. Finally, the sequences ofnon-chromosomal elements, for examplethe 6 kb of the 2µ plasmid DNA, the killerplasmids present in some strains, and theyeast mitochondrial genome (ca.75 kb),must be considered.

On completion of the yeast genome se-quence it became possible for the first timeto define the proteome of a eukaryotic cell.Detailed information was laid down in in-ventory databases and most of the proteinscould be classified according to function. Itwas seen that almost 40 % of the proteomeconsisted of membrane proteins, and thatan estimated 8 to 10 % of nuclear genes en-code mitochondrial functions. It came as aninitial surprise that no function could be at-tributed to approximately 40 % of the yeastgenes. However, even with the exponentialgrowth of entries in protein databases andthe refinement of in silico analyses, this fig-ure could not be reduced substantially. Thesame was observed for all other genomesthat have since been sequenced. As an expla-nation, we have to envisage that a consider-able portion of every genome is reserved forspecies- or genus-specific functions.

An interesting observation made for thefirst time was the occurrence of regional

Page 24: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

24 1 Genome Projects in Model Organisms

variations of base composition with similaramplitudes along the chromosomes. Analy-sis of chromosomes III and XI revealed al-most regular periodicity of the GC content,with a succession of GC rich and GC poorsegments of ~50 kb each. Another interest-ing observation was that the compositionalperiodicity correlated with local gene den-sity. Profiles obtained from similar analysesof chromosomes II and VIII again showedthese phenomena, albeit with less pro-nounced regularity. Similar compositionalvariation has been found along the arms ofother chromosomes, with pericentromericand subtelomeric regions being AT-rich,though spacing between GC-rich peaks isnot always regular. Usually, however, thereis a broad correlation between high GC con-tent and high gene density.

Comparison of all yeast sequences re-vealed there is substantial internal geneticredundancy in the yeast genome, which atthe protein level is approximately 40 %.Whereas an estimate of sequence similarity(at both the nucleotide and the amino acidlevel) is highly predictive, it is still difficult tocorrelate these values with functional redun-dancy. Interestingly, the same phenomenonhas since been observed in all other ge-nomes sequenced. Gene duplications inyeast are of different type. In many instanc-es, the duplicated sequences are confined tonearly the entire coding region of thesegenes and do not extend into the intergenicregions. Thus, the corresponding gene prod-ucts share high similarity in terms of aminoacid sequence, and sometimes are evenidentical, and, therefore, might be function-ally redundant. As suggested by sequencedifferences within the promoter regions ordemonstrated experimentally, however, ex-pression varies. It is possible one gene copyis highly expressed whereas another is poor-ly expressed. Turning on or off expression ofa particular copy within a gene family might

depend on the differentiated status of thecell (for example mating type, sporulation,etc.). Biochemical studies also revealed thatin particular instances “redundant” proteinscan substitute each other, thus accountingfor the observation that large amounts ofsingle-gene disruption in yeast do not im-pair growth or cause “abnormal” pheno-types. This does not imply, however, thatthese “redundant” genes are a priori dis-pensable. Rather they might have arisenfrom the need to help adapt yeast cells toparticular environmental conditions.

Subtelomeric regions in yeast are rich induplicated genes which are of functionalimportance to carbohydrate metabolism orcell-wall integrity, but there is also much va-riety of (single) genes internal to chromo-somes that seem to have arisen from dupli-cations. An even more surprising phenom-enon became apparent when the sequencesof complete chromosomes were comparedwith each other. This revealed 55 large chro-mosome segments (up to 170 kb) in whichhomologous genes are arranged in thesame order, with the same relative tran-scriptional orientations, on two or morechromosomes [52]. The genome has contin-ued to evolve since this ancient duplicationoccurred – genes have been inserted or de-leted, and Ty elements and introns havebeen lost and gained between two sets of se-quences. If optimized for maximum cover-age, up to 40 % of the yeast genome isfound to be duplicated in clusters, not in-cluding Ty elements and subtelomeric re-gions. No observed clusters overlap andintra- and interchromosomal cluster dupli-cations have similar probabilities.

The availability of the complete yeast ge-nome sequence not only provided furtherinsight into genome organization and evo-lution in yeast but also offered a referenceto search for orthologs in other organisms.Of particular interest were those genes that

Page 25: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

251.3 Genome Projects of Selected Eukaryotic Model Organisms

are homologs of genes that perform differ-entiated functions in multicellular organ-isms or that might be of relevance to malig-nancy. Comparing the catalog of human se-quences available in the databases with theyeast ORF reveals that more than 30 % ofyeast genes have homologs among humangenes of known function. Approximately100 yeast genes are significantly similar tohuman disease genes [53], and some of thelatter could even be predicted from compar-ison with the yeast genes.

1.3.1.4

The Yeast Postgenome EraIt was evident to anyone engaged in the pro-ject that determination of the entire se-quence of the yeast genome should only beregarded as a prerequisite for functionalstudies of the many novel genes to be de-tected. Thus, a European functional analy-sis network (EUROFAN) was initiated in1995 and similar activities were started inthe international collaborating laboratoriesin 1996. The general goal was to systemati-cally investigate yeast genes of unknownfunction by use of the approaches:1. improved data analysis by computer (in

silico analysis);2. systematic gene disruptions and gene

overexpression;3. analysis of phenotypes under different

growth conditions, for example tempera-ture, pH, nutrients, stress;

4. systematic transcription analysis by con-ventional methods; gene expressionunder different conditions;

5. in situ cellular localization and movementof proteins by the use of tagged proteins(GFP-fusions);

6. analysis of gene expression under differ-ent conditions by 2D gel-electrophoresisof proteins; and

7. complementation tests with genes fromother organisms.

In this context, a most compelling approachis the genome-wide analysis of gene expres-sion profiles by chip technology. High-den-sity micro-arrays of all yeast ORF were thefirst to be successfully used in studying var-ious aspects of a transcriptome [54, 55].Comprehensive and regularly updated in-formation can be found in the yeast Ge-nome Database (http://www.yeastgenome.org).

Now the entire sequence of a laboratorystrain of S. cerevisiae is available, the com-plete sequences of other yeasts of industrialor medical importance are within ourreach. Such knowledge would considerablyaccelerate the development of productivestrains needed in other areas (e.g. Kluyvero-myces, Yarrowia) or the search for novelanti-fungal drugs. It might even be unnec-essary to finish the entire genome if a yeastor fungal genome has substantial syntenywith that of S. cerevisiae. A special programdevoted to this problem, analysis of hemias-comycetes yeast genomes by tag-sequencestudies for the approach of speciationmechanisms and preparation of tools forfunctional genomics, has recently been fi-nalized by a French consortium [56].

In all, the yeast genome project has dem-onstrated that an enterprise like this cansuccessfully be conducted in “small steps”and by teamwork. Clearly, the wealth offresh and biologically relevant informationcollected from yeast sequences and fromfunctional analyses has had an impact onother large-scale sequencing projects.

1.3.2

The Plant Arabidopsis thaliana

1.3.2.1

The OrganismArabidopsis thaliana (Fig. 1.5) is a small cru-ciferous plant of the mustard family firstdescribed by the German physician Johan-

Page 26: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

26 1 Genome Projects in Model Organisms

nes Thal in 1577 in his Flora Book of theHarz mountains, Germany, and laternamed after him. In 1907 A. thaliana wasrecognized to be a versatile tool for geneticstudies by Laibach [57] when he was a stu-dent with Strasburger in Bonn, Germany.More than 30 years later in 1943 – then Pro-fessor of Botany in Frankfurt – he published

an influential paper [58] clearly describingthe favorable features making this plant atrue model organism: (1) short generationtime of only two months, (2) high seedyield, (3) small size, (4) simple cultivation,(5) self fertilization, but (6) easy crossingyielding fully fertile hybrids, (7) only fivechromosomes in the haploid genome, and(8) the possibility of isolating spontaneousand induced mutants. An attempt by Rédeiin the 1960s to convince funding agenciesto develop Arabidosis as a plant modelsystem was unsuccessful, mainly becausegeneticists at that time had no access togenes at the molecular level and thereforeno reason to work with a plant irrelevant toagriculture.

With the development of molecular biol-ogy two further properties of the A. thalianagenome made this little weed the superiorchoice as an experimental system. Laibachhad already noted in 1907 that A. thalianacontained only one third of the chromatinof related Brassica species [57]. Much later itbecame clear that this weed has: (1) thesmallest genome of any higher plant [61]and (2) a small amount of repetitive DNA.Within the plant kingdom characterized byits large variation of genome sizes (see, forexample, http://www.rbgkew.org.uk/cval/database1.html), mainly because of differ-ent amounts of repetitive DNA, these fea-tures support efficient map-based cloningof genes for a detailed elucidation of theirfunction at the molecular level. A first set oftools for that purpose became available dur-ing the 1980s with a comprehensive geneticmap containing 76 phenotypic markers ob-tained by mutagenesis, RFLP maps, Agro-

Fig. 1.5 The model plant Arabidopsis thaliana. (a) Adult plant, approx. height 20 cm [59]; (b) flower, approx. height 4 mm [60]; (c) chromosome plate showing the five chromosomes of Arabidopsis [57].

Page 27: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

271.3 Genome Projects of Selected Eukaryotic Model Organisms

bacterium-mediated transformation, andcosmid and yeast artificial chromosome(YAC) libraries covering the genome seve-ralfold with only a few thousand clones.

It was soon realized that projects involv-ing resources shared by many laboratorieswould profit from centralized collectionand distribution of stocks and related infor-mation. “The Multinational CoordinatedArabidopsis thaliana Genome Research Pro-ject” was launched in 1990 and, as a result,two stock centers in Nottingham, UK, andOhio, USA, and the Arabidopsis database atBoston, USA [62] were created in 1991.With regard to seed stocks these centerssucceeded an effort already started by Lai-bach in 1951 [59] and continued byRöbbelen and Kranz. The new, additionalcollections of clones and clone libraries pro-vided the basis for the genome sequencingproject later on. With increased use of theinternet at that time the database soon be-came a central tool for data storage and dis-tribution and has ever since served thecommunity as a central one-stop shoppingpoint for information, despite its move toStanford and its restructuring to become“The Arabidopsis Information Resource”(TAIR; http://www.arabidopsis.org).

In subsequent years many research toolswere improved and new ones were added –mutant lines based on insertions of T-DNAand transposable elements were created inlarge numbers, random cDNA clones weresequenced partially, maps became availablefor different types of molecular markersuch as RAPD, CAPS, microsatellites,AFLP, and SNP which were integrated witheach other, recombinant inbred lines wereestablished to facilitate the mapping pro-cess, a YAC library with large inserts andBAC and P1 libraries were constructed,physical maps based on cosmids, YAC, andBAC were built, and tools for their displaywere developed.

1.3.2.2

Structure of the Genome ProjectIn August 1996 the stage was prepared forlarge-scale genome sequencing. At a meet-ing in Washington DC representatives ofsix research consortia from North America,Europe, and Japan launched the “Arabidop-sis Genome Initiative”. They set the goal ofsequencing the complete genome by theyear 2004 and agreed on the strategy, thedistribution of tasks, and guidelines for thecreation and publication of sequence data.The genome of the ecotype Columbia waschosen for sequencing, because all large in-sert libraries had been prepared from thisline and it was one of the most prominentecotypes for all kinds of experiment world-wide besides Landsberg erecta (Ler). Be-cause Ler is actually a mutant isolated froman inhomogeneous sample of the Lands-berg ecotype after X-ray irradiation [63], itwas not suitable to serve as a model ge-nome. The sequencing strategy rested onBAC and P1 clones from which DNA can beisolated more efficiently than from YACand which, on average, contain larger in-serts than cosmids. This strategy was cho-sen even though most attempts to createphysical maps had been based on cosmidand YAC clones at that time and that initialsequencing efforts had employed mostlycosmids. BAC and P1 clones for sequenc-ing were chosen via hybridization to YACand molecular markers of known and wellseparated map positions. Later, informationfrom BAC end sequences and fingerprintand hybridization data, created while ge-nome sequencing was already in progress,were used to minimize redundant sequenc-ing caused by clone overlap. This multina-tional effort has been very fruitful and led tocomplete sequences for two of the fiveArabidopsis chromosomes, chromosomes2 [64] and 4 [65], except for their rDNA re-peats and the heterochromatic regions

Page 28: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

28 1 Genome Projects in Model Organisms

around their centromers. The genomic se-quence was completed by the end of theyear 2000, more than three years ahead ofthe original timetable. Sequences of themitochondrial [66] and the plastid ge-nome [67] have also been determined, socomplete genetic information was availablefor Arabidopsis.

1.3.2.3

Results from the Genome ProjectThe sequenced chromosomes have yieldedno surprises with regard to their structuralorganization. With the exception of one se-quenced marker, more than one hundredhave been observed in the expected order.Repetitive elements and repeats of transpos-able elements are concentrated in the heter-ochromatic regions around the centromerswhere gene density and recombination fre-quency are below the average (22 genes/100 kbp, 1 cM/50 - 250 kbp). With a few mi-nor exceptions these average values do notvary substantially in other regions of thechromosomes, which is in sharp contrastwith larger plant genomes [68–70]. In addi-tion, genomes such as that of maize do con-tain repetitive elements and transposonsinterspersed with genes [71], so Arabidopsisis certainly not a model for the structure oflarge plant genomes.

From the sequences available it has beencalculated that the 120 Mbp gene containingpart of the nuclear genome of Arabidopsiscontains approximately 25,000 genes(chr. 2: 4037, chr. 4: 3744 annotatedgenes; [72]), whereas the mitochondrial andplastid genomes carry only 57 and 78 geneson 366,924 and 154,478 bp of DNA, respec-tively. Most of the organellar proteins musttherefore be encoded in the nucleus and aretargeted to their final destinations via N-ter-minal transit peptides. It has recently beenestimated that 10 or 14 % of the nuclear

genes encode proteins located in mitochon-dria or plastids, respectively [73]. Only for afraction of the predicted plastid proteinscould homologous cyanobacterial proteinsbe identified [64, 65]; even so, lateral genetransfer from the endosymbiotic organelleto the nucleus has been assumed to be themain source of organellar proteins. Thesedata indicate that either the large evolution-ary distance between plants and cyanobacte-ria prevents the recognition of orthologsand/or many proteins from other sourcesin the eukaryotic cell have acquired plastidtransit peptides, as suggested earlier on thebasis of Calvin cycle enzymes [74]. A sub-stantial number of proteins without predict-ed transit peptides but with higher homolo-gy to proteins from cyanobacteria than toany other organism have also been recog-nized [64, 65], indicating that plastids, atleast, have contributed a significant part ofthe protein complement of other compart-ments. These data clearly show that plantcells have become highly integrated geneticsystems during evolution, assembled fromthe genetic material of the eukaryotic hostand two endosymbionts [75]. That thissystem integration is an ongoing process isrevealed by the many small fragments ofplastid origin in the nuclear genome [76]and the unexpected discovery of a recentgene-transfer event from the mitochondrialto the nuclear genome. Within the geneti-cally defined centromer of chromosome 2, a270-kbp fragment 99 % identical with themitochondrial genome has been identifiedand its location confirmed via PCR acrossthe junctions with unique nuclearDNA [64]. For a comprehensive list of refer-ences on Arabidopsis thaliana readers are re-ferred to Schmidt [77]. This contributioncites only articles not listed by Schmidt orwhich are of utmost importance to the mat-ters discussed here.

Page 29: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

291.3 Genome Projects of Selected Eukaryotic Model Organisms

1.3.2.4

Follow-up Research in the Postgenome EraPotential functional assignments can bemade for up to 60 % of the predicted pro-teins on the basis of sequence comparisons,identification of functional domains andmotifs, and structural predictions. Interest-ingly, 65 % of the proteins have no signifi-cant homology with proteins of the com-pletely sequenced genomes of bacteria,yeast, C. elegans, and D. melanogaster [64],clearly reflecting the large evolutionary dis-tance of the plant from other kingdoms andthe independent development of multicel-lularity accompanied by a large increase ingene number. The discovery of proteinclasses and domains specific for plants, e.g.Ca2+-dependent protein kinases containingfour EF-hand domains [65] or the B3 do-main of ABI3, VP1, and FUS3 [78], and thesignificantly different abundance of severalproteins or protein domains compared withC. elegans or D. melanogaster, for examplemyb-like transcription factors [79], C3HC4ring finger domains, and P450 en-zymes [65], further support this notion. Al-ready the larger number of genes in theArabidopsis genome (approx. 25,000) com-pared with the genomes of C. elegans (ap-prox. 19,000) and D. melanogaster (approx.14,000) seems indicative of different waysof evolving organisms of comparable com-plexity. Currently, the underlying reasonfor this large difference is unclear. It mightsimply reflect a larger proportion of dupli-cated genes, as it does for C. elegans com-pared with D. melanogaster [80]. The largenumber of observed tandem repeats [64, 65]and the large duplicated segments in chro-mosomes 2 and 4, 2.5 Mbp in total, and inchromosomes 4 and 5, a segment contain-ing 37 genes [65], seem to favor this expla-nation. But other specific properties ofplants, for example autotrophism, non-mo-bile life, rigid body structure, continuous

organ development, successive gameto-phytic and sporophytic generations, and,compared with animals, different ways ofprocessing information and responding toenvironmental stimuli and a smaller extentof combinatorial use of proteins, or anycombination of these factors, might alsocontribute to their large number of genes.Functional assignment of proteins is notcurrently sufficiently advanced to enable usto distinguish between all these possibil-ities.

The data currently available indicate thatthe function of many plant genes is differ-ent from those of animals and fungi.Besides the basic eukaryotic machinerywhich was present in the last common an-cestor before endosymbiosis with cyanobac-teria created the plant kingdom, and whichcan be delineated by identification of ortho-logs in eukaryotic genomes [65], all theplant-specific functions must be elucidated.Because more than 40 % of all predictedproteins from the genome of Arabidopsishave no assigned function and many othershave not been investigated thoroughly, thiswill require an enormous effort using dif-ferent approaches. New techniques fromchip-based expression analysis and geno-typing to high-throughput proteomics andprotein–ligand interaction studies and me-tabolite profiling have to be applied in con-junction with identification of the diversitypresent in nature or created via mutagene-sis, transformation, gene knock-out, etc.Because multicellularity was established in-dependently in all kingdoms, it might bewise to sequence the genome of a unicellu-lar plant. Its gene content would help usidentify all the proteins required for cell-to-cell communication and transport of sig-nals and metabolites. To collect, store, eval-uate, and access all the relevant data fromthese various approaches, many of whichhave been performed in parallel and de-

Page 30: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

30 1 Genome Projects in Model Organisms

signed for high-throughput analyses, newbioinformatic tools must be created. To co-ordinate such efforts, a first meeting of“The Multinational Coordinated Arabidopsis2010 Project” (http://www.arabidopsis.org/workshop1.html), with the objective offunctional genomics and creation of a virtu-al plant within the next decade, was held inJanuary 2000 at the Salk Institute in SanDiego, USA. From just two examples it isevident that the rapid progress in Arabidop-sis research will continue in the future. Acentralized facility for chip-based expres-sion analysis has been set-up already inStanford, USA, and a company provides ac-cess to 39,000 potential single-nucleotidepolymorphisms, which will speed up map-ping and genotyping substantially. At theend the question remains the same as in thebeginning – why should all these efforts bedevoted to a model plant irrelevant to agri-culture? The answer can be given again as aquestion – which other plant system wouldprovide better tools for tackling all the basicquestions of plant development and adapta-tion than Arabidopsis thaliana?

1.3.3

The Roundworm Caenorhabditis elegans

1.3.3.1

The OrganismThe free-living nematode Caenorhabditis ele-gans (Fig. 1.6) is the first multicellular ani-mal whose genome has been completely se-quenced. This worm, although often viewedas a featureless tube of cells, has been stud-ied as a model organism for more than 20 -years. In the 1970s and 1980s, the completecell lineage of the worm from fertilized eggto adult was determined by microscopy [81],and later the entire nervous system was re-constructed [82]. It has proved to have sever-al advantages as an object of biologicalstudy, for example simple growth condi-

tions, rapid generation time with an invari-ant cell lineage, and well developed geneticand molecular tools for its manipulation.Many of the discoveries made with C. ele-gans are particularly relevant to the study ofhigher organisms, because it shares manyof the essential biological features, for ex-ample neuronal growth, apoptosis, intra-and intercellular signaling pathways, fooddigestion, etc. that are in the focus of inter-est of, for example, human biology.

A special review by the C. elegans Ge-nome Consortium [17] gives an interestingsummary of how “the worm was won” andexamines some of the findings from thenear-complete sequence data.

The C. elegans genome was deduced froma clone-based physical map to be 100 Mb.This map was initially based on cosmidclones using a fingerprinting approach. Lat-er yeast artificial chromosome (YAC) cloneswere incorporated to bridge the gapsbetween cosmid contigs, and providedcoverage of regions that were not represent-ed in the cosmid libraries. By 1990, thephysical map consisted of fewer than 20contigs and was useful for rescue experi-ments that were able to locate a phenotypeof interest to a few kilobases of DNA [83].Alignment of the existing genetic and phys-ical maps into the C. elegans genome map

Fig. 1.6 The freeliving bacteriovorous soil nema-tode Caenorhabditis elegans which is a member ofthe Rhabditidae (found at http://www.nematodes.org).

Page 31: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

311.3 Genome Projects of Selected Eukaryotic Model Organisms

was greatly facilitated by cooperation of theentire “worm community”. When the phys-ical map had been nearly completed the ef-fort to sequence the entire genome becameboth feasible and desirable. By that timethis attempt was significantly larger thanany sequencing project before and wasnearly two orders of magnitude more ex-pensive than the mapping effort.

1.3.3.2

The Structure of the Genome ProjectIn 1990 a three-year pilot project for se-quencing 3 Mb of the genome was initiatedas a collaboration between the Genome-Se-quencing Center in St Louis, USA and theSanger Center in Hinxton, UK. Fundingwas obtained from the NIH and the UKMRC. The genome sequencing effort in-itially focused on the central regions of thefive autosomes which were well represent-ed in cosmid clones, because most genesknown at that time were contained in theseregions. At the beginning of the project, se-quencing was still based on standard radio-isotopic methods, with “walking” primersand cosmid clones as templates for the se-quencing reactions. A severe problem ofthis primer-directed sequencing approachon cosmids were, however, multiple prim-ing events because of repetitive sequencesand efficient preparation of sufficient tem-plate DNA. To address these problems thestrategy was changed to a more classic shot-gun sequencing strategy based on cosmidsubclones generally sequenced from uni-versal priming sites in the subcloning vec-tors. Further developments in automationof the sequencing reactions, fluorescent se-quencing methods, improvements in dye-terminator chemistry [84], and the genera-tion of assembly and contig editing pro-grams, led to the phasing out of the instru-ment in favor of four-color, single-lane se-quencing. The finishing phase then used a

more ordered, directed sequencing strategyas well as the walking approach, to closespecific remaining gaps and resolve ambi-guities. Hence, the worm project grew intoa collaboration among C. elegans Sequenc-ing Consortium members and the entireinternational community of C. elegans re-searchers. In addition to the nuclear ge-nome-sequencing effort other researcherssequenced its 15-kb mitochondrial genomeand performed extensive cDNA analysesthat facilitated gene identification.

The implementation of high-throughputdevices and semi-automated methods forDNA purification and sequencing reactionsled to overwhelming success in scaling upof the sequencing. The first 1 Mb thresholdof C. elegans finished genome sequenceswas reached in May 1993. In August 1993,the total had already increased to over2 Mb [85], and by December 1994 over10 Mb of the C. elegans genome had beencompleted. Bioinformatics played an in-creasing role in the genome project. Soft-ware developments made the processing,analysis, and editing of thousands of datafiles per day a manageable task. Indeed,many of the software tools developed in theC. elegans project, for example ACeDB,PHRED, and PHRAP, have become keycomponents in the current approach to se-quencing the human genome.

The 50 Mb mark was passed in August1996. At this point, it became obvious that20 % of the genome was not covered by cos-mid clones, and a closure strategy was im-plemented. For gaps in the central regions,either long-range PCR was used, or a fos-mid library was probed in search of a bridg-ing clone. For the remaining gaps in thecentral regions, and for regions of chromo-somes contained only in YAC, purified YACDNA was used as the starting material forshotgun sequencing. All of these final re-gions have been essentially completed with

Page 32: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

32 1 Genome Projects in Model Organisms

the exception of several repetitive elements.The final genome sequence of the worm is,therefore, a composite from cosmids, fos-mids, YAC, and PCR products. The exactgenome size was approximate for a longtime, mainly because of repetitive sequenc-es that could not be sequenced in their en-tirety. Telomeres were sequenced fromplasmid clones provided by Wicky et al. [86].Of twelve chromosome ends, nine havebeen linked to the outermost YAC on thephysical map.

1.3.3.3

Results from the Genome ProjectAnalysis of the approximately 100 Mb of thetotal C. elegans genome revealed nearly20,000 predicted genes, considerably morethan expected before sequencing began [87,88], with an average density of one predict-ed gene per 5 kb. Each gene was estimatedto have an average of five introns and 27 %of the genome residing in exons. The num-ber of genes is approximately three timesthe number found in yeast [89] and approx-imately one-fifth to one-third the numberpredicted for the human genome [17].

Interruption of the coding sequence byintrons and the relatively low gene densitymake accurate gene prediction more chal-lenging than in microbial genomes. Valu-able bioinformatics tools have been devel-oped and used to identify putative codingregions and to provide an initial overview ofgene structure. To refine computer-generat-ed gene structure predictions, the availableEST and protein similarities and the ge-nomic sequence data from the related nem-atode Caenorhabditis briggsae were used forverification. Although approximately 60 %of the predicted genes have been confirmedby EST matches [17, 90], recent analyseshave revealed that many predicted genesneeded corrections in their intron-exonstructures [91].

Similarities to known proteins provide aninitial glimpse into the possible function ofmany of the predicted genes. Wilson [17] re-ported that approximately 42 % of predictedprotein products have cross-phylum match-es, most of which provide putative function-al information [92]. Another 34 % of pre-dicted proteins match other nematode pro-teins only, a few of which have been func-tionally characterized. The fraction of geneswith informative similarities is far less thanthe 70 % observed for microbial genomes.This might reflect the smaller proportion ofnematode genes devoted to core cellularfunctions [89], the comparative lack ofknowledge of functions involved in build-ing an animal, and the evolutionary diver-gence of nematodes from other animals ex-tensively studied so far at the molecular lev-el. Interestingly, genes encoding proteinswith cross-phylum matches were more like-ly to have a matching EST (60 %) than thosewithout cross-phylum matches (20 %). Thisobservation suggested that conserved genesare more likely to be highly expressed, per-haps reflecting a bias for “house-keeping”genes among the conserved set. Alternative-ly, genes lacking confirmatory matchesmight be more likely to be false predictions,although the analyses did not suggest this.

In addition to the protein-coding genes,the genome contains several hundred genesfor noncoding RNA. 659 widely dispersedtransfer RNA genes have been identified,and at least 29 tRNA-derived pseudogenesare present. Forty-four percent of the tRNAgenes were found on the X chromosome,which represents only 20 % of the total ge-nome. Several other noncoding RNA genes,for example those for splicosomal RNA, oc-cur in dispersed multigene families. Sever-al RNA genes are located in the introns ofprotein coding genes, which might indicateRNA gene transposition [17]. Other noncod-ing RNA genes are located in long tandem

Page 33: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

331.3 Genome Projects of Selected Eukaryotic Model Organisms

repeat regions; the ribosomal RNA geneswere found solely in such a region at theend of chromosome I, and the 5S RNAgenes occur in a tandem array on chemo-mosome V.

Extended regions of the genome code forneither proteins nor RNA. Among these areregions that are involved in gene regula-tion, replication, maintenance, and move-ment of chromosomes. Furthermore, a sig-nificant fraction of the C. elegans genome isrepetitive and can be classified as either lo-cal repeats (e.g. tandem, inverted, and sim-ple sequence repeats) or dispersed repeats.Tandem and inverted repeats amount to 2.7and 3.6 % of the genome, respectively.These local repeats are distributed non-uni-formly throughout the genome relative togenes. Not surprisingly, only a small per-centage of tandem repeats are found withinthe 27 % of the protein coding genes. Con-versely, the density of inverted repeats ishigher in regions predicted as intergen-ic [17]. Although local repeat structures areoften unique in the genome, other repeatsare members of families. Some repeat fam-ilies have a chromosome-specific bias inrepresentation. Altogether 38 dispersed re-peat families have been recognized. Most ofthese dispersed repeats are associated withtransposition in some form [93], and in-clude the previously described transposonsof C. elegans. In addition to multiple-copyrepeat families, a significant number ofsimple duplications have been observed toinvolve segments that range from hundredsof bases to tens of kilobases. In one in-stance a segment of 108 kb containing sixgenes was duplicated tandemly with onlyten nucleotide differences observedbetween the two copies. In another exam-ple, immediately adjacent to the telomere atthe left end of chromosome IV, an invertedrepeat of 23.5 kb was present, with onlyeight differences found between the two

copies. There are many instances of smallerduplications, often separated by tens of kil-obases or more that might contain a codingsequence. This could provide a mechanismfor copy divergence and the subsequent for-mation of new genes [17].

The GC content is more or less uniformat 36 % throughout all chromosomes, un-like human chromosomes that have differ-ent isochores [94]. There are no localizedcentromers, as are found in most othermetazoa. Instead, the extensive highly re-petitive sequences that are involved in spin-dle attachment in other organisms mightbe represented by some of the many tan-dem repeats found scattered among thegenes, particularly on the chromosomearms. Gene density is also uniform acrossthe chromosomes, although some differ-ences are apparent, particularly between thecenters of the autosomes, the autosomearms, and the X chromosome.

More striking differences become evidenton examination of other features. Both in-verted and tandem repeat sequences aremore frequent on the autosome arms thanin the central regions or on the X chromo-some. This abundance of repeats on thearms is probably the reason for difficultiesin cosmid cloning and sequence comple-tion in these regions. The fraction of geneswith cross-phylum similarities tends to besmaller on the arms as does the fraction ofgenes with EST matches. Local clusters ofgenes also seem to be more abundant onthe arms.

1.3.3.4

Follow-up Research in the Postgenome EraAlthough sequencing of the C. elegans ge-nome has essentially been completed, anal-ysis and annotation will continue and –hopefully – be facilitated by further informa-tion and better technologies as they becomeavailable. The recently completed genome

Page 34: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

34 1 Genome Projects in Model Organisms

sequencing of C. briggsae enabled compara-tive analysis of functional sequences of bothgenomes. The genomes are characterized bystriking conservation of structure and func-tion. Thus, many features could be verifiedin general. Although several specific charac-teristics had to be corrected for C. elegans,however, for example the number of predict-ed genes, the noncoding RNA, and repetitivesequences [95], it is now possible to describemany interesting features of the C. elegansgenome, on the basis of analysis of the com-pleted genome sequence.

The observations and findings of the C.elegans genome project provide a prelimi-nary glimpse of the biology of metazoan de-velopment. There is much left to be uncov-ered and understood in the sequence. Ofprimary interest, all of the genes necessaryto build a multicellular organism are nowessentially available, although their exactboundaries, relationships, and functionalroles have to be elucidated more precisely.The basis for a better understanding of theregulation of these genes is also now withinour grasp. Furthermore, many of the dis-coveries made with C. elegans are relevant tothe study of higher organisms. This extendsbeyond fundamental cellular processessuch as transcription, translation, DNA rep-lication, and cellular metabolism. For thesereasons, and because of its intrinsic practi-cal advantages, C. elegans has proved to bean invaluable tool for understanding fea-tures such as vertebrate neuronal growthand pathfinding, apoptosis and intra- andintercellular signaling pathways.

1.3.4

The Fruitfly Drosophila melanogaster

1.3.4.1

The OrganismIn 1910, T.H. Morgan started his analysis ofthe fruit fly Drosophila melanogaster and

identified the first white-eyed mutant flystrain [96]. Now, less than 100 years later,almost the complete genome of the insecthas been sequenced, offering the ultimateopportunity to elucidate processes rangingfrom the development of an organism to itsdaily behavior.

Drosophila (Fig. 1.7), a small insect with ashort lifespan and very rapid, well-charac-terized development is one of the best ana-lyzed multicellular organisms. During thelast century, more than 1300 genes – most-ly based on mutant phenotypes – were ge-netically identified, cloned, and sequenced.Surprisingly, most were found to havecounterparts in other metazoa includingeven humans. Soon it became evident thatnot only are the genes conserved amongspecies, but also the functions of the encod-ed proteins. Processes like the developmentof limbs, the nervous system, the eyes andthe heart, or the presence of circadianrhythms and innate immunity are highly

Fig. 1.7 The fruit fly Drosophila melanogaster. (a) Adult fly. (b) Stage 16 embryo. In dark-brown,the somatic muscles are visualized by using an â3-tubulin-specific antibody; in red-brown, expres-sion of â1-tubulin in the attachment sites of thesomatic muscles is shown.

Page 35: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

351.3 Genome Projects of Selected Eukaryotic Model Organisms

conserved, even to the extent that genes tak-en from human sources can supplementthe function of genes in the fly [80]. Early inthe 1980s identification of the HOX genesled to the discovery that the anterior–poste-rior body patterning genes are conservedfrom fly to man [97]. Furthermore, genesessential for the development and differen-tiation of muscles like twist, MyoD, andMEF2 act in most of the organisms ana-lyzed so far [98]. The most striking exampleis the capability of a certain class of genes,the PAX-6 family, to induce ectopic eye de-velopment in Drosophila, irrespective of theanimal source from which it was taken [99].Besides these developmental processes, hu-man disease networks involved in replica-tion, repair, translation, metabolism ofdrugs and toxins, neural disorders likeAlzheimer’s, and also higher order func-tions like memory and signal transductioncascades have also been shown to be highlyconserved [97, 100, 101]. It is quite a sur-prising conclusion that although simply aninsect, Drosophila is capable of serving as amodel system even for complex human dis-eases.

The genomic organization of Drosophilamelanogaster has been known for manyyears. The fly has four chromosomes, threelarge and one small, with an early estimateof 1.1 ×108 bp. Using polytene chromo-somes as tools, in 1935 and 1938 Bridgespublished maps of such accuracy they arestill used today [96]. Making extensive useof chromosomal rearrangements he con-structed cytogenetic maps that assignedgenes to specific sections and even specificbands. With the development of techniquessuch as in-situ hybridization to polytenechromosomes, genes could be mappedwith a resolution of less than 10 kb. An-other major advantage of Drosophila is thepresence of randomly shuffled chromo-somes, the so-called balancers, which en-

able easy monitoring and pursuit of givenmutations on the homologous chromo-some and guarantee the persistence of sucha mutation at the chromosomal place whereit has initially occurred, by suppression ofmeiotic recombination.

1.3.4.2

Structure of the Genome ProjectThe strategy chosen for sequencing was thewhole-genome-shotgun sequencing (WGS)technique. For this technique, the whole ge-nome is broken into small pieces, sub-cloned into suitable vectors and sequenced.For the Drosophila genome project, librariescontaining 2 kb, 10 kb, and 130 kb insertswere chosen as templates. They were se-quenced from the ends, and assembled bypairs of reads, called mates, from the endsof the 2 kb and 16 kb inserts [102]. Assem-bly of the sequences was facilitated by theabsence of interspersed repetitive elementslike the human ALU-repeat family. Theends of the large BAC clones were taken tounequivocally localize the sequences intolarge contigs and scaffolds. Three millionreads of ~500 bp were used to assemble theDrosophila genome. The detailed strategies,techniques, and algorithms used are de-scribed in an issue of Science [103] pub-lished in March 2000. The work was orga-nized by the Berkeley Drosophila GenomeProject (BDGP) and performed at theBDGP, the European DGP, the CanadianDGP and at Celera Genomics under theauspices of the federally funded HumanGenome Project. The combined DGP pro-duced all the genomic resources and fin-ished 29 Mbp of sequences. In total, theDrosophila genome is ~180 Mbp in size, athird of which is heterochromatin. Fromthe ~120 Mbp of euchromatin, 98 % are se-quenced with an accuracy of at least 99.5 %.Because of the structure of the hetero-chromatin, which is mainly built up of re-

Page 36: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

36 1 Genome Projects in Model Organisms

petitive elements, retrotransposons, rRNAclusters, and some unique sequences, mostof it could not be cloned or propagated inYAC, in contrast with the C. elegans genomeproject [102].

1.3.4.3

Results of the Genome ProjectAs a major result, the number of genes wascalculated to be ~13,600 using a Drosophila-optimized version of the program GE-NIE [100]. Over the last 20 years, 2500 ofthese have already been characterized by thefly community, and their sequences havebeen made available continuously duringthe ongoing of the fly project. The averagedensity is one gene in 9 kb, but ranges fromnone to nearly 30 genes per 50 kbp, and, incontrast with C. elegans, gene-rich regionsare not clustered. Regions of high gene den-sity correlate with G + C-rich sequences.Computational comparisons with knownvertebrate genes have led to both expectedfindings and surprises. So genes encodingthe basic DNA replication machinery areconserved among eukaryotes; especially, allthe proteins known to be involved in recog-nition of the replication start point arepresent as single-copy genes, and the ORC3and ORC6 proteins are closely similar tovertebrate proteins but much different fromthose in yeast and, apparently, even moredifferent from those in the worm. Focusingon chromosomal proteins, the fly seems tolack orthologs to most of the mammalianproteins associated with centromeric DNA,for example the CENP-C/MIF-2 family. Fur-thermore, because Drosophila telomereslack the simple repeats characteristic ofmost eukaryotic telomeres, the known te-lomerase components are absent. Concern-ing gene regulation, RNA polymerase sub-units and co-factors are more closely relatedto their mammalian counterparts than to

yeast; for example, the promoter interactingfactors UBF and TIF-IA are present in Dro-sophila but not in yeast. The overall set oftranscription factors in the fly seems tocomprise approximately 700 members,about half of which are zinc-finger proteins,whereas in the worm only one-third of 500factors belong to this family. Nuclear hor-mone receptors seem to be more rare, be-cause only four new members were detect-ed, bringing the total to 20 compared withmore than 200 in C. elegans. As an exampleof metabolic processes, iron pathway com-ponents were analyzed. A third ferritingene has been found that probably encodesa subunit belonging to cytosolic ferritin, thepredominant type in vertebrates. Two new-ly discovered transferrins are homologs ofthe human melanotransferrin p97, which isof especial interest, because the iron trans-porters analyzed so far in the fly are in-volved in antibiotic response rather than iniron transport. Otherwise, proteins homolo-gous to transferrin receptors seem to be ab-sent from the fly, so the melanotransferrinhomologs might mediate the main insectpathway for iron uptake (all data taken fromAdams et al. [100]). The sequences and thedata compiled are freely available to the sci-entific community on servers in severalcountries (see: http://www.fruitfly.org). Inaddition, large collections of EST (expressedsequence tags), cDNA libraries, and genom-ic resources are available, as are data basespresenting expression patterns of identifiedgenes or enhancer trap lines, for exampleflyview at the University of Münster(http://pbio07.uni-muenster.de), started bythe group of W. Janning. Furthermore, byusing more advanced transcription profil-ing approaches in combination with lowerstringency gene prediction programs, ap-proximately 2000 new genes could be de-tected, according to a recent report [104].

Page 37: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

371.4 Conclusion

1.3.4.4

Follow-up Research in the Postgenome EraComparative analysis of the genomes ofDrosophila melanogaster, Caenorhabditis ele-gans, and Saccharomyces cerevisiae has beenperformed [80]. It showed that the core pro-teome of Drosophila consists of 9453 pro-teins, which is only twice the number for S.cerevisiae (4383 proteins). Using stringentcriteria, the fly genome is much closer relat-ed to the human genome than to that of C.elegans one. Interestingly, the fly has ortho-logs to 177 of 289 human disease genes ex-amined so far and provides an excellent ba-sis for rapid analysis of some basic process-es involved in human disease. Further-more, hitherto unknown counterparts werefound for human genes involved in otherdisorders, for example, menin, tau, limb gir-dle muscular dystrophy type 2B, Friedrichataxia, and parkin. Of the cancer genes sur-veyed, at least 68 % seem to have Drosophi-la orthologs, and even a p53-like tumor-suppressor protein was detected. All ofthese fly genes are present as single copiesand can be genetically analyzed without un-certainty about additional functionally re-dundant copies. Hence, all the powerfulDrosophila tools such as genetic analysis,exploitation of developmental expressionpatterns, loss-of-function, and gain-of-func-tion phenotypes can be used to elucidatethe function of human genes that are notyet understood. A very recent discovery, thefunction of a new class of regulatory RNAtermed microRNA, has rapidly attracted at-tention to the non-coding parts of the ge-nome [105]. Screens for miRNA targets inDrosophila [106] have already led to identifi-cation of a large collection of genes and willcertainly be very helpful in elucidating therole of the miRNA during development.Furthermore, genome-wide protein–pro-tein-interaction studies using the yeast two-

hybrid system have been performed [107]and led to a draft map of about 20,000 inter-actions between more than 7,000 proteins.These results must certainly be refined butclearly emphasize the importance of ge-nome-wide screening of both proteins andRNA in combination with elaborated com-putational methods.

1.4

Conclusions

This book chapter summarizes the genomeprojects of selected model organisms withcompleted, or almost completed, genomicsequences. The organisms represent themajor phylogenetic lineages, the eubacteria,archaea, fungi, plants, and animals, thuscovering unicellular prokaryotes with sin-gular, circular chromosomes, and uni- andmulticellular eukaryotes with multiple line-ar chromosomes. Their genome sizes rangefrom ~2 to ~180 Mbp with estimated genenumbers ranging from 2000 to 25,000(Tab. 1.3).

The organization of the genome of eachorganism has its own specific and often un-expected characteristics. In B. subtilis, forexample, a strong bias in the polarity oftranscription of the genes with regard to thereplication fork was observed, whereas in E.coli and S. cerevisiae the genes are more orless equally distributed on both strands.Furthermore, although insertion sequencesare widely distributed in bacteria, none wasfound in B. subtilis. In A. thaliana, an unex-pectedly high percentage of proteinsshowed no significant homology with pro-teins of organisms outside the plant king-dom, and thus are obviously specific toplants. Another interesting observation re-sulting from the genome projects concernsgene density. In prokaryotic organisms, i.e.

Page 38: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

38 1 Genome Projects in Model Organisms

eubacteria and archaea, genome sizes varysubstantially. Their gene density is, howev-er, relatively constant at approximately onegene per kbp. During evolution of eukaryot-ic organisms, genome sizes grew, but genedensity decreased from one gene in two kbpin Saccharomyces to one gene in 10 kbp inDrosophila. This led to the surprising obser-vation that some bacterial species can havemore genes than lower eukaryotes and thatthe number of genes in Drosophila is onlyapproximately three times higher than thenumber in E. coli.

Another interesting question concernsthe general homology of genes or geneproducts between model organisms. Com-parison of protein sequences has shownthat some gene products can be found in awide variety of organisms.

Thus, comparative analysis of predictedprotein sequences encoded by the genomesof C. elegans and S. cerevisiae suggested thatmost of the core biological functions areconducted by orthologous proteins that oc-cur in comparable numbers in these organ-isms. Furthermore, comparing the yeast ge-nome with the catalog of human sequencesavailable in the databases revealed that a sig-nificant number of yeast genes have homo-logs among human genes of unknownfunction. Drosophila is of special impor-tance in this respect, because many human

disease networks have been shown to behighly conserved in the fruitfly. Hence, theinsect is, in an outstanding manner, ca-pable of serving as a model system even forcomplex human diseases. Because the re-spective genes in the fly are present as sin-gle copies, they can be genetically analyzedmuch more easily.

Genome research of model organismshas just begun. At the time when this arti-cle was written, more than 100 genome se-quences, mainly from prokaryotes, havebeen published and more than 200 ge-nomes are currently being sequencedworldwide. The full extent of this broad andrapidly expanding field of genome researchon model organisms cannot be covered by asingle book chapter. The World-Wide Web,however, is an ideal platform for makingavailable the outcome of large genome pro-jects in an appropriate and timely manner.Beside several specialized entry points, twoweb pages are recommended as an intro-duction with a broad overview of model-or-ganism research – the WWW Virtual Li-brary: Model Organisms (http://ceolas.org/VL/mo) and the WWW Resources for Mod-el Organisms (http://genome.cbs.dtu.dk/gorm/modelorganism.html). Both links areexcellent starting sites for detailed informa-tion on model-organism research.

Table 1.3 Summary of information on the genomes of model organisms described in this chapter (status April 2004).

Organism Genome structure Genome size (kb) Estimated no. of genes/ORF

Escherichia coli 1 chromosome, circular 4600 4400Bacillus subtilis 1 chromosome, circular 4200 4100Archaeoglobus fulgidus 1 chromosome, circular 2200 2400Saccharomyces cerevisiae 16 chromosomes, linear 12,800 6200Arabidopsis thaliana 5 chromosomes, linear 130,000 25,000Caenorhabditis elegans 6 chromosomes, linear 97,000 19,000Drosophila melanogaster 4 chromosomes, linear 180,000 16,000

Page 39: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

39

1 Berg, P., Baltimore, D., Brenner, S., Roblin,R.O. 3rd, Singer, M.F. (1975), Asilomarconference on recombinant DNA molecules.Science 188, 991–994.

2 Blattner, F.R., Plunkett, G. III, Bloch, C.A.,Perna, N.T., Burland, V. et al. (1997), Thecomplete genome sequence of Escherichia coliK-12, Science 277, 1453–1474.

3 Yamamoto, Y., Aiba, H., Baba, T., Hayashi,K., Inada, T. et al. (1997), Construction of acontiguous 874-kb sequence of the Escherichiacoli -K12 genome corresponding to 50.0–68.8min on the linkage map and analysis of itssequence features. DNA Res. 4, 91–113.

4 Kröger, M., Wahl, R. (1998) Compilation ofDNA sequences of Escherichia coli K12:description of the interactive databases ECDand ECDC, Nucl. Acids Res. 26, 46–49.

5 Thomas, G.H. (1999), Completing the E. coliproteome: a database of gene productscharacterised since completion of the genomesequence, Bioinformatics 15, 860–861.

6 Taylor, A.L., Thoman, M.S. (1964) Thegenetic map of Escherichia coli K-12. Genetics50, 659–677.

7 Bachmann, B.J. (1983) Linkage map ofEscherichia coli K-12, edition 7. Microbiol. Rev.47, 180 –230.

8 Neidhard, F.C. (1996) Escherichia coli andSalmonella typhimurium – Cellular andMolecular Biology, 2nd edn. American Societyfor Microbiology, Washington DC.

9 Kohara, Y., Akiyama, K., Isono, K. (1987), Thephysical map of the whole E. colichromosome: Application of a new strategyfor rapid analysis and sorting of a largegenomic library. Cell 50, 495–508.

10 Roberts, R. (2000) Database issue of NucleicAcids Research, Nucl. Acids Res. 28, 1–382.

11 Datsenko, K.A., Wanner, B.L. (2000), One-step inactivation of chromosomal genes inEscherichia coli K-12 using PCR products.PNAS 97, 6640–6645.

12 Karp, P.D., Riley, M.R., Saier, M., Paulsen,I.T., Collado-Vides, J., Paley, S.M., Pellegrini-Toole, A., Bonavides, C., Gama-Castro, S.(2002), The EcoCyc Database. Nucl. Acids Res.30, 56–58.

13 Sekowska, A. (1999), Une rencontre dumétabolisme du soufre et de l’azote: lemétabolisme des polyamines chez Bacillussubtilis, thesis in Génétique cellulaire etmoléculaire. Versailles-Saint-Quentin:Versailles, France. p. 202.

14 Kunst, F., Ogasawara, N., Moszer, I.,Albertini, A.M., Alloni, G. et al. (1997), Thecomplete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature390, 249–256.

15 Moreno, M.S., Schneider, B.L., Maile, R.R.,Weyler, W., Saier, M.H., Jr. (2001), Cataboliterepression mediated by the CcpA protein inBacillus subtilis: novel modes of regulationrevealed by whole-genome analyses. Mol.Microbiol. 39, 1366–1381.

16 Fabret, C., Quentin, Y., Guiseppi, A., Busuttil,J., Haiech, J., Denizot, F. (1995), Analysis oferrors in finished DNA sequences: thesurfactin operon of Bacillus subtilis as anexample. Microbiology 141, 345–350.

17 Wilson, R.K. (1999), How the worm was won.The C. elegans genome sequencing project.Trends Genet. 15, 51–58.

18 Arias, R.S., Sagardoy, M.A., van Vuurde, J.W.(1999), Spatio-temporal distribution ofnaturally occurring Bacillus spp. and otherbacteria on the phylloplane of soybean underfield conditions. J. Basic Microbiol. 39, 283–292.

References

Page 40: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

40 References

19 Rocha, E.P. (2003), DNA repeats lead to theaccelerated loss of gene order in bacteria.Trends Genet. 19, 600–603.

20 Rocha, E.P., Danchin, A., Viari, A. (1999),Universal replication biases in bacteria. Mol.Microbiol. 32, 11–16.

21 Schofield, M.J. and Hsieh, P. (2003), DNAmismatch repair: molecular mechanisms andbiological function. Annu. Rev. Microbiol. 57,579–608.

22 Glaser, P., Frangeul, L., Buchrieser, C.,Rusniok, C., Amend, A. et al. (2001),Comparative genomics of Listeria species.Science 294, 849–852.

23 Rocha, E.P., Sekowska, A., Danchin, A.(2000), Sulphur islands in the Escherichia coligenome: markers of the cell’s architecture?FEBS Lett. 476, 8–11.

24 Danchin, A., Guerdoux-Jamet, P., Moszer, I.,Nitschke, P. (2000), Mapping the bacterial cellarchitecture into the chromosome. Philos.Trans. R. Soc. Lond. B Biol. Sci. 355, 179–190.

25 Gottesman, S., Roche, E., Zhou, Y., Sauer,R.T. (1998), The ClpXP and ClpAP proteasesdegrade proteins with carboxy-terminalpeptide tails added by the SsrA-taggingsystem. Genes Dev. 12, 1338–1347.

26 Kobayashi, K., Ehrlich, S.D., Albertini, A.,Amati, G., Andersen, K.K. et al. (2003),Essential Bacillus subtilis genes. Proc. Natl.Acad. Sci. USA 100, 4678–4683.

27 Soma, A., Ikeuchi, Y., Kanemasa, S.,Kobayashi, K., Ogasawara, N. et al. (2003), AnRNA-modifying enzyme that governs both thecodon and amino acid specificities ofisoleucine tRNA. Mol. Cell. 12, 689–698.

28 Rocha, E.P. and Danchin, A. (2003), Geneessentiality determines chromosomeorganisation in bacteria. Nucleic Acids Res. 31,6570–6577.

29 Beckering, C.L., Steil, L., Weber, M.H.,Volker, U., Marahiel, M.A. (2002),Genomewide transcriptional analysis of the cold shock response in Bacillus subtilis.J. Bacteriol. 184, 6395–6402.

30 Hoffmann, T., Schutz, A., Brosius, M.,Volker, A., Volker, U., Bremer, E. (2002),High-salinity-induced iron limitation inBacillus subtilis. J. Bacteriol. 184, 718–727.

31 Danchin, A. (1997), Comparison between theEscherichia coli and Bacillus subtilis genomessuggests that a major function ofpolynucleotide phosphorylase is to synthesizeCDP. DNA Res. 4, 9–18.

32 Klenk, H.-P., Clayton, R.A., Tomb, J.-F.,White, O., Nelson, K.E. et al. (1997), Thecomplete genome of the hyperthermophilic,sulfate-reducing archaeon Archaeoglobusfulgidus. Nature 390, 364–370.

33 Stetter, K.O. (1988), Archaeoglobus fulgidusgen. nov., sp. nov.: a new taxon of extremelythermophilic archaebacteria. System. Appl.Microbiol. 10, 172–173.

34 Tomb, J.-F., White, O., Kerlavage, A.R.,Clayton, R.A., Sutton, G.G. et al. (1997), The complete genome sequence of the gastricpathogen Heliobacter pylori. Nature 388,539–547.

35 Bult, C.J., White, O., Olsen, G.J., Zhou, L.,Fleischmann, R.D. et al. (1996), Completegenome sequence of the methanogenicarchaeon Methanococcus jannaschii, Science273, 1058–1073.

36 Rabus, R., Ruepp, A., Frickey, T., Rattei, T.,Fartmann, B. et al. (2004), The genome ofDesulfotalea psychrophila, a sulphate-reducing bacterium from permanently coldArctic sediments. Environ. Microbiol., in press.

37 Vorholt, J., Kunow, J., Stetter, K.O., Thauer,R.K. (1995), Enzymes and coenzymes of thecarbon monoxide dehydrogenase pathway forautotrophic CO2 fixation in Archaeoglobuslithotrophicus and the lack of carbon monoxidedehydrogenase in the heterotrophic A.profundus. Arch. Microbiol. 163, 112–118.

38 Stetter KO, Lauerer G, Thomm M, Neuner A(1987) Isolation of extremely thermophilicsulfate reducers: evidence for a novel branchof archaebacteria. Science 236, 822–824.

39 Hosfield, D.J., Frank, G., Weng, Y., Tainer,J.A., Shen, B. (1998), Newly discoveredarchaebacterial flap endonucleases show astructure-specific mechanism for DNAsubstrate binding and catalysis resemblinghuman flap endonuclease-1. J. Biol. Chem.273, 27154–27161.

40 Lyamichev, V., Mast, A.L., Hall, J.G., Prudent,J.R., Kaiser, M.W. et al. (1999),Polymorphism identification and quantitativedetection of genomic DNA by invasivecleavage of oligonucleotide probes. Nat.Biotechnol. 17, 292–296.

41 Cooksey, R.C., Holloway, B.P., Oldenburg,M.C., Listenbee, S., Miller, C.W. (2000),Evaluation of the invader assay, a linear signalamplification method, fro identification ofmutations associated with resistance ofrifampin and isoniazid in Mycobacterium

Page 41: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

41References

tuberculosis, Antimicrob. Agents Chemother. 44,1296–1301.

42 Tang, T.H., Bachellerie, J.P. Rozhdestvensky,T., Bortolin, M.L., Huber, H. et al., (2003),Identification of 86 candidates for small non-messenger RNAs from the archaeonArchaeoglobus fulgidus, Proc. Natl. Acad. Sci.USA 99, 7536–7541.

43 Maisnier-Patin, S., Malandrin, L., Birkeland,N.K., Bernander, R. (2002), Chromosomereplication patterns in the hyperthermophiliceuryarchaea Archaeoglobus fulgidus andMethanocaldococcus (Methanococcus) janashii,Mol. Microbiol. 45, 1443–1450.

44 Goffeau, A. Barrell, B.G., Bussey, H., Davis,R.W., Dujon, B. et al. (1996), Life with 6000genes, Science 274, 546–567.

45 Anonymous (1997), Dictionary of the yeastgenome, Nature 387 (suppl.).

46 Lindegren, C.C. (1949) The Yeast Cell, itsGenetics and Cytology. EducationalPublishers, St Louis, MI.

47 Mortimer, R.K., Contopoulou, R., King, J.S.(1992), Genetic and physical maps ofSaccharomyces cerevisiæ, Edition 11. Yeast 8,817–902.

48 Carle, G.F., Olson, M.V. (1985), Anelectrophoretic karyotype for yeast. Proc. Natl.Acad. Sci. USA 82, 3756.

49 Stucka, R., Feldmann, H. (1994), Cosmidcloning of Yeast DNA, in: Molecular Geneticsof Yeast – A Practical Approach (Johnston, J.Ed.) Oxford Univ. Press, pp. 49–64.

50 Thierry, A., Dujon, B. (1992), Nestedchromosomal fragmentation in yeast usingthe meganuclease I-Sce I: a new method forphysical mapping of eukaryotic genomes.Nucleic Acids Res. 20, 5625–5631.

51 Louis, E.J., Borts, R.H. (1995), A complete setof marked telomeres in Saccharomycescerevisiae for physical mapping and cloning.Genetics 139, 125–136.

52 Wolfe, K.H., Shields, D.C. (1997), Molecularevidence for an ancient duplication of theentire yeast genome. Nature 387, 708–713.

53 Foury, F., Roganti, T., Lecrenier, N., Pumelle,B. (1998), Yeast genes and human disease.FEBS Lett. 440, 325–331.

54 DeRisi, J.L., Iyer, V.R., Brown, P.O. (1997),Exploring the metabolic and genetic controlof gene expression on a genomic scale.Science 278, 680–686.

55 Goffeau, A. (2000), Four years of postgenomiclife with 6,000 yeast genes. FEBS Lett. 480,37–41.

56 Feldmann, H. (ed.) (2000), Genolevures:Genomic exploration of thehemiascomycetous yeasts. FEBS Lett. 487, no.1 (http://cbi.labri.fr/genolevures).

57 Laibach, F. (1907). Zur Frage nach derIndividualität der Chromosomen imPflanzenreich. Beih. Bot. Centralblatt, Abt. I,22, 191–210.

58 Laibach, F. (1943) Arabidopsis thaliana (L.)Heynh. als Objekt für genetische undentwicklungsphysiologischeUntersuchungen. Bot. Arch. 44, 439–455.

59 Laibach, F. (1951) Über sommer- undwinterannuelle Rassen von Arabidopsisthaliana (L.) Heynh. Ein Beitrag zur Atiologieder Blütenbildung, Beitr. Biol. Pflanzen 28,173–210.

60 Müller, A. (1961) Zur Charakterisierung derBlüten und Infloreszenzen von Arabidopsisthaliana (L.) Heynh. Die Kulturpflanze 9,364–393.

61 Arumuganathan, K., Earle E.D. (1991),Nuclear DNA content of some importantplant species. Plant Mol. Biol. Rep. 9, 208–218.

62 Cherry, J.M., Cartinhour, S.W., Goodmann,H.M. (1992), AAtDB, an Arabidopsis thalianadatabase. Plant Mol. Biol. Rep. 10, 308–309,409–410.

63 Rédei, G.P. (1992), A heuristic glance at thepast of Arabidopsis genetics, in: Methods inArabidopsis Research (Koncz, C., Chua, N.-H.,Schell, J. Eds.), World Scientific Publishing,Singapore, pp. 1–15.

64 Lin, X., Kaul, S., Rounsley, S., Shea, T.P.,Benito, M.I. et al. (1999), Sequence andanalysis of chromosome 2 of the plantArabidopsis thaliana. Nature 402, 761–768.

65 Mayer, K., Schuller, C., Wambutt, R.,Murphy, G., Volckaert, G. et al. (1999),Sequence and analysis of chromosome 4 ofthe plant Arabidopsis thaliana. Nature 402,769–777.

66 Unseld, M., Marienfeld, J.R., Brandt, P.,Brennicke, A. (1997), The mitochondrialgenome of Arabidopsis thaliana contains 57genes in 366,924 nucleotides. Nature Genet.15, 57–61.

67 Sato, S., Nakamura, Y., Kaneko, T., Asamizu,E., Tabata, S. (1999), Complete structure ofthe chloroplast genome of Arabidopsisthaliana. DNA Res. 6, 283–290.

Page 42: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

42 References

68 Gill, K.S., Gill, B.S., Endo, T.R., Taylor, T.(1996a), Identification and high densitymapping of gene-rich regions in thechromosome group 5 of wheat. Genetics 143,1001–1012.

69 Gill, K.S., Gill, B.S., Endo, T.R., Taylor, T.(1996b), Identification and high densitymapping of gene–rich regions in thechromosome group 1 of wheat. Genetics 144,1883–1891.

70 Künzel, G., Korzun, L., Meister, A. (2000),Cytologically integrated physical restrictionfragment length polymorphism maps for thebarley genome based on translocationbreakpoints. Genetics 154, 397–412.

71 SanMiguel, P., Tikhonov, A., Jin, Y.K.,Motchoulskaia, N., Zakharov, D., Melake-Berhan, A., Springer, P.S., Edwards, K.J., Lee,M., Avramova, Z., Bennetzen, J.L. (1996),Nested retrotransposons in the intergenicregions of the maize genome. Science 274765–768.

72 Meyerowitz, E.M. (2000), Today we have thenaming of the parts. Nature 402, 731–732.

73 Emanuelsson, O., Nielsen, H., Brunak, S.,von Heijne, G. (2000), Predicting subcellularlocalization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol.300, 1005–1016.

74 Martin, W., Schnarrenberger, C. (1997), Theevolution of the Calvin cycle from prokaryoticto eukaryotic chromosomes: a case study offunctional redundancy in ancient pathwaysthrough endosymbiosis. Curr. Genet. 32, 1–8.

75 Herrmann, R.G. (2000), Organelle genetics –part of the integrated plant genome. Vortr.Pflanzenzüchtg. 48, 279–296.

76 Bevan, M., Bennetzen, J.L., Martienssen, R.(1998), Genome studies and molecularevolution. Commonalities, contrasts,continuity and change in plant genomes. CurrOpin Plant Biol. 101–102.

77 Schmidt, R. (2000), The Arabidopsis genome.Vortr. Pflanzenzüchtg. 48, 228–237.

78 Suzuki, M., Kao, C.Y., McCarty, D.R. (1997),The conserved B3 domain of VIVIPAROUS1has a cooperative DNA binding activity. PlantCell 9, 799–807.

79 Jin, H., Martin, C. (1999), Multifunctionalityand diversity within the plant MYB-genefamily. Plant Mol. Biol. 41, 577–585.

80 Rubin, G.M., Yandell, M.D., Wortman, J.R.,Gabor Miklos, G.L., Nelson, C.R. et al. (2000),Comparative genomics of eukaryotes. Science287, 2204–2215.

81 Sulston, J. (1988), Cell Lineage; In: TheNematode Caenorhabditis elegans (Wood, W.B.Ed.) pp. 123–155, CSHL Press

82 Chalfie, M., White, J. (1988), The Nervoussystem, In: The Nematode Caenorhabditiselegans (Wood, W.B. Ed.) pp. 337–391, CSHLPress.

83 Coulson, A., Waterston, R., Kiff, J., Sulston,J., Kohara, Y. (1988), Genome linking withyeast artificial chromosomes. Nature 335,184–186.

84 Lee, L.G., Connell, C.R., Woo, S.L., Cheng,R.D., McArdle, B.F., Fuller, C.W., Halloran,N.D., Wilson, R.K. (1992), DNA sequencingwith dye-labeled terminators and T7 DNApolymerase: Effect of dyes and dNTPs onincorporation of dye-terminators, andprobability analysis of termination fragments.Nucleic Acids Res. 20, 2471–2483.

85 Wilson, R., Ainscough, R., Anderson, K.,Baynes, C., Berks, M. et al. (1994), 2.2 Mb ofcontiguous nucleotide sequence fromchromosome III of C. elegans. Nature 368,32–38.

86 Wicky, C., Villeneuve, A.M., Lauper, N.,Codourey, L., Tobler, H., Muller, F. (1996),Telomeric repeats (TTAGGC) are sufficientfor chromosome capping in Caenorhabditiselegans. Proc. Natl. Acad. Sci. USA 93,8983–8988.

87 Herman, R.K. (1988), Genetics, in: TheNematode Caenorhabditis elegans (Wood, W.B.Ed.) pp. 17–45, CSHL Press.

88 Waterston, R., Sulston, J. (1995), The genomeof Caenorhabditis elegans. Proc. Natl. Acad. Sci.USA 92, 10836–10840.

89 Chervitz, S.A., Aravind, L., Sherlock, G., Ball,C.A., Koonin, E.V. et al. (1998), Comparisonof the complete protein sets of worm andyeast: orthology and divergence. Science 282,2022–2027.

90 Stein, L., Sternberg, P., Durbin, R., Thierry-Mieg, J., Spieth, J. (2001), WormBase:network access to the genome and biology ofCaenorhabditis elegans. Nucleic Acids Res. 29,82–86.

Page 43: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen

43References

91 Reboul, J., Vaglio, P., Rual, J.F., Lamesch, P.,Martinez, M. et al. (2003), C. elegansORFeome version 1.1: experimentalverification of the genome annotation andresource for proteome-scale proteinexpression. Nature Genet. 34, 35–41.

92 Green, P., Lipman, D., Hillier, L., Waterston,R., States, D., Claverie, J.M. (1993), Ancientconserved regions in new gene sequencesand the protein database. Science 259,1711–1716.

93 Smit, A.F. (1996), The origin of interspersedrepeats in the human genome. Curr. Opin.Genet. Dev. 6, 743–748.

94 Bernardi, G. (1995), The human genome:organization and evolutionary history. Annu.Rev. Genet. 29, 445–476.

95 Stein, L.D., Bao, Z., Blasiar, D., Blumenthal,T., Brent, M.R. et al. (2003), The genomesequence of Caenorhabditis briggsae: Aplatform for comparative genomics. PLOSBiology, 1, 166–192.

96 Rubin, G.M., Lewis, E.B. (2000), A briefhistory of Drosophila´s contributions togenome research. Science 287, 2216–2218.

97 Veraksa, A., Del Campo, M., McGinnis, W.(2000), Developmental patterning genes andtheir conserved functions: From modelorganisms to humans. Mol. Genet. Metab. 69,85–100.

98 Zhang, J.M., Chen, L., Krause, M., Fire, A.,Paterson, B.M. (1999), Evolutionaryconservation of MyoD function anddifferential utilization of E proteins. Dev. Biol.208, 465–472.

99 Halder, G., Callaerts, P., Gehring, W. J.(1995), Induction of ectopic eyes by targetedexpression of the eyeless gene in Drosophila.Science 267, 1788–1792.

100 Adams, M.D., Celniker, S.E., Holt, R.A.,Evans, C.A., Gocayne, J.D. et al., (2000), Thegenome sequence of Drosophila melanogaster.Science 287, 2185–2195.

101 Mayford, M., Kandel, E.R. (1999), Geneticapproaches to memory storage. Trends Genet.15, 463–470.

102 Myers, E.W., Sutton, G.G., Delcher, A.L.,Dew, I.M., Fasulo, D.P. et al. (2000), A whole-genome assembly of Drosophila. Science 287,2196–2204.

103 Anonymous (2000) The Drosophila Genome,Science 287, issue 5461.

104 Hild, M., Beckmann, B., Haas, S.A., Koch, B.,Solovyev, V. et al. (2003), An integrated geneannotation and transcriptional profilingapproach towards the full gene content of theDrosophila genome. Genome Biol. 5, R3.

105 Lai Eric C. (2003), microRNAs: Runts of thegenome assert themselves. Curr. Biol. 13,R925–R936.

106 Stark, A., Brennecke, J., Russell, R.B., Cohen,S.M. (2003), Identification of DrosophilamicroRNA targets. PLOS Biology, 1, 397–409.

107 Giot, L., Bader, J.S., Brouwer, C., Chaudhuri,A., Kuang, B. et al. (2003), A proteininteraction map of Drosophila melanogaster.Science 302, 1727–1736.

Page 44: Part I Key Organisms · Part I Key Organisms Handbook of Genome Research. Genomics, Proteomics, Metabolomics, Bioinformatics, Ethical and Legal Issues. Edited by Christoph W. Sensen