Streptococcus oralis, a commensal species of the human oral cavity, belongs to the Mitis group of streptococci, which includes one of the major human pathogens as well, S. pneumoniae. We report here the first complete genome sequence of this species. S. oralis Uo5, a high-level penicillin- and multiple-antibiotic-resistant isolate from Hungary, is competent for genetic transformation under laboratory conditions. Comparative and functional genomics of Uo5 will be important in understanding the evolution of pathogenesis among Mitis streptococci and their potential to engage in interspecies gene transfer.
Leishmania species cause a spectrum of human diseases in tropical and subtropical regions of the world. We have sequenced the 36 chromosomes of the 32.8-megabase haploid genome of Leishmania major (Friedlin strain) and predict 911 RNA genes, 39 pseudogenes, and 8272 protein-coding genes, of which 36% can be ascribed a putative function. These include genes involved in host-pathogen interactions, such as proteolytic enzymes, and extensive machinery for synthesis of complex surface glycoconjugates. The organization of protein-coding genes into long, strand-specific, polycistronic clusters and lack of general transcription factors in the L. major, Trypanosoma brucei, and Trypanosoma cruzi (Tritryp) genomes suggest that the mechanisms regulating RNA polymerase II-directed transcription are distinct from those operating in other eukaryotes, although the trypanosomatids appear capable of chromatin remodeling. Abundant RNA-binding proteins are encoded in the Tritryp genomes, consistent with active posttranscriptional regulation of gene expression.
We have sequenced and annotated the genome of fission yeast (Schizosaccharomyces pombe), which contains the smallest number of protein-coding genes yet recorded for a eukaryote: 4,824. The centromeres are between 35 and 110 kilobases (kb) and contain related repeats including a highly conserved 1.8-kb element. Regions upstream of genes are longer than in budding yeast (Saccharomyces cerevisiae), possibly reflecting more-extended control regions. Some 43% of the genes contain introns, of which there are 4,730. Fifty genes have significant similarity with human disease genes; half of these are cancer related. We identify highly conserved genes important for eukaryotic cell organization including those required for the cytoskeleton, compartmentation, cell-cycle control, proteolysis, protein phosphorylation and RNA splicing. These genes may have originated with the appearance of eukaryotic life. Few similarly conserved genes that are important for multicellular organization were identified, suggesting that the transition from prokaryotes to eukaryotes required more new genes than did the transition from unicellular to multicellular organization.
Arabidopsis thaliana is an important model system for plant biologists. In 1996 an international collaboration (the Arabidopsis Genome Initiative) was formed to sequence the whole genome of Arabidopsis and in 1999 the sequence of the first two chromosomes was reported. The sequence of the last three chromosomes and an analysis of the whole genome are reported in this issue. Here we present the sequence of chromosome 3, organized into four sequence segments (contigs). The two largest (13.5 and 9.2 Mb) correspond to the top (long) and the bottom (short) arms of chromosome 3, and the two small contigs are located in the genetically defined centromere. This chromosome encodes 5,220 of the roughly 25,500 predicted protein-coding genes in the genome. About 20% of the predicted proteins have significant homology to proteins in eukaryotic genomes for which the complete sequence is available, pointing to important conserved cellular functions among eukaryotes.
The higher plant Arabidopsis thaliana (Arabidopsis) is an important model for identifying plant genes and determining their function. To assist biological investigations and to define chromosome structure, a coordinated effort to sequence the Arabidopsis genome was initiated in late 1996. Here we report one of the first milestones of this project, the sequence of chromosome 4. Analysis of 17.38 megabases of unique sequence, representing about 17% of the genome, reveals 3,744 protein coding genes, 81 transfer RNAs and numerous repeat elements. Heterochromatic regions surrounding the putative centromere, which has not yet been completely sequenced, are characterized by an increased frequency of a variety of repeats, new repeats, reduced recombination, lowered gene density and lowered gene expression. Roughly 60% of the predicted protein-coding genes have been functionally characterized on the basis of their homology to known genes. Many genes encode predicted proteins that are homologous to human and Caenorhabditis elegans proteins.
The plant Arabidopsis thaliana (Arabidopsis) has become an important model species for the study of many aspects of plant biology. The relatively small size of the nuclear genome and the availability of extensive physical maps of the five chromosomes provide a feasible basis for initiating sequencing of the five chromosomes. The YAC (yeast artificial chromosome)-based physical map of chromosome 4 was used to construct a sequence-ready map of cosmid and BAC (bacterial artificial chromosome) clones covering a 1.9-megabase (Mb) contiguous region, and the sequence of this region is reported here. Analysis of the sequence revealed an average gene density of one gene every 4.8 kilobases (kb), and 54% of the predicted genes had significant similarity to known genes. Other interesting features were found, such as the sequence of a disease-resistance gene locus, the distribution of retroelements, the frequent occurrence of clustered gene families, and the sequence of several classes of genes not previously encountered in plants.
The nucleotide sequence of the 948,061 base pairs of chromosome XVI has been determined, completing the sequence of the yeast genome. Chromosome XVI was the last yeast chromosome identified, and some of the genes mapped early to it, such as GAL4, PEP4 and RAD1 (ref. 2) have played important roles in the development of yeast biology. The architecture of this final chromosome seems to be typical of the large yeast chromosomes, and shows large duplications with other yeast chromosomes. Chromosome XVI contains 487 potential protein-encoding genes, 17 tRNA genes and two small nuclear RNA genes; 27% of the genes have significant similarities to human gene products, and 48% are new and of unknown biological function. Systematic efforts to explore gene function have begun.
The complete DNA sequence of the yeast Saccharomyces cerevisiae chromosome IV has been determined. Apart from chromosome XII, which contains the 1-2 Mb rDNA cluster, chromosome IV is the longest S. cerevisiae chromosome. It was split into three parts, which were sequenced by a consortium from the European Community, the Sanger Centre, and groups from St Louis and Stanford in the United States. The sequence of 1,531,974 base pairs contains 796 predicted or known genes, 318 (39.9%) of which have been previously identified. Of the 478 new genes, 225 (28.3%) are homologous to previously identified genes and 253 (32%) have unknown functions or correspond to spurious open reading frames (ORFs). On average there is one gene approximately every two kilobases. Superimposed on alternating regional variations in G+C composition, there is a large central domain with a lower G+C content that contains all the yeast transposon (Ty) elements and most of the tRNA genes. Chromosome IV shares with chromosomes II, V, XII, XIII and XV some long clustered duplications which partly explain its origin.
The yeast Saccharomyces cerevisiae is the pre-eminent organism for the study of basic functions of eukaryotic cells. All of the genes of this simple eukaryotic cell have recently been revealed by an international collaborative effort to determine the complete DNA sequence of its nuclear genome. Here we describe some of the features of chromosome XII.
Bacillus subtilis is the best-characterized member of the Gram-positive bacteria. Its genome of 4,214,810 base pairs comprises 4,100 protein-coding genes. Of these protein-coding genes, 53% are represented once, while a quarter of the genome corresponds to several gene families that have been greatly expanded by gene duplication, the largest family containing 77 putative ATP-binding transport proteins. In addition, a large proportion of the genetic capacity is devoted to the utilization of a variety of carbon sources, including many plant-derived molecules. The identification of five signal peptidase genes, as well as several genes for components of the secretion apparatus, is important given the capacity of Bacillus strains to secrete large amounts of industrially important enzymes. Many of the genes are involved in the synthesis of secondary metabolites, including antibiotics, that are more typically associated with Streptomyces species. The genome contains at least ten prophages or remnants of prophages, indicating that bacteriophage infection has played an important evolutionary role in horizontal gene transfer, in particular in the propagation of bacterial pathogenesis.
The nucleotide sequences of five major regions from chromosome VII of Saccharomyces cerevisiae have been determined and analysed. These regions represent 203 kilobases corresponding to approximately one-fifth of the complete yeast chromosome VII. Two fragments originate from the left arm of this chromosome. The first one of about 15.8 kb starts approximately 75 kb from the left telomere and is bordered by the SK18 chromosomal marker. The second fragment covers the 72.6 kb region between the chromosomal markers CYH2 and ALG2. On the right chromosomal arm three regions, a 70.6 kb region between the MSB2 and the KSS1 chromosomal markers and two smaller regions dominated by the KRE11 marker and another one in the vicinity of the SER2 marker were sequenced. We found a total of 114 open reading frames (ORFs), 13 of which were completely overlapping with larger ORFs running in the opposite direction. A total of 44 yeast genes, the physiological functions of which are known, could be precisely mapped on this chromosome. Of the remaining 57 ORFs, 26 shared sequence homologies with known genes, among which were 13 other S. cerevisiae genes and five genes from other organisms. No homology with any sequence in the databases could be found for 31 ORFs. Furthermore, five Ty elements were found, one of which may not be functional due to a frame shift in its Ty1B amino acid sequence. The five chromosomal regions harboured five potential ARS elements and one sigma element together with eight tRNA genes and two snRNAs, one of which is encoded by an intron of a protein-coding gene.
The complete DNA sequence of the yeast Saccharomyces cerevisiae chromosome XI has been determined. In addition to a compact arrangement of potential protein coding sequences, the 666,448-base-pair sequence has revealed general chromosome patterns; in particular, alternating regional variations in average base composition correlate with variations in local gene density along the chromosome. Significant discrepancies with the previously published genetic map demonstrate the need for using independent physical mapping criteria.
In the framework of the EU genome-sequencing programmes, the complete DNA sequence of the yeast Saccharomyces cerevisiae chromosome II (807 188 bp) has been determined. At present, this is the largest eukaryotic chromosome entirely sequenced. A total of 410 open reading frames (ORFs) were identified, covering 72% of the sequence. Similarity searches revealed that 124 ORFs (30%) correspond to genes of known function, 51 ORFs (12.5%) appear to be homologues of genes whose functions are known, 52 others (12.5%) have homologues the functions of which are not well defined and another 33 of the novel putative genes (8%) exhibit a degree of similarity which is insufficient to confidently assign function. Of the genes on chromosome II, 37-45% are thus of unpredicted function. Among the novel putative genes, we found several that are related to genes that perform differentiated functions in multicellular organisms of are involved in malignancy. In addition to a compact arrangement of potential protein coding sequences, the analysis of this chromosome confirmed general chromosome patterns but also revealed particular novel features of chromosomal organization. Alternating regional variations in average base composition correlate with variations in local gene density along chromosome II, as observed in chromosomes XI and III. We propose that functional ARS elements are preferably located in the AT-rich regions that have a spacing of approximately 110 kb. Similarly, the 13 tRNA genes and the three Ty elements of chromosome II are found in AT-rich regions. In chromosome II, the distribution of coding sequences between the two strands is biased, with a ratio of 1.3:1. An interesting aspect regarding the evolution of the eukaryotic genome is the finding that chromosome II has a high degree of internal genetic redundancy, amounting to 16% of the coding capacity.