Heterorhabditis bacteriophora are entomopathogenic nematodes that have evolved a mutualism with Photorhabdus luminescens bacteria to function as highly virulent insect pathogens. The nematode provides a safe harbor for intestinal symbionts in soil and delivers the symbiotic bacteria into the insect blood. The symbiont provides virulence and toxins, metabolites essential for nematode reproduction, and antibiotic preservation of the insect cadaver. Approximately half of the 21,250 putative protein coding genes identified in the 77 Mbp high quality draft H. bacteriophora genome sequence were novel proteins of unknown function lacking homologs in Caenorhabditis elegans or any other sequenced organisms. Similarly, 317 of the 603 predicted secreted proteins are novel with unknown function in addition to 19 putative peptidases, 9 peptidase inhibitors and 7 C-type lectins that may function in interactions with insect hosts or bacterial symbionts. The 134 proteins contained mariner transposase domains, of which there are none in C. elegans, suggesting an invasion and expansion of mariner transposons in H. bacteriophora. Fewer Kyoto Encyclopedia of Genes and Genomes Orthologies in almost all metabolic categories were detected in the genome compared with 9 other sequenced nematode genomes, which may reflect dependence on the symbiont or insect host for these functions. The H. bacteriophora genome sequence will greatly facilitate genetics, genomics and evolutionary studies to gain fundamental knowledge of nematode parasitism and mutualism. It also elevates the utility of H. bacteriophora as a bridge species between vertebrate parasitic nematodes and the C. elegans model.
Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species.
Escherichia coli is a model laboratory bacterium, a species that is widely distributed in the environment, as well as a mutualist and pathogen in its human hosts. As such, E. coli represents an attractive organism to study how environment impacts microbial genome structure and function. Uropathogenic E. coli (UPEC) must adapt to life in several microbial communities in the human body, and has a complex life cycle in the bladder when it causes acute or recurrent urinary tract infection (UTI). Several studies designed to identify virulence factors have focused on genes that are uniquely represented in UPEC strains, whereas the role of genes that are common to all E. coli has received much less attention. Here we describe the complete 5,065,741-bp genome sequence of a UPEC strain recovered from a patient with an acute bladder infection and compare it with six other finished E. coli genome sequences. We searched 3,470 ortholog sets for genes that are under positive selection only in UPEC strains. Our maximum likelihood-based analysis yielded 29 genes involved in various aspects of cell surface structure, DNA metabolism, nutrient acquisition, and UTI. These results were validated by resequencing a subset of the 29 genes in a panel of 50 urinary, periurethral, and rectal E. coli isolates from patients with UTI. These studies outline a computational approach that may be broadly applicable for studying strain-specific adaptation and pathogenesis in other bacteria.
Human chromosome 2 is unique to the human lineage in being the product of a head-to-head fusion of two intermediate-sized ancestral chromosomes. Chromosome 4 has received attention primarily related to the search for the Huntington's disease gene, but also for genes associated with Wolf-Hirschhorn syndrome, polycystic kidney disease and a form of muscular dystrophy. Here we present approximately 237 million base pairs of sequence for chromosome 2, and 186 million base pairs for chromosome 4, representing more than 99.6% of their euchromatic sequences. Our initial analyses have identified 1,346 protein-coding genes and 1,239 pseudogenes on chromosome 2, and 796 protein-coding genes and 778 pseudogenes on chromosome 4. Extensive analyses confirm the underlying construction of the sequence, and expand our understanding of the structure and evolution of mammalian chromosomes, including gene deserts, segmental duplications and highly variant regions.
Salmonella enterica serovars often have a broad host range, and some cause both gastrointestinal and systemic disease. But the serovars Paratyphi A and Typhi are restricted to humans and cause only systemic disease. It has been estimated that Typhi arose in the last few thousand years. The sequence and microarray analysis of the Paratyphi A genome indicates that it is similar to the Typhi genome but suggests that it has a more recent evolutionary origin. Both genomes have independently accumulated many pseudogenes among their approximately 4,400 protein coding sequences: 173 in Paratyphi A and approximately 210 in Typhi. The recent convergence of these two similar genomes on a similar phenotype is subtly reflected in their genotypes: only 30 genes are degraded in both serovars. Nevertheless, these 30 genes include three known to be important in gastroenteritis, which does not occur in these serovars, and four for Salmonella-translocated effectors, which are normally secreted into host cells to subvert host functions. Loss of function also occurs by mutation in different genes in the same pathway (e.g., in chemotaxis and in the production of fimbriae).
Human chromosome 7 has historically received prominent attention in the human genetics community, primarily related to the search for the cystic fibrosis gene and the frequent cytogenetic changes associated with various forms of cancer. Here we present more than 153 million base pairs representing 99.4% of the euchromatic sequence of chromosome 7, the first metacentric chromosome completed so far. The sequence has excellent concordance with previously established physical and genetic maps, and it exhibits an unusual amount of segmentally duplicated sequence (8.2%), with marked differences between the two arms. Our initial analyses have identified 1,150 protein-coding genes, 605 of which have been confirmed by complementary DNA sequences, and an additional 941 pseudogenes. Of genes confirmed by transcript sequences, some are polymorphic for mutations that disrupt the reading frame.
The soil nematodes Caenorhabditis briggsae and Caenorhabditis elegans diverged from a common ancestor roughly 100 million years ago and yet are almost indistinguishable by eye. They have the same chromosome number and genome sizes, and they occupy the same ecological niche. To explore the basis for this striking conservation of structure and function, we have sequenced the C. briggsae genome to a high-quality draft stage and compared it to the finished C. elegans sequence. We predict approximately 19,500 protein-coding genes in the C. briggsae genome, roughly the same as in C. elegans. Of these, 12,200 have clear C. elegans orthologs, a further 6,500 have one or more clearly detectable C. elegans homologs, and approximately 800 C. briggsae genes have no detectable matches in C. elegans. Almost all of the noncoding RNAs (ncRNAs) known are shared between the two species. The two genomes exhibit extensive colinearity, and the rate of divergence appears to be higher in the chromosomal arms than in the centers. Operons, a distinctive feature of C. elegans, are highly conserved in C. briggsae, with the arrangement of genes being preserved in 96% of cases. The difference in size between the C. briggsae (estimated at approximately 104 Mbp) and C. elegans (100.3 Mbp) genomes is almost entirely due to repetitive sequence, which accounts for 22.4% of the C. briggsae genome in contrast to 16.5% of the C. elegans genome. Few, if any, repeat families are shared, suggesting that most were acquired after the two species diverged or are undergoing rapid evolution. Coclustering the C. elegans and C. briggsae proteins reveals 2,169 protein families of two or more members. Most of these are shared between the two species, but some appear to be expanding or contracting, and there seem to be as many as several hundred novel C. briggsae gene families. The C. briggsae draft sequence will greatly improve the annotation of the C. elegans genome. Based on similarity to C. briggsae, we found strong evidence for 1,300 new C. elegans genes. In addition, comparisons of the two genomes will help to understand the evolutionary forces that mold nematode genomes.
Salmonella enterica subspecies I, serovar Typhimurium (S. typhimurium), is a leading cause of human gastroenteritis, and is used as a mouse model of human typhoid fever. The incidence of non-typhoid salmonellosis is increasing worldwide, causing millions of infections and many deaths in the human population each year. Here we sequenced the 4,857-kilobase (kb) chromosome and 94-kb virulence plasmid of S. typhimurium strain LT2. The distribution of close homologues of S. typhimurium LT2 genes in eight related enterobacteria was determined using previously completed genomes of three related bacteria, sample sequencing of both S. enterica serovar Paratyphi A (S. paratyphi A) and Klebsiella pneumoniae, and hybridization of three unsequenced genomes to a microarray of S. typhimurium LT2 genes. Lateral transfer of genes is frequent, with 11% of the S. typhimurium LT2 genes missing from S. enterica serovar Typhi (S. typhi), and 29% missing from Escherichia coli K12. The 352 gene homologues of S. typhimurium LT2 confined to subspecies I of S. enterica-containing most mammalian and bird pathogens-are useful for studies of epidemiology, host specificity and pathogenesis. Most of these homologues were previously unknown, and 50 may be exported to the periplasm or outer membrane, rendering them accessible as therapeutic or vaccine targets.
The genome of the model plant Arabidopsis thaliana has been sequenced by an international collaboration, The Arabidopsis Genome Initiative. Here we report the complete sequence of chromosome 5. This chromosome is 26 megabases long; it is the second largest Arabidopsis chromosome and represents 21% of the sequenced regions of the genome. The sequence of chromosomes 2 and 4 have been reported previously and that of chromosomes 1 and 3, together with an analysis of the complete genome sequence, are reported in this issue. Analysis of the sequence of chromosome 5 yields further insights into centromere structure and the sequence determinants of heterochromatin condensation. The 5,874 genes encoded on chromosome 5 reveal several new functions in plants, and the patterns of gene organization provide insights into the mechanisms and extent of genome evolution in plants.
The higher plant Arabidopsis thaliana (Arabidopsis) is an important model for identifying plant genes and determining their function. To assist biological investigations and to define chromosome structure, a coordinated effort to sequence the Arabidopsis genome was initiated in late 1996. Here we report one of the first milestones of this project, the sequence of chromosome 4. Analysis of 17.38 megabases of unique sequence, representing about 17% of the genome, reveals 3,744 protein coding genes, 81 transfer RNAs and numerous repeat elements. Heterochromatic regions surrounding the putative centromere, which has not yet been completely sequenced, are characterized by an increased frequency of a variety of repeats, new repeats, reduced recombination, lowered gene density and lowered gene expression. Roughly 60% of the predicted protein-coding genes have been functionally characterized on the basis of their homology to known genes. Many genes encode predicted proteins that are homologous to human and Caenorhabditis elegans proteins.