Genetic nomenclature for E. coli has evolved since the first formal proposal in 1966 by Demerec et al. The Instructions to Authors for the Journal of Bacteriology is considered to be an authoritative source for the most recent version of the genetic nomenclature guidelines, along with the nomenclature guide from Trends in Genetics.
The genetic nomenclature uses gene and locus interchangeably. A gene is named by a three letter italicized lowercase locus mnemonic, such as lac for lactose, trp for tryptophan, or rpo for RNA polymerase. Different genes with the same mnemonic are distinguished by the addition of an uppercase italicized letter, e.g. lacZ, lacY, and lacZ are the genes of the lac operon. The three letter mnemonic does not mean that genes with the same designator necessarily map close to one another. However, some genes were named based on proximity to other genes. It was later determined that the function of these were not closely coupled. Thus, similar gene names are not a substitute for functional annotations.
The nomenclature is somewhat inconsitent, for historical reasons, in how cis-elements in the DNA are described. For example, the major origin of replication is oriC, while sites related to the termination of replication include RhsA, RhsB, RhsC, RhsD, RhsE (Recombination hot spot), TerA, TerB, TerC, TerD, TerE, and TerF (Terminus), where the first letter is capitalized. Transcriptional control elements are to be designated with a lower case letter following a nearby locus, e.g. lacZp. However, it is not unusual to see lacP, lacp or Plac in the literature (Demerec refers to lacO). Subscripts and superscripts in general are problematic for keeping the data in computer-friendly form.
In some cases, older names refer to a locus that was not resolved into separate genes when the name was given. For example, malA and malB refer to two clusters of mal (maltose) genes that map at 76.4 and 91.5 minutes, respectively. The malA cluster includes malT, malP and malQ, while malB includes malE, malF, malG, malK, lamB, and malM.
Traditional gene names derive from a phenotype mnemonic, or from the identification of a gene based on the protein or RNA product it makes. Thus, lac refers to genes that are needed for growth on lactose as a sole carbon source, while ssrA is for "small stable RNA". With genomics, many genes of unknown function were identified as open reading frames with computationally identified. These have been given names like ybbN where y indicates a gene of unknown function, and the next two letters indicate the position of the gene on the standard map of E. coli K-12 strain MG1655. For y gene names, a=0, b=1, c=2, d=3, e=4, f=5, g=6, h=7, i=8, and j=9. Thus, ybbN is a gene of unknown function that maps at 11 minutes. As with traditional loci, the uppercase letter distinguishes ybbN from other genes in the region. The numbers recycle when more than 26 genes are found in a region.
Thus, ykgE is a gene of unknown function that maps at 06 minutes on the E. coli genetic map. Large clusters of y genes are found in cryptic prophage genomes that were discovered during genome sequencing. The y gene names are not reused if a gene is renamed when its function is assigned, or when a gene name is retired because the ORF has been determined to be not a coding sequence, or when two y genes are fused by corrections in the DNA sequence.
A similar notation is used to designate loci that are only known by map location. In this case, a z is used as the first letter and the next two letters describe the coordinates. This is seen most often in designating the site of a transposon insertions, as in the mapping collection from Singer et al (1989).
Standard names and synonyms
Standard gene names are currently managed by the E. coli Genetic Stock Center (CGSC) and can be found on a variety of E. coli database websites. Many genes have multiple names, or synonyms. This can occur when a gene was given different names based on different alleles that were only later shown to be allelic, or when y genes are renamed, or when people publish new gene names without consulting the standard names and reviewers and editors don't catch it. Databases have to keep track of synonyms to help find all of the relevant literature for a gene.
In some cases, the synonyms are still widely used.
The mnemonic sup was used extensively for suppressor, to indicate genes that, when mutated, suppress the phenotypes of mutations in other genes. Most, but not all, of the sup loci in E. coli are mutations that create nonsense suppressor tRNAs. These have been given different standard names, but the sup names are still used by E. coli biologists. These names are useful because they are shorthand for particular alleles. The same tRNA gene can often mutate to some combination of amber (UAG), ochre (UAA), or opal (UGA) suppressors.
Other commonly used synonyms
Strain descriptions in the literature or in laboratory strain lists often use str to indicate an allele of rpsL that confers resistance to streptomycin.
Databases also need to deal with situations where the same gene name has been used by different groups to designate different genes. Although the standard names are unique, the literature has instances of the same name referring to different genes.
In general the wild-type allele is implicit in the genotype description of a strain; only genes with mutations are listed. Allele designators are used to specify the mutations in a strain; for example, araD139 is a mutation in araD found in many commonly used laboratory strains. In some cases, noninteger allele designations are used, as in lacIq or nusAcs10. Under the J. Bact guidelines, qualifiers indicating temperature-sensitive, cold-sensitive, or hybrid alleles should be inicated in parentheses after the allele number. Using these rules nusAcs10 should probably be nusA10(Cs). However, many variants can be found in the literature. For example, argE(Am), argE-am, and argEam all indicate amber mutations in the argE gene. Variations in usage and capitalization can make it difficult to extract instances of the same alleles by text searching. This kind of problem is not unique to E. coli.
After gene sequences became available and identifying the sequence changes for mutant alleles became the norm, alleles have been described based on the changes created in the sequence of the encoded polypeptide. Some papers expressed this as two single letter amino acid codes followed by a position in the protein, but it is now more common to use two single letter amino acids separated by the position. Thus, QL44 and Q44L would both indicate a change from a glutamine to a leucine at position 44 of the polypeptide encoded by the gene of interest.
In principle, allele numbers should be registered with the CGSC, but many unregistered mutant alleles are in the literature.
Deletions, Insertions, and Fusions
Special notation is used for mutations that are not single base substitutions.
Deletions are indicted as Δ, [Delta] or spelled out as Delta (the last is more friendly to computer searching; the [Delta] version is particularly bad for wikis since it must be embedded in <nowiki> tags), as in Δ(ara-leu)7696 or Δ(lac)X74. This nomenclature does not specify which genes are missing in the deletion; one must find additional information about the specific allele.
Insertions are indicated by ::, as in pyrC103::Tn10 to indicate a Tn10 insertion into the pyrC gene. Insertions that are linked to, but not in a nearby gene are sometimes connected by a dash, as in proC-Tn10.
Gene fusions are powerful tools for a variety of purposes . These are usually either noted with a dash, as in trp-lac or lamB-lacZ. Sometimes the fusion is generated via an insertion vector, in which case the insertion designation may be used.
Episomes and mobile genetic elements
Episomes include plasmids, like F, and integrated elements that have a free-living stage in their life cycle, such as prophage. These are indicated in various ways. For example, a strain that is W3110 with F'128 might be written W3110/F'128 or W3110(F'128). Lysogens are written with the integrated phage in parentheses, as in C600(λ), not C600 attλ::λ. F's and specialized transducing phage carry chromosomal genes. In this case the wt gene is often indicated. While this should be written as, for example, F'128 proAB+lac+, this is often just written as F'128 pro lac.
Phenotypes are given as nonitalicized mnemonics beginning with an uppercase letter. A mutant that is unable to use lactose as the sole carbon source would be designated as having a Lac or Lac- phenotype. Superscripts are also used to clarify specific phenotypes such as sensitivity or resistance; a tetracycline-resistant strain is Tetr, while a tetracycline-sensitive strain is Tets. Superscripts are also used for conditional phenotypes such as ts (temperature-sensitive) or cs (cold-sensitive). A superscript zero is sometimes used to designate a strain that does not contain nonsense suppressors: Sup0.
When referring to the gene product the mnemonic is first-letter capitalised and not italicized (e.g. DnaA refers to the protein coded by the dnaA gene).
- Demerec, M et al. (1966) A proposal for a uniform nomenclature in bacterial genetics. Genetics 54 61-76 PubMed
- Rudd, KE (1998) Linkage map of Escherichia coli K-12, edition 10: the physical map. Microbiol. Mol. Biol. Rev. 62 985-1019 PubMed
- Besse, M et al. (1986) Synthetic lac operator mediates repression through lac repressor when introduced upstream and downstream from lac promoter. EMBO J. 5 1377-81 PubMed
- Vogel, U & Jensen, KF (1997) NusA is required for ribosomal antitermination and for modulation of the transcription elongation rate of both antiterminated RNA and mRNA. J. Biol. Chem. 272 12265-71 PubMed
- Zeeberg, BR et al. (2004) Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics 5 80 PubMed
- Silhavy, TJ & Beckwith, JR (1985) Uses of lac fusions for the study of biological problems. Microbiol. Rev. 49 398-418 PubMed