How many different protein coding genes are in the Human Biome?

There are approximately 20k protein-coding genes found in the human genome. This number is presumably very small when considering all the genomes found in the diverse microbes associated with the human body.

Is there an estimate for the total number of protein-coding genes found in human microbiome?

The Human Microbiome Project collected samples as shown in the image below from healthy volunteers:

They give an estimate of about 8 Million genes in the human microbiome, which is about 360x the number of genes in the human genome. See here for their press release with a lot of information. Additionally these two articles are as well interesting, if you want to dive deeper into the stuff:

Human Microbiome Project

According to Published data on HMP website:

  • The human microbiome consists of all the microorganisms that reside in or on the human body.
  • They may cause illness but some are necessary for good health.
  • There total count is 10 times more than Human cells.
  • ~800 genomes had been sequenced.
  • ~5000 Human sample had been used.
  • Include 14 disease project samples.
  • Total data stored on the DACC resource is 14 terabytes.
  • Total protein(~8 million) coding genes found to be 360x of number of human genes




The Human Genome at 20: How Biology’s Most-Hyped Breakthrough Led to Anticlimax and Arrests

When President Bill Clinton took to a White House lectern 20 years ago to announce that the human genome sequence had been completed, he hailed the breakthrough as “the most important, most wondrous map ever produced by humankind.” The scientific achievement was placed on par with the moon landings.

It was hoped that having access to the sequence would transform our understanding of human disease within 20 years, leading to better treatment, detection, and prevention. The famous journal article that shared our genetic ingredients with the world, published in February 2001, was welcomed as a “Book of Life” that could revolutionize medicine by showing which of our genes led to which illnesses.

But in the two decades since, the sequence has underwhelmed. The potential of our newfound genetic self-knowledge has not been fulfilled. Instead, what has emerged is a new frontier in genetic research: new questions for a new batch of researchers to answer.

Today, the gaps between our genes, and the switches that direct genetic activity, are emerging as powerful determinants behind how we look and how we get ill – perhaps deciding up to 90% of what makes us different from one another. Understanding this “genetic dark matter,” using the knowledge provided by the human genome sequence, will help us to push further into our species’ genetic secrets.

The announcement was first made in a joint press conference between President Bill Clinton and Prime Minister Tony Blair in 2000.

Unraveled code

Cracking the human genetic code took 13 years, US$2.7 billion (£1.9 billion) and hundreds of scientists peering through over 3 billion base pairs in our DNA. Once mapped, our genetic data helped projects like the Cancer Dependency Map and the Genome Wide Association Studies better understand the diseases that afflict humans.

But some results were disappointing. Back in 2000, as it was becoming clear the genome sequence was imminent, the genomics community began excitedly placing bets predicting how many genes the human genome would contain. Some bets were as high as 300,000, others as low as 40,000. For context, the onion genome contains 60,000 genes.

Dispiritingly, it turned out that our genome contains roughly the same number of genes as a mouse or a fruit fly (around 21,000), and three times less than an onion. Few would argue that humans are three times less complex than an onion. Instead, this discovery suggested that the number of genes in our genome had little to do with our complexity or our difference from other species, as had been previously assumed.

Great responsibility

Access to the human genome sequence also presented the scientific community with a huge number of important ethical questions, underscored in 2000 by Prime Minister Tony Blair when he cautioned: “With the power of this discovery comes the responsibility to use it wisely.”

Ethicists were particularly concerned about questions of “genetic discrimination,” like whether our genes could be used against us as evidence in a court of law, or as a basis for exclusion: a new kind of twisted hierarchy determined by our biology.

Some of these concerns were addressed by legislation against genetic discrimination, like the US Genetic Information Nondiscrimination Act of 2008. Other concerns, like those around so-called “designer babies,” are still being put to the test today.

In 2018, human embryos were gene edited by a Chinese scientist, using a method called CRISPR which allows targeted sections of DNA to be snipped off and replaced with others. The scientist involved was subsequently jailed, suggesting that there remains little appetite for human genetic experimentation.

On the other hand, to deny available genetic treatments to willing patients may one day be considered unethical – just as some countries have chosen to legalize euthanasia on ethical grounds. Questions remain about how humanity should handle its genetic data.

The Chinese scientist He Jiankui announced in 2018 that he had created gene-edited twins. He was jailed in 2019.

Disease diversions

With human gene editing still highly contentious, researchers have instead looked to find out which genes may be responsible for humanity’s illnesses. Yet when scientists investigated which genes are linked to human diseases, they were met with a surprise. After comparing huge samples of human DNA to find whether certain genes led to certain illnesses, they found that many unexpected sections of the genome were involved in the development of human disease.

The genome contains two sections: the coding genome, and the non-coding genome. The coding genome represents just 1.7% of our DNA, but is responsible for coding the proteins that are the essential building blocks of life. Genes are defined by their ability to code proteins: so 1.7% of our genome consists of genes.

The non-coding genome, which makes up the remaining 98.3% of our DNA, doesn’t code proteins. This largely unknown section of the genome was once dismissed as “junk DNA,” previously thought to be useless. It contained no protein-creating genes, so it was assumed the non-coding genome had little to do with the stuff of life.

Bewilderingly, scientists found that the non-coding genome was actually responsible for the majority of information that impacted disease development in humans. Such findings have made it clear that the non-coding genome is actually far more important than previously thought.

Enhanced capabilities

Within this non-coding part of the genome, researchers have subsequently found short regions of DNA called enhancers: gene switches that turn genes on and off in different tissues at different times. They found that enhancers needed to shape the embryo have changed very little during evolution, suggesting that they represent a major and important source of genetic information.

These studies inspired one of us, Alasdair, to explore the possible role of enhancers in behaviors such as alcohol intake, anxiety, and fat intake. By comparing the genomes of mice, birds, and humans we identified an enhancer that has changed relatively little over 350 million years – suggesting its importance in species’ survival.

When we used CRISPR genome editing to delete this enhancer from the mouse genome, those mice ate less fat, drank less alcohol, and displayed reduced anxiety. While these may all sound like positive changes, it’s likely that these enhancers evolved in calorically poor environments full of predators and threats. At the time, eating high-calorie food sources such as fat and fermented fruit, and being hyper-vigilant of predators, would have been key for survival. However, in modern society these same behaviors may now contribute to obesity, alcohol abuse, and chronic anxiety.

Intriguingly, subsequent genetic analysis of a major human population cohort has shown that changes in the same human enhancer were also associated with differences in alcohol intake and mood. These studies demonstrate that enhancers are not only important for normal physiology and health, but that changing them could result in changes in behavior that have major implications for human health.

Given these new avenues of research, we appear to be at a crossroads in genetic biology. The importance of gene enhancers in health and disease sits uncomfortably with our relative inability to identify and understand them.

And so in order to make the most of the sequencing of the human genome two decades ago, it’s clear that research must now look beyond the 1.7% of the genome that encodes proteins. In exploring uncharted genetic territory, like that represented by enhancers, biology may well locate the next swathe of healthcare breakthroughs.

Related Links

References: A draft map of the human proteome. Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, Madugundu AK, Kelkar DS, Isserlin R, Jain S, Thomas JK, Muthusamy B, Leal-Rojas P, Kumar P, Sahasrabuddhe NA, Balakrishnan L, Advani J, George B, Renuse S, Selvan LD, Patil AH, Nanjappa V, Radhakrishnan A, Prasad S, Subbannayya T, Raju R, Kumar M, Sreenivasamurthy SK, Marimuthu A, Sathe GJ, Chavan S, Datta KK, Subbannayya Y, Sahu A, Yelamanchi SD, Jayaram S, Rajagopalan P, Sharma J, Murthy KR, Syed N, Goel R, Khan AA, Ahmad S, Dey G, Mudgal K, Chatterjee A, Huang TC, Zhong J, Wu X, Shaw PG, Freed D, Zahari MS, Mukherjee KK, Shankar S, Mahadevan A, Lam H, Mitchell CJ, Shankar SK, Satishchandra P, Schroeder JT, Sirdeshmukh R, Maitra A, Leach SD, Drake CG, Halushka MK, Prasad TS, Hruban RH, Kerr CL, Bader GD, Iacobuzio-Donahue CA, Gowda H, Pandey A. Nature. 2014 May 29509(7502):575-81. doi: 10.1038/nature13302. PMID: 24870542.

Funding: NIH’s National Institute of General Medical Sciences (NIGMS), National Cancer Institute (NCI), and National Heart, Lung, and Blood Institute (NHLBI) the Sol Goldman Pancreatic Cancer Research Center India's Council of Scientific and Industrial Research and Wellcome Trust/DBT India Alliance.

Medical Implications of Detailed Human Genome Maps

Advances in molecular genetics made over the past two decades are already having a major impact on medical research and clinical care. The ability to clone and analyze individual genes and to deduce the amino acid sequences of encoded proteins has greatly increased our understanding of genetic disorders, the immune system, endocrine abnormalities, coronary artery disease, infectious diseases, and cancer. A few proteins produced on a commercial scale by recombinant DNA methods are available for therapeutic use or in clinical trials, and many more are in earlier developmental stages. Recent progress in determining the genetic basis for such neurological and behavioral disorders as Huntington's disease (Gusella et al., 1983), Alzheimer's disease (St George-Hyslop et al., 1987), and manic-depressive illness (Egeland et al., 1987) promises new insights into these common and serious conditions. Higher resolution maps of the human genome will accelerate progress in understanding disease pathogenesis and in developing new approaches to diagnosis, treatment, and prevention in many areas of medicine. In Chapter 3 the potential medical impact of a detailed human genomic map is discussed further.

SARS-CoV-2 Genome Study Uncovers Full Set of Protein-Coding Genes

On January 11, 2020, the first draft genome of the SARS-CoV-2 virus was posted online by Chinese researchers. Quickly, many of the virus’s genes were quickly determined as it shares a large amount of similarity with other coronaviruses. However, the full complement of protein-coding genes remained unresolved.

Now, researchers have generated what they describe as the most accurate and complete gene annotation of the SARS-CoV-2 genome. In the study, several protein-coding genes were confirmed and others were newly identified. The research team also analyzed nearly 2,000 mutations that have arisen in different SARS-CoV-2 isolates since it began infecting humans, allowing them to rate how important those mutations may be in changing the virus’ ability to evade the immune system or become more infectious.

“We were able to use this powerful comparative genomics approach for evolutionary signatures to discover the true functional protein-coding content of this enormously important genome,” said Manolis Kellis, PhD, professor of computer science in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) as well as a member of the Broad Institute of MIT and Harvard.

The team used comparative genomics to elucidate the SARS-CoV-2 genome. The SARS-CoV-2 virus belongs to a subgenus of viruses called Sarbecovirus, most of which infect bats. The researchers performed their analysis on SARS-CoV-2, SARS-CoV (the cause of the 2003 SARS outbreak), and 42 strains of bat sarbecoviruses.

Kellis has previously developed computational techniques for doing this type of analysis, which his team has also used to compare the human genome with genomes of other mammals. Using these techniques, the researchers confirmed six protein-coding genes in the SARS-CoV-2 genome in addition to the five that are well established in all coronaviruses. The team determined that the region that encodes a gene called ORF3a also encodes an additional gene, which they name ORF3c, which overlaps with ORF3a but in a different reading frame. The role of this new gene, as well as several other SARS-CoV-2 genes, is not known yet.

The authors wrote that they “find strong protein-coding signatures for ORFs 3a, 6, 7a, 7b, 8, 9b, and a novel alternate-frame gene, ORF3c, whereas ORFs 2b, 3d/3d-2, 3b, 9c, and 10 lack protein-coding signatures or convincing experimental evidence of protein-coding function.”

The researchers also showed that five other regions that had been proposed as possible genes do not encode functional proteins, and they also ruled out the possibility that there are any more conserved protein-coding genes yet to be discovered.

“We analyzed the entire genome and are very confident that there are no other conserved protein-coding genes,” said Irwin Jungreis, PhD, lead author of the study and a CSAIL research scientist. “Experimental studies are needed to figure out the functions of the uncharacterized genes, and by determining which ones are real, we allow other researchers to focus their attention on those genes rather than spend their time on something that doesn’t even get translated into protein.”

The researchers also recognized that many previous papers used not only incorrect gene sets, but sometimes also conflicting gene names. To remedy the situation, they brought together the SARS-CoV-2 community and presented a set of recommendations for naming SARS-CoV-2 genes, in a separate paper published a few weeks ago in Virology.

In addition to the annotation work, the researchers also analyzed more than 1,800 mutations that have arisen in SARS-CoV-2 since it was first identified. For each gene, they compared how rapidly that particular gene has evolved in the past with how much it has evolved since the current pandemic began.

They found that in most cases, genes that evolved rapidly for long periods of time before the current pandemic have continued to do so, and those that tended to evolve slowly have maintained that trend. However, the researchers also identified exceptions to these patterns, which may shed light on how the virus has evolved as it has adapted to its new human host, Kellis said.

In one example, the researchers identified a region of the nucleocapsid protein that had many more mutations than expected from its historical evolution patterns. This protein region is also classified as a target of human B cells. Therefore, mutations in that region may help the virus evade the human immune system, Kellis said.

They wrote, “Cross-strain and within-strain evolutionary pressures agree, except for fewer-than-expected within-strain mutations in nsp3 and S1, and more-than-expected in nucleocapsid, which shows a cluster of mutations in a predicted B-cell epitope, suggesting immune-avoidance selection.”

“The most accelerated region in the entire genome of SARS-CoV-2 is sitting smack in the middle of this nucleocapsid protein,” he added. “We speculate that those variants that don’t mutate that region get recognized by the human immune system and eliminated, whereas those variants that randomly accumulate mutations in that region are in fact better able to evade the human immune system and remain in circulation.”

The researchers also analyzed mutations that have arisen in variants of concern, such as the B.1.1.7 strain from England, the P.1 strain from Brazil, and the B.1.351 strain from South Africa. Many of the mutations that make those variants more dangerous are found in the spike protein, and help the virus spread faster and avoid the immune system. However, each of those variants carries other mutations as well.

“Each of those variants has more than 20 other mutations, and it’s important to know which of those are likely to be doing something and which aren’t,” Jungreis said. “So, we used our comparative genomics evidence to get a first-pass guess at which of these are likely to be important based on which ones were in conserved positions.”

The annotated gene set and mutation classifications are available in the University of California at Santa Cruz Genome Browser.

“We can now go and actually study the evolutionary context of these variants and understand how the current pandemic fits in that larger history,” Kellis said. “For strains that have many mutations, we can see which of these mutations are likely to be host-specific adaptations, and which mutations are perhaps nothing to write home about.”

Enhanced capabilities

Within this non-coding part of the genome, researchers have subsequently found short regions of DNA called enhancers: gene switches that turn genes on and off in different tissues at different times. They found that enhancers needed to shape the embryo have changed very little during evolution, suggesting that they represent a major and important source of genetic information.

These studies inspired one of us, Alasdair, to explore the possible role of enhancers in behaviours such as alcohol intake, anxiety and fat intake. By comparing the genomes of mice, birds and humans we identified an enhancer that has changed relatively little over 350 million years – suggesting its importance in species’ survival.

When we used CRISPR genome editing to delete this enhancer from the mouse genome, those mice ate less fat, drank less alcohol, and displayed reduced anxiety. While these may all sound like positive changes, it’s likely that these enhancers evolved in calorifically poor environments full of predators and threats. At the time, eating high-calorie food sources such as fat and fermented fruit, and being hyper-vigilant of predators, would have been key for survival. However, in modern society these same behaviours may now contribute to obesity, alcohol abuse and chronic anxiety.

Intriguingly, subsequent genetic analysis of a major human population cohort has shown that changes in the same human enhancer were also associated with differences in alcohol intake and mood. These studies demonstrate that enhancers are not only important for normal physiology and health, but that changing them could result in changes in behaviour that have major implications for human health.

Given these new avenues of research, we appear to be at a crossroads in genetic biology. The importance of gene enhancers in health and disease sits uncomfortably with our relative inability to identify and understand them.

And so in order to make the most of the sequencing of the human genome two decades ago, it’s clear that research must now look beyond the 1.7% of the genome that encodes proteins. In exploring uncharted genetic territory, like that represented by enhancers, biology may well locate the next swathe of healthcare breakthroughs.

This article was updated on February 21, 2021 to clarify that DNA base pairs are not made from proteins.


Identifying Orphans.

Our analysis requires studying the properties of human ORFs that lack cross-species counterparts, which we term “orphans.” Such study requires carefully filtering the human gene catalogs, to identify genes with counterparts and to eliminate a wide range of artifacts that would interfere with analysis of the orphans. For this reason, we undertook a thorough reanalysis of the human gene catalogs.

We focused on the Ensembl catalog (version 35), which lists 22,218 protein-coding genes with a total of 239,250 exons. Our analysis considered only the 21,895 genes on the human genome reference sequence of chromosomes 1–22 and X. (We thus omitted the mitochondrial chromosome, chromosome Y, and “unplaced contigs,” which involve special considerations see below.)

We developed a computational protocol by which the putative genes are classified based on comparison with the human, mouse, and dog genomes (Fig. 1 see Materials and Methods). The mouse and dog genomes were used, because high-quality genomic sequence is available (7, 8), and the extent of sequence divergence is well suited for gene identification. The nucleotide substitution rate relative to human is ≈0.50 per base for mouse and ≈0.35 for dog, with insertion and deletion (indel) events occurring at a frequency that is ≈10-fold lower (8, 9). These rates are low enough to allow reliable sequence alignment but high enough to reveal the differential mutation patterns expected in coding and noncoding regions.

Flowchart of the analysis. The central pipeline illustrates the computational analysis of 21,895 putative genes in the Ensembl catalog (v35). We then performed manual inspection of 1,178 cases to obtain the tables of likely valid and invalid genes. See text for details.

After the computational pipeline, we undertook visual inspection of ≈1,200 cases to detect instances misclassifications due to limitations of the algorithms or apparent errors in reported human gene annotations this process revised the classification of 417 cases. We briefly summarize the results.

Class 0: Transposons, pseudogenes, and other artifacts.

Some of the putative genes consist of transposable elements or processed pseudogenes that slipped through the process used to construct the Ensembl catalog. Using a more stringent filter, we identified 1,538 such cases. These were 487 cases consisting of transposon-derived sequence, 483 processed pseudogenes derived from a multiexon parent gene (recognizable because the introns had been eliminated by splicing), and 568 processed pseudogenes derived from a single-exon parent gene (recognizable because the pseudogene sequence almost precisely interrupts the aligned orthologous sequence of human with mouse or dog).

Class 1: Genes with cross-species orthologs.

We next identified putative genes with a corresponding gene in the syntenic region of mouse or dog. We examined the orthologous DNA sequence in each species, checking whether an orthologous gene was already annotated in current gene catalogs for mouse or dog and, if not, whether we could identify an orthologous gene. Such cases are referred to as “simple orthology” (or 1:1 orthology). We then expanded the search to a surrounding region of 1 Mb in mouse and dog to allow for cases of local gene family expansion. Such cases are referred to as “complex orthology” (or “coorthology”). In both circumstances, the orthologous gene was required to have an ORF that aligns to a substantial portion (≥80%) of the human gene and have substantial peptide identity (≥50% for mouse, ≥60% for dog). Orthologous genes were identified for 18,752 of the putative human genes, with 16,210 involving simple orthology and 2,542 involving coorthology.

Class 2: Genes with cross-species paralogs.

The pipeline then identified 155 cases of putative human genes that have a paralog within the human genome, that, in turn, has an ortholog in mouse or dog. These genes largely represent nonlocal duplications in the human lineage (three-quarters lie in segmental duplications) or possibly gene losses in the other lineages. Among these genes, close inspection revealed eight cases in which a small change to the human annotation allowed the identification of a clear human ortholog.

Class 3: Genes with human-only paralogs.

The pipeline identified 68 cases of putative human genes that have one or more paralogs within the human genome, but with none of these paralogs having orthologs in mouse or dog. Close inspection eliminated 17 cases as additional retroposons or other artifacts (see SI Appendix ). The remaining 51 cases appear to be valid genes, with 15 belonging to three known families of primate-specific genes (DUF1220, NPIP, and CDRT15 families) and the others occurring in smaller paralogous groups (two to eight members) that may also represent primate-specific families.

Class 4: Genes with Pfam domains.

The pipeline identified 97 cases of putative genes with homology to a known protein domain in the Pfam collection (10). Close inspection eliminated 21 cases as additional retroposons or other artifacts (see SI Appendix ) and 40 cases in which a small change to the human annotation allowed the identification of a clear human ortholog. The remaining 36 genes appear to be valid genes, with 10 containing known primate-specific domains and 26 containing domains common to many species.

Class 5: Orphans.

A total of 1,285 putative genes remained after the above procedure. Close inspection identified 40 cases that were clear artifacts (long tandem repeats that happen to lack a stop codon) and 68 cases in which a cross-species ortholog could be assigned after a small change correction to the human gene annotation. The remaining 1,177 cases were declared to be orphans, because they lack orthology, paralogy, or homology to known genes and are not obvious artifacts. We note that the careful review of the genes was essential to obtaining a “clean” set of orphans for subsequent analysis.

Characterizing the Orphans.

We characterized the properties of the orphans to see whether they resemble those seen for protein-coding genes or expected for randoms ORFs arising in noncoding transcripts.

ORF lengths.

The orphans have a GC content of 55%, which is much higher than the average for the human genome (39%) and similar to that seen in protein-coding genes with cross-species counterparts (53%). The high-GC content reflects the orphans' tendency to occur in gene-rich regions.

We examined the ORF lengths of the orphans, relative to their GC-content. The orphans have relatively small ORFs (median = 393 bp), and the distribution of ORF lengths closely resembles the mathematical expectation for the longest ORF that would arise by chance in a transcript-derived form human genomic DNA with the observed GC-content (SI Fig. 4).

Conservation properties.

We then focused on cross-species conservation properties. To assess the sensitivity of various measures, we examined a set of 5,985 “well studied” genes defined by the criterion that they are discussed in more than five published articles. For each well studied gene, we selected a matched random control sequence from the human genome, having a similar number of “exons” with similar lengths, a similar proportion of repeat sequence and a similar proportion of cross-species alignment, but not overlapping with any putative genes.

The well studied genes and matched random controls differ with respect to all conservation properties studied (SI Fig. 5 and SI Table 1). The nucleotide identity and Ka/Ks ratio clearly differ, but the distributions are wide and have substantial overlap. The indel density has a tighter distribution: 97.3% of well studied genes, but only 2.8% of random controls, have an indel density of <10 per kb. The sharpest distinctions, however, were found for two measures that reflect the distinctive evolution of protein-coding genes: the reading frame conservation (RFC) score and the codon substitution frequency (CSF) score.

Reading frame conservation.

The RFC score reflects the percentage of nucleotides (ranging from 0% to 100%) whose reading frame is conserved across species (SI Fig. 6). The RFC score is determined by aligning the human sequence to its cross-species ortholog and calculating the maximum percentage of nucleotides with conserved reading frame, across the three possible reading frames for the ortholog. The results are averaged across sliding windows of 100 bases to limit propagation of local effects due to errors in sequence alignment and gene boundary annotation. We calculated separate RFC scores relative to both the mouse and dog genomes and focused on a joint RFC score, defined as the larger of two scores. The RFC score was originally described in our work on yeast, but has been adapted to accommodate the frequent presence of introns in human sequence (see SI Appendix ).

The RFC score shows virtually no overlap between the well studied genes and the random controls (SI Fig. 5). Only 1% of the random controls exceed the threshold of RFC >90, whereas 98.2% of the well studied genes exceed this threshold. The situation is similar for the full set of 18,752 genes with cross-species counterparts, with 97% exceeding the threshold (Fig. 2 a). The RFC score is slightly lower for more rapidly evolving genes, but the RFC distribution for even the top 1% of rapidly evolving genes is sharply separated from the random controls (SI Fig. 5).

Cumulative distributions of RFC score. (Left) Human genes with cross-species orthologs (blue) versus matched random controls (black). (Right) Human orphans (red) versus matched random controls (black). RFC scores are calculated relative to mouse and dog together (Top), macaque (Middle) and chimpanzee (Bottom). In all cases, the orthologs are strikingly different from their matched random controls, whereas the orphans are essentially indistinguishable from their matched random controls.

By contrast, the orphans show a completely different picture. They are essentially indistinguishable from matched random controls (Fig. 2 b) and do not resemble even the most rapidly evolving subset of the 18,572 genes with cross-species counterparts. In short, the set of orphans shows no tendency whatsoever to conserve reading frame.

Codon substitution frequency.

The CSF score provides a complementary test of for the evolutionary pattern of protein-coding genes. Whereas the RFC score is based on indels, the CSF score is based on the different patterns of nucleotide substitution seen in protein-coding vs. random DNA. Recently developed for comparative genomic analysis of Drosophila species (11), the method calculates a codon substitution frequency (CSF) score based on alignments across many species. We applied the CSF approach to alignments of human to nine mammalian species, consisting of high-coverage sequence (≈7×) from mouse, dog, rat, cow, and opossum and low-coverage sequence (≈2×) from rabbit, armadillo, elephant, and tenrec.

The results again showed strong differentiation between genes with cross-species counterparts and orphans. Among 16,210 genes with simple orthology, 99.2% yielded CSF scores consistent with the expected evolution of protein-coding genes. By contrast, the 1,177 orphans include only two cases whose codon evolution pattern indicated a valid gene. Upon inspection, these two cases were clear errors in the human gene annotation by translating the sequence in a different frame, a clear cross-species orthologs can be identified.

Orphans Do Not Represent Protein-Coding Genes.

The results above are consistent with the orphans being simply random ORFs, rather than valid human protein-coding genes. However, consistency does not constitute proof. Rather, we must rigorously reject the alternative hypothesis.

Suppose the orphans represent valid human protein-coding genes that lack corresponding ORFs in mouse and dog. The orphans would fall into two classes: (i) some may predate the divergence from mouse and dog—that is, they are ancestral genes that were lost in both mouse and dog, and (ii) some may postdate the divergence—that is, they are novel genes that arose in the lineage leading to the human. How can we exclude these possibilities? Our solution was to study two primate relatives: macaque and chimpanzee. We consider the alternatives in turn.

Suppose that the orphans are ancestral mammalian genes that were lost in dog and mouse but are retained in the lineage leading to human. If so, they would still be present and functional in macaque and chimpanzee, except in the unlikely event that they also underwent independent loss events in both macaque and chimpanzee lineages.

Suppose that the orphans are novel genes that arose in the lineage leading to the human, after the divergence from dog and mouse [≈75 million years ago (Mya)]. Assuming that the generation of new genes is a steady process, the birthdates should be distributed across this period. If so, most of the birthdates will predate the divergence from macaque (≈30 Mya) and nearly all will predate the divergence from chimpanzee (≈6 Mya) (12).

Under either of the above scenarios, the vast majority of the orphans must correspond to functional protein-coding genes in macaque or chimpanzee.

We therefore tested whether the orphans show any evidence of protein-coding conservation relative to either macaque or chimpanzee, using the RFC score. Strikingly, the distribution of RFC scores for the orphans is essentially identical to that for the random controls (Fig. 2 d and f). The distribution for the orphans does not resemble that seen even for the top 1% of most rapidly evolving genes with cross-species counterparts (SI Figs. 7–9).

The set of orphans thus shows no evidence whatsoever of reading-frame conservation even in our closest primate relatives. (It is of course possible that the orphans include a few valid protein-coding genes, but the proportion must be small enough that it has no discernable effect on the overall RFC distribution.) We conclude that the vast majority of orphans do not correspond to functional protein-coding genes in macaque and chimpanzee, and thus are neither ancestral nor newly arising genes.

If the orphans represent valid human protein-coding genes, we would have to conclude that the vast majority of the orphans were born after the divergence from chimpanzee. Such a model would require a prodigious rate of gene birth in mammalian lineages and a ferocious rate of gene death erasing the huge number of genes born before the divergence from chimpanzee. We reject such a model as wholly implausible. We thus conclude that the vast majority of orphans are simply randomly occurring ORFs that do not represent protein-coding genes.

Finally, we note that the careful filtering of the human gene catalog above was essential to the analysis above, because it eliminated pseudogenes and artifacts that would have prevented accurate analysis of the properties of the orphans.

Experimental Evidence of Encoded Proteins.

As an independent check on our conclusion, we reviewed the scientific literature for published articles mentioning the orphans to determine whether there was experimental evidence for encoded proteins. Whereas the vast majority of the well studied genes have been directly shown to encode a protein, we found articles reporting experimental evidence of an encoded protein in vivo for only 12 of 1,177 orphans, and some of these reports are equivocal (SI Table 2). The experimental evidence is thus consistent with our conclusion that the vast majority of nonconserved ORFs are not protein-coding. In the handful of cases where experimental evidence exists or is found in the future, the genes can be restored to the catalog on a case-by-case basis.

Revising the Human Gene Catalogs.

With strong evidence that the vast majority of orphans are not protein-coding genes, it is possible to revise the human gene catalogs in a principled manner.

Ensembl catalog.

Our analysis of the Ensembl (v35) catalog indicates that it contains 19,108 valid protein-coding genes on chromosomes 1–22 and X within the current genome assembly. The remaining 15% of the entries are eliminated as retroposons, artifacts or orphans. Together with the mitochrondrial chromosome [well known to contain 13 protein-coding genes (13)] and chromosome Y [for which careful analysis indicates 78 protein-coding genes (14)], the total reaches 19,199.

We extended the analysis to the Ensembl (v38) catalog, in which 2,212 putative genes were added and many previous entries were revised or deleted. Our computational pipeline found 598 additional valid protein-coding genes based on cross-species counterparts, 1,135 retroposons, and 479 orphans. The RFC curves for the orphans again closely matched the expectation for random DNA.

Other catalogs.

We applied the same approach to the Vega (v34) and RefSeq (March 2007) catalog. Both catalogs contain a substantial proportion of entries that appear not to be valid protein-coding genes (16% and 10%, respectively), based on the lack of a cross-species counterpart (see SI Fig. 10 and SI Appendix ). If we restrict the RefSeq entries to those with the highest confidence (with the caveat that this set contains many fewer genes), only 1% appear invalid. Together, these two catalogs add an additional 673 protein-coding genes.

Combined analysis.

Combining the analysis of the three major gene catalogs, we find that only 20,470 of the 24,551 entries appear to be valid protein-coding genes.

Limitations on the Analysis.

Our analysis of the current gene catalogs has certain limitations that should be noted.

First, we eliminated all pseudogenes and orphans. We found six reported cases in which a processed pseudogene or transposon underwent exaptation to produce a functional gene (SI Tables 1 and 3) and 12 reported cases of orphans with experimental evidence for an encoded protein. These 18 cases can be readily restored to the catalog (raising the count to 20,488). There are additional cases of potentially functional retroposons that are not present in the current gene catalogs (15). If any are found to produce protein, they should also be included.

Second, we have not considered the 197 putative genes that lie in the “unmapped contigs.” These regions are sequences that were omitted from the finished assembly of the human genome. They largely consist of segmental duplications, and most of the genes are highly similar to others in the assembly. Many of the sequence may represent alternative alleles or misassemblies of the genome. However, regions of segmental duplication are known to be nurseries of evolutionary innovation (16) and may contain some valid genes. They deserve focused attention.

Third and most importantly, the nonconserved ORFs studied here were typically included in current gene catalogs because they have the potential to encode at least 100 amino acids. We thus do not know whether our conclusions would apply to much shorter ORFs. In principle, there exist many additional protein-coding genes that encode short proteins, such as peptide hormones, which are usually translated from much larger precursors and may evolve rapidly. It should be possible to investigate the properties of smaller ORFs by using additional mammalian species beyond mouse and dog.

Improving Gene Annotations.

In the course of our work, we generated detailed graphical “report cards” for each of the 22,218 putative genes in Ensembl (v35). The report cards show the gene structure, sequence alignments, measures of evolutionary conservation, and our final classification (Fig. 3).

An example gene report card for a small gene, HAMP, on chromosome 19. Report cards for all 22,218 putative genes in Ensembl v35 are available at The report cards provide a visual framework for studying cross-species conservation and for spotting possible problems in the human gene annotation. Information at the top shows chromosomal location alternative identifiers and summary information, such as length, number of exons, and repeat content. Various panels below provide graphical views of the alignment of the human gene to the mouse and dog genomes. “Synteny” shows the large-scale alignment of genomic sequence, indicating both aligned and unaligned segments. The human sequence is annotated with the exons in white and repetitive sequence in dark gray. “Alignment detail” shows the complete DNA sequence alignment and protein alignment. In the DNA alignment, the human sequence is given at the top, bases in the other species are marked as matching (light gray) or nonmatching (dark gray), exon boundaries are marked by vertical lines, indels are marked by small triangles above the sequence (vertex down for insertions, vertex up for deletions, number indicating length in bases), the annotated start codon is in green, and the annotated stop codon is in purple. In the protein alignment, the human amino acid sequence is given at the top, and the sequences in the other species are marked as matching (light gray), similar (pink), or nonmatching (red). “Frame alignment” shows the distribution of nucleotide mismatches found in each codon position, with excess mutations expected in the third position. Matching are shown in light gray, and mismatches are shown in dark gray. “Indels, starts and stops” provides an overview of key events. Indels are indicated by triangles (vertex down for insertions, vertex up for deletions) and marked as frameshifting (red) or frame-preserving (gray). Start codons are marked in green and stop codons in purple. “Splice sites” shows sequence conservation around splice sites, with two-base donor and acceptor sites highlighted in gray and mismatching bases indicated in red. “Summary data” lists various conservation statistics relative to mouse and dog, including RFC score, nucleotide identity, number of conserved splice sites, frameshifting and nonframeshifting indel density/kb, and gene neighborhood. The gene neighborhood shows a dot for the three upstream and downstream genes, which is colored gray if synteny is preserved and red otherwise.

The report cards are valuable for studying gene evolution and for refining gene annotation. By examining local anomalies by cross-species comparison, we have identified 23 clear errors in gene annotation (including cases in which changing the reading frame or coding strand reveals unambiguous cross-species orthologs) and 332 cases in which cross-species conservation suggests altering the start or stop codon, eliminating an internal exon, or moving a splice site. Of these latter cases, most are likely to be errors in the human gene annotation, although some may represent true cross-species differences. The report cards, together with search tools and summary tables, are available at

Human microbiome churns out thousands of tiny novel proteins

The bacteria in and on our bodies make thousands of tiny, previously unidentified proteins that could shed light on human health and advance drug development, Stanford researchers have found.

Ami Bhatt and her collaborators found that microbes in and on humans are making thousands of proteins so small that they've previously gone undetected.
Norbert von der Groeben

Your body is a wonderland. A wonderland teeming with trillions of bacteria, that is. But it’s not as horrifying as it might sound. In fact, there’s mounting evidence that many aspects of our health are closely intertwined with the composition and hardiness of our microscopic compatriots, though exactly how is still mostly unclear.

Now, researchers at the Stanford University School of Medicine have discovered that these microbial hitchhikers — collectively known as the human microbiome — are churning out tens of thousands of proteins so small that they’ve gone unnoticed in previous studies. The proteins belong to more than 4,000 new biological families predicted to be involved in, among other processes, the warfare waged among different bacterial strains as they vie for primacy in coveted biological niches, the cell-to-cell communication between microbes and their unwitting hosts, and the critical day-to-day housekeeping duties that keep the bacteria happy and healthy.

Because they are so small — fewer than 50 amino acids in length — it’s likely the proteins fold into unique shapes that represent previously unidentified biological building blocks. If the shapes and functions of these proteins can be recreated in the lab, they could help researchers advance scientific understanding of how the microbiome affects human health and pave the way for new drug discovery.

A paper describing the research findings was published Aug. 8 in Cell. Ami Bhatt, MD, PhD, assistant professor of medicine and of genetics, is the senior author. Postdoctoral scholar Hila Sberro, PhD, is the lead author.

‘A clear blind spot’

“It’s critically important to understand the interface between human cells and the microbiome,” Bhatt said. “How do they communicate? How do strains of bacteria protect themselves from other strains? These functions are likely to be found in very small proteins, which may be more likely than larger proteins to be secreted outside the cell.”

But the proteins’ miniscule size had made it difficult to identify and study them using traditional methods.

“We’ve been more likely to make an error than to guess correctly when trying to predict which bacterial DNA sequences contain these very small genes,” Bhatt said. “So until now, we’ve systematically ignored their existence. It’s been a clear blind spot.”

It might be intimidating for the uninitiated to think too deeply about the vast numbers of bacteria that live on and in each of us. They account for far more cells in and on the human body than actual human cells do. Yet these tiny passengers are rarely malicious. Instead, they help with our digestion, supplement our diet and generally keep us running at our peak. But in many cases, it’s been difficult to pick apart the molecular minutiae behind this partnership.

Bhatt and her colleagues wondered if answers might be found in the small proteins they knew were likely to wiggle through the nets cast by other studies focusing on the microbiome. Small proteins, they reasoned, are more likely than their larger cousins to slip through the cell membrane to ferry messages — or threats — to neighboring host or bacterial cells. But how to identify and study these tiny Houdinis?

“The bacterial genome is like a book with long strings of letters, only some of which encode the information necessary to make proteins,” Bhatt said. “Traditionally, we identify the presence of protein-coding genes within this book by searching for combinations of letters that indicate the ‘start’ and ‘stop’ signals that sandwich genes. This works well for larger proteins. But the smaller the protein, the more likely that this technique yields large numbers of false positives that muddy the results.”

A big surprise

To tackle the problem, Sberro decided to compare potential small-protein-coding genes among many different microbes and samples. Those that were identified repeatedly in several species and samples were more likely to be true positives, she thought. When she applied the analysis to large data sets, Sberro found not the hundreds of genes she and Bhatt had expected, but tens of thousands. The proteins predicted to be encoded by the genes could be sorted into more than 4,000 related groups, or families, likely to be involved in key biological processes such as intercellular communication and warfare, as well as maintenance tasks necessary to keep the bacteria healthy.

“Honestly, we didn’t know what to expect,” Bhatt said. “We didn’t have any intuition about this. The fact that she found thousands of new protein families definitely surprised us all.”

The researchers confirmed the genes encoded true proteins by showing they are transcribed into RNA and shuttled to the ribosome for translation — key steps in the protein-making pathway in all organisms. They are now working with collaborators to learn more about the proteins’ functions and to identify those that might be important to the bacteria fighting for space in our teeming intestinal carpet, for example. Such proteins might serve as new antibiotics or drugs for human use, they believe.

“Small proteins can be synthesized rapidly and could be used by the bacteria as biological switches to toggle between functional states or to trigger specific reactions in other cells,” Bhatt said. “They are also easier to study and manipulate than larger proteins, which could facilitate drug development. We anticipate this to be a valuable new area of biology for study.”

Other Stanford co-authors are graduate student Brayon Fremin postdoctoral scholars Soumaya Zlitni, PhD, and Fredrik Edfors, PhD and Michael Snyder, PhD, professor and chair of genetics.

Researchers from One Codex, the Joint Genome Institute of the Department of Energy, the Alexander Fleming Biomedical Sciences Research Center in Greece, and Lawrence Berkeley National Laboratory also contributed to the study.

The study was supported by the National Institutes of Health (grants HG000044, K08CA184420, P30CA124435, and 1Ro1AT010123201), the PhRMA Foundation, the U.S. Department of Energy and a Damon Runyon Clinical Investigator Award.

Stanford’s departments of Medicine and of Genetics also supported the work.

What is gene expression profiling and who uses it?

Gene expression profiling measures which genes are being expressed in a cell at any given moment. This method can measure thousands of genes at a time some experiments can measure the entire genome at once [3]. Gene expression profiling measures mRNA levels, showing the pattern of genes expressed by a cell at the transcription level [4]. This often means measuring relative mRNA amounts in two or more experimental conditions, then assessing which conditions resulted in specific genes being expressed.

Gene expression profiling is used by a variety of biomedical researchers, from molecular biologists to environmental toxicologists. This technology can provide accurate information on gene expression, towards countless experimental goals.

Different techniques are used to determine gene expression. These include DNA microarrays and sequencing technologies. The former measures the activity of specific genes of interest and the latter enables researchers to determine all active genes in a cell [5].

Once a genome has been sequenced, we know what potential a cell has—what characteristics and function it might have—based on the genes it contains. However, sequencing the genome does not tell us which genes a cell is expressing, or the functions or processes it is carrying out at any given moment. To determine these, we need to work out its gene expression profile. If a gene is being used to make mRNA, it is considered ‘on’ if it is not being used to make mRNA, it is considered ‘off’.

A gene expression profile tells us how a cell is functioning at a specific time. This is because cell gene expression is influenced by external and internal stimuli, including whether the cell is dividing, what factors are present in the cell's environment, the signals it is receiving from other cells, and even the time of day [6].


As the number of protein-coding genes continues to be revised downward [25], there appears to be an ever-growing catalogue of ncRNAs. Nevertheless, there is an ongoing lack of clarity regarding the true number of ncRNAs within the genome. This is at least partly due to the inherent difficulties in discriminating ncRNAs from mRNAs and artifacts, especially amongst the thousands of long transcripts that defy categorization by even the most sophisticated of today's classification methods. This situation is further complicated by the emerging realization that the transcriptome may not consist of discrete separable species, but in reality comprise a series of overlapping clusters, of which many span large genomic regions [5],[71],[72], potentially comprising an information continuum. Looking ahead, we must also be prepared to cast off our historical biases toward what appears now to be an increasingly false dichotomy, and instead embrace the likelihood that RNA is a molecular multi-tasker, whose roles can simultaneously bridge both protein-coding and noncoding domains, and not only have more than one embedded function but also produce multiple products.

Watch the video: Genetic Mutation. DNA Repair. Deleterious Mutations. Brain Pop (December 2021).