[Genomics] A decade of human genome project conclusion: Scientific diffusion about our genome knowledge

Posted by Sunghwan Ji on November 03, 2019 · 5 mins read

요즘 Genetics공부를 하고있는데, 가장 큰 milestone 중 하나인 Human Geneome Project(HGP)와 그 시기에 일어났던 일들, 그리고 그 과정을 통해 인류가 알게 된 비밀들에 대해 공부하면서 공유하고싶어 포스팅 해보려 한다.

A decade of human genome project conclusion: Scientific diffusion about our genome knowledge 링크
Lessons from the Human Genome Project(Youtube) 링크

시간이 많다면 원문을 읽어보는 것을 추천 ㅎㅎ 역사책처럼 잘 정리해놨다.

Human Genome Project(HGP)

HGP는 20개국의 생물학자, 물리학자, 화학자, 공학자들이 모여 30억달러의 공적자금이 투입되어 1990년 시작되어 2003년에 완료되었으며, whole human genome을 sequncing 하는 것이 목적이었다.

13년의 노력 끝에 여러 가지 진보를 이루었다.

  1. nonrelated human genome은 1.2~1.5 DNA base당 1개 꼴로 차이가 났다.
  2. 40%의 human genome protein은 파리나 벌레의 protein과 비슷했고, 50%의 human gene은 다른 개체들의 genomic sequence와 매우 유사했다. 즉, human genome은 다른 개체들에 비해 복잡하지도, 특별하지도 않았다.
  3. 가장 놀라웠던 것은 고작 2만개(30억 base pair의 염기서열 중 고작 2%)의 protein-coding gene만 존재한다는 것이었다. 이 사실은 HGP의 결과 중 가장 실망스러운 이유 중 하나였다. 그 전에는 10만개는 될 것이라고 예측했기 때문이다.

human genome 을 sequencing 하면 인간의 complexity를 이해할 수 있을 것이라는 기대를 가지고 시작했으나, 많은 의문들이 해결되지 않았다. 생명의 비밀, 유전병에 대한 이해 등을 해결하기 위한 새로운 시작이 필요했다. 그중 하나가 아래의 ENCODE project이다.

ENCyclopedia of DNA Elements(ENCODE) Project

HGP가 끝난 후, DNA가 어떻게 작동하는지, 어떤 요소들이 그것을 조절하고 그 조절은 어떻게 일어나는지 등에 대해 궁금해 졌다. 2003년 9월, HGP의 데이터들을 해석하기 위해 ENCODE project가 시작되었다. 이 프로젝트는 human genome의 모든 functional element의 지도를 만드는 것이 목적이었다. 즉, protein-coding gene과 noncoding gene, transcription을 조절하는 요소들 등을 mapping하는 것이었다.

여기부터는 이해하기가 힘들었음….

The First Phase of the ENCODE Project
  1. The transcription occurs in almost the whole genome such that most of its bases are committed with at least one primary transcript. Many transcripts link distal loci segments to protein‐coding regions.
  2. Various novel nonprotein coding transcripts were identified. Many of these transcripts originate from overlapping protein‐coding loci and from regions previously considered transcriptionally silent.
  3. Many transcription start sites were identified. Many of them present chromatin structure and protein‐binding specific sequences similar to the well‐known promoters.
  4. The regulatory sequences that surround the transcription start sites are symmetrically distributed, with no bias towards upstream regions.
  5. The accessibility to chromatin and histone modification patterns are highly predictive of both the presence and the activity of transcription start sites.
  6. The DNA replication timing is related to the chromatin structure.
  7. A total of 5% of the bases in the genome can be considered under evolutionary restriction in mammals. For 60% of these bases, there is evidence for function based on results of experimental tests accomplished to date.
  8. A general overlapping between the genomic regions identified as functional by experimental tests and those under evolutionary restriction was not observed.

One of the most surprising conclusions from this first phase concerns the remarkable excess of experimentally identified functional elements which lack evolutionary constraint. This means that apparently many functional elements are not restricted to mammal evolution. The consortium suggested the existence of a large pool of neutral elements that are biochemically active, but that do not provide a particular benefit to the organism. This pool may serve as a storage to natural selection, potentially acting as a source of lineage specific elements. As concluded by the consortium, this surprise suggests that we take a more “neutral” view of many of the functions conferred by the genome.

The Second Phase of the ENCODE Project
  1. Most of the human genome (80.4%) takes part in at least one biochemical RNA and/or chromatin‐associated event in at least one kind of cell. A total of 99% of the known bases in the genome are within 1.7 kb of any ENCODE element, whereas 95% of bases are within 8 kb of a transcription factor binding motif.
  2. The classification of the genome in seven chromatin states (signature pattern of histone modification) pointed out a set of 399.124 regions with enhancer‐like features and 70.292 regions with promoter‐like features as well as a lot of quiescent regions.
  3. It is possible to correlate quantitatively RNA production and processing with both chromatin markers and transcription factor binding at promoters.
  4. Many non‐coding variants in individual genome sequences lie in ENCODE‐annotated functional regions. 5. This number is at least as large as those that lie in protein‐coding genes.
  5. Single nucleotide polymorphisms (SNPs) associated with diseases are located mainly in non‐coding functional elements.

Undoubtedly, the verification that the human genome is pervasively transcribed and almost fully active remains as one of the most important molecular biology discoveries.

Final Consideration

HGP와 ENCODE project의 결론은, 우리 인간의 complexity는 protein-coding gene에 의존하는 것이 아니고, whole genome 뿐만 아니라 epigentics까지 dynamic하게 우리 인간의 복잡성을 만든다는 것이다. 기존의 one-gene one-protein의 central dogma는 새롭게 정의되어야만 한다.