Продолжая использовать сайт, вы даете свое согласие на работу с этими файлами.
Shapiro Senapathy algorithm
The Shapiro Senapathy algorithm (S&S) is an algorithm for predicting splice junctions in genes of animals and plants. This algorithm has been used to discover disease-causing splice site mutations and cryptic splice sites.
The algorithm
A splice site is the border between an exon and intron in a gene. These sites contain a particular sequence motif, which is necessary for recognition and processing by the RNA splicing machinery.
The S&S algorithm uses sliding windows of eight nucleotides, corresponding to the length of the splice site sequence motif, to identify these conserved sequences and thus potential splice sites. Using a weighted table of nucleotide frequencies, the S&S algorithm outputs a consensus-based percentage for the possibility of the window containing a splice site.
The S&S algorithm serves as the basis of other software tools, such as Human Splicing Finder, Splice-site Analyzer Tool, dbass (Ensembl), Alamut, and SROOGLE.
Cancer gene discovery using S&S
By using the S&S algorithm, mutations and genes that cause many different forms of cancer have been discovered. For example, genes causing commonly occurring cancers including breast cancer,ovarian cancer,colorectal cancer,leukemia,head and neck cancers,prostate cancer,retinoblastoma,squamous cell carcinoma,gastrointestinal cancer,melanoma,liver cancer,Lynch syndrome,skin cancer, and neurofibromatosis have been found. In addition, splicing mutations in genes causing less commonly known cancers including gastric cancer,gangliogliomas,Li-Fraumeni syndrome, Loeys–Dietz syndrome, Osteochondromas (bone tumor), Nevoid basal cell carcinoma syndrome, and Pheochromocytomas have been identified.
Specific mutations in different splice sites in various genes causing breast cancer (e.g., BRCA1, PALB2), ovarian cancer (e.g., SLC9A3R1, COL7A1, HSD17B7), colon cancer (e.g., APC, MLH1, DPYD), colorectal cancer (e.g., COL3A1, APC, HLA-A), skin cancer (e.g., COL17A1, XPA, POLH), and Fanconi anemia (e.g., FANC, FANA) have been uncovered. The mutations in the donor and acceptor splice sites in different genes causing a variety of cancers that have been identified by S&S are shown in Table 1.
Disease type | Gene symbol | Mutation location | Original sequence | Mutated sequence | Splicing aberration |
---|---|---|---|---|---|
Breast cancer | BRCA1 | Exon 11 | AAGGTGTGT | AAAGTGTGT | Skipping of exon 12 |
PALB2 | Exon 12 | CAGGCAAGT | CAAGCAAGT | Potentially weakening the canonical donor splicing site | |
Ovarian cancer | SLC9A3R1 | Exon2 | GAGGTGATG | GAGGCGATG | Significant effect in ‘splicing’ |
Colorectal Cancer | MLH1 | Exon 9 | TCGGTATGT | TCAGTATGT | Skipping of exon 8 and protein truncation |
MSH2 | Intron 8 | CAGGTATGC | CAGGCATGC | Intervening sequence, RNA processing,No amino acid change | |
MSH6 | Intron 9 | TTTTTAATTTTAAGG | TTTTTAATTTTGAGG | Intervening sequence, RNA processing,No amino acid change | |
Skin Cancer | TGFBR1 | Exon 5 | TTTTGATTCTTTAGG | TTTTGATTCTTTCGG | Exon 5 skipping |
ITGA6 | Intron 19 | TTATTTTCTAACAGG | TTATTTTCTAACACG | Skipping of the exon 20 and resulted in in-frame deletion | |
Birt–Hogg–Dubé (BHD) syndrome | FLCN | Exon 9 | GAAGTAAGC | GAAGGAAGC | Skipping of exon 9 and weak retention of 131 bp of intron 9 |
Nevoid basal cell carcinoma | PTCH1 | Intron 4 | CAGGTATAT | CAGGTGTAT | Exon 4 Skipping |
Mesothelioma | BAP1 | Exon 16 | AAGGTGAGG | TAGGTGAGG | Creates a novel 5’ splice site that results in a 4 nucleotide deletion of the 3’ end of exon 16 |
Discovery of genes causing inherited disorders using S&S
Specific mutations in different splice sites in various genes that cause inherited disorders, including, for example, Type 1 diabetes (e.g., PTPN22, TCF1 (HCF-1A)), hypertension (e.g., LDL, LDLR, LPL), Marfan syndrome (e.g., FBN1, TGFBR2, FBN2), cardiac diseases (e.g., COL1A2, MYBPC3, ACTC1), eye disorders (e.g., EVC, VSX1) have been uncovered. A few example mutations in the donor and acceptor splice sites in different genes causing a variety of inherited disorders identified using S&S are shown in Table 2.
Disease type | Gene symbol | Mutation location | Original sequence | Mutated sequence | Splicing aberration |
---|---|---|---|---|---|
Diabetes | PTPN22 | Exon 18 | AAGGTAAAG | AACGTAAAG | Skipping of exon 18 |
TCF1 | Intron 4 | TTTGTGCCCCTCAGG | TTTGTGCCCCTCGGG | Skipping of exon 5 | |
Hypertension | LDL | Intron 10 | TGGGTGCGT | TGGGTGCAT | Normolipidemic to classical heterozygous FH |
LDLR | Intron 2 | GCTGTGAGT | GCTGTGTGT | May cause splicing abnormalities through an in-silico analysis | |
LPL | Intron 2 | ACGGTAAGG | ACGATAAGG | Cryptic splice sites is activated in vivo at the sites | |
Marfan syndrome | FBN1 | Intron 46 | CAAGTAAGA | CAAGTAAAA | Exon skipping/cryptic splice site |
TGFBR2 | Intron 1 | ATCCTGTTTTACAGA | ATCCTGTTTTACGGA | Abnormal splicing | |
FBN2 | Intron45 | TGGGTAAGT | TGGGGAAGT | Splice site alterations leading to frameshift mutations,
causing a truncated protein |
|
Cardiac disease | COL1A2 | Intron 46 | GCTGTAAGT | GCTGCAAGT | Permitted almost exclusive use of a cryptic donor
site 17 nt upstream in the exon |
MYBPC3 | Intron 5 | CTCCATGCACACAGG | CTCCATGCACACCGG | Abnormal mRNA transcript with a premature
stop codon will produce a truncated protein lacking the binding sites for myosin and titin |
|
ACTC1 | Intron 1 | TTTTCTTCTCATAGG | TTTTCTTCTTATAGG | No effect | |
Eye disorder | ABCR | Intron 30 | CAGGTACCT | CAGTTACCT | Autosomal recessive RP and CRD |
VSX1 | Intron 5 | TTTTTTTTTACAAGG | TATTTTTTTACAAGG | Aberrant splicing |
Genes causing immune system disorders
More than 100 immune system disorders affect humans, including inflammatory bowel diseases, multiple sclerosis, systemic lupus erythematosus, bloom syndrome, familial cold autoinflammatory syndrome, and dyskeratosis congenita. The Shapiro–Senapathy algorithm has been used to discover genes and mutations involved in many immune disorder diseases, including Ataxia telangiectasia, B-cell defects, epidermolysis bullosa, and X-linked agammaglobulinemia.
Xeroderma pigmentosum, an autosomal recessive disorder is caused by faulty proteins formed due to new preferred splice donor site identified using S&S algorithm and resulted in defective nucleotide excision repair.
Type I Bartter syndrome (BS) is caused by mutations in the gene SLC12A1. S&S algorithm helped in disclosing the presence of two novel heterozygous mutations c.724 + 4A > G in intron 5 and c.2095delG in intron 16 leading to complete exon 5 skipping.
Mutations in the MYH gene, which is responsible for removing the oxidatively damaged DNA lesion are cancer-susceptible in the individuals. The IVS1+5C plays a causative role in the activation of a cryptic splice donor site and the alternative splicing in intron 1, S&S algorithm shows, guanine (G) at the position of IVS+5 is well conserved (at the frequency of 84%) among primates. This also supported the fact that the G/C SNP in the conserved splice junction of the MYH gene causes the alternative splicing of intron 1 of the β type transcript.
Splice site scores were calculated according to S&S to find EBV infection in X-linked lymphoproliferative disease. Identification of Familial tumoral calcinosis (FTC) is an autosomal recessive disorder characterized by ectopic calcifications and elevated serum phosphate levels and it is because of aberrant splicing.
Application of S&S in hospitals for clinical practice and research
Applying the S&S technology platform in modern clinical genomics research hasadvance diagnosis and treatment of human diseases.
In the modern era of Next Generation Sequencing (NGS) technology, S&S is applied in clinical practice extensively. Clinicians and molecular diagnostic laboratories apply S&S using various computational tools including HSF, SSF, and Alamut. It is aiding in the discovery of genes and mutations in patients whose disease are stratified or when the disease in a patient is unknown based on clinical investigations.
In this context, S&S has been applied on cohorts of patients in different ethnic groups with various cancers and inherited disorders. A few examples are given below.
Cancers
Cancer type | Publication title | Year | Ethnicity | Number of patients | |
---|---|---|---|---|---|
1 | Breast cancer | The germline mutational landscape of BRCA1 and BRCA2 in Brazil | 2018 | Brazil | 649 Patients |
2 | Hereditary non-polyposis colorectal cancer | Prevalence and characteristics of hereditary non-polyposis colorectal cancer (HNPCC) syndrome in immigrant Asian colorectal cancer patients | 2017 | Asian Immigrant | 143 Patients |
3 | Nevoid basal cell carcinoma syndrome | Nevoid basal cell carcinoma syndrome caused by splicing mutations in the PTCH1 gene | 2016 | Japanese | 10 Patients |
4 | Prostate cancer | Identification of Two Novel HOXB13 Germline Mutations in Portuguese Prostate Cancer Patients | 2015 | Portuguese | 462 Patients, 132 Controls |
5 | Colorectal adenomatous polyposis | Identification of Novel Causative Genes for Colorectal Adenomatous Polyposis | 2015 | German | 181 Patients,531 Controls |
6 | Renal cell cancer | Genetic screening of the FLCN gene identify six novel variants and a Danish founder mutation | 2016 | Danish | 143 individuals |
Inherited disorders
Disease name | Publication title | Year | Ethnicity | Number of patients | |
---|---|---|---|---|---|
1 | Bardet-Biedl Syndrome | The First Nationwide Survey and Genetic Analyses of Bardet-Biedl Syndrome in Japan | 2015 | Japan | 38 Patients(Disease identified in 9 Patients) |
2 | Odontogenesis Diseases | Genetic Evidence Supporting the Role of the Calcium Channel, CACNA1S, in Tooth Cusp and Root Patterning | 2018 | Thai families | 11 Patients,18 Controls |
3 | Beta-Ketothiolase Deficiency | Clinical and Mutational Characterizations of Ten Indian Patients with Beta-Ketothiolase Deficiency | 2016 | Indian | 10 Patients |
4 | Unclear speech developmental delay | Progressive SCAR14 with unclear speech, developmental delay, tremor, and behavioral problems caused by a homozygous deletion of the SPTBN2 pleckstrin homology domain | 2017 | Pakistani family | 9 Patients, 12 controls |
5 | Dent's disease | Dent's disease in children: diagnostic and therapeutic consideration | 2015 | Poland | 10 Patients |
6 | Atypical Haemolytic Uraemic Syndrome | Genetics Atypical hemolytic-uremic syndrome | 2015 | Newcastle cohort | 28 Families, 7 Sporadic patients |
7 | Age-related Macular Degeneration and Stargardt disease | Genetics of Age-related Macular Degeneration and Stargardt disease in South African populations | 2015 | African Populations | 32 Patients |
S&S - the first algorithm for identifying splice sites, exons and split genes
Dr. Senapathy's original objective in developing a method for identifying splice sites was to find complete genes in raw uncharacterized genomic sequence that could be used in the human genome project. In the landmark paper with this objective, he described the basic method for identifying the splice sites within a given sequence based on the Position Weight Matrix (PWM) of the splicing sequences in different eukaryotic organism groups for the first time. He also created the first exon detection method by defining the basic characteristics of an exon as the sequence bounded by an acceptor and a donor splice sites that had S&S scores above a threshold, and by an ORF that was mandatory for an exon. An algorithm for finding complete genes based on the identified exons was also described by Dr. Senapathy for the first time.
Dr. Senapathy demonstrated that only deleterious mutations in the donor or acceptor splice sites that would drastically make the protein defective would reduce the splice site score (later known as the Shapiro–Senapathy score), and other non-deleterious variations would not reduce the score. The S&S method was adapted for researching the cryptic splice sites caused by mutations leading to diseases. This method for detecting deleterious splicing mutations in eukaryotic genes has been used extensively in disease research in the humans, animals and plants over the past three decades, as described above.
The basic method for splice site identification, and for defining exons and genes was subsequently used by researchers in finding splice sites, exons and eukaryotic genes in a variety of organisms. These methods also formed the basis of all subsequent tools development for discovering genes in uncharacterized genomic sequences. It also was used in a different computational approaches including machine learning and neural network, and in alternative splicing research.
Discovering the mechanisms of aberrant splicing in diseases
The Shapiro–Senapathy algorithm has been used to determine the various aberrant splicing mechanisms in genes due to deleterious mutations in the splice sites, which cause numerous diseases. Deleterious splice site mutations impair the normal splicing of the gene transcripts, and thereby make the encoded protein defective. A mutant splice site can become “weak” compared to the original site, due to which the mutated splice junction becomes unrecognizable by the spliceosomal machinery. This can lead to the skipping of the exon in the splicing reaction, resulting in the loss of that exon in the spliced mRNA (exon-skipping). On the other hand, a partial or complete intron could be included in the mRNA due to a splice site mutation that makes it unrecognizable (intron inclusion). A partial exon-skipping or intron inclusion can lead to premature termination of the protein from the mRNA, which will become defective leading to diseases. The S&S has thus paved the way to determine the mechanisms by which a deleterious mutation could lead to a defective protein, resulting in different diseases depending on which gene is affected.
Examples of splicing aberrations
Disease type | Gene symbol | Mutation location | Original donor/acceptor | Mutated donor/acceptor | Aberration effect |
---|---|---|---|---|---|
Colon Cancer | APC | Intron 2 | AAGGTAGAT | AAGGAAGAT | Skipping of Exon 3 |
Colorectal cancer | MSH2 | Exon 15 | GAGGTTTGT | GAGGTTTCT | Skipping of Exon 15 |
Retinoblastoma | RB1 | Intron 23 | TCTTAACTTGACAGA | TCTTAACGTGACAGA | New splice acceptor, intron inclusion |
Trophic benign epidermolysis bullosa | COL17A1 | Intron 51 | AGCGTAAGT | AGCATAAGT | lead to exon skipping, intron inclusion, or the use of a cryptic splice site, resulting in either a truncated protein or a protein lacking a small region of the coding sequence |
Choroideremia | CHM | Intron 3 | CAGGTAAAG | CAGATAAAG | Premature termination codon |
Cowden syndrome | PTEN | Intron 4 | GAGGTAGGT | GAGATAGGT | Premature termination codon within exon 5 |
An example of splicing aberration (exon skipping) caused by a mutation in the donor splice site in the exon 8 of MLH1 gene that led to colorectal cancer is given below. This example shows that a mutation in a splice site within a gene can lead to a profound effect in the sequence and structure of the mRNA, and the sequence, structure and function of the encoded protein, leading to disease.
S&S in cryptic splice sites research and medical applications
The proper identification of splice sites has to be highly precise as the consensus splice sequences are very short and there are many other sequences similar to the authentic splice sites within gene sequences, which are known as cryptic, non-canonical, or pseudo splice sites. When an authentic or real splice site is mutated, any cryptic splice sites present close to the original real splice site could be erroneously used as authentic site, resulting in an aberrant mRNA. The erroneous mRNA may include a partial sequence from the neighboring intron or lose a partial exon, which may result in a premature stop codon. The result may be a truncated protein that would have lost its function completely.
Shapiro–Senapathy algorithm can identify the cryptic splice sites, in addition to the authentic splice sites. Cryptic sites can often be stronger than the authentic sites, with a higher S&S score. However, due to the lack of an accompanying complementary donor or acceptor site, this cryptic site will not be active or used in a splicing reaction. When a neighboring real site is mutated to become weaker than the cryptic site, then the cryptic site may be used instead of the real site, resulting in a cryptic exon and an aberrant transcript.
Numerous diseases have been caused by cryptic splice site mutations or usage of cryptic splice sites due to the mutations in authentic splice sites.
S&S in animal and plant genomics research
S&S has also been used in RNA splicing research in many animals and plants.
The mRNA splicing plays a fundamental role in gene functional regulation. Very recently, it has been shown that A to G conversions at splice sites can lead to mRNA mis-splicing in Arabidopsis. The splicing and exon–intron junction prediction coincided with the GT/AG rule (S&S) in the Molecular characterization and evolution of carnivorous sundew (Drosera rotundifolia L.) class V b-1,3-glucanase. Unspliced (LSDH) and spliced (SSDH) transcripts of NAD+ dependent sorbitol dehydroge nase (NADSDH) of strawberry (Fragaria ananassa Duch., cv. Nyoho) were investigated for phytohormonal treatments.
Ambra1 is a positive regulator of autophagy, a lysosome-mediated degradative process involved both in physiological and pathological conditions. Nowadays, this function of Ambra1 has been characterized only in mammals and zebrafish. Diminution of rbm24a or rbm24b gene products by morpholino knockdown resulted in significant disruption of somite formation in mouse and zebrafish. Dr.Senapathy algorithm used extensively to study intron-exon organization of fut8 genes. The intron-exon boundaries of Sf9 fut8 were in agreement with the consensus sequence for the splicing donor and acceptor sites concluded using S&S.
The split-gene theory, introns and splice junctions
The motivation for Dr. Senapathy to develop a method for the detection of splice junctions came from his split-gene theory. If primordial DNA sequences had a random nucleotide organization, the random distribution of stop codons would allow only very short Open Reading Frames (ORFs), as three stop codons out of 64 codons would result in an average ORF of ~60 bases. When Senapathy tested this in random DNA sequences, not only this was proven to be true, but the longest ORFs even in very long DNA sequences was found to be ~600 bases above which no ORFs existed. If so, a long coding sequence of even 1,200 bases (the average coding sequence length of genes from living organisms), and longer coding sequences of 6,000 bases (many of which occur in living organisms) will not occur in a primordial random sequence. Thus, genes had to occur in pieces in a split form, with short coding sequences (ORFs) that became exons, interrupted by very long random sequences that became introns. When the eukaryotic DNA was tested for ORF length distribution, it exactly matched that from random DNA, with very short ORFs that matched the lengths of exons, and very long introns as predicted, supporting the split gene theory.
If this split gene theory was true, then the ends of these ORFs that had a stop codon by nature would have become the ends of exons that would occur within introns, and that would define the splice junctions. When this hypothesis was tested, the almost all splice junctions in eukaryotic genes were found to contain stop codons exactly at the ends of introns, bordering the exons. In fact, these stop codons were found to form the “canonical” AG:GT splicing sequence, with the three stop codons occurring as part of the strong consensus signals. The Nobel Laureate Dr. Marshall Nirenberg, who deciphered the codons, stated that these findings strongly showed that the split gene theory for the origin of introns and the split structure of genes must be valid, and communicated the paper to the PNAS. New Scientist covered this publication in “A long explanation for introns”.
This basic split gene theory led to the hypothesis that the splice junctions originated from the stop codons. Besides the codon CAG, only TAG, which is a stop codon, was found at the ends of introns. Surprisingly, all three stop codons (TGA, TAA and TAG) were found after one base (G) at the start of introns. These stop codons are shown in the consensus canonical donor splice junction as AG:GT(A/G)GGT, wherein the TAA and TGA are the stop codons, and the additional TAG is also present at this position. The canonical acceptor splice junction is shown as (C/T)AG:GT, in which TAG is the stop codon. These consensus sequence clearly show the presence of the stop codons at the ends of introns bordering the exons in all eukaryotic genes. Dr. Marshall Nirenberg again stated that these observations fully supported the split gene theory for the origin of splice junction sequences from stop codons, who was the referee for this paper. New Scientist covered this publication in “Exons, Introns and Evolution”.
Dr. Senapathy wanted to detect the splice junctions in random DNA based on the consensus splice signal sequences, as he found that there were many sequences resembling splice sites that were not the real splice sites within genes. This Position Weight Matrix method turned out to be a highly accurate algorithm to detect the real splice sites and the cryptic sites in genes. He also formulated the first exon detection method, based on the requirement for splice junctions at the ends of exons, and the requirement for an Open Reading Frame that would contain the exon. This exon detection method also turned to be highly accurate, detecting most of the exons with few false positives and false negatives. He extended this approach to define a complete split gene in a eukaryotic genomic sequence. Thus, the PWM based algorithm turned out to be very sensitive to not only detect the real splice sites and cryptic sites, but also to detect mutated splice sites that are deleterious as opposed to non-deleterious splicing mutations.
The stop codons within splice junctions turned out to be the strongest bases in splice junctions of eukaryotic genes, when tested using the PWMs of the consensus sequences. In fact, it was shown that mutations in these bases were the cause of diseases compared to other bases, as these three of the four bases (base 1, 3 and 4) of the canonical AG:GT were part of the stop codons. Senapathy showed that, when these canonical bases were mutated, the splice site score became weak, causing splicing aberrations in the splicing process and translation of the mRNA (as described under the diseases section above). Although the value of the splice site detection method in discovering genes with splicing mutations that caused disease has been realized over the years, its importance in clinical medicine is increasingly realized in the Next Generation Sequencing era over the past five years, with its incorporation in several tools based on the S&S algorithm.
Dr. Senapathy is currently the President and CSO of Genome International Corporation (GIC), a genomics R&D company based in Madison, WI. His team has developed several databases and tools for the analysis of splice junctions, including EuSplice, AspAlt, ExDom and RoBust. AspAlt was commended by Biotechniques, which stated that it solved a difficult problem for scientists in the comparative analysis and visualization of alternative splicing across different genomes. GIC has most recently developed the clinical genomics analysis platform Genome Explorer®.
Selected publications
- Shapiro, Marvin B.; Senapathy, Periannan (1987). "RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression". Nucleic Acids Research. 15 (17): 7155–7174. doi:10.1093/nar/15.17.7155. PMC 306199. PMID 3658675.
- Senapathy, P. (1988). "Possible evolution of splice-junction signals in eukaryotic genes from stop codons". Proc Natl Acad Sci U S A. 85 (4): 1129–33. Bibcode:1988PNAS...85.1129S. doi:10.1073/pnas.85.4.1129. PMC 279719. PMID 3422483.
- Senapathy, P; Shapiro, MB; Harris, NL (1990). "Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project". Methods in Enzymology. 183: 252–78. doi:10.1016/0076-6879(90)83018-5. PMID 2314278.
- Harris, N.L.; Senapathy, P. (1990). "Distribution and consensus of branch point signals in eukaryotic genes: a computerized statistical analysis". Nucleic Acids Res. 18 (10): 3015–9. doi:10.1093/nar/18.10.3015. PMC 330832. PMID 2349097.
- Senapathy, P. (1986). "Origin of eukaryotic introns: a hypothesis, based on codon distribution statistics in genes, and its implications". Proc Natl Acad Sci U S A. 83 (7): 2133–7. Bibcode:1986PNAS...83.2133S. doi:10.1073/pnas.83.7.2133. PMC 323245. PMID 3457379.
- Regulapati, R.; Bhasi, A.; Singh, C.K.; Senapathy, P. (2008). "Origination of the Split Structure of Spliceosomal Genes from Random Genetic Sequences". PLOS ONE. 3 (10): 10. Bibcode:2008PLoSO...3.3456R. doi:10.1371/journal.pone.0003456. PMC 2565106. PMID 18941625.
- Senapathy, P. (1995). "Introns and the origin of protein-coding genes". Science. 268 (5215): 1366–7. Bibcode:1995Sci...268.1366S. doi:10.1126/science.7761858. PMID 7761858.