Refined annotation
of the Arabidopsis thaliana genome by complete EST mapping
Wei Zhu, Shannon D. Schlueter and Volker Brendel (Plant Physiology 132, 469-484)
We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences
onto the Arabidopsis thaliana genome using GeneSeqer, a spliced alignment program
incorporating sequence similarity and splice site scoring.
About 96% of the available ESTs could be properly aligned with a genomic locus, with the
remaining ESTs deriving from organelle genomes and non-Arabidopsis sources, or
displaying insufficient sequence quality for alignment.
The mapping provides verified sets of EST clusters for evaluation of EST clustering programs.
Analysis of the spliced alignments suggests corrections to current gene structure annotation
and provides examples of alternative and non-consensus pre-mRNA splicing.
Particular applications are documented below.
|

|
This particular project inspired the database and web site you are currently using.
Please follow the link to the AtGDB "Data Sets, Methods, and Resources" page for more
information of what you are seeing and how it was derived.
|
|

|
Applications
The following links document particular results discussed in the cited manuscript.
|
|

|
EST assembly refers to the problem of finding the correct orientation and order of EST
sequences in a tiling path covering the cognate mRNA. Typically, ESTs are clustered
on the basis of pairwise sequence overlap and clone pair information. Here, we have
clustered 169,888 ESTs that could be assigned putative cognate spliced alignments by high
quality spliced alignment on the basis of their genomic locations. The link provides
details and download options for various clusterings depending on assignment of minimal
overlap conditions.
|
|

|
Setting the maximal allowed gap between genomic end points of EST alignments to 60 bases
for ESTs within the same cluster, 27,954 EST clusters were formed. Of these, 129 occur in
genomic regions without annotated protein coding genes but contain open reading frames of
at least 100 codons.
|
|

|
Analysis of Annotated Gene Structures
For the most part, current gene structure annotation has been established by application of
both ab-initio gene prediction methods and mRNA-derived sequence alignment. Here we
show that 977 current GenBank gene structure
annotations are in contradiction with EST evidence. For 5,000 Ceres-TIGR full-length
cDNAs used as a control set, 23 annotations
were found at odds with the spliced alignment.
EST evidence supports expression of about 65% of the annotated gene models. About half of
the EST clusters contain only one or two ESTs, but there are
147 genes supported by at least 100 ESTs. The EST
distributions displayed at AtGDB can serve as rough "electronic Northerns."
2,023 genes have EST-confirmed potential 5'-UTR introns,
and 487 genes have EST-confirmed potential 3'-UTR
introns. It remains to be studied what role in gene expression, if any, these introns may have.
Full-length cDNAs comprise the most decisive proof for a particular gene structure
annotation. Unfortunately, the assignment of "full-length" appears non-uniform, referring
in some cases to entire mRNA transcripts and in other cases merely to transcripts covering
all of the coding region. For example, in this database
1,100 putative cognate spliced alignments from the
Ceres-TIGR full-length cDNAs are embedded within longer EST clusters. Alternative
transcription start and end sites may account for some of these discrepancies.
|
|

|
Unambiguous EST evidence supports 327 cases of apparent alternative splicing. The
alternative transcript isoforms can be classified relative to the dominant isoform as
displaying an alternative donor site
(102 cases), an alternative acceptor site
(190 cases), alternative donor and
acceptor sites (3 cases), exon
skipping (21 cases), and others
(11 cases). There are also
338 cases of apparent intron
retention (which may in part represent sampling of incompletely or inefficiently spliced
pre-mRNAs).
|
|

|
More then 98% of Arabidopsis introns are characterized by canonical GT...AG splice
sites. Stringent EST mapping supports 736 introns deviating from this consensus. The
distribution of non-canical patterns is approximately as follows (in some cases, repeats at
the intron ends prevent unambiguous assignment of the splice sites). 23 non-canonical AT-AC
introns demonstrating U12 splicing characteristics are exclude from this list, and are studied
in greater detail below.
|
Type
|
Number
|
|
GC-AG
|
453
|
|
NN-AG (not GC-AG)
|
99
|
|
GT-NN
|
80
|
|
GC-NN (not GC-AG)
|
14
|
|
Others (26 patterns)
|
67
|
|
Total
|
713
|
|
|

|
Motif analysis of the EST defined Arabidopsis introns supports 41 cases of U12 or
U12-like intron splicing. Unlike the more common U2 supported splicing, U12-type introns
are more likely to be determined by conserved motifs around the donor site and the branch
site rather than the dinucleotide termini of the intron.
|
|

|
Very short exons are typically hard to predict with ab inition gene structure
prediction programs. Furthermore, such exons are challenging for splicing models
designed from features of typical-sized exons and introns. Here we list 128 non-terminal
mini-exons (ranging in length from 5-25 bp) supported by EST evidence.
|
© 2006 Shannon D. Schlueter