Anonymous

Anonymous

Search:

Login / Register

Login / Register

Refined annotation of the Arabidopsis thaliana genome by complete EST mapping

Wei Zhu, Shannon D. Schlueter and Volker Brendel (Plant Physiology 132, 469-484)

We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis thaliana genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring. About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources, or displaying insufficient sequence quality for alignment. The mapping provides verified sets of EST clusters for evaluation of EST clustering programs. Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-consensus pre-mRNA splicing. Particular applications are documented below.


Data Sets and Methods

This particular project inspired the database and web site you are currently using. Please follow the link to the AtGDB "Data Sets, Methods, and Resources" page for more information of what you are seeing and how it was derived.


Applications

The following links document particular results discussed in the cited manuscript.

EST Clustering and Assembly

EST assembly refers to the problem of finding the correct orientation and order of EST sequences in a tiling path covering the cognate mRNA. Typically, ESTs are clustered on the basis of pairwise sequence overlap and clone pair information. Here, we have clustered 169,888 ESTs that could be assigned putative cognate spliced alignments by high quality spliced alignment on the basis of their genomic locations. The link provides details and download options for various clusterings depending on assignment of minimal overlap conditions.

Putative Novel Genes

Setting the maximal allowed gap between genomic end points of EST alignments to 60 bases for ESTs within the same cluster, 27,954 EST clusters were formed. Of these, 129 occur in genomic regions without annotated protein coding genes but contain open reading frames of at least 100 codons.

Analysis of Annotated Gene Structures

For the most part, current gene structure annotation has been established by application of both ab-initio gene prediction methods and mRNA-derived sequence alignment. Here we show that 977 current GenBank gene structure annotations are in contradiction with EST evidence. For 5,000 Ceres-TIGR full-length cDNAs used as a control set, 23 annotations were found at odds with the spliced alignment.

EST evidence supports expression of about 65% of the annotated gene models. About half of the EST clusters contain only one or two ESTs, but there are 147 genes supported by at least 100 ESTs. The EST distributions displayed at AtGDB can serve as rough "electronic Northerns."

2,023 genes have EST-confirmed potential 5'-UTR introns, and 487 genes have EST-confirmed potential 3'-UTR introns. It remains to be studied what role in gene expression, if any, these introns may have.

Full-length cDNAs comprise the most decisive proof for a particular gene structure annotation. Unfortunately, the assignment of "full-length" appears non-uniform, referring in some cases to entire mRNA transcripts and in other cases merely to transcripts covering all of the coding region. For example, in this database 1,100 putative cognate spliced alignments from the Ceres-TIGR full-length cDNAs are embedded within longer EST clusters. Alternative transcription start and end sites may account for some of these discrepancies.

Alternative Splicing

Unambiguous EST evidence supports 327 cases of apparent alternative splicing. The alternative transcript isoforms can be classified relative to the dominant isoform as displaying an alternative donor site (102 cases), an alternative acceptor site (190 cases), alternative donor and acceptor sites (3 cases), exon skipping (21 cases), and others (11 cases). There are also 338 cases of apparent intron retention (which may in part represent sampling of incompletely or inefficiently spliced pre-mRNAs).

Putative Non-canonical Splice Sites

More then 98% of Arabidopsis introns are characterized by canonical GT...AG splice sites. Stringent EST mapping supports 736 introns deviating from this consensus. The distribution of non-canical patterns is approximately as follows (in some cases, repeats at the intron ends prevent unambiguous assignment of the splice sites). 23 non-canonical AT-AC introns demonstrating U12 splicing characteristics are exclude from this list, and are studied in greater detail below.

Type

Number

GC-AG

453

NN-AG (not GC-AG)

99

GT-NN

80

GC-NN (not GC-AG)

14

Others (26 patterns)

67

Total

713

Putative U12 Spliced Introns

Motif analysis of the EST defined Arabidopsis introns supports 41 cases of U12 or U12-like intron splicing. Unlike the more common U2 supported splicing, U12-type introns are more likely to be determined by conserved motifs around the donor site and the branch site rather than the dinucleotide termini of the intron.

Mini-exons

Very short exons are typically hard to predict with ab inition gene structure prediction programs. Furthermore, such exons are challenging for splicing models designed from features of typical-sized exons and introns. Here we list 128 non-terminal mini-exons (ranging in length from 5-25 bp) supported by EST evidence.
© 2006 Shannon D. Schlueter

AtGDB

PlantGDB

MaizeGDB

NSF Plant Genome Research

Brendel Group

Plant Sciences Institute

Iowa State University