With the ultimate goal of characterizing the plant gene space, PlantGDB regularly assembles unique transcripts from plant mRNA sequences. Our procedure
involves Vmatch, PaCE, and
CAP3 software programs. This page describes the assembly procedure in detail.
Data sources
Plant mRNA sequences are extracted from NCBI. More specifically, every GenBank record is downloaded from the
NCBI FTP site. The plant mRNA sequences are extracted from the EST,
HTC, and PLN divisions and sorted by species. Unless specifically requested by researchers, PlantGDB only assembles species-specific PUTs when the
species' mRNA sequence count has reached 10,000. New releases of PUTs will be available every three months shortly following the most recent
GenBenk release. The release version is indicated in the PUT identifier as describe here.
Contamination and repetitive elements
Some sequences deposited in GenBank are contaminated by non-native sequences derived from cloning vectors, bacterial host, and etc. In addition, abundant repetitive elements
(e.g., transposons) in a sequence collection will also prevent accurate PUT assembly
(for recent review on this topic, see
Comparative EST analyses in plant systems).
In our assembly pipeline, we use the Vmatch program to identify contaminations and repetitive elements by comparison of the mRNA sequences to vector, bacterial and repeat databases.
Specifically, the NCBI UniVec database and the
E. coli genome sequence are used for masking vector and bacterial contamination,
respectively (Vmatch options: -qmaskmatch X -d -p -l 50 -exdrop 1 -identity 90). After trimming off the masked contamination nucleotides, the
surviving sequence length must be at least 100 bp in order to proceed to the next step. Similarly, the
TIGR plant repeat database is used for masking known repetitive elements (Vmatch option:
-qmaskmatch X -d -p -l 100 -exdrop 2 -identity 80). In this case, if more than 50% of the nucleotides in a sequence are masked, the sequence is excluded from the subsequent assembly. Otherwise, we will use the original un-masked
sequence in the next assembly step (i.e., for any given sequence, if we can't confidently mask the majority of its nucleotides as repetitive
elements, we do not consider it containing any repetitive element at all to avoid false-masking).
PolyA tail
The presence of PolyA tail in the transcripts also prevent accurate PUT assembly (i.e., non-related sequences can be linked together solely by polyA sequences). Therefore, we first masked polyA sequences (Vmatch options: -d -p -v -l 15 -exdrop 1 -identity 90 -selfun end2end-match.so -qmaskmatch X -seedlength 10 where end2end-match.so is our own customized Vmatch selection function to ensure the masking is performed on end-to-end region). After trimming off the masked polyA region, the surviving sequence length must be at least 50 bp in order to proceed to the next step.
Removal of Duplicates
This is a step currently perhaps unique to our assembly procedure (compared with other resources, e.g., TIGR Gene Indices). GenBank (especially the
EST division) is known to contain large duplicates (identical or near-identical sequences). Those duplications waste a lot of computational resources
during assembly (i.e., repeatedly aligning the identical and near-identical sequences). One goal at PlantGDB is to provide accurate estimation of
plant gene space by assembling together mRNA fragments in timely fashion. Unfortunately, the performance of current assembly programs (e.g., CAP3)
slows down dramatically with large amounts of input sequences. Based on our experience, reducing the duplicates greatly speeds up our assembly process
and the resulted consensus sequences are still compatible with ones assembled from the entire data set.
In order to identify duplicates from the input sequences, a filter is designed to remove globally-similar sequences. Specifically, the Vmatch program
(with option: -d -p -l 50 -exdrop 1 -identity 99) is first applied to compare each input sequence against each other. Then if a sequence A is
contained in another sequence B ("contained" is defined as A being shorter than B and at least 99% of the nucleotides in A being matched to B), the
sequence A is excluded from the subsequent step while the sequence B still proceeds. Note that although A is excluded from later assembly (i.e., not
participating in building the consensus sequence), its sequence information is already represented by B. In addition, such "contained" relationship
between A and B is also saved and stored in PlantGDB database table as well as being displayed in the web so that we can trace the contribution of
A. As a result, we no longer designate "contig" or "singlet" to classify our FINAL assembly results because any final "Unique Transcripts" may
inclusively represent their contained sequences.
Clustering and Assembly
The sequences are first being clustered by the PaCE program, which groups overlapping sequences based on single-linkage clustering using parallel
computers (PaCE options: match 2, mismatch -4, gap -1, hgap -6, AlignmentWithN -1, LOADPERPROC 80, window 11, MinLen 100, ScratchMemory 250,
TranscriptsTogether 0, EndToEndScoreRatioThreshold 10, EndtoEndAlignLenThreshold 80, MaxScoreRatioThreshold 5, TranscriptCoverageThreshold 40,
ClonePairsFile None, Keep_Mbuf_Full 0, MPI_Block_Sends 1, ReportSplicedCandidates 0, ReportMaximalPairs 0, ReportMaximalSubstrings 0,
ReportAcceptedPairs 0, ReportGeneratedPairs 0, ReportMaximalRepeatCount 1, DumpClustersMidway 1).
Then for each resulted PaCE clusters, CAP3 program is used to perform the assembly (CAP3 option: -p 95 -o 49 -t 10000). The output is a set of
CAP3 contigs/singlets, where the contigs are the consensus sequences derived from multiple member mRNA sequences.
Refinement
This is a step currently perhaps unique to our assembly procedure. The above PaCE parameters are designed after balancing sensitivity (clustering all
the overlapping sequences together), specificity (clustering based on meaningful end-to-end or global overlap) as well as performance (clustering
speed). There is no guarantee that the PaCE result won't generate any false negatives (sequences that should be clustered together are spread into
different clusters). Subsequently, the false negative will be prorogated into the CAP3 assembly step since CAP3 only performs on individual PaCE
clusters.
Therefore, in order to minimize such potential false negatives, the above resulted CAP3 contigs/singlets are self-clustered using the Vmatch program
(Vmatch option: -d -p -seedlength 15 -l 50 -exdrop 1 -identity 95 -selfun end2end-match.so -dbcluster 0 0 where end2end-match.so is our own customized
Vmatch selection function to ensure the clustering is performed on end-to-end overlap). If any CAP3 contigs/singlets are clustered
(e.g., the previous PaCE/CAP3 false negatives), their member mRNA sequences will be pooled together for a re-assembly by CAP3. In other words, we
provide a comprehensive opportunity for any potential overlapping sequences to be grouped together for a potential re-build of the consensus sequences.
Final results
After the above refinement, a set of final CAP3 contigs/singlets are obtained with the insurance that they represent minimal overlaps with each other.
These final CAP3 contigs/singlets are designed as PlantGDB-assembled Unique Transcripts (PUT). Those unique transcripts are subjected to our
automated functional annotations by BLASTXing (BLAST option: -e 1e-20) against UniProt protein database
to identify significantly similar proteins. Besides individual record display on the web, the final unique transcripts as well as their functional
annotations can be accessed by: