Gene Groups


SRGD - Data and Methods:

SRGD was initially developed by Wang & Brendel (The ASRG database). Using their set of 395 splicing related genes in Arabidopsis (ASRG395) as a starting point, Chen & Brendel (Identification and Survey of Splicing-related Proteins in 10 Plant Species (unsubmitted)) surveyed 10 plant genomes for splicing related genes. This page provides links to data sets, software, and scripts used in that work.


In our study, we researched 10 plant species including six dicots (Arabidopsis thaliana (At), Glycine max (Gm), Lotus japonicus (Lj), Medicago truncatula (Mt), Populus trichocarpa (Pt), and Vitis vinifera (Vv),), three monocots (Rice Oryza sativa (Os), Sorghum bicolour (Sb), and Zea mays (Zm)), and one moss (Physcomitrella patens (Pp)). Among these 10 plants, there is one non-flowering plant (Pp), and three legumes (Glycine max (Gm), Lotus japonicus (Lj), and Medicago truncatula (Mt)).

The following table provides the data source and downlodable linkage. This table contains five columns, including species, data source&version, genomic or gene sequences that are needed for generating CIWOG information files, protein sequences (The whole annotated protein sequeces), and annotation files including gff files and xml files.

Species Source Genomic/Gene Sequences Protein Sequences Annotation Files
At TAIR (TAIR9_blastsets) At(TAIR9_seq_20090619) At_aa (TAIR9_pep_20090619) TAIR9_3_utr_20090619
Gm JGI(Glyma1) Gm(Gmax.main_genome.scaffolds.fasta.gz) Gm_aa (Glyma1.pep.fa.gz) Glyma1.gff3
Lj Kazusa.org(version_r1.0) Lj(genome_sequence.gz) Lj_aa (Protein_sequence.gz) models_gff
Mt Medicago.org (Mt2.0) Mt(Mt2.0_pseudomolecule.tar.gz) Mt_aa (20080227_imgag_protMAPPED_NO_OVERLAP.fa.tar.gz) MT2.0_medicago_chrX_20080103_NoOverlap.xml.tar.gz
Os Plantbiology.msu (version_6.1) Os (all.seq) Os_aa (all.pep) all.gff3
Pp JGI(v5.0) Pp(Physcomitrella_patens.1_1.fasta.gz) Pp_aa (Phypal_1.FilteredModels3.fasta.gz) Phypal_1.FilteredModels3.gff.gz
Pt JGI(v1.1) Pt (poplar.unmasked.fasta.gz) Pt_aa (Proteins.Poptr1_1.JamboreeModels.fasta.gz) Poptr1_1.JamboreeModels.gff
Sb JGI(Sbi1_4) Sb(Sorbi1_assembly_scaffolds.fasta.gz) Sb_aa (Sorbil_GeneModels_FilteredModels6_aa.fasta.gz) Sorbi1_GeneModels_FilteredModels6_FilteredModels6.gff.gz
Vv genoscope.cns.fr(Unmarked) Vv (unmasked) Vv_aa (unmasked) Vitis_vinifera_annotation_v1.gff
Zm Maizesequence.org(release-4a.53) Zm(ZmB73_4a.53_filtered_genes_500.fasta) Zm_aa (ZmB73_4a.53_filtered_translations.fasta) ZmB73_4a.53_filtered_genes.gff

Tools (Software used):

Common software used in this work was obtained from the respective public distribution sites:

Scripts and Pipeline:

A three-round BLASTp search was used to identify pre-mRNA splicing-related proteins in 10 plant species as follows:

Step 1:

Initially, a comprehensive set of 395 pre-mRNA splicing-related proteins in Arabidopsis was downloaded from ASRG database. This set will be referred to as AtSRP (href="/SRGD/Atortho/gdna/ASRG395">ASRG395). Complete sets of predicted protein sequences of 10 plant species derived from from the respective genome annotations were obtained from each species as mentioned in the source of datasets.

Step 2:

AtSRP was then used as the query in local BLASTp search against each of the annotated protein sets. All hits with e-value of less than 10^-20 were retained for futher analysis.The BLASTp result for each species are dowloadable ont the follwoing table, which are refered to At_**(BLASTp result of AtSRP against **), where ** represents one of 10 plant species.

  • The comand line is:
  • $formatdb -i hugefasta -p F
  • $blastall -i infile -d hugefasta -p blastp -o out -m 8

Step 3

In order to identify potential additional homologs not idnetified in the initial search, all hits from the first stage were retrieved, pooled, and then used as the query in a second local BLASTp search against the combined set of all annotated proteins from all 10 species. New hits at a cutoff e-value of 10^-20 were added to the set of candidate plant pre-mRNA splicing-related proteins.The BLASTp result is on the following table as reffered to all_all.

The following table provides links to Blast output files for each species:

Blast At_At At_Gm At_Lj At_Mt At_Os At_Pp At_Pt At_Sb At_Vv At_Zm all_all

Step 4.

All candidates of splicing-related proteins were blastp-searched against themselves in order to obtain pairwise sequence similarities for input into OrthoMCL. The output of BLASTP result from step 3 (also the input of OrthoMCL), and the output of OrthoMCL are provide via the following links:

  • All against all BLASTp result: allblastall
  • The clusters of interest are saved into a separate file as following.
  • OrthoMCL result:
    • At395_orthoMCL (csv format) contains clusters with at least one of ASRG395 genes
    • Novel_orthoMCL.out contains clusters with novel identified splicing-related proteins
    • Commond used in runing OrthoMCL:
    • %orthomcl.pl --model 3 --blast_file allblastall_20 --gg_file id2010.gg
    • The result all_orthoMCL can be found under the OrthoMCL directory, which contains all genes from the all2all_blastp.out result. CSV format is also available: all_orthoMCL.csv

Step 5.

For each gene cluster, CIWOG was used to identify the common intron positions and types. For each cluster, two files were built to be processing with CIWOG software. One file contained muscle format of proteins alignments from the same cluster, and another one contained CIWOG required format of information including gene names, gene structures, gene transcription start and stop sites, gene translation start and stop codons, and genome sequences.

  • The perl scripts were written to process the annotation file, genome file and the gff file to generate the CIWOG information file (The genome sequences in Mt are already included in the gff file). We can download these files from the dataset section and put them into the same folder to run the perl script
  • Because Lj has different gene names in the annotation file and the PlantGDB, we only use 9 plants in the CIWOG result.
  • CIWOG scripts and output files:
    Perl Scripts ciwog_at.perl ciwog_gm.perl ciwog_lj.perl ciwog_mt.perl ciwog_os.perl ciwog_pp.perl ciwog_pt.perl ciwog_sb.perl ciwog_vv.perl ciwog_zm.perl
    CIWOG Result At.ciwog Gm.ciwog Lj.ciwog Mt.ciwog Os.ciwog Pp.ciwog Pt.ciwog Sb.ciwog Vv.ciwog Zm.ciwog

  • (1) Gff file and xml file were used to generate CIWOG information file and further formatted for each cluster based on the information on PlantGDB.
  • (2) Muscle was used to generate the alignment file for each cluster.
  • (3) For each cluster, the alignment and CIWOG information file were saved in a single directory named as the cluster number. The alignment file and Ciwog information file in the same directory were named as the same name as the directory but different suffix. For example, for cluster 1, the directory is named as 1, which composed of two files: 1.aln (alignment file) and 1.ciwog(CIWOG information file), which can be also downloaded at here.

Loading Help Page...Thanks for your patience!

Loading Video...Thanks for your patience!

Loading Image...Thanks for your patience!