Qunfeng Dong, Liang Ye, Xiaoqiu Huang, and Volker Brendel (2006)
Maize full-length gene identification and annotation
Plant Genome, submitted, 2006.
Hightlights
Supplementary materials
S1. Sequence data used in the study for gene finding (right-click the link to access the actual file; most are in .bz2 compressed format)
| Type | Count | Source |
| Maize BACs | 1,454 | Maize genomic sequences were downloaded from GenBank release 153. Total 1,459 sequences, whose length were no less than 10 kb, were initially considered as putative BAC sequences. Then gi|11990232, gi|12479, gi|40794996, gi|55851977, and gi|85724367 were excluded because they are derived from organelle genome. |
| Maize GSS contigs | 294,245 | Contigs assembled from 2,681,812 maize B73 GSS sequences by PlantGDB. See S2 below for more details. |
| Maize EST contigs | 140,616 | Unigene assembled from 728,957 maize EST sequences available at GenBank release 153 by PlantGDB. |
| Arabidopsis proteins | 30,609 | Arabidopsis protein sequences from TAIR v.6 (Nov 11, 2006). |
| rice proteins | 62,827 | Rice protein sequences from TIGR v.4 (Dec 30, 2005) |
S2. Arabidopsis and rice proteins, gene families and GO terms
#At and Os protein info
$ ref_pep.pl "Arabidopsis thaliana" ATpep > At.ref_pep.txt
$ ref_pep.pl "Oryza sativa" OSpep > Os.ref_pep.txt
load the above .txt files into MySQL table Ref_pep.
#Os gene family (described by TIGR rice.v4.paralogous.family.list)
$ OSpep_family.pl OSpep rice.v4.paralogous.family.list > OSpep_family.txt
Then load into the MySQL table OSpep_family.
#Os GO term (described by TIGR)
$ OSpep_GO.pl OSpep.GO.txt > OSpep_GO.txt
Then load into MySQL table OSpep_GO.
S3. Maize GSS sequences used as inputs for the PCAP program to assemble into the above GSS contigs (right-click the link to download)
A total of 2,681,812 maize (cultivar B73) GSS entries available in November 2004 were downloaded from NCBI, including:
- 2,296,370 with quality-scores downloaded from NCBI trace archive, including:
- 385,442 other B73 GSS without quality-scores at GenBank. For these reads, a Phred quality score of 25 was assigned to each base.
|
|
Click here to download the complete PCAP assembly inputs/parameters/outputs - total size 4.7G!
|
S4. Maize GSS contig length distribution. Total 294,425 contigs were obtained from the assembly (1,661,712 member GSSs formed the contigs). The total size of contigs is 503,497,339 bp. The table below shows their length distribution.
| Size | No. of contigs |
| <=1 kb | 66,408 |
| 1 kb > size <= 2 kb | 160,428 |
| 2 kb > size <= 3 kb | 38,460 |
| 3 kb > size <= 4 kb | 14,588 |
| 4 kb > size <= 5 kb | 6,856 |
| 5 kb > size <= 6 kb | 3,454 |
| 6 kb > size <= 7 kb | 1,864 |
| 7 kb > size <= 8 kb | 1,018 |
| 8 kb > size <= 9 kb | 576 |
| 9 kb > size <= 10 kb | 335 |
| > 10 kb | 438 |
We set out to estimate our assembly quality by matching the GSS contigs to the available BAC sequences. The matching was performed with the Vmatch program and the matching outputs were parsed with a Perl script - GSScontig_BAC.pl. Contigs were considered to match confidently to a BAC if at least 99% of the length of the contig is part of the match and the alignment contains at least 99% identities.
- With such stringent criteria, 9,354 GSS contigs (out of total 294,245) were able to match to 1,404 BAC (out of total 1,459), indicating that our GSS contigs sequences are in agreement with the high-quality assembled BAC sequences.
mysql> select distinct B.gi, s.id, GG.SCAF_lpos, GG.SCAF_rpos from BACgi B left join ZMGDB153.gseg G
on B.gi=G.gi left join ZMGDB153.gseg_gss GG on GG.SCAF_uid=G.uid left join ZMGDB153.gss s
on s.uid = GG.FRAG_uid where s.id is not null into outfile '/tmp/ZmBAC_match_ZmGSScontig.txt';
$ awk '{print $1}' ZmBAC_match_ZmGSScontig.txt | sort -u | wc -l
1404
$ awk '{print $2}' ZmBAC_match_ZmGSScontig.txt | sort -u | wc -l
9354
-
Of all the above 1,404 BAC, about 9.9% (18,909,602 bp) of their total regions (218,734,118 bp) were matched by the GSS contigs.
$ matchMask.pl Zea_mays.BAC.fasta.v.153.organellefree.fsa ZmBAC_match_ZmGSScontig.txt > Zea_mays.BAC.fasta.v.153.organellefree.matchGSScontig.fsa
$ countMask.pl Zea_mays.BAC.fasta.v.153.organellefree.matchGSScontig.fsa > countMask.txt
Total: 218734118 bp => total length of these matched 1404 BACs
Masked: 18909602 bp => the number nucleotides being matched by GSS contigs on those BACs!
Since the matched 9,354 contigs are contained in the BAC sequences, they were excluded from the subsequent analyses so that we do not count the same genomic locus more than once.
S5. GenomeThreader outputs of the splice-alignments of maize genomic DNAs with maize PUTs, Arabidopsis and rice proteins.
(The GenomeThreader XML specification is available at
http://www.genomethreader.org/).
The following steps are used to extract alignment info from the above xml files
$ All_gthxml_parser.pl ZmBAC /.../Output/ZmBAC
--the above command generates the following outputs:
PGL_ZmBAC.txt
PGL_ZmBAC.fsa
PPS_ZmBAC.txt
PPS_ZmBAC.fsa
ALN_ZmBAC.txt
COV_ZmBAC.txt
$ All_gthxml_parser.pl ZmGSS /.../Output/ZmGSS
--the above command generates the following outputs:
PGL_ZmGSS.txt
PGL_ZmGSS.fsa
PPS_ZmGSS.txt
PPS_ZmGSS.fsa
ALN_ZmGSS.txt
COV_ZmGSS.txt
The above .txt files were loaded into corresponding SQL tables (PGL, PPS, ALN, and COV).
*Since some GSS contigs were CONTAINED in the BACs, we delete PGLs identified from those contigs:
(Let's get a non-redundant set of PGL from ZmBAC and ZmGSScontigs)
$ deleteGSSSQL.pl ZmBAC_match_ZmGSScontig.txt > deleteGSSSQL.sql
$ mysql -ppassword -uusrname MaizeGeneSpace < deleteMatchGSS.sql
S6. Total number non-redundant PGL identified from BAC and non-overlapping GSS contigs
mysql> select count(distinct gDNA_id, PGL_id) from PGL;
+---------------------------------+
| count(distinct gDNA_id, PGL_id) |
+---------------------------------+
| 137438 | => total number non-redundant PGL identified from BAC and non-overlapping GSS contigs!
+---------------------------------+
1 row in set (15.93 sec)
mysql> select count(distinct gDNA_id, PGL_id) from allPGL; #allPGL includes loci identified from ALL GSS contigs.
+---------------------------------+
| count(distinct gDNA_id, PGL_id) |
+---------------------------------+
| 141222 | 141222 - 137438 = 3784 genes covered by those matched GSScontigs!
+---------------------------------+
1 row in set (23.83 sec)
Note: They are not ALL full-length or near full-length by our standards!
S7. Distribution of the matching PUT, AtPEP, and OsPEP
mysql> select * from PGL into outfile '/tmp/NonredundantPGL.txt';
*For all the BAC loc, how many of them match only to ...
$ PGLclass.pl ZmBAC NonredundantPGL.txt
$ wc -l ZmBAC.*PGL.*txt
91 ZmBAC.PGL.At.txt => only match to At
347 ZmBAC.PGL.Os_At.txt => only match to Os AND At at the same time
7117 ZmBAC.PGL.Os.txt => only match to Os
55 ZmBAC.PGL.PUT_At.txt => only match to PUT AND At
1371 ZmBAC.PGL.PUT_Os_At.txt => only match to PUT AND OS AND At
6210 ZmBAC.PGL.PUT_Os.txt => only match to PUT AND Os
24032 ZmBAC.PGL.PUT.txt => only match to PUT
39223 total
*For all the GSS loc (do not include the ones overlapping with BAC), how many of them match only to ...
$ PGLclass.pl ZmGSS NonredundantPGL.txt
$ wc -l ZmGSS.*PGL.*txt
434 ZmGSS.PGL.At.txt
1294 ZmGSS.PGL.Os_At.txt
23029 ZmGSS.PGL.Os.txt
239 ZmGSS.PGL.PUT_At.txt
3650 ZmGSS.PGL.PUT_Os_At.txt
5040 ZmGSS.PGL.PUT_Os.txt
64529 ZmGSS.PGL.PUT.txt
98215 total
S8. Full-length or near full-length maize genes based on matching to AtPEP or OsPEP
We define full-length or near full-length maize coding genes as the maize genic loci whose the predicted maize ORF contains one-segment whose length is at least 90% length of the matched protein!!!
mysql> select * from ALN into outfile '/tmp/ALN.txt';
$ StopCodonFreeAlign.pl ALN.txt > ALN.stopcodonfree.txt
And then loaded into the MySQL table fullALN.
mysql> select count(distinct gDNA_id, PGL_id) from fullALN;
+---------------------------------+
| count(distinct gDNA_id, PGL_id) |
+---------------------------------+
| 29759 | => this number of PGL that really has full-length or near-full length protein matching support!
+---------------------------------+
1 row in set (5.95 sec)
*Note: many of the above PGL actually match to annotated transposon proteins!!!
S9. Maize gene description
What are those full-lengh or near full-length genes? How many of them are matched by non-transposon proteins?
mysql> select distinct gDNA_id, PGL_id, AGS_id, ref_id from fullALN into outfile '/tmp/fullALN.PGLid.txt';
mysql> select distinct * from PGL into outfile '/tmp/PGL.location.txt';
$ fullALN_loc.pl fullALN.PGLid.txt PGL.location.txt > fullALN_loc.TXT
*fullALN_loc.TXT tab-delimited file that displays coordinates of the matched protein to the genic loci.*
Col1: gDNA_type;
Col2: gDNA_id;
Col3: gDNA_strand;
Col4: PGL_id;
Col5: PGL_location;
Col6: AGS_id;
Col7: AGS_structure;
Col8: ref_id;
Col9: alignment_score;
Col10: coverage;
Col11: coverage_type;
Col12 PGS_structure.
$ fullALNgene.pl Zea_mays.BAC.fasta.v.153.organellefree.fsa ZmB73GSStuc fullALN_loc.TXT
---Outputs: fullALN.PGL.fsa - FASTA-format DNA sequences for each maize gene (full-length or near full-length coding region).
fullALN.PGLregion.fsa - FASTA-format sequence file, corresponding to coding region plus the potential promoter and 3' UTR regions.
mysql> select Species, ref_id, ref_des from Ref_pep into outfile '/tmp/Ref_pep.des.txt';
$ fullALNref.pl Ref_pep.des.txt fullALN.PGLid.txt > fullALNref.TXT
Load fullALNref.TXT' into MySQL table fullALNref
*fullALNref.TXT is tab-delimited 4 column plain text file - maize gene description*
Col1: locusID;
Col2: matched protein;
Col3: species of the matched protein;
Col4: description of the matched protein*
mysql> select gDNA_id, PGL_id, AGS_id, ref_id, gDNA_match from fullALN into outfile '/tmp/fullALN.gDNA_match.txt';
mysql> select gDNA_id, PGL_id, AGS_id, ref_id, gDNA_Prot from fullALN into outfile '/tmp/fullALN.gDNA_Prot.txt';
$ maizePEP.pl fullALN.gDNA_Prot.txt > maizePEP.fsa
$ grep -v 'transposon' fullALNref.TXT > nontransposon_fullALNref.TXT
$ wc -l fullALNref.TXT
311585 fullALNref.TXT
$ wc -l nontransposon_fullALNref.TXT
14029 nontransposon_fullALNref.TXT => tentative non-transposon full-length maize genes
$ awk '{print $1}' nontransposon_fullALNref.TXT | sort -u | wc -l
5481 => total 5,481 non-transposon full-length or near full-length maize genes !!!
S10. Maize gene families
#How many maize full-length or near full-length non-transposon genic loci were identified through matching to OSpep?
mysql> select count(distinct Seq_PGL) from fullALNref
where ref_id like 'LOC_Os%' AND ref_des NOT like '%transposon%';
+-------------------------+
| count(distinct Seq_PGL) |
+-------------------------+
| 4558 |
+-------------------------+
#How many Os protein matches to the above loci?
mysql> select count(distinct ref_id) from fullALNref where ref_id like 'LOC_Os%' AND ref_des NOT like '%transposon%';
+------------------------+
| count(distinct ref_id) |
+------------------------+
| 4422 |
+------------------------+
1 row in set (1.00 sec)
#How many Os genes encode the above proteins?
mysql> select count(distinct substring(ref_id, 1, 14)) from fullALNref where ref_id like 'LOC_Os%'
AND ref_des NOT like '%transposon%';
+------------------------------------------+
| count(distinct substring(ref_id, 1, 14)) |
+------------------------------------------+
| 4096 |
+------------------------------------------+
1 row in set (3.95 sec)
#how many gene-families for those matched near-full length non-transposon OSpep genes?
mysql> select count(distinct family_id) from OSpep_family F left join fullALNref A on F.OSpep_locus=A.ref_id
where A.ref_id like "LOC_Os%" AND ref_des NOT like '%transposon%' ;
+---------------------------+
| count(distinct family_id) |
+---------------------------+
| 812 |
+---------------------------+
1 row in set (0.76 sec)
S11. Non-transposon full-length or near full-length Maize gene GO terms
#unformately, the TIGR GOslim term is still not high-level enough, we decided to parse from GO web site the upper level GO description
e.g., http://www.godatabase.org/cgi-bin/amigo/go.cgi?view=details&search_constraint=terms&depth=0&query=GO:0009416
mysql> select distinct GOSlim_id from OSpep_GO into outfile '/tmp/OSpep_GO.GOSlim_id.txt';
$ GOtermParserRootParent.pl OSpep_GO.GOSlim_id.txt > OSpep_GO.GOSlim_id.RootParent.txt
Then load OSpep_GO.GOSlim_id.RootParent.txt into MySQL table GO_Ancestor.
mysql> select A.Function, count(distinct F.Seq_PGL) as C from fullALNref F left join OSpep_GO G on substring(F.ref_id,1,14) = G.OSpep_locus
left join GO_Ancestor A on G.GOSlim_id=A.GO_id where G.GO_type='molecular_function'
AND F.ref_id like "LOC_Os%" AND F.ref_des NOT like '%transposon%' group by A.Function order by C desc into outfile '/tmp/full_length_gene_GOslimAncestor.funct.ZmLociCount.txt';
Query OK, 10 rows affected (6.01 sec)
[qfdong@gremlin3dev Analysis]$ more full_length_gene_GOslimAncestor.funct.ZmLociCount.txt
catalytic activity 1190
binding 1175
transporter activity 170
structural molecule activity 75
transcription regulator activity 72
antioxidant activity 50
signal transducer activity 43
\N 38
enzyme regulator activity 11
nutrient reservoir activity 3
The results were loaded into EXCEL to draw a pie
S12. Zm and Os synteny
$ OsZmSynteny.pl nontransposon_fullALNref.TXT OSGDB151.chr_tigr_tu.txt > OsZmSynteny.txt
Load OsZmSynteny.txt into MySQL table OsZmSynteny.
S13. Survey of molecular evolution dynamics for each gene families.
For each gene family, we surveyed their Ka, Ks, and Ka/Ks. The following section described our data set, methods, and results.
Step 1: The matching regions were extracted from the GenomeThreader ouputs.
$ All_gthxml_MatchExonProteinparser.pl ZmBAC > ZmBAC.MatchCDS.txt
$ All_gthxml_MatchExonProteinparser.pl ZmGSS > ZmGSS.MatchCDS.txt
Step 2: The maize gene and their top-matching rice homolog were paired
mysql> select distinct gDNA_id, PGL_id, AGS_id, ref_id from fullALN into outfile '/tmp/fullALN.PGLid.txt';
mysql> select distinct * from PGL into outfile '/tmp/PGL.location.txt';
$ fullALN_loc.pl fullALN.PGLid.txt PGL.location.txt > fullALN_loc.TXT
$ TopMatchRicePepLoci.pl fullALN_loc.TXT > OsTopMatchingLoci.txt
mysql> select * from chr_tigr_tu into outfile '/tmp/OSGDB151.chr_tigr_tu.txt';
$ TopMatchRicePepLoci.pl fullALN_loc.TXT > OsTopMatchingLoci.txt
$ codemlCDSpair.pl ZmBAC_CDSpair OSgenome OSGDB151.chr_tigr_tu.txt OsTopMatchingLoci.txt ZmBAC.MatchCDS.txt
---output is ZmBAC_CDSpair directory
$ codemlCDSpair.pl ZmGSS_CDSpair OSgenome OSGDB151.chr_tigr_tu.txt OsTopMatchingLoci.txt ZmGSS.MatchCDS.txt
---output is ZmGSS_CDSpair directory
Step 3: Compute Ka, Ks values from each maize and its top-matching rice homolog pair
1). Use the KaKs.cnt as the control file
for the codeml program in the PAML package.
2). codemlKaKs.pl ZmBAC_CDSpair > ZmBAC.codemlKaKs.txt
3). codemlKaKs.pl ZmGSS_CDSpair > ZmGSS.codemlKaKs.txt
the above ZmBAC.codemlKaKs.txt and ZmGSS.codemlKaKs.txt results are loaded in MySQL codemlKaKs table.
Step 4: coding sequence divergence between rice and maize
mysql> select C.* from fullALNref R left join codemlKaKs C
on (R.ref_id=C.ref_id AND R.Seq_PGL=CONCAT_WS(':',C.gDNA_id,C.PGL_id))
where C.dN_dS is not null AND C.dN is not null AND C.dS is not null
AND C.dN<=2 AND C.dS<=2 AND R.ref_des NOT like '%transposon%'
into outfile '/tmp/Zm.codemlKa2Ks2.txt';
[qfdong@gizmo3 Analysis]$ wc -l Zm.codemlKa2Ks2.txt
2293 Zm.codemlKa2Ks2.txt
> K=read.delim(file="C:\\Documents and Settings\\qfdong\\Desktop\\MaizeGeneSpace\\Zm.codemlKa2Ks2.txt", header=F)
> par(mfrow=c(1,3))
> hist(K[,8], xlab="Ka", main="")
> hist(K[,9], xlab="Ks", main="")
> hist(K[,7], xlab="Ka/Ks", main="")
Save as "Ka_Ks_KaKs_distribution.png"
Step 4: for each family (from non-transposon gene), what's their Ka, Ks, and Ka/Ks?
mysql> select F.family_id, F.OSpep_locus, C.gDNA_id, C.PGL_id, C.dN_dS, C.dN, C.dS
from fullALNref R left join codemlKaKs C on (R.ref_id=C.ref_id AND R.Seq_PGL=CONCAT_WS(':',C.gDNA_id,C.PGL_id))
left join OSpep_family F on C.ref_id = F.OSpep_locus
where F.family_id is NOT NULL AND C.dN_dS is not null AND C.dN is not null
AND C.dS is not null AND R.ref_des NOT like '%transposon%' into outfile '/tmp/KaKs.Zmfamily.txt';
*** get averageKa, averageKs, and averageKa/Ks
mysql> select F.family_id, avg(C.dS), avg(C.dN), avg(C.dN_dS) from fullALNref R
left join codemlKaKs C on (R.ref_id=C.ref_id AND R.Seq_PGL=CONCAT_WS(':',C.gDNA_id,C.PGL_id))
left join OSpep_family F on C.ref_id = F.OSpep_locus
where F.family_id is NOT NULL AND C.dN_dS is not null AND C.dN is not null
AND C.dS is not null AND R.ref_des NOT like '%transposon%'
group by F.family_id into outfile '/tmp/KaKs.Zmfamily.averageKaKs.txt';
Query OK, 692 rows affected (4.14 sec)
#discard those whose Ks>2 and Ka>2
mysql> select F.family_id, avg(C.dS), avg(C.dN), avg(C.dN_dS) from fullALNref R
left join codemlKaKs C on (R.ref_id=C.ref_id AND R.Seq_PGL=CONCAT_WS(':',C.gDNA_id,C.PGL_id))
left join OSpep_family F on C.ref_id = F.OSpep_locus
where F.family_id is NOT NULL AND C.dN_dS is not null AND C.dN is not null
AND C.dS is not null AND R.ref_des NOT like '%transposon%' AND C.dN<=2 AND C.dS <=2
group by F.family_id into outfile '/tmp/KaKs.Zmfamily.averageKa2Ks2.txt';
> KaKs<-matrix(scan("C:\\Documents and Settings\\qfdong\\Desktop\\MaizeGeneSpace\\KaKs.Zmfamily.averageKa2Ks2.txt"),byrow=T, ncol=4)
Read 1972 items
> mean(KaKs[,4])
[1] 0.1210681
> median(KaKs[,4])
[1] 0.1012
> var(KaKs[,4])
[1] 0.008286756
> sd(KaKs[,4])
[1] 0.09103162
Step 5: for each GO functional class (from non-transposon gene), what's their Ka, Ks, and Ka/Ks?
mysql> select A.Function, C.gDNA_id, C.PGL_id, C.ref_id, C.dN_dS, C.dN, C.dS from fullALNref R
left join codemlKaKs C on (R.ref_id=C.ref_id AND R.Seq_PGL=CONCAT_WS(':',C.gDNA_id,C.PGL_id))
left join OSpep_GO G on substring(C.ref_id,1,14) = G.OSpep_locus
left join GO_Ancestor A on G.GOSlim_id=A.GO_id
where G.GO_type='molecular_function' AND C.dN is not null and C.dS is not null
AND R.ref_des NOT like '%transposon%'
into outfile '/tmp/Zmfull_length_GO_KaKs.txt';
Query OK, 4830 rows affected (5.86 sec)
#if we discard the ones with Ka>2 or Ks>2
$ GO_KaKs.pl 2 Zmfull_length_GO_KaKs.txt > Zmfull_lengthGO_Ka2Ks2.txt
####remove outlier (Ka>0.35; Ka/Ks>0.35)
$ GO_KaKsNoOutlier.pl 2 Zmfull_length_GO_KaKs.txt > Zmfull_lengthGO_Ka2Ks2.KaKs0.35.Ka0.35.txt
> GOK = read.delim(file="C:\\Documents and Settings\\qfdong\\Desktop\\MaizeGeneSpace\\Zmfull_lengthGO_Ka2Ks2.KaKs0.35.Ka0.35.txt",header=T)
> attach(GOK)
> par(mar = c(7, 4, 4, 2) + 0.5)
> boxplot(Ka[func_id==1], Ka[func_id==2], Ka[func_id==3], Ka[func_id==4],Ka[func_id==5],Ka[func_id==6], Ka[func_id==7], Ka[func_id==8], outline=FALSE, notch=TRUE,xaxt = "n",xlab = "", ylab="Ka",main="")
> axis(1, 1:8, labels = FALSE)
> labels <- c("antioxidant", "binding", "catalytic", "enzyme_regulator","signal_transducer","structural_molecule","transcription_regulator", "transporter")
> text(1:8, par("usr")[3]-0.02, srt = 45, adj = 1,labels = labels, xpd = TRUE)
save as GO_Ka2.png
> par(mar = c(7, 4, 4, 2) + 0.5)
> boxplot(Ks[func_id==1], Ks[func_id==2], Ks[func_id==3], Ks[func_id==4],Ks[func_id==5],Ks[func_id==6],
Ks[func_id==7], Ks[func_id==8], outline=FALSE, notch=TRUE,xaxt = "n",xlab = "", ylab="Ks",main="")
> axis(1, 1:8, labels = FALSE)
> labels <- c("antioxidant", "binding", "catalytic", "enzyme_regulator","signal_transducer","structural_molecule","transcription_regulator", "transporter")
> text(1:8, par("usr")[3]-0.08, srt = 45, adj = 1,labels = labels, xpd = TRUE)
save as GO_Ks2.png
> par(mar = c(7, 4, 4, 2) + 0.5)
> boxplot(Ka_Ks[func_id==1], Ka_Ks[func_id==2], Ka_Ks[func_id==3], Ka_Ks[func_id==4],Ka_Ks[func_id==5],
Ka_Ks[func_id==6], Ka_Ks[func_id==7], Ka_Ks[func_id==8], outline=FALSE, notch=TRUE,xaxt = "n",xlab = "", ylab="Ka_Ks",main="")
> axis(1, 1:8, labels = FALSE)
> labels <- c("antioxidant", "binding", "catalytic", "enzyme_regulator","signal_transducer","structural_molecule","transcription_regulator", "transporter")
> text(1:8, par("usr")[3]-0.02, srt = 45, adj = 1,labels = labels, xpd = TRUE)
save as GO_Ka2Ks2.png
S14. Codon usage.
$ codemlCDSpairWithStopCodon.pl ZmBAC_CDSpairWithStopCodon OSgenome OSGDB151.chr_tigr_tu.txt OsTopMatchingLoci.txt ZmBAC.MatchCDS.txt
Output is ZmBAC_CDSpairWithStopCodon
$ codemlCDSpairWithStopCodon.pl ZmGSS_CDSpairWithStopCodon OSgenome OSGDB151.chr_tigr_tu.txt OsTopMatchingLoci.txt ZmGSS.MatchCDS.txt
Output is ZmGSS_CDSpairWithStopCodon
$ NonTransposon_CodonWinput.pl nontransposon_fullALNref.TXT ZmBAC_CDSpairWithStopCodon
$ NonTransposon_CodonWinput.pl nontransposon_fullALNref.TXT ZmGSS_CDSpairWithStopCodon
The above two jobs produced the following outputs accordingly
-------------------------------------------------------------
ZmBAC_CDSpairWithStopCodon.maize.codon.fsa
ZmBAC_CDSpairWithStopCodon.rice.codon.fsa
ZmGSS_CDSpairWithStopCodon.maize.codon.fsa
ZmGSS_CDSpairWithStopCodon.rice.codon.fsa
$ cat ZmBAC_CDSpairWithStopCodon.maize.codon.fsa ZmGSS_CDSpairWithStopCodon.maize.codon.fsa > Zm_CDSpair.maize.codon.fsa
$ cat ZmBAC_CDSpairWithStopCodon.rice.codon.fsa ZmGSS_CDSpairWithStopCodon.rice.codon.fsa > Zm_CDSpair.rice.codon.fsa
$ codonw Zm_CDSpair.maize.codon.fsa -nomenu -nowarn -cutot
[qfdong@gremlin3dev Analysis]$ more Zm_CDSpair.maize.codon.blk
Phe UUU12515 0.45 Ser UCU12571 0.67 Tyr UAU 9823 0.52 Cys UGU 6941 0.47
UUC43333 1.55 UCC32627 1.75 UAC28309 1.48 UGC22789 1.53
Leu UUA 5526 0.23 UCA 9992 0.54 TER UAA 11 0.34 TER UGA 52 1.59
UUG14067 0.58 UCG22451 1.20 UAG 35 1.07 Trp UGG19557 1.00
CUU15819 0.66 Pro CCU13059 0.61 His CAU 9965 0.57 Arg CGU 7340 0.43
CUC55149 2.29 CCC24229 1.14 CAC25082 1.43 CGC36463 2.12
CUA 7828 0.32 CCA12765 0.60 Gln CAA11461 0.53 CGA 5839 0.34
CUG46239 1.92 CCG35130 1.65 CAG31410 1.47 CGG24129 1.40
Ile AUU12103 0.70 Thr ACU10565 0.61 Asn AAU12073 0.59 Ser AGU 7528 0.40
AUC32159 1.86 ACC25986 1.49 AAC29148 1.41 AGC26740 1.43
AUA 7674 0.44 ACA 9867 0.57 Lys AAA11912 0.40 Arg AGA 8710 0.51
Met AUG33851 1.00 ACG23226 1.33 AAG46969 1.60 AGG20875 1.21
Val GUU13276 0.50 Ala GCU20476 0.52 Asp GAU22068 0.57 Gly GGU14779 0.51
GUC38790 1.47 GCC64960 1.64 GAC55605 1.43 GGC58864 2.01
GUA 6724 0.26 GCA17036 0.43 Glu GAA15368 0.42 GGA14930 0.51
GUG46592 1.77 GCG56373 1.42 GAG58126 1.58 GGG28422 0.97
1454281 codons in Average of genes (used Universal Genetic code)
***note***
The numbers in the tables for each codon is the number of that codon in the gene and the RSCU value.
The RSCU is the observed frequency of a codon divided by the frequency expected if all synonymous codons for that amino acid were used equally,
so RSCU values close to 1.0 indicate a lack of bias.
This may be examplified by the codons for Val in the table above:
codon GUU is used 4 times, GUC is used 1 time and GUA is used 5 times.
Val occur totally 5+1+4=10 times in the protein, and there are 4 synonymous codons for Val.
The RSCU values will be: for GUU 4/10*4=1.60, for GUC 1/10*4=0.40, for GUA 5/10*4=2.00 and for GUG 0/10*4=0.
This indicates that there is a codon usage bias towards using the codons GUU and GUA for the amino acid Val in this gene.
**********