18
Genome Analysis Gene Discovery and Tissue-Specific Transcriptome Analysis in Chickpea with Massively Parallel Pyrosequencing and Web Resource Development 1[W][OA] Rohini Garg, Ravi K. Patel, Shalu Jhanwar, Pushp Priya, Annapurna Bhattacharjee, Gitanjali Yadav, Sabhyata Bhatia, Debasis Chattopadhyay, Akhilesh K. Tyagi, and Mukesh Jain* National Institute of Plant Genome Research, Aruna Asaf Ali Marg, New Delhi 110067, India Chickpea (Cicer arietinum) is an important food legume crop but lags in the availability of genomic resources. In this study, we have generated about 2 million high-quality sequences of average length of 372 bp using pyrosequencing technology. The optimization of de novo assembly clearly indicated that hybrid assembly of long-read and short-read primary assemblies gave better results. The hybrid assembly generated a set of 34,760 transcripts with an average length of 1,020 bp representing about 4.8% (35.5 Mb) of the total chickpea genome. We identified more than 4,000 simple sequence repeats, which can be developed as functional molecular markers in chickpea. Putative function and Gene Ontology terms were assigned to at least 73.2% and 71.0% of chickpea transcripts, respectively. We have also identified several chickpea transcripts that showed tissue-specific expression and validated the results using real-time polymerase chain reaction analysis. Based on sequence comparison with other species within the plant kingdom, we identified two sets of lineage-specific genes, including those conserved in the Fabaceae family (legume specific) and those lacking significant similarity with any non chickpea species (chickpea specific). Finally, we have developed a Web resource, Chickpea Transcriptome Database, which provides public access to the data and results reported in this study. The strategy for optimization of de novo assembly presented here may further facilitate the transcriptome sequencing and characterization in other organisms. Most importantly, the data and results reported in this study will help to accelerate research in various areas of genomics and implementing breeding programs in chickpea. Chickpea (Cicer arietinum) is an annual and self- pollinated plant that ranks third in legume production worldwide. The genome of chickpea is diploid (2n = 2x = 16) and of moderate size (approximately 740 Mb). Chickpea provides a protein-rich supplement to the cereal-based diets and is a very important food crop for nutrition in the developing world. In addition, like other legumes, chickpea has the ability to fix atmo- spheric nitrogen biologically, which is very important for agricultural sustainability (Graham and Vance, 2003). Despite its economic importance, chickpea pro- duction has been very low due to significant yield losses caused by several biotic and abiotic factors. A few breeding programs have been implemented suc- cessfully to improve the yield of chickpea under various stress conditions (Millan et al., 2006). How- ever, biotechnological approaches have not been ap- plied due to the recalcitrant nature of the chickpea and the availability of very limited genomic resources. The amount of genomic information available for chickpea has been increasing recently. Efforts have been made to generate linkage genetic maps for chick- pea based on microsatellites (Hu ¨ ttel et al., 1999; Radhika et al., 2007; Upadhyaya et al., 2008; Millan et al., 2010; Gaur et al., 2011). Chickpea genome organization and composition have also been ana- lyzed based on the 500-kb sequence from 11 bacterial artificial chromosome (BAC) clones (Rajesh et al., 2008). The construction of BAC libraries has been re- ported for the identification of clones associated with disease resistance (Rajesh et al., 2004; Lichtenzveig et al., 2005). Recently, a BAC/binary BAC-based phys- ical map of cultivated chickpea (cv Hadas) has also been developed (Zhang et al., 2010). In addition, a few EST projects have provided a few thousand single- pass sequences (Buhariwalla et al., 2005; Gao et al., 2008; Ashraf et al., 2009; Varshney et al., 2009; Jain and Chattopadhyay, 2010). Although several genes/ESTs involved in various stress responses have been iden- tified based on transcriptomic and proteomic studies (Pandey et al., 2006, 2008; Mantri et al., 2007; Molina et al., 2008, 2011; Varshney et al., 2009), the gene dis- covery has been very limited in chickpea. So far, only a few candidate genes have been cloned and func- tionally validated (Kaur et al., 2008; Shukla et al., 2009; Tripathi et al., 2009; Peng et al., 2010). The EST generation provides a very useful and quick means of gene discovery, which can be used for functional genomic studies and understanding the molecular basis of agriculturally important traits in 1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation Challenge Programme on Chickpea Genomics. * Corresponding author; e-mail [email protected]. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Mukesh Jain ([email protected]). [W] The online version of this article contains Web-only data. [OA] Open Access articles can be viewed online without a sub- scription. www.plantphysiol.org/cgi/doi/10.1104/pp.111.178616 Plant Physiology Ò , August 2011, Vol. 156, pp. 1661–1678, www.plantphysiol.org Ó 2011 American Society of Plant Biologists 1661 www.plantphysiol.org on October 29, 2020 - Published by Downloaded from Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

Genome Analysis

Gene Discovery and Tissue-Specific TranscriptomeAnalysis in Chickpea with Massively ParallelPyrosequencing and Web Resource Development1[W][OA]

Rohini Garg, Ravi K. Patel, Shalu Jhanwar, Pushp Priya, Annapurna Bhattacharjee, Gitanjali Yadav,Sabhyata Bhatia, Debasis Chattopadhyay, Akhilesh K. Tyagi, and Mukesh Jain*

National Institute of Plant Genome Research, Aruna Asaf Ali Marg, New Delhi 110067, India

Chickpea (Cicer arietinum) is an important food legume crop but lags in the availability of genomic resources. In this study, wehave generated about 2 million high-quality sequences of average length of 372 bp using pyrosequencing technology. Theoptimization of de novo assembly clearly indicated that hybrid assembly of long-read and short-read primary assemblies gavebetter results. The hybrid assembly generated a set of 34,760 transcripts with an average length of 1,020 bp representing about4.8% (35.5 Mb) of the total chickpea genome. We identified more than 4,000 simple sequence repeats, which can be developedas functional molecular markers in chickpea. Putative function and Gene Ontology terms were assigned to at least 73.2% and71.0% of chickpea transcripts, respectively. We have also identified several chickpea transcripts that showed tissue-specificexpression and validated the results using real-time polymerase chain reaction analysis. Based on sequence comparison withother species within the plant kingdom, we identified two sets of lineage-specific genes, including those conserved in theFabaceae family (legume specific) and those lacking significant similarity with any non chickpea species (chickpea specific).Finally, we have developed a Web resource, Chickpea Transcriptome Database, which provides public access to the data andresults reported in this study. The strategy for optimization of de novo assembly presented here may further facilitate thetranscriptome sequencing and characterization in other organisms. Most importantly, the data and results reported in thisstudy will help to accelerate research in various areas of genomics and implementing breeding programs in chickpea.

Chickpea (Cicer arietinum) is an annual and self-pollinated plant that ranks third in legume productionworldwide. The genome of chickpea is diploid (2n =2x = 16) and of moderate size (approximately 740 Mb).Chickpea provides a protein-rich supplement to thecereal-based diets and is a very important food cropfor nutrition in the developing world. In addition, likeother legumes, chickpea has the ability to fix atmo-spheric nitrogen biologically, which is very importantfor agricultural sustainability (Graham and Vance,2003). Despite its economic importance, chickpea pro-duction has been very low due to significant yieldlosses caused by several biotic and abiotic factors. Afew breeding programs have been implemented suc-cessfully to improve the yield of chickpea undervarious stress conditions (Millan et al., 2006). How-ever, biotechnological approaches have not been ap-plied due to the recalcitrant nature of the chickpea andthe availability of very limited genomic resources.

The amount of genomic information available forchickpea has been increasing recently. Efforts havebeen made to generate linkage genetic maps for chick-pea based on microsatellites (Huttel et al., 1999;Radhika et al., 2007; Upadhyaya et al., 2008; Millanet al., 2010; Gaur et al., 2011). Chickpea genomeorganization and composition have also been ana-lyzed based on the 500-kb sequence from 11 bacterialartificial chromosome (BAC) clones (Rajesh et al.,2008). The construction of BAC libraries has been re-ported for the identification of clones associated withdisease resistance (Rajesh et al., 2004; Lichtenzveiget al., 2005). Recently, a BAC/binary BAC-based phys-ical map of cultivated chickpea (cv Hadas) has alsobeen developed (Zhang et al., 2010). In addition, a fewEST projects have provided a few thousand single-pass sequences (Buhariwalla et al., 2005; Gao et al.,2008; Ashraf et al., 2009; Varshney et al., 2009; Jain andChattopadhyay, 2010). Although several genes/ESTsinvolved in various stress responses have been iden-tified based on transcriptomic and proteomic studies(Pandey et al., 2006, 2008; Mantri et al., 2007; Molinaet al., 2008, 2011; Varshney et al., 2009), the gene dis-covery has been very limited in chickpea. So far, onlya few candidate genes have been cloned and func-tionally validated (Kaur et al., 2008; Shukla et al.,2009; Tripathi et al., 2009; Peng et al., 2010).

The EST generation provides a very useful andquick means of gene discovery, which can be used forfunctional genomic studies and understanding themolecular basis of agriculturally important traits in

1 This work was supported by the Department of Biotechnology,Government of India, New Delhi, under the Next GenerationChallenge Programme on Chickpea Genomics.

* Corresponding author; e-mail [email protected] author responsible for distribution of materials integral to the

findings presented in this article in accordance with the policydescribed in the Instructions for Authors (www.plantphysiol.org) is:Mukesh Jain ([email protected]).

[W] The online version of this article contains Web-only data.[OA] Open Access articles can be viewed online without a sub-

scription.www.plantphysiol.org/cgi/doi/10.1104/pp.111.178616

Plant Physiology�, August 2011, Vol. 156, pp. 1661–1678, www.plantphysiol.org � 2011 American Society of Plant Biologists 1661 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 2: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

crop plants. With the advent of next generation se-quencing (NGS) technologies, the deep sequencing oftranscriptomes has become very rapid and economical(Morozova et al., 2009). Recently, we reported thesequencing and de novo assembly of the chickpeatranscriptome using short-read data, which providedfirst insights into the gene content (Garg et al., 2011).However, for a better assembly and coverage of thetranscriptome and to complement our ongoing wholegenome sequencing and annotation efforts for chick-pea, we have generated a huge collection of longerreads also using pyrosequencing technology. A totalof 1,931,224 high-quality reads from various tissuesamples were obtained and assembled into 34,760transcripts after optimization of de novo assembly.Furthermore, we have performed the functional an-notation of the chickpea transcripts and analyzed thetissue-specific expression. This huge collection of newsequences represents a highly diverse set of chickpeagenes and constitutes an important resource for ge-nomic studies in chickpea and other related legumespecies. We have also identified several simple se-quence repeats (SSRs) for the development of func-tional markers. The identification of lineage-specificchickpea transcripts has also been reported. In addi-tion, we have developed a user-friendly Web resourcefor public access of the data and results generated inthis study.

RESULTS AND DISCUSSION

Pyrosequencing of the Chickpea Transcriptome

In our previous study, we presented the first snap-shot of the chickpea transcriptome based on short-readsequence data (Garg et al., 2011). To present a betterpicture of the chickpea transcriptome, we performedlong-read transcriptome sequencing from five differ-ent tissue samples and onemixed tissue sample in fourflow cells of the GS FLX Titanium sequencer. Afterseparation of sequence reads for each tissue type, morethan 2.5 million reads (0.15–0.76 million reads fromindividual tissue samples) of Q20 quality were ob-tained. The reads thus obtained were passed throughseveral quality-control filters. After removing low-quality reads, trimming adapter/primer sequences,trimming sequences containing homopolymers ofmore than 7 bp, and removing sequences of less than100 bp, a total of 2,347,832 reads were obtained (TableI). Furthermore, 416,608 reads corresponding to rRNAwere removed as per the criteria described in “Mate-rials and Methods.” Finally, a total of 1,931,224 high-quality reads corresponding to mRNAwith an averagelength of 372 bp were obtained. The number of high-quality reads for different tissue samples ranged from103,524 for root sample to 551,932 for shoot sample.Overall, these reads covered a total of 718,245,395bases. The length distribution of these high-qualityreads shows that most of them are more than 300 bp in

length (Supplemental Fig. S1A). The average Phredquality score of 90% of the high-quality reads was foundto be more than 30 (Supplemental Fig. S1B). The sum-mary of sequencing data generated from different tissuesamples and their quality filtering is given in Table I.

De Novo Assembly Optimization of 454 Sequencing Data

All the previous studies about transcriptome se-quencing and analysis based on Roche 454 pyrose-quencing have used only a single program for de novoassembly without any optimization, barring few ex-ceptions (Weber et al., 2007; Papanicolaou et al., 2009;Kumar and Blaxter, 2010). We used all (1,931,224) thehigh-quality sequence reads from different tissue sam-ples to optimize the de novo assembly using eight as-sembly programs. Among these, MIRA, Newbler (v2.3and v2.5p1), CAP3, and TGICL assemblers are basedon the overlap-layout-consensus strategy, whereasCLC, Velvet, and ABySS are based on the de Bruijngraph algorithm. The parameters used for variousassemblers are described in “Materials and Methods.”We took several criteria into consideration to select thebest de novo assembly, including assembly statistics,reads utilized for the assembly, and similarity/cover-age of the reference sequences.

Statistics of Assembly Output Generated byVarious Assemblers

Among the assemblers used, Velvet and ABySSare primarily short-read assemblers and are not con-sidered optimal for de novo assembly of 454 pyro-sequencing reads. However, we still utilized theseprograms to check their performance on our data.The assembly using Velvet and ABySS was performedat different k-mer lengths, and best assembly wasfound for k = 99 for Velvet and k = 49 for AbySS, takinginto consideration the N50 length (the length of thesmallest contig in the set comprised of largest contigswhose combined length represents 50% of the totalassembly size) and average length of the contigs of 100bp or more in length (Supplemental Table S1). Forother assemblers also, the contigs of 100 bp or more inlength were filtered and used for comparison. Thesinglets/nonassembled sequences were not taken intoconsideration for the comparison of output amongdifferent assemblers. The largest number of contigswas obtained using Velvet followed by MIRA, and thesmallest number was obtained using Newbler v2.5p1.However, the largest assembly in terms of number ofbases was obtained by MIRA (approximately 52 Mb),and the smallest was obtained by ABySS (approxi-mately 22 Mb). The size of assemblies generated byNewbler v2.3, CLC, and TGICLwas comparable (36.5–40.1 Mb). Surprisingly, Newbler v2.5p1 produced anassembly of 28.5 Mb, which is about 38% smaller thanNewbler v2.3. This is in sharp contrast to the study byKumar and Blaxter (2010), which reported that theassembly size obtained by Newbler v2.5p1 was 39%

Garg et al.

1662 Plant Physiol. Vol. 156, 2011 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 3: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

larger than that of Newbler v2.3 on the transcriptomedata set of a nematode and outperformed other as-semblers. This indicates that choice of best assemblerdepends on the data set and needs to be optimized.Newbler v2.3 generated an assembly with a largestN50 of 1,617 bp and a largest average contig length of1,200 bp. MIRA generated a largest number of contigs1 kb or more in length, followed by Newbler v2.3 andTGICL. The average consensus quality of all the con-tigs generated by MIRA was 55, with 115 strongunresolved repeat positions and none of the weakunresolved repeat and mismatch positions. Velvet andABySS did not perform well, as expected, and gener-ated smaller contigs, as indicated by smaller N50 andaverage lengths. Our results indicate that Velvet andABySS programs are not optimal for the de novoassembly of 454 sequencing data. The output of CLCassembly was moderate, with N50 length of 1,325 bpand average length of 961 bp. CAP3 produced a lessernumber of contigs and smaller assembly size (32 Mb)as compared with TGICL. The assembly results of thedifferent assemblers are summarized in Table II.

Numbers of Reads Used for Assembly by

Various Assemblers

Because different assemblers report number of readsutilized/not utilized for assembly in different ways ordo not report it at all, we used a consensus method toidentify the number of reads utilized for the assembly.We mapped all the high-quality reads on assembledcontigs for each assembly using the CLC GenomicsWorkbench. The mapping results revealed that thelargest number of reads was utilized by TGICL (95%)for the assembly, followed by MIRA (93%; Table II).The number of reads used for assembly using Velvet(16%) and ABySS (23%) was much less. Among otherassemblers, Newbler v2.5p1 used the least number ofreads for assembly (73%), but most of them showedunique hits, indicating the least redundancy. The re-dundancy was highest for Newbler v2.3 followed byMIRA, with about 20% and 14% of the total readsshowing multiple hits, respectively. CLC and CAP3

were comparable in terms of the number of reads usedand redundancy (Table II).

Coverage of a Reference Proteome

The validation of assembly output was also done byBLASTX search of contigs generated by various pro-grams with reference sequences. We used the pro-teome of soybean (Glycine max) predicted from thegenome sequence as a reference. The coverage of thisproteome should reflect the quality of assembly. Al-though a variable number of contigs exhibited signif-icant similarity with soybean proteins, the percentagewas comparable (Table II). Therefore, we identified thenumber of unique soybean proteins to which thecontigs showed significant similarity. The largest num-ber of soybean proteins was represented in the Velvet(49.4%) assembly followed by MIRA (46.6%). How-ever, the coverage of soybean proteins was much lessin the Velvet assembly, which may be because ofsmaller contigs. The coverage of soybean proteinswas highest for MIRA followed by TGICL, indicatingbetter assemblies.

Overall, based on the statistics of assembly output,number of reads used for the assembly, and coverageof a reference sequence, the assemblies generated byTGICL, MIRA, and Newbler v2.3 were consideredbetter. It was very difficult to decide the best assemblyprogram among these, as one or the other outper-formed the rest on one of several criteria.

Merged Super Assembly

Recently, in a comparison of de novo assemblers for454 transcriptome data, it has been reported that thesuper assembly of two primary assemblies generatedusing different programs gives better results in termsof consistency and size of contigs and alignment to thereference sequences (Kumar and Blaxter, 2010). There-fore, we also attempted super assemblies using TGICLby merging the primary assemblies of two assemblersat a time among TGICL, MIRA, and Newbler v2.3(hereafter Newbler). We analyzed the output of super

Table I. Summary of 454 sequencing data generated for chickpea transcriptome and quality filtering

Library/Tissue

Type

Total

ReadsaLow-Quality

ReadsbTrimmed

ReadscTrashed

ReadsdrRNA

Readse

High-Quality

mRNA

Readsf

Average

Lengthg

Shoot 763,102 9,116 13,939 60,710 141,344 551,932 342Root 145,133 1,074 3,487 11,150 29,385 103,524 383Mature leaf 254,967 2,197 3,599 15,713 86,971 150,086 391Flower bud 354,262 3,131 11,560 21,373 44,594 285,164 383Young pod 473,321 3,289 10,976 30,174 109,984 329,874 398Mixed 571,508 7,164 11,768 49,370 4,330 510,644 374Total 2,562,293 25,971 55,329 188,490 416,608 1,931,224 372

aTotal number of reads separated for each tissue sample. bNumber of low-quality reads (Phred quality score , 20 in more than 30% of bases)removed. cNumber of trimmed reads containing primer/adapter sequence and more than 7-bp-long homopolymer starting from the first base ofthe homopolymer. dNumber of short reads (less than 100 bp) removed. eNumber of reads identified as rRNA sequences. fNumber ofhigh-quality reads corresponding to mRNA sequences used for further analysis. gAverage length of high-quality mRNA reads in bp.

Gene Discovery and Tissue-Specific Transcriptome of Chickpea

Plant Physiol. Vol. 156, 2011 1663 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 4: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

assemblies for super contigs and singletons generatedtogether after the assembly. The super contigs rep-resent consensus sequences of two or more of thecontigs, and singletons represent individual contigsequences from primary assemblies. The merging ofprimary assemblies resulted in a lesser number ofcontigs than individual assemblies (Table III). Thelargest number of super contigs was formed in thesuper assembly of MIRA+TGICL. Although N50 andaverage lengths were higher in the super assembly ofTGICL+Newbler, the total assembly size was largestfor MIRA+Newbler and MIRA+TGICL super assem-blies. The largest number of soybean proteins showedsignificant similarity with the super contigs of MIRA+TGICL super assembly. Furthermore, we attemptedthe super assembly of primary assemblies of all threebest assemblers, TGICL, MIRA, and Newbler, usingTGICL. This further increased the assembly size andnumber of soybean proteins showing significant sim-ilarity with super contigs (Table III). The N50 andaverage lengths of the super contigs were also furtherimproved over that of the MIRA+TGICL super assem-bly. The coverage of the soybean proteome was betterin the super assembly of three primary assemblies ascompared with the super assembly of any two of theprimary assemblies. Overall, based on the assemblystatistics and the coverage of reference sequences, thesuper assembly of the three primary assemblies ap-peared to be the best for our 454 sequencing data.Interestingly, however, the mapping of 454 reads onthe merged super assemblies revealed a high percent-age of nonuniquely mapped reads, and the numberwas largest for the merged super assembly of the threeprimary assemblies (Table III). This indicates the re-

dundancy of the contigs or the formation of chimericcontigs in the super assemblies, which do not repre-sent a good assembly and were thus rejected. Thestudy by Kumar and Blaxter (2010), which recom-mended the merging of assemblies from differentprograms to get more credible results, did not reportthe statistics of read mapping on the merged superassemblies. The results of our study indicate that theoutput of merged assemblies should be consideredvery carefully.

Hybrid Assembly

It has been suggested that hybrid assembly of 454long-read and Illumina short-read sequencing datahelp improve de novo assemblies of genomes andtranscriptomes (Schatz et al., 2010). Earlier, we reportedthe de novo assembly of the chickpea transcriptomeusing about 107 million high-quality short reads gen-erated on the Illumina platform (Garg et al., 2011).We attempted hybrid assembly using both 454 (thisstudy) and Illumina short-read (Garg et al., 2011) datasets via Velvet and ABySS programs, but the resultswere not good in all respects, including assemblystatistics, coverage of reference soybean proteins, andmapping of reads as expected (data not shown).Therefore, to further improve our above assembly of454 data, we performed hybrid assemblies of Newbler,MIRA, and TGICL primary assemblies with the 53,409transcripts reported earlier based on the short-readdata using the TGICL program. The Newbler+short-read and TGICL+short-read hybrid assemblies gener-ated more contigs as compared with Newbler andTGICL alone, respectively (Table IV). However, for the

Table II. Summary of de novo assembly results of 454 sequence data using various programs

Parameter Velvet ABySS MIRANewbler

v2.3

Newbler

v2.5p1CLC CAP3 TGICL

Contigs (100 bp or greater) 83,335 52,095 55,282 31,879 29,877 38,041 32,157 40,849Total size (Mb) 30.71 22.14 51.78 38.25 28.54 36.54 32.03 40.14Large contigs (1,000 bp or

greater)1,433 4,525 16,319 15,194 10,600 12,878 11,635 14,336

Maximum length (bp) 3,179 4,873 12,197 15,658 15,658 13,146 11,477 12,323Average length (bp) 369 425 937 1,200 955 961 996 983N50 (bp) 398 640 1,169 1,617 1,393 1,325 1,159 1,252Reads mapped (%) 315,439

(16.3)442,895(22.9)

1,796,444(93.0)

1,643,234(85.1)

1,412,622(73.2)

1,731,780(89.7)

1,749,894(90.6)

1,836,847(95.1)

Uniquely mappedreads (%)

306,201(15.9)

442,889(22.9)

1,522,393(78.8)

1,251,637(64.8)

1,391,651(72.1)

1,646,375(85.3)

1,621,452(84.0)

1,702,241(88.1)

Contigs with significanthits (%)a

69,380(83.3)

38,540(74.0)

46,379(83.9)

26,977(84.6)

23,301(78.0)

28,006(73.6)

28,352(88.2)

33,166(81.2)

Contigs with 80% orgreater coverage (%)b

1,008(1.2)

2,877(5.5)

11,952(21.6)

11,638(36.5)

9,066(30.3)

9,048(23.8)

8,010(24.9)

10,124(24.8)

Soybean protein hits (%)c 22,917(49.4)

20,545(44.3)

21,617(46.6)

16,663(35.9)

17,191(37.1)

19,480(42.0)

18,109(39.0)

20,453(44.1)

Soybean proteins with 80%or greater coverage (%)d

967(2.1)

2,831(6.1)

9,884(21.3)

7,447(16.1)

8,585(18.5)

8,566(18.5)

7,484(16.1)

9,414(20.3)

aContigs showing significant hits (E # 1e-5) with soybean proteins. bContigs showing 80% or greater coverage of soybean proteins.cUnique soybean proteins to which contigs show significant hits (E # 1e-5). dUnique soybean proteins to which contigs show 80% or greatercoverage.

Garg et al.

1664 Plant Physiol. Vol. 156, 2011 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 5: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

MIRA+short-read hybrid assembly, the number ofcontigs was reduced as compared with MIRA alone.The hybrid assemblies increased the N50 and averagelengths of the contigs significantly. The number oflarger contigs also increased in the hybrid assembliesas compared with primary assemblies. Among thethree hybrid assemblies, the number of contigs andassembly size were smallest for Newbler+short-read,whereas its N50 and average contig lengths werelargest. Notably, the coverage of soybean proteinsalso increased in the hybrid assemblies as comparedwith primary assemblies. Although a significantlylarger number (46.5%) of soybean proteins showedsignificant hits with the contigs of MIRA+short-readhybrid assembly as compared with the other twohybrid assemblies (40.8% for TGICL+short-read and39.9% for Newbler+short-read), the soybean proteinsshowing 80% or greater coverage were marginallyhigher for the MIRA+short-read hybrid assembly (Ta-ble IV). Furthermore, mapping of 454 reads on thecontigs generated after hybrid assemblies revealedthat although the smallest number of readsmapped onthe contigs of the Newbler+short-read hybrid assem-bly, most of them were mapped uniquely (Table IV).Only 7.5% of the total 454 reads were nonuniquelymapped for the Newbler+short-read hybrid assemblyas compared with 21.2% and 15.3% for the MIRA+short-read and TGICL+short-read hybrid assemblies,respectively. Based on the above observations, weconcluded that the hybrid assembly of the 454 andIllumina data improved the assembly quality and con-sidered that the hybrid assembly of Newbler+short-read is the best.

Features of the Chickpea Transcriptome

After a step-by-step optimization of the de novoassembly of the transcriptome, we report a total of

34,760 tentative consensus (TC) transcripts in chickpearepresenting 35,468,895 bp (35.5 Mb) of the sequence.A unique transcript identifier (ID) number has beenassigned to all the 34,760 transcripts from TC00001 toTC34760. Although these transcripts represent a non-redundant set of sequences, some of these may repre-sent alternatively spliced forms of the same gene locus.However, the exact picture will be clear only once thecomplete genome sequence of chickpea becomes avail-able. Furthermore, the number of chickpea transcriptsassembled here is lower than the total number ofannotated protein-encoding loci in the completelysequenced genomes of the legumes soybean (55,787)and Medicago (53,423; Phytozome version 7.0; http://www.phytozome.net/). This difference might be dueto the fact that chickpea transcripts generated hererepresent only the expressed sequences, and many ofthe annotated protein-encoding loci in soybean andMedicago genomes may not have the expression evi-dence, as has been reported for soybean (Libault et al.,2010; Severin et al., 2010). The average length ofchickpea transcripts is 1,020 bp and the N50 size is1,671 bp. The largest transcript length is 15,661 bp, andabout 40% (13,803) of the transcripts were at least 1,000bp in length (Supplemental Fig. S2). Only a very smallfraction (220; 0.6%) of the transcripts were larger thanthe expected maximum length of 5,000 bp (Supple-mental Fig. S2). Such large sequences were found to bepresent in the predicted coding sequences of Arabi-dopsis (Arabidopsis thaliana), rice (Oryza sativa), andsoybean genes also (data not shown). A total of1,720,477 (89.1%) 454 sequencing reads could be map-ped (81.6% mapped uniquely) to these chickpea tran-scripts. In addition, a total of 88.3% Illumina readsgenerated earlier (Garg et al., 2011) could also bemapped (74.7% mapped uniquely) to these chickpeatranscripts. In comparison with the chickpea tran-scriptome based on the de novo assembly of short

Table III. Summary of TGICL merged super assembly validation using 454 sequence data

Parameter MIRA+Newbler TGICL+Newbler MIRA+TGICL MIRA+TGICL+Newbler

Contigs (100 bp or greater)a 20,918 20,639 32,654 34,314Singletons (100 bp or greater)b 27,781 18,981 17,854 17,020Total contigs/singletons (100 bp or greater) 48,699 39,620 50,508 51,334Assembly size (Mb) 54.3 45.3 54.2 58.2Large contigs (1,000 bp or greater) 19,367 17,007 19,226 20,986Maximum contig length (bp) 16,338 15,659 16,598 16,333Average contig length (bp) 1,115 1,144 1,072 1,133.2N50 (bp) 1,494 1,552 1,406 1,517Contigs with significant hits (%)c 40,874 (83.9) 26,933 (68.0) 41,556 (82.3) 42,488 (82.8)Contigs showing 80% or greater coverage (%)d 12,896 (26.5) 11,993 (30.3) 12,270 (24.3) 13,572 (26.4)Soybean protein hits (%)e 20,954 (45.2) 17,560 (37.9) 21,188 (45.7) 20,987 (45.2)Soybean proteins with 80% or greater coverage (%)f 10,485 (22.6) 10,314 (22.2) 10,275 (22.2) 10,780 (23.2)No. of 454 reads mapped (%) 1,740,301 (90.1) 1,815,449 (94.0) 1,745,431 (90.4) 1,840,977 (95.3)No. of 454 reads uniquely mapped (%) 1,214,991 (62.9) 1,401,916 (72.6) 1,262,790 (65.4) 1,179,585 (61.1)

aContigs generated by TGICL in merged super assembly. bContigs of primary assembly not assembled by TGICL in merged superassembly. cContigs showing significant hits (E # 1e-5) with soybean proteins. dContigs showing 80% or greater coverage of soybeanproteins. eUnique soybean proteins to which contigs show significant hits (E # 1e-5). fUnique soybean proteins to which contigs show 80%or greater coverage.

Gene Discovery and Tissue-Specific Transcriptome of Chickpea

Plant Physiol. Vol. 156, 2011 1665 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 6: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

reads (Garg et al., 2011), we found the present tran-scriptome much better in all perspectives, includingsize (1.3 times larger), average transcript size (1.95times larger), and N50 size (1.86 times larger). Inaddition, the extent of coverage of the reference soy-bean protein sequences is more than two times ascompared with the short-read transcriptome. About46% of the chickpea transcripts in this study showed50% or greater coverage of the coding sequence of thesoybean proteins, as compared with about 20% ofthe short-read transcriptome. Furthermore, more than34% of the contigs showed 80% or greater coverage ofsoybean proteins, which might represent full-lengthtranscripts. In addition, we found that more than 77%of mRNA sequences and 66% of the ESTs available forchickpea at the National Center for BiotechnologyInformation (NCBI) mapped on the chickpea tran-scripts generated in this study. The average coverageof each chickpea transcript was 18.4 reads per kilobaseper million (RPKM) for 454 reads and 22.8 RPKM forshort reads. An average of 49.5 reads from 454 dataand 2,712.3 reads from short-read data were mappedto each chickpea transcript. The average GC content ofchickpea transcripts was 39.5%, which is nearly sim-ilar to that of other legumes but a little lower than thatof Arabidopsis (42.5%), as reported earlier (Garg et al.,2011).

Identification of SSRs

SSRs consist of direct tandem repeats of short (twoto six) nucleotides and are considered the most suit-able choice for a wide range of genetic and populationstudies based on cost, labor, and genetic diversity.Based on the EST sequencing, only a few hundredSSRs (two to six nucleotides) have been identified inchickpea so far (Buhariwalla et al., 2005; Choudhary

et al., 2009; Varshney et al., 2009). The availability of alimited number of SSR markers has been a constrainton many areas of chickpea research. To identify SSRs(dinucleotide to hexanucleotide repeats) in chickpea,we mined the transcripts generated in this study usingPerl script MISA.We did not consider mononucleotiderepeats in this study, as we trimmed sequence readswith homopolymers of more than seven bases as a partof quality control. A total of 4,111 SSRs of minimumlength of 12 bp could be identified in 3,386 (9.7%)transcripts of chickpea (Table V). The average fre-quency of SSRs was found to be one SSR per 8.63 kb ofthe chickpea transcriptome sequence. Furthermore,545 transcripts contained more than one SSR, and334 SSRs were present in compound formation. Al-though the number of reiterations of a SSR type variedfrom five to 48, only a small fraction (7.8%) showedmore than 10 reiterations (Supplemental Table S2).However, the SSRs with five reiterations were mostabundant (35.3%). The largest fraction of SSRs identi-fied were trinucleotides (57.9%) followed by dinucle-otides (37.2%), as also reported by several studies inother plants (La Rota et al., 2005; Hisano et al., 2007;Cloutier et al., 2009). The largest fraction of trinucle-otide repeats in ESTs/mRNAs has been attributed tothe suppression of other types of repeats whose ex-pansion and contraction may lead to frameshift errorsin protein-coding regions (Metzgar et al., 2000,Morgante et al., 2002). However, in contrast, dinucle-otide SSRs were found to be most abundant in thetranscriptome of pigeonpea (Cajanus cajan), which hasbeen attributed to the greater representation of un-translated regions in the sequence data set (Dutta et al.,2011). Among the SSRs identified, AG/CT (27.7%)accounted for 74.6% of the dinucleotide repeats andAAG/CTT (17.2%), AAC/GTT (9.6%), and AAT/ATT(8.7%) accounted for 61.2% of the trinucleotide repeats

Table IV. Summary of TGICL hybrid assembly validation

Parameter MIRA+Short-Read TGICL+Short-Read Newbler+Short-Read

Contigs (100 bp or greater)a 30,432 21,064 18,968Singletons (100 bp or greater)b 21,361 22,027 15,792Total contigs/singletons (100 bp or greater) 51,793 43,091 34,760Assembly size (Mb) 51.99 43.61 35.47Large contigs (1,000 bp or greater) 18,047 16,038 13,803Maximum contig length (bp) 16,616 15,602 15,661Average contig length (bp) 1,004 1,012 1,020N50 (bp) 1,424 1,485 1,671Contigs with significant hits (%)c 41,857 (80.8) 29,194 (67.7) 27,317 (78.6)Contigs showing 80% or greater coverage (%)d 12,644 (24.4) 12,375 (28.7) 11,952 (34.4)Soybean protein hits (%)e 21,584 (46.5) 18,904 (40.8) 18,514 (39.9)Soybean proteins with 80% or greater coverage (%)f 10,855 (23.4) 11,097 (23.9) 10,683 (23.0)No. of 454 reads mapped (%) 1,758,638 (91.1) 1,834,911 (95.0) 1,720,477 (89.1)No. of 454 reads uniquely mapped (%) 1,350,782 (69.9) 1,538,879 (79.7) 1,576,415 (81.6)No. of short reads mapped (%) 94,674,527 (88.8) 95.972,813 (90.0) 94,217,680 (88.3)No. of short reads uniquely mapped (%) 47,020,498 (44.1) 63,410,043 (59.5) 79,718,851 (74.7)

aContigs generated by TGICL in hybrid assembly. bContigs of primary assembly not assembled by TGICL in hybrid assembly. cContigsshowing significant hits (E # 1e-5) with soybean proteins. dContigs showing 80% or greater coverage of soybean proteins. eUnique soybeanproteins to which contigs show significant hits (E # 1e-5). fUnique soybean proteins to which contigs show 80% or greater coverage.

Garg et al.

1666 Plant Physiol. Vol. 156, 2011 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 7: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

(Supplemental Table S2). A significant number oftetranucleotide (88), pentanucleotide (36), and hexa-nucleotide (79) repeats were also identified. The totalnumber of SSRs identified in this study is about 47%higher than in our previous study excluding mono-nucleotide SSRs (Garg et al., 2011). The mapping ofavailable EST sequences for chickpea at NCBI revealsthat most of the EST-based SSRs identified in previousstudies (Buhariwalla et al., 2005; Choudhary et al.,2009; Varshney et al., 2009) are well represented in thetranscriptome presented in this study. The SSRs iden-tified from the transcribed regions have better poten-tial for linkage to loci that contribute to agronomictraits as compared with those identified from non-transcribed regions (Varshney et al., 2005). Being acost-effective option for the development of functionalmarkers, the identified SSRs will have considerableutility for marker-assisted selection and implementingbreeding programs in chickpea.

Sequence Conservation with Other Plants andFunctional Annotation

To study the sequence conservation of chickpeatranscripts in other plant species, a BLASTX searchwas performed against the 15 available proteomes ofcompletely sequenced plant genomes. The largestnumber (78.6%) of chickpea transcripts exhibited sig-nificant similarity with soybean proteins, and thesmallest number (61.1%) exhibited significant similar-ity with proteins of distantly related Physcomitrella, asexpected (Supplemental Fig. S3). Overall, 28,478 (82%)chickpea transcripts showed significant similaritywith at least one protein sequence from 15 plants.These conserved transcripts from other plant speciesmight perform similar functions in chickpea. Further-more, we have been able to assign putative function to23,491 chickpea transcripts based on their significant

sequence similarity with Arabidopsis proteins. In ad-dition, putative function was assigned to another 1,950chickpea transcripts based on their significant hits inthe UniRef90, UniRef100, NR, PFAM, SMART, KEGG,and COG databases. Overall, putative function couldbe assigned to a total of 25,441 chickpea transcripts,and another 9,319 transcripts were designated as ex-pressed proteins. The orthologs of genes/gene familiesinvolved in a wide variety of cellular processes, such asprotein turnover, signal transduction, DNA binding,metabolism, stress response, and plant growth anddevelopment, were well represented in the chickpeatranscripts. Furthermore, a diversity (3,117) of PFAMdomains were represented in our transcript data set, asrevealed by AutoFACT analysis. Among these, theprotein kinase domain was most highly represented,followed by the pentatricopeptide repeat, protein Tyrkinase, zinc finger C3HC4, RNA recognition motif, andcytochrome P450. The top 20 PFAM domains repre-sented in the chickpea transcripts are shown in Sup-plemental Figure S4.

We assigned GOSlim terms to the chickpea tran-scripts based on their similarity to Arabidopsisproteins. The GOSlim terms for biological process,molecular function, and cellular component could beassigned to 65.3%, 65.7%, and 62.9% of the chickpeatranscripts, respectively. In total, 71.9% of the chickpeatranscripts were assigned at least one GOSlim term.Overall, the distribution of chickpea transcripts invarious GOSlim categories appears to be very simi-lar to that of Arabidopsis (Fig. 1A). The GOSlimterms protein metabolism, transferase activity, andchloroplast were most represented among the bio-logical process, molecular function, and cellular com-ponent categories, respectively. Furthermore, the KEGGpathway analysis revealed that a diversity of path-ways were represented in our chickpea transcriptomedata set (Supplemental Table S3). The biosynthesisof secondary metabolites, metabolic pathways, ribo-some, spliceosome, and ubiquitin-mediated prote-olysis were the five most represented pathwaysamong the chickpea transcripts. The number andassortment of allocated GO categories and pathwaysprovide a good indication of the large diversity ofexpressed genes sampled from the chickpea tran-scriptome.

Transcription Factor-Encoding Genes in Chickpea

Transcription factors (TFs) represent key proteinsthat bind to specific DNA sequences and regulate geneexpression. TFs are represented by various multigenefamilies and are highly conserved in eukaryotic or-ganisms, especially plants. However, the number ofgenes encoding for a particular TF family may vary indifferent plant species due to evolutionary expansionand/or to perform species-specific function(s). Severalstudies have reported the evolutionary expansion/contraction and specific functions of TF families invarious plant species, including legumes at the ge-

Table V. Statistics of SSRs identified in chickpea transcripts

Parameter Value

SSR miningTotal no. of sequences examined 34,760Total size of examined sequences (bp) 35,468,895Total no. of identified SSRs 4,111No. of SSR-containing sequences 3,386 (9.74%)No. of sequences containing more

than one SSR545

No. of SSRs present in compoundformation

334

Frequency of SSRs One per 8.63 kbDistribution of SSRs in different

repeat typesDinucleotide 1,528 (37.17%)Trinucleotide 2,380 (57.89%)Tetranucleotide 88 (2.14%)Pentanucleotide 36 (0.88%)Hexanucleotide 79 (1.92%)

Gene Discovery and Tissue-Specific Transcriptome of Chickpea

Plant Physiol. Vol. 156, 2011 1667 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 8: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

nome level or the individual family level (Jain et al.,2008; Libault et al., 2009; Schmutz et al., 2010). We alsoanalyzed the TF repertoire in the chickpea transcripts.We identified a total of 1,851 (5.33%) transcripts en-coding for TFs belonging to all 84 families searched for.The lower fraction of TF-encoding transcripts identi-fied in this study as comparedwith our previous study(Garg et al., 2011) may be attributed to the betterquality of the transcriptome presented here and thealtogether different strategy used for the identificationof TFs (stringent HMMER search in this study ascompared with BLAST homology search in the previ-ous study). Although the fraction of TF-encodingchickpea transcripts is similar to that of Arabidopsisand other plant species, it is much less than that of

soybean (Libault et al., 2009; Schmutz et al., 2010).Interestingly, a much lower number of TF genes hasbeen reported in legumes such as Medicago and Lotusas well (Libault et al., 2009) as compared with that ofsoybean (Schmutz et al., 2010). The difference in the TFcontent of these legumesmay reflect their evolutionaryrelationship. Chickpea, Medicago, and Lotus, whichcontain lower number of TFs, belong to the galegoidclade, and soybean, containing more TFs, belongs tothe millettioid clade of the Papilionoideae subfamily ofthe legumes (Doyle and Luckow, 2003; Cannon et al.,2009). Although we anticipate that the lower numberof TFs identified in this study does not represent theunderestimation of the TF content in chickpea, thepossibility of lower representation of tissue-/cell type-

Figure 1. GOSlim functional categorization of Arabidopsis and chickpea genes/transcripts, and distribution of TF-encodinggenes/transcripts of soybean, Arabidopsis, and chickpea in different families. A, The percentage of Arabidopsis and chickpeagenes/transcripts belonging to different GOSlim categories of biological process, molecular function, and cellular component. B,The number of soybean, Arabidopsis, and chickpea genes/transcripts belonging to different major TF families. The numbers ofTFs for soybean and Arabidopsis are from Schmutz et al. (2010).

Garg et al.

1668 Plant Physiol. Vol. 156, 2011 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 9: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

specific TFs expressed at a very low level in ourtranscriptome data set cannot be ruled out. Further-more, it may be due to species-specific evolutionaryprocesses. However, it will only be clear once thecomplete genome sequence of chickpea becomes avail-able. The number of transcripts belonging to variousfamilies varied from one to 131 in chickpea. The largestTF family was the MYB/MYB-related domain con-taining 131 transcripts (7.1%). A comparative analysisof the number of chickpea transcripts identified invarious major TF families in this study compared withthose of soybean and Arabidopsis reported in a pre-vious study (Schmutz et al., 2010) showed a largernumber of genes included in all the TF families insoybean as compared with Arabidopsis and chickpea(Fig. 1B). However, several events of expansion andcontraction of TF families were found in chickpeavis-a-vis Arabidopsis. For example, ABI3VP1, AP2/EREBP, bHLH, BTB/POZ, MADS box, and MYB/MYB-related family TFs were significantly greater innumber in Arabidopsis as compared with chickpea,and SNF2, TCP, and CCAAT domain TFs were signif-icantly greater in number in chickpea as comparedwith Arabidopsis. These differences in abundance ofTFs in Arabidopsis and chickpea might play an im-portant role in regulating species-specific biologicalprocesses. Although legumes do not contain any spe-cific TF family or preferential expansion of a particularTF family, the legume-specific functions have beenattributed to the evolution of promoter sequencesleading to alteration in gene expression patterns(Libault et al., 2009). However, the possibility of evo-lution of a few TF-encoding genes performing species-specific function(s) due to neofunctionalization/subfunctionalization is not ruled out.

Legume- and Species-Specific Chickpea Transcripts

The identification of lineage-specific genes providesinsights into species-specific functions and evolu-tionary processes such as speciation and adaptation(Domazet-Loso and Tautz, 2003). A few studies havebeen performed to identify the lineage-specific genesin plants (Campbell et al., 2007; Lin et al., 2010). Thelegume-specific genes have also been identified basedon the unigene sets from soybean, Medicago, and Lotus(Graham et al., 2004). We performed a series of BLASTsearches to identify a core set of chickpea transcriptswith conserved sequences within the Fabaceae thatlack significant similarity to sequences outside theFabaceae and transcripts that lack similarity to se-quences from any other plant species. A summary ofthe approach followed and the results obtained ispresented in Figure 2A. In the first step, the chickpeatranscripts showing any significant BLASTX hit witha protein sequence of 12 annotated plant genomeswere removed. In the second step, the remaining7,703 chickpea transcripts were searched via TBLASTXagainst non-Fabaceae plant transcript assemblies from236 species available at The Institute for Genomic

Research (TIGR) and unigenes from 48 species avail-able at NCBI. This resulted in the removal of 328transcripts that showed significant hits with at leastone of these sequences, thus eliminating a total of27,385 transcripts that were considered to be con-served in plant species other than the Fabaceae. Sub-sequently, in the third step, TBLASTX search wasperformed against transcript assemblies of 13 Faba-ceae species available at TIGR and EST/unigenesof five Fabaceae species available at NCBI, andBLASTX search against proteomes and TBLASTX/BLASTN search against the genome sequences ofsoybean, Medicago, and Lotus was performed. A totalof 3,743 transcripts showed significant similaritywith at least one of the above sequences and wereconsidered as candidate legume-specific genes. An-other 3,632 transcripts did not show significanthits with any of the above sequences and might rep-resent chickpea-specific (CS) genes. This fraction(10.5%) is less than that of rice genes (17.4%), whichdid not show significant similarity to sequences fromthe plant kingdom (Campbell et al., 2007), but muchhigher than that predicted in Arabidopsis (4.9%; Linet al., 2010). This set of genes might represent novelgenes that perform species-specific functions. As thelast step, after a comparative analysis of transcripts,we identified 741 transcripts that have significantsimilarity to protein/unigene/genome sequence ofat least three of the five non-chickpea Fabaceae spe-cies simultaneously and were identified as a set ofcore legume-specific (CLS) genes. Although the emer-gence and evolution of CLS transcripts is not clear,it may be speculated that these genes might haveappeared in the Fabaceae after their divergence fromnon-Fabaceae plants or theymay have been lost in non-Fabaceae plants after their divergence from the Faba-ceae plants. In addition, several hypotheses, includinglateral gene transfer, gene duplication followed bysequence divergence, enhanced evolutionary rate, andde novo emergence of genes for noncoding DNA, havebeen proposed for the evolution of lineage-specificgenes (Daubin et al., 2003; Cai et al., 2006; Levine et al.,2006).

The GC content distribution of all the chickpeatranscripts is bimodal, with a broader range of 35%to 45% for most (79.4%) chickpea transcripts (Fig. 2B),as compared with that of Arabidopsis (40%–45%). Theanalysis of CLS and CS transcripts revealed that theGC distribution of CS transcripts is also bimodal butshifted toward 30% to 40% (lower) as compared withall transcripts (Fig. 2B). However, the GC contentdistribution of CLS transcripts was distinctly differentfrom all chickpea transcripts and CS transcripts, show-ing unimodal distribution with a peak in the GCcontent range of 35% to 40%. Similar results havebeen reported for core Poaceae-specific genes in rice(Campbell et al., 2007). The GC content of genes hasbeen related to gene conversion, recombination rate,DNA thermostability, transcriptional activity, and gene/genome contraction/expansion phenomena (Carels

Gene Discovery and Tissue-Specific Transcriptome of Chickpea

Plant Physiol. Vol. 156, 2011 1669 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 10: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

and Bernardi, 2000; Galtier et al., 2001; Vinogradov,2003). Although CS transcripts did not show signifi-cant similarity to any sequence in the plant kingdom,a conserved PFAM/SMART domain could be iden-tified in at least 89 transcripts, leading to their puta-tive annotation (Supplemental Table S4). Some ofthese domains were overrepresented in the CS tran-scripts, for example, Atrophin-1 (PF03154), Extensin_2(PF04554), Amelogenin (PF07174), and Serpentine-type 7TM GPCR chemoreceptor Srz (PF10325). Theputative function has been assigned to 133 of the CLStranscripts (Supplemental Table S5). Among these,the largest number (12) of annotated CLS transcriptsrepresent cyclin-like F-box domain-containing pro-teins. The F-box domain-containing proteins have beenfound to be enriched in lineage-specific genes previ-ously as well (Graham et al., 2004; Campbell et al.,2007), indicating their major role in evolutionaryprocesses leading to lineage specificity. The eventsof extensive duplication followed by sequence diver-gence in plant F-box genes have already been re-ported (Jain et al., 2007; Xu et al., 2009). In addition,Atrophin-1, Extensin_2, NB-LRR-type disease resis-tance proteins, and WRKY TFs were also representedin the CLS transcript set. At least two and 12 TF-encoding transcripts were identified in the CS andCLS sets, respectively. It has been suggested that thelineage-specific TFs along with their binding DNAsequences may lead to novel gene regulatory net-works and might be responsible for specific pheno-types/traits (Nowick and Stubbs, 2010). The study ofspecies-specific and lineage-specific genes will helpin understanding the molecular mechanisms under-lying characteristic biological processes of chickpeaand other legumes.

Tissue-Specific Expression Analysis of the

Chickpea Transcriptome

The use of EST abundance as an indicator oftranscript abundance is a well-accepted method fordifferential gene expression analysis and has beenimplemented in numerous studies (Weber et al.,2007; Hale et al., 2009; Kristiansson et al., 2009). Theresults of differential gene expression analysis usingNGS technologies have been found to be accurate andhighly correlated with other methods such as real-timePCR analysis and microarray analysis (Cloonan et al.,2008; Wilhelm and Landry, 2009; Zenoni et al., 2010).The gene expression studies in chickpea have beenlimited to the single gene level so far. In this study,we utilized RNA-Seq data to quantify the gene ex-pression of all chickpea transcripts. For an analysisof the abundance of chickpea transcripts within atissue sample, the high-quality reads from individualtissue samples were mapped to the transcripts andthe number of reads mapped was normalized by theRPKM method. The RPKM method corrects for biasesin total gene exon size and normalizes for the totalread sequences obtained in each tissue library. We

Figure 2. Identification and GC content analysis of legume- andspecies-specific transcripts in chickpea. A, Strategy for the identifica-tion of legume- and species-specific transcripts in chickpea. Thetranscripts that showed significant hits with non-Fabaceae plant speciesare in dotted boxes. “Yes” represents a significant hit, and “No”represents no significant hit in BLAST searches as per the given criteria(E # 1e-5 for BLASTX and TBLASTX and E # 1e-10 for BLASTN). Thetranscripts identified as CS and CLS are highlighted in gray boxes. B,The percentage of all chickpea transcripts, CLS transcripts, and CStranscripts with GC content within a range.

Garg et al.

1670 Plant Physiol. Vol. 156, 2011 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 11: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

classified the gene expression in five categories (verylow, low, moderate, high, and very high), arbitrarilybased on the RPKM value of each transcript in differ-ent tissue samples (Fig. 3A). The largest fraction oftranscripts showed moderate expression (RPKM .10–50) followed by low expression (RPKM . 3–10)in all tissue samples. A small fraction (1.5%–3.2%)of transcripts were expressed at very high levels(RPKM . 100) in different tissue samples (Fig. 3A).Furthermore, we identified transcripts expressed pref-erentially in each tissue sample analyzed via tissue-by-tissue comparison. For this, we normalized thenumber of reads uniquely mapped to each transcript

per million total reads (RPM) uniquely mapped foreach tissue sample. Figure 3B shows a table giving thenumber of genes that have significant preferentialexpression (3-fold or greater) between two tissues.The fraction of transcripts that are preferentially ex-pressed among the two tissues analyzed in this studyvaried from 1.2% to 9.2%. The largest number oftranscripts showed preferential expression in root ascompared with mature leaf, followed by flower bud incomparison with mature leaf. The identification ofgenes preferentially expressed will help gain insightinto the gene functions and thereby the biologicalprocesses. Furthermore, we identified transcripts

Figure 3. Differential and tissue-specific expression analysis of the chickpea transcriptome. A, Number of transcripts withdifferent expression abundances in various tissue samples based on the RPKM method. The transcripts showing RPKM values of1 to 3, greater than 3 to 10, greater than 10 to 50, greater than 50 to 100, and greater than 100 have been classified as very lowly,lowly, moderately, highly, and very highly expressed, respectively. B, Transcripts preferentially expressed in each tissue sample ascompared with others in a tissue-by-tissue comparison. Transcripts represented by 3-fold or greater RPM as compared with theother tissue sample are given. For the transcripts in each cell, there is preferential expression in the column tissue than in the rowtissue sample. C, Number of transcripts showing specific expression with different expression abundances in various tissuesamples. The tissue-specific transcripts are represented by 3 or greater RPM in the tissue of interest but 0 RPM in other tissuesamples analyzed. The transcripts showing RPKM values of greater than 3 to 10, greater than 10 to 50, greater than 50 to 100, andgreater than 100 have been classified as lowly, moderately, highly, and very highly expressed, respectively. D, Heat map of thetranscripts specifically expressed in various tissue samples. The color scale (1–100) represents RPM values. The transcripts withgreater than 100 RPM (very highly expressed) are not distinguished by the color scale (they are represented by the same color asthat of transcripts with 100 RPM) to differentiate the genes with low to high expression. FB, Flower bud; ML, mature leaf; YP,young pod.

Gene Discovery and Tissue-Specific Transcriptome of Chickpea

Plant Physiol. Vol. 156, 2011 1671 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 12: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

expressed specifically in a single tissue sample with atleast 3 RPM in the tissue of interest and zero in others.The largest number of transcripts exhibited specificexpression in flower bud (1,132) followed by youngpod (695) with variable abundance (Fig. 3C). The heatmap shown in Figure 3D clearly shows the tissue-specific expression of these chickpea transcripts. Thesetranscripts may play specific roles in the biology ofvarious tissue samples in chickpea. The results havebeen validated by quantitative real-time PCR analysisof representative transcripts (Supplemental Fig. S5A).All the primer pairs used in the study amplified aspecific PCR product. A very good correlation wasobtained among the results of RNA-Seq and real-timePCR analysis. Among the 18 representative transcriptsanalyzed by real-time PCR analysis, which are specif-ically expressed in different tissue samples, the ex-pression pattern of only three transcripts predicted tobe expressed in mature leaf did not correlate. Theseresults further confirm the potential of NGS technol-ogies to quantify gene expression. In addition, aga-rose gel electrophoresis analysis revealed the expectedsize of PCR products in real-time PCR for all thetranscripts analyzed (Supplemental Fig. S5B; Supple-mental Table S6), which further validates the accuracyof de novo assembly presented in this study.

Furthermore, we analyzed various GOSlim catego-ries represented in tissue-specific chickpea transcripts.Although all the GOSlim terms were represented inthe tissue-specific transcripts, a few terms were sig-nificantly overrepresented in some tissue samples(Supplemental Fig. S6). For example, among the bio-logical process GOSlim terms, protein metabolismwasmost overrepresented in all the tissue samples exceptroot. However, the GOSlim term response to stresswas significantly overrepresented in the root-specificchickpea transcripts. Furthermore, developmentalprocess-related transcripts were also overrepresentedin flower bud and young pod tissues. Likewise,various molecular function GOSlim terms were sig-nificantly overrepresented in various tissue-specifictranscripts. For example, protein binding was mostoverrepresented in the shoot-, root-, and mature leaf-specific transcripts, followed by transferase activityin shoot- and mature leaf-specific transcripts andhydrolase activity in root-specific transcripts. How-ever, transferase activity followed by nucleotide bind-ing was most represented in flower bud-specifictranscripts, and hydrolase activity followed by trans-ferase activity was most represented in young pod-specific transcripts. The cellular component termchloroplast was most overrepresented in the greentissues, shoot, and mature leaf. The term nucleuswas significantly overrepresented in the flower bud-and root-specific transcripts. A large fraction of root-,young pod-, and flower bud-specific transcriptsshowed overrepresentation of the cellular componentterm plasma membrane. We analyzed the statisticallysignificant enrichment of specific Gene Ontology (GO)terms in the tissue-specific chickpea transcripts using

the BiNGO tool (Fig. 4). In the root-specific set oftranscripts, the biological process response to stimulusand its children terms (response to oxidative stress anddefense response) and the molecular function oxido-reductase activity were significantly enriched, whichare well related to the major role of roots in variousstress responses (Fig. 4A). Notably, the biologicalprocess term nitrogen fixation and the molecularfunction terms electron carrier activity and hemebinding/iron ion binding were also found to be sig-nificantly enriched in root-specific transcripts, whichare related to the ability for atmospheric nitrogenfixation by legumes. In the flower bud-specific chick-pea transcripts, the biological process developmentalprocesses and its children terms (gametophyte devel-opment, pollen development, and microsporogenesis,etc.) were significantly enriched, which are related toflower development (Fig. 4B). Likewise, the molecularfunction term nutrient reservoir activity was signifi-cantly enriched in the young pod-specific transcripts,which is related to seed development (Fig. 4C). Theseobservations further authenticate our gene expressionanalysis results and provide a clue toward specificfunctions of the chickpea transcripts in a particulartissue sample. Although we identified the tissue-specific transcripts based on sequence data from fivedifferent tissue samples using stringent criteria, theirexpression in other chickpea tissues not analyzed inthis study cannot be ruled out. In addition, the possi-bility of the identification of novel transcripts notsequenced/assembled in this study also cannot beruled out. The tissue-specific transcripts identified inthis study may be analyzed further in functionalgenomics studies.

Chickpea Transcriptome Database

We developed a public data resource, the ChickpeaTranscriptome Database (CTDB), which provides asearchable interface to the chickpea transcriptomedata. CTDB is publicly available at http://www.nipgr.res.in/ctdb.html. The current release (release1.0) of the database provides the transcriptome se-quence of the chickpea (genotype ICC4958) reported inthis study. The Web pages provide information aboutthe project, data, research group, and other resourcesof chickpea transcriptome data. Figure 5 providessnapshots of the various tools/features of CTDB. Thedatabase can be queried based on transcript ID andkeyword search for all the functional annotations.CTDB output can be displayed to the screen or savedas a tab-delimited file. The data available in the data-base can be mined/queried using a variety of searchoptions provided. The sequence-based search has beenemployed using NCBI BLASTsearch (version 2.2.24+),which provides the option of different BLAST algo-rithms to search nucleotide or protein sequence(s)against chickpea transcriptome sequence data reportedin this study and previously (Garg et al., 2011) atuser-defined parameters. In addition, we have pro-

Garg et al.

1672 Plant Physiol. Vol. 156, 2011 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 13: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

Figure 4. Significantly enriched GO categories in tissue-specific chickpea transcripts. A, Root specific. B, Flower bud specific. C,

Gene Discovery and Tissue-Specific Transcriptome of Chickpea

Plant Physiol. Vol. 156, 2011 1673 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 14: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

vided the option to search unigenes generated byassembling (using TGICL) all the chickpea ESTs avail-able at NCBI. Various annotation search facilities areprovided, namely, putative function, PFAM domain,transcript ID, and GO search. The expression data fortranscripts can be accessed in two ways: first, theexpression level (RPM values) for individual tran-scripts can be retrieved using unique transcript IDs forall tissues types used in this study; second, highlyexpressed transcripts (on the basis of RPKM expres-sion values) can be retrieved for a specific tissue type.In addition, the batch download (for multiple tran-script IDs) utility for sequence and annotation data ina tab-delimited file has also been provided, whichprovides the user the flexibility for downstream dataanalysis/interpretation. Links have also been pro-vided to download the whole assembled sequencedata as a single file. We anticipate that the databasewill be very useful for accelerating research in thearea of chickpea functional genomics. Furthermore,we aim to update the database annually or as the newsequence/annotation data set(s) are available forchickpea.

CONCLUSION

The generation and availability of genomic re-sources for chickpea, a very important food legumeplant, lag significantly. The applications of NGS tech-nologies in the characterization of the transcriptome ofnonmodel species are well documented. In this study,we present the high-quality transcriptome sequence ofthe crop legume plant chickpea as a genomic resourceusing sequence data generated from Roche 454 andIllumina NGS platforms and de novo assembly opti-mization. The functional annotation of the transcrip-tome provides a greater insight into the gene content,biological processes, and pathways conserved inchickpea. The identification of lineage- and species-specific genes in chickpea will be very helpful inestablishing evolutionary relationships among le-gumes. The SSRs identified in chickpea should pro-vide breeders a very good resource for the futuredevelopment of improved chickpea cultivars. Theidentification of tissue-specific transcripts shouldhelp accelerate the functional analysis of genes ofinterest in chickpea. Our data and results provideessential information for future genetic studies inchickpea and present a de novo transcriptome assem-

Figure 4. (Continued.)Young pod specific. The chickpea transcripts showing specific expression in each tissue type were analyzed using BiNGO, andthe results for only the tissue type showing significant enrichment of one or more GO categories are shown. Node size isproportional to the number of transcripts in each category and colors shaded according to the significance level (white, nosignificant difference; yellow, P = 0.05; orange, P , 0.0000005).

Figure 5. Snapshots of the public access resource CTDB showing its various utilities.

Garg et al.

1674 Plant Physiol. Vol. 156, 2011 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 15: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

bly workflow that should be applicable to other plantsas well.

MATERIALS AND METHODS

Plant Material and RNA Isolation

Chickpea (Cicer arietinum genotype ICC4958) seeds were grown as

described (Garg et al., 2010). Root and shoot tissue samples were collected

from 15-d-old seedlings. The mature leaves, flower buds, and young pods

were collected from plants grown in the field. At least three independent

biological replicates of each tissue sample were harvested and immediately

frozen in liquid nitrogen. Total RNA was extracted from all tissue samples

using TRI Reagent (Sigma Life Science) according to the manufacturer’s

instructions. The quality and quantity of each RNA sample were assessed

using NanoVue (GE Healthcare) and the Agilent 2100 Bioanalyzer (Agilent

Technologies) as described previously (Garg et al., 2010). Equal quantities

of total RNA samples from the three biological replicates were pooled

for mRNA purification followed by library preparation for each tissue

sample.

Library Preparation and 454 Sequencing

The mRNA was purified from total RNA samples using the Dynabead

mRNA purification kit according to the manufacturer’s instructions (Invitro-

gen, Dynal). We found significant rRNA contamination in the mRNA samples

purified from the five individual tissue samples. Therefore, we did two

additional stringent washes before elution of the mRNA from themixed tissue

sample to avoid rRNA contamination. Six cDNA libraries were generated

using the cDNA rapid library preparation kit for the GS FLX Titanium series

essentially following the manufacturer’s instructions (Roche Diagnostics). The

double-stranded cDNA was synthesized using the SuperScript double-

stranded cDNA synthesis kit (Invitrogen). Approximately 600 ng of double-

stranded cDNA was nebulized and selected for the 300- to 800-bp fragment

length range. The specific adapters were ligated to the fragmented cDNA and

denatured to generate single-stranded cDNA followed by emulsion PCR

amplification for sequencing. Five cDNA libraries were generated, one each

from the mRNA isolated from shoot, root, mature leaf, flower bud, and young

pod tissue samples, and a sixth cDNA library was generated from the mRNA

purified from total RNA samples pooled in equal amounts from all the above

tissue samples. The quality of libraries was assessed using the High Sensitivity

DNA kit on the Agilent 2100 Bioanalyzer. All six cDNA libraries were

sequenced in two runs (one complete and two half runs) using GS FLX

Titanium series sequencing reagents and sequencer. Each of the cDNA

libraries from shoot and mixed samples were sequenced in one complete

flow cell, whereas two each of the other four cDNA libraries were tagged with

unique RL multiplex identifier adaptors and mixed before emulsion PCR and

sequencing in one complete flow cell.

Sequence Quality Controls and Preprocessing

First, Q20 data were extracted for all the flow cells using the gsRunPro-

cessor command with a value of 0.01 for errorQscoreWindowTrim in the

filterTemplate.xml file. The sequence data for different tissue samples se-

quenced in a single flow cell were separated using the sfffile command. The

reads not belonging to the multiplex identifiers used were discarded. The

sequence data generated in this study have been deposited in Standard

Flowgram Format (SFF) at NCBI in the Short Read Archive database under the

accession number SRA030696 (experiment accession nos. SRX048831–

SRX048836). Various quality controls, including filtering of high-quality reads

based on the score values given in .qual files, trimming of reads containing

primer/adaptor sequences, trimming of reads containing homopolymers of

more than seven bases, and removal of reads with length of less than 100 bp,

were done using the in-house NGS QC tool kit (R.K. Patel and M. Jain,

unpublished data). This was followed by filtering of rRNA sequences with an

optimized highly stringent criterion. The sequences showing E-value cutoffs

of 1e-10 or less and 100-bp alignment length with at least 80% identity in

BLASTN search against a set of 147 rRNA sequences from various plant

species downloaded from NCBI were considered as rRNA sequences and

discarded.

De Novo Assembly

We used various de novo assembly programs to obtain the best assembly

results with the chickpea 454 sequence data set and generated a nonredundant

set of transcripts. Among the various programs available, we used publicly

available programs Velvet (version 1.0.14; http://www.ebi.ac.uk/~zerbino/

velvet/), ABySS (version 1.1.2; http://www.bcgsc.ca/platform/bioinfo/

software/abyss), and MIRA (version 3.2.0; http://sourceforge.net/projects/mira-

assembler/). In addition, we used gsdenovo assembler (Newbler v2.3 and

v2.5p1; http://www.454.com/products-solutions/analysis-tools/gs-de-novo-

assembler.asp) supplied with the GS FLX Titanium sequencer and the com-

mercially available CLC Genomics Workbench (version 3.7.1; http://www.

clcbio.com/index.php?id=1240). These programs have been developed for the

de novo assembly of long and/or short reads. We also used CAP3 (http://seq.

cs.iastate.edu/cap3.html) and TGICL (version 2.0; http://sourceforge.net/

projects/tgicl/) programs for the assembly of 454 sequence data.

All the assemblies were performed either on the server with 48 cores and

128 GB of random access memory or on the server with eight cores and 48 GB

of random access memory. The high-quality reads and their quality values in

fasta format were used as input in all the assemblers for 454 data. Velvet and

ABySS assemblies were performed at various k-mer lengths. In MIRA, we

used 2job = denovo,est,accurate,454 quick switch using 12 cores (2GE:not =

12) with four passes (2AS:nop = 4) for the assembly of 454 data. The two latest

versions of Roche 454’s Newbler, v2.3 and v2.5p1, were used with the cdna

option using multiple CPUs. The assembly on the CLC Genomics Workbench

was performed using default settings. CAP3 assembly of the 454 data was

performed using default parameters. The TGICL assembly was performed on

16 CPUs (2c 16) with minimum overlap length of 40 (2l 40), minimum

percentage identity of 90 for overlaps (2p 90), and maximum length of

unmatched overhangs of 20 (2v 20).

GC Content Analysis and SSR Identification

The GC content analysis was done using an in-house Perl script. We used

MISA (MIcroSAtellite; http://pgrc.ipk-gatersleben.de/misa/) for the iden-

tification of SSRs. Dinucleotides repeats of more than six times and trinu-

cleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeats of

more than five times were considered as search criteria for SSRs in MISA

script.

Sequence Conservation and Functional Annotation

The available proteome data sets for all the completely sequenced plant

genomes were downloaded from their respective genome project Web sites.

The criterion of an expect (E) value of 1e-5 or less was used for the

identification of significant hits in a BLASTX search. To deduce the putative

function of chickpea transcripts, they were subjected to BLASTX search

against annotated protein sequences of Arabidopsis (Arabidopsis thaliana;

available at The Arabidopsis Information Resource). The results of only the

best hit were extracted, and the hits with E # 1e-5 were considered to be

significant. Each of the chickpea transcripts showing significant hits was

assigned the same putative function as that of the corresponding Arabidopsis

protein. Furthermore, the chickpea transcripts that did not show any signif-

icant hits were searched against UniRef90, UniRef100, NR, PFAM, SMART,

KEGG, and COG databases to assign putative functions to them using the

AutoFACT pipeline (http://megasun.bch.umontreal.ca/Software/AutoFACT.

htm; Koski et al., 2005). The GOSlim terms for molecular function, biological

process, and cellular component categories associated with the best BLASTX hit

of the Arabidopsis protein were assigned to the corresponding chickpea

transcript. GO enrichment analysis was done using the BiNGO plugin of

Cytoscape (Maere et al., 2005).

Identification of TF-Encoding Genes

The chickpea transcripts encoding for TFs were identified using the hidden

Markov model profiles already available in the PFAM database or those

generated from the domain alignments available at the Plant Transcription

Factor Database (http://plntfdb.bio.uni-potsdam.de/v3.0/; Perez-Rodrıguez

et al., 2010) for 84 families using HMMER search. We followed similar crite-

ria as the Plant Transcription Factor Database to identify the TFs belonging

to various families.

Gene Discovery and Tissue-Specific Transcriptome of Chickpea

Plant Physiol. Vol. 156, 2011 1675 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 16: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

Identification of Legume-Specific and

Species-Specific Genes

The plant transcript assemblies and EST/unigene data for non-Fabaceae/

Fabaceae plant species were downloaded from the TIGR Plant Transcript

Assemblies database (http://plantta.jcvi.org/; Childs et al., 2007) and NCBI

(ftp://ftp.ncbi.nih.gov/repository/UniGene/), respectively. The data set of

transcript assemblies was composed of 3,618,017 sequences from 236 non-

Fabaceae species and 275,760 sequences from 13 Fabaceae species. The NCBI

EST/unigene data set was composed of 11,974,373 sequences from 48 non-

Fabaceae species and 2,004,299 sequences from five Fabaceae species. The

genome sequences for all the completely sequenced plants were downloaded

from their respective genome project Web sites. The final set of chickpea

transcripts were subjected to various BLAST searches as per the strategy

described in “Results and Discussion.” The criteria of E # 1e-5 for BLASTX

and TBLASTX searches and E # 1e-10 for BLASTN search were used for

filtering significant hits. In-house Perl scripts were used for filtering the

BLAST results of significant and nonsignificant hits and their sequences.

Mapping of Sequence Reads onto Chickpea Transcripts

To quantify the expression of each transcript in individual tissue samples

and identify differentially expressed genes, all the reads from six samples

were mapped onto the nonredundant set of transcripts using CLC Genomics

Workbench software. For mapping the 454 reads, the criterion of a minimum

of 90% coverage of the total length was used, and for short reads, a maximum

of two mismatches were allowed for mapping. The total number of reads,

number of unique reads, and RPM corresponding to each transcript were

determined. In addition, the coverage of each transcript was determined in

terms of RPKM. The heat map showing tissue-specific expression was

generated based on the RPM for each transcript in all the tissue samples

using TIGR MultiExperiment Viewer.

Real-Time PCR Validation

The gene-specific primers for real-time PCR analysis were designed using

Primer Express (version 3.0) software (Applied Biosystems), and the specific-

ity of primer pairs was confirmed by BLASTN with all the nucleotide

sequences of the nonredundant set of chickpea transcripts generated. The

primer sequences of all the genes used in this study are listed in Supplemental

Table S6. The real-time PCR analysis using gene-specific primers was

performed as described previously (Garg et al., 2010). In brief, cDNA was

synthesized from 6 mg of total RNA in a final reaction volume of 100 mL for

each sample. The real-time PCRs were performed using diluted cDNA, 200 nM

of gene-specific primers, and SYBR Green PCR mix on 96-well optical PCR

plates using the 7500 Sequence Detection System (Applied Biosystems). All

the reactions were performed under default parameters, and the specificity of

reactions was verified by dissociation curve analysis. Two independent

biological replicates for each sample and three technical replicates of each

biological replicate were analyzed for real-time PCR analysis. For a biological

replicate of a tissue sample, the same cDNA pool was used for real-time PCR

analysis of all the genes analyzed. The transcript level of each gene in different

tissue samples was normalized with the transcript level of the most suitable

internal control gene, EF1a (Garg et al., 2010).

Construction of CTDB

CTDB is a public resource for chickpea transcriptome data. Web pages

have been prepared using Perl-CGI on the Apache Tomcat (version 5.5.29)

Web server application. The data regarding expression and annotation for

each transcript are stored in the MySql server (version 5.0.77). The database is

currently hosted on a SunWorkstation running the CentOs (version 5.4) Linux

operating system with two Intel Xeon quad core processors and 12 GB of

random access memory. The sequence data are stored in flat files.

Sequence data from this article can be found in the Short Read Archive

database at NCBI under accession number SRA030696 (experiment accession

nos. SRX048831–SRX048836).

Supplemental Data

The following materials are available in the online version of this article.

Supplemental Figure S1. Length (A) and average quality score (B)

distribution of the total number of high-quality reads generated.

Supplemental Figure S2. Length distribution of chickpea transcripts

generated from optimized hybrid assembly.

Supplemental Figure S3. Sequence conservation of chickpea transcripts

with proteomes of completely sequenced plants.

Supplemental Figure S4. Top 20 PFAM domains represented in the

chickpea transcripts.

Supplemental Figure S5. Relative transcript levels of representative

tissue-specific chickpea transcripts validated by real-time PCR analysis

(A), and agarose gel electrophoresis showing amplification of specific

PCR products of desired size (B).

Supplemental Figure S6. GOSlim term assignment in different categories

of biological process (A), molecular function (B), and cellular component

(C) to the chickpea transcripts showing tissue-specific expression in the

five tissue samples analyzed in this study.

Supplemental Table S1. Statistics of de novo assemblies of 454 sequence

data using Velvet and ABySS at various k-mer lengths.

Supplemental Table S2. Frequency of SSRs identified in chickpea tran-

scripts.

Supplemental Table S3. KEGG pathways represented in chickpea tran-

scripts.

Supplemental Table S4. Transcript ID and annotations of CS transcripts.

Supplemental Table S5. Transcript ID and annotations of CLS transcripts.

Supplemental Table S6. Primer sequences used for real-time PCR analysis

of tissue-specific expression of chickpea transcripts in the study.

Received April 19, 2011; accepted June 7, 2011; published June 8, 2011.

LITERATURE CITED

Ashraf N, Ghai D, Barman P, Basu S, Gangisetty N, Mandal

MK, Chakraborty N, Datta A, Chakraborty S (2009) Comparative

analyses of genotype dependent expressed sequence tags and stress-

responsive transcriptome of chickpea wilt illustrate predicted and

unexpected genes and novel regulators of plant immunity. BMC Ge-

nomics 10: 415

Buhariwalla HK, Jayashree B, Eshwar K, Crouch JH (2005) Development

of ESTs from chickpea roots and their use in diversity analysis of the

Cicer genus. BMC Plant Biol 5: 16

Cai JJ, Woo PC, Lau SK, Smith DK, Yuen KY (2006) Accelerated evolu-

tionary rate may be responsible for the emergence of lineage-specific

genes in ascomycota. J Mol Evol 63: 1–11

Campbell MA, Zhu W, Jiang N, Lin H, Ouyang S, Childs KL, Haas

BJ, Hamilton JP, Buell CR (2007) Identification and characterization

of lineage-specific genes within the Poaceae. Plant Physiol 145:

1311–1322

Cannon SB, May GD, Jackson SA (2009) Three sequenced legume ge-

nomes and many crop species: rich opportunities for translational

genomics. Plant Physiol 151: 970–977

Carels N, Bernardi G (2000) Two classes of genes in plants. Genetics 154:

1819–1825

Childs KL, Hamilton JP, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz PD,

Town CD, Buell CR, Chan AP (2007) The TIGR Plant Transcript

Assemblies database. Nucleic Acids Res (Database issue) 35:D846–D851

Choudhary S, Sethy NK, Shokeen B, Bhatia S (2009) Development of

chickpea EST-SSR markers and analysis of allelic variation across

related species. Theor Appl Genet 118: 591–608

Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK,

Taylor DF, Steptoe AL, Wani S, Bethel G, et al (2008) Stem cell

transcriptome profiling via massive-scale mRNA sequencing. Nat

Methods 5: 613–619

Cloutier S, Niu Z, Datla R, Duguid S (2009) Development and analysis of

EST-SSRs for flax (Linum usitatissimum L.). Theor Appl Genet 119: 53–63

Daubin V, Lerat E, Perriere G (2003) The source of laterally transferred

genes in bacterial genomes. Genome Biol 4: R57

Garg et al.

1676 Plant Physiol. Vol. 156, 2011 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 17: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

Domazet-Loso T, Tautz D (2003) An evolutionary analysis of orphan genes

in Drosophila. Genome Res 13: 2213–2219

Doyle JJ, Luckow MA (2003) The rest of the iceberg: legume diversity and

evolution in a phylogenetic context. Plant Physiol 131: 900–910

Dutta S, Kumawat G, Singh BP, Gupta DK, Singh S, Dogra V, Gaikwad K,

Sharma TR, Raje RS, Bandhopadhya TK, et al (2011) Development of

genic-SSR markers by deep transcriptome sequencing in pigeonpea

[Cajanus cajan (L.) Millspaugh]. BMC Plant Biol 11: 17

Galtier N, Piganeau G, Mouchiroud D, Duret L (2001) GC-content evo-

lution in mammalian genomes: the biased gene conversion hypothesis.

Genetics 159: 907–911

Gao WR, Wang XS, Liu QY, Peng H, Chen C, Li JG, Zhang JS, Hu SN, Ma

H (2008) Comparative analysis of ESTs in response to drought stress in

chickpea (C. arietinum L.). Biochem Biophys Res Commun 376: 578–583

Garg R, Patel RK, Tyagi AK, Jain M (2011) De novo assembly of chickpea

transcriptome using short reads for gene discovery and marker identi-

fication. DNA Res 18: 53–63

Garg R, Sahoo A, Tyagi AK, Jain M (2010) Validation of internal control

genes for quantitative gene expression studies in chickpea (Cicer

arietinum L.). Biochem Biophys Res Commun 396: 283–288

Gaur R, Sethy NK, Choudhary S, Shokeen B, Gupta V, Bhatia S (2011)

Advancing the STMS genomic resources for defining new locations on

the intraspecific genetic linkage map of chickpea (Cicer arietinum L.).

BMC Genomics 12: 117

Graham MA, Silverstein KA, Cannon SB, VandenBosch KA (2004) Com-

putational identification and characterization of novel genes from

legumes. Plant Physiol 135: 1179–1197

Graham PH, Vance CP (2003) Legumes: importance and constraints to

greater use. Plant Physiol 131: 872–877

Hale MC, McCormick CR, Jackson JR, Dewoody JA (2009) Next-genera-

tion pyrosequencing of gonad transcriptomes in the polyploid lake

sturgeon (Acipenser fulvescens): the relative merits of normalization and

rarefaction in gene discovery. BMC Genomics 10: 203

Hisano H, Sato S, Isobe S, Sasamoto S, Wada T, Matsuno A, Fujishiro T,

Yamada M, Nakayama S, Nakamura Y, et al (2007) Characterization of

the soybean genome using EST-derived microsatellite markers. DNA

Res 14: 271–281

Huttel B, Winter P, Weising K, Choumane W, Weigand F, Kahl G (1999)

Sequence-tagged microsatellite site markers for chickpea (Cicer arietinum

L.). Genome 42: 210–217

Jain D, Chattopadhyay D (2010) Analysis of gene expression in response to

water deficit of chickpea (Cicer arietinum L.) varieties differing in

drought tolerance. BMC Plant Biol 10: 24

Jain M, Nijhawan A, Arora R, Agarwal P, Ray S, Sharma P, Kapoor S,

Tyagi AK, Khurana JP (2007) F-box proteins in rice: genome-wide

analysis, classification, temporal and spatial gene expression during

panicle and seed development, and regulation by light and abiotic

stress. Plant Physiol 143: 1467–1483

Jain M, Tyagi AK, Khurana JP (2008) Genome-wide identification, classi-

fication, evolutionary expansion and expression analyses of homeobox

genes in rice. FEBS J 275: 2845–2861

Kaur H, Shukla RK, Yadav G, Chattopadhyay D, Majee M (2008)

Two divergent genes encoding L-myo-inositol 1-phosphate synthase1

(CaMIPS1) and 2 (CaMIPS2) are differentially expressed in chickpea.

Plant Cell Environ 31: 1701–1716

Koski LB, Gray MW, Lang BF, Burger G (2005) AutoFACT: an automatic

functional annotation and classification tool. BMC Bioinformatics 6: 151

Kristiansson E, Asker N, Forlin L, Larsson DG (2009) Characterization of

the Zoarces viviparus liver transcriptome using massively parallel

pyrosequencing. BMC Genomics 10: 345

Kumar S, Blaxter ML (2010) Comparing de novo assemblers for 454

transcriptome data. BMC Genomics 11: 571

La Rota M, Kantety RV, Yu JK, Sorrells ME (2005) Nonrandom distribution

and frequencies of genomic and EST-derived microsatellite markers in

rice, wheat, and barley. BMC Genomics 6: 23

Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ (2006) Novel

genes derived from noncoding DNA in Drosophila melanogaster are

frequently X-linked and exhibit testis-biased expression. Proc Natl Acad

Sci USA 103: 9935–9939

Libault M, Farmer A, Joshi T, Takahashi K, Langley RJ, Franklin LD, He J,

Xu D, May G, Stacey G (2010) An integrated transcriptome atlas of the

crop model Glycine max, and its use in comparative analyses in plants.

Plant J 63: 86–99

Libault M, Joshi T, Benedito VA, Xu D, Udvardi MK, Stacey G (2009)

Legume transcription factor genes: what makes legumes so special?

Plant Physiol 151: 991–1001

Lichtenzveig J, Scheuring C, Dodge J, Abbo S, Zhang HB (2005) Con-

struction of BAC and BIBAC libraries and their applications for gener-

ation of SSR markers for genome analysis of chickpea, Cicer arietinum L.

Theor Appl Genet 110: 492–510

Lin H, Moghe G, Ouyang S, Iezzoni A, Shiu SH, Gu X, Buell CR (2010)

Comparative analyses reveal distinct sets of lineage-specific genes

within Arabidopsis thaliana. BMC Evol Biol 10: 41

Maere S, Heymans K, Kuiper M (2005) BiNGO: a Cytoscape plugin to

assess overrepresentation of Gene Ontology categories in biological

networks. Bioinformatics 21: 3448–3449

Mantri NL, Ford R, Coram TE, Pang EC (2007) Transcriptional profiling of

chickpea genes differentially regulated in response to high-salinity, cold

and drought. BMC Genomics 8: 303

Metzgar D, Bytof J, Wills C (2000) Selection against frameshift mutations

limits microsatellite expansion in coding DNA. Genome Res 10: 72–80

Millan T, Clarke HJ, Siddique KHM, Buhariwalla HK, Gaur PM, Kumar

J, Gil J, Kahl G, Winter P (2006) Chickpea molecular breeding: new

tools and concepts. Euphytica 147: 81–103

Millan T, Winter P, Jungling R, Gil J, Rubio J, Cho S, Cobos MJ, Iruela M,

Rajesh PN, Tekeoglu M, et al (2010) A consensus genetic map of chick-

pea (Cicer arietinum L.) based on 10 mapping populations. Euphytica

175: 175–189

Molina C, Rotter B, Horres R, Udupa SM, Besser B, Bellarmino L, Baum

M, Matsumura H, Terauchi R, Kahl G, et al (2008) SuperSAGE:

the drought stress-responsive transcriptome of chickpea roots. BMC

Genomics 9: 553

Molina C, Zaman-Allah M, Khan F, Fatnassi N, Horres R, Rotter B,

Steinhauer D, Amenc L, Drevon JJ, Winter P, et al (2011) The salt-

responsive transcriptome of chickpea roots and nodules via deep-

SuperSAGE. BMC Plant Biol 11: 31

Morgante M, Hanafey M, Powell W (2002) Microsatellites are preferen-

tially associated with nonrepetitive DNA in plant genomes. Nat Genet

30: 194–200

Morozova O, Hirst M, Marra MA (2009) Applications of new sequencing

technologies for transcriptome analysis. Annu Rev Genomics Hum

Genet 10: 135–151

Nowick K, Stubbs L (2010) Lineage-specific transcription factors and

the evolution of gene regulatory networks. Brief Funct Genomics 9:

65–78

Pandey A, Chakraborty S, Datta A, Chakraborty N (2008) Proteomics

approach to identify dehydration responsive nuclear proteins from

chickpea (Cicer arietinum L.). Mol Cell Proteomics 7: 88–107

Pandey A, Choudhary MK, Bhushan D, Chattopadhyay A, Chakraborty

S, Datta A, Chakraborty N (2006) The nuclear proteome of chickpea

(Cicer arietinum L.) reveals predicted and unexpected proteins. J Pro-

teome Res 5: 3301–3311

Papanicolaou A, Stierli R, Ffrench-Constant RH, Heckel DG (2009) Next

generation transcriptomes for next generation genomes using est2as-

sembly. BMC Bioinformatics 10: 447

Peng H, Yu X, Cheng H, Shi Q, Zhang H, Li J, Ma H (2010) Cloning and

characterization of a novel NAC family gene CarNAC1 from chickpea

(Cicer arietinum L.). Mol Biotechnol 44: 30–40

Perez-Rodrıguez P, Riano-Pachon DM, Correa LG, Rensing SA, Kersten

B, Mueller-Roeber B (2010) PlnTFDB: updated content and new fea-

tures of the plant transcription factor database. Nucleic Acids Res

(Database issue) 38: D822–D827

Radhika P, Gowda SJ, Kadoo NY, Mhase LB, Jamadagni BM, Sainani MN,

Chandra S, Gupta VS (2007) Development of an integrated intraspecific

map of chickpea (Cicer arietinum L.) using two recombinant inbred line

populations. Theor Appl Genet 115: 209–216

Rajesh PN, Coyne C, Meksem K, Sharma KD, Gupta V, Muehlbauer FJ

(2004) Construction of a HindIII bacterial artificial chromosome library

and its use in identification of clones associated with disease resistance

in chickpea. Theor Appl Genet 108: 663–669

Rajesh PN, O’Bleness M, Roe BA, Muehlbauer FJ (2008) Analysis of

genome organization, composition and microsynteny using 500 kb BAC

sequences in chickpea. Theor Appl Genet 117: 449–458

Schatz MC, Delcher AL, Salzberg SL (2010) Assembly of large genomes

using second-generation sequencing. Genome Res 20: 1165–1173

Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL,

Gene Discovery and Tissue-Specific Transcriptome of Chickpea

Plant Physiol. Vol. 156, 2011 1677 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.

Page 18: Gene Discovery and Tissue-Specific Transcriptome Analysis ...1 This work was supported by the Department of Biotechnology, Government of India, New Delhi, under the Next Generation

Song Q, Thelen JJ, Cheng J, et al (2010) Genome sequence of the

palaeopolyploid soybean. Nature 463: 178–183

Severin AJ, Woody JL, Bolon YT, Joseph B, Diers BW, Farmer AD,

Muehlbauer GJ, Nelson RT, Grant D, Specht JE, et al (2010) RNA-Seq

atlas of Glycine max: a guide to the soybean transcriptome. BMC Plant

Biol 10: 160

Shukla RK, Tripathi V, Jain D, Yadav RK, Chattopadhyay D (2009) CAP2

enhances germination of transgenic tobacco seeds at high temperature

and promotes heat stress tolerance in yeast. FEBS J 276: 5252–5262

Tripathi V, Parasuraman B, Laxmi A, Chattopadhyay D (2009) CIPK6, a

CBL-interacting protein kinase is required for development and salt

tolerance in plants. Plant J 58: 778–790

Upadhyaya HD, Dwivedi SL, Baum M, Varshney RK, Udupa SM, Gowda

CL, Hoisington D, Singh S (2008) Genetic structure, diversity, and

allelic richness in composite collection and reference set in chickpea

(Cicer arietinum L.). BMC Plant Biol 8: 106

Varshney RK, Graner A, Sorrells ME (2005) Genic microsatellite markers

in plants: features and applications. Trends Biotechnol 23: 48–55

Varshney RK, Hiremath PJ, Lekha P, Kashiwagi J, Balaji J, Deokar AA,

Vadez V, Xiao Y, Srinivasan R, Gaur PM, et al (2009) A comprehensive

resource of drought- and salinity-responsive ESTs for gene discovery

and marker development in chickpea (Cicer arietinum L.). BMC Ge-

nomics 10: 523

Vinogradov AE (2003) DNA helix: the importance of being GC-rich.

Nucleic Acids Res 31: 1838–1844

Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB (2007) Sampling

the Arabidopsis transcriptome with massively parallel pyrosequencing.

Plant Physiol 144: 32–42

Wilhelm BT, Landry JR (2009) RNA-Seq-quantitative measurement of

expression through massively parallel RNA-sequencing. Methods 48:

249–257

Xu G, Ma H, Nei M, Kong H (2009) Evolution of F-box genes in plants:

different modes of sequence divergence and their relationships with

functional diversification. Proc Natl Acad Sci USA 106: 835–840

Zenoni S, Ferrarini A, Giacomelli E, Xumerle L, Fasoli M, Malerba G,

Bellin D, Pezzotti M, Delledonne M (2010) Characterization of tran-

scriptional complexity during berry development in Vitis vinifera using

RNA-Seq. Plant Physiol 152: 1787–1795

Zhang X, Scheuring CF, Zhang M, Dong JJ, Zhang Y, Huang JJ, Lee

MK, Abbo S, Sherman A, Shtienberg D, et al (2010) A BAC/BIBAC-

based physical map of chickpea, Cicer arietinum L. BMC Genomics

11: 501

Garg et al.

1678 Plant Physiol. Vol. 156, 2011 www.plantphysiol.orgon October 29, 2020 - Published by Downloaded from

Copyright © 2011 American Society of Plant Biologists. All rights reserved.