Crossref journal-article
American Association for the Advancement of Science (AAAS)
Science (221)
Abstract

Large segmental duplications cover much of the Arabidopsis thaliana genome. Little is known about their origins. We show that they are primarily due to at least four different large-scale duplication events that occurred 100 to 200 million years ago, a formative period in the diversification of the angiosperms. A better understanding of the complex structural history of angiosperm genomes is necessary to make full use of Arabidopsis as a genetic model for other plant species.

Bibliography

Vision, T. J., Brown, D. G., & Tanksley, S. D. (2000). The Origins of Genomic Duplications in Arabidopsis. Science, 290(5499), 2114–2117.

Authors 3
  1. Todd J. Vision (first)
  2. Daniel G. Brown (additional)
  3. Steven D. Tanksley (additional)
References 55 Referenced 817
  1. K. Arumuganthan E. D. Earle Plant Mol. Biol. Rep. 9 208 (1991). (10.1007/BF02672069)
  2. 10.1126/science.282.5389.662
  3. McGrath J. M., Jansco M. M., Pichersky E., Theor. Appl. Genet. 86, 880 (1993). (10.1007/BF00212616) / Theor. Appl. Genet. by McGrath J. M. (1993)
  4. Bancroft I., Yeast 17, 1 (2000). (10.1002/(SICI)1097-0061(200004)17:1<1::AID-YEA3>3.0.CO;2-V) / Yeast by Bancroft I. (2000)
  5. Blanc G., Barakat A., Guyot R., Cooke R., Delseny M., Plant Cell 12, 1093 (2000). (10.1105/tpc.12.7.1093) / Plant Cell by Blanc G. (2000)
  6. 10.1073/pnas.070430597
  7. Kowalski S. P., Lan T. H., Feldmann K. A., Paterson A. H., Genetics 138, 499 (1994). (10.1093/genetics/138.2.499) / Genetics by Kowalski S. P. (1994)
  8. 10.1073/pnas.160271297
  9. 10.1038/45471
  10. Mayer K., et al., Nature 402, 769 (1999). (10.1038/47134) / Nature by Mayer K. (1999)
  11. 10.1038/ng1296-380
  12. Lan T.-H., et al., Genome Res. 10, 776 (2000). (10.1101/gr.10.6.776) / Genome Res. by Lan T.-H. (2000)
  13. Paterson A. H., et al., Plant Cell 12, 1523 (2000). (10.1105/tpc.12.9.1523) / Plant Cell by Paterson A. H. (2000)
  14. Terryn N., et al., FEBS Lett. 445, 237 (1999). (10.1016/S0014-5793(99)00097-6) / FEBS Lett. by Terryn N. (1999)
  15. Supplementary information is available at: www.igd.cornell.edu/∼tvision/arab/science_supplement.html
  16. We used GENSCAN [
  17. 10.1006/jmbi.1997.0951
  18. ] to provide gene models in unannotated clones. Because we only require that most predicted exons overlap true exons in the same translation frame and that predicted gene densities are approximately correct ab initio predictions are sufficient for our purposes.
  19. 10.1093/nar/25.17.3389
  20. BLASTP scores were obtained for all pairs of genes. The ( i j ) element of matrix M was assigned the alignment score for proteins i and j if the score was 100 bits or greater. Row and column indices denote the position of each protein within each chromosome. Chromosome order and orientation are arbitrary.
  21. Matching genes within 15 positions of each other were collected into the row and column with the smallest index and were assigned the maximum of the component scores. This was iterated until convergence thereby combining both tandemly duplicated genes and single-copy genes shared by overlapping clones. A single gene may occur in two positions if the tiling path information is wrong but extensive clone overlap would be needed to generate a spurious duplicated block and the high sequence similarity would have flagged the error.
  22. Only the five highest scores in each row and column were retained a conservative approach that sacrifices sensitivity for specificity.
  23. To identify duplicated blocks we first calculated a weight for each pair of nonzero elements M i1 j1 and M i2 j2 for all j 2 > j 1 with corresponding transcriptional orientations T ( i 1 ) T ( i 2 ) T ( j 1 ) T ( j 2 ) where T ε {−1 1}. The weight was W = − k + ℓ ( r + c ) −1 [ T ( i 1 )· T ( j 1 )= T ( i 2 )· T ( j 2 )=sgn( i 2 - i 1 )] · m where k ℓ and m are constants; r = ‖ i 1 − i 2 ‖ and  c = ‖ j 1 − j 2 ‖ are the row and column distances; 1 [ T ( i 1 )· T ( j 1 )= T ( i 2 )· T ( j 2 )=sgn( i 2 - i 1 )] is an indicator function equaling 1 if the transcriptional orientations of the two pairs of linked cORFs are equivalent but otherwise equaling 0; and sgn( x ) equals −1 0 or 1 for x greater than less than or equal to 0 respectively. When cORFs were composed of multiple genes transcriptional orientations were taken to be those of the highest scoring gene pair. Weights were assigned to edges of a directed acyclic graph in which nodes were nonzero elements of M and edges connected all nodes ( i 1 j 1 ) and ( i 2 j 2 ) for which j 2 > j 1 . We computed minimum-weight paths between every pair of nodes connected by an edge [T. H. Cormen C. E. Leiserson R. L. Rivest Introduction to Algorithms (MIT Press Cambridge MA 1990) pp. 536–538] identified paths with negative weight combined overlapping paths combined paths from either side of the diagonal and accepted the resulting sets of nodes as duplicated blocks. Errors in the order and orientation of genes may hinder our ability to detect duplicated blocks but are unlikely to generate false ones.
  24. 10.1038/42711
  25. A. McLysaght C. Seoighe K. H. Wolfe in Comparative Genomics D. Sankoff J. H. Nadeau Eds. (Kluwer New York 2000) pp. 47–58. (10.1007/978-94-011-4309-7_6)
  26. Random matrices were derived from M by permutation of its rows. Using parameter values of k = 5 l = 1.14 and m = 25 only one block defined by six or more pairs was identified in 1000 permutations of the chromosome 2 versus 4 submatrix compared with 34 blocks of five pairs. Thus blocks of seven pairs are unlikely to arise by chance though real duplicated blocks may be overextended or erroneously merged.
  27. Amino acid alignments were obtained using CLUSTALW version 1.7 [
  28. 10.1093/nar/22.22.4673
  29. ]. Estimates of d A were obtained using PAML [Z. Yang Phylogenetic Analysis by Maximum Likelihood (PAML) Version 3.0. (University College London 2000)] with the JTT substitution matrix [
  30. Jones D. W., Taylor W. R., Thornton J. M., CABIOS 8, 275 (1992); / CABIOS by Jones D. W. (1992)
  31. ]. The smallest estimate of d A was used for matches between multiple genes. The median is more robust to outliers than the mean though it will still be affected by the absence of highly diverged homologous genes from the sample.
  32. Homogeneity in d A among blocks was rejected by a single-classification analysis of variance ( P < 0.0001).
  33. Mixture models of normal distributions with parameters for means variances and mixing proportions were fit using an expectation-maximization algorithm. Models were compared by likelihood ratio tests [M. Lynch B. Walsh Genetics and Analysis of Quantitative Traits. (Sinauer Sunderland MA 1997) pp. 359–364]. The medians of samples drawn from a population of any distribution approach a normal distribution as the sample size increases. [W. Feller An Introduction to Probability Theory and its Applications (John Wiley & Sons New York ed. 2 1957) pp. 238–241]. The approximation should be adequate for samples of this size (average = 29) though it is not strictly valid because there are differing numbers of matches in each block. The log likelihoods of the one two and three distribution models are −5.5 −128.5 and −183.6 respectively and each differs from the next by three degrees of freedom.
  34. Because numerous small blocks were not counted and up to 20% of the genome sequence has not been analyzed this is likely to be an underestimate.
  35. This estimate is the product of the average gene density of chromosomes 2 and 4 ∼210 genes/megabase and the estimated length of chromosome 1 which is 27.9 megabases [
  36. 10.1038/10334
  37. A rate of 9 × 10 −10 ± 9 × 10 −10 nonsynonymous base substitutions·site −1 ·lineage −1 ·year −1 has been estimated for nuclear genes in the grasses [
  38. Gaut B. S., Evol. Biol. 30, 93 (1998); / Evol. Biol. by Gaut B. S. (1998)
  39. 10.1023/A:1006319803002
  40. ] though this must be treated with caution due to many sources of uncertainty. Because 75% of all possible sense nucleotide substitutions in the genetic code are nonsynonymous and there are three positions in each codon we assume that there are 2.25 nonsynonymous sites in each codon in converting from amino acid substitutions to nonsynonymous base substitutions. Accounting for patterns of codon usage in Arabidopsis (GenBank Release 119.0) one obtains a nearly identical conversion factor (2.30 nonsynonymous sites per codon) [
  41. Benson D. A., et al., Nucleic Acids Res. 28, 15 (2000)]. (10.1093/nar/28.1.15) / Nucleic Acids Res. by Benson D. A. (2000)
  42. Arabidopsis is a member of the rosid lineage of dicot angiosperms [
  43. Soltis P. S., Soltis D. E., Chase M. W., Nature 402, 402 (1999)]. (10.1038/46528) / Nature by Soltis P. S. (1999)
  44. 10.1073/pnas.86.16.6201
  45. Yang Y.-W., Lai K.-N., Tai P.-Y., Li W.-H., J. Mol. Evol. 48, 597 (1999). (10.1007/PL00006502) / J. Mol. Evol. by Yang Y.-W. (1999)
  46. The two presumed redundancies involve clones T6J4/F13B4 and F21N10/K17E7.
  47. Koch M., Bishop J., Mitchell-Olds T., Plant Biol. 1, 529 (1999). (10.1111/j.1438-8677.1999.tb00779.x) / Plant Biol. by Koch M. (1999)
  48. This may be inflated by undetected homologies overextended blocks and transposition of genes from their original positions.
  49. 10.1007/PL00006498
  50. C. Somerville personal communication.
  51. S. J Liljegren et al. Nature 404 766 (2000). (10.1038/35008089)
  52. 10.1101/gr.9.9.825
  53. 10.1139/g99-033
  54. Copenhaver G. P., Browne W. E., Preuss D., Proc. Natl. Acad. Sci. U.S.A. 95, 247 (1998). (10.1073/pnas.95.1.247) / Proc. Natl. Acad. Sci. U.S.A. by Copenhaver G. P. (1998)
  55. We thank C. Aquadro J. Doyle R. Durrett T. Mitchell-Olds C. Somerville M. Yanofsky and L. Zhang for helpful comments. This research was funded by grants from the National Science Foundation and the Office of Naval Research. T.J.V. is supported in part by the Cornell Theory Center.
Dates
Type When
Created 23 years, 1 month ago (July 27, 2002, 5:52 a.m.)
Deposited 1 year, 7 months ago (Jan. 13, 2024, 4:26 a.m.)
Indexed 1 week, 3 days ago (Aug. 26, 2025, 3:10 a.m.)
Issued 24 years, 8 months ago (Dec. 15, 2000)
Published 24 years, 8 months ago (Dec. 15, 2000)
Published Print 24 years, 8 months ago (Dec. 15, 2000)
Funders 0

None

@article{Vision_2000, title={The Origins of Genomic Duplications in Arabidopsis}, volume={290}, ISSN={1095-9203}, url={http://dx.doi.org/10.1126/science.290.5499.2114}, DOI={10.1126/science.290.5499.2114}, number={5499}, journal={Science}, publisher={American Association for the Advancement of Science (AAAS)}, author={Vision, Todd J. and Brown, Daniel G. and Tanksley, Steven D.}, year={2000}, month=dec, pages={2114–2117} }