Abstract
AbstractMethods of computational linguistics are used to demonstrate that a natural language such as English and organic chemistry have the same structure in terms of the frequency of, respectively, text fragments and molecular fragments. This quantitative correspondence suggests that it is possible to extend the methods of computational corpus linguistics to the analysis of organic molecules. It is shown that within organic molecules bonds that have highest information content are the ones that 1) define repeat/symmetry subunits and 2) in asymmetric molecules, define the loci of potential retrosynthetic disconnections. Linguistics‐based analysis appears well‐suited to the analysis of complex structural and reactivity patterns within organic molecules.
References
29
Referenced
73
10.1002/3527607439
/ Supramolecular Chemistry: Concepts and Perspectives, 1 ed. by Lehn J. M. (1995){'key': 'e_1_2_2_3_2', 'volume-title': 'Foundations of statistical language processing', 'author': 'Manning C. D.', 'year': '1999'}
/ Foundations of statistical language processing by Manning C. D. (1999){'key': 'e_1_2_2_4_2', 'first-page': '1605', 'volume-title': 'Thesaurus, Encyclopedia of artificial intelligence, 2nd\u2005ed.', 'author': 'Jones K. S.', 'year': '1992'}
/ Thesaurus, Encyclopedia of artificial intelligence, 2nd ed. by Jones K. S. (1992){'key': 'e_1_2_2_5_2', 'volume-title': 'Theory and Applications of Natural Language Processing, Multi‐source, Multilingual Information Extraction and Summarization', 'author': 'Saggion H.', 'year': '2013'}
/ Theory and Applications of Natural Language Processing, Multi‐source, Multilingual Information Extraction and Summarization by Saggion H. (2013){'key': 'e_1_2_2_7_2', 'volume-title': 'The Psychobiology of Language: An Introduction to Dynamic Philology', 'author': 'Zipf G.', 'year': '1935'}
/ The Psychobiology of Language: An Introduction to Dynamic Philology by Zipf G. (1935){'key': 'e_1_2_2_8_2', 'volume-title': 'Human Behavior and the Principle of Last Effort', 'author': 'Zipf G.', 'year': '1949'}
/ Human Behavior and the Principle of Last Effort by Zipf G. (1949)- While the Zipf law is the most universal characterictic of any natural language it alone does not describe complexity of languages relations between grammar and vocabulary and so on. Zipf’s law is also applicable to phenomena outside of linguistics such as dolphins’ sounds brain waves and stock markets. For additional literature see:
10.3390/e11040688
10.1371/journal.pone.0053227
10.1016/j.physrep.2012.01.007
10.1162/003355399556133
{'key': 'e_1_2_2_15_2', 'volume-title': 'Corpus Linguistics: Method, Theory and Practice', 'author': 'McEnery T.', 'year': '2012'}
/ Corpus Linguistics: Method, Theory and Practice by McEnery T. (2012){'key': 'e_1_2_2_16_2', 'volume-title': 'Corpus Linguistics and Language Technology', 'author': 'Dash N. S.', 'year': '2005'}
/ Corpus Linguistics and Language Technology by Dash N. S. (2005){'key': 'e_1_2_2_18_2', 'volume-title': 'The Teacher’s Word Book of 30\u2009000 Words', 'author': 'Thorndike E. L.'}
/ The Teacher’s Word Book of 30 000 Words by Thorndike E. L.- I. S. P. Nation Vocabulary size text coverage and word lists in Schmitt;McCarthy Vocabulary:Description Acquisition and Pedagogy Cambridge University Press Cambridge 1997.
10.1002/j.1538-7305.1948.tb01338.x
10.1002/j.1538-7305.1948.tb00917.x
{'key': 'e_1_2_2_20_4', 'volume-title': 'Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology', 'author': 'Gusfield D.', 'year': '1999'}
/ Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology by Gusfield D. (1999)10.1039/c2sc00011c
- For example in Chematica’s retrosynthetic module each potential disconnection is checked against a database of several thousand expert‐coded reaction mechanisms. The program also scrutinizes stereochemistry and regiochemistry before allowing any disconnection. Still performing such analyses for all bonds in a molecule is very computationally costly and the linguistic approach helps shorten calculation times by preselecting only certain most likely bonds and preventing the exponential “explosion” of possible retrosynthetic “trees”.
10.1002/ange.201202209
10.1002/anie.201202209
10.1002/ange.201202210
10.1002/anie.201202210
- Disclosure: B. A. G. has a financial interest in Chematica which is distributed by GSI (Grzybowski Scientific Inventions).
Dates
Type | When |
---|---|
Created | 11 years, 1 month ago (July 10, 2014, 4:10 p.m.) |
Deposited | 1 year, 10 months ago (Oct. 16, 2023, 10:59 a.m.) |
Indexed | 4 months, 1 week ago (April 16, 2025, 10:05 a.m.) |
Issued | 11 years, 1 month ago (July 10, 2014) |
Published | 11 years, 1 month ago (July 10, 2014) |
Published Online | 11 years, 1 month ago (July 10, 2014) |
Published Print | 11 years ago (July 28, 2014) |
@article{Cadeddu_2014, title={Organic Chemistry as a Language and the Implications of Chemical Linguistics for Structural and Retrosynthetic Analyses}, volume={53}, ISSN={1521-3773}, url={http://dx.doi.org/10.1002/anie.201403708}, DOI={10.1002/anie.201403708}, number={31}, journal={Angewandte Chemie International Edition}, publisher={Wiley}, author={Cadeddu, Andrea and Wylie, Elizabeth K. and Jurczak, Janusz and Wampler‐Doty, Matthew and Grzybowski, Bartosz A.}, year={2014}, month=jul, pages={8108–8112} }