Monday, September 05, 2005

Interesting Article: Finding Structure in Raw Data

A computer program that can uncover structure from raw data (“corpora of raw symbolic sequential data”): For examples, see Abstract below.

Unsupervised Learning Of Natural Languages

Zach Solan*, David Horn*, Eytan Ruppin**, and Shimon Edelman***

*School of Physics and Astronomy and **School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel; and ***Department of Psychology, Cornell University, Ithaca, NY 14853

Proceedings of the National Academy of Science (PNAS) 2005 102: 11629-11634; published online before print as 10.1073/pnas.0409746102. [Need subscription to obtain full text online; university libraries carry print journal; academics can access online through their library]. Free full-text of lead-up publications at

Address correspondence to Dr. Shimon Edelman:


We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production.

Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function.

This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

NB: For further details see Zach Solan’s Ph.D. thesis:

"The Syntax of Nature" - "The Nature of Syntax": a study of the hidden structures in human language and in other raw sequential data such as music, proteins, DNA and more...

TonySeb: What use could you get out of ADIOS? Perhaps we could learn how to “talk” to birds, dolphins, à la Dr. Doolittle.

More on scienceblog:


Post a Comment

Links to this post:

Create a Link

<< Home