New algorithm for learning languages

New algorithm for learning languages
Science Blog ^ | 8/31/2005 | Blogger

Posted on 09/01/2005 10:00:03 AM PDT by TChris

Cornell University and Tel Aviv University researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences.

The development -- which has a patent pending -- has implications for speech recognition and for other applications in natural language engineering, as well as for genomics and proteomics. It also offers new insights into language acquisition and psycholinguistics.

"The algorithm -- the computational method -- for language learning and processing that we have developed can take a body of text, abstract from it a collection of recurring patterns or rules and then generate new material," explained Shimon Edelman, a computer scientist who is a professor of psychology at Cornell and co-author of a new paper, "Unsupervised Learning of Natural Languages," published in the Proceedings of the National Academy of Sciences (PNAS, Vol. 102, No. 33).

"This is the first time an unsupervised algorithm is shown capable of learning complex syntax, generating grammatical new sentences and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics," he said.

Unlike previous attempts at developing computer algorithms for language learning, the new method, called Automatic Distillation of Structure (ADIOS), successfully identifies complex patterns in raw texts. The algorithm discovers the patterns by repeatedly aligning sentences and looking for overlapping parts.

For example, the sentences I would like to book a first-class flight to Chicago, I want to book a first-class flight to Boston and Book a first-class flight for me, please may give rise to the pattern book a first-class flight -- if this candidate pattern passes the novel statistical significance test that is the core of the algorithm.

If the system also encounters the sentences I need to book a direct flight from New York to Tel Aviv andI would like to book an economy flight , it may infer that the phrases first-class, direct and economy are equivalent in the context of the new pattern. "Because such equivalence sets can contain other patterns -- in turn containing further patterns, and so on -- the resulting body of knowledge grows recursively, as a sort of forest of branching trees of possibilities," said Edelman.

He added, "ADIOS relies on a statistical method for pattern extraction and on structured generalization -- two processes that have been implicated in language acquisition. Our experiments show that it can acquire intricate structures from raw data, including transcripts of parents' speech directed at 2- or 3-year-olds. This may eventually help researchers understand how children, who learn language in a similar item-by-item fashion and with very little supervision, eventually master the full complexities of their native tongue."

In addition to child-directed language, the algorithm has been tested on the full text of the Bible in several languages, on artificial context-free languages with thousands of rules and on musical notation. It also has been applied to biological data, such as nucleotide base pairs and amino acid sequences. In analyzing proteins, for example, the algorithm was able to extract from amino acid sequences patterns that were highly correlated with the functional properties of the proteins.

The new method was developed jointly with David Horn and Eytan Ruppin, professors of physics and computer science, respectively, at Tel Aviv University, and with Zach Solan, a doctoral student there and the lead author on the paper. Their collaboration with Edelman was supported in part by the U.S.-Israel Binational Science Foundation.

TOPICS: Culture/Society; Israel
KEYWORDS: algorithm; computers; cornell; genetics; language; science; technology; telaviv

This may not qualify as news, since the original paper was published in 2003, but I didn't find it here, and it's profoundly cool technology.

1 posted on 09/01/2005 10:00:12 AM PDT by TChris

[ Post Reply | Private Reply | View Replies]

To: TChris

Sounds like the beginnings of Star Trek's Universal Translator.

2 posted on 09/01/2005 10:09:33 AM PDT by radiohead (Proud member of the 'arrogant supermagt')

[ Post Reply | Private Reply | To 1 | View Replies]

To: TChris

Most interesting! Will this method put translators out of business?

3 posted on 09/01/2005 10:10:02 AM PDT by Ciexyz (Let us always remember, the Lord is in control.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: TChris

Hey that's great
Now Lee Iococca can understand Snoop Doggy Dogg
when they do more commercials together...

4 posted on 09/01/2005 10:11:57 AM PDT by joesnuffy (Save the whales. Redeem them for valuable prizes.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: radiohead

But will it work for Ebonics?

5 posted on 09/01/2005 10:12:01 AM PDT by scouse

[ Post Reply | Private Reply | To 2 | View Replies]

To: scouse

But will it work for Ebonics?

It seems to work with data that has a consistent set of grammar rules. ...so maybe not. :-)

6 posted on 09/01/2005 10:14:24 AM PDT by TChris ("The central issue is America's credibility and will to prevail" - Goh Chok Tong)

[ Post Reply | Private Reply | To 5 | View Replies]

To: TChris; scouse

But will it work for Ebonics?
It seems to work with data that has a consistent set of grammar rules. ...so maybe not.

Both y'all are very funny. : )

7 posted on 09/01/2005 10:58:32 AM PDT by radiohead (Proud member of the 'arrogant supermagt')

[ Post Reply | Private Reply | To 6 | View Replies]

To: TChris

If as good as they say, this is a bit sooner than I thought. And, it's grammar, not meaning.

That said, it's going to be fun when they get good enuough to start putting court decisions into the machine. It could lead to finding that many decisions are not consistent, and may be distinctly out of phase with each other and higher court decisions.

But that's for next month.

8 posted on 09/01/2005 11:03:22 AM PDT by Blagden Alley

[ Post Reply | Private Reply | To 1 | View Replies]

To: dennisw; Cachelot; Yehuda; Nix 2; veronica; Catspaw; knighthawk; Alouette; Optimist; weikel; ...

If you'd like to be on this middle east/political ping list, please FR mail me.

..................

The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences.

I'd love to run Cindy Sheehan's ravings through it. A few others too.

9 posted on 09/01/2005 11:08:49 AM PDT by SJackson (“I worry that I've seen this movie before”, Rep. Mark Kirk on aid to palestinians.)

[ Post Reply | Private Reply | To 7 | View Replies]

To: TChris

"I have Al Gore Rythm!!"

10 posted on 09/01/2005 1:14:12 PM PDT by paltz

[ Post Reply | Private Reply | To 1 | View Replies]

To: TChris

bump for later reading

11 posted on 09/01/2005 4:57:30 PM PDT by Kevin OMalley (No, not Freeper#95235, Freeper #1165: Charter member, What Was My Login Club.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: paltz

Is this an algorithm: "I invented the internet"

No, that's an AlGore-ism

Ohhhh .... I just thought that would help with learning languages.

12 posted on 09/01/2005 10:21:36 PM PDT by Optimist (I think I'm beginning to see a pattern here.)

[ Post Reply | Private Reply | To 10 | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search

News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794