Teach Time Encyclopedia - Learn About Our World
Home Page
Teach Time
Featured Topics

United States
by state

CITYology

Academic Disciplines

Historical Timelines

Themed Timelines

Calendars

Reference Tables

Biographies

How-tos



Thursday, December 04, 2008

Traduki

Traduki is an open source machine translation program, developed with the Lua programming language and released under the GNU General Public License. It is a tool being developed to give free speech and translation to everyone. Traduki means "to translate" in Esperanto.

Development was suspended in mid-2002, but has restarted in 2003.

Traduki is a free Machine Translation program, released under the GNU General Public License. It is a tool being developed to give free speech and translation to everyone.

Machine Translation is a complex task. The folowing are preliminary ideas.

Table of contents
1 Input
2 Tokenization
3 Morphological analysis
4 Sytactical analyses
5 Disambiguation
6 Semantic Disambiguation
7 Translation to an interlanguage
8 Destination language syntheses
9 See also
10 External links and references

Input

Input is the reading the original English text. This can be from a simple console, GUI, or web interface, but it can also be from more complicated things such as OCR, handwriting recognition or speech recognition.

Tokenization

Tolkenization is the division of the text into sentences and of sentences into words and punctuation. The division of the text into sentences can be done using "!", "?" and "." as separators. But sometimes, "." is used un numbers (i.e. 10.233), abbreviations (i.e. Dr.) and Initials (i.e. A. C. Doyle). The punctuation marks ",", ";", "", »«, :. () and [] can also be used to separate semi-independent sentences.

The article "What is a word, What is a sentence? Problems of Tokenization" is a good discussion of tokenization problems. It can be downloaded here

Morphological analysis

Each word must be analyzed to identify derived words. Dictionaries used in Machine Translation do not have words derived from simpler words. Derived words must be identified by the program itself. Verbal forms and plurals are the most common derived words.

Project Natural Language Toolkit[1] has some python code that could be reused in Traduki. However, Natural Language Toolkit is released under the IBM Common Public License 0.5. Can we use the code?

Sytactical analyses

Syntactical analysis is the determination of the syntactic function of the words. The program should discover if a word is a "verb" or a "noun". A dictionary with the syntactic classification of all root words must be used. WordNet[1] is a good source of data to build a good English dictionary.

Disambiguation

A word can have more than one syntactic function. For example, "fat" can be an adjective ("The fat boy eats hamburgers") and can be a noun ("Hamburgers have lots of fat"). So, how do we know that "fat" in the sentence "Hamburgers have lots of fat" is a noun? There are two methods:

Semantic Disambiguation

Sometimes, some ambiguity may remain after the application of the methods described above. Semantic information may be use to may be use to solve the problem. That's why a good dictionary must have some semantic information. For example, words related to music should be marked as such.

Translation to an interlanguage

All the syntactic, morphological and semantic information should be codified in an interlanguage. All the source language root words should be translated to root words. Esperanto is often used as an intermediate language (including in Traduki) because 99% of esperanto words have only one sense and because Esperanto is already somewhat of an interlanguage.

Ergane is a free to use multilanguage dictionary that use Esperanto as a interlanguage can be useful for Traduki.

Destination language syntheses

The syntheses of the destination language from interlanguage is an easy step. There is, however, some problems:

See also

External links and references

Useful resources for the Traduki project

Online articles

Books



Internet Hotel Solutions

Site Sponsors
AC Units
Baltimore Harbor
Boot Camp Grads
Bra Size
Burkittsville
College Hotels
Digital Harbor
Free Cell Phones
Golden Hare Travel
Golf Vacations
Golf Courses
Gourmet
Hair Styles
Hippodrome
iWoman
Lesson Plans
Maryland Hotels
MD Genealogy
Minor League Stuff
Motel Site
Ocean City
OC Real Estate
Old Agers
Office Supplies
Orlando
Pet Friendly Hotel
Room Prices
Savannah, GA
Ski Vacations
South Baltimore
Student Teaching
Travel Sources
University Hotels
Visit Military Bases
Washington, DC

Brought to you by NoChildLeftBehind.com and the Beaches and Towns Network, LLC.