May 26, 2011

Machine Translation - As expected, what the user does and what they get

Machine Translation (MT) systems are now everywhere. This ubiquity is a combination of increased demand for translation in today's global marketplace, and an exponential growth in computing power, which is vital to such facilities. And under the right circumstances, systems of MT is a powerful tool. They offer poor quality translations in situations where more than low-quality translation of a translation that does not exist, or if it is a rough translation of a large document in seconds, or deliveredMinutes is more useful than a good translation delivered in three weeks.

Unfortunately, despite the wide availability of MT, it is clear that the purpose and limitations of these systems are often misunderstood and overestimated their ability. In this article I want to work a brief overview of MT systems and how they can best use. Then I will give you some data, such as Internet-based MT is now used, and show that there is aGap between planned and actual use of such systems, and users need to know how to educate their systems of MT use it effectively.

Electronic Dictionary

How does machine translation

One might expect that a computer program would result grammar of these languages ​​are used and combined with some kind of production in-memory "dictionary", the report resulting therefrom. And in fact, that is, in essence, as some older systems worked. But most modern systems actually MTstatistical approach that is completely "blind language. In essence, the system is formed based on a corpus of example translations. The result is a statistical model that contains information such as:

- "If the words (a, b, c) must enter into a sentence, there is a chance that the words x% (d, e, f) will take place in sequence in the translation" (NB: not having the same number of words in each pair);
- Where "two consecutive words (a, b) in the target language,If Word (s) ends in X, there is a X% probability that word (b) will end in Y ".

Faced with a huge body of these observations, the system can translate a sentence, taking into account the different translations candidate - that has put words together almost by accident (in fact some "naive selection" process) - and the choice of the faculty statistically likely.

Hearing this high-level description of how MT works, many people that such a "language-blind 'approach surprisedwork at all. What is more surprising than that usually works better than systems based on rules. This is partly due to the court for the analysis of grammatical errors, even in the equation (automated analysis is not right, and people do not always agree how to parse a sentence) out. And the formation of a system to "bare text" you can create a much more data bases, to what would otherwise be possible: the corpus of texts analyzed were sown small and thin;The sides of the "naked text" in their trillions.

But what this approach means that the quality of translation depends very much on the way in which elements of the source text in the data originally used to train the system is represented. If you accidentally come to be or demand vous avez (instead of returning him or avez vous demande), the system with the fact that these sequences are returned as disabled, you probably have not occurred often intraining data (or worse, can come up with a completely different meaning than they needed, in his will to the lawyer again.) And because the system has little knowledge of grammar (to work out, for example, whether to return a form, and "is the infinitive, after probably"), has done little to go.

Similarly, you can use the system on a set that perfectly grammatical, and often is included in the daily use to ask them to translate, but the functions do not occurwere together in the training data. MT systems are usually the type of text to the human translations available, such as technical training or work documents or records of meetings and conferences multilingual parliaments. This gives the MT systems is a natural tendency to certain types of formal or technical texts. And even when the vocabulary of everyday life is still dominated by the training data affected the grammar of the language (such as the use of you instead of in UstedHispanic or not the present tense instead of future tense in several languages).

MT systems, in practice,

Research and development of translation systems have always been aware that one of the greatest dangers is incorrect public perception of the purpose and limits. related to Somers (2003) [1] taking into account the use of MT and chat rooms, he said. "This increased visibility of MT, a number of side effects effets had [...] There is certainly a need to educatethe general public about the poor quality of raw MT, and, above all, because the quality is so low. "Looking at MT in use in 2009, there is unfortunately little evidence that awareness of these issues has improved.

To illustrate I will have a small selection of data from a Spanish-English MT service provided for the site Español, Inglés. The service works by user input, the application of some "cleaning" processes (such as correcting some common spellingDecoding errors and common instances of "SMS Language"), and then tries to translations in (a) a bank of examples from the Spanish-English site, and (b) a motor MT. Currently, Google Translate for the MT engine is used, even if a custom engine can be used in the future. The numbers I have presented here an analysis of 549 Spanish-English query, the system of machines in Mexico [2] - in other words, we assume that the majority of users to translate their homesLanguage.

First, what people with the MT system? For each query, I tried a "best estimate" of its objectives for the translation of the query. In many cases, the objective is very clear in some cases, there is much ambiguity. With this caveat, I judge that in approximately 88% of cases the intended use clear enough, and categorize the applications as follows:

Consider a single word or phrase: 38% a formal translation of the text: 23% Internet chat room: 18% Homework:9%

A surprising (if not alarming!) Noted that in most cases, users search with the help of a translator for a single word or phrase. In fact, there was 30% of the requests of a single word. The finding is somewhat surprising given the fact that the site is under discussion, including a Spanish-English dictionary and suggests that users confuse the purpose dictionaries and translators. Although not shown in the raw figures, there were clearly some cases of subsequent scanswhere it seemed that a user intentionally splitting a sentence or phrase that probably would have been better, translated, if not together. Perhaps because the students about the use of drilling in the dictionary, see, for example, a query was ("previous quarter") for para cuarto immediately by a query to a number. There is a clear need to educate students and people in general the difference between the electronic dictionary, and machine translation [3]: in particular, thatDictionary is the user choose the appropriate translation given the context leader, but it requires a single word or single-rate research, while a translator usually works best in complete sentences and with a single word or term is simply the report statistically more frequent translation .

I believe that in less than a quarter of cases, the user with the MT system is "trained" for his purposes or rough translation to translate a formal text (and enter a whole sentence orat least partial sentence rather than a noun phrase isolates). Of course it is impossible to know if any of these translations were then to publish without further proof, which is definitely not the intended use of the system.

The use for the translation of texts is now almost officially compete using informal chat sessions online dictionary - a framework for MT systems are usually not formed. The online chat which raises special problems for MT systems, asFeatures such as non-standard spelling, punctuation, and not the lack of informal written in other contexts are often found. be effectively translated for chat sessions would probably be a dedicated system trained on a body more suitable (and possibly customized).

It is not too surprising that students with the systems of MT to do their homework. But it is interesting to note how much and how. In fact, the use anvil for homework, a mixture of "fair use"(Understanding of service) with an attempt to "get the computer to do homework" (with predictably disastrous results in some cases). Queries are specified as a set of homework, of course, instructions for the years to explain, plus a few phrases that are trivial generalizations would be unusual in a text or a conversation, but typical for Beginners "homework".

Whatever the use to which it is a problem for users and designers, the frequency of errors inSource code, which are likely to impede the translation. In fact, over 40% of the queries contained these errors, with many different surveys. The most common errors are as follows (for the query words and terms specific to the calculation of these figures have been excluded):
Missing accents: 14% of the requests missing punctuation marks: 13% Other misspellings: 8% Grammatically incomplete sentence: 8%

Whereas in most cases, the translation in which userstheir mother tongue appear to users the importance of using standard spelling, the best opportunity to give a good translation underestimated. More subtly, users do not always understand that the translation of a word depends on another, and that the translator the task is difficult if the grammar is incomplete, so that queries such as Día de hoy are not uncommon. such queries impede translation, because the possibility of a sentence in the training data, for example,a "preposition dangling like this is slim.

Teach ...?

Currently, there is a mismatch between the performance of MT systems and user expectations. I see the responsibility for this shortcoming, such as lying in the hands of developers and users and educators. Users need to better understand the source of their sentences "MT-friendly" and learn to think as measured at the output of MT systems. Language courses should address these problems: learning about computerstranslation tools effectively be an important part of learning a language to be. And developers, including myself, have to think about how to make the tools that we provide better language appropriate to the needs of users.

Notes

[1] Somers (2003), "Machine Translation: The latest developments in" The Oxford Handbook of Computational Linguistics, OUP.
[2] This number is just strange, because queries that match the criteria for selection were random probability within a recordingfixed time frame. It should be noted that the system is to derive a country's machine by its IP address can not be entirely accurate.
[3] If the user only a single word in the system in question is suggested by a message in the translation the user has a better result to see the site dictionary.

Machine Translation - As expected, what the user does and what they get

electronic dictionary arabic english

Reviews Comp Gift Blogspot You We Answer