View 1 excerpt, cites background. This paper compares a qualitative reasoning model of translation with a quantitative statistical model. We consider these models within the context of two hypothetical speech translation systems, … Expand.
View 1 excerpt. Why corpus-based statistics-oriented machine translation. Phosphorus-containing compounds of the formula WHEREIN R is phenyl or alkyl of 1 to 4 carbon atoms, m and p are each integers from 2 to 6, n and q are each integers from 1 to 10, y is an integer of … Expand.
Semantic and Syntactic Aspects of Score Function. A prototype system was released in and … Expand. Toward Memory-based Translation. GLR Parsing with Scoring. Skip to search form Skip to main content You are currently offline.
Some features of the site may not work correctly. Special attention is given to the more and more popular corpus-based statistics-oriented CBSO approaches in MT researches. In particular, the parameterized two-way training philosophy in designing the second generation BehaviorTran, which is the first and the largest operational system in this area, is introduced in this paper. View on ACL. Save to Library Save. Create Alert Alert. Note that? To keep computational complexity limited the subtrees of p that are considered and the subtrees that occur in the source and target side of the grammar G have been restricted to horizontally complete subtrees including bottom-up subtrees.
When not finding a matching grammar rule, a horizontally complete subtree is constructed, as explained in Sect. Adding a simple bilingual word form dictionary is optional. When a word translation is not found in the transduction grammar, the word is looked up in this dictionary. If the word has multiple translations in the dictionary, each of these translations receives the same weight and is combined with the translated label usually part-of-speech tags.
When the word is not in the dictionary or no dictionary is present, the source word is transfered as is to Q. In a first step, the transducer performs bottom-up subtree matching, which is analogous to the use of phrases in phrase-based SMT, but restricted to linguistically meaningful phrases. Bottom-up subtree matching functions like a sub-sentential translation memory: every linguistically meaningful phrase that has been encountered in the data will be considered in the transduction process, obliterating the distinction between a translation memory, a dictionary and a parallel corpus [ 45 ].
These matches include single word translations together with their parts-of-speech. An example of a grammar rule with horizontally complete subtrees on both source and target sides was shown in Fig. This rule has three alignment points, as indicated by the indices. F g is the frequency of occurrence of grammar rule g.
F d g is the frequency of occurrence of the source side d g of grammar rule g. F h is the frequency of occurrence of grammar rule h. F d h is the frequency of occurrence of the source side d h of grammar rule g h. When constructing a horizontally complete subtree fails, a grammar rule is constructed by translating each child separately.
The main task of the target language generator is to determine word order, as the packed forest contains unordered trees. An additional task of the target language model is to provide additional information concerning lexical selection, similar to the language model in phrase-based SMT [ 23 ]. The target language generator has been described in detail in [ 47 ], but the system has been generalised and improved and was adapted to work with weighted packed forests as input.
For every node in the forest, the surface order of its children needs to be determined. A large monolingual treebank is searched for an NP with an occurrence of these three elements, and in what order they occur most, using the relative frequency of each permutation as a weight.
When still not finding a match, all permutations are generated with an equal weight, and a penalty is applied for the distance between the source language word order and the target language word order to avoid generating too many solutions with exactly the same weight.
This is related to the notion of distortion in IBM model 3 in [ 5 ]. In the example bag, there are two types of information for each child: the part-of-speech and the word token, but as already pointed out in Sect. The functionality of the generator is similar to the one described in [ 17 ], but relative frequency of occurrence is used instead of n -grams of dependencies. Large monolingual target language treebanks have been built by using the target sides of the parallel corpora and adding the British National Corpus BNC 8.
This dictionary is only used for words where the grammar does not cover a translation. These results show that the best scoring condition is trained on all the data apart from DGT, which seems to deteriorate performance. Adding the dictionary is beneficial under all conditions. Error analysis shows that the system often fails when using the back-off models, whereas it seems to function properly when horizontally complete subtrees are found. Comparing the results with Moses 9 [ 24 ] shows that there is a long way to go for our syntax-based approach until we par with phrase-based SMT.
The difference in score is partly due to remaining bugs in the PaCo-MT system which cause no output in 2. Nevertheless, the PaCo-MT system has not yet reached its full maturity and there are several ways to improve the approach, as discussed in Sect. With the research presented in this paper we wanted to investigate an alternative approach towards MT, not using n -grams or any other techniques from phrase-based SMT systems.
A detailed error analysis and comparison between the different conditions will reveal what can be done to improve the system. Different parameters in alignment can result in more useful information from the same set of data. Different approaches to grammar induction could also improve the system, as grammar induction is now limited to horizontally complete subtrees. STSGs allow more complex grammar rules including horizontally incomplete subtrees.
Another improvement can be expected from working on the back-off strategy in the transducer, such as the real time construction of new grammar rules on the basis of partial grammar rules. The system could be converted into a syntactic translation aid, by only taking the decisions of which it is confident, backing off to human decisions in cases of data sparsity.
It remains to be tested whether this approach would be useful. Further investigation of the induced grammar could lead to a reduction in grammar rules, by implementing a default inheritance hierarchy, similar to [ 13 ], speeding up the system, without having any negative effects on the output. Previous versions were described in [ 48 ] and [ 49 ].
Limited restructuring is applied to make the resulting parse trees more uniform. For instance, nouns are always placed under an NP.
A similar restructuring of syntax trees is shown by [ 52 ] to improve translation results. This definition is inspired by [ 10 ]. The edge labels have been omitted from these examples, but were used in the actual rule induction. This phrase-based SMT system was trained on the same test set with the same training data, using 5-g without minimum error rate training scored Open Access.
This chapter is distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author s and source are credited. Skip to main content Skip to sections. This service is more advanced with JavaScript available. Advertisement Hide. Parse and Corpus-Based Machine Translation. First Online: 11 November This process is experimental and the keywords may be updated as the learning algorithm improves.
Download chapter PDF. A source language SL sentence gets syntactically analysed by a pre-existing parser which leads to a source language parse tree, making abstraction of the surface order. This is described in Sect.
0コメント