Bi-Lingual Machine Translation Approach using Long Short–Term Memory Model for Asian Languages

Objectives : To develop an appropriate machine translation model for translating text from English to Tamil. Methods: The proposed work uses a Gated Recurrent Unit (GRU) Long Short-Term Memory (LSTM) model. The Repeat Vector function is used for ﬁtting both the decoder and encoder parts of the network model. Adam optimizer is used because of its faster execution and less consumption of memory. It mainly uses the text corpora which are available in the Internet repository namely Technology Development for Indian Languages (TDIL), Linguistic Data Consortium for Indian Languages (LDCIL), Kaggle, and Ishikahooda. Findings: The motivation for the proposed work emerged from identifying the regional language Tamil as one of the less frequently used languages in the existing translation systems. The Tamil Character Set is one of the challenging factors for the existence of fewer such translation systems. The proposed system produces a BLEU score of 0.9, a Meteor score of 0.98, a TER score of 0.5, a WER score of 20%, an Accuracy rate of 5 (in a 5-point grading scale), and an Adequacy rate of 5 (on a 5-point grading scale) which are signiﬁcantly better than the existing systems. Novelty: The space complexity of the proposed LSTM-based English Tamil Translator is ﬁne-tuned to 256 units of memory using Adam optimizer for achieving less storage consumption. The number of layers is optimized for reducing the execution time. Unicode Transformation Format (UTF-8) encoding is used to incorporate Tamil language characters. This work has been implemented with a wide range of sentences counted to several thousand. LSTM-based English Tamil Translator is helpful for bilingual learners who are learning speciﬁcally Tamil language


Introduction
Machine translation is an interdisciplinary research domain encompassing various fields such as Artificial Intelligence, Natural Language Processing, Mathematics, and Statistics.Such kind of translation services provided by machine translation systems is highly appreciable in this scenario (1)(2)(3) .The translation tasks were primarily handled using classical and statistical methodologies.However, the classical approach was not preferred by many due to its pitfalls.The main disadvantages are the rules developed by the expert and the huge number of requirements of strategies and exemptions.The data-driven approach, namely the statistical approach, outperforms the performance of the classical approach (4)(5)(6) .This technique also eliminates the necessity of a linguistic expert and requires only a text corpus of both the source and target language examples.But, the results are superficial for languages with different word orders.Deep Neural Machine Translation models are being considered for a wider range of translation tasks in recent times.LSTM uses memory cells for retaining the values and also requires a minimal amount of training data.Hence, the proposed system overcomes the existing systems by using the NMT model involving GRU LSTM in both the encoding and decoding phases (7)(8)(9) .The proposed work has been implemented using a large number of sentences counted to several thousand.Both the human subjective scores (Adequacy, Fluency, Accuracy) and automated scores (BLEU, Meteor) have been used for analyzing the results (10)(11)(12) .

Literature Review
An exhaustive review has been conducted on the existing machine translation models that are used in various language conversions and comparisons (5,(13)(14)(15)(16)(17) .A neural machine translation system is developed for the conversion of sentences from Marathi to English.The Byte Pair Encoding algorithm is used for representing the input sequences.LSTMs are used in the existing systems for achieving the required translation (5) .A machine learning-based translation model is developed for converting the text from Hindi to English.The pattern recognizer using quantum neural is incorporated into recognizing and learning the corpus patterns.The existing system thus performs the machine translation using sentence pairs of Devanagari-Hindi and English.The dataset of 2600 sentences has been used for the implementation of the system (13) .A model for converting English text to two Indian languages such as Tamil and Punjabi is developed by using the NMT.This model includes both the attention mechanism and the score function for an effective translation.Human evaluators evaluated the quality of the translated output using the Bi-Lingual Evaluation Under Study concerning the fluency and adequacy of the predicted output (14) .A transformer-based NMT model is developed for the translation of English sentences to eight Asian languages.The model highlights the layer normalization component for the convergence of different translations.The combination of such eight translation approaches is carried out in the multilingual NMT model (15) .
A framework has been suggested for the neural machine algorithm for the translation of English resources.The model includes the combination of both neural machine and statistical machine translation for the effective translation of words.The vocabulary alignment structure is used by the decoder to reduce vocabulary-related problems (16) .A classical translation system known as the English-to-Bengali MT system is developed for the translation between English and Bengali languages.The system performs well in some circumstances, but a larger corpus makes it more difficult to build a large database.This system was created with a tiny corpus.Due to the structural similarities between Urdu and Hindi, an English-to-Urdu machine translation system is developed by using Hindi as a medium of conversion.Hindi is first translated from the input English sentences, and then Hindi is translated into Urdu.This system uses Interlingua and rule-based methodologies.The Urdu term was mapped to the matching Hindi word using the Hindi-Urdu mapping database.According to industry standards, the system's BLEU score for translating from English to Urdu is 0.3544 (17) .

Research Gap
It has been observed that some of the existing systems were developed through classical and statistical methods (5,(13)(14)(15)(16)(17) .The existing research works on language translation are found to be very minimal for the conversion of the text from English to Tamil.This developed the intuition for the proposed system and an exhaustive analysis has been made of a few such existing systems.The Space complexity of the proposed LSTM-based English -Tamil Translator is fine-tuned to 256 units of memory.The number of layers is optimized for achieving less Time complexity in generating the desired output.Unicode Transformation Format (UTF-8) encoding is used to assist their learning and translation processes.Even a few systems that were developed using NMTs consist of a low number of sentences.This work supports bilingual learners especially in learning the Tamil language with a wide range of sentences counted to several thousand.https://www.indjst.org/

Methodology
The proposed LSTM-based English -Tamil Translator is implemented using the LSTM model.Figure 1 depicts the outline of the proposed methodology.A detailed explanation of the working of the LSTM-based English -Tamil Translator system is explained is provided in the following section:

Preprocessing
Pre-processing of the text corpus is considered to be the most important phase in developing an NMT system.It is a 2step process that includes Tokenization and Cleaning.Tokenization is performed on the sentences to divide them into words separated by white spaces.It is done by using Keras API for the text corpus of the proposed system.Cleaning is helpful to remove the non-alphabet, special characters, unnecessary spaces, and improper sentences from the text corpus.The input sentences are then cleaned up so that the existence of any special characters can be removed for further processing.The text corpus is loaded in the UTF -8 format for the proposed system because UTF-8 can efficiently store text containing any human language character.

Encoding
Encoding is done by the Encoder of the LSTM-based English -Tamil Translator system.It acquires only one grapheme or word in a timestep.Hence, for the input sentence of m words or length L, L time steps are taken to read it.Each word is then mapped into the index of the language vocabulary for creating a thought vector or context vector.Context Vector represents the meaning of the sentence in the source language.converted into a fixed-sized vector.The task of this component is to perform the proper UTF-8 and ASCII encoding i.e., American Standard Code for Information Interchange encoding of the input sentences.Considering S as a sentence in the native source language (English) and T as its corresponding sentence in the target language (Tamil).The encoder converts S (s 1 , s 2 , s 3 ..., s m ) into fixed dimension vectors which in turn are foreseen literally by the decoder using the conditional probability that is given in Eq. (1).
The following are the notations that are used such as x t is the input at time step t; h t and c t are the hidden states of LSTM at the time step t; y t is the output produced at time step t.Considering a sentence in the source language English such as "I slept", the same is interpreted as a sentence consisting of two words namely, s 1 = "I", and s 2 = "slept" in the proposed system.So, this sequence is read in two -time steps as shown in Figure 2 .At t = 1, it remembers "I, " and when time t = 2, it recalls that the LSTM has read "I slept, " and the states h2 and c2 remember the complete sequence "I slept".
The initial states such as h 0 and c 0 are initialized as zero vectors.The encoder takes the above sequence of words X ={x s 1 , x s 2, x s 3 ,….x s L } as the input and calculates the thought vector v = {h c , v c }, where h c represents the final external hidden state which is obtained after processing the final element input sequence and v c is the final cell state.It can be mathematically represented as v c = c L, and v h = h L .
Here, S 1 , S 2 , ..., and S M represent the fixed-size encoded vectors.Eq. ( 1) is then converted to a form as written in Eq. ( 2), by using the chain rule.
https://www.indjst.org/These kinds of NMT models that are developed using the above Eq.( 2) are referred to as Left -to -Right (L2R) autoregressive NMT.The decoder predicts the next word by using the previously determined word vectors and also the source sentence vectors which are given in Eq. (1).

Decoding
The decoding is done by the decoder of the LSTM-based English -Tamil Translator system.The role of this is to decode the context vector into the desired translation.The parameters that are used to initialize the decoder are the context vector v = {v h , v c } represented as h 0 = v h and c 0 = v c .At each step of decoding, a decoder uses the global attention mechanism by checking against the entire pool of source states which is shown in Figure 3. https://www.indjst.org/

Results and Discussion
The proposed LSTM-based English -Tamil Translator system mainly uses the text corpora which are available in the Internet repository namely TDIL, LDCIL, and manythings.org.It contains around three thousand Tamil words and their English equivalent words from a repository of more than two hundred sentence pairs.This system is implemented in the TensorFlow platform.The following Table 1 shows the description of datasets used in the proposed system.The stepwise implementation of the language-translation is described as follows: An encoder-decoder LSTM model is used for the proposed English-to-Tamil translation system.The size of the English and Tamil languages' vocabularies, as well as the overall amount of memory units, used to hold both the encoded and decoded words, are all significant characteristics that are used to configure the model.The complexity of the translation is usually predicted using the parameters such as the count of sentence pairs available in the data set, the length of each such pair, and the size of the vocabulary.The training and testing data set is used in a ratio of 70:30 respectively.The language tokenizers also figure out how much vocabulary is used in each language and how long a sentence may be for a given language phrase.Figure 4 represents the proposed NMT model.The language model used in the proposed system is represented in Table 2 .i) Preprocessing source sentence x s = x 1 , x 2 , x 3 , . .., x L , and target sentence y t = y 1 , y 2 " y 3 ,. .., y L pairs; ii) Performing embedding using the embedding layer; iii) Passing x s into encoder; iv) Computing content vector v across the attention layer conditioned on x s .v) Setting (h 0 c 0 ) of the content vector by using the initial states of the decoder; vi) Predicting target sentence y T = {y 1 T , y 2 T , ..., y M T } for the input sentence x s, where m th prediction in the target vocabulary is determined by using y M T = softmax (w so f tmax h m + b so f tmax ); vii) Determining the loss using categorical cross entropy for the words between the predictions and the actuals at the m th position; viii) Optimizing encoder and decoder by updating the weight matrices (W, U, V) and softmax layer; Input is utilized by the encoder to produce a fixed-sized vector from it.By using both the attention mechanism and the scoring function, the decoder assesses the possibility of discovering a potential target word related to the source word.The target embedding involves the repeat vector functionality to generate the target Tamil language words concerning the input.The Repeat Vector layer involves the maximum word length of the Source language as specified in Table 1 , along with 256 units.The Target Embedding layer then uses the above parameters along with the maximum word length of the Source language as specified in Table 1 .The total number of layers that are hidden and output layers are represented in Table 3 .The Input layer is defined by using the maximum length of the target language i.e., the Tamil word that is used in the proposed system.The maximum length of the phrases of the generated output in the target language Tamil and the total amount of memory units required to configure the suggested model is utilized to generate the Embedding and LSTM layers.The Embedding and LSTM layers are created using the maximum length of the phrases of the generated output in the target language, and the total amount of memory units needed to design the suggested model.The Repeat Vector layer is similar to an adapter.It is used for fitting both the decoder and encoder parts of the available network model.This layer simply repeats the given 2D input several times to create a 3D output.The Time Distributed wrapper reuses the same output layer for each element in the output sequence.The maximum word length allowed in the source language, English, as well as the total amount of memory units needed to configure the suggested model, is used to generate this wrapper.

Evaluating the proposed model
The model is evaluated on both the train and test datasets.This involves two steps such as i) generating the output iteratively for many input sets of data and ii) summarizing the model generation according to step i).Both the forward and backward mapping of converting words to numbers and numbers to words are carried out simultaneously using this model in translating the source to the target language (18) .

Automatic and Human Evaluation Scores
Human subjective evaluations of some aspects of the output, such as fluency in a 5 -point grade scale (with the highest score of 5) and adequacy, are used as human intrinsic measurements to establish quality.Automatic intrinsic measurements compare the corresponding MT output against a predetermined set of reference translations to create rankings among MT systems using an easily calculated phrase similarity measure such as Meteor, TER (Translation Edit Rate), and WER (Word Error Rate) (14) .The above results are summarized and shown in Table 4. https://www.indjst.org/

Comparison with the existing systems
The Bi-Lingual Evaluation Under Study or, otherwise BLEU score is used for the comparison of the existing systems.The BLEU score is used in the evaluation of the machine translations for the different languages.It is determined using the number of words generated in the machine translation output that are matching with the reference translation (5,(13)(14)(15)(16)(17) .The proposed system is found to achieve a score of 9 which is found to be better than the available translation systems.A comparison is made against the observed BLEU scores of the proposed and existing systems are shown in Figure 5 .From the above Figure 5 , it is inferred that the LSTM-based English -Tamil Translator can perform more efficiently than the existing translation systems which were adopting the NMT for converting between the English and Tamil languages.

Conclusion
This study concludes that the GRU LSTM model is the best method implemented for the proposed English-to-Tamil bilingual translation system.The LSTM-based English -Tamil Translator model can perform the required translation effectively by achieving less time and space complexity.This translator is very supportive of bilingual learners as it is implemented using huge datasets.The results are evaluated by using both the human subjective and the machine evaluation scores and are found to be better than the existing systems.Future research should widen its scope to support other Indian languages using Tamil as their primary language.

Fig 3 .
Fig 3. Attention Mechanism The m th prediction of the translated sentence is determined by the Eq.(3).C m , h m = { y 1 T, y 2 T, . . ., y M T } y T m = softmax (w softmax * h m + b softmax )

Table 1 .
Datasets used in the LSTM-based English -Tamil Translator

Table 2 .
Language Model of the LSTM-based English -Tamil Translator Size

of the vocabulary of the languages Total number of words used
LSTM-based English -Tamil Translator uses an encoder-decoder-based LSTM model that encompasses the sequence of inputs which is encoded by a front-end model known as the encoder. https://www.indjst.org/

Table 3 .
Representation of the Layers of the LSTM-based English -Tamil Translator

Table 4 .
Summary of the results of the proposed LSTM-based English -Tamil Translator