Nakamura, Satoshi. Translators Computer programs. Signal, Image and Speech Processing. Language Translation and Linguistics. Second-language speakers pronounce words in multiple different ways compared to the native speakers. Those deviations, may it be phoneme substitutions, deletions or insertions, can be modelled automatically with the new method presented here. The methods is based on a discrete hidden Markov model as a word pronunciation model, initialized on a standard pronunciation dictionary.
The implementation and functionality of the methodology has been proven and verified with a test set of non-native English in the regarding accent. Essa et al. Messaoudi et al. Vergyri et al. To deal with the huge lexical variety, Xiang et al. Saon et al. The achieved WER was 8. Kuo et al.
Statistical Pronunciation Modeling for Non-Native Speech Processing | Rainer E. Gruhn | Springer
Since the ASR decoder works better with long words, our method focuses on finding a way to merge transcription words to increase the number of long words. For this purpose, we consider to merge words according to their tags.
- Bibliographic Reference.
- Characters of Shakespeare’s Plays!
- Caring and Coping: A Guide to Social Services.
- Bibliographic Information.
- The Development And Symbolism Of Passover Until 70 CE (Journal for the Study of the Old Testament Supplement Series 414)?
That is, merge a noun that is followed by an adjective, and merge a preposition that is followed by a word. A tag is a word property such as noun, pronoun, verb, adjective, adverb, preposition, conjunction, interjection, etc. Each language has its own tags. Tags may be different from language to language.
The total number of tags of this tagger is 29 tags, only 13 tags were used in our method as listed in Table 3. As we mentioned, we focused on three kinds of tags: noun, adjectives, and preposition. In this work, we used the Noun-Adjective as shorthand for a compound word generated by merging a noun and an adjective. We also used Preposition-Word as shorthand for a compound word generated by merging a preposition with a subsequent word.
The prepositions used in our method include:. Table 4 shows the tagger output for a simple non-diacritized sentence. Thus, the tagger output is used to generate compound words by searching for Noun-Adjective and Preposition-Word sequences. These two compound words are, then, represented in new sentences as illustrated in Figure 9. Therefore, the three sentences the original and the new ones will be used, with all other cases, to produce the enhanced language model and the enhanced pronunciation dictionary.
Figure 10 shows the process of generating a compound word. It demonstrates that a noun followed by an adjective will be merged to produce a one compound word. It is noteworthy to mention that our method is independent from handling pronunciation variations that may occur at words junctures. That is, our method does not consider the phonological rules that could be implemented between certain words. The steps for modeling cross-word phenomenon can be described by the algorithm pseudocode shown in Figure In the figure, the Offline stage means that the stage is implemented once before decoding, while Online stage means that this stage needs to be repeatedly implemented after each decoding process.
The proposed method was investigated on a speaker-independent modern standard Arabic speech recognition system using Carnegie Mellon University Sphinx speech recognition engine. WER is computed using the following formula:.
OOV is known as a source of recognition errors, which in turn could lead to additional errors in the words that follow Gallwitz et al. In this research work, the baseline system is based on a closed vocabulary.
Download Statistical Pronunciation Modeling for NonNative Speech Processing Free Books
The closed vocabulary assumes that all words of the testing set are already included in the dictionary. In our method, we calculate OOV as the percentage of recognized words that are not belonging to the testing set, but to the training set.
The following formula is used to find OOV:. The perplexity of the language model is defined in terms of the inverse of the average log likelihood per word Jelinek, Measuring the perplexity is a common way to evaluate N-gram language model. It is a way to measure the quality of a model independent of any ASR system. Of course, The measurement is performed on the testing set. A lower perplexity system is considered better than one of higher perplexity.
The perplexity formula is:. The WER of the baseline system The boundaries of the confidence interval are found to be [ If the changed classification error rate is outside this interval, this change can be interpreted as statistically significant. Otherwise, It is most likely caused by chance. Table 5 shows the enhancements for different experiments.
The other cases are similar, i. Preposition-Word, and Hybrid cases also achieved a significant improvement. Table 5 shows that the highest accuracy achieved is in Noun-Adjective case. The reduction in accuracy in the hybrid case is due to the ambiguity introduced in the language model. For more clarification, our method depends on adding new sentences to the transcription corpus that is used to build the language model.
Therefore, adding many sentences will finally cause the language model to be biased to some n-grams 1-grams, 2-grams, and 3-grams on the account of others. The common way to evaluate the N-gram language model is using perplexity. The perplexity for the baseline is The measurements were taken based on the testing set, which contains words. The enhanced cases are clearly better as their perplexities are lower. The reason for the low perplexities is the specific domains that we used in our corpus, i. The OOV was also measured for the performed experiments.
Our ASR system is based on a closed vocabulary, so we assume that there are no unknown words. The OOV was calculated as the percentage of recognized words that do not belong to the testing set, but to the training set. For the enhanced cases, Table 6 shows the resulting OOVs.
Clearly, the lower the OOV the better the performance is, which was achieved in all three cases. Table 7 shows some statistical information collected during experiments. It was reasonable to review the collected compound words as our transcription corpus is small words. For large corpora, the accuracy of the tagger is crucial for the results. Table 8 shows an error that occurred in the tagger output. Table 9 shows an illustrative example of the enhancement that was achieved in the enhanced system.
Introducing a compound word in this sentence avoided the misrecognition that occurred in the baseline system. According to the proposed algorithm, each sentence in the enhanced transcription corpus can have a maximum of one compound word, since sentences are added to the enhanced corpus once a compound word is formed. Finally, After the decoding process, the results are scanned in order to decompose the compound words back to their original form two separate words.
This process is performed using a lookup table such as:. Table 10 shows comparison results of the suggested methods for cross-word modeling.wegoup777.online/en-busca-del-oro-azteca-ebook-epub-jack-stalwart.php
Statistical Pronunciation Modeling for Non-Native Speech Processing
It shows that PoS tagging approach outperform the other methods i. The use of phonological rules was demonstrated in AbuZeina et al. So, the comparison demonstrated in Table 10 is subject to change as more cases need to be investigated for both techniques. That is, cross-word was modeled using only two Arabic phonological rules, while only two compounding schemes were applied in PoS tagging approach. The recognition time is compared with the baseline system.
-  Investigating the role of L1 in automatic pronunciation evaluation of L2 speech?
- Browse more videos!
- Donate to arXiv.
- Rainer E. Gruhn (Author of Statistical Pronunciation Modeling for Non-Native Speech Processing)?
The comparison includes the testing set which includes speech files. The specifications of the machine where we conducted the experiments were as follows: a desktop computer which contains a single processing chip of 3. We found that the recognition time for the enhanced method is almost the same as the recognition time of the baseline system. This means that the proposed method is almost equal to the baseline system in term of time complexity.
As future work, we propose investigating more word-combination cases. A hybrid system could also be investigated. It is possible to use the different cross-word modeling approaches in a one ASR system. It is also worthy to investigate how to model the compound words in the language model. In our method, we create a new sentence for each compound word.
A comprehensive research work should be made to find how to effectively represent the compound words in the language model. In addition, we highly recommend further research in PoS tagging for Arabic. The proposed knowledge-based approach to model cross-word pronunciation variations problem achieved a feasible improvement. Mainly, PoS tagging approach was used to form compound words. The experimental results clearly showed that forming compound words using a noun and an adjective achieved a better accuracy than merging of a preposition and its next word.