Fast language-independent correction of interconnected typos to finding longest terms

AuthorsBehzad Soleimani Neysiani
Conference Title24th International Conference of Information Technology (IVUS 2019)
Holding Date of Conference2019-04-25 - 2019-04-27
Event Place122 - Kaunas
Presented byدانشگاه لیتوانی
PresentationSPEECH
Conference LevelInternational Conferences

Abstract

Triagers deal with bug reports in software triage systems like Bugzilla to prioritizing, finding duplicates, and assigning those to developers, which these processes should be automated, especially for huge open source projects. These bug reports must be mined by text mining, information retrieval, and natural language processing techniques for automation processes. There are many typos in user bug reports which cause low accuracy for artificial intelligence techniques. These typos can be detected based on standard dictionaries, but correction of these typos needs human knowledge based on the context of bug reports. It is important which neither Google Translator nor Microsoft Office Word can detect interconnected terms –a common type of typos in bug reports- having more than two meaningful terms. This research provides a novel language-independent approach for fast correction of interconnected typos based on natural language processing and human neural network structure to detect and correct interconnected typos. A new tree-based method proposed for term matching and two algorithms proposed for fast longest term finding in an interconnected typo. A dataset is used including 180-kilo typos based on four famous bug report dataset of Android, Eclipse, Mozilla Firefox, and Open Office projects. Then proposed method evaluated on typos versus the state of the art. The results show the runtime performance of the proposed method is as same as the related works but the average words length is improved and at least more than 57% of typos in the dataset can be classified as interconnected typos.

tags: Information Retrieval, Natural Language Processing, Duplicate Detection, Bug Reports, Typo Correction, Lexical Interconnected Typo, Trie