A context-aware lemmatization model for setswana language using machine learning

Abstract:

Lemmatization is an important task which is concerned with making computers understand the

relationship that exists amongst words written in natural language. It is a prior condition needed

for the development of natural language processing (NLP) systems such as machine translation

and information retrieval.

In particular, Lemmatization is intended to reduce the variability in word forms by collapsing

related words to a standard lemma. There is a limited research on lemmatization of Setswana

language. A large part of the available research on Setswana lemmatization relies on rule driven

strategy, which takes time to construct, lacks context of how words are used, and needs extremely

qualified language skills. Moreover, it has been discovered that the treatment of language with

hand coded regulations lacks generalization component as it requires a continual redesign every

time new data appears and this complicates the scalability of systems. With such rich vocabulary

and complex morphology, lemmatization of Setswana cannot be easily unraveled using explicit

rules developed by programmers.

In this thesis we describe how a supervised machine learning approach that employs the use of

Naive Bayes algorithm can solve Setswana lemmatization with regard to how words are used in

sentences. The contribution of this study includes; first, context aware lemmatization model,

that handles most of the morphologically productive classes. Second, we experiment with the

strongest multi-class algorithm Naive Bayes, which to our best knowledge has never been used

to address lemmatization in Setswana. The accuracy of the lemmatization model obtained from

the experiments reached 70.32%. The model shifts from entirely hand programmed rules and is

able to lemmatize words based on the context how they are used. In Setswana lemmatization

should be done according to sentence intension, the model again ensures that as long as the

data is a good example of the goal concept the generalization is simultaneously created, which

allows the model'’s future performance to continue improving.

Furthermore, given that this is a young area of research with no standard datasets for training

and testing, we also contribute with a considerable medium sized dataset which remains a coveted

resource for research community. The experimental results obtained from this study shows

that machine learning approaches are more reliable than rule based approaches in lemmatizing

Setswana inflectional words with regard to the context of how they are used.

Read Download