Document Type

Dissertation

Publication Date

2025

Abstract

Swahili remains significantly underrepresented in natural language processing (NLP) despite being one of the most widely spoken languages in Africa. Computational Bridges: Enhancing Natural Language Processing of Swahili addresses this gap through computational linguistics, corpus creation, and large-scale analysis of Swahili syntax and lexical structure. Central to this study is GUMZO, a novel corpus developed from spontaneous conversational data collected from YouTube videos, television panel discussions, political speeches, religious discourse, and unscripted broadcasts. Unlike many existing datasets that rely on formal or translated text, GUMZO captures authentic language use and provides a stronger foundation for NLP research involving low-resource languages.

Using statistical and computational methods, this research analyzes word frequency, sentence structure, lexical diversity, and syntactic complexity in Swahili while comparing findings with high-resource and other low-resource languages. The results reveal substantial lexical richness and distinctive syntactic patterns that remain inadequately represented in current NLP systems and language technologies. The study further demonstrates the importance of culturally grounded corpus creation for improving NLP development for low-resource languages.

This dissertation contributes to computational linguistics and Swahili language research by expanding digital resources for Swahili and providing transferable methodologies for future studies involving low-resource languages. The findings also highlight broader implications for educational access, cultural preservation, public health communication, and technological inclusion within Swahili-speaking communities. Ultimately, this research advocates for more inclusive and linguistically diverse approaches to natural language processing and computational linguistics.

Program or Discipline Name

Data Sciences

Included in

Data Science Commons

Share

COinS