Modelo de clasificación automática de texto en idioma indígena Wayuunaiki que incorpora características gramaticales
Fecha
Autores
Autor corporativo
Título de la revista
ISSN de la revista
Título del volumen
Editor
Compartir
Altmetric
Resumen
The natural language processing (NLP) techniques applied to automatic text classification operate optimally when performing tasks such as ordering, labeling, and clustering texts written in widely used languages such as English, Chinese, and Spanish, among others. This performance has been achieved thanks to significant advances in machine learning and deep learning architectures, semantic representation strategies for pre-training, and the availability of and access to large volumes of data. In the case of NLP for indigenous community languages, few studies describe the processing of an indigenous language that takes into account both its grammatical features and the cultural identity of its speakers. This gap stems from challenges related to the scarcity of datasets containing an adequate number of records with high data quality; likewise, there are no linguistic resources such as dictionaries, lemmatizers, or taggers that could be adapted from other NLP solutions for grammatical analysis. Against this backdrop, the present work outlines a proposal for an automatic text classification model in the indigenous wayuunaiki language, the native tongue of the Wayuú community inhabiting Colombia and Venezuela. This model is developed using natural language processing (NLP) techniques and the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology. It fundamentally integrates wayuunaiki’s own grammatical features—prepositions, verb conjugations marked for person and gender, and agglutinative morphology—with the aim of achieving more accurate classification that supports the execution of other NLP tasks. In addition to contributing to computational processes, this work also seeks to provide a high-quality, labeled wayuunaiki text corpus for research that fosters the conservation and teaching of the language.
