An improved Bulgarian natural language processing pipeline

Title	An improved Bulgarian natural language processing pipeline
Publication Type	Journal Article
Year of Publication	2023
Authors	Berbatova M, Ivanov F
Journal	Annuaire de l’Université de Sofia “St. Kliment Ohridski”. Faculté de Mathématiques et Informatique
Volume	110
Start Page	37
Pagination	37-50
ISSN	1313-9215 (Print) 2603-5529 (Online)
Keywords	language pipeline, natural language processing, word sense disambiguation
Abstract	In this paper, we present a language pipeline for processing Bulgarian language data. The pipeline consists of the following steps: tokenization, sentence splitting, part-of-speech tagging, dependency parsing, named entity recognition, lemmatization, and word sense disambiguation. The first two components are based on rules and lists of words specific to the Bulgarian language, while the rest of the components use machine learning algorithms trained on universal dependency data and pretrained word vectors. The pipeline is implemented in the Python library spaCy (https://spacy.io/) and achieves significant results on all the included subtasks. The pipeline is open source and is available on Github (https://github.com/melaniab/spacy-pipeline-bg/) for use by researchers and developers for a variety of natural language processing and text analysis tasks.
DOI	10.60063/GSU.FMI.110.37-50

Attachment	Size
110-037-050.pdf	667.13 KB

Search form