Share

Export Citation

APA
MLA
Chicago
Harvard
Vancouver
BIBTEX
RIS
Universitas Hasanuddin
Research output:Contribution to journalArticlepeer-review

The Anoa-L01 Benchmark: Prompt-Based Zero-Shot Evaluation for Sulawesi's Regional Languages Detection in LLMs

Yuyun

International Conference on Computer Control Informatics and Its Applications Ic3ina

Published: 2025

Abstract

In recent years, large language models (LLMs) have demonstrated impressive performance in a wide range of tasks of natural language processing. However, their performance on low-resource languages remains largely underexplored. This paper proposes the Language Detection Prompting (LDP) framework, a prompt-based zero-shot strategy designed to identify languages in input text without requiring fine-tuning for each target language. We introduce Anoa, a term we use to refer regional languages spoken in Southern, Western, and Southeastern Sulawesi, Indonesia. To support this effort, we collected a dataset of 13 languages by extracting traditional folktale books from these regions. We evaluate the performance of sevens pretrained LLM models, such as Gemma 7B, LLaMA 2 7B, LLaMA 3.1 8B, and Mistral 7B Instruct, as well as three variants of Gemini: Gemini 1.5 Flash, Gemini 1.5 Pro, and Gemini 2.0 Flash. Two distinct types of prompts were utilized: the first was designed to identify the primary language of a given text, while the second aimed to identify the language names of the provided sentence. We evaluate model predictions by comparing the output of prompt-based inference against the gold standard labels (ground truth). Our experiments show that the Gemini model demonstrates superior zero-shot capabilities in identifying the primary language of texts. Our findings further reveal that the model not only succeeds in language identification but also detects a high degree of linguistic relatedness among the identified languages.

Other files and links

Fingerprint

Computer scienceSciences
Natural language processingSciences
Language identificationSciences
Artificial intelligenceSciences
Language modelSciences
InferenceSciences
Identification (biology)Sciences
Natural languageSciences
Term (time)Sciences
Range (aeronautics)Sciences
LinguisticsSciences
Spoken languageSciences
Natural (archaeology)Sciences
Written languageSciences