The Anoa-L01 Benchmark: Prompt-Based Zero-Shot Evaluation for Sulawesi's Regional Languages Detection in LLMs

Yuyun

doi:10.1109/IC3INA68387.2025.11325480

Share

Export Citation

APA

MLA

Chicago

Harvard

Vancouver

BIBTEX

RIS

Universitas Hasanuddin

Research output:Contribution to journal›Article›peer-review

The Anoa-L01 Benchmark: Prompt-Based Zero-Shot Evaluation for Sulawesi's Regional Languages Detection in LLMs

Yuyun

International Conference on Computer Control Informatics and Its Applications Ic3ina

Published: 2025

Abstract

In recent years, large language models (LLMs) have demonstrated impressive performance in a wide range of tasks of natural language processing. However, their performance on low-resource languages remains largely underexplored. This paper proposes the Language Detection Prompting (LDP) framework, a prompt-based zero-shot strategy designed to identify languages in input text without requiring fine-tuning for each target language. We introduce Anoa, a term we use to refer regional languages spoken in Southern, Western, and Southeastern Sulawesi, Indonesia. To support this effort, we collected a dataset of 13 languages by extracting traditional folktale books from these regions. We evaluate the performance of sevens pretrained LLM models, such as Gemma 7B, LLaMA 2 7B, LLaMA 3.1 8B, and Mistral 7B Instruct, as well as three variants of Gemini: Gemini 1.5 Flash, Gemini 1.5 Pro, and Gemini 2.0 Flash. Two distinct types of prompts were utilized: the first was designed to identify the primary language of a given text, while the second aimed to identify the language names of the provided sentence. We evaluate model predictions by comparing the output of prompt-based inference against the gold standard labels (ground truth). Our experiments show that the Gemini model demonstrates superior zero-shot capabilities in identifying the primary language of texts. Our findings further reveal that the model not only succeeds in language identification but also detects a high degree of linguistic relatedness among the identified languages.

Access to Document

10.1109/IC3INA68387.2025.11325480

Fingerprint

Computer scienceSciences

Natural language processingSciences

Language identificationSciences

Artificial intelligenceSciences

Language modelSciences

InferenceSciences

Identification (biology)Sciences

Natural languageSciences

Term (time)Sciences

Range (aeronautics)Sciences

LinguisticsSciences

Spoken languageSciences

Natural (archaeology)Sciences

Written languageSciences

Share

Export Citation

The Anoa-L01 Benchmark: Prompt-Based Zero-Shot Evaluation for Sulawesi's Regional Languages Detection in LLMs

Abstract

Access to Document

Other files and links

Related Papers

Fingerprint