Share
Export Citation
NLP-Based Extraction of Bird Morphological Features from Indonesian Texts
Arifin S.R.
Proceedings International Seminar on Intelligent Technology and Its Applications Isitia
Abstract
This study presents a systematic computational approach for extracting, analyzing, and categorizing morphological features of birds from Indonesian language descriptions. The methodology emphasizes empirical validation and statistical rigor, implementing comprehensive quantitative evaluation including cross-validation reliability assessment and statistical significance testing. Using the Indonesian-translated CUB-200-2011 dataset, our analysis of 117,853 descriptions across 200 bird species achieved excellent cross-validation reliability (0.848) and statistical significance (p < 0.001) across all feature categories. Findings reveal statistically validated patterns in Indonesian ornithological terminology, highlighting the prominence of beak morphology (53.50%, 95% CI: [0.532, 0.538]), high-contrast coloration (black 50.93%, white 48.65%), and significant size asymmetry patterns in Indonesian bird descriptions. Feature co-occurrence analysis with statistical validation unveils semantic relationships between anatomical features and their visual characteristics, with “perut putih” (white belly) emerging as the most common combination (9.21%). The framework consists of five main stages: dataset preparation and preprocessing, morphological feature extraction, statistical validation and reliability assessment, feature analysis and categorization, and visualization and database generation. The preprocessing pipeline standardizes Indonesian bird descriptions while maintaining domain-specific terminology, while feature extraction employs context-aware pattern matching optimized for Indonesian language morphology. Statistical validation through 5-fold crossvalidation and chi-square significance testing ensures methodology reliability and reproducibility. This empirically validated approach contributes essential groundwork to biodiversity informatics by providing reliable baseline measurements and linguistic insights that can inform the development of more advanced computational systems for multilingual biodiversity databases.