Recent years have witnessed rapid advancements in emerging technologies such as blockchain, the metaverse, the Internet of Things (IoT), 5G, virtual reality and extended reality
(Samala et al., 2023; Samala et al., 2024; Choudhary et al., 2023). Among these, artificial intelligence (AI)-particularly natural language processing (NLP)-has driven transformative innovations, including advanced conversational agents such as ChatGPT (Generative Pre-trained Transformer) (
Van Dis et al. 2023). These AI-driven chatbots hold significant potential to reshape both educational and medical domains
(Leiter et al., 2024).
Developed by OpenAI, ChatGPT simulates human-like dialogue through NLP, offering adaptive, context-aware responses (
OpenAI, 2024). Since its launch in November 2022, the platform has attracted over 100 million users within two months and is now accessible in more than 161 countries (
Reuters, 2023;
Choudhary et al., 2023). Its versatility has led to applications across medical education, general learning and agricultural practices, including veterinary medicine. By drawing upon vast, internet-derived datasets, ChatGPT can address a wide range of factual, procedural and analytical queries (
Zhai, 2022).
The COVID-19 pandemic further accelerated the adoption of virtual teaching tools and digital farm-management technologies, catalysing a global network of AI users
(Priya et al., 2024; AlZubi and Al-Zu’bi, 2023). In veterinary and animal sciences, ChatGPT’s integration has the potential to advance academic instruction, livestock production techniques, disease management strategies and extension activities.
Despite these promising developments, systematic evaluation of ChatGPT’s accuracy, reliability and curriculum alignment in veterinary education remains limited. The present study aims to systematically assess the domain-specific accuracy of ChatGPT-generated responses within veterinary sciences, comparing these outputs with authoritative veterinary textbooks and peer-reviewed literature. It further seeks to identify the strengths and limitations of ChatGPT’s content generation across diverse veterinary subfields, highlighting areas of consistency and divergence from established knowledge. In addition, the study evaluates the suitability of ChatGPT as an educational aid for learners at varying academic levels, thereby providing evidence-based insights to guide its responsible integration into veterinary pedagogy and practice.
Study design
This study employs a comparative, descriptive and mixed-methods research design aimed at evaluating the domain-specific accuracy and pedagogical relevance of ChatGPT, an AI-based large language model, in the context of veterinary and animal sciences education. The research focuses on both quantitative accuracy scoring and qualitative content evaluation across core veterinary subdisciplines.
AI tools evaluated
•
ChatGPT-3.5 (primary model).
•
ChatGPT-4 (comparative).
Both models were accessed
via the OpenAI platform, with prompts submitted between (05.06.2025 to 07.06.2025).
Reference framework
Ground-truth comparisons were made against the following authoritative and widely used resources:
• Livestock Production Management by Sastry and Thomas.
• Farm Animal Management by Singh and Islam.
•
ICAR Guidelines,
Veterinary Council of India (VCI) UG Curriculum.
• Species-specific veterinary practical manuals and standard reference texts in clinical and paraclinical sciences
These texts were selected to represent national curricular relevance and content validity in Indian veterinary education.
Domain categorization
To capture the full scope of veterinary education, the questions were stratified into the following three major academic domains (Table 1).
Question sampling strategy
•
Source of questions
o VCI-aligned undergraduate curriculum
o Standard viva-voce banks and practical examination questions.
o Frequently asked questions from real classroom settings.
•
Sample Size: A minimum of 10 questions per domain, totalling 30 evaluated responses (Table 2).
•
Validation: Each question was reviewed and approved by two subject-matter experts (SMEs) in the relevant discipline for appropriateness and curricular alignment.
•
Prompting Protocol:
o Only
initial responses generated by ChatGPT were recorded.
o No follow-up, regeneration, or iterative refinement was permitted, ensuring consistency and avoiding researcher bias.
Evaluation framework
Each AI-generated answer was evaluated independently by two blinded experts using a five-parameter rubric, assigning scores from 0 to 2 for each criterion (maximum score = 10) as mentioned in Table 3.
•
Flagging threshold: Responses scoring <6 out of 10 were flagged as unreliable or pedagogically unsuitable.
•
Inter-rater reliability: Inter-rater agreement was computed using Cohen’s kappa coefficient (κ) to validate scoringreliability across evaluators.
Data analysis
Quantitative analysis
•
Descriptive statistics
o Mean, standard deviation (SD) and distribution of scores across all domains.
o Domain-wise accuracy rates for both ChatGPT versions (3.5 vs 4).
•
Comparative analysis
Independent-sample t-tests were used to detect significant differences in scoring between ChatGPT-3.5 and 4. ANOVA was applied to assess differences across academic domains.
Qualitative content analysis
• A thematic analysis was conducted on flagged responses to identify patterns of:
o Factual inaccuracies
o Conceptual misunderstandings
o Terminological or contextual confusion
o Mismatches with curriculum expectations
• Misleading content was categorized under recurring themes such as:
o Overgeneralization
o Misclassification of breeds or diseases
o Inappropriate depth for learner level
o Incorrect interpretation of physiological signs or behavioural cues
• Quotes and excerpts were used to support thematic insights.
Ethical considerations
No human or animal subjects were directly involved. Ethical approval was not required as this is an educational content evaluation study. However, expert evaluators voluntarily participated and data anonymity was maintained.
Mean evaluation scores across academic domains
The comparative analysis of ChatGPT-3.5 and ChatGPT-4.0 demonstrated a marked improvement in the latter’s performance across all academic domains (Table 4). In Animal Sciences, the mean score for ChatGPT-4.0 (9.90±0.32) was significantly higher than ChatGPT-3.5 (7.90±0.74), yielding a mean difference of +2.00 points (p<0.001, paired
t-test). Similarly, in Paraclinical Sciences, the mean score improved from 7.80±0.63 (ChatGPT-3.5) to 9.80±0.42 (ChatGPT-4.0), with a mean difference of +2.00 points (p<0.001). For Clinical Sciences, ChatGPT-4.0 achieved a mean of 9.70± 0.48 compared to 7.70±0.67 for ChatGPT-3.5, corresponding to a mean improvement of +2.00 points (p<0.001). The improvement was consistent and statistically significant across all domains, with minimal variability in ChatGPT-4.0 scores, as reflected by lower standard deviations.
Domain-wise performance distribution
Domain-specific evaluation revealed that ChatGPT-4.0 consistently provided more accurate, comprehensive and contextually relevant answers than ChatGPT-3.5 (Table 4). In Animal Sciences, 100% of responses by ChatGPT-4.0 scored ≥ 9/10, compared to only 40% for ChatGPT-3.5. For Paraclinical Sciences, ChatGPT-4.0 achieved ≥ 9/10 in 90% of cases, whereas ChatGPT-3.5 reached this threshold in only 30% of cases. In Clinical Sciences, ChatGPT-4.0 reached ≥ 9/10 in 80% of cases, while ChatGPT-3.5 achieved this in just 20%. This distribution underscores the reliability and uniformity of ChatGPT-4.0’s outputs across varying question complexity levels.
Paired t-test analysis
The paired
t-test revealed highly significant differences between ChatGPT-3.5 and ChatGPT-4.0 in all domains (p<0.001; Table 4). The effect sizes (Cohen’s
d) were large in all cases (> 2.0), indicating that the observed differences were not only statistically significant but also practically meaningful in the context of academic evaluation.
One-way ANOVA
One-way ANOVA confirmed significant variation in scores between models across domains (Table 4; F-values > 100, p<0.001). Post hoc analysis using Tukey’s HSD test indicated that ChatGPT-4.0 outperformed ChatGPT-3.5 consistently in each individual domain, with no overlap in confidence intervals for mean scores.
Two-Way aNOVA
Two-way ANOVA results showed a strong model effect (p< 0.001), confirming that the primary source of score variation was attributable to the difference in model versions (Table 4). The domain effect was also significant (p<0.05), suggesting that some domains inherently influenced scoring levels regardless of model type. The interaction effect between model type and domain was statistically significant (p<0.05), indicating that the degree of improvement from ChatGPT-3.5 to ChatGPT-4.0 varied across domains, with the greatest improvement seen in Animal Sciences.
Inter-rater agreement (Cohen’s kappa)
The agreement between evaluators in thematic coding of flagged responses was assessed using Cohen’s Kappa (Table 5). The coefficient was -0.018, indicating
poor agreement and suggesting that evaluators frequently diverged in categorizing flagged responses. This poor inter-rater reliability implies that qualitative assessment criteria may require refinement or clearer operational definitions to ensure consistency in future evaluations.
Thematic coding of flagged responses
Thematic analysis of flagged responses revealed that discrepancies were often due to nuanced interpretation differences between raters rather than overt scoring errors (Table 6 and Fig 1). Common flagged categories included:
•
Incomplete explanations (more common in ChatGPT-3.5).
•
Contextual irrelevance (occasional in both models, but less frequent in ChatGPT-4.0).
•
Terminology inaccuracy (more frequent in Paraclinical Sciences for ChatGPT-3.5).
•
Ambiguity in reasoning (found in both models but less in ChatGPT-4.0).
Despite ChatGPT-4.0’s stronger quantitative performance, these qualitative coding results highlight that some limitations remain, particularly in domain-specific terminology precision and consistent contextual alignment.
Our comparative evaluation of ChatGPT-3.5 and ChatGPT-4.0 across multiple veterinary science domains revealed statistically significant differences in performance, with ChatGPT-4.0 consistently achieving higher mean scores. One-way ANOVA analyses demonstrated that domain-specific score variations were significant, a finding reinforced by two-way ANOVA results indicating a strong model effect and a notable interaction between model type and academic domain. These outcomes suggest that ChatGPT-4.0 not only provides generally more accurate and comprehensive answers but also exhibits greater adaptability across disciplinary contexts. Similar trends have been documented in prior AI benchmarking studies in medical and veterinary education, where GPT-4 outperformed earlier iterations in factual accuracy, reasoning complexity and context-aware responses
(Nori et al., 2023; Wulcan et al., 2025).
Despite these performance gains, the inter-rater reliability for flagged responses, as assessed by Cohen’s Kappa, was extremely low (κ = -0.018), indicating poor agreement between evaluators. This aligns with earlier research in educational assessment demonstrating that subjective scoring of open-ended responses can result in substantial variability, particularly when raters weigh criteria such as factual correctness, clarity and domain relevance differently (
McHugh, 2012). The negative Kappa value suggests that rater judgments were not only inconsistent but, in some cases, inversely correlated, reflecting the inherent subjectivity in qualitative evaluation of AI-generated answers.
The thematic coding of flagged responses revealed recurring patterns of model shortcomings. The most frequent themes included factual inaccuracies, lack of domain specificity and overgeneralization, with a smaller proportion of flags related to ambiguous phrasing and irrelevant content. This distribution mirrors patterns observed by
Zhang et al. (2024) in AI-generated medical education content, where even high-performing models occasionally produced domain-inaccurate statements or resorted to generic phrasing when specialized knowledge was insufficiently represented in the training data.
Interestingly, the magnitude of improvement from GPT-3.5 to GPT-4.0 was most pronounced in Animal Sciences and Clinical Sciences, whereas Paraclinical Sciences showed a smaller relative gain. This may be attributable to uneven domain exposure during model training; prior studies have shown that large language models tend to perform better in areas with abundant, high-quality, publicly available literature compared to niche subfields with limited training representation
(Wulcan et al., 2025; Nori et al., 2023).
In contrast to findings from controlled medical question- answering benchmarks that reported near-perfect rater agreement for structured, fact-based queries
(Kung et al., 2023), our results underscore the importance of robust scoring protocols and evaluator calibration when assessing AI performance in integrative, interdisciplinary domains like veterinary sciences. The low agreement in our study may be partly due to the complexity of veterinary problem-solving, which often requires synthesizing multi-domain knowledge, thus increasing the scope for interpretive variability among evaluators.
From a pedagogical perspective, our results support the integration of GPT-4 into veterinary education as a supplementary tool, particularly in areas where it demonstrated strong factual accuracy and contextual adaptation. However, the flagged content analysis highlights the need for human oversight and domain-specific fine-tuning before deploying such models in high-stakes educational or clinical contexts.
Collectively, the present findings both corroborate and extend prior AI evaluation literature: While GPT-4 represents a measurable advancement over GPT-3.5 in knowledge accuracy and adaptability, evaluator disagreement and domain-dependent performance gaps remain significant limitations. Addressing these issues will require a dual approach-algorithmic improvement in domain-specific reasoning and methodological refinement in evaluation frameworks.