Veterinary practices generate vast amounts of unstructured data through clinical notes, treatment descriptions, and billing entries. However, this raw information is often inconsistent and fragmented, making it challenging for clinics, partners, and analytics teams to derive actionable insights.
To solve this problem, we developed an AI-driven categorization system that automatically classifies veterinary services into standardized categories such as Treatment, Lab, Inventory, and Additional Services with high accuracy and scalability.
Leveraging advanced Natural Language Processing (NLP) and deep learning techniques, our model analyzes millions of service descriptions, captures contextual patterns from historical data, and produces clean, structured outputs that enable:
This whitepaper explores the business need for service data categorization, the AI solution we designed, its tangible impact on veterinary operations, and how it empowers data-driven decision-making at scale.
The veterinary field lacks standardized procedure codes, unlike other medical fields like dentistry. This inconsistency hinders cross-practice uniformity, performance benchmarking, and data-driven insights. To address this, we developed a standardized categorization system for veterinary procedure descriptions. This standardization initiative aims to:
Data Collection & Preprocessing: A dataset of over 300k unique procedure descriptions was processed using NLP techniques and regular expressions to standardize and clean the text. Manual validation was conducted to ensure accuracy and consistency in the descriptions.
Categorization Framework:
Figure 1: Main Categories
The categorization system consists of four main categories—Treatment, Lab, Inventory, and Additional Services (Fig. 1)—along with 26 subcategories. A previously labeled dataset was utilized to train the model.
Model Selection and Training:
Transformers are state-of-the-art models that have revolutionized NLP by employing self-attention mechanisms to capture contextual relationships within text. Unlike traditional sequence-based models such as Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM), transformers process entire input sequences in parallel, enhancing efficiency and accuracy. For the task of veterinary service text categorization, we leverage various pretrained language models built upon the transformer architecture. These models are capable of effectively interpreting nuanced language variations, enabling precise classification of service descriptions and improving both consistency and efficiency in automated categorization.
To address the complexity of multi-level categorization, we evaluated two modeling approaches: hierarchical classification and multi-task head, both implemented using transformer architectures.
After comparative evaluation, the hierarchical classification approach using a pretrained transformer model outperformed the other methods in terms of both accuracy and inference speed.
Figure 2: Model Workflow
In multi-level imbalanced data classification, traditional evaluation metrics like accuracy can be misleading, as they may be biased toward the majority classes and fail to reflect the model's performance on underrepresented categories. Instead, more informative metrics such as precision, recall, and F1-score—particularly on a per-class basis or using macro/micro averaging—provide a clearer understanding of how well the model performs across all levels of the hierarchy. These metrics ensure that the performance on minority classes is not overlooked, which is critical in imbalanced settings where accuracy may appear high despite poor classification of important but infrequent classes.
The F1 score, the harmonic mean of precision and recall, was chosen to evaluate model performance as it provides a balanced measure that accounts for both false positives and false negatives. This metric is especially valuable in imbalanced classification tasks, offering a more meaningful assessment across all classes, including minority ones. By combining precision and recall, the F1 score fairly reflects the model’s ability to correctly identify relevant instances without over-predicting. An F1 score of 94% was achieved for the main category, and 88.7% for the subcategory and combined predictions.
Figure 3: Sample Industry Trends