
Fourth in a Series: Mastering data quality for safe and scalable AI
Saibal Banerjee, PhD and Steven Atneosen, JD
Introduction
AI practitioners often struggle with assessing the performance of multi-class classification models. When your model has to recognize multiple categories, especially if the populations of these categories are imbalanced – i.e., they vary widely – simple overall metrics like accuracy or simple averages of precision and recall over all categories are inadequate. That’s where weighted macro and micro-averaged metrics, provide meaningful guidance that reflects the true nature of your dataset and model behavior.
Why Weighted Metrics Are Necessary
Weighted metrics are necessary to evaluate the overall performance of unbiased models on imbalanced datasets. Let’s say your classifier is being used to identify content violations on a large social platform: hate speech, spam, impersonation, and normal posts. In a real-life production environment, these classes may occur at different frequencies – spam might be 30% of the posts, hate speech 10%, impersonation 2%, and the rest normal. However, the true frequencies in a real-life environment may be unknown; they may vary from one environment to another, and they may vary over time for an environment in which corrective actions are taken. Thus, to be agnostic to these effects and in order to avoid any systematic bias, the training data for a classifier is often balanced with approximately equal proportions of each class being used to train the classifier. However, with such training, an unbiased classifier will make tradeoffs that are not optimal in a production environment where the class proportions vary widely. In the above production environment for example, one can envisage such a model making an excessive number of erroneous content violation calls, simply because it expects only a quarter of the posts to be normal, given its training. Evaluating overall model performance in such environments with imbalanced datasets requires metric formulations that respect its class proportions.
Technical Overview: Macro vs. Micro Averaging
In our third blog in a series of Mastering Data Quality for Safe & Scalable AI, we defined precision and recall scores for each class k as a ratio of two counts Akand Bkobtained from the confusion matrix count entries. In other words, the class performance scores Sk for both precision and recall have the form Sk = Ak / Bkfor class value k. For the precision score Pk of the predicted class value k, the numerator Ak is the true positive count of k (which is the number of samples in the test dataset that both the actual and the predicted values agree is k) and the denominator Bk is the number of samples that are predicted to be k. Similarly, for the recall score Rk of the actual class value k, the numerator Ak is the true positive count of k and the denominator Bk is the number of samples that are actual instances of class k.
Given this form of the ratio of counts, macro and micro averaging are two ways to perform a weighted average of these class scores Sk in order to combine them into a single overall weighted score. The weights used in the weighing are given by an arbitrary convex combination weight vector w = [w1 ,…, wn] where n is the number of classes, all weights are non-negative, and their sum is one.
- The macro weighted average S̄(w) of scores Sk concerning an arbitrary convex combination weight vector w is the weighted arithmetic mean of the individual scores S1 ,…, Sn. In other words:
S̄(w) = w1S1 +…+ wnSn |
- The micro weighted average S̴(w) of scores Sk concerning an arbitrary convex combination weight vector w is the weighted mediant of the ratios A1 / B1 ,…, An / Bn that define the scores S1 ,…, Sn respectively. In other words:
S̴(w) = (w1A1 +…+ wnAn) / (w1B1 +…+ wnBn) |
- In particular, we shall use notations P() and R() instead of S() to denote the precision and recall scores respectively.
Weights Demystified: Uniform, Actual, and Predicted Population Weights
When assessing the performance of classification models—particularly with datasets having imbalanced class distributions, selecting the correct weights is very important. Depending on your context, you might choose from:
- Uniformly Weighted Averaging
- Actual Population Weighted Averaging
- Predicted Population Weighted Averaging (rare)
Each of these weightings produces a single macro or micro-averaged metric from multiple class-specific scores, but they differ in how they weigh those class scores before averaging. Let’s break them down.
1. Uniformly Weighted Averages
Definition: Each class’s metric score Sk (precision and recall) is given equal weight in weighted macro or micro averaging, i.e. wk = 1 / n for 1 ≤ k ≤ n, regardless of how frequently that class appears in the test set.
Formulas (derived from equations (1) and (2) above):
- Uniform Macro Average Score S̄(n-1) = (S1 +…+ Sn) / n
- Uniform Micro Average Score S̴(n-1) = (A1 +…+ An) / (B1 +…+ Bn)
Where, n-1 = [1 / n,…, 1 / n] denotes the uniform convex combination weight vector of size n.
Key Insight: Both the uniform micro average of precision P̴(n-1) and recall scores R̴(n-1) are equal to the overall accuracy—a metric well-known to be weak in case of imbalanced classes because it trades off the accuracy of minority classes in favor of the majority.
Use Cases
Suppose your classifier mislabels hate speech 30% of the time, while achieving nearly perfect accuracy on spam and impersonation. A uniform micro average score of precision or recall being equal to the overall accuracy, will hide this problem because the vast majority of samples are not hate speech. But uniform macro average scores will reveal this weakness clearly by treating all categories equally. If your priority is fairness and reducing harm, uniform macro metrics are better suited in this case.
Uniform macro metrics are also better if your AI content moderation tool carries significant reputational risks for false negatives for hate speech, spam, or impersonation. For example, even if “impersonation” only occurs in 2% of examples, its identification failure has serious consequences.
Because it weighs all classes to be equally important, uniform macro metrics are ideal for fairness-sensitive applications like:
- Legal risk analysis tools
- Hiring decision automation
- Health diagnostics for rare conditions
Example calculations
- From the dog-cat-pig classifier example in this blog series, we obtain the individual precision scores from Figure 3 of our third blog, “If it walks like a duck, talks like a duck, it’s a pig… The Confusion Matrix is a Simple Tool for Model Performance:”
Uniform Macro Average Precision P̄(n-1) = (0.40 + 0.75 + 0.6667) / 3 = 60.56%
Uniform Micro Average Precision P̴(n-1) = (2 + 3 + 2) / (5 + 4 + 3) = 58.33% = Accuracy
- Similarly, from the individual recall score entries in Table 6 of our related technical paper (available at the link below) we get:
Uniform Macro Average Recall R̄(n-1) = (0.50 + 0.60 + 0.6667) / 3 = 58.89%
Uniform Micro Average Recall R̴(n-1) = (2 + 3 + 2) / (4 + 5 + 3) = 58.33% = Accuracy
2. Actual Population Weighted Averages
Definition: Each class’s metric score Sk (precision and recall) is weighed according to how frequently it actually appears (often called the ground truth) in the test set. In other words, if vk represents the proportion of samples that are actual instances of class k, then wk = vk for 1 ≤ k ≤ n.
Formulas (derived from equations (1) and (2) above):
- Actual Population Weighted Macro Average Score S̄(v) = v1S1 +…+ vnSn
- Actual Population Weighted Micro Average Score S̴(v) = (v1A1 +…+ vnAn) / (v1B1 +…+ vnBn)
Where, v = [v1 ,…, vn] denotes the actual population frequencies convex combination weight vector.
Key Insight: The actual population weighted macro average recall score R̄(v) is equal to the overall accuracy, thereby proving that it is a weak metric in case of imbalanced classes.
Use Cases
A healthcare AI system trained on real-world data, where class distributions reflect patient populations e.g., 90% of patients are healthy, 5% pre-diabetic, and 5% diabetic, may use this population weighted metric because it emphasizes majority class performance, which may reflect real operational loads. It is clear that the risk of using this metric is that it hides poor minority or rare class performance, much like the overall accuracy score does. For example, a cancer detection model that excels on healthy patients but misses all cancer cases, could have its actual population weighted average precision and recall scores be misleadingly high.
Example calculations
- From the dog-cat-pig classifier example in Table 4 of the technical paper, we obtain the actual population frequencies: vdog = 4/12, vcat = 5/12, and vpig = 3/12.
- From the individual precision score entries in Table 3 of the technical paper:
Act. Pop. Macro Average Precision P̄(v) = (4 x 0.40 + 5 x 0.75 + 3 x 0.6667) / 12 = 61.25%
Act. Pop. Micro Average Precision P̴(v) = (4 x 2 + 5 x 3 + 3 x 2) / (4 x 5 + 5 x 4 + 3 x 3) = 59.18%
- From the individual recall scores in Table 6 of the technical paper:
Act. Pop. Macro Average Recall R̄(v) = (4 x 0.50 + 5 x 0.60 + 3 x 0.6667) / 12 = 58.33% = AccuracyAct. Pop. Micro Average Recall R̴(v) = (4 x 2 + 5 x 3 + 3 x 2) / (4 x 4 + 5 x 5 + 3 x 3) = 58.00%
3. Predicted Population Weighted Macro Averages
Definition: Each class’s metric score Sk (precision and recall) is weighed by the population frequency of the model’s predictions, not the actual test labels. In other words, if uk represents the proportion of samples that are predicted to be class k, then wk = uk for 1 ≤ k ≤ n.
Formulas (derived from equations (1) and (2)):
- Predicted Population Weighted Macro Average Score S̄(u) = u1S1 +…+ unSn
- Predicted Population Weighted Micro Average Score S̴(u) = (u1A1 +…+ unAn) / (u1B1 +…+ unBn)
Where, u = [u1 ,…, un] denotes the convex combination frequency weight vector of predicted labels.
Key Insight: The predicted population weighted macro average precision score P̄(u) is equal to the overall accuracy. That’s a surprising and powerful equivalence: it connects what your model thinks it’s doing (prediction-based weighting) to what it’s achieving (accuracy).
Use Cases
You’re analyzing a production classifier’s prediction tendencies, not its real-world inputs. This is especially valuable during model introspection to identify if it is overly confident in certain predictions. This formulation is helpful in highlighting how your model is allocating prediction attention across classes. This is especially useful for feedback systems, recommendation engines, and auto-tagging platforms.
Example calculations
- From the dog-cat-pig classifier example in Table 5 of the technical paper, we obtain the predicted population frequencies: udog = 5/12, ucat = 4/12, and upig = 3/12.
- From the individual precision score entries in Table 3 of the technical paper:
Pred. Pop. Macro Average Precision P̄(u) = (5 x 0.40 + 4 x 0.75 + 3 x 0.6667) / 12 = 58.33% = Accuracy
Pred. Pop. Micro Average Precision P̴(u) = (5 x 2 + 4 x 3 + 3 x 2) / (5 x 5 + 4 x 4 + 3 x 3) = 56.00%
- From the individual recall scores in Table 6 of the technical paper:
Pred. Pop. Macro Average Recall R̄(u) = (5 x 0.50 + 4 x 0.60 + 3 x 0.6667) / 12 = 57.50%
Pred. Pop. Micro Average Recall R̴(u) = (5 x 2 + 4 x 3 + 3 x 2) / (5 x 4 + 4 x 5 + 3 x 3) = 28 / 49 = 57.14%
Visual Summary: Weighting Strategies
Strategy | Weights Based On | Best For Use In… |
Uniformly Weighted Macro | Equal for all classes | Ethics, compliance, fairness, rare events |
Actual Population Weighted | Frequency in ground truth | Operational deployment, real-world load simulation |
Predicted Population Weighted | Frequency in predictions | Debugging model behavior, calibration tuning |
Synthetic Testing to Understand Weight Impacts
With tomtA.ai’s privacy-safe data generation platform, you can simulate:
- Imbalanced vs. balanced datasets
- Real-world vs. model prediction class distributions
- Rare class presence to test under Uniform Macro Averaging
Use Case Example:
You want to audit bias in a credit scoring model. Use tomtA.ai to generate underrepresented groups (e.g., rural applicants) at higher proportions and test model recall for these groups using uniform macro averaging.
Conclusion: Choosing the Right Weighting
Your metric isn’t just a number—it’s a lens. Choosing the wrong averaging method will mislead you. With the right one, you can:
- Spot blind spots
- Align model outcomes with business values
- Ensure compliance with fairness mandates
Rule of Thumb:
- Use Uniformly Weighted Macro for fairness
- Use Actual Weighted for system performance
- Use Predicted Weighted for model introspection
And lastly, but not least important, be sure to test them all with safely generated data from a true privacy-first data tool before you go live.