Measure twice, cut once. Weighted metrics matter.

Fourth in a Series: Mastering data quality for safe and scalable AI

Saibal Banerjee, PhD and Steven Atneosen, JD

Introduction

AI practitioners often struggle with assessing the performance of multi-class classification models. When your model has to recognize multiple categories, especially if the populations of these categories are imbalanced – i.e., they vary widely – simple overall metrics like accuracy or simple averages of precision and recall over all categories are inadequate. That’s where weighted macro and micro-averaged metrics, provide meaningful guidance that reflects the true nature of your dataset and model behavior.

Why Weighted Metrics Are Necessary

Weighted metrics are necessary to evaluate the overall performance of unbiased models on imbalanced datasets. Let’s say your classifier is being used to identify content violations on a large social platform: hate speech, spam, impersonation, and normal posts. In a real-life production environment, these classes may occur at different frequencies – spam might be 30% of the posts, hate speech 10%, impersonation 2%, and the rest normal. However, the true frequencies in a real-life environment may be unknown; they may vary from one environment to another, and they may vary over time for an environment in which corrective actions are taken. Thus, to be agnostic to these effects and in order to avoid any systematic bias, the training data for a classifier is often balanced with approximately equal proportions of each class being used to train the classifier. However, with such training, an unbiased classifier will make tradeoffs that are not optimal in a production environment where the class proportions vary widely. In the above production environment for example, one can envisage such a model making an excessive number of erroneous content violation calls, simply because it expects only a quarter of the posts to be normal, given its training. Evaluating overall model performance in such environments with imbalanced datasets requires metric formulations that respect its class proportions.

Technical Overview: Macro vs. Micro Averaging

In our third blog in a series of Mastering Data Quality for Safe & Scalable AI, we defined precision and recall scores for each class k as a ratio of two counts A_kand B_kobtained from the confusion matrix count entries. In other words, the class performance scores S_k for both precision and recall have the form S_k = A_k / B_kfor class value k. For the precision score P_k of the predicted class value k, the numerator A_kis the true positive count of k (which is the number of samples in the test dataset that both the actual and the predicted values agree is k) and the denominator B_kis the number of samples that are predicted to be k. Similarly, for the recall score R_k of the actual class value k, the numerator A_kis the true positive count of k and the denominator B_kis the number of samples that are actual instances of class k.

Given this form of the ratio of counts, macro and micro averaging are two ways to perform a weighted average of these class scores S_k in order to combine them into a single overall weighted score. The weights used in the weighing are given by an arbitrary convex combination weight vector w = [w₁,…, w_n] where n is the number of classes, all weights are non-negative, and their sum is one.

The macro weighted average S̄(w) of scores S_k concerning an arbitrary convex combination weight vector w is the weighted arithmetic mean of the individual scores S₁ ,…, S_n. In other words:

S̄(w) = w₁S₁ +…+ w_nS_n

equation 1

The micro weighted average S̴(w) of scores S_k concerning an arbitrary convex combination weight vector w is the weighted mediant of the ratios A₁ / B₁ ,…, A_n / B_n that define the scores S₁ ,…, S_n respectively. In other words:

S̴(w) = (w₁A₁ +…+ w_nA_n) / (w₁B₁ +…+ w_nB_n)

equation 2

In particular, we shall use notations P() and R() instead of S() to denote the precision and recall scores respectively.

Weights Demystified: Uniform, Actual, and Predicted Population Weights

When assessing the performance of classification models—particularly with datasets having imbalanced class distributions, selecting the correct weights is very important. Depending on your context, you might choose from:

Uniformly Weighted Averaging
Actual Population Weighted Averaging
Predicted Population Weighted Averaging (rare)

Each of these weightings produces a single macro or micro-averaged metric from multiple class-specific scores, but they differ in how they weigh those class scores before averaging. Let’s break them down.

1. Uniformly Weighted Averages

Definition: Each class’s metric score S_k (precision and recall) is given equal weight in weighted macro or micro averaging, i.e. w_k = 1 / n for 1 ≤ k ≤ n, regardless of how frequently that class appears in the test set.

Formulas (derived from equations (1) and (2) above):

Uniform Macro Average Score S̄(n^-1) = (S₁ +…+ S_n) / n
Uniform Micro Average Score S̴(n^-1) = (A₁ +…+ A_n) / (B₁ +…+ B_n)

Where, n^-1= [1 / n,…, 1 / n] denotes the uniform convex combination weight vector of size n.

Key Insight: Both the uniform micro average of precision P̴(n^-1) and recall scores R̴(n^-1) are equal to the overall accuracy—a metric well-known to be weak in case of imbalanced classes because it trades off the accuracy of minority classes in favor of the majority.

Use Cases

Suppose your classifier mislabels hate speech 30% of the time, while achieving nearly perfect accuracy on spam and impersonation. A uniform micro average score of precision or recall being equal to the overall accuracy, will hide this problem because the vast majority of samples are not hate speech. But uniform macro average scores will reveal this weakness clearly by treating all categories equally. If your priority is fairness and reducing harm, uniform macro metrics are better suited in this case.

Uniform macro metrics are also better if your AI content moderation tool carries significant reputational risks for false negatives for hate speech, spam, or impersonation. For example, even if “impersonation” only occurs in 2% of examples, its identification failure has serious consequences.

Because it weighs all classes to be equally important, uniform macro metrics are ideal for fairness-sensitive applications like:

Legal risk analysis tools
Hiring decision automation
Health diagnostics for rare conditions

Example calculations

From the dog-cat-pig classifier example in this blog series, we obtain the individual precision scores from Figure 3 of our third blog, “If it walks like a duck, talks like a duck, it’s a pig… The Confusion Matrix is a Simple Tool for Model Performance:”

Uniform Macro Average Precision P̄(n^-1) = (0.40 + 0.75 + 0.6667) / 3 = 60.56%

Uniform Micro Average Precision P̴(n^-1) = (2 + 3 + 2) / (5 + 4 + 3) = 58.33% = Accuracy

Similarly, from the individual recall score entries in Table 6 of our related technical paper (available at the link below) we get:

Uniform Macro Average Recall R̄(n^-1) = (0.50 + 0.60 + 0.6667) / 3 = 58.89%

Uniform Micro Average Recall R̴(n^-1) = (2 + 3 + 2) / (4 + 5 + 3) = 58.33% = Accuracy

2. Actual Population Weighted Averages

Definition: Each class’s metric score S_k (precision and recall) is weighed according to how frequently it actually appears (often called the ground truth) in the test set. In other words, if v_k represents the proportion of samples that are actual instances of class k, then w_k=v_k for 1 ≤ k ≤ n.

Formulas (derived from equations (1) and (2) above):

Actual Population Weighted Macro Average Score S̄(v) = v₁S₁ +…+ v_nS_n
Actual Population Weighted Micro Average Score S̴(v) = (v₁A₁ +…+ v_nA_n) / (v₁B₁ +…+ v_nB_n)

Where, v= [v₁ ,…, v_n] denotes the actual population frequencies convex combination weight vector.

Key Insight: The actual population weighted macro average recall score R̄(v) is equal to the overall accuracy, thereby proving that it is a weak metric in case of imbalanced classes.

Use Cases

A healthcare AI system trained on real-world data, where class distributions reflect patient populations e.g., 90% of patients are healthy, 5% pre-diabetic, and 5% diabetic, may use this population weighted metric because it emphasizes majority class performance, which may reflect real operational loads. It is clear that the risk of using this metric is that it hides poor minority or rare class performance, much like the overall accuracy score does. For example, a cancer detection model that excels on healthy patients but misses all cancer cases, could have its actual population weighted average precision and recall scores be misleadingly high.

Example calculations

From the dog-cat-pig classifier example in Table 4 of the technical paper, we obtain the actual population frequencies: v_dog = 4/12, v_cat = 5/12, and v_pig = 3/12.

From the individual precision score entries in Table 3 of the technical paper:

Act. Pop. Macro Average Precision P̄(v) = (4 x 0.40 + 5 x 0.75 + 3 x 0.6667) / 12 = 61.25%

Act. Pop. Micro Average Precision P̴(v) = (4 x 2 + 5 x 3 + 3 x 2) / (4 x 5 + 5 x 4 + 3 x 3) = 59.18%

From the individual recall scores in Table 6 of the technical paper:

Act. Pop. Macro Average Recall R̄(v) = (4 x 0.50 + 5 x 0.60 + 3 x 0.6667) / 12 = 58.33% = AccuracyAct. Pop. Micro Average Recall R̴(v) = (4 x 2 + 5 x 3 + 3 x 2) / (4 x 4 + 5 x 5 + 3 x 3) = 58.00%

3. Predicted Population Weighted Macro Averages

Definition: Each class’s metric score S_k (precision and recall) is weighed by the population frequency of the model’s predictions, not the actual test labels. In other words, if u_k represents the proportion of samples that are predicted to be class k, then w_k=u_k for 1 ≤ k ≤ n.

Formulas (derived from equations (1) and (2)):

Predicted Population Weighted Macro Average Score S̄(u) = u₁S₁ +…+ u_nS_n
Predicted Population Weighted Micro Average Score S̴(u) = (u₁A₁ +…+ u_nA_n) / (u₁B₁ +…+ u_nB_n)

Where, u= [u₁ ,…, u_n] denotes the convex combination frequency weight vector of predicted labels.

Key Insight: The predicted population weighted macro average precision score P̄(u) is equal to the overall accuracy. That’s a surprising and powerful equivalence: it connects what your model thinks it’s doing (prediction-based weighting) to what it’s achieving (accuracy).

Use Cases

You’re analyzing a production classifier’s prediction tendencies, not its real-world inputs. This is especially valuable during model introspection to identify if it is overly confident in certain predictions. This formulation is helpful in highlighting how your model is allocating prediction attention across classes. This is especially useful for feedback systems, recommendation engines, and auto-tagging platforms.

Example calculations

From the dog-cat-pig classifier example in Table 5 of the technical paper, we obtain the predicted population frequencies: u_dog = 5/12, u_cat = 4/12, and u_pig = 3/12.

From the individual precision score entries in Table 3 of the technical paper:

Pred. Pop. Macro Average Precision P̄(u) = (5 x 0.40 + 4 x 0.75 + 3 x 0.6667) / 12 = 58.33% = Accuracy

Pred. Pop. Micro Average Precision P̴(u) = (5 x 2 + 4 x 3 + 3 x 2) / (5 x 5 + 4 x 4 + 3 x 3) = 56.00%

From the individual recall scores in Table 6 of the technical paper:

Pred. Pop. Macro Average Recall R̄(u) = (5 x 0.50 + 4 x 0.60 + 3 x 0.6667) / 12 = 57.50%

Pred. Pop. Micro Average Recall R̴(u) = (5 x 2 + 4 x 3 + 3 x 2) / (5 x 4 + 4 x 5 + 3 x 3) = 28 / 49 = 57.14%

Visual Summary: Weighting Strategies

Strategy	Weights Based On	Best For Use In…
Uniformly Weighted Macro	Equal for all classes	Ethics, compliance, fairness, rare events
Actual Population Weighted	Frequency in ground truth	Operational deployment, real-world load simulation
Predicted Population Weighted	Frequency in predictions	Debugging model behavior, calibration tuning

Synthetic Testing to Understand Weight Impacts

With tomtA.ai’s privacy-safe data generation platform, you can simulate:

Imbalanced vs. balanced datasets
Real-world vs. model prediction class distributions
Rare class presence to test under Uniform Macro Averaging

Use Case Example:

You want to audit bias in a credit scoring model. Use tomtA.ai to generate underrepresented groups (e.g., rural applicants) at higher proportions and test model recall for these groups using uniform macro averaging.

Conclusion: Choosing the Right Weighting

Your metric isn’t just a number—it’s a lens. Choosing the wrong averaging method will mislead you. With the right one, you can:

Spot blind spots
Align model outcomes with business values
Ensure compliance with fairness mandates

Rule of Thumb:

Use Uniformly Weighted Macro for fairness
Use Actual Weighted for system performance
Use Predicted Weighted for model introspection

And lastly, but not least important, be sure to test them all with safely generated data from a true privacy-first data tool before you go live.

Stay tuned for our next blog in the “Mastering Data Quality for Safe & Scalable AI” series: Proving the Tradeoff. A New Way to Think About Precision and Recall.

Request our full technical paper at the “Book a Demo” button on our website.

Measure twice, cut once. Weighted metrics matter.

Fourth in a Series: Mastering data quality for safe and scalable AI

Saibal Banerjee, PhD and Steven Atneosen, JD

One response to “Measure twice, cut once. Weighted metrics matter.”

Leave a Reply Cancel reply