Why we use F1 and AUC-ROC, not accuracy, for fault detection

If you train a fault detection model on real EV charging data and optimize for accuracy, you will build a very accurate model that is essentially useless.

Here's why.

The imbalance problem

In a healthy, well-maintained fleet, charger failures are rare events. Across our pilot data, fault conditions represent somewhere between 1–3% of all session-level observations, depending on how you define "fault." If your model predicts "no fault" for every single session, it will be 97–99% accurate. It will also miss every failure.

This is the class imbalance problem, and it's not unique to EV charging — it shows up in fraud detection, medical diagnosis, equipment monitoring, and anywhere else the event you care about is rare. The standard accuracy metric is blind to it.

What F1 score actually measures

F1 is the harmonic mean of precision and recall. Precision answers: of everything we flagged as a fault, how many actually were? Recall answers: of everything that was actually a fault, how many did we flag?

A model that fires on every session has perfect recall but terrible precision. A model that almost never fires has decent precision but terrible recall. F1 penalizes both extremes. You can also use the weighted variant (F-beta) to shift the balance toward precision or recall depending on your operational context — in our case, we lean slightly toward recall, because a missed failure is worse than a false alarm.

Why AUC-ROC complements F1

AUC-ROC measures the model's ability to distinguish between classes across all possible classification thresholds. It answers: if I pick a random fault and a random non-fault, does the model rank the fault higher? An AUC of 0.5 is random; 1.0 is perfect.

F1 gives you a single-threshold view; AUC-ROC gives you the threshold-agnostic picture of how separable the classes are. We use both because they tell you different things. A model can have a high AUC-ROC but a mediocre F1 if the decision boundary is poorly calibrated — and vice versa.

How we handle the imbalance

We use a combination of approaches. At the data level, we apply SMOTE (Synthetic Minority Oversampling Technique) selectively for certain model variants. At the training level, we use class-weighted loss functions that penalize misclassification of the minority class more heavily. At the evaluation level, we use stratified cross-validation to ensure fault cases are represented proportionally in every fold.

None of these are magic. Class imbalance is a fundamental property of the problem. What these techniques do is prevent the model from taking the easy path of predicting the majority class and calling it done. The harder work is making sure the features we're giving the model actually encode the information that distinguishes pre-fault from healthy behavior — and that's a data and feature engineering problem, not a metrics problem.

The practical upshot

When an operator asks us "how accurate is your model?", we don't answer that question. We answer: here's our precision, here's our recall, here's our AUC-ROC on held-out data from your hardware class. Those numbers tell you something meaningful. Accuracy doesn't.