Understanding Is an Illusory Goal

Author's note early 2026: Below are some incomplete thoughts from early 2025. While I never turned them into a cohesive narrative, thinking about these issues directly informed what I went on to build at Coinbase: Combining Real Counterfactuals with Saliency Maps.

If you know of any resources which should be included here that I've missed, let me know. A lot of my original thoughts on this topic stem from this paper by Erasmus, Brunet, and Fisher.

Setting

People are scared about AI for many misguided reasons, one of which is a perceived lack of understanding in how the models make decisions. The Erasmus paper enumerates calls for understanding and explainability in AI - read it if you want more context.

Understanding is an illusory goal. Most of the time, we actually want trust, not understanding. There is, I'm sure, a deep philosophical conversation to be had here. Last time I tried to make such a protrusion into the world of positive philosophical utterances, it felt like a process of chasing several rabbits far enough down their holes so that I could make my point. Alas, there is no philosophical rigor here, only a cesspool of gunk and residue, accumulated as a byproduct of so much useless pontification on that which I know not.

The distinction between local and global is helpful for this conversation. Local understanding and global understanding are distinct. They require different approaches, both philosophical and mathematical, high level and implementation.

Local and Global Understanding - We Don't Know What We Want

Local understanding is when a predictor makes a certain prediction and you want to, somehow, ~understand~ why that prediction was made.

Global understanding is when you want to, somehow, ~understand~ how the predictor makes decisions.

Both are ill-defined, in humans and ML models alike.

Local Understanding

Local understanding is more tractable. Local understanding ought to be comparative. One typically wants counterfactual analysis in local understanding. 'My loan was rejected, why?' What do I really want in this scenario? I want to understand how my application could have been different so that it would have been accepted.

I want a counterfactual analysis. 'Here are three loans that are similar to yours and were approved. As you can see, in each of the cases, either the applicant's income was higher or the loan amount was lower.' This is what DICE does.

Attributive explanation methods, also called Saliency Maps, purport to tell you which parts of the input were ~actually important~ for the predictor outputting what it did. SHAP, LIME, and Integrated Gradients (IG) are all notable saliency mapping techniques. They do their mapping by comparing the input we're interested in with [a bunch of other inputs in the case of LIME/SHAP] or [a baseline input in the case of IG]. For each part of your input, you get a weight which purports to represent how important that part was to the output being what it was. These are all comparative tools using averages to pretend they aren't. All variants of the original Shapley Values paper share this quirk. Any saliency map which doesn't tell you explicitly that it is comparative (read: all saliency maps) is garbage.

DICE is better for local explanations. DICE is explicitly comparative, and for most (all?) instances of local understanding, this is what we want. The only thing which is lacking from a comparative analysis with nearby points with different outcomes, as we find in DICE, is guarantees about which part of the difference in the input is actually causing the difference in the output. Luckily for us, this is precisely what attributive methods are good for. DICE can be combined with something like IG to provide the whole picture.

A stronger argument for the vacuity of local explanations without comparison

There are two elements to this:

Philosophically, local understanding isn't helpful without point of comparison.

Mathematically, ascribing weights based on comparative operations without making those comparisons explicit is obfuscating the truth to a degree. Perhaps this is too strong of a claim - anyone who wanted to understand the underlying math could. But the point is that we use methods like LIME, SHAP, and especially IG to say "this is what is important for this input" instead of "this is what is important for this input relative to this baseline / set of perturbations / reference distribution." The comparison is what's doing the explanatory work; hiding it just makes the explanation feel atomic when it isn't.

In prediction tasks with a finite range, and especially classification tasks with discrete outcomes, "understanding why your loan was rejected" is often substantively the same as "understanding why your loan was not approved." This seems trivial, but the point is real: asking 'why was my loan rejected?' is implicitly asking 'why was it rejected and not approved?' In tasks where ML is predominant, that counterfactual framing is integral to the question, not a reformulation of it.

To me the most direct route to local understanding is counterfactual comparison.

Why was my loan denied?

Your credit score was too low.

What do you mean too low / what could it have been to get approved?

Find similar inputs with different outcomes.

Here's a table for you. In each of the cases, the other person had [given fields] different, by [given delta (for continuous) or they had given value] and they had this outcome.

Fields different Delta / value Outcome
Credit score +800 Approve
Credit score, Loan amount +400 / -5000 Approve
Race [white] Hopefully the same, probably not!

This 'finding similar inputs' is mathematically rich.

Let's briefly contrast this with global understanding, which seems much more open; provable guarantees are really helpful for that kind of understanding.

Global Understanding

Last week a friend visited town. He is one of the only people I know who loves food more than I do. He said that when deciding whether or not to go to a restaurant, instead of just reading the reviews, he first looks at the photos. If they pass his vibe test, then he'll take a peek at the comments.

"I've gotten really good at seeing past fake shit and figuring out what actually seems good."

"How do you make a call like that?"

"IDK just vibes."

And yet, because of my vast experience with this friend, I trust his classifications despite my lack of understanding of his classification process. What do you do when you can't understand? Do you decide whether or not to trust? I think so.

Model architecture + weights would tell us the whole story if we could follow it. But that isn't the kind of understanding we seek. We need abstractions.

Provable guarantees about models are the kind of understanding we seek. They can't address every concern, but they can make real progress on some.

For global understanding, saliency maps are not the kind of understanding we seek. For global understanding, we need abstractions. Provable guarantees may be a kind of understanding, and they are indeed a form of trust. Individual fairness guarantees, like latent space distance guarantees, as in LCIFR, are great. Guarantees about adversarial robustness, as discussed here, are great.

Provable guarantees are of the form: I can guarantee that any input which is ~similar~ to this input here will have the same output. You get to define ~similar~ however you want, so long as it is rigorous. For example, I want to know (with high probability) that two loan applicants who have identical applications, but one is a man while the other is a woman, get the same output. For another example, I want to know (with high probability) that my image classifier, given any two images which are the same but one is a 3 degree or smaller rotation of the other, will classify them the same.

This is possible (with high probability).

Local understanding is also made possible by the existence of provable guarantees. If I can't be sure that my loan was rejected because, for example, of my race, then why should I engage in trying to improve my application at all? You probably wouldn't.

So What?


Interpretability Methods Taxonomy

Made with assistance from Claude.

1. Glass-box Models (Inherently Interpretable)

2. Post-hoc Explanation Methods (for any model)

A. Feature Attribution Methods (Individual feature importance)

B. Feature Interaction Methods (How features work together in pairs)

C. Complex Interaction Detection (Detecting multi-feature interactions)

D. Counterfactual Methods

Further reading