Stop Explaining Black Box Models and Use Interpretable Models Instead

The paper “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead” by Cynthia Rudin is a mix of technical and philosophical arguments and comes with two main takeaways: First, it sharpens the understanding of the difference between explainability and interpretability and presents why the former may be problematic. Second, it provides some great pointers to techniques for creating truly interpretable models.

There has been an increasing trend in healthcare and criminal justice to leverage machine learning (ML) for high-stakes prediction applications that deeply affect human lives. The lack of transparency and accountability of predictive models can have (and has already had) severe consequences.

Defining Terms

A model can be a black box for one of two reasons. (1) The function that the model computes is far too complicated for any human to comprehend, or (2) the model actually may be simple but its details are proprietary and not available for inspection.

In explainable ML, predictions are made using a complicated black box model and use a second model created to explain what the first model is doing. A classic example here is local interpretable model-agnostic explanations (LIME), which explores a local area of a complex model to uncover decision boundaries.

An interpretable model is a model used for predictions that can itself be directly inspected and interpreted by human experts.

Interpretability is a domain-specific notion, so there cannot be an all-purpose definition. Usually, however, an interpretable machine learning model is constrained in model form so that it is either useful to someone or obeys structural knowledge of the domain, such as monotonicity or physical constraints that come from domain knowledge.

Explanations Don’t Really Explain

There has been a lot of research into producing explanations for the outputs of black box models. Rudin said she thinks this approach is fundamentally flawed. At the root of her argument is the observation that ad-hoc explanations are only really “guessing” at what the black box model is doing.

Explanations must be wrong. They cannot have perfect fidelity with respect to the original model. If the explanation were completely faithful to what the original model computes, the explanation would equal the original model and one would not need the original model in the first place, only the explanation.

Even the word “explanation” is problematic, because we are not really describing what the original model actually does. The example of Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) brings this distinction to life. A linear explanation model for COMPAS created by ProPublica, and dependent on race, was used to accuse COMPAS (which is a black box) of depending on race. But we don’t know whether or not COMPAS has race as a feature (though it may well have correlated variables).

Let us stop calling approximations to black box model predictions explanations. For a model that does not use race explicitly, an automated explanation—“This model predicts you will be arrested because you are black”—is not a model of what the model is actually doing and would be confusing to a judge, lawyer, or defendant.

In the image space, saliency maps show us where the network is looking, but even they don’t tell us what it is truly looking at. Saliency maps for many different classes can be very similar.

Because explanations are not really explaining, identifying and troubleshooting issues with black box models can be very difficult.

Read the full story here.



Don't miss out on the latest technology delivered to your email monthly.  Sign up for the Data Science and Digital Engineering newsletter.  If you are not logged in, you will receive a confirmation email that you will need to click on to confirm you want to receive the newsletter.