This post is the second in a series of three based on interviews of Courtney Bowman and Anthony Bak of Palantir. The first post laid a foundation which can help any organization seeking to establish an ethics policy and program around AI and ML. This one picks up from there and dives into the role of humans in AI systems for the enterprise.
Gourley: The approach to ethics issues around AI and ML you introduced us to has a clear respect for the roles of humans in the equation. How do you articulate the role of humans and computers in an enterprise AL system? What it is the humans are adding to the equation?
Bowman: Humans provide the understanding. In part, this involves the deeper semantics that comes from living in the world, having shared lives with other people, being entrenched in linguistic, social, and cultural practices that are often ambiguous and very context-dependent. Understanding allows us to do much more than just tell apart subtly distorted images of school buses from ostriches. It also forms the basis of other, much more delicate, forms of cognitive reasoning, like what’s just, fair, virtuous, or beautiful.
Gourley: That’s really interesting. It’s also kind of complicated. In your view, how do you get an institution to internalize these subtle but important distinctions? In other words, is it possible to reduce ethics and ethical AI to compliance frameworks and policies that can be embodied in process or even checklists?
Bowman: Policies are critical for establishing a baseline for what constitutes acceptable decisions. But I can tell from your question that you understand where policies fall short. You can’t enshrine every nuance of ethics in policy, nor can you rely on process to keep you in the clear. It’s just as critical to create a culture with shared principles, vocabulary, and frameworks for understanding and discussing ethical considerations. For a tech company like Palantir, we recommend starting with principles to ground your technical and business decisions (you can read ours here), and provide clear escalation paths and an open environment to ensure that developers see and raise flags when something doesn’t align with those principles. But being able to identify and raise those flags requires more than just operational policies. It’s a matter of culture. Building a culture of ethical engineering is like learning to speak a language: you have to practice using the language all the time, and you improve through repetition and lived experience. You need training and workshops, and you need to engage in transparent dialogue about the decisions you make and how they align with your principles.
Bak: So, I think what Courtney’s trying to say is, no, ethical engineering is not a compliance task. You don’t get to morally defensible technology applications just by ticking a lot of boxes because those boxes alone won’t always instruct you what to do when a truly novel situation presents itself or when two seemingly countervailing considerations demand a decision one way or another, in the same way the rote memorization of vocabulary won’t instruct you how to articulate a unique thought in a new language you’re studying.
Gourley: I see what you’re saying. But even skilled language instructors employ tools to help students learn. How do you help instruct computer scientists to become grounded in a culture of ethical engineering?
Bowman: One thing you do is focus design practices on methodologies and frameworks for architecting systems that have a vision of morally good outcomes, or at least certain critical features of those outcomes. What does that mean in practice? Well it means you’re reminding yourself to grapple with potentially difficult questions at each stage of development, asking yourself questions like, “What is this technology intended to do in a defensible way, but also what could it be re-purposed to do in ways that reasonable people would find objectionable?,” “How do I minimize the risk of unintended consequences and harms?,” “How do I keep my technology out of the hands of irresponsible or unsavory users?,” Etc.
Having a framework for asking these sorts of questions at critical stages in AI/ML development is the most important task. But also it is important to give engineers the resources, vocabulary, and sense of responsibility to act and build appropriately.
Gourley: I think I see what you’re suggesting in a broad sense, but can you make this more concrete in terms of frameworks that might be applied in AI/ML applications?
Bak: Sure. Applying AI ethics is multi-layered. The first layer is evaluating whether you should be using AI to solve a problem at all. We work in domains where decisions are “high consequence.” We see poor problem selection all the time. For instance there was a recent paper where researchers were trying to predict criminality using facial recognition software. Now, without knowing more, I could’ve told you at the outset that this was a problematic study because my moral intuition tells me that there’s no such thing as a criminal face (I also hear my mother’s voice in my head reminding me, “never judge a book by its cover!”).
As it turns out the flaws in this study were simpler than you might even suspect. In the images used to train the model, the researchers used mugshots for the criminals and actor “headshots” for the non-criminals. The software was able to be highly accurate just due to picking up on queues such as “is the person smiling.” Of course, not many people are at their best in their mugshots and are therefore not inclined to smile. Nevertheless, I would argue this isn’t a problem worth solving in the first place, even as an academic exercise, and no matter how you try to control for training data bias (say, by using glamour shots of convicted felons and/or having actors stage mugshots), the dataset is still going to lead to reinforcement of one kind of entrenched systemic bias or another – be it race, ethnicity, or something else.
So if you’re dealing with a problem type where the cost of making false predictions (either True or False) is high and the nature of the data is such that a certain level of false predictions are inevitable (for systemic or other reasons), AI/ML is not the approach you should be taking. We call this “problem selection” and it’s a threshold consideration.
Gourley: Okay, but let’s assume you’ve selected a problem that meets your threshold criteria. What next?
Bak: Assuming you’ve found a legitimate AI problem, you then need to ensure that the training data you’re building upon is appropriately representative. This means not just looking at distributions of variables within the data, but considering the system that generated the data and what the implications are for inferences you’re trying to make from the data. The data you use to train a model is the most important factor determining the models performance characteristics – both in terms of overall accuracy, and in terms of fairness or other normative concerns.
This issue of where the data comes from is really subtle. One example is from image recognition where you want to identify objects in images. A fairly common way to get image data is to use images posted on the web and social media as training data for AI – so here the task for the machine is to say what’s in an image – here’s a person, here’s a car, etc. People then try to transfer these models to a robot – so it can identify objects. What you find is that even with similar photo sensors being used, the accuracy of the model drops significantly. This can be traced back to the fact that images posted online are purposely framed for aesthetic reasons, while the robot is more haphazardly scanning the environment – this small difference in generating method can make a big difference in the performance.
Many ethical issues have similar roots in that the generating process for the data is different than the application domain. For example, if you use the same dataset to find people in photographs – you may find, and in fact people do find, that it doesn’t do as well at finding minorities due to them being under-represented in the training data. Google famously released a model that identified many people of African descent as gorillas.
Gourley: It seems like there’s a whole cluster of challenges that arise in working with training datasets.
Bak: Right. As the robot example shows, it’s not just about over/under representation of objects of interest in the data and it’s also not obvious how to “fix” the training data as in the second example (gather more photos with minority representation). This becomes particularly pernicious when you create a feedback loop between the data your model uses and the decisions it enables. Consider, for example, using records of stop-and-frisk encounters to build a model suggesting who to stop and frisk — the data is subject to human biases around race, gender, and class, and when you feed it into a system that doesn’t understand that you end up with a model that says those are the type of people who should be stopped. If people follow the model then new training data is created that further exacerbates the initial problem. Studies have even shown that there’s a tendency for ML to make biases even more pronounced than what might exist in the training data. Understanding what it means for the data to be biased is not just a matter of looking at the data itself – you have to consider the larger system in which it sits – both in terms of the training data used and how a model built on that training data will feed back into the larger system.
Gourley: So then what do you — as a tech company — do to address these AI training data problems?
Bak: For Palantir, one of the questions we ask when faced with these sorts of complicated phenomena is – Are there ways in which we can build technology to help solve these issues? Up front, it’s clear that we won’t be able to completely obviate the need for careful thought, but we do look for ways to help people or nudge people along in their thought processes.
There are two areas where we see technology as helping avoid some of these ethical pitfalls (1) Model testing, validation, and monitoring and (2) data management.
Our goal is to have Palantir provide industry leading monitoring and validation for AI models, this means tracking performance for the model – what it is getting right or wrong as well as how it performs on different subgroups of the data, visualizations of the data being used by the model to understand representation bias and other issues, and where possible aid the end-user in interpreting how the model is making decisions by examining the relative importance of different characteristics for a data point or even other more sophisticated methods. Part of this involves good, out-of-the-box defaults so that, at a minimum, you are encouraged to track a wide range of performance issues and metrics. But we also understand that that alone will not be sufficient because each model application must be contextually tuned. So we provide flexible, configurable tools to make it as easy as possible to evaluate ML bias – either coming from the data used to train the model or in model outcomes.
For data management our system is built around being able to easily and transparently track where data comes from, who and how it’s been used, including being used as training, testing, or live data for AI models, as well as any decisions made using the model.
Model tracking and data management provide entry points to start to discover and address bias issues. We understand that by themselves they are necessary but not sufficient for addressing all ethical issues.
Gourley: Does Palantir put in place other capabilities to make that easier?
Bak: We make it easier both on our developers creating solutions and customers operating them by ensuring methods for algorithmic explainability and interpretability. The same is true for more standard data protection features and capabilities, like granular access control, data provenance and lineage tracking, data retention and deletion management, and audit logging.
Bowman: Notice that what Anthony’s describing are practices and implementations of AI support tools that are intended to enforce both good data science hygiene and morally defensible outcomes. I can’t stress this enough: It’s our belief that most of the time, there isn’t really a trade-off to be made between either doing high quality AI/ML or doing ethical AI/ML. In other words, it turns out the methods and practices of of qualitatively good AI/ML tend to be wholly consistent with the demands of ethical AI/ML.
Gourley: That sounds like a principle. Is that what you mean when you say you established principles?
Bowman: Exactly. It’s one of our guiding precepts in working on AI and ML.
Gourley: What would you say are some of your other principles?
Bak: Whether a solution has fully automated decision-making or machine assisted decision-making, we humans have to think through the inherent limitations of the solution. AI systems are frequently encoding human behavior but without the understanding that an actual human provides. Our default position and guiding principle here is, generally speaking and especially when the stakes are high, a human should be in the loop in the systems we field, exactly so that their understanding is retained. A further corollary is that the higher consequence the system has, the more we should be rigorous around evaluating if AI is appropriate.
Bowman: Another related principle is that, in addition to carefully think through the inherent limitations of the solution, we also have a responsibility to evaluate how the solution could be re-purposed or extended in abusive ways. Potential application of AI must include not just an assessment of what the technology can and cannot do, but also (and perhaps more importantly), what it should and should not be permitted to do.
The third post in our series based on the interview with Bowman and Bak will bring more focus to advice for the C-Suite.