When we talk about the use of artificial intelligence in war, we tend to assume that AI produces outputs that are either right or wrong. The archetypal example is an AI-enabled decision support system (AI-DSS) that labels a child playing with a stick as an adult carrying a gun.
But, in reality, AI systems do not fail. As long as they haven’t been hacked or experience a strictly technical malfunction, computers produce outputs in accordance with their algorithms and their coding. If a child is killed by humans acting on the recommendation of an AI system, the AI was doing its job exactly as it was designed to; it was the humans who were at fault.
This is not nit-picking. Militaries have acknowledged that machines are not legally responsible entities under the laws of war. Therefore, if we insist that machines do not positively apply the law by being right, then we must also accept that they cannot fail to apply it by being wrong.
Humans, on the other hand, can and do err. And when they do something wrong, they are (in theory at least) held responsible. In the case above, the human should catch the AI’s misidentification and stop the attack against the child. If they don’t catch it, they (or someone) should be held accountable. Otherwise, the whole system falls apart.
Put another way, human soldiers should always have the capacity to make the right decision, regardless of what the AI system tells them to do. As I explain in a new report published today by the International Committee of the Red Cross, this means that the human must be able to know how any particular AI output relates to the particular context of the decision. And that, in turn, requires a knowledge and understanding of the uncertainties, assumptions, and biases (let’s call them UABs) embedded in that output.
The ABCs of UABs
All AI outputs that pass before a decision maker’s eyes have uncertainties, assumptions and biases that need to be accounted for in the decision.
First, there are uncertainties. The human who is using an AI system needs to be able to gauge exactly how much they don’t know.
For example, the feed from an infrared satellite does not include data on the color of objects. Likewise, the feed from a GPS satellite is never millimetrically precise about the location of objects. If the color or exact location of an object could spell the difference between it being civilian or military, then those uncertainties would need to be carefully considered before the human acts on computerized outputs that use those data.
Similarly, the training data of machine learning systems is a source of uncertainty. How many examples of children wielding sticks were in the dataset? Is it possible that there were none at all? Ignoring these types of uncertainties in situations where they would be relevant could be a breach of the law of armed conflict.
Second, all decision-supporting computer systems operate under certain assumptions. This is because assumptions are what turn raw data (e.g., an assortment of pixels in a video) into information (“those pixels are a person, and that person is holding a gun”).
In traditional computer systems, these assumptions are deliberately codified by way of concrete rules. For example, a rule might be: “if an object is traveling at more than 300 knots and is heading towards you, it should always be identified as a potential incoming threat and marked with a red cursor.” In machine learning-based systems, these assumptions are generated by the computer itself during its data training process. If a statistically significant number of training examples that were labeled as “incoming threat” include objects traveling at more than 300 knots in a particular direction, then the system will probably label any similar examples that it encounters in use as “incoming threat.”
Such assumptions play an enormously important role in data-driven warfare. In counter-insurgency operations—where the goal is to identify insurgents from among the broader civilian population—many computerized detection systems, including some of those currently being used in Gaza, are based on assumptions about particular patterns of behavior. According to recent reporting, if a person in Gaza regularly changes their phone number, for example, this can serve—in combination with other indicators—to code certain individuals as “combatants.”
Assumptions that turn data into information are always context-sensitive. Commercial airplanes often fly at more than 300 knots, and sometimes they do so in the direction of military installations. A system that marks all aircraft flying at more than 300 knots as threats would be problematic in mixed airspace unless humans take that assumption into account when looking at every red dot on their screen.
Assumptions also play out differently in different cultural or practical contexts. In some countries, roadwork teams that dig holes along roads—a behavior that might be coded as an indicator that they are insurgents planting roadside bombs—often operate at night because it is cooler than in the day. In Gaza, civilians often change their numbers because they lose their phones or they lose their service as a result of the military operations around them.
Finally, systems can be biased. Biases arise when there are systematic discrepancies between the properties of the theoretical world that a computer was coded or trained for and the real-life properties of the world to which it is deployed. A system that identifies people as threats based solely on whether they are carrying a gun will exhibit biases against people who are more likely to carry guns for reasons that have nothing to do with ongoing active hostilities (e.g., farmers protecting livestock).
If anyone tries to tell you that AI systems can be “debiased,” don’t believe them. There is no such thing as a computer system that is not biased against something or someone in certain contexts. The best one can do is limit the use of the system to situations where biases are less likely to arise. This requires a deep knowledge of what those biases are and the physical, cultural, and demographic context in which the systems will be used.
The interplay between uncertainties, assumptions and biases further complicates this problem. Assumptions can result in implicit biases, and these biases may only be relevant in the context of specific uncertainties. Assumptions and biases that are not understood by the human user are, themselves, a source of uncertainty.
What’s to be Done?
Last year’s “Political Declaration on Responsible Military Use of Artificial Intelligence and Autonomy”—which was endorsed by 52 states—hints at the problem of UABs. It declares that service members who use AI must be capable of “mak[ing] appropriate context-informed judgments.” But that doesn’t capture the magnitude of the task that lies ahead. Teaching AI systems to distinguish between sticks and guns is difficult; teaching humans to reliably and responsibly judge their way around UABs is a monumental challenge.
Militaries say humans will always be “in the loop” when it comes to AI. If that is so, they need to be honest about the fact that, despite decades of hard scientific research on the matter, many of the questions around how to account for UABs remain entirely unanswered. Governments often talk about biases in AI systems, but they never talk about uncertainties and assumptions. States also need to be clear that lessons from the civilian uses of AI will only get you so far in battle. One of the primary military motivations for adopting AI is speed; when humans are expected to account for UABs quickly in the fog and fury of war, the challenges are multiplied.
In the absence of better answers, the big question will also remain unanswered: Who should be held accountable for harm that arises from imperfect interactions between humans and machines? The human operator could not be fully to blame if they were unaware of the system’s faulty statistical assumptions, its undisclosed biases, or its invisible uncertainties. In that case, are the people who designed the system likely to take the fall? Probably not; either they are indemnified from liability, or those UABs had never emerged in testing. What about the leaders who decided that using the AI would be a good idea in the first place—could they be brought before a court martial? Unlikely.
This is what is referred to as an “accountability gap. ” Such gaps can arise even when the technology generally works very well. If a military assures you that its human-machine systems get it right 99.9 percent of the time, but can’t tell you how they apply the law in those 0.01 cases where they don’t, then that’s not good enough.
Given the intense interest in AI among militaries around the globe, and the broad way in which the discourse mischaracterizes errors, that’s bad news for applying law and ethics to human decision in war. All of which might make it sound like the arena of public discourse has failed us on the question of military AI. But that’s not true. There has been a lot of phenomenal scholarship on these issues lately. I see real progress. A lot of parties, and some countries, are asking the right questions. But please, let’s stop saying the machine got it wrong.