Many of you will have seen the furore caused by David Heinemeier Hansson’s tweet about the massive disparity in credit limits for him and his wife on their Apple Credit Cards (story from the Reg here). As usual in these circumstances it was blamed on the algorithm used for determining credit-worthiness. And they’re not alone. Twitter is awash with similar stories, such as the HR department rejecting perfectly good applicants because their personality test said “no”. The world is becoming a slave to the algorithm.
If we’re not careful it’s this, not The Terminator, that will be the greatest threat from Artificial Intelligence; specifically the surrendering of control over an increasingly automated system with little supervision of the overall system and insufficient feedback on whole system efficacy, rather than simply whether individual decisions are made correctly.
In the work we’ve done on AI, what becomes obvious is that there are effectively three planes of working. The first is the Application Layer. Data is ingested, processed and triggers an action at the other end. This can be done with or without the intervention of AI and could even be a manual process. Above that is the AI Layer, which takes feeds of training data combined with objective setting from the enterprise’s data scientists. The sophistication of the AI can vary, as can the extent to which it is applied to the various elements of the application. Above that is the Oversight Layer, which has responsibility for objective setting and critical review of those objectives and the means used to achieve them, i.e. the AI layer below.
NB – this graphic is still a bit of a work in progress, so expect a few refinements.
Failures of AI are almost universally due to weaknesses, or complete lack, of the Oversight Layer. Without it, the AI might be supervised, in as much as there are checks on whether it is functioning correctly. But there is a world of differences between an AI that is functioning correctly and one which is providing results which are in keeping with the overall organisation’s objectives (which should include, one would hope, “not being racist, sexist or otherwise prejudiced”).
Consider Tay, the Microsoft AI bot that was developed to learn better conversational language, which was shut down for parroting offensive (to put it mildly) opinions on Twitter. It was, one must assume, functioning within the parameters set out for it. However I don’t think anyone would say it was a success! It was subsequently replaced by Zo which went in completely the opposite direction and has been criticised as being sanctimonious and judgemental in the extreme.
Analysis of the implementation of cutting-edge technology often sounds like a broken record: it’s not the hardware or the software, it’s the wetware. Half-baked AI which hasn’t been anywhere near stress-tested enough is let loose on the public with predictable results. A dose more OT (rather than IT) principles would also help: don’t beta test in the field, properly stress test the capability, use robust feedback mechanisms and so forth. And this needs to be applied to whether the AI itself is functioning correctly, not whether the application to which it is being applied is performing within some narrowly defined parameters.
It was interesting to see Applause, whose heritage is in app testing, launching a capability aimed at testing AI across a number of areas including image recognition, voice and chatbots, and addressing bias. More on that here. This offers one option for addressing the issue: outsource bias testing to a third party which will have the benefit of aggregating experiences from multiple clients to identify and avoid pitfalls.