This content is locked. Please login or become a member.
Update your infrastructure
The earlier revolutions — the electricity revolution or the agricultural revolution — when you initially were starting to deploy a new technology, the existing systems were really not geared for the deployment of that technology. So thinking about the agricultural revolution, the boundaries of fields were not, like, nice and square. So you couldn’t just take a tractor and run it through for miles and miles and basically get the efficiency of having a tractor. The same thing is happening now with artificial intelligence. You are introducing artificial intelligence in a system that was designed for human intelligence. And human intelligence has a limited scope, a limited speed of processing, a limited amount of data that a human can assimilate, and you’re putting the AI system in with that same set of constraints, it’s limited in how much impact it can have. And so what you need in order to really get the bang for the buck from any new technology is to re-engineer the system from the very beginning.
Test realistic use cases
There’s all this discussion correctly about how the machine hallucinates in some cases. So one of the things that is really important is to create realistic use cases that don’t allow the AI to fool me because the AI can really fool itself and fool me into thinking that it’s learned meaningful signal by latching on to something that is just not real but is purely an artifact of data. So for example, let’s imagine I’m trying to deploy a system that helps clinicians in diagnosis. I can train my data on some number of clinician interactions, and then I deploy it for those same clinicians, at which point I’ve anchored myself on the specific patterns of use that those clinicians have in this particular context, in this particular moment in time. I would be very wary of taking that system and then deploying it to a different clinician that the system has never seen before because there is nothing to suggest in the way that the system was tested that it would be able to generalize to a different clinician that it had never seen before.
Conversely, if I took the system and trained it on a hundred clinicians and I gave it a hundred new clinicians that it hadn’t seen before and it was so helpful to them, I might feel more confident about deploying that system to the hundred and first new clinician because I have evidence to suggest that that system generalizes well from this use case to that use case. And so thinking about how far you would like the system to generalize, in machine learning jargon, we call it “generalizing out of distribution.” And what did I train the system on? And, conversely, what are the use cases where I expect it to be deployed in practice and making sure that it’s tested in that use context? If you don’t test the system in a regime that is truly reflective of where it will be deployed in practice, you have, a risk of, misleading yourself.
Ensure quality prompting
There was this beautiful paper by Microsoft that showed that, clinically, a large language model can deal with really challenging diagnostic cases that most clinicians can’t deal with. Then there was another article that basically took that same idea and asked what you would think is the same question, and the performance they got was less than half from an accuracy perspective. And so you ask yourself, what are the differences? And the answer is, in the second paper, the prompting wasn’t of a similar quality. They just basically asked the question of, you know, “What would you do in this case?” There wasn’t, like, this kind of intelligent prompting that an expert would do with the system. And so it’s important to understand if you were to want to deploy the system that Microsoft publicized, how would it be used in practice? Would there be an expert prompter that gives you the accuracy that you need?