Discussions about artificial intelligence, particularly large language models (LLMs) like ChatGPT, have exploded over the last 12 months. They promise productivity and efficiency gains, but also open up new and novel risks that need to be considered.
Some of these risks are related to data and privacy – what are they doing with the prompts you feed them? Others are related to the quality of the output, with the model providing incorrect but plausible sounding answers. Which brings us to the double-edged findings from a recent research paper from the Harvard Business School.
In this blog, we will cover:
- A summary of the paper’s findings
- Some caveats and limitations
- What does this mean for risk managers?
- Looking beyond the risk team
Subscribe to our knowledge hub to get practical resources, eBooks, webinar invites and more showing the latest developments in risk, resilience and compliance, direct to your inbox:
What did the paper find?
The paper, lengthily titled Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality, sets out to answer questions about how useful LLM’s (in this case GPT-4) were for a range of tasks. The subjects of the experiment were strategy consultants – perhaps one could make an assumption there might be some similarities in tasks across knowledge management, including risk management.
The experiment explored tasks ‘inside the frontier’ and ‘outside the frontier’, or stripping the jargon, tasks that the GPT-4 model could do well, and those it couldn’t. Some of the more interesting findings:
- For tasks it is naturally good at, AI improved the average score of all groups, but had a bigger improvement on those who were less skilled from the outset – i.e. closing a skill gap
- Those tasks were also completed 25% faster by the AI groups than the non-AI groups
- While the content of those tasks was better on average, it was much less varied than the control group who didn’t use AI (the answers were less original and more similar to one another)
- For tasks that it was not naturally good at, the use of AI actually decreased the number of correct answers compared to the control group who didn’t use AI
An important point that the authors highlight is not just whether you decide to use LLMs or not, but also how you use them. This leads us to their cute classification of some of the more productive participants as centaurs or cyborgs:
- Centaurs – Divide tasks and sub-tasks between the human and the AI, after identifying those they believe each is good at, and then integrating the outputs of both.
- Cyborgs – Take a more interactive and iterative approach. Cyborgs don’t simply accept the output, they use their expertise to continually challenge and shape the outputs.
This breakdown is clearly a little simplistic, but the paper acknowledges this is an area that needs further exploration on how to best use LLMs in workflows.
What are some possible issues with the results?
I’ve naturally simplified my summary (you can read the full report here if you want to dig deeper), but a few specific things stood out as I walked through the findings:
- There was a financial incentive for participants to ensure they used it– could this have contributed to people using it when they didn’t actually want to or perceive benefit?
- The tasks, while assessed and aligned with workflow, were fabricated, including one that was specifically designed to be outside of the LLM’s capabilities (which apparently the researchers also found hard to create)
- The scoring model for the easy tasks was on an ordinal scale from 1 to 10; is a 6 three times better than a 2? There is no reference to a final output or objective that matters. While I don’t think the increase is in dispute, relying on the percentages alone might be questionable
- While you can make assumptions about knowledge work generally, this is based on a specific domain and use case
It’s also worth noting that developments in AI and LLMs are moving increasingly fast. Many LLMs are experimenting with multi-modal models, integrating them with image recognition, image creation, and other ‘languages’ which will also affect tasks and workflows.
What does this mean for risk managers?
The findings will have some application for all knowledge workers, but let’s focus on risk managers. Some tasks benefit from AI, and some get worse… of the tasks that you or your team do, can you determine which fall inside or outside the frontier?
LLMs are good at coming up with plausible sounding answers with confidence – even if they are incorrect. One of the challenges with LLM’s is that they are trained on a massive amount of data – some of which may be incorrect, has been superseded, or results in an ‘average’ version of a topic but isn’t nuanced or represent pioneering thought in that domain. If you ask a question or request output on the topic of risk or risk management with little context, you get a very generic answer.
Based on my personal experience, you need to have a certain level of expertise in order to pick up on cues or outputs that need to be challenged, have additional context added, or simply start over. ChatGPT can easily create a risk register with minimal information about a business or an activity. While some of the information might be relevant, most examples I’ve seen propose a risk rating even though there is almost no context provided, or a list of causes and impacts rather than risks. Its outputs always need to be vetted.
This brings up another interesting comment from the paper. The easy tasks can often be done faster by AI – why not get AI to do all of them, and leave only the harder stuff to the expert? But then how do you build new expertise beyond the frontier, except through experience at the tasks within that frontier?
Despite all of the above, there are still many areas within the risk domain that LLMs can help with. Here are a few that I like:
- Developing plausible scenarios and exercises for operational resilience and business continuity
- Defining specific templates and structure for ChatGPT to work within – e.g. a structure for risk of risk event, causes and impacts to build a risk register
- Describing a risk in context, and asking for a breakdown of how to more accurately assess its potential impact or range of outcomes
- Asking for potential key risk indicators for identified risks (followed by asking it what poor outcomes or perverse incentives those key risk indicators might also create)
This is just the tip of the iceberg. All of these need to consider another risk; ensuring you don’t share personal or sensitive commercial information that might be used to train the model. Either take appropriate care, or use a model that has acceptable privacy settings.
Beyond the risk team
Let’s step outside the risk team for a moment. The risk team should be aware that the above findings apply to all knowledge workers in their organisation. Does your organisation know who is using LLM’, and for what purpose? A challenging follow up is, how would you know if they were?
If they are using it, how do you know which are using it for tasks within the frontier, and which might be without? Is there any guidance? Consider whether you need to adopt an AI policy, or guidelines and templates for specific use cases to improve consistency and quality.
Conclusions and next steps for your organisation
Large language models can enhance both efficiency and quality both within and beyond the risk team, but they must be used with care and expertise to avoid nonsensical outputs. For most organisations, we would recommend that this encompasses AI policy and guidelines that set down expectations for employees to follow.
If you are considering implementing AI in your organisation – or indeed, considering the implications of it having already been implemented at a decentralised level – then you should download our IT Risk Management eBook, which includes a more extensive discussion of AI risks, how to fit them into your enterprise risk management framework, and a checklist for best-practice AI governance: