Measuring reputation is important for both valuation and understanding an organisation’s risk profile. Social media postings are an excellent source of information but the large amounts of data involved can make this a resource intensive process. Karim Derrick of Kennedys IQ looks at the potential of large language models to take on the task and how they compare to human analysis.
We are seeing a growing interest in how corporates perform in relation to ESG issues. There is demand from both investors and the public for funds that commit to investing in socially responsible, climate conscious businesses and a company’s ESG activities can have a major impact on how they are portrayed in the media.
How people view a company in ESG terms is a strong predictor of financial market sentiment and positive ESG news builds trust among investors and enhances reputation. Negative stories involving poor treatment of employees, overstating climate change credentials or unethical use of data are just some of ESG issues that can lead to heightened reputational risk.
At the same time, social media has exploded and provides a wealth of data with the potential to yield important insights into corporate reputation. Assessing and analysing this data brings the possibility of predicting the impact of ESG issues on corporate stock value and corporate risk but it is also a hugely labour intensive process, open to error.
This is where the rapidly developing field of artificial intelligence and large language models comes in. If we can be confident that machines are as effective or perhaps even better than humans when it comes to assessing sentiment, the ability to carry this kind of work in an efficient and cost-effective way presents a considerable opportunity.
Measuring opinions, attitudes and emotions
Along with colleagues at the University of Manchester, we have sought to investigate this, carrying out a study looking at how effective humans are at the simple task of assessing the sentiment in a sample of ESG related social media posts and comparing this to the effectiveness of cutting-edge machine approaches to assessing the same posts.
We wanted to understand how consistent human ESG sentiment analysis is and whether large language models might in fact produce better performance.
Sentiment analysis is also known as opinion analysis, a study of opinions, attitudes and emotions towards objects or entities which might include products, organisations, stories, politics, or individuals.
The main objective in sentiment analysis is the identification of whether a text is positive or negative towards a subject (its polarity). Measuring sentiment is not always straightforward and the expression of sentiment can be subjective and often ambiguous.
Computing sentiment
The most widely used methods of computing sentiment until only recently attempted to compute meaning from text relying on the assumption that the order and thus context of words is unimportant. These methods, typically known as “bag-of-words”, reduce a document to a term matrix made up from rows of words and columns of word counts.
Sentiment dictionaries are a variation on the “bag-of-words” concept and instead consist of collections of words that look to determine a particular sentiment. The task is simple – make a long list of positive and negative words and then count how many words of each category occur. Dictionary based approaches like this generally do not perform well. Some do when tailored for specific tasks but often suffer from false negatives.
Latent semantic analysis is a development of the simpler “bag-of words” approach that uses singular value decomposition to reduce the term-document matrix dimensionality to isolate latent topics. In effect it is factor analysis for text.
Word classification using supervised machine learning is another method. These methods either use researcher given rules/filters or otherwise some other factor, such as the reaction to the text, as the dependent variable, thus removing the researcher’s subjectivity. Numerous studies in the financial sector have used this approach to good effect.
Language models
Language models have been developed to predict the next word in a given text, or the next word given a text as context. As such they are a significant development in the natural language processing field.
The basic models are trained on very large corpuses of text and can then be fine-tuned on domain or task specific texts which can include semantic analysis. Fine-tuning language models allows the model weights to be adjusted for down-stream natural language tasks. Early attempts at this approach have used BERT (which stand for Bidirectional Encoder Representations from Transformers), the original Transformer model. More recently, large language models including the Generative Pretrained Transformers (GPT) have been trained on 45TB of text and feature an incredible 175 billion parameters. They have demonstrated incredible abilities out of the box.
Establishing a ‘gold standard’
For our study, tweets were randomly collected from X (formerly Twitter) based only on whether or not those tweets contained a small number of preselected ESG related keywords.
To evaluate machine-based approaches in sentiment analysis and to draw meaningful comparison against human performance, some form of objective measurement is required.
Our researchers classified the sentiment of 150 tweets and established a gold standard classification for each tweet based on the consensus of three researchers. This was not an easy process and demonstrated the fallibility of human judgement. It was hard to get agreement as to the sentiment of each tweet, even when the task was ostensibly simple.
This process in itself highlighted a flaw in much of our relationship with AI, particularly in the professional services space. Too often new technologies are being used without a proper appreciation of what success looks like. Without establishing a ‘gold standard’ baseline of what we want to achieve, we cannot properly evaluate the efficacy of these models, but this is something humans are not especially good at.
A step change in performance
Once the gold standard data set was established, it was then used to measure the performance of different machine approaches: one based on the VADER dictionary approach to sentiment classification and then multiple language model approaches, including FinBERT, GPT3.5 and GPT4.
As expected, the dictionary-based approach to sentiment analysis was less effective on nearly every measure. Just because a tweet features ESG related language does not mean that the polarity of the language used is a good predictor of the polarity of the overall statement. The dictionary approach cannot discriminate and its ability to correctly gauge sentiment is not much better than chance.
The large language model-based approaches all fared better. FinBERT performed least well with good performance in identification of positive statements balanced by an inability to identify negative or neutral statements.
GPT3.5 was noticeably better but not stellar. Sarcasm was readily identified and the ability to discern statements that are only informational was significantly improved compared to FinBERT, but it was not perfect. GPT4 enjoyed all the improvements of GPT3.5 but was better and more accurate on every count. Where GPT3.5 could not discern a subtle double negative, GPT 4 got it right.
Tweets that only indirectly take a stance by asking a question still indirectly imply a negative position. This was subtlety identified by GPT4 when analysing statements like: “Are #CEOs and corporate executives still greenwashing or rainbow-washing their #ESG / #SDG goals? Some thoughts on this from my closing remarks at #GEPInnovate2022”.
Before large language models and with human judgment as the “gold standard” against which everything else is measured, algorithms were deemed to be performing about as well as a random human judge, but only when trained for a specific domain and usually at great effort. In other words, their performance was flawed.
Given the 95% level of accuracy of out of the box performance of GPT4 it seems likely that human performance will be exceeded without effort unless substantial effort is made to choose and train human judges. It is only with substantial effort and much discussion that the gold standard was established in this study. In comparison the language models are effortless.
Innovative solutions
This has implications for the notion of human professionals who often lack consistency when it is needed most. Until now, creating statistical models for human performance has been onerous, requiring large datasets, much data wrangling, and a lot of computing power. The performance of out of the box GPT4 has exceeded humans in this study without any additional training at all, with minimal computing and with little setup overhead.
The significant improvement in the accuracy of the sentiment measurement achieved shows the considerable potential for how we might use these models to predict company performance and future risk.
It also highlighted the need for proper benchmarking – what these large language models can do is impressive but their output must be challenged and checked against a gold standard of what we want to achieve. Human input is still needed at the outset to establish the efficacy of new technologies which can augment and improve human judgement.
Given the inconsistency of humans when it comes to establishing sentiment, this is an area that needs development alongside the implementation of AI so that we can effectively measure success.
For the insurance industry these findings open up the possibility of being able to measure and anticipate reputational risk without the deployment of considerable computing or manpower. Tools are already in development that can analyse corporate documents and publicly available information to create a real-time reputational risk index – these are set to improve considerably as large language models mature.
Insurers should take notice. Innovative solutions are on their way and are set to transform how risks are measured and predicted. In this fast-paced new world, smart organisations will want to ensure they are part of that change.
About the author: Karim Derrick is Chief Products Officer at Kennedys IQ, the client facing technology arm of Kennedys LLP, creating ‘baked in’ legal technology products for use by financial services clients. The paper this article was passed on can be found here.