Artificial Intelligence (AI) and the literature review process: Generative AI Technology

Application of AI tools such as ChatGPT to searching and all aspects of the literature review process

Natural language processing is a research field that combines linguistics with computer science. It aims to enable computers to understand human language through research into tasks such as classifying text, summarizing text, tagging parts of speech, classifying documents, answering questions, sentiment analysis etc...

Understanding Generative AI technology requires you to grasp key concepts such as: Language model, Pre-training, Fine-tuning, Tokens, Transformer and Generative pre-trained transformer (GPT).

Language models

Language models are "statistical models of word sequences" (Jurafsky and Martin, 2014: 85).

There are of two types of language models:

(i) statistical language models, which compute the probability of the next token (character, word or string) based on the previous tokens or sequence of tokens;

and

(ii) neural language models which use the power of neural networks to model the sequences of words.

Since the launch of ELMo in 2018, with 94 million parameters and a pretraining dataset of >1 billion tokens, these models have expanded in size because training larger models, especially those based on the transformer architecture (see below), has produced further improvements (Bender et al., 2021; Sanh et al., 2020).

"This trend of increasingly large LMs can be expected to continue as long as they correlate with an increase in performance" (Bender et al., 2021: 611).

They are all now called large language models (LLMs). Table 3 of Ray (2023) provides a comparison of LLMs. LLMs are typically trained on massive amounts of text data, such as Wikipedia, news articles, books, and social media posts. This allows them to learn the patterns and relationships that exist within language and use this knowledge to complete natural language processing tasks.

LLM leaderboards rank LLMs based on benchmarks which compare their performance on a range of metrics such as language understanding, reasoning, coding, and conversation. Each benchmark usually has its own leaderboard. Chatbot Arena (Chang et al., 2024) for example, ranks over 220 models based on the evaluations of over 1 million user votes. Its overall ranking includes, in its top 30, various models of Gemini (Google), o3 (OpenAI), Grok (xAI), Kimi (Moonshot), ChatGPT and GPT (OpenAI), GLM (Z. AI), Claude (Anthropic), Qwen (Alibaba), DeepSeek (DeepSeek), Mistral, (Mistral AI), Minimax (Minimax) and Huanyuan Turbos (Tencent).

Pre-training

Pre-training technology is used to enhance language model performance. It consists of analysing large amounts of data such as text or images.

In natural language processing, there are five categories of pre-training tasks as outlined in the survey by Zhou et al. (2023) of pretraining foundation models. Pre-trained language models are classified according to the word representations approach:

Autoregressive
Contextual
Permuted (Zhou et al., 2023).

Autoregressive language models predict the next possible word based on the preceding word or predict the last possible word based on the succeeding word. LLMs therefore rely on pattern recognition and probabilities to predict the most likely correct response. For common questions or calculations, LLMs often provide accurate answers because they have seen similar patterns. They do not have a direct counting function and they can sometimes miscalculate or skip over small details. Asking them to count the exact number of letters in a word or how many words in an essay can lead to errors (see OpenAI Developer Forum discussion).

Fine tuning

After pre-training, the language model is then fine-tuned using a smaller dataset to update the model's weights and biases to better fit the task, usually to improve efficiency, effectiveness and privacy.

Computational cost and training duration are lower for fine tuning (smaller datasets, days to weeks) than for pre-training (large dataset and complex models, months to years). Pre-training involves extensive training on enormous volumes of unlabelled data while fine-tuning adapts the pre-trained models to meet precise requirements using smaller, labelled datasets to create accurate and contextually appropriate responses (see Parthasarathy et al., 2024 for the 7 stage process of fine tuning).

Fine tuning to align with human instruction

Language models need to be fine tuned so that they a) meet the user's explicit requirements and b) meet the implicit requirements such as being truthful and not being biased, toxic, or harmful (Ouyang et al., 2022).

The foundational model created by Anthropic was designed to be "helpful, harmless, and honest" (Askell et al., 2021; Bai et al., 2022). Models created since then have tended to follow these intentions.

Reinforcement learning from human feedback (RHLF) was originally used to train robots in simulated environments and Atari games (Christiano et al., 2017) and has since been applied to align language models with human objectives.. RHLF was used to train GPT-4, Claude, Bard (now Gemini) and Meta’s Llama 2-Chat. The training process using RHLF involves three key stages: supervised fine-tuning (SFT), reward model (RM) training, and proximal policy optimization (PPO) on this reward model (Zheng et al., 2023). There are key challenges of RHLF (see Casper et al., 2023):

challenges associated with human feedback - RLHF needs human evaluators to be selected and instructed. Human evaluators can pursue harmful goals, either maliciously or innocently; they bring their own biases; the demographics of those selected may differ from the general population. In addition, evaluators are often paid by the number of examples they assess so there is a financial incentive to get through as many as possible and this may be at the expense of accuracy. When examples are difficult or complex to evaluate, evaluators may provide poor feedback. LLMs sound confident even when they are incorrect and this can lead evaluators to provide more positive feedback leading to concerns about AI tools manipulating humans without the intent of the designers (Carroll et al., 2023).
challenges associated with training the reward model - these can be difficult to train even with high-quality human feedback. A single reward function cannot represent the diverse society in which we live. When preferences, expertise and capabilities differ, the majority wins, potentially placing under-represented groups at a disadvantage.
challenges associated with policy.

These challenges lie at the heart of the issues surrounding the use and ethics of AI tools which are outlined in a separate page.

Tokens

When processing prompts and training data, language models break down the information into small text chunks called tokens. For example, in English, a token could be a word (e.g., "cat"), a sub-word (e.g., "un" and "happiness" as separate tokens in "unhappiness") or even individual characters (e.g. @ # $ %). These tokens are the building blocks that LLMs can understand and manipulate to construct meaningful and fluent text.

Models often have a maximum token limit for each input sequence. Exceeding this limit may require truncation or other strategies to fit the input into the agent.

There is also a token limit per session. This means AI agents can "forget" about the earlier prompts in a conversation and you may have to re-state important information throughout the exchange. Some agents indicate this limit in their answers.

Transformer

Transformer architecture was introduced by Vaswani et al. (2017), six of the eight researchers were based at Google.

The architecture is based on the use of an attention mechanism and has applications in natural language processing, computer vision, graph learning and for speech recognition. Transformer-based language models such as GPT-3 and GPT-4 are autoregressive models.

Generative pre-trained transformer (GPT)

Radford et al. (2018), working at Open AI, demonstrated large improvements in natural language processing tasks through generative pre-training of a language model on a dataset of unlabelled text, followed by fine-tuning on each specific task. They used the BooksCorpus dataset of 7000 unpublished books for pretraining. Their model was based on transformer architecture.

The open-source GPT (Generative Pre-trained Transformer) family created by OpenAI, has shown improvements in their level of intelligence from GPT-1 (117 million parameters) in 2018 to GPT-3 (175 billion parameters) in 2020. OpenAI is a world-leading artificial intelligence company created in 2015 which has since received billions of dollars of investment from Microsoft.

Brown et al. (2020), working at OpenAI, trained GPT-3 on a 1 trillion word dataset, Common Crawl. The model showed strong performance on many natural language processing tasks and benchmarks in the zero-shot, one-shot, and few-shot settings as well as showing strong performance on tasks that require on-the-fly reasoning such as unscrambling words, using a novel word in a sentence, or performing three-digit arithmetic.

This family expanded further with the launch of GPT3.5 in March 2022 and GPT-4 (1 trillion parameters and optimized for chat) in March 2023. ChatGPT has been fine-tuned using both supervised and reinforcement learning techniques (OpenAI, 2022). It is estimated to have reached 100 million monthly active users in January 2023, just two months after launch, making it the fastest-growing consumer application in history (Hu, 2023). GPT-4 is a model based on transformer architecture pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). As the technical report states, "given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training computer,dataset, construction, training method, or similar" (OpenAI, 2023).

OpenAi's GPT-4o model, launched in May 2024, accepts text, audio, image, and video as input, generating any combination of text, audio, and image outputs. It is this multimodal nature which means OpenAi is "still just scratching the surface of exploring what the model can do and its limitations" (OpenAI, 2024).

Curzon

School of Art

Mary Seacole

School of Jewellery