Artificial Intelligence (AI) and the literature review process: Data extraction
Data extraction tools automatically identify data within a paper and save it into a table or spreadsheet. However, whereas most natural language processing research has examined the screening phase of systematic literature reviews, far fewer have investigated the data extraction phase. There were just 26 articles relating to this topic from January 2000 to January 2015 (Jonnalagadda et al., 2015). Yet these tools can save you a considerable amount of time.
AI tools for data extraction
Elicit
Elicit will extract data from pdfs you upload, saving you time and allowing you to then synthesise the information. Its own user survey found that 10% of respondents said that Elicit saves them 5 or more hours each week and that, in pilot projects, Elicit saved research groups 50% in costs and more than 50% in time by automating data extraction work they previously did manually (Elicit, 2023).
The free basic version allows you to extract data from papers and upload your own papers. However, only priced versions of the product will give you summaries of papers and allow you to extract the information into csv and bib formats.
Using Elicit
Elicit has been designed as an AI research assistant to assist with the literature review workflow. Elicit extracts papers into an organized table.
There are 24 different types of information that you can extract and view in columns in a table, including:
- The findings
- Details of participants
- Location/country
- Outcomes measured.
Viewing the information in a table makes it easier for you to synthesize the articles and increases your own understanding of the literature.
Elicit saves you time by doing this aspect of the review process for you.
In addition, by providing one sentence summaries of the abstracts, Elicit allows you to decide whether to include the paper in your review or whether you need additional information about the paper.
Elicit works best with the prompt: What are the effects of ___ on ___? You do need to include a question mark in the search. It works less well for identifying facts.
Evidence
Generative AI tools such as ChatGPT can provide contextually-relevant and personalized responses from clinical studies but struggles to extract detail or key information (Liu et al., 2023). In addition, ChatGPT tends to miss important attributes in the summary, such as failing to refer to short-term or long-term outcomes that often have varying risks (Peng et al., 2023). Other AI tools have therefore been used to save you time when extracting data from journal articles.
Elicit is currently seen as "as a supplement to traditional library database searching for advanced searchers" (Whitfield and Hofmann, 2023: 207). Tools such as Elicit and SciSpace offer substantial assistance to the SLR process by extracting key insights from a large number of papers very quickly, reducing the risk of omissions and errors by researchers and saving them time (Dukic et al., 2024).
Its limitations are that:
- it takes information from the Semantic Scholar Academic Graph dataset and relies on open access content via the Unpaywall plugin. This means that full-text content that cannot be found via Unpaywall may only have limited bibliographic information. You may therefore need to copy and paste the title into Google Scholar to identify whether the paper is a journal article, conference paper, book chapter or thesis.
- it may replace some of the more routine tasks in reviewing the literature such as data extraction but "Elicit is not able to perform high-level cognitive functions that are required to create an understanding and synthesize the literature" ((Whitfield and Hofmann, 2023: 204).
- "around 90% of the information you see in Elicit is accurate... it’s very important for you to check the work in Elicit closely" (Elicit, 2023). Elicit tries to make this easier for you by identifying all of the sources of the information generated with language models. But it does mean that around 10% of the information you see is incorrect.
- Elicit does not currently answer questions or surface information that is not written about in an academic paper. It tends to work less well for identifying facts and works less well in theoretical or non-empirical domains.