Extracting insights from scientific publications using large language models

A case study in using ChatGPT to extract rare disease data from the literature by Robert Allaway, Christina Conrad, Sasha Scott, Ann Novakowski, Jineta Banerjee, and Julie Bletz

authors	Robert Allaway, Christina Conrad, Sasha Scott, Ann Novakowski, Jineta Banerjee, Julie Bletz

Screenshot 2025-12-21 at 12.30.57 PM.png

Published on Nov 16, 2023. DOI 10.21428/4f83582b.bfc5e9cb

Introduction

Last year, with support from the Gilbert Family Foundation, Sage Bionetworks launched a database called Neurofibromatosis (NF) Research Tools Central. NF is a family of rare genetic disorders that affect the nervous system, manifesting primarily as the growth of tumors in various parts of the nervous system, such as the brain, spinal cord, and peripheral nerves. The variation in severity and symptoms among individuals with NF makes it a challenging and complex condition to study. Nevertheless, the NF community consisting of researchers, patients, advocates, and medical professionals, stand resilient and dedicated to the mission of finding a cure for NF.

The vision for the database is to enable researchers to parse through large stores of information and streamline access to experimental resources, such as animal models and cell lines that can support their day-to-day experiments. This becomes particularly pertinent given the rapid evolution of technology. For example, various techniques, such as the utilization of transgenic animals, knockout/knock-in mice, and CRISPR-Cas9 genome editing, offer distinct advantages and limitations in modeling NF. The database provides several advantages, including the ability to share historical lessons learned from tools that have been used in the field. This is crucial because some of these resources have been employed for over a decade, resulting in a wealth of published information about their utility and effectiveness. Additionally, the database serves as a valuable resource for newcomers to the field, enabling them to enter the realm of NF research more easily.

To begin building the database, we spent a few months curating data from the literature, other databases, and communications with domain experts. If you would like to learn more about the database, feel free to check out our paper.

Here, we want to highlight some of Sage Bionetworks’ recent exploration to make this database more useful for researchers. A crucial part of this database is providing a place for researchers to submit “observations” about tools. These could be natural history observations associated with a specific time point, such as ‘Nf1 Arg681*/Arg681* and Nf1 Δ4/Δ4 embryos failed to develop beyond embryonic day 9.5’ or they could be general observations about a tool, such as ‘Optic gliomas in mice harboring the Nf1 Arg681* allele are more aggressive than those in mice with the Nf1 Δ31&32/Δ31&32 mutation.’ We made a pretty timeline to visualize the time-based observation:

An example natural history timeline at http://tools.nf.synapse.org .

An example of a timeline showing natural history observations for pig developmental phases.

There was just one issue - we were having trouble actually collecting these observations from the community. People are busy, and free data curation is pretty low on most people’s priority list. We tried incentivizing contributions with gift cards and providing content for people’s CVs (listing them as database editors), but these incentives were not very successful. So, how can we add accurate data to this timeline without reading and curating hundreds of papers? GPT-4 to the rescue!

Methods

There are tons of tools out there for processing research papers using large language models (LLMs), but we had the most experience with the GPT-4 API and ChatGPT using GPT-4. Consequently, we didn’t do an extensive comparison of LLMs - this was not a benchmarking exercise. We knew we wanted to stick with ChatGPT for this proof of concept because it is easy to use and quick to get started. However, ChatGPT alone cannot read scientific papers. There are a few options - including manually pasting in the text you want to extract information from, enabling browsing (a premium feature), or using a third-party plugin (also a premium feature). We noticed a plugin called ScholarAI, which purports to enable searching through the scientific literature and the retrieval of high-accuracy (and low-hallucination results). The plugin allows you to provide a URL to a PDF and extract information from it interactively with ChatGPT. Under the hood, it seems like ScholarAI converts the PDF to text, takes the user’s input prompt, refines it, and uses that prompt and some sort of semantic search strategy to identify the chunks of the text that are most relevant to the query. A few chunks are returned to ChatGPT, and then GPT-4 does the actual task (e.g., extraction, summarization, translation, etc.) on the text chunks that are returned. ScholarAI’s materials indicate that this strategy reduces AI hallucinations, which makes sense. In my experience, the more accurate and specific contextual information you can provide, the less likely it seems to make stuff up.

Results

Great - so now we have an extraction strategy - what next? We spent a couple of hours testing out different prompts and refining them until we were semi-consistently extracting the type of information we were seeking. For example, we’d occasionally get observations like NF1 patients often get cutaneous neurofibromas, which was an observation in the source text, but it was not the kind of observation we wanted to extract. Or, we’d get an observation that was purely methodological (Weights of control and experimental mice were measured for 4 weeks). This is technically true, but it’s not a very helpful observation. We’d much rather answer the question: “Is there a difference in weight between the control and experimental groups?” With a bit of prompt refining, we were able to remove, or at least cut down on these types of observations. We were also able to provide standard categories for each observation, to ensure that GPT-4 would assign a “tag” for each observation using the terms that we wanted it to use.

Finally, since we’re trying to make this easy on ourselves, we wanted the format to be something easy to work with. ChatGPT is great at formatting. Want it in a JSON? You got it! Want a bulleted list? Sure! Want a table? Awesome! We went with JSON because it was easy to read the individual observations to check for errors, and it’s easy to reformat into a table. The final prompt looked something like this:

I am interested in learning more about the mouse models in this manuscript. Please extract a comprehensive list of temporal “observations” relevant to neurofibromatosis from this publication (https://genesdev.cshlp.org/content/15/7/859.full.pdf) that describe the natural history of the model as they relate to development or age. For example, an observation could be “The pigs developed tumor type X at Y months of age.” Do not include observations about humans with NF1. The observations should be formatted like the following JSON:
[ 
  {
    observationText: This is an example sentence.,
    observationType: [Body Length, Body weight, Coat Color, Disease Susceptibility, Feed Intake, Feeding Behavior, Growth rate, Motor Activity, Organ Development, Reflex Development, Reproductive Behavior, Social Behavior, Swimming Behavior, Tumor Growth, Issue, Depositor Comment, Usage Instructions, General Comment or Review, Other],
    observationPhase: [prenatal, postnatal, null]
    observationTime: a double; the time during the development of the organism which the observation occurred (in combination with the observationTimeUnits attribute),
    observationTimeUnits: [days, weeks, months, years]
  }
]

The response from ChatGPT looks something like this:

Based on the additional chunks retrieved from the paper, here are more temporal observations related to the development of mouse models with NF1:
[
  {
    "observationText": "NF1flox/-mice began to die at 10 months of age, and only 80% survived beyond 14 months.",
    "observationType": ["Disease Susceptibility", "Mortality"]
    "observationPhase": ["postnatal"],
    "observationTime": 10,
    "observationTimeUnits": "months"
  },
  {
    "observationText": "NF1\u232c/-embryos are characterized by an enlarged heart, peripheral edema, hemorrhage, small unpigmented eyes, and die at E13.5 because of cardiac abnormalities."
    "observationType": ["Organ Development", "Disease Susceptibility"]
    "observationPhase": ["prenatal"]
    "observationTime": 13.5,
    "observationTimeUnits": "days"
  }
]
These observations provide specific time points at which certain phenotypes are observed in the mouse models. The first observation indicates a survival rate issue starting at 10 months of age in NF1flox/- mice, while the second observation details the phenotype of embryonic lethality at 13.5 days of embryonic development in NF1Δ/- embryos.
Would you like to continue with the retrieval of more content from the paper for further observations?

This is pretty awesome! Not only does ChatGPT extract the type of observations we are looking for, but it also correctly assigns the phase (“prenatal” or “postnatal”), a type of observation (though do note that it’s suggesting a type - mortality - that is not in our list of terms - it is very important to validate these results against our schema), the time, and units of the time of the observation. Looking at the paper for confirmation, we can see that both of these observations are correct. Then take the data, format it for our database, link to the right tools, validate against our schema, slap an “🤖 AI-extracted” header on it, and all of a sudden NF Research Tools Central is now AI-powered! For example, check out this NF1 minipig observation:

An observation card in NF Research Tools Central describing minipig development.

https://assets.pubpub.org/m2o8ay6e/Screen Shot 2023-11-15 at 4.19.27 PM-01700086786710.png

Discussion

As neat as this is, there are still several pain points:

It still takes several minutes per paper to extract the observations through the chat interface, and there’s no documented way to use ChatGPT plugins or directly interact with ScholarAI through APIs. This prevents us from scaling this approach across the >1000 tools we have in the database, or automating it when new publications that use a particular research tool are published. Using the GPT-4 API would also cost - roughly estimating - a few thousand dollars if we were to try to curate all of the publications this way, but this is comparable, probably cheaper, than the cost of manual curation.
We don’t know what the hallucination rate is - this approach seems to be very accurate, which is a good thing, but just because we haven’t seen hallucination doesn’t mean it won’t happen (it probably will)!
ScholarAI will only return a couple of chunks back to ChatGPT at a time, which typically yields less than 6 “observations.” You need to have an ongoing dialogue with ChatGPT to explore more chunks of the paper, so there’s a collation step that needs to happen.
Similarly, sometimes ScholarAI will keep returning the same chunks of the paper. I’m not sure why this happens, but when it does, we’ll miss observations. It requires some manual inspection to figure out when this is happening.
It’s not clear when an entire publication has been “read” - i.e. how do we know we’ve extracted all the observations? It’s unclear without checking manually, which defeats the purpose.
We can only do this with papers that have the full text freely available. At Sage Bionetworks, we believe that open science helps speed the translation of science into medicine; this is yet another example of why open access publications and/or posting research on preprint servers is so important.
This misses unpublished observations! We still need to figure out how to collect valuable observations that were never formally published and are just stuck in lab notebooks or brains! There is an opportunity here to further study this and to figure out how to incentivize the sharing of valuable observations that would otherwise be lost to time.

With all that said, we think that this is still a super-powerful approach and has the potential to become even more useful as LLM tools improve and become easier to customize. What’s next?

Foundation models like GPT-4 can now explain images, too - maybe we can extract observational data directly from figures without even relying on the text?! Show, don’t tell.
Often, multiple tools are used in a single paper - with careful prompting, we should be able to extract observations about all of them!
Based on our personal experience, we think this could empower data curation by people who are not subject matter experts. If you don’t understand something about a tool, you can just incorporate that into the conversation! For example, “why is this mouse called “B6.129S6-Nf1<tm1Fcr>/Nci” and what the heck does that mean?” Curation can become a learning exercise.
During preparation of this article, OpenAI released “GPTs,” which allows us to create a customized and tuned ChatGPT; we could generate one that takes more standardized input and provides standardized output and remove a lot of the prompt engineering that we described above. However, at the time of writing, GPTs do not work with ChatGPT plugins (like ScholarAI), so we are unable to do this right now.
Similarly, we’d love to run a data-thon to get these tools into the hands of NF researchers; build this database with help from the experts, and give NF researchers training on how to apply large language models to their day-to-day work (want to sponsor this? Send me an email! 😄)!

Acknowledgements

A large team of people is responsible for building all of the bits and pieces that comprise NF Research Tools Central. In no particular order, the past and current individuals at Sage Bionetworks have contributed to the development of NF Research Tools Central: Ashley Clayton, James Eddy, Mialy DeFelice, Milen Nikolov, Brynn Zalmanek, Anh Nguyet Vu, Jay Hodgson, Kaitlin Throgmorton, Stockard Simon, Duminda Aluthgamage, Adam Hindman, Ljubomir Bradic, Marco Marasca, Kevin Boske, Bruce Hoff, Andrew Lamb. The development of the NF Research Tools Central was supported by the Gilbert Family Foundation (GFF). The former and current members of GFF who had a substantial guiding impact on this project include Caroline Morin, Kalyan Vinnakota, James Goss, YooRi Kim. Thanks to Luca Foschini for reviewing and providing valuable comments. NF Research Tools Central is a portion of a broader neurofibromatosis initiative, called the NF Open Science Initiative, which is sponsored by the Neurofibromatosis Therapeutic Acceleration Program (NTAP), the Gilbert Family Foundation (GFF), and the Children’s Tumor Foundation (CTF).