Artificial intelligence (AI) is increasingly revolutionizing the field of medicine, transforming the way clinical tasks are performed. One notable advancement in the AI realm is the development of GatorTronGPT, a generative language model trained on clinical data using the GPT-3 […]
Artificial intelligence (AI) is increasingly revolutionizing the field of medicine, transforming the way clinical tasks are performed. One notable advancement in the AI realm is the development of GatorTronGPT, a generative language model trained on clinical data using the GPT-3 framework and ChatGPT. This groundbreaking model was trained on an extensive corpus of 277 billion words, including 82 billion words from de-identified clinical notes and 195 billion words from various English texts.
What sets GatorTronGPT apart is its ability to understand clinical language with less clinical data. In addition to training the model on existing clinical data, researchers took a unique approach by using GatorTronGPT itself to generate a synthetic corpus of clinical text. This synthetic clinical text, comprising over 20 billion words, closely resembles real clinical notes written by doctors and focuses on clinical factors.
The synthetic data was then used to train a BERT-based model called GatorTron-S. In comparative evaluations, GatorTron-S exhibited exceptional performance in natural language understanding tasks, surpassing the records set by the original BERT-based model, GatorTron-OG, which was trained on 82 billion words of clinical data.
Remarkably, these groundbreaking results were achieved using fewer data. By leveraging the power of 560 NVIDIA A100 Tensor Core GPUs running the Megatron-LM package on the HiPerGator supercomputer at the University of Florida, the researchers demonstrated the potential of GatorTron-S and GatorTron-OG to drive advancements in medical understanding.
The use of synthetic data generated by large language models addresses several challenges faced when training AI models. Large language models require substantial amounts of data, which can be limited when it comes to high-quality medical data. Synthetic data offers a solution, allowing models to be trained while complying with medical privacy regulations such as HIPAA.
The work on GatorTronGPT exemplifies the adaptability of large language models, which gained popularity with the widespread adoption of ChatGPT last year. These models hold promise for various fields, including medicine, and represent the significant progress facilitated by novel AI techniques and accelerated computing.
The efforts surrounding GatorTronGPT are a testament to the ambitious collaboration between the University of Florida and NVIDIA, which aim to establish the fastest AI supercomputer in the academic world. This initiative, fueled by a $50 million donation from NVIDIA founder Chris Malahovsky and the company itself, promises to unlock further innovations in medical sciences and various disciplines across the University of Florida.
Q: What is GatorTronGPT?
A: GatorTronGPT is a generative language model that utilizes artificial intelligence to understand clinical language.
Q: How was GatorTronGPT trained?
A: GatorTronGPT was trained using the GPT-3 framework, which incorporates the use of ChatGPT. It was trained on a large corpus of 277 billion words, including clinical notes and English texts.
Q: What is the significance of synthetic data in AI training?
A: Synthetic data allows AI models to be trained with limited high-quality medical data while complying with privacy regulations like HIPAA.
Q: What are the potential applications of GatorTronGPT and similar models?
A: GatorTronGPT and other large language models hold promise for advancing various fields, including medicine, through improved natural language understanding and accelerated computing.