Written by
Allison Gasparini, AI Lab and Center for Statistics and Machine Learning
March 26, 2025

On Feb. 28, more than 80 researchers from the Princeton Plasma Physics Laboratory and Princeton University gathered together in the Lewis Science Library to test some of the latest artificial intelligence tools. Over Zoom, the Princeton group was joined by nearly 1,500 scientists across 12 Department of Energy (DOE) national labs. The national network of researchers worked together simultaneously in a “jam session” to probe the capabilities and limits of the advanced reasoning models. 

“This is one of the largest concurrent evaluations of the latest and greatest AI systems ever attempted,” said Shantenu Jha, the head of computational sciences at PPPL.

The goal of the session, which was cosponsored by the Princeton Laboratory for Artificial Intelligence, was to gain a deeper understanding of the ability of the latest AI models to contribute and aid in scientific discovery. Participants in the jam session were provided access to OpenAI’s reasoning models, which are specialized language models built to perform complex tasks. "It's important to appreciate the models we are testing are not just regular ChatGPTs,” said Jha. 

Overall, Jha estimated the amount of computing done by members of PPPL alone over the course of the day could be worth around a million dollars. “We're stress testing these models in ways that we couldn’t do without working with OpenAI,” said Jha.

Shantenu Jha works with researchers at AI jam

Shantenu Jha (standing), the head of computational sciences at PPPL, helped to oversee the jam session, for which participants were given free access to OpenAI’s reasoning models. (Photo credit: Michael Livingston/PPPL Communications Department)

 

Scientists, researchers, and engineers across the national laboratories were invited by OpenAI to tackle their research questions using a range of tools. As scientists worked with the models, they filled out evaluation forms rating the experience and model performance in the face of complex research questions. 

Alvaro Sanchez-Villar, an associate research physicist at PPPL, came into the jam session hoping to use his time with the models to test their rigor in deriving mathematical expressions related to wave phenomena in plasmas. He had tested older models in the past and found that they’d failed in deriving even simpler problems. At the jam session, Sanchez-Villar found the newer models to be an overall significant improvement. But he still hit limits. “When it came to complex concepts, the model showed a kind of mathematical intuition as it sometimes found the right functions,” said Sanchez-Villar. “But the actual equations were far from correct.”

In the future, Sanchez-Villar said he believes AI models would be improved if different models were deeply trained for differing subjects, rather than one model trained to complete all different tasks. “In my view, these models are advancing quickly, but still far from the level that you need to tackle highly specialized scientific topics,” said Sanchez-Villar. “That said, we’re dealing with highly niche topics, often understood in depth by maybe a few hundred experts worldwide, so the performance we observed is still impressive.”

Overall, Sanchez-Villar felt optimistic about the potential for the tools' continued aid in advancing scientific discovery. “The reasoning technology is performing well and is a clear improvement over previous models,” said Sanchez-Villar. “I'm looking forward to seeing how it continues to evolve in the future.”

researchers working together at 1000 scientists AI Jam

More than 80 researchers from the PPPL and Princeton University gathered in the Lewis Science Library to evaluate model performance in the face of complex research questions. (Photo credit: Michael Livingston/PPPL Communications Department)

Yueling Ma, a postdoctoral researcher with the High Meadows Environmental Institute at Princeton University, also spent her time at the jam session testing OpenAI’s o3-mini high tool. In her own research, Ma works with machine learning tools to do large-scale groundwater modeling. 

Working with o3-mini high during the session, Ma said she found that while the tool had an idea of how to design code, the code that was generated was ultimately filled with bugs that made it difficult to run without manually debugging it herself. “I feel like the models can solve some issues, but there is always some limitation on these tools,” said Ma.

Still, based on the growth in AI models Ma has seen just over the last couple years, she sees a future where the tools only improve on their limitations. “I'm quite optimistic,” she said. “I feel like these tools will continue to grow more and more powerful.” 

“The jam session was an excellent opportunity to showcase what these models can do and to test them against our experimental and theoretical work,” said Sanchez-Villar.

By evaluating the latest AI tools on the scientific questions that are of interest to PPPL and Princeton researchers today, Jha is hopeful that future models will grow even more attuned to accelerating the scientific discovery happening on the campuses. 

“The fact that our problems, how good, bad or ugly they are, are being taken into account, so that the next generation of tools are responsive to our problems is non-trivial,” said Jha. “It may not help us do science today, but the fact that the next system will have used our scientific problems of interest to be improved will help us in the long run.”