INTRODUCTION
Most biology instructors want their students to go beyond memorization to think critically, reason scientifically, and solve problems (
1,
2). This range of cognitive tasks is often discussed via the lens of Bloom’s taxonomy (
3,
4), which includes both lower-order cognitive skills (LOCS: knowledge and comprehension) and higher-order cognitive skills (HOCS: application, analysis, synthesis, and evaluation) (
5,
6).
Among its other advantages, Bloom’s taxonomy provides a convenient language to discuss the cognitive requirements of biological tasks (
7–9). Nevertheless, different instructors may perceive these cognitive requirements differently due to, for example, different awareness of students’ prior activities (
6). Likewise, students may perceive questions differently than their instructors; one study found that for a set of 24 multiple-choice questions, only 63% of medical students agreed with the instructor (i.e., not much better than a coin flip) on whether each question was lower order or higher order (
10). Such discrepancies have been noted in student interviews when students, who were asked to solve problems designed by instructors to elicit HOCS, often avoided HOCS and instead took an LOCS approach of simply recalling related facts from memory (
11).
To circumvent students’ overreliance on fact recall and encourage the application of information contained within the problems, Semsar et al. (
11) found a solution in “us[ing] primarily novel and/or invented scenarios” (p. 24). As an example, a scenario about the familiar roles of angiotensin and aldosterone in regulating blood pressure was replaced with a novel scenario about fluid balance in mosquitos. Student interviews indicated that the novel scenarios were indeed more successful in eliciting HOCS (
11). This qualitative result provides some empirical support for others’ stated preferences for unfamiliar scenarios in summative assessments (e.g.,
12–15).
We know of one previous quantitative study that directly compared biology students’ scores on familiar problems and novel problems. Deane-Coe et al. (
16) reported that students scored 7.5% lower on high-novelty/low-complexity questions (their category B) compared to low-novelty/low-complexity questions (their category A). However, there was no such difference between high-novelty/high-complexity and low-novelty/high-complexity questions (their categories D and C, respectively). Moreover, while the study’s methods generally appear sound, it is not clear whether their category-A questions and category-B questions assessed the same lesson learning objectives (LLOs). For example, among the 64 questions used, four category-A questions asked about the meaning of R-code output, but no category-B questions did. Meanwhile, four category-B questions asked students whether various hypotheses were consistent with pilot data (on fish jaw anatomy) provided with the questions, but no category-A questions asked this.
When we attempted to analyze our own archived exam data as Deane-Coe et al. did (
16), our attempts were similarly hampered by an inability to control for LLO. That is, our archived exams did not ask the same students a familiar question and a novel question both matched to the same LLO. In principle, though, such matching is achievable via the Test Question Templates (TQT) framework, which explicitly and purposefully links LLOs to specific examples of assessment questions (
17–19). Since TQT LLOs are written as “Given X, do Y,” one can identify LLOs where “X” could reasonably be either a familiar starting point or a novel one and then generate both familiar and novel questions for those LLOs.
We therefore aimed to supplement previous evidence about familiar and novel questions (
11,
16) with a new quantitative study using TQTs to control for the context of LLO. To collect data under well-controlled conditions that avoid the confounding impacts of student and instructor variability, we used chatbots based on large language models (LLMs) as models of undergraduate students in two subdisciplines of biology (human physiology and cell biology). We posed the following research question: Do chatbots perform better on realistic (familiar) questions than on hypothetical (novel) questions matched to the same LLO? We hypothesized that the answer would be affirmative, which would strengthen previous suggestions that the two types of questions have different cognitive requirements, which in turn would have clear implications for student training and assessment.
RESULTS
Since our primary expertise and interests are in science education, we did not characterize the chatbots’ functions as AI researchers might do. However, we did evaluate three basic assumptions that underpinned our goal of using chatbots as models of students: chatbots understand the multiple-choice format, we can ask a chatbot a given question multiple times and take each response as an independent readout of its “understanding,” and chatbots can support their multiple-choice answers with good reasons.
Chatbots readily answer multiple-choice questions
When we asked chatbots our multiple-choice questions, they unambiguously chose a single answer with a frequency of 93% (YouChat), 96% (ChatGPT-c), 98% (ChatGPT-d), or 99% (Bard; this information was not recorded for ChatGPT-b). For the remaining 1%–7% of questions, the chatbot either reported that it did not have enough information to answer the question or picked 2–3 answers rather than a single option. All chatbots answered correctly at greater-than-chance frequencies, except for the redefinition subtype of hypothetical questions (see below; due to varying numbers of choices per question, random chance would have yielded scores of 32% and 22% for the human physiology and cell biology questions, respectively).
Without feedback, chatbots appear to answer each question independently
Since we asked the chatbots large numbers of related questions, we wondered whether such cumulative exposure might result in improvement or “learning” over time, despite a lack of feedback. We explored this issue at two levels of granularity. First, if a chatbot is asked the same question 3–5 times in a row, does it score better on the later attempts? Second, as a chatbot answers an LLO’s set of eight related questions, does it score better on the later questions? For all chatbots tested, the answer to both questions was no in both human physiology and cell biology. When scores were regressed against attempt number, there was no relationship (slopes were not different from 0, P > 0.05). Additionally, we found no relationship between score and question number (slopes not different from 0, P > 0.05) regardless of whether each LLO’s realistic and hypothetical questions were analyzed together or separately.
Chatbots usually explain their correct answers well
Chatbots usually included an explanation of one to three paragraphs in their responses even when no explanation was requested. We therefore tested whether two chatbots’ explanations provided accurate information relevant to and consistent with their letter choices, indicating associations that might appear as appropriate reasoning, as we might expect from human students.
First, we examined ChatGPT-c’s full responses to human physiology questions between 15 February and 6 March 2023. Of the 200 multiple-choice questions attempted, 127 were answered correctly. Of the explanations accompanying those 127 correct answers, 118 (93%) were entirely correct (with no errors) or mostly correct (with only minor errors), and 9 were judged incorrect. (We did not systematically examine explanations accompanying incorrect multiple-choice answers.) Explanations were similarly successful for the realistic questions and the hypothetical questions. For the realistic questions, of the 76 correct multiple-choice answers, 70 came with correct or largely correct explanations (minor flaws or no flaws), and 6 had incorrect explanations. For the hypothetical questions, of the 51 correct multiple-choice answers, 48 came with correct or largely correct explanations, and 3 had incorrect explanations.
Second, we examined YouChat’s full responses to cell biology questions on 1 March 2023, which yielded similar results. Here, of the 160 multiple-choice questions attempted, 79 were answered correctly. Of these, 7 lacked explanations; of the other 72, 62 (86%) had predominantly or entirely correct explanations, while 10 had incorrect explanations. For the realistic questions, of the 39 correct multiple-choice answers with explanations, 36 of the explanations were correct or predominantly correct, and 3 were incorrect. For the hypothetical questions, out of 33 correct multiple-choice answers with explanations, 26 explanations were largely or entirely correct, and 7 were incorrect.
Chatbots score better on realistic questions than on hypothetical questions
As noted above, this study’s main research question was as follows: Do chatbots perform better on realistic questions than on hypothetical questions matched to the same LLO? The answer was an overall “yes” across multiple rounds of testing chatbots on both human physiology questions (
Table 4) and cell biology questions (
Table 5).
We tested the performance of five chatbot versions on 25 LLO-matched sets of human physiology questions. Four versions (ChatGPT-b, ChatGPT-c, YouChat, and Bard) scored significantly higher on the realistic questions than on the hypothetical questions (
Table 4; examples of responses to realistic and hypothetical questions matched by LLO are shown in
Fig. 2). The difference in scores of the most recent ChatGPT version, ChatGPT-d, did not reach statistical significance (
P = 0.079).
We also tested the performance of four chatbot versions on 22 LLO-matched sets of cell biology questions. Three versions (ChatGPT-c, YouChat, and Bard) returned significantly more correct answers on the realistic questions than on the hypothetical questions (
Table 5). As above, the difference in scores of ChatGPT-d was not statistically significant (
P = 0.15).
Overall, these results show that an average drop in score due to questions being hypothetical rather than realistic was 15.5% for human physiology (i.e., chatbots averaged 69.7% on the realistic questions vs. 54.2% on the hypothetical questions) and 9.5% for cell biology (74.6% vs. 65.1%).
Among hypothetical questions, chatbots score better on invention questions than on redefinition questions
Eight human physiology LLOs had hypothetical questions of both subtypes (
Table 1), permitting a small-scale comparison between these subtypes. Three of the five chatbots tested scored significantly lower on the redefinition questions than on the invention questions (
Table 6). The exceptions were ChatGPT-d, which scored well on both subtypes, and Bard, which scored poorly on both (
Table 6). Since the cell biology questions did not permit a similar comparison and since the overall number of redefinition questions was quite low, our finding that redefinition questions are harder should be considered preliminary and is in need of further testing.
ChatGPT improved rapidly during this study
During February and March 2023, we were able to study three different versions of ChatGPT (
Table 1). ChatGPT’s human physiology scores improved rapidly over this time period (
Fig. 3), with ChatGPT-d showing especially large gains in hypothetical question scores. Similar trends were evident in the cell biology scores of ChatGTP-c and ChatGPT-d (
Table 5).
DISCUSSION
This study employed LLM-driven chatbots as imperfect but useful models of biology students. While chatbots do not “think” like students, they readily fielded hundreds of questions apiece, yielding data that are unobtainable in typical classroom settings. These data showed that aside from the most advanced version of ChatGPT available to us, all chatbots tested scored significantly lower on hypothetical questions than on realistic questions matched to the same LLOs. Our findings support the development and use of more novel questions to prompt students to transfer their knowledge to new scenarios.
The chatbots we tested generally impressed us with their apparent knowledge of biology. Their frequent success in answering difficult multiple-choice questions and their often-lucid supporting explanations suggest that these chatbots can model the kind of understanding that we want our students to acquire. Our results are broadly consistent with previous reports of chatbots’ fluency with content from undergraduate biology (
23,
24), the Medical College Admission Test (MCAT) (
26), and medical school (
27,
28). For instance, a ChatGPT version based on LLM GPT-3.5 (likely equivalent to our ChatGPT-a, ChatGPT-b, or ChatGPT-c) performed at or above median scores on the MCAT (
26), and the scores we report here are roughly comparable to those (e.g., ~75% correct for questions from the MCAT section that are most analogous to our question banks, Biological and Biochemical Foundations of Living Systems). Our results are also roughly similar to those from a report (
27) on neurosurgery board preparation exam questions, for which a GPT-3.5-based ChatGPT, a GPT-4-based ChatGPT (equivalent to our ChatGPT-d), and Bard had overall scores of 62.4%, 84.6%, and 44.2%, respectively.
Despite the chatbots’ frequent virtuosity, switching from realistic questions to hypothetical ones in our study lowered their scores by an average of 13 percentage points—an effect that, if applied to student exams, would often correspond to a drop of 1.3 letter grades (e.g., from a B-plus to a C). Our finding that only ChatGPT-d did about equally well on hypothetical and realistic questions mirrored the neurosurgery board preparation exam question study, in which only the GPT4-based ChatGPT did about equally well on higher-order and lower-order questions (
27).
Our study’s distinction between “realistic” and “hypothetical” questions bears some (possibly misleading) similarity to a distinction between “abstract” and “applied” questions in a 20-question homeostasis concept inventory (
29). In that study, McFarland et al. classified nine questions as abstract; six of these concerned the general meaning of terms like “control center” (question #16) and “effector” (question #13), and three concerned the hypothetical regulation of metabolite X in the blood of a new species of deer (questions #2–4). Therefore their “abstract” category is quite different from our “hypothetical” category, while their “applied” category corresponds closely to our “realistic” category, so their finding of similar scores on abstract questions and applied questions (
29) is not directly comparable to our finding of different scores on hypothetical questions and realistic questions. Nonetheless, McFarland et al. offer relevant insight into the ways in which question formats may influence student responses (
29). They cite a prior claim that concrete problems containing specific details may be especially hard for students because the details may appear to conflict with prior knowledge and/or “may trigger application of inaccurate mental models” (p. 3). In this light, it makes sense that our questions on which the chatbots scored worst were the “redefinition” questions, in which details were changed so as to directly clash with prior knowledge.
Taken together, our results and those of others (
11,
16) have the practical implication that novel problems do indeed offer a unique window to student understanding. The general task of applying previously learned information to new contexts is known in the cognitive psychology literature by many terms, including HOCS (discussed above), analogical thinking (
30), case comparison (
31), and transfer (
20,
32). Regardless of terminology, this task is widely understood to be a central focus of education, yet students often fall short of faculty expectations (
32,
33).
One likely reason for this difficulty is that students may not receive enough practice with scenarios that are truly novel (yet LLO-aligned). Momsen et al. have reported that introductory biology courses usually have few exam questions novel enough to demand HOCS (
5). Similarly, we have observed that in popular human anatomy and physiology textbooks, only ~0.2% of questions concern non-human animals or aliens (Sankar et al., unpublished observations). When instructors create these kinds of questions, we may get feedback such as the following (given to G.J.C. by a human physiology student): “Give real patient examples and stop with the alien or monsters or other creatures; not everybody is aware or knowledgeable of these creatures unless you are a marine biologist of some sort. In real life, I would like to save and evaluate real people/person, not a Loch Ness monster.”
This representative comment highlights the risk that novel problems, being unfamiliar to students, may be perceived as irrelevant and/or unfair. To avoid such misunderstandings, we urge instructors to be transparent with students on both the “what” and the “why” of these novel problems. That is, if students are likely to face exam problems about biological entities not previously covered, instructors should—well ahead of the exam—explicitly inform students of this, justify the inclusion of such novelty, and provide LLO-linked examples, perhaps via the TQT framework (
17–19). Most broadly, instructors should help students appreciate that in both basic science and applied (e.g., clinical) science, we solve novel problems by applying what is known to what is not yet known. Novel or hypothetical problems thus serve as valuable practice for authentic challenges in research, medical care, etc. The invention subtype of hypothetical question may correspond to the discovery of a novel mechanism, the diagnosis of a patient with a novel disorder, or the treatment of a patient with a novel class of drug, while the redefinition subtype may correspond to situations where new test results overturn previous assumptions.
Finally, regarding cheating (
34,
35), our results provide some reassurance about chatbots’ current limitations in answering exam questions, as well as some warning, given the ongoing evolution of their abilities. As of this writing, many chatbots seem to struggle with hypothetical questions and, in the absence of feedback, do not improve their answers when repeatedly asked the same question or similar questions; the latter finding suggests that chatbots cannot necessarily improve in real time during the course of a single exam. In addition, while we made all of our questions text-based, we presumably could have stumped the chatbots with image-based questions. However, the high success rate of GPT-4-driven ChatGPT (ChatGPT-d) on our and others’ hypothetical questions (
36), as well as this ChatGPT version’s ability to analyze images (
37), suggests that with continuing advances in AI, even hypothetical and image-based questions may soon become straightforward for many chatbots. We advise against a strategy of trying to “outsmart” chatbots by writing ever more convoluted questions; instead, we favor approaches to assessment that simultaneously prioritize fairness (equity), stress reduction, and student learning (
38). While our study did not directly investigate equity issues, we had to pay $20 per month to use the highest-scoring chatbot, implying that students with different resources might have access to chatbots with different capabilities.
In conclusion, we used the framework of TQTs to create well-matched sets of realistic and hypothetical questions relevant to undergraduate courses in human physiology and cell biology. The fact that LLM-based chatbots usually scored lower on the hypothetical questions constitutes new evidence to support previous suggestions that novel scenarios provide unique cognitive challenges. We hope that future work will further explore the issue of question novelty, perhaps via fuller comparisons of the redefinition and invention subtypes of novel questions, to further clarify how novel questions impact cognition and how they might be used optimally in instruction and assessment.