



Our study evaluates the response ability of ChatGPT in natural science and engineering fields. The number of participants across departments was as follows: AE: 25 people, AS: 41 people, CEG: 59 people, EEMCS: 36 people, 3mE: 37 people. Participants currently hold the following positions at Delft University of Technology: Assistant professors: 71 people, associate professors: 59 people, full-time professors: 47 people, lecturers: 9 people, doctoral students: 6 people, postdoctoral researchers: 4 people, others: 2 people. An overview of the evaluation of ChatGPT responses against the nine evaluation criteria is shown in Figure 1 and described below. The boxplots show the evaluation results for nine metrics grouped by three skill categories. For each criterion, we present separate ratings for the three educational levels. Average results across departments. The triangle indicates the average rating, and the red horizontal bar indicates the median value. The boxes span from the 1st quartile to the 3rd quartile, with black diamonds representing outliers.

Figure 1

Summary of evaluation results. The triangle indicates the average rating, and the red horizontal bar indicates the median value. The boxes range from the first quartile to the third quartile.

We identify four main findings from the aggregated results (Figure 1). First, ChatGPT scores higher on average for basic and scientific skills compared to skills beyond scientific knowledge. Second, the relevance (bi) of the bachelor's level answer questions received the highest overall rating with an average score of 4.46. Additionally, participants rate their level of English (a.ii) as high (mean score 4.17 across all education levels). This score corresponds to advanced use of academic English (with some technical terminology) in written communication. Third, the model's critical attitude (ci) score is the lowest among the nine criteria. The ratings collected here say, on average, that ChatGPT is important for some results, but this is not the general attitude. Results should always be verified. However, 50% of participants thought that the criteria for skills (c) beyond scientific knowledge were not applicable, compared to 2.3% and 8.1% for basic skills (a) and scientific skills (b), respectively. It should be noted that only %. Fourth, for seven of the nine evaluation criteria, bachelor's level responses were rated higher than master's and doctoral level responses. level. For example, for completeness of answers (b.ii), participants give an average score of 3.51 for bachelor's level questions, but an average score of 2.93 for master's level questions, and 2.93 for doctoral level questions. Level 2.85.

Perhaps one of the most interesting criteria is scientific correctness (b.iii). Here, the average ChatGPT scores are 3.76 (Bachelor's level), 3.35 (Master's level), and 3.43 (PhD level). This score indicates that ChatGPT is able to answer most of the bachelor's level questions correctly and can answer master's and doctoral level questions. Level questions are answered partially correctly on average. The distribution of ratings is shown in Figure 2. The bar graph shows the number of ratings for each assessment option in the scientific correctness (b.iii) of the rubric. At the BA and PhD level, most participants said the answer was mostly correct (BA: 69 times, PhD: 82 times), but at the MA level, most participants said the answer was partially correct. (66 times). At all levels of education, the completely incorrect option was chosen least frequently (Bachelor's: 10 times, Master's: 15 times, Doctoral: 12 times).

Figure 2

Evaluation results of scientific correctness.

Answers from ChatGPT come with potential consequences when executed. We asked participants to assess the extent to which the implementation of answer (c.ii) would be positively or negatively impacted and to what extent her ChatGPT was aware of its potential impact (c.iii). I asked. Additionally, if the rubrics Impact of Response Implementation (c.ii) and Perception of Impact (c.iii) are applicable, research participants are asked to describe the type of impact of their response in the free text field. 128 of the 594 responses from ChatGPT mentioned one or more influence types, and we aggregated them into 8 influence types. This coding process was performed in a consensus coding session by the three authors of this study, who were faculty members with no industry experience. Therefore, the final result is unanimous and completely reliable between coders41. The types and the number of occurrences of each are shown in Table 2. Impact types are sorted by the number of occurrences of comments in the free text field. The impact of implementing answers ranges from serious consequences (score: 1) to clearly positive consequences (score: 5). Most impact types in the boxplot range from a score of 24, but the first quartile for environmental and social/political impacts is relatively high with an evaluation score of 3, and the third quartile for safety impacts is relatively high with an evaluation score of 3. The ranking is relatively low with an evaluation score of 3 (Table 2). Impact types such as environmental, economic, social/political, scientific, technological, educational, and health are rated as having neither positive nor negative impacts on average, but when it comes to safety impacts, ChatGPT May cause harmful consequences. The most frequently cited type of impact was environmental impact, which was mentioned 40 times. The least frequent effect type was health, which was mentioned five times. The results show that ChatGPT has on average the most positive impact on the environment (mean rating score 3.33) and the most negative impact on safety (mean rating score 2.39). All free text comments are provided in the Supplementary Information.

Table 2 Potential impact of implementing responses.Impact of study variables on assessment scores

It's very interesting to understand the variables that influence how ChatGPT answers are perceived. Combine standards for scientific skills (b) and skills beyond scientific knowledge (c) for each level of education. This is because reliability analysis using Cronbach \(\alpha \) showed that their measurements were consistent, while ignoring fundamental skills (a) . result in a discrepancy (Table 3). Note that the basic skills category (a) consists of answer format (ai) and level of English (a.ii), and we expect that their dependence is also slight .

Table 3 Reliability analysis.

For example, a question asks for a code example, but a ChatGPT answer explains the underlying algorithm in correct academic English. This answer receives a low score for the answer format (ai) but a high score for the level of English (a.ii). Criteria assessing scientific skills (b) and skills beyond scientific knowledge (c) show high consistency across educational levels (Cronbachs \(\alpha \) > 0.7). As a result, the criteria within each category consistently measure the same underlying skills.

Figure 3 shows the results for the variable assessment scores, skill categories, and education level. First, we test the effect of skill category on assessment scores. ANOVA shows that skill category has a significant effect on rating scores (F(1, 101)=92.6, p<0.001): ChatGPT's rating scores for scientific skills (b) significantly higher than the evaluation score for skills beyond physical knowledge (c). Next, when we test the null hypothesis for the effect of education level on assessment scores, the p-value is less than 0.01 (F(2, 202)=5.29). This test shows that education level has a significant impact on assessment scores. Answers with lower education levels, such as bachelor's level, are rated significantly higher than answers with higher education levels. Additionally, we test for interdependence between the independent variables skill category and education level. ANOVA shows that the variables significantly reinforce each other (F(2, 202)=6.49, p<0.01). Figure 3 shows that scientific skills for bachelor's level questions are rated even higher than would be expected if the dependence of ratings on skill category and education level were considered separately. We also analyze the influence that teachers have on evaluation evaluations. Here, no significant effect was found (F(4, 101)=0.79, p=0.53).

Figure 3

Repeated measures ANOVA results. Shows the average rating scores for different combinations of skill categories and education levels. Error bars represent 95% confidence intervals.

Free text comment

In addition to the quantitative evaluation of ChatGPT responses, we allowed all participants to submit free text comments for each response. Participants submitted a total of 355 free text comments. A complete list of free text comments can be found in the Supporting Information.

We manually assigned all free text comments to three inductive main categories: lack of detail, quality of response, and comparison with students. This coding process was performed in a consensus coding session by the three authors of this study, who were faculty members with no industry experience. Therefore, the final result is unanimous and completely reliable between coders41. Most comments (91 of 355) criticize the lack of detail or the answers being too superficial. For example, one participant commented: The answers are mostly narrative and general. Although the answer makes sense, it does not provide a deep and profound answer and remains phenomenological. Regarding the quality of the answer, 52 free text comments discuss the correctness of her ChatGPT answer. 28 comments say she has the ChatGPT answer wrong and 24 comments say the answer is correct. Regarding the third inductive main category, 25 comments compare the quality of answers from his ChatGPT with the quality of answers from students. In this context, we inductively determined three subcategories: (i) ChatGPTs formulate their answers better than most students (e.g., they formulate their answers better than most students and somewhat common but mostly true), (ii) perform worse than students expected. (e.g., a real student would be surprised to see such a mistake when their overall knowledge level is high.), (iii) act like a student guessing the answer (e.g., the content (students who did not fully understand) which conditions should be replaced [Linear-Quadratic-Programming] and [Model Predictive Control] This may give you the answer. )

Individual free text comments also touch on multiple other aspects of ChatGPT responses. One interesting example critically discusses the sources of training data and the meaning of these data. The answer will propagate. [a] A false and harmful perception of where quantum computing speedups will come from [].The answer came from a clearly misleading statement. [] About quantum acceleration, which is often seen on the internet. Finally, another category of free text comments exclusively for doctoral students has emerged. level question. Eleven participants stated that for questions that approximated open research questions, the model answers listed established literature facts but did not make any interpretations or inferences about them. According to participants, ChatGPT therefore fails to provide a perspective on future research directions or a ranking of options. For example, one comment states that the answer is essentially a combination of previously published approaches, and some are very limited. This answer actually addresses the question, but doesn't provide any new insight.

Finally, disruptive technologies such as ChatGPT can provoke emotional reactions. Run manual sentiment analysis to analyze the emotional tone of free text comments. Code free text comments into positive, neutral, or negative tone. This coding process was performed in a consensus coding session by the three authors of this study, who were faculty members with no industry experience. Therefore, the final result is unanimous and completely reliable between coders41. The majority of comments (287 out of 355) are written in a neutral and objective tone. In addition, there are 34 positively written comments (e.g., the answer is surprisingly good) and 34 negatively written comments (e.g., the answer is quite bad). There is no strong sentiment in the free text comments as 81% of the comments have a neutral tone and there are as many positive comments as there are free text comments written negatively.

