Test item creation - with or without AI?

Jo Szoke
Oct 13, 2023
5 min read

I have to design a short test for a flipped reading activity, so I decided to take a closer look at test item creation. What kind of questions are worth asking when checking comprehension, and can AI help teachers with this?

Key concepts first - validity and reliability

I want my assessment to make sense, so first, what I want to create is a valid and reliable test.

A valid test measures what it's supposed to measure. The reading my students have to do for next week is about Lexical Chunks in English Language Teaching. If I ask about anything else that's not related to the topic or that's not in the reading, my test is not valid.

A reliable test should provide consistent, replicable information about the test takers' performance. If there's high inconsistency in who takes the test, when or where they take it, or who assesses them (typical case of strict teacher in a bad mood vs nice and lenient teacher), then the test is unreliable.

Question types - time to create vs. time to evaluate

Something that we can easily agree on is that there seems to be a tendency that questions that are easier to create take longer to evaluate. For example, it's much easier to come up with an open-ended question that asks students to reflect on their reading but it takes much more time to read everyone's answers and to provide reliable and consistent feedback. The typical issue with open ended questions is that it's hard to define what makes someone's response a 10 out of 10, or a 7, or a 4, and so on. What you'd really need in this case is a well-designed rubric that explains each and every criterion and provides a score for various performance levels (e.g., unsatisfactory, satisfactory, excellent). But that's a lot of work!

There are question types which are much easier and quicker to grade (closed questions, such as multiple choice, multiple selection, matching, true or false) but they are typically more difficult to produce. Why? Because you still want to create a valid test. If the wrong answer options are too easy to spot or the whole question is too easy to answer, then what's the point of the test? So, you need to think hard to come up with the right distractors, which are plausible but still incorrect.

Question types - authenticity vs. reliability

We can look at the issue above from a slightly different perspective. If I ask open-ended and personalised questions, then I will probably get more authentic answers that truly show not just whether my students understood the main concepts but also if they can apply this knowledge. However, there is the problem of reliable grading. How do you decide if one student's answer is more creative than another's? Or do they get the same score for saying something creative? Where do you draw the line between average and above average?

OK, let's do closed questions. They are super reliable, they are quicker to grade (which means less workload for you), but the questions are not very authentic, and they don't tell you much about your students' understanding. These tests typically just scratch the surface.

So, your best option would be to include both types.

Starting out

First, I acted as a traditional teacher. I read through the entire article and tried to come up with my own questions. Based on the previous points, these were my guiding principles:

no questions about people's names, publication dates and other such data, as these are at the lowest of Bloom's taxonomy (remembering), and they don't really serve a real purpose in terms of understanding
mostly closed questions because I want this quiz to be machine-graded but I might still include one open question to learn more about my students' knowledge
if closed questions, then I will choose from multiple choice, multiple selection, true-false and short answer questions
short answer questions are best if the answer is indeed short (1-3 words), otherwise you have to provide all kinds of possible answer options (capital vs. lower-case, word order, with or without linking words, punctuation...), and that takes a lot of time and forward-thinking
multiple choice and multiple selection questions require good distractors that are sometimes difficult to come up with

(How) Can AI help?

So, I drafted my own questions and answer options, then used 3 AI tools (Quizbot, Quizgecko, ChatPDF) to do the same for me. I used their free versions. These are the things I noticed:

Speed!!! - AI generated quizzes are a dream! You upload the file (if the free version lets you) or paste the text (character limits are in place), select the age level, the difficulty, the language, and the number of questions, and bam! They generate questions in seconds and provide the answer key as well.
Closed question types - The free versions typically give you closed questions only (and within that category, multiple choice). But the questions are not bad and the wrong answers are also not blatantly wrong. Quizbot definitely won this round because it not only gives you complete answer feedback but the questions are more complex (Quizgecko asked some simple fact-recall questions). It also has more options for question creation, including questions from video or picture input, Bloom's taxonomy-ready questions, and Gardner's Learning Style-ready ones.
Open ended questions - As I've mentioned above, I would like to include at least one open-ended question too, so for that I got inspiration from ChatPDF, which came up with pretty good complex questions.
Teacher revision is a must - After all this work, the most important thing I'd add is that without your revision as a teacher, you might end up with a weak collection of questions. They won't be bad because they all focus on the content of the reading and are relevant. But they might grasp something different from what you think is important. Or there's still the possibility that they misunderstand something.

UPDATE: Thank you, Julie Moore, for pointing out that in the Quizbot test, in question 4, both b) and c) could work. And while this is true, the text only contained information about c), so AI thought that nothing's wrong. That's why double-checking is super important. I really like Jules's comment: "a proper editor is always invaluable ... however the content was created."

Conclusion

I would never hand out an AI-generated quiz without actually reading the text myself and double-checking all the questions and answers. However, I'm definitely going to work together with such tools because they are so much quicker in generating answer options, which I'm the slowest at. And I think I might buy the premium subscription for Quizbot.

Sources:

https://www.thoughtco.com/blooms-taxonomy-questions-7598

Wei, L. 2013. Language Assessment. Applied Linguistics. Wiley Blackwell.