Background/Context

When one thinks of item-based testing, typically the first thing to come to mind is a test consisting of a series of questions requiring single answer responses. However, this is a very narrow view of item-based testing; in order to better understand the process of item-based testing and item design, it is key to first be clear about what is meant by the term "item" in the context of assessment. According to the American Educational Research Association's Standards for Educational and Psychological Testing, an item is defined as:

"A statement, question, excercise, or task on a test for which the test taker is to select or construct a response, or perform a task" (177).

This testing-industry-approved definition demonstrates that an "item" is not just a "question," but that this term can refer to a task in which the supplied response does not have to be written or even verbal. A test item could be a multiple choice question in which the test taker has to choose the correct answer between A, B, C, or D, for a class quiz.

Multiple-choice item

A test item could be an essay that compares and synthesizes information gathered from muliple sources on a national test.

Sample essay from PARCC

Alternatively, a test item could be a performance-based, physical demonstration, such as a dancer being asked to show that s/he has masterd a dance routine in order to be admitted to a dance company, or a medical student taking care of a patient while an attending physian evaluates the treatment.

Choreographed Dance as an Example of a Non-Verbal Test Item

Here, we will focus on the type of test items that are used in the field of education, but I think it is also important to remember that testing is not limited to obligatory skills testing in classroom, or workplace environments, but that item-based testing can have a broader, voluntary and social aspect as well, such as the myriad of "friend" quizzes found on social media sites such as Facebook where one's responses are shared among peers. In this manner, item-based testing can also be seen as something fun that may reveal aspects of one's personality or behaviors.

This work will attempt to:

Outline the process for developing test items for educational purposes
Give examples of the item development in practice using W. James Popham's concept of learning progressions and the NAEP assessment
Outline some of the strengths and weaknesses in item-based testing
Explain some ways that technology has impacted item-based assessment
Provide a brief glossary of key terms

According to the Standards, there are four phases in test item development (p.37):

Delineation of the purpose(s) of the test and the scope of the construct or the extent of the domain to be measured
Development and evaluation of the test specifications
Development, field testing, evaluation, and selection of items and scoring guides and procedures
Assembly and evaluation of the test for operational use

Key to testing, is this first step of defining the purpose of the test and the intended inferences to be made from test results, as well as the target test-takers. Critical mistakes may occur when a test is designed for one purpose and/or target population, but then gets used for another that was not the original intent of the test design. For example, an item-based summative test may have been originally written to evaluate a student's reading ability, but may be repurposed by a school or district to evaluate a teacher's instructional ability. Such misuse of tests can have unintended high-stakes negative consequences, such as denials for promotions or even job loss.

Test specifications serve as blueprint for item development, such as, what type of items (e.g. selected response, short constructed response, or extended response), whether there will be time limits, how many total items there will be per section, what materials may be used (e.g. can calculators, dictionaries, or other devices be used during the test), and how the test will be administered (e.g. individually/group, paper/computer, orally/written).

The process of field testing involves trying out the items with either a small or large sample of the target population, depending on the scope of the testing, to see which ones should be included, which need to be revised, and which need to be removed for the final test product. Examples of items that might be modified or removed might be ones that reveal a cultural bias (example, only males/females got an item correct), or items that were outliers, for example if an item was intended "easy," yet test takers who got the "difficult" items correct scored poorly on it, might mean that there is something inherently wrong with the item. To help determine if an answer is correct, partially correct, or incorrect, scoring guides are often include rubric, particularly if the test items involve more complex responses, such as a written essay response.

The last phase of item development involves the final production of the test, for example, putting together an assessment kit which might include guidelines for administering the test, the actual test, and any other supplemental materials needed to administer the test such as stimulus texts, manipulatives, and assessment forms.

Test designers may range from teams of professional test designers/researchers of standardized commercial tests such as NWEA MAP tests, or individual instructors creating local tests. In either case, having a set of standards for test item development helps ensure that the tests created are purposeful and thorough.

Concept in Practice

In order to further unpack the concept of defining a purpose for an item-based test, it may be helpful to use Popham's model of learning progressions, particularly in regard to creating formative assessment tests, as these are the tests that inform the ongoing instruction of a teacher and can lead to the most dramatic increase in learning acquisition. In essence, learning progressions are a sequence of knowledge or skills that build on one another in order to reach a final learning objective. These smaller steps leading to the target learning progression are considered "building blocks."

Learning Progression Model from Popham, 2011, p26

These learning progressions should represent long-term curricular goals. Test items may focus on the assessment of one or more of the subskills within a learning progressions.

Once the purpose of the test and possible learning progressions have been defined, the test designer must then determine which type of knowledge an item is meant to address. For example, if a learning progression goal is "student must be able to evaluate how two different authors address similar themes or topics" a subskill might be to "identify main ideas of a text", another more advanced subskill might be "student makes comparisons across two or more texts". Yet, a major challenge to designing assessment items, is that too often, they do not address higher-order thinking and may not represent increasing levels of task complexity.

Another useful model for ensuring a systematic approach in item design is the newly revised Bloom's Taxonomy of Educational Objectives. Bloom's traditional model of Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation was revised in 2001 by editors Anderson et al. to further refine these categories while also including a second dimension, which they refer to as the "knowledge dimension," which is subdivided into Factual, Conceptual, Procedural, and Metacognitive Knowledge.

Revised Bloom's Taxonomy (sorry, couldn't get it to save the rotated image)

By using such a framework, a test designer could ensure that the appropriate number and type of items are being represented in an assessment, since without such a systematic approach, it is possible to run the risk of not including the appropriate type of question for testing a given subskill or learning progression, or having the distribution of items be skewed, for example, having too many factual knowledge/remember type questions (e.g. recalling or identifying details from a text) when the goal is actually have the test taker critique the author's craft (conceptual knowledge/evaluate).

An example of an assessment using a similar type of framework, is the NAEP (National Assessment of Educational Progress). The NAEP is a national test for students in grades 4, 8, and 12 that are administered once every four years. As an example of an item based assessment, one of the main assessment subject areas is reading. According to NAEP: "The NAEP reading assessment measures the reading and comprehension skills of students in grades 4, 8, and 12 by asking them to read selected grade-appropriate passages and answer questions based on what they have read." The cognitive processes, or "cognitive targets," as NAEP calls them, for item development focus on the ability of students to:

Locate and recall (literal thinking with information from within the text)
Integrate and interpret (inferential thinking within and across given texts)
Critique and evaluate (critical thinking beyond the given text)

Although there are constructed response items, most of the items are selected response with multiple choice single answer questions, which may or may not actually get at student thinking, as we will explain further in the next section. Additionally, the readings selected for the assessment are relatively short (fourth grade passages 200-800 words each, eighth grade grade passages 400-1000 words each, and for twelfth grade, 500-1500 words), which also creates some limitations which will also be explored further in the next section.

Strengths/Weaknesses of the Approach

Strengths of Item-Based Testing

Can be very time efficient, especially in the case of selected response items which may be scored by computers in real time
Encompasses wide range of testing formats (selected response, constructed response, performance-based)
Can be adapted for use in classroom, workplace, and beyond
Standardized tests can be administered to a wide range of test-takers in order to identify trends with large sample sets which can help reveal systemic problems that need to be addressed
Can be well-adapted for diagnostic, formative, and summative assessment

Weaknesses of Item-Based Testing

Tends to set up a paradigm of there being only one "right" answer; limited room for original thought/interpretation by the test taker
More complex items, such as extended response essays can be difficult to score reliably; require detailed rubrics and even then, there may be great variability between scorers which could lead to some test takers passing or failing an assessment based on the variabilty of scorers instead of knowledge demonstrated
The exact source of error is not always clear, for example, if a student gets a selected response item "wrong" is it because s/he did not understand the construct being assessed, or if the student didn't understand the format of the item
Conversely, if a test taker has a correct random guess on a selected response item, s/he receives the same credit as if s/he had actually known the right answer and may give a false picture of what the test taker actually "knows"
High-stakes testing tends to be selected response-based, e.g with No Child Left Behind, if a school is not making "adequate yearly progress," it may be shut down, so it may lead schools to "teaching to the test" instead of following a set curriculum
Can be difficult to separate out cultural bias and the way that influences item design; for example, if a national standardized test should be written in such a way as to reflect a wide variety of test takers (e.g. gender, ethnicity, socioeconomic background, urban/suburban/rural, East/West/Midwest/South, etc.), yet it may be that the actual test content does not reflect this cultural diversity, which could result in some test takers feeling alienated and as a result not perform as well. Such cultural bias in testing often takes subtle forms, for example, if the illustrations/photos in a test only show white males as subjects or if the settting for the readings are all suburban or rural.
Test taker has limited to non-existent input as to the way item-based tests are created or interpreted; no room for synergistic feedback between test taker and the test designer or the scorer of the test
Often set-up in such a way as to try to "trick" the test taker with deceptive distractor item choices; for example if in a selected response item the correct answer is "A. 125," a distractor choice might be one with the numbers reversed like "B. 152", so that at a glance, a test-taker may mistakenly choose B. However, such a mistaken choice doesn't prove that the test taker was not able to calculate the answer correctly, so much as that the test-taker was not sensitive to the test design. Such items may serve to create a stressful/antagonistic assessment experience for test taker. This kind of a negative experience with item-based testing can lead to individuals feeling like outsiders who have a generalized dislike of testing that may end up creating a self-fullfilling prophesy of "not testing well" for certain test takers.

Implications for Technology Mediated Environment

Greater accessibility

With items being administered on a computer, the items have a greater level of potential accessibilty for test takers with special needs. Some examples of ways items can be adapted for the vidually impaired on a computer include having the text/graphics be able to be enlarged or having the option to change the color scheme to create greater color contrast. For those with limited hearing, videos may have captioning turned on, and test takers may be able to pause and review in a way that would not be possible in a traditional group administered video viewing.

Multimodal/Interactive Items

With technolgy advancing every day, test item designers are taking the steps to make technology enhanced test items that are not just a digitized substitution of paper and pencil item, but that leverage technology to transform the way we assess. A useful resource for exploring some of the possible item types is the Taxonomy of Intermediate Constraint Assessment created by Univeristy of Oregon professor, Kathleen Scalise. This interactive taxonomy focuses on item types that exist in the spectrum between fully constrained selected response and fully constructed response items. These item types allow test takers to engage in multimodal experience with test items in a way that was not previously possible. For example, in the figure below with the item on branches of government, the image capture shows how a concept map can be enhanced through use of the computer since the test taker is able to easily manipulate the graphics, draw arrows of relationships, and label different parts; in a traditional pen/paper environment, this could get very messy and difficult to score due to illegibilty.

Example of Digital Concept Mapping Item from Scalise

Additionally, having items be computer based allows for testing of mulitple literacies, including being able to gather information from audiovisual sources, such as incorporating video into test items as in the example from PARCC below. The fact that such video items are individually administered allows the possibilty of test takers to view the video muliple times, pause, re-watch certain parts that may help them better understand the information in a way that would not be possible, for example, if the video were shown whole class.

Sample Video-Based Items from PARCC Assessment

Computer Adaptive Items

Computer adaptive items, in theory, allow for differenitated assessment by providing the test taker with successively more difficult or successively easier questions, depending on whether or not the test taker got a "correct" ir "incorrect" answer. However, it is often not transparent as to the back-end decision-making of how the system is able to determine that if test-taker gets a given item correct or incorrect that it will be predictive of how they would do on the next item, or even the number of any given item type a test taker receives before the system makes a judement of mastery (e.g. has a test taker "mastered" the concept of identifyng the main idea of a passage after getting one "main idea" item correct, or after two, or four, etc..)

Digital Data Management

With test data being available in a digital format, it is much easier to run analysis to identify trends with individual learners, as well as classrooms, schools, and districts. This data may be large or fine grained, depending on the quality of the assessment, and can help provide evidence for areas where additional instructional support is needed. Additionally, it is much easier to be able to share this data so that it is possible to keep long-term records of a student's performance level as they go from grade to grade, school to school.

Example of Digital Management System Tracking Test Results in a Digital Environment

Reflections

While item-based testing may have some powerful advantages, such as the efficiency of administering a selected response test, there are also serious issues that can be difficult to address. A key example of one difficult-to-address-problems is with determining the test takers actual thinking when they respond to an item, for example, why they chose a particular multiple-choice answer (e.g. was it an error of logic or an error of test-taking selection). Nonetheless, item-based testing still serves a useful purpose, for example, it may not provide a multidimensional view of a student as a reader, it can provide a snapshot and allow for more systematic comparison between test takers to help identify trends. Such assessment should not be the exclusive manner in which a learner is judged to have mastered a skill or learning progression, but it is a piece of the puzzle and perhaps is best used as a preliminary probe or diagnostic tool, or as a supplement to ongoing formative assessment where other richer assessment formats, where the student has more input into demonstrating knowledge such as a portfolio-based assessment, are also being used.

References

American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education (1999). Standards for Educationa and Psychological Testing. Washington.DC: AERA.

Anderson, L.W., Krathwohl, D.R. Airsasian, P.W. et. al. (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives. New York: Addison Wesley Longman.

Bloom, B.S. (1956). The Taxonomy of Educational Objectives, The Classification of Educational Goals, Handbook I: Cognitive Domain. New York: David McKay Company.

Kalantis, M. and Cope, W. (2012) "Measuring Learning," New Learning. pp.305-341.

NAEP reading assessment. (n.d.). Retrieved from http://nces.ed.gov/nationsreportcard/reading/. National Assessment of Educational Progress

PARCC Assessment (n.d.). Retrieved from http://www.parcconline.org

Popham, W.J. (2011) "Learning Progressions: Blueprints for the Formative Assessment Process," Transformative Assessment in Action. Alexandria, VA: ASCD. pp.24-47.

Scalise, K. (n.d.) Taxonomy of Assesment Retrieved from http://pages.uoregon.edu/kscalise/taxonomy/taxonomy.html

Key Terms Glossary

Diagnostic assessment: assessment to identify areas of strength/weakness to create a baseline for measuring growth/decline

Extended response items: test items where the test taker must supply a response in a few sentences, paragraphs, or pages; an example might be an item that requires an essay response

Formative assessment: assessment that is ongoing that allows the instructor to adjust teaching based on testing results

Item: (AERA, 1999 p 177) "A statement, question, excercise, or task on a test for which the test taker is to select or construct a response, or perform a task"

Learning progressions: (Popham, 2011 p25) "a sequenced set of subskills or bodies of enabling knowledge that, it is thought, students must master en route to mastering a more remote target curricular aim

Portfolio-based assessment: in this form of assessment, the subject often has input as to the content of what will be included in the assessment, for example, the student may put together a collection/presentation of what s/he considers their best work, which may take multiple modes such any combination of written work, video collages, artwork, recorded interviews, etc.

Selected response items: test items where the test taker must choose one or more from a given selection of possible responses, an example might be a mulitple choice item where the test taker must choose a single answer among A, B, C, or D, or between "true" or "false"

Short constructed response items: test items that require the test taker to supply one or a few words (or numbers) for an answer, such as "429" or "the Battle of Gettysburg"

Summative assessment: assessment that is administered at the end of an instructional unit to demonstrate whether or not test taker has mastered the skills that have been taught

Assessment Theory

Project Overview

Project Description