Write a wiki-like entry defining an assessment concept. Define the concept, describe how the concept translates into practice, and provide examples. Concepts could include any of the following, or choose another concept that you would like to define. Please send a message to both admins through Scholar indicating which you would like to choose - if possible, we only want one or two people defining each concept so, across the group, we have good coverage of concepts.
When one thinks of item-based testing, typically the first thing to come to mind is a test consisting of a series of questions requiring single answer responses. However, this is a very narrow view of item-based testing; in order to better understand the process of item-based testing and item design, it is key to first be clear about what is meant by the term "item" in the context of assessment. According to the American Educational Research Association's Standards for Educational and Psychological Testing, an item is defined as:
"A statement, question, excercise, or task on a test for which the test taker is to select or construct a response, or perform a task" (177).
This testing-industry-approved definition demonstrates that an "item" is not just a "question," but that this term can refer to a task in which the supplied response does not have to be written or even verbal. A test item could be a multiple choice question in which the test taker has to choose the correct answer between A, B, C, or D, for a class quiz.
A test item could be an essay that compares and synthesizes information gathered from muliple sources on a national test.
Alternatively, a test item could be a performance-based, physical demonstration, such as a dancer being asked to show that s/he has masterd a dance routine in order to be admitted to a dance company, or a medical student taking care of a patient while an attending physian evaluates the treatment.
Here, we will focus on the type of test items that are used in the field of education, but I think it is also important to remember that testing is not limited to obligatory skills testing in classroom, or workplace environments, but that item-based testing can have a broader, voluntary and social aspect as well, such as the myriad of "friend" quizzes found on social media sites such as Facebook where one's responses are shared among peers. In this manner, item-based testing can also be seen as something fun that may reveal aspects of one's personality or behaviors.
This work will attempt to:
According to the Standards, there are four phases in test item development (p.37):
Key to testing, is this first step of defining the purpose of the test and the intended inferences to be made from test results, as well as the target test-takers. Critical mistakes may occur when a test is designed for one purpose and/or target population, but then gets used for another that was not the original intent of the test design. For example, an item-based summative test may have been originally written to evaluate a student's reading ability, but may be repurposed by a school or district to evaluate a teacher's instructional ability. Such misuse of tests can have unintended high-stakes negative consequences, such as denials for promotions or even job loss.
Test specifications serve as blueprint for item development, such as, what type of items (e.g. selected response, short constructed response, or extended response), whether there will be time limits, how many total items there will be per section, what materials may be used (e.g. can calculators, dictionaries, or other devices be used during the test), and how the test will be administered (e.g. individually/group, paper/computer, orally/written).
The process of field testing involves trying out the items with either a small or large sample of the target population, depending on the scope of the testing, to see which ones should be included, which need to be revised, and which need to be removed for the final test product. Examples of items that might be modified or removed might be ones that reveal a cultural bias (example, only males/females got an item correct), or items that were outliers, for example if an item was intended "easy," yet test takers who got the "difficult" items correct scored poorly on it, might mean that there is something inherently wrong with the item. To help determine if an answer is correct, partially correct, or incorrect, scoring guides are often include rubric, particularly if the test items involve more complex responses, such as a written essay response.
The last phase of item development involves the final production of the test, for example, putting together an assessment kit which might include guidelines for administering the test, the actual test, and any other supplemental materials needed to administer the test such as stimulus texts, manipulatives, and assessment forms.
Test designers may range from teams of professional test designers/researchers of standardized commercial tests such as NWEA MAP tests, or individual instructors creating local tests. In either case, having a set of standards for test item development helps ensure that the tests created are purposeful and thorough.
In order to further unpack the concept of defining a purpose for an item-based test, it may be helpful to use Popham's model of learning progressions, particularly in regard to creating formative assessment tests, as these are the tests that inform the ongoing instruction of a teacher and can lead to the most dramatic increase in learning acquisition. In essence, learning progressions are a sequence of knowledge or skills that build on one another in order to reach a final learning objective. These smaller steps leading to the target learning progression are considered "building blocks."
These learning progressions should represent long-term curricular goals. Test items may focus on the assessment of one or more of the subskills within a learning progressions.
Once the purpose of the test and possible learning progressions have been defined, the test designer must then determine which type of knowledge an item is meant to address. For example, if a learning progression goal is "student must be able to evaluate how two different authors address similar themes or topics" a subskill might be to "identify main ideas of a text", another more advanced subskill might be "student makes comparisons across two or more texts". Yet, a major challenge to designing assessment items, is that too often, they do not address higher-order thinking and may not represent increasing levels of task complexity.
Another useful model for ensuring a systematic approach in item design is the newly revised Bloom's Taxonomy of Educational Objectives. Bloom's traditional model of Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation was revised in 2001 by editors Anderson et al. to further refine these categories while also including a second dimension, which they refer to as the "knowledge dimension," which is subdivided into Factual, Conceptual, Procedural, and Metacognitive Knowledge.
By using such a framework, a test designer could ensure that the appropriate number and type of items are being represented in an assessment, since without such a systematic approach, it is possible to run the risk of not including the appropriate type of question for testing a given subskill or learning progression, or having the distribution of items be skewed, for example, having too many factual knowledge/remember type questions (e.g. recalling or identifying details from a text) when the goal is actually have the test taker critique the author's craft (conceptual knowledge/evaluate).
An example of an assessment using a similar type of framework, is the NAEP (National Assessment of Educational Progress). The NAEP is a national test for students in grades 4, 8, and 12 that are administered once every four years. As an example of an item based assessment, one of the main assessment subject areas is reading. According to NAEP: "The NAEP reading assessment measures the reading and comprehension skills of students in grades 4, 8, and 12 by asking them to read selected grade-appropriate passages and answer questions based on what they have read." The cognitive processes, or "cognitive targets," as NAEP calls them, for item development focus on the ability of students to:
Although there are constructed response items, most of the items are selected response with multiple choice single answer questions, which may or may not actually get at student thinking, as we will explain further in the next section. Additionally, the readings selected for the assessment are relatively short (fourth grade passages 200-800 words each, eighth grade grade passages 400-1000 words each, and for twelfth grade, 500-1500 words), which also creates some limitations which will also be explored further in the next section.
Strengths of Item-Based Testing
Weaknesses of Item-Based Testing
Greater accessibility
With items being administered on a computer, the items have a greater level of potential accessibilty for test takers with special needs. Some examples of ways items can be adapted for the vidually impaired on a computer include having the text/graphics be able to be enlarged or having the option to change the color scheme to create greater color contrast. For those with limited hearing, videos may have captioning turned on, and test takers may be able to pause and review in a way that would not be possible in a traditional group administered video viewing.
Multimodal/Interactive Items
With technolgy advancing every day, test item designers are taking the steps to make technology enhanced test items that are not just a digitized substitution of paper and pencil item, but that leverage technology to transform the way we assess. A useful resource for exploring some of the possible item types is the Taxonomy of Intermediate Constraint Assessment created by Univeristy of Oregon professor, Kathleen Scalise. This interactive taxonomy focuses on item types that exist in the spectrum between fully constrained selected response and fully constructed response items. These item types allow test takers to engage in multimodal experience with test items in a way that was not previously possible. For example, in the figure below with the item on branches of government, the image capture shows how a concept map can be enhanced through use of the computer since the test taker is able to easily manipulate the graphics, draw arrows of relationships, and label different parts; in a traditional pen/paper environment, this could get very messy and difficult to score due to illegibilty.
Additionally, having items be computer based allows for testing of mulitple literacies, including being able to gather information from audiovisual sources, such as incorporating video into test items as in the example from PARCC below. The fact that such video items are individually administered allows the possibilty of test takers to view the video muliple times, pause, re-watch certain parts that may help them better understand the information in a way that would not be possible, for example, if the video were shown whole class.
Computer Adaptive Items
Computer adaptive items, in theory, allow for differenitated assessment by providing the test taker with successively more difficult or successively easier questions, depending on whether or not the test taker got a "correct" ir "incorrect" answer. However, it is often not transparent as to the back-end decision-making of how the system is able to determine that if test-taker gets a given item correct or incorrect that it will be predictive of how they would do on the next item, or even the number of any given item type a test taker receives before the system makes a judement of mastery (e.g. has a test taker "mastered" the concept of identifyng the main idea of a passage after getting one "main idea" item correct, or after two, or four, etc..)
Digital Data Management
With test data being available in a digital format, it is much easier to run analysis to identify trends with individual learners, as well as classrooms, schools, and districts. This data may be large or fine grained, depending on the quality of the assessment, and can help provide evidence for areas where additional instructional support is needed. Additionally, it is much easier to be able to share this data so that it is possible to keep long-term records of a student's performance level as they go from grade to grade, school to school.
While item-based testing may have some powerful advantages, such as the efficiency of administering a selected response test, there are also serious issues that can be difficult to address. A key example of one difficult-to-address-problems is with determining the test takers actual thinking when they respond to an item, for example, why they chose a particular multiple-choice answer (e.g. was it an error of logic or an error of test-taking selection). Nonetheless, item-based testing still serves a useful purpose, for example, it may not provide a multidimensional view of a student as a reader, it can provide a snapshot and allow for more systematic comparison between test takers to help identify trends. Such assessment should not be the exclusive manner in which a learner is judged to have mastered a skill or learning progression, but it is a piece of the puzzle and perhaps is best used as a preliminary probe or diagnostic tool, or as a supplement to ongoing formative assessment where other richer assessment formats, where the student has more input into demonstrating knowledge such as a portfolio-based assessment, are also being used.
American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education (1999). Standards for Educationa and Psychological Testing. Washington.DC: AERA.
Anderson, L.W., Krathwohl, D.R. Airsasian, P.W. et. al. (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives. New York: Addison Wesley Longman.
Bloom, B.S. (1956). The Taxonomy of Educational Objectives, The Classification of Educational Goals, Handbook I: Cognitive Domain. New York: David McKay Company.
Kalantis, M. and Cope, W. (2012) "Measuring Learning," New Learning. pp.305-341.
NAEP reading assessment. (n.d.). Retrieved from http://nces.ed.gov/nationsreportcard/reading/. National Assessment of Educational Progress
PARCC Assessment (n.d.). Retrieved from http://www.parcconline.org
Popham, W.J. (2011) "Learning Progressions: Blueprints for the Formative Assessment Process," Transformative Assessment in Action. Alexandria, VA: ASCD. pp.24-47.
Scalise, K. (n.d.) Taxonomy of Assesment Retrieved from http://pages.uoregon.edu/kscalise/taxonomy/taxonomy.html
Diagnostic assessment: assessment to identify areas of strength/weakness to create a baseline for measuring growth/decline
Extended response items: test items where the test taker must supply a response in a few sentences, paragraphs, or pages; an example might be an item that requires an essay response
Formative assessment: assessment that is ongoing that allows the instructor to adjust teaching based on testing results
Item: (AERA, 1999 p 177) "A statement, question, excercise, or task on a test for which the test taker is to select or construct a response, or perform a task"
Learning progressions: (Popham, 2011 p25) "a sequenced set of subskills or bodies of enabling knowledge that, it is thought, students must master en route to mastering a more remote target curricular aim
Portfolio-based assessment: in this form of assessment, the subject often has input as to the content of what will be included in the assessment, for example, the student may put together a collection/presentation of what s/he considers their best work, which may take multiple modes such any combination of written work, video collages, artwork, recorded interviews, etc.
Selected response items: test items where the test taker must choose one or more from a given selection of possible responses, an example might be a mulitple choice item where the test taker must choose a single answer among A, B, C, or D, or between "true" or "false"
Short constructed response items: test items that require the test taker to supply one or a few words (or numbers) for an answer, such as "429" or "the Battle of Gettysburg"
Summative assessment: assessment that is administered at the end of an instructional unit to demonstrate whether or not test taker has mastered the skills that have been taught