Overview and History

Item based testing is a form of assessment that measures an individual’s knowledge or skill area based on his or her response to a single question or “item” or a series of item. Item based testing is the most common form of standardized assessment in elementary and secondary education settings in the United States, where it is often used primarily as criterion and norm-referenced assessment.

Item based testing originated with the work of Alfred Binet and his collaborator Theodore Simon. First issued in the early 20th century, Binet’s Intelligence Test consisted of 30 items, increasing in difficulty, that measured individuals’ mental ages against their peer group (Binet). Binet focused on finding items upon which he could test students against each other, and based his assessment results strictly on how students fared against each other. Today this is what we know of as “Norm-Referenced Assessment” (New Learning, 314).

Not long after Binet’s test took off, a similar one emerged for use in the United States Army. Robert Yerkes, then President of the American Psychological Association, created a series of Army Intelligence Tests to assure all recruits fit the intelligence criteria. Like Binet’s test, these tests - the written Army Alpha test for literate recruits and pictorial Army Beta test for illiterate ones(Gould) were norm-referenced, used to determine individuals’ “native intellectual ability” and “inborn intelligence” as compared to others. While both tests attempt to measure intelligence across the population, unlike Binet’s test, Yerkes’ test specifically uses these results as a means of quantifying individuals by their ability and therefore their candidacy. While Yerkes tests hold strong racist and eugenicist connotations not often found in today’s standardized tests, there are striking similarities in the use of tests as gateways into institutions, such as the Army or a university. Both Binet’s and Yerkes’ tests set the standard for the future of item based tests in the western world as norm-referenced assessment.

Item-Based Testing in the 20th Century

The use of standardized testing continued to rise in popularity in the decades after Binet’s and Yerkes’ tests, yet the practice lacked in efficiency due to the limitations of scoring technology. It was nearly impossible to score large quantities of tests until the 1950s, when optical scoring machines became more widely available (Weiss, 2). These machines, using Optical Character Recognition (OCR) technology, were the early relatives of the extremely popular Scantron testing technology. Scantron, who popularized the use of #2 pencils in test settings, “reads pencil marks and converts them into a computer-usable form” and revolutionized the testing industry (Macmillan Science Library, 1). Without OCR technology, efficient grading of widespread standardized tests such as the SAT and GRE would not be possible.

With this advance in technology, test development became much more regulated, as large numbers of exams could be administered, collected and analyzed quickly. As seen in the below image the process is cyclical: Tests are designed and delivered based off of the item bank; once the results are collected, the data can be analyzed to feed the item bank with statistical information based on results; changes to the item bank are made as needed, and the cycle continues each time the items are used.

The Test Development Cycle (Weiss, 2)

Even with this method in place, it was not until the 1960s and 1970s, with the advent of computers and personal computers, that item-based testing could truly be measured effectively. And not until the mid-1980s could test item banks be stored, along with their statistical data, effectively in spreadsheets and in software. However, item bank storage, organization, and analysis were not the only revolutionary uses of computers for item-based testing; today, item-based assessment continues to evolve via computer-based assessment and adaptive learning technologies (Weiss, 3-4).

Item-Based Testing Today

By the 1990s, state-mandated tests based on more rigorous content and performance standards in different subject areas had become widespread. Since the American Federation of Teachers began tracking the use of these standards-based tests in 1995, the number of states using or planning to use this form of assessment has risen by almost 50 percent, from 33 to 49 (Weiss, 2-3).

Through the 1980s and into the 1990s, state-mandated criterion-referenced item-based assessment rose to prominence, and remain so today (Clarke). National standards-based testing initiatives, such as the Common Core State Standards (CCSS), dictate learning criterion for language arts and mathematics comprehension in elementary and secondary education students across the country. In order to measure selected criterion-based standards, each item must measure a unique knowledge, ability or skill (Gorin). Item types vary from selection-based items such as multiple choice, true/false, and fill-in-the-blank, where the student selects the correct answer out of a set of provided choices, to open-ended items where the student responds in his or er own words to the question or prompt.

While selection-based item types were at one point considered to test lower-level thinking (Hollingworth), improvements in assessment creation result in more complex questions. Some strong examples of technology enhanced standards-based items can be seen in those used in Partnership for Assessment of Readiness for College and Careers (PARCC) assessments. Their selection-based item types include Evidence-Based Selected Response (EBSR) and Technology-Enhanced Construction Response (TECR). Below you will see examples from their language arts standards-based assessments, both of which closely correlate to a specific standard, such as Standard 1, which “asks students to support answers with explicit textual evidence” (PARCC, 2).

EBSR items follow the format of a traditional multiple choice selection-based item; however, where it differs is how and what it is measuring (Conrad-Curry). The first part of the item tests accuracy and comprehension of a written passage, while the second part asks the student to select the passage that best illustrates their comprehension. EBSR items allow for partial and full credit, weighing each part equally with one point each.

Sample EBSR test item (PARCC)

The TECR item is very similar to the EBSR except it allows the student more autonomy. As seen in Part B and C of the sample item, the student must drag and drop the correct passage rather than selecting from a given list of candidates.

Sample TECR Test Item (PARCC)

Where in the past, student process was inferred by item selection, the use of TECR items allows students to select the specific sentences of the passage from which their answer to Part A derives. While similar to EBSR, this allows the student even more autonomy and provides a closer look into the student’s process of selecting their answer. These items’ ability to more accurately and creatively measure criterion standards illustrates the advances in item-based assessment technology (The PARCC Assessment).

The Future of Item-Based Assessment

While the TECR and EBSR test items use technology to reinvent item construction , technology is also being used to improve measurement of statistical data each item provides and thereby adapting items to match students’ abilities.

An example of the adaptive testing movement is the adaptive learning company, Knewton. What stands out about Knewton’s item development is its use of Item Response Theory (IRT). Today, IRT is the preferred method of understanding item performance in high-stakes testing over its predecessor Classical Testing Theory (CTT) due to its higher level of sophistication, precision (Thompson, 1), and overall accuracy (Magno, 8). Whereas CTT measures performance based on the number of correct answers, IRT focuses on the difficulty level of each individual item (alejandro). According to Knewton’s Tech Blog:

“Item response theory (IRT) attempts to model student ability using question level performance instead of aggregate test level performance. Instead of assuming all questions contribute equally to our understanding of a student’s abilities, IRT provides a more nuanced view on the information each question provides about a student.” (alejandro)

Basically, IRT tells us more about students’ specific abilities compared to each other as a means of measuring an item’s difficulty. If two students receive the same score on a test it doesn’t necessarily mean they know the same things. IRT distinguishes this difference. Different item parameters allow us to know specific things about students’ knowledge. For example when testing two students, if Student 1 has answered one question, and Student 2 has answered 2 questions, using IRT, we can understand which question was more difficult, and by the students answers, which student has a higher ability. As seen in the chart below, Student 1 got Question 1 correct and Student 2 got Question 1 incorrect but question 2 correct. Since Student 2 got question 1 incorrect we can presume that it is more difficult than question 2. Therefore, even though Student 1 only answered one question, we can presume he or she has a higher ability than Student 2 (alejandro).

Item Response Theory

In the past IRT has commonly been used in high-stakes testing environments, but its use in adaptive learning is only just now becoming popular as a means of customizing learning experiences based on data gathered about targeted student needs.

Criticisms of Item-Based Assessment

While Item-based assessment is a very popular assessment type, is not without its hindrances and drawbacks. The use of such a standardized (or standardizable) form of assessment, and its prominent placement in today’s classroom leads many educators to criticize the emphasis on “teaching to the test,” which many argue limits students’ ability for higher level learning. Criticism of the use of item-based assessment in standardized testing is commonly associated with the No Child Left Behind (NCLB) Act, which strictly enforces meeting standards and penalizes schools whose students do not fully meet these standards (National Education Association). Another common argument against item-based testing is that it limits student creativity and reflection, which can be remedied with the use of portfolio-based assessment, which “enables students to document and track their learn ing; develop an integrated, coherent picture of their learning experiences; and enhance their self-understanding" (Stanford Center for Innovations in Learning as quoted in Rickards et al., 34).

Conclusion

Item-based assessment is one of the most prolific and common forms of testing today. With the advances in technology over the past two centuries, item-based assessment has gone from being commonly considered a test of purely lower-level thinking (Hollingworth) to one that can measure process and decision making (The PARCC Assessment). Furthermore, this assessment type has risen to prominence due to its marked efficiency, standardization, and ease of customization and delivery via large (and often cloud-based) item banks (Weiss, 10). As assessment technology continues to improve, and the use of adaptive learning technology grows, item-based assessment will hopefully find a balance between efficiency and autonomy that will allow it to meet the needs of all types of learners.

References

alejandro . ‘The Mismeasure of Students: Using Item Response Theory Instead of Traditional Grading to Assess Student Proficiency.’ N Choose K: The Knewton Developer Blog. Knewton, Inc, 7 June 2012. Web. 19 Sept. 2014.

Binet, Alfred. “New Methods for the Diagnosis of the Intellectual Level of Subnormals.” The Development of Intelligence In Children. 1905 [1916]. n.p. Psych Central. Web. 19 Sept. 2014.

Clarke, Marguerite, et al. “The Marketplace for Educational Testing.” National Board on Educational Testing and Public Policy. Carolyn A. and Peter S. Lynch School of Education, Boston College 2.3 (2001): n.p. Web. 11 Sept. 2014.

Conrad-Curry, Dea. “HELP!! PARCC Testing Acronyms: EBSR/TECR” Partner in Education. 5 Nov. 2013. Web. 18 Sept. 2014.

Gorin, Joanna S. ‘Test Design with Cognition in Mind’ in Educational Measurement: Issues and Practice 25.4. (2006): 21-35. Web. 5 Sept. 2014.

Gould, Stephen Jay. The Mismeasure of Man. New York: W.W. Norton. 1981. Web 5 Sept. 2014

Hollingworth, Liz, Beard, Jonathan J. and Thomas P. Proctor. (2007). ‘An Investigation of Item Type in Standards-Based Assessment.’ in Practical Assessment, Research & Evaluation, 12.18. Web. 5 Sept. 2014.

Kalantzis, Mary, and Bill Cope. New Learning: Elements of a Science of Education. 2nd ed. Melbourne: Cambridge University Press, 2012. Kindle file.

Macmillan Science Library: Computer Sciences “Optical Character Recognition.” United Nations Statistics Division - Demographic and Social Statistics (2010): 1-2. Web. 19 Sept. 2014.

Magno, Carlo. “Demonstrating the Difference between Classical Test Theory and Item Response Theory Using Derived Test Data.” International Journal of Educational and Psychological Assessment 1.1 (2009): 1-11. Web. 19 Sept. 2014.

National Education Association. “NCLB - The Basics.” National Education Association, 2014. Web. 19 Sept. 2014.

PARCC. “PARCC Grade 3 Item Samples.” Partnership for Assessment of Readiness for College and Careers. (2013): 1-14. Web 19 Sept 2014.

Rickards, William H., et al. “Learning, Reflection, and Electronic Portfolios: Stepping Toward an Assessment Practice” The Journal of General Education 57.1 (2008): 31-50. Web. 11 Sept. 2014.

The PARCC Assessment. Partnership for Assessment of Readiness for College and Careers, 2014. Web. 19 Sept. 2014.

Thompson, Nathan A. “Ability Estimation with Item Response Theory.” St. Paul, MN: Assessment Systems Corporation. (2009). Web. 5 Sept. 2014.

Weiss, David J. “Item Banking, Test Development, and Test Delivery.” The APA Handbook on Testing and Assessment. (2011): 1-20. Web. 11 Sept. 2014.

Assessment Theory

Project Overview

Project Description