Produced with Scholar

Assessment Theory

Project Overview

Project Description

Write a wiki-like entry defining an assessment concept. Define the concept, describe how the concept translates into practice, and provide examples. Concepts could include any of the following, or choose another concept that you would like to define. Please send a message to both admins through Scholar indicating which you would like to choose - if possible, we only want one or two people defining each concept so, across the group, we have good coverage of concepts.

Icon for Automated Essay Assessment

Automated Essay Assessment

Full Essay Assessment with Minimum Grading

As educators we look for the best method to assess our students. Assessment falls into many forms each with its pros and cons. In Foundations of Education and Instructional Assessment/Assessment Strategies/Essays the authors state that "By utilizing essays as a mean of assessments, teachers are able to better survey what the student has learned. Multiple choice questions, by their very design, can be worked around. The student can guess, and has decent chance of getting the question right, even if they did not know the answer."

To get a true understanding of the students' accomplishments or knowledge, essay tests are more common than other methods such as multiple choice or fill in the blanks. The latter doesn't provide as much deductive reasoning or the ability to discuss a topic or problem. The issue, as stated by Diki (2006) is "Revision and feedback are essential aspects of the writing process. Students need to receive feedback to increase their writing quality. However, responding to student papers can be a burden for teachers." In this vein we need a tool that can offer both feedback, to improve writing style, and assessment on the mastery of the skill or topic described.

In the evaluation of Automated Essay Assessment or AEA success is defined in the following four areas:

  • Keep the value provided by the essays while allowing for larger enrollment without the need for additional instructors.
  • Improve speed of grading to allow for quicker feedback.
  • Ensure academic integrity is upheld throughout the process.
  • Provide valuable feedback to improve students' writing.

AEA is found by the same rules of applied to any assessment tool: validity, fairness, and reliability. 

Origins and Use

Machine grading of essays, later called Automated Essay Assessment can trace its roots to Ellis Page back in 1966. Page created the forefunner of Project Essay Grade, one of the main tools still in use today. The first actual use of scoring by computer was in 1997 using a tool called Intelligent Essay Assessor (IEA).

Statistics on the dissemination and use of AEA are difficult to find. There are case studies but no full district wide or education system wide studies available at this time. Some of this is due to proprietary nature of the software, others due to the criticism of using AEA.

Project Essay Grade provides the following metrics from http://www.measurementinc.com/Solutions/AssessmentTechnologies

  •  PEG  used by MI to provide over two million scores to students over the past five years.
  • PEG is currently being used by one state as the sole scoring method on the state summative writing assessment, and we have conducted pilot studies with three other states.
  • PEG is currently being used in 1,000 schools and 3,000 public libraries as a formative assessment tool. 

Leading Tools

The number of tools available to those who wish to automate essay assessment are growing. The below list shows those that are more developed both in the proprietary and open source realms.

Current Automated Essay Assessment Tools

Tool Publisher Type Notes
eRater Educational Testing Service Proprietary https://www.ets.org/erater/about
Intellimetric Vantage Learning Proprietary http://www.mccanntesting.com/products-services/intellimetric/
Project Essay Grade Measurement, Inc Proprietary http://www.measurementinc.com/Solutions/AssessmentTechnologies
LightBox Lightside Labs Open Source https://www.getlightbox.com/#/
EASE (Enhanced AI Scoring Engine) EdX Open Source https://github.com/edx/ease

Capabilities

In order to truly evaluate AEA it is critical to understand the capabilities that these program have and how they relate to the overall "burden on faculty" that set the need for these tools. In many software programs we would be able to list the main capabilites that each of these tools have. This field is in such an infancy or not near maturity, that these are not fully outlined. To underscore this, in 2012 the Hewlett Foundation (https://www.kaggle.com/c/asap-aes)created a competition In AEA to develop an automated scoring algorithim.

This challenge had three main parts:

  • Challenge developers of automated student assessment systems to demonstrate their current capabilities.
  • Compare the efficacy and cost of automated scoring to that of human graders.
  • Reveal product capabilities to state departments of education and other key decision makers interested in adopting them.

The fact that the products above to not confrom to a set of capabilities nor are they fully disclosed so a rubric can be applied to assess what tool is best for what application.

While not a fully exhaustive list or a compare and contrast, below are the self described capabilities of each. Each of these lists are dervived from the information provided on their websites listed above, many are more corporate press releases than detailed capabilities of the products.

Erater
  • Natural Language Processing - applies principles of linguistics and computer science to create computer applications that interact with human language
  • Provides holistic essay score
  • Real-time feedback on grammar, usage, mechangics, style, development
  • Feedback tailored toward student analysis
Intellimetric
  • Accuracy, consistency, and reliability greater than human expert scoring
  • Web-based tools that are accessible anytime, anywhere
  • Scoring of both short-answer and extended-response questions
  • Holistic and analytic scoring and feedback
  • Scoring capability in more than 20 different languages
  • Detection of non-legitimate essays, such as those that:
    • are off topic;
    • are off task;
    • lack proper development;
    • are written in a language other than what was expected;
    • contain bad syntax;
    • copy the question;
    • are inappropriate;
    • contain messages of harm.
  • Immediate feedback that eliminates scoring delays and promotes greater use of data to target instruction
  • Analyzes written prose, calculates more than 300 measures that reflect the intrinsic characteristics of writing (fluency, diction, grammar, construction, etc.), and achieves results that are comparable to the human scorers in terms of reliability and validity.
LightSide
  • Provides students based on previously scored example responses to a prompt,
  • Learns to value the same things that humans value in that context.
  • LightSide’s automated assessment matches human reliability. Just like two humans might grade an essay differently, though, sometimes instructors willl disagree with an automated assessment.
  • This doesn’t always result in “good” assessment; LightSide has occasionally worked with organizations where human graders’ scores are correlated with word counts closely – up to 90 percent! 
EASE by edX
  • No specifc webpage dedicated to the capabilities of this product
  • Built into many courses offered through edX including courses offered by Harvard University

Critics and Proponents

As with any technological advance there are critics. Automated Essay Assessment is not unique. In this case there are three major areas of criticism:

  • "the overreliance on surface features of responses, the insensitivity to the content of responses and to creativity, and the vulnerability to new types of cheating and test-taking strategies." - Yang
  • Students' motivation will be lacking as no human will ever read their work.
  • Software is fallible and even intentionally gibberish essays will earn high scores.

There are those who say that these programs achieve the main assumptions and goals set at the beginning of this work: That these tools help reduce the burden on instructor while giving students greater, more reliable, more instant feedback, and help improve students' writing. One study by Scott Jaschik (2011), reported that computer scoring is more consistent than fallible human raters. Similarly, Peter Foltz promoted the idea that these tools give instant feedback as a formative assessment for students. In a recent New York Times article (2013), Dr. Agarwal, president of EdX, said he believed that the software was nearing the capability of human grading. He further states that “This is machine learning and there is a long way to go, but it’s good enough and the upside is huge,” he said. “We found that the quality of the grading is similar to the variation you find from instructor to instructor.”

In a 2012 press release from the Automated Student Assessment Prize, the ability for automated assessment tools have shown how they stack up compared to human evaluation. 

Judges Concluded Project Essay Grade’s Ability To Score Short-Answer Constructed Responses Is Most Similar To Human Graders

The rise of these programs in High-Stakes Assessment has even lead to petitions being created to stop the use of automated essay assessment. A humanreaders.org petition calls for legislators and policy makers to "stop mandating essay scores generated by machines to make crucial decisions such as grade promotion, academic placement, graduation, school ranking, school accreditation, or teacher qualification, promotion, and pay".

The Next Steps

There are both proponents and critics to the automated essay assessment movement. As we have larger classrooms with more required writing samples these tools will continue to be deployed. As this is a very young area, there are more questions than answers. A few of the most pressing are:

  • Does peer review, which also lessens the burden on faculty, achieve the same or better effect than the machine grading?
  • Does combining both faculty assessment and AEA provide more consistent and reliable grading?
  • Even if the results of automated essay assessment are similar to an instructor reading and assessing a work, is there some extra value provided by a human reading another human's work that makes up for biases we bring to evaluation. Or is it a case of us being comfortable with human evaluation and not as comfortable with machine grading?
  • Are there a common set of standards or capabilities that can be agreed on to rate the effectiveness of these tools.

To satisfy proponents and assuage critics, additional research into AEA tools, their effectiveness, and alternative human methods should be completed..

References

  • Automated Student Assesment Prize. (2012) Winners of Competition Announced. [Press release]. Retreived from http://www.measurementinc.com/sites/default/files/ASAP2%20Press%20Release%2010-04-12--Final.pdf
  • Dikli, S. (2006, January 1). Automated Essay Scoring. Retrieved September 10, 2014, from http://files.eric.ed.gov/fulltext/ED494415.pdf
  • Foltz, Peter. "Analysis of student ELA writing performance for a large scale implementation of formative assessment".
  • Foundations of Education and Instructional Assessment/Assessment Strategies/Essays. (n.d.). Retrieved September 10, 2014, from http://en.wikibooks.org/wiki/Foundations_of_Education_and_Instructional_Assessment/Assessment_Strategies/Essays
  • Jaschik, Scott (2011-02-21). "Can You Trust Automated Grading?". Retrieved 2013-04-12. "[ETS researcher Chaitanya] Ramineni said, one of the problems that surfaced in the review was that some humans doing the evaluation were not scoring students' essays on some prompts in consistent ways, based on the rubric used by NJIT."
  • Markoff, J. (2013, April 4). Essay-Grading Software Offers Professors a Break. Retrieved September 23, 2014, from http://www.nytimes.com/2013/04/05/science/new-test-for-computers-grading-essays-at-college-level.html?pagewanted=all&_r=0
  • Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An Overview of Current Research on Automated Essay Grading. Journal of Information Technology Education, 2(2003), 319-330. Retrieved September 10, 2014, from https://www.google.com/url?url=http://scholar.google.com/scholar_url?hl=en&q=http://www.editlib.org/p/111481/article_111481.pdf&sa=X&scisig=AAGBfm09UZchWT1_azuG7tkYWOZuyX5Kgw&oi=scholarr&rct=j&q=&esrc=s&sa=X&ei=IZwQVIzbDoKMyASOnYDwBA&ved=0CCAQgAMoAjAA&usg=AFQjCNFL_HUJduLuMaxpsJmPh0lcC_bCnQ
  • Wood, J. (2013, May 24). Teacher and Automated Essay Scoring (AES)... A Winning Combination? Retrieved September 16, 2014, from http://www.nwea.org/blog/2013/teacher-and-automated-essay-scoring-aes-a-winning-combination/