Texas will use computers to grade written answers on this year’s STAAR tests

Students sitting for their STAAR exams this week will be part of a new method of evaluating Texas schools: Their written answers on the state’s standardized tests will be graded automatically by computers.

The Texas Education Agency is rolling out an “automated scoring engine” for open-ended questions on the State of Texas Assessment of Academic Readiness for reading, writing, science and social studies. The technology, which uses natural language processing, a building block of artificial intelligence chatbots such as GPT-4, will save the state agency about $15 million to 20 million per year that it would otherwise have spent on hiring human scorers through a third-party contractor.

The change comes after the STAAR test, which measures students’ understanding of state-mandated core curriculum, was redesigned in 2023. The test now includes fewer multiple choice questions and more open-ended questions — known as constructed response items. After the redesign, there are six to seven times more constructed response items.

“We wanted to keep as many constructed open ended responses as we can, but they take an incredible amount of time to score,” said Jose Rios, director of student assessment at the Texas Education Agency. In 2023, Rios said TEA hired about 6,000 temporary scorers, but this year, it will need fewer than 2,000.

To develop the scoring system, the TEA gathered 3,000 responses that went through two rounds of human scoring. From this field sample, the automated scoring engine learns the characteristics of responses, and it is programmed to assign the same scores a human would have given.

This spring, as students complete their tests, the computer will first grade all the constructed responses. Then, a quarter of the responses will be rescored by humans.

When the computer has “low confidence” in the score it assigned, those responses will be automatically reassigned to a human. The same thing will happen when the computer encounters a type of response that its programming does not recognize, such as one using lots of slang or words in a language other than English. “We have always had very robust quality control processes with humans,” said Chris Rozunick, division director for assessment development at the Texas Education Agency. With a computer system, the quality control looks similar.

Every day, Rozunick and other testing administrators will review a summary of results to check that they match what is expected. In addition to “low confidence” scores and responses that do not fit in the computer’s programming, a random sample of responses will also be automatically handed off to humans to check the computer’s work.

TEA officials have been resistant to the suggestion that the scoring engine is artificial intelligence. It may use similar technology to chatbots such as GPT-4 or Google’s Gemini, but the agency has stressed that the process will have systematic oversight from humans. It won’t “learn” from one response to the next, but always defer to its original programming set up by the state.

“We are way far away from anything that’s autonomous or can think on its own,” Rozunick said. But the plan has still generated worry among educators and parents in a world still weary of the influence of machine learning, automation and AI.

Some educators across the state said they were caught by surprise at TEA’s decision to use automated technology — also known as hybrid scoring — to score responses.

“There ought to be some consensus about, hey, this is a good thing, or not a good thing, a fair thing or not a fair thing,” said Kevin Brown, the executive director for the Texas Association of School Administrators and a former superintendent at Alamo Heights ISD.

Representatives from TEA first mentioned interest in automated scoring in testimony to the Texas House Public Education Committee in August 2022. In the fall of 2023, the agency announced the move to hybrid scoring at a conference and during test coordinator training before releasing details of the process in December. The STAAR test results are a key part of the accountability system TEA uses to grade school districts and individual campuses on an A-F scale. Students take the test every year from third grade through high school. When campuses within a district are underperforming on the test, state law allows the Texas education commissioner to intervene.

The commissioner can appoint a conservator to oversee campuses and school districts. State law also allows the commissioner to suspend and replace elected school boards with an appointed board of managers. If a campus receives failing grades for five years in a row, the commissioner is required to appoint a board of managers or close that school.

With the stakes so high for campuses and districts, there is a sense of uneasiness about a computer’s ability to score responses as well as a human can.

“There's always this sort of feeling that everything happens to students and to schools and to teachers and not for them or with them,” said Carrie Griffith, policy specialist for the Texas State Teachers Association. A former teacher in the Austin Independent School District, Griffith added that even if the automated scoring engine works as intended, “it's not something parents or teachers are going to trust.”

“The automation is only as good as what is programmed,” said Lori Rapp, superintendent at Lewisville ISD. School districts have not been given a detailed enough look at how the programming works, Rapp said.

The hybrid scoring system was already used on a limited basis in December 2023. Most students who take the STAAR test in December are retaking it after a low score. That’s not the case for Lewisville ISD, where high school students on an altered schedule test for the first time in December, and Rapp said her district saw a “drastic increase” in zeroes on constructed responses. “At this time, we are unable to determine if there is something wrong with the test question or if it is the new automated scoring system,” Rapp said.

The state overall saw an increase in zeroes on constructed responses in December 2023, but the TEA said there are other factors at play. In December 2022, the only way to score a zero was by not providing an answer at all. With the STAAR redesign in 2023, students can receive a zero for responses that may answer the question but lack any coherent structure or evidence.

The TEA also said that students who are retesting will perform at a different level than students taking the test for the first time. “Population difference is driving the difference in scores rather than the introduction of hybrid scoring,” a TEA spokesperson said in an email.

For $50, students and their parents can request a rescore if they think the computer or the human got it wrong. The fee is waived if the new score is higher than the initial score. For grades 3-8, there are no consequences on a student’s grades or academic progress if they receive a low score. For high school students, receiving a minimum STAAR test score is a common way to fulfill one of the state graduation requirements, but it is not the only way. Even with layers of quality control, Round Rock ISD Superintendent Hafedh Azaiez said he worries a computer could “miss certain things that a human being may not be able to miss,” and that room for error will impact students who Azaiez said are “trying to do his or her best.”

Test results will impact “how they see themselves as a student,” Brown said, and it can be “humiliating” for students who receive low scores. With human graders, Brown said, “students were rewarded for having their own voice and originality in their writing,” and he is concerned that computers may not be as good at rewarding originality.

Julie Salinas, director of assessment, research and evaluation at Brownsville ISD said she has concerns about whether hybrid scoring is “allowing the students the flexibility to respond” in a way that they can demonstrate their “full capability and thought process through expressive writing.”

Brownsville ISD is overwhelmingly Hispanic. Students taking an assessment entirely in Spanish will have their tests graded by a human. If the automated scoring engine works as intended, responses that include some Spanish words or colloquial, informal terms will be flagged by the computer and assigned to a human so that more creative writing can be assessed fairly. The system is designed so that it “does not penalize students who answer differently, who are really giving unique answers,” Rozuick said.

With the computer scoring now a part of STAAR, Salinas is focused on adapting. The district is incorporating tools with automated scoring into how teachers prepare students for the STAAR test to make sure they are comfortable.

“Our district is on board and on top of the things that we need to do to ensure that our students are successful,” she said.

will save the state agency about $15 million to 20 million per year that it would otherwise have spent on hiring human scorers through a third-party contractor.

what is the NET savings, not gross.

"...will be graded automatically by computers..."

Be careful. The real statement should be will be graded automatically by programs written by computer programmers.

If any of them came from google/amazon/apple/microsoft then forget it.

OK, Now how do I game the answers for the computer?

Just put in important words?

Will it check spelling and grammar?

What if I use AAVE; can it read that?

What if I insult or criticize the computer?

A story from the past. One day a Smart*ss student in my dorm said his English prof never read their papers. So he put some critizism and a statement in his paper, “If you read this, you have my permission to flunk me.”

He claims the prof never read the paper, or that part.

I taught many years ago with typewritten papers and blue books. I know someone who teaches at colleges and we fed student essay responses through ChatGPT to assess them.

It's interesting. You have to train the system to look for what you want. If you do a good job it's actually fairer than human grading because it doesn't get tired or peeved by crappy answers. Depending on the assessment criteria you give it it can go from failing almost everyone to everyone gets an A. But you can find a "sweet spot" that seems good. More importantly you can train it so it "generously" interprets students who try to apply something but not quite correctly. It's actually better at seeing that than we were because after covering the same subject repeatedly you formulate a template of what a "right" answer should look like and tend to ignore something outside of that even if it is mainly or partly applying the concepts correctly.

So you weed out the lazy and the bullshitters more easily but give credit to those who mostly get it right. We also asked it to answer the essay questions: it produces grammatically correct content that reads like an undergrad who scanned the material and bloviates an answer. So spotting ChatGPT generated responses is actually quite easy.

... and just like that, BIPOC scores went up 25% while Asian and White scores went down 15%.

That can only happen if personal info is fed into the system, which I doubt. Or if it is, it would be easy to expose. It's likely to be fairer than human scoring, which means BIPOC students will probably be graded more appropriately.

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.