[Congressional Record Volume 147, Number 70 (Monday, May 21, 2001)]
[Extensions of Remarks]
[Pages E863-E866]
From the Congressional Record Online through the Government Publishing Office [www.gpo.gov]




                       CAN TESTERS PASS THE TEST?

                                 ______
                                 

                           HON. BARNEY FRANK

                            of massachusetts

                    in the house of representatives

                          Monday, May 21, 2001

  Mr. FRANK. Mr. Speaker, the House is about to vote on a plan to make 
annual testing of students from grades 3-8 mandatory throughout the 
nation. I hope that no one will vote on that proposal before reading 
the following excellent report on the great difficulties involved in 
implementing a national program of annual testing.

                [From The New York Times, May 20, 2001]

            Right Answer, Wrong Score: Test Flaws Take Toll

             (By Diana B. Henriques and Jacques Steinberg)

       One day last May, a few weeks before commencement, Jake 
     Plumley was pulled out of the classroom at Harding High 
     School in St. Paul and told to report to his guidance 
     counselor.
       The counselor closed the door and asked him to sit down. 
     The news was grim, Jake, a senior, had failed a standardized 
     test required for graduation. To try to salvage his diploma, 
     he had to give up a promising job and go to summer school. 
     ``It changed my whole life, that test,'' Jake recalled.
       In fact, Jake should have been elated. He actually had 
     passed the test. But the company that scored it had made an 
     error, giving Jake and 47,000 other Minnesota students lower 
     scores than they deserved.
       An error like this--made by NCS Pearson, the nation's 
     biggest test scorer--is every testing company's worst 
     nightmare. One executive called it ``the equivalent of a 
     plane crash for us.''
       But it was not an isolated incident. The testing industry 
     is coming off its three most problem-plagued years. Its 
     missteps have affected millions of students who took 
     standardized proficiency tests in at least 20 states.
       An examination of recent mistakes and interviews with more 
     than 120 people involved in the testing process suggest that 
     the industry cannot guarantee the kind of error-free, high-
     speed testing that parents, educators and politicians seem to 
     take for granted.
       Now President Bush is proposing a 50 percent increase in 
     the workload of this tiny industry--a handful of giants with 
     a few small rivals. The House could vote on the Bush plan 
     this week, and if Congress signs off, every child in grades 3 
     to 8 will be tested each year in reading and math. Neither 
     the Bush proposal nor the Congressional debate has addressed 
     whether the industry can handle the daunting logistics of 
     this additional business.
       Already, a growing number of states use these so-called 
     high-stakes exams--not to be confused with the SAT, the 
     college entrance exam--to determine whether students in 
     grades 3 to 12 can be promoted or granted a diploma. The 
     tests are also used to evaluate teachers and principals and 
     to decide how much tax money school districts receive. How 
     well schools perform on these tests can even affect property 
     values in surrounding neighborhoods.
       Each recent flaw had its own tortured history. But all 
     occurred as the testing industry was struggling to meet 
     demands from states to test more students, with custom-
     tailored tests of greater complexity, designed and scored 
     faster than ever.
       In recent years, the four testing companies that dominate 
     the market have experienced serious breakdowns in quality 
     control. Problems at NCS, for example, extend beyond 
     Minnesota. In the last three years, the company produced a 
     flawed answer key that incorrectly lowered multiple-choice 
     scores for 12,000 Arizona students, erred in adding up scores 
     of essay tests for students in Michigan and was forced with 
     another company to rescore 204,000 essay tests in Washington 
     because the state found the scores too generous. NCS also 
     missed important deadlines for delivering test results in 
     Florida and California.
       ``I wanted to just throw them out and hire a new company,'' 
     said Christine Jax, Minnesota's top education official. ``But 
     then my testing director warned me that there isn't a 
     blemish-free testing company out there. That really shocked 
     me.''
       One error by another big company resulted in nearly 9,000 
     students in New York City being mistakenly assigned to summer 
     school in 1999. In Kentucky, a mistake in 1997 by a smaller 
     company, Measured Progress of Dover, N.H., denied $2 million 
     in achievement awards to deserving schools. In California, 
     test booklets have been delivered to schools too late for the 
     scheduled test, were left out in the rain or arrived with 
     missing pages.
       Many industry executives attribute these errors to growing 
     pains.
       The boom in high-stakes tests ``caught us somewhat by 
     surprise,'' said Eugene T. Paslov, president of Harcourt 
     Educational Measurement, one of the largest testing 
     companies. ``We're turned around, and responded to these 
     issues, and made some dramatic improvements.''
       Despite the recent mistakes, the industry says, its error 
     rate is infinitesimal on the millions of multiple-choice 
     tests scored by machine annually. But that is only part of 
     the picture. Today's tests rely more heavily on essay-style 
     questions, which are more difficult to score. The number of 
     multiple-choice answer sheets scored by NCS more than doubled 
     from 1997 to 2000, but the number of essay-style questions 
     more than quadrupled in that period, to 84.4 million from 20 
     million.
       Even so, testing companies turn the scoring of these 
     writing samples over to thousands of temporary workers 
     earning as little as $9 an hour.
       Several scorers, speaking publicly for the first time about 
     problems they saw, complained in interviews that they were 
     pressed to score student essays without adequate training and 
     that they saw tests scored in an arbitrary and inconsistent 
     manner.
       ``Lots of people don't even read the whole test--the time 
     pressure and scoring pressure are just too great,'' said 
     Artur Golczewski, a

[[Page E864]]

     doctoral candidate, who said he has scored tests for NCS for 
     two years, most recently in April.
       NCS executives dispute his comments, saying that the 
     company provides careful, accurate scoring of essay questions 
     and that scorers are carefully supervised.
       Because these tests are subject to error and subjective 
     scoring, the testing industry's code of conduct specifies 
     that they not be the basis for life-altering decisions about 
     students. Yet many states continue to use them for that 
     purpose, and the industry has done little to stop it.
       When a serious mistake does occur, school districts rarely 
     have the expertise to find it, putting them at the mercy of 
     testing companies that may not be eager to disclose their 
     failings. The surge in school testing in the last five years 
     has left some companies struggling to find people to score 
     tests and specialists to design them.
       ``They are stretched too thin,'' said Terry Bergeson, 
     Washington State's top education official. ``The politicians 
     of this country have made education everybody's top priority, 
     and everybody thinks testing is the answer for everything.''


                 The Mistake--When 6 Wrongs Were Rights

       The scoring mistake that plagued Jake Plumley and his 
     Minnesota classmates is a window into the way even glaring 
     errors can escape detection. In fact, NCS did not catch the 
     error. A parent did.
       Martin Swaden, a lawyer who lives in Mendota Heights, 
     Minn., was concerned when his daughter, Sydney, failed the 
     state's basic math test last spring. A sophomore with average 
     grades, Sydney found math difficult and had failed the test 
     before.
       This time, Sydney failed by a single answer. Mr. Swaden 
     wanted to know why, so he asked the state to see Sydney's 
     test papers. ``Then I could say, `Syd, we gotta study maps 
     and graphs,' or whatever,'' he explained.
       But curiosity turned to anger when state education 
     officials sent him boilerplate e-mail messages denying his 
     request. After threatening a lawsuit, Mr. Swaden was finally 
     given an appointment. On July 21, he was ushered into a 
     conference room at the department's headquarters, where he 
     and a state employee sat down to review the 68 questions on 
     Sydney's test.
       When they reached Question No. 41, Mr. Swaden immediately 
     knew that his daughter's ``wrong'' answer was right.
       The question showed a split-rail fence, and asked which 
     parts of it were parallel. Sydney had correctly chosen two 
     horizontal rails; the answer key picked one horizontal rail 
     and one upright post.
       ``By the time we found the second scoring mistake, I knew 
     she had passed,'' Mr. Swaden said. ``By the third, I was 
     concerned about just how bad this was.''
       After including questions that were being field-tested for 
     future use, someone at NCS had failed to adjust the answer 
     key, resulting in 6 wrong answers out of 68 questions. Even 
     worse, two quality control checks that would have caught the 
     errors were never done.
       Eric Rud, an honor-roll student except in math, was one of 
     those students mislabeled as having failed. Paralyzed in both 
     legs at birth, Eric had achieved a fairly normal school life, 
     playing wheelchair hockey and dreaming of become an 
     architect. But when he was told he had failed, his spirits 
     plummeted, his father, Rick Rud, said.
       Kristle Glau, who moved to Minnesota in her senior year, 
     did not give up on high school when she became pregnant. She 
     persevered, and assumed she would graduate because she was 
     confident she had passed the April test, as in fact, she had.
       ``I had a graduation party, with lots of presents,'' she 
     recalled angrily. ``I had my cap and gown. My invitations 
     were out.'' Finally, she said, her mother learned what her 
     teachers did not have the heart to tell her; according to 
     NCS, she had failed the test and would not graduate.
       When the news of NCS's blunder reached Ms. Jax, the state 
     schools commissioner, she wept. ``I could not believe,'' she 
     said, ``how we could betray children that way.''
       But when she learned that the error would have been caught 
     if NCS had done the quality control checks it had promised in 
     its bid, she was furious. She summoned the chief executive of 
     NCS, David W. Smith, to a news conference and publicly blamed 
     the company for the mistake.
       Mr. Smith made no excuses. ``We messed up,'' he said. ``We 
     are extremely sorry this happened.'' NCS has offered a $1,000 
     tuition voucher to the seniors affected, and is covering the 
     state's expenses for retesting. It also paid for a belated 
     graduation ceremony at the State Capitol.
       Jake Plumley and several other students are suing NCS on 
     behalf of Minnesota teenagers who they say were emotionally 
     injured by NCS's mistake. NCS has argued that its liability 
     does not extend to emotional damages.
       The court cases reflect a view that is common among parents 
     and even among some education officials: that standardized 
     testing should be, and can be, foolproof.


           The Task--Trying to Grade 300 Million Test Sheets

       The mistake that derailed Jake Plumley's graduation plans 
     occurred in a bland building in a field just outside Iowa 
     City. From the driveway on North Dodge Street, the structure 
     looks like an overgrown suite of medical offices with a small 
     warehouse in the back.
       Casually dressed workers, most of them hired for the spring 
     testing season, gather outside a loading dock to smoke, or 
     wander out for lunch at Arby's.
       This is ground zero for the testing industry, NCS's 
     Measurement Services unit. More of the nation's standardized 
     tests are scored here than anywhere else. Last year, nearly 
     300 million answer sheets coursed through this building, the 
     vast majority without mishap. At this facility and at other 
     smaller ones around the country, NCS scores a big chunk of 
     the exams from other companies. What the company does in this 
     building affects not only countless students, but the 
     reputation of the entire industry.
       Inside, machines make the soft sound of shuffling cards as 
     they scan in student answers to multiple-choice questions. 
     Hand-written answers are also scanned in, to be scored later 
     by workers.
       But behind the soft whirring and methodical procedures is 
     an often frenzied rush to meet deadlines, a rush that left 
     many people at the company feeling overwhelmed, current and 
     former employees said.
       ``There was a lack of personnel, a lack of time, too many 
     projects, too few people,'' signed Nina Metzner, an education 
     assessment consultant who worked at NCS. ``People were spread 
     very, very thin.''
       Those concerns were echoed by other current and former NCS 
     employees, several of whom said those pressures had played a 
     role in the Minnesota error and other problems at the 
     company.
       Mr. Smith, the NCS chief executive, disputed those reports. 
     The company has sustained a high level of accuracy, he said, 
     by matching its staffing to the volume of its business. The 
     Minnesota mistake, he said, was not caused by the pressures 
     of a heavy workload but by ``pure human error caused by 
     individuals who had the necessary time to perform a quality 
     function they did not perform.''
       Betsy Hickok, a former NCS scoring director, said she had 
     worked hard to ensure the accurate scoring of essays. But 
     that became more difficult, she said, as she and her scorers 
     were pressed into working 12-hour days, six days a week.
       ``I became concerned,'' Ms. Hickok said ``about my ability, 
     and the ability of the scorers, to continue making sound 
     decisions and keeping the best interest of the student in 
     mind.''
       Mr. Smith said NCS was ``committed to scoring every test 
     accurately.''


               The Workers--Some Questions About Training

       The pressures reported by NCS executives are affecting the 
     temporary workers who score the essay questions in vogue 
     today, said Mariah Steele, a former NCS scorer and a graduate 
     student in Iowa City.
       In today's tight labor markets, Ms. Steele is the testing 
     industry's dream recruit. She is college-educated but does 
     not have a full-time job; she lives near a major test-scoring 
     center and is willing to work for $9 an hour.
       For her first two evenings, she and nearly 100 other 
     recruits were trained to score math tests from Washington 
     State. This training is critical, scoring specialists say, to 
     make sure that scorers consistently apply a state's specific 
     standards, rather than their own.
       But one evening in late July, as the Washington project was 
     ending, Ms. Steele said, she was asked by her supervisor to 
     stop grading math and switch to a reading test from another 
     state, without any training.
       ``He just handed me a scoring rubric and said, `Start 
     scoring,' '' Ms. Steele said. Perhaps a dozen of her co-
     workers were given similar instructions, she added, and were 
     offered overtime as an inducement.
       Baffled, Ms. Steele said she read through the scoring guide 
     and scored tests for about 30 minutes. ``Then I left, and 
     didn't go back,'' she said. ``I really was not confident in 
     my ability to score that test.''
       Two other former scorers for NCS say they saw inconsistent 
     grading.
       Renee Brochu of Iowa City recalled when a supervisor 
     explained that a certain response should be scored as a 2 on 
     a two-point scale. ``And someone would gasp and say, `Oh, no, 
     I've scored hundreds of those as a 1,'' Ms. Brochu said. 
     ``There was never the suggestion that we go back and change 
     the ones already scored.''
       Another former scorer, Mr. Golczewski, accused supervisors 
     of trying to manipulate results to match expectations. ``One 
     day you see an essay that is a 3, and the next day those are 
     to be 2's because they say we need more 2's,'' he said.
       He recalled that the pressure to produce worsened as 
     deadlines neared. ``We are actually told,'' he said, ``to 
     stop getting too involved or thinking too long about the 
     score--to just score it on our first impressions.''
       Mr. Smith of NCS dismissed these anecdotes as aberrations 
     that were probably caught by supervisors before they affected 
     scores.
       ``Mistakes will occur,'' he said. ``We do everything 
     possible to eliminate those mistakes before they affect an 
     individual test taker.''
       New York City did not use NCS to score its essay-style 
     tests; instead, like a few other states, it used local 
     teachers. But like the scorers in Iowa, they also complained 
     that they had not been adequately trained.
       One reading teacher said she was assigned to score eight-
     grade math tests. ``I said I hadn't been in eight-grade math 
     class since I was in eight grade,'' she said.

[[Page E865]]

       Another teacher, said she, arrived late at the scoring 
     session and was put right to work without any training.
       Roseanne DeFablo, assistant education commissioner in New 
     York State, said she thought the complaints were exaggerated. 
     State audits each year of 10 percent of the tests do not show 
     any major problems, she said, ``so I think it's unlikely that 
     there's any systemic problem with the scoring.''


              The Demand--States Pushing For More, Faster

       Testing specialists argue that educators and politicians 
     must share the blame for the rash of testing errors because 
     they are asking too much of the industry.
       They says schools want to test as late in the year as 
     possible to maximize student performance, while using tests 
     that take longer to score. Yet schools want the results 
     before the school year ends so they can decide about school 
     financing, teacher evaluations, summer school, promotions or 
     graduation.
       ``The demands may just be impossible,'' said Edward D. 
     Roeber, a former education official who is now vice president 
     for external affairs for Measured Progress.
       Case in point: California. On Oct. 9, 1997, Gov. Pete 
     Wilson signed into law a bill that gave state education 
     officials five weeks to choose and adopt a statewide 
     achievement test, called the Standardized Testing and 
     Reporting program.
       The law's ``unrealistic'' deadlines; state auditors said 
     later, contributed to the numerous quality control problems 
     that plagued the test contractor, Harcourt Educational 
     Measurement, for the next two years.
       That state audit, and an audit done for Harcourt by 
     Deloitte & Touche, paint a devastating portrait of what went 
     wrong. There was not time to test the computer link between 
     Harcourt, the test contractor, and NCS, the subcontractor. 
     When needed, it did not work, causing delays. Some test 
     materials were delivered so late that students could not take 
     the test on schedule.
       It got worse. pages in test booklets were duplicated, 
     missing or out of order. One district's test booklets, more 
     than two tons of paper, were dumped on the sidewalk outside 
     the district offices at 5 p.m. on a Friday--in the rain. Test 
     administrators were not adequately trained. When school 
     districts got the computer disks from NCS that were supposed 
     to contain the test results, some of the data was inaccurate 
     and some of the disks were blank.
       In 1998, nearly 700 of the stat's 8,500 schools got 
     inaccurate test results, and more than 750,000 students were 
     not included in the statewide analysis of the test results.
       Then, in 1999, Harcourt made a mistake entering demographic 
     data into its computer. The resulting scores made it appear 
     that students with a limited command of English were 
     performing better in English than they actually were, a 
     politically charged statistic in a state that had voted a 
     year earlier to eliminate bilingual education in favor of a 
     one-year intensive class in English.
       ``There's tremendous political pressure to get tests in 
     place faster than is prudent,'' said Maureen G. DiMarco, a 
     vice president at Houghton Mifflin, whose subsidiary, the 
     Riverside Publishing Company, was one of the unsuccessful 
     bidders for California's business.
       Dr. Paslov, who became president of Harcourt Educational 
     Measurement after the 1999 problems, said that the current 
     testing season in California is going smoothly and that 
     Harcourt has addressed concerns about errors and delays.
       But California is still sprinting ahead.
       In 1999, Gov. Gray Davis signed a bill directing state 
     education officials to develop another statewide test, the 
     California High School Exit Exam. Once again, industry 
     executive said, speed seemed to trump all other 
     considerations.
       None of the major testing companies had on the project 
     because of what Ms. DiMarco called ``impossible, unrealistic 
     time lines.''
       With no bidders, the state asked the companies to draft 
     their own proposals. ``We had just 10 days to put it 
     together,'' recalled George W. Bohrnstedt, senior vice 
     president of research at the American Institutes for 
     Research, which has done noneducational testing but is new to 
     school testing.
       Phil Spears, the state testing director, said A.I.R. faced 
     a ``monumental task, building and administering a test in 18 
     months.''
       ``Most states,'' Mr. Spears said, ``would take three-plus 
     years to do that kind of test.''
       The new test was given for the first time this spring.


                The Concern--Life Choices Based on Score

       States are not just demanding more speed; they are 
     demanding more complicated exams. Test companies once had a 
     steady business selling the same brand-name tests, like 
     Harcourt's Stanford Achievement Test or Riverside's Iowa Test 
     of Basic Skills, to school districts. These ``shelf'' tests, 
     also called norm-referenced tests, are the testing equivalent 
     of ready-to-wear clothing. Graded on a bell curve, they 
     measure how a student is performing compared with other 
     students taking the same tests.
       But increasingly, states want custom tailoring, tests 
     designed to fit their homegrown educational standards. These 
     ``criterion referenced'' tests measure students against a 
     fixed yardstick, not against each other.
       That is exactly what Arizona wanted when it hired NCS and 
     CTB/McGraw-Hill in December 1998. What it got was more than 
     two years of errors, delays, escalating costs and angry 
     disappointment on all sides.
       Some of the problems Arizona encountered occurred because 
     the state had established standards that, officials later 
     conceded, were too rigorous. But the State blames other 
     disruptions on NCS.
       ``You can't trust the quality assurance going on now,'' 
     said Kelly Powell, the Arizona testing director, who is still 
     wrangling with NCS.
       For its part, NCS has thrown up its hands on Arizona. 
     ``We've given Arizona nearly $2 of service for every dollar 
     they have paid us,'' said Jeffrey W. Taylor, a senior vice 
     president of NCS. Mr. Taylor said NCS would not bid on future 
     business in that state.
       Each customized test a state orders must be designed, 
     written, edited, reviewed by state educators, field-tested, 
     checked for validity and bias, and calibrated to previous 
     tests--an arduous process that requires a battery of people 
     trained in educational statistics and psychometrics, the 
     science of measuring mental function.
       While the demand for such people is exploding, they are in 
     extremely short supply despite salaries that can reach into 
     the six figures, people in the industry said. ``All of us in 
     the business are very concerned about capacity,'' Mr. 
     Bohrnstedt of A.I.R. said.
       And academia will be little help, at least for a while, 
     because promising candidates are going into other, more 
     lucrative areas of statistics and computer programming, 
     testing executives say.
       Kurt Landgraf, president of the Educational Testing Service 
     in Princeton, N.J., the titan of college admission tests but 
     a newcomer to high-stakes state testing, estimated that there 
     are about 20 good people coming into the field every year.
       Already, the strain on the test-design process is showing. 
     A supplemental math test that Harcourt developed for 
     California in 1999 proved statistically unreliable, in part 
     because it was too short. Harcourt had been urged to add five 
     questions to the test, state auditors said, but that was 
     never done.
       Even more troubling, most test professionals say, is the 
     willingness of states like Arizona to use standardized tests 
     in ways that violate the testing industry's professional 
     standards. For example, many states use test scores for 
     determining whether students graduate. Yet the American 
     Educational Research Association, the nation's largest 
     educational research group, specifically warns educators 
     against making high-stakes decisions based on a single test.
       Among the reasons for this position, testing professionals 
     say, is that some students are emotionally overcome by the 
     pressure of taking standardized tests. And a test score, 
     ``like any other source of information about a student, is 
     subject to error,'' noted the National Research Council in a 
     comprehensive study of high-stakes testing in 1999.
       But industry executives insist that, while they try to 
     persuade schools to use tests appropriately, they are 
     powerless to enforce industry standards when their customers 
     are determined to do otherwise. A few executives say 
     privately that they have refused to bid on state projects 
     they thought professionally and legally indefensible.
       ``But we haven't come to the point yet, and I don't know if 
     we will, where we are going to tell California--Where we sell 
     $44 million worth of business--`Nope! We don't like the way 
     you people are using these instruments, so we're not going to 
     sell you this test,' '' Dr. Paslov said.
       Besides, as one executive said, ``If I don't sell them, my 
     competitors will.''


             The Expectations--Bush Proposal Raises the Bar

       President Bush explained in a radio address on Jan. 24 why 
     he wanted to require annual testing of students in grades 3 
     to 8 in reading, math and science, ``without yearly 
     testing,'' he said, ``we do not know who is falling behind 
     and who needs our help.''
       While many children will clearly need help, so will the 
     testing industry if it is called upon to carry out Mr. Bush's 
     plan, education specialists said.
       Currently, only 13 states test for reading and math in all 
     six grades required by the Bush plan. If Mr. Bush's plan is 
     carried out,--the industry's workload will grow by more than 
     50 percent.
       Ms. Jax, Minnesota's top school official, says she is not 
     close to being ready. ``It's just impossible to find enough 
     people,'' she said, ``I will have to add at least four tests. 
     I don't have the capacity for that, and I'm not convinced 
     that the industry does either.''
       Certainly the industry has been generating revenues that 
     could support some expansion. In 1999, its last full year as 
     an independent company, NCS reported revenues of more than 
     $620 million, up 30 percent from the previous year. The other 
     major players, all corporate units, do not disclose revenues.
       Several of the largest testing companies have assured the 
     administration that the industry can handle the additional 
     work. ``It's taken the testing industry a while to gear up 
     for this,'' said Dr. Paslov of Harcourt. ``But we are 
     ready.''
       Other executives are far less optimistic. ``I don't know 
     how anyone can say that we can do this now,'' said Mr. 
     Landgraf of the Educational Testing Service.
       Russell Hagen, chief executive of the Data Recognition 
     Corporation, a midsize testing company in Maple Grove, Minn., 
     worries that the added workload from the Bush proposal would 
     create even more quality control

[[Page E866]]

     problems, with increasingly serious consequences for 
     students. ``Take the Minnesota experience and put it in 50 
     states,'' he said.
       The Minnesota experience is still a fresh fact of life for 
     students like Jake Plumley, who is working nights for Federal 
     Express and hoping to find another union job like the one he 
     gave up last summer.
       But despite his difficult experience, he does not oppose 
     the kind of testing that derailed his post-graduation plans. 
     ``The high-stakes test--it keeps kids motivated. So I 
     understand the idea of the test,'' he said. ``But they need 
     to do it right.''

     

                          ____________________