Public Good and Private Interest in Educational Evaluation

This is an invited presentation I gave at APA in August 2005…

ABSTRACT

Educational evaluation is by and large a public good—although evaluation occurs in many fields and in many contexts supported through many means, the genesis of educational evaluation is the stipulations in the Elementary and Secondary Education Act passed in 1965. Established in 1965 as part of Lyndon Johnson’s War on Poverty, the ESEA provides federal assistance to schools, communities, and children in need. With current funding of about $9.5 billion annually, the ESEA continues to be the single largest source of federal funding to K-12 schools. Through its many Title programs, and especially Title I, ESEA has been a major force in focusing how and what is taught in schools, as well as the ways those activities are evaluated. With Johnson’s conceptualization of ESEA, educational evaluation was seen to be a public good (just like education and schooling) that should serve the common public good. What I want to illustrate is that although educational evaluation remains a public good (publicly funded) it increasingly serves private interests.Public Good and Private Interest in Educational Evaluation
Sandra Mathison
University of British Columbia

If we share an apple, we can share a community.

So reads the caption on a poster I received upon completion of an evaluation of an accelerated mini-school program in a high school. Over the past six months I have been working with this mini school and its stakeholders (including students, teachers, parents, school administrators, and graduates) to evaluate this accelerated program for precocious 8th and 9th graders who complete the grades 8 through 10 curriculum in two years. The school asked me to work with them to do the evaluation. The school had lots of ideas about what they were interested and how to collect evidence, and some capacity to do so. They wanted a partner with expertise in educational evaluation to provide guidance and technical assistance along the way. The evaluation was stakeholder based, the types of evidence collected evolved as the evaluation proceeded, the students became interested in the process and worked with me to develop some data collection skills and then collect data themselves, the final report and where to go from the results was a collaborative effort among the stakeholder groups.

This is the kind of educational evaluation I do.

A wide range of approaches are used in educational evaluation. But for this work, I did not get paid. The school has no money for evaluation. The school district has no money for evaluation. The evaluation approach the program wanted is not a priority or publicly funded.

Here is a quote from the User Friendly Guide to Identifying and Implementing Educational Practices Supported By Rigorous Evidence, published by the US Department of Education:

Well-designed and implemented randomized controlled trials are considered the “gold standard” for evaluating an intervention’s effectiveness, in fields such as medicine, welfare and employment policy, and psychology. This section discusses what a randomized controlled trial is, and outlines evidence indicating that such trials should play a similar role in education. (US Government, 2003)

And so the stage for the current focus on educational evaluation is set by the United States Department of Education. I will return to the current governmental preference for randomized clinical trials for doing evaluation in education later in this talk, but first I want to step further back in history and give you a sense of the ebb and flow of educational evaluation in the United States.

It is important to note that educational evaluation is itself quite a diverse sub-area within evaluation as a discipline and a profession. Educational evaluation may focus on the value, merit, worth or effectiveness of programs, curriculum, teachers, student learning, schools, and school systems. I will talk about educational evaluation generally though and not dwell on any specific focus in educational evaluation.

While the passage of ESEA marks the beginning of the formalization of educational evaluation, one prior event, the Eight Year Study , also played an important role in educational evaluation, although it is more often associated with developments in curriculum theory and design. The Eight Year Study involved 30 high schools dispersed throughout the US serving diverse communities. Each school developed its own curriculum suited to its community and each was released from government regulations, as well as the need for students to take college entrance examinations. With dissension early in the project about how its success should be evaluated, a young Ralph Tyler was brought on board to direct the evaluation, which was funded by the Rockefeller Foundation. Out of the Eight Year Study came what is now known as the Tyler Rationale, the commonsense idea that what students were supposed to learn should determine what happened in classrooms and how evaluation should be done.

Just as with all educational interventions, the intentions and expectations for the thirty participating schools were sometimes vague, value based, and of course there were many.

The development of effective methods of thinking

The cultivation of useful work habits and study skills

The inculcation of social attitudes

The acquisition of a wide range of significant interests

The development of increased appreciation of music, art, literature, and other aesthetic experiences

The development of social sensitivity

The development of better personal-social adjustment

The acquisition of important information

The development of physical health

The development of a consistent philosophy of life

Figure 1. Ten major objectives for schools in the Eight Year Study.

Tyler’s evaluation team devised many curriculum specific tests, helped to build the capacity for each school to devise its own measures of context specific activities and objectives, identified a role for learners in evaluation, and developed data records to serve intended purposes (including descriptive student report cards). All of these developments resonate with conceptual developments in evaluation from the 1970’s to the present. The notion of opportunity to learn is related to the curriculum sensitivity of measures; the wide spread focus on organizational evaluation capacity building resonates with the Tylerian commitment to helping schools help themselves in judging the quality and value of their work; democratic and empowerment approaches, indeed all stakeholder based approaches, resonate with the learners’ active participation in evaluation; and the naturalistic approaches to evaluation resonate with the use of behavioral descriptive data.

The Eight Year Study ended in 1941 and was published in five volumes in 1942, an event which was overshadowed by its unfortunate coincidence with the American troops taking an active role in World War II. Nonetheless, Ralph Tyler and the Eight Year Study evaluation staff provided a foundation, whether necessarily recognized or not, for future education evaluators.

When ESEA was passed in 1965, the requirement that the expenditure of public funds be accounted for thrust educators into a new and unfamiliar role. Educational researchers and educational psychologists stepped in to fill the need for evaluation created by ESEA. But the efforts of practitioners and researchers alike were generally considered to be only minimally successful at providing the kind of evaluative information envisioned. The compensatory programs supported by ESEA were complex and embedded in the complex organization of schooling. Research methods that focused on hypothesis testing were not well suited to the task at hand.

The late 60s and into the 1980s were the gold rush days of educational evaluation. During this time, models of evaluation proliferated and truly exciting intellectual work was being done, especially in education. Often very traditionally educated American educational psychologists experienced epiphanies that directed their thinking in new ways to do evaluation. For example, Robert Stake, a psychometrician who began his career at the Educational Testing Service, wrote a paper called the “Countenance of Educational Evaluation,” which reoriented thinking about the nature of educational interventions and what was important to pay attention to in determining their effectiveness. Egon Guba, a well known education change researcher abandoned the research, development, diffusion approach for naturalistic and qualitative approaches that examined educational interventions carefully and contextually. Lee Cronbach, psychometric genius, focused not on the technical aspects of measurement in evaluation, but on the policy-oriented nature of evaluation. An idea that lead to a radical reconstruction of internal and external validity, including separating the two conceptually and conceptualizing external validity in relation to usability and plausibility of conclusions, not as a technical feature of research or evaluation design.

While ESEA, now NCLB, is the driving force in what counts as evaluation in education, other developments have occurred simultaneously and are important companion pieces to understand the contemporary educational evaluation landscape. The National Assessment of Educational Progress (or NAEP), sometimes referred to as the nation’s report card, was created in the mid 1960’s, building on and systematizing a much longer history of efforts to use educational statistics to improve and expand public education. Francis Keppel, the U.S. Commissioner of Education from 1962 to 1965 and a former dean of the Harvard School of Education, lamented the lack of information about the academic achievement of American students:

It became clear that American education had not yet faced up to the question of how to determine the quality of academic performance in the schools. There was a lack of information. Without a reporting system that alerted state or federal authorities to the need for support to shore up educational weakness, programs had to be devised on the basis of social and economic data…. Economic reports existed on family needs, but no data existed to supply similar facts on the quality and condition of what children learned. The nation could find out about school buildings or discover how many years children stay in school; it had no satisfactory way of assessing whether the time spent in school was effective. (Keppel, 1966, p. 108)

Under the direction of Ralph Tyler (Tyler’s intellectual legacy in evaluation is huge, as you may have noted), NAEP developed as a system to test a sample of students on a range of test items, rather than the simple testing of all students with the same test items. Thus matrix sampling was created. And, to allay fears that NAEP would be used to coerce local and state educational authorities the results were initially released for four regions only. NAEP has continued to develop, early on largely with the use of private funding from the Carnegie Corporation, and the early fears of superintendents and professional associations (such as the National Council for Teachers of English) turn out to be well founded. State level NAEP scores are indeed now available. This shift in the use of NAEP occurred during the Reagan administration and with then Secretary of Education Terrel Bell’s infamous wall chart. With a desire to compare states educational performance, indicators available for all states were needed, and NAEP filled that bill. Southern states, such as Arkansas, under then governor Bill Clinton, applauded the use such comparisons that would encourage competition, a presumed condition for improvement.

During these halceyon years in educational evaluation, much evaluation was publicly funded, primarily by the US Department of Education, but also by other agencies such as the National Science Foundation, in addition to many foundations, such as Carnegie, Rockefeller, Ford, Weyerhouser. The dominance of public money and the liberal and progressive political era contributed significantly to the conceptualization of evaluation as a public good. The relatively small number of meta evaluations conducted during this time focused primarily on whether the evaluation was fair and in the public interest. Two good examples of this are the meta-evaluation of Follow-Through (that thoroughly criticized Alice Rivlin’s planned variation experiment as an evaluation method that did not do justice to the unique contributions of follow-through models for local communities) and the meta-evaluation of Push-Excel, Jesse Jackson’s inspirational youth program that was undone by the Charles Murray’s (of the bell curve fame) evaluation that failed to consider the program on its own terms in the context of local communities.

The recent reauthorization of ESEA, called No Child Left Behind, reinforces the need for evaluation. But unlike the general expectation for evaluation that typified the original ESEA evaluation mandate crafted by Robert Kennedy, NCLB is decidedly prescriptive about how education should be evaluated. While the 1965 authorization of ESEA opened new frontiers and contributed significantly to the discipline of evaluation, NCLB has narrowed the scope of evaluation. Few federal funds are spent on educational evaluation and the burden of evaluation has been shifted to the state and local levels through student testing. NCLB mandates what counts as evaluation (acceptable indicators, what counts as progress, consequences for lack of progress) but provides no funding to care out the mandate.

The current narrow evaluation focus of NCLB (standardized tests for evaluating student learning and schools) evolved as a result of changes in political values. The current public and governmental neo-liberalist sentiment (an ideology shared by Republicans and Democrats) has had major implications for government policies beginning in the 1970s but increasingly prominent since 1980. Neo-liberalism de-emphasizes government intervention in the economy, focusing instead on achieving progress (including social justice) by encouraging free market methods and fewer restrictions on business operations and economic development.

Concerns about and constructions of a crisis in American schools are formulated around the constructs such as international competitiveness and work productivity. In other words, our schools are meant to serve the interests of the economy. A Nation At Risk, published in 1983, was the clarion call for educational reform: “The educational foundations of our society are presently being eroded by a rising tide of mediocrity that threatens our very future as a nation and a people. . . . We have, in effect been committing an act of unthinking, unilateral educational disarmament.” Although it took a few years, in 1989 President Bush and the state governors called an Education Summit in Charlottesville. That summit established six broad educational goals to be reached by the year 2000. Goals 2000 was signed into law in 1994 by President Clinton. Goals 3 and 4 were related specifically to academic achievement and thus set the stage for both what educational evaluation should focus on and how.

Goal 3: By the year 2000, American students will leave grades 4, 8, and 12 having demonstrated competency in challenging subject matter including English, mathematics, science, history, and geography; and every school in America will ensure that all students learn to use their minds well, so they may be prepared for responsible citizenship, further learning, and productive employment in our modern economy.

Goal 4: By the year 2000, U.S. students will be first in the world in science and mathematics achievement.

In 1990, the federally funded procedures for moving the country toward accomplishment of these goals were established. The National Education Goals Panel (NEGP) and the National Council on Education Standards and Testing (NCEST) were created and charged with answering a number of questions: What is the subject matter to be addressed? What types of assessments should be used? What standards of performance should be set?

In 1996, a national education summit was attended by forty state governors and more than forty–five business leaders. They supported efforts to set clear academic standards in the core subject areas at the state and local levels and the business leaders pledged to consider the existence of state standards when locating facilities. Another summit followed in 1999 and focused on three key challenges facing U.S. schools—improving educator quality, helping all students reach high standards, and strengthening accountability—and agreed to specify how each of their states would address these challenges. And a final summit occurred in 2001, when Governors and business leaders met at the IBM Conference Center in Palisades, New York to provide guidance to states in creating and using tests, including the development of a national testing plan. The culminating event to this series of events beginning in the early 1980s was the passage of NCLB.

The heavy hand of business interests and market metaphors in establishing what schools should do and how we should evaluate what they are doing is evident in the role business leaders have played in the education summits. The infrastructure that supports this perspective is broad and deep. The Business Roundtable, an association of chief executive officers of U.S. corporations, and the even more focused Business Coalition for Education Reform, a coalition of 13 business associations are political supporters and active players in narrowing evaluation of education to the use of standardized achievement tests.

Simultaneous with the passage of NCLB, the US Department of Education funds very little evaluation because of a much-narrowed definition of what the government now considers good evaluation. A quick search of the US Department of Education website tells us they funded twelve evaluation studies in 2003, another six in 2004. These are referred to as a “new generation of rigorous evaluations.” As reflected in the quote I used at the beginning of this talk, these evaluations must be randomized clinical trials, perhaps quasi experimental or regression discontinuity designs. Few if any educational evaluations have been of this sort, indeed much of the work since the 1960s has been directed to creating different evaluation methods and models of evaluative inquiry (not just borrowed research methods) that answer evaluative questions. Questions about feasibility, practicability, needs, costs, intended and unintended outcomes, and ethics, and justifiability.

While neo-liberalism clearly surrounds NCLB by the characterization of education as a commodity, the use of single indicators, and the promotion of market systems to improve the quality of schooling, the connection to the US governments mandate for randomized clinical trials is a little more tenuous. However, neo-liberalism is characterized by a reliance on specialized knowledge and silencing or at least muting the voices of the populace. Unlike many approaches to evaluation that are built on the inclusion of stakeholders in directing and conducting the evaluation, experimental design is controlled by experts and stakeholders (especially service providers and recipients) are more anonymous subjects, and less moral, socio-political actors.

The early 1980s saw heated debates at professional evaluation association meetings about the contributions and limits of experimental design for doing evaluation. A key player in those debates was Bob Boruch, now the principal investigator for the What Works Clearinghouse, an arm of the Institute of Education Sciences, which I will discuss in just a moment. By many accounts, the discipline of evaluation had dealt with the role of experimental design in evaluation—that it was potentially useful, most of the time impractical, and often limited in answering the array of evaluative questions invariably asked. What is not clear is that the deep commitment to experimental design as the sine qua non of evaluation designs was only dormant, waiting for the political fertilizer to germinate and grow this commitment.

Evaluation foci and methods that encourage private interests

Just as progressivism was the value context up to the late 1970s and even early 1980s, neo-liberalism has been the value context that brings educational evaluation to where we are today in the United States. Schools are a business, education is a product, products should be created efficiently, and one should look to the bottom line in making decisions. Implicit in this neo-liberal perspective are values (and rhetoric) that motivate action. The most obvious of these values is that accountability is good, that simple parsimonious means for holding schools accountable are also good, that choice or competition will increase quality, that it is morally superior to seek employability. Econometrics drives thinking about what these simple parsimonious means are—the appeal of single indicators like standardized tests, the concept of value added now promoted in evaluating the teachers.

It is useful to look at two examples that illustrate this neo-liberal focusing of evaluation.

Case example 1: School Matters (Standard & Poors) http://www.schoolmatters.com/

School Matters describes its purpose thus:

SchoolMatters gives policymakers, educators, and parents the tools they need to make better-informed decisions that improve student performance. SchoolMatters will educate, empower, and engage education stakeholders.

It is a product of Standard and Poors, which is in turn owned by McGraw-Hill Companies—one of the biggest producers of educational tests, and it promises to provide in one convenient location the following:

Student Performance: SchoolMatters analyzes student achievement measures, including national and state test results, as well as participation, attendance, graduation, and dropout-promotion rates.

Spending, Revenue, Taxes, & Debt: SchoolMatters analyzes extensive financial data for each school district, along with state and county comparisons.

School Environment: SchoolMatters analyzes performance in the context of the learning environment, which includes class size, teacher qualifications, and student demographics.

Community Demographics: SchoolMatters analyzes community characteristics, such as adult education levels, household incomes, and labor force statistics. Independent research has shown that these factors affect student achievement.

And the website, which is highly interactive delivers on this.

Indicators that are used (school size, reading scores, math scores, special needs (limited to ELL), tchr student ratios, ethnicity, income and housing costs)

But there is much about schools and education that School Matters does not deliver, because this is not considered necessary, or is not easily collected data, or does not reflect a narrow conception of the purpose of schools to prepare skilled workers.

Indicators not used (types of school programming, health and fitness, quality of physical plant, availability of resources such as books, paper, pencils, attrition rates, proportion of drop outs getting a GED, volunteerism/community involvement)

In addition, there is a decidedly different language used in the discussion of factors that are outside of school control versus those in school control. In the former case there are cautions about the importance of parents and communities in academic achievement.

Research has shown that the education levels and contributions of parents are critical factors that impact a child’s academic performance. To help all students reach their full potential, it is necessary that students, teachers, families, and communities collectively engage in efforts to improve student performance.

The implication here is that parents should get themselves educated and do something to contribute to the improvement of student performance—an essentially moral message to others.
This contrasts with a factor that is within the school’s control, namely class size. When there is a potential change to the school that might improve student performance but at a cost, School Matters advises caution.

Smaller class sizes may improve student performance in certain settings; for example, research has shown that low-income students in early grades may benefit from smaller classes. Yet, there is less agreement on across-the-board benefits of small classes. Deciding to implement a policy to create smaller classrooms means that more teachers must be hired, and not all communities have a pool of qualified teachers from which to draw.

This selective presentation of research on the benefits of reducing class size serves other purposes—it diverts parents or educators from investing time in contemplating changes that might increase costs and therefore threaten the production of educational products at the lowest possible cost. Indeed, S&P generally promotes “improving the return on educational resources” rather than increasing resources.

Case example 2: Institute for Education Science’s What Works Clearinghouse

As I mentioned earlier, the What Works Clearinghouse promises “a new generation of rigorous evaluations,” by specifying a single acceptable, desirable evaluation design, that is the randomized clinical trial. The WWC standards for identifying studies that show what works are based on an examination of design elements only.

WWC Evidence Standards identify studies that provide the strongest evidence of effects: primarily well conducted randomized controlled trials and regression discontinuity studies, and secondarily quasi-experimental studies of especially strong design.

“Meets Evidence Standards”—randomized controlled trials (RCTs) that do not have problems with randomization, attrition, or disruption, and regression discontinuity designs that do not have problems with attrition or disruption.

“Meets Evidence Standards with Reservations”—strong quasi-experimental studies that have comparison groups and meet other WWC Evidence Standards, as well as randomized trials with randomization, attrition, or disruption problems and regression discontinuity designs with attrition or disruption problems.

“Does Not Meet Evidence Screens”—studies that provide insufficient evidence of causal validity or are not relevant to the topic being reviewed.

In addition, the standards rate other important characteristics of study design, such as intervention fidelity, outcome measures, and generalizability.

These two examples demonstrate different but complementary ways that evaluation as a public good has come to serve private interests. Metaphorically, a database for rating schools created by a market rating firm implicitly reinforces and naturalizes the neo-liberal market oriented values that inform what schooling is and how we evaluate it. And the narrowly defined, and highly specialized conception of evaluation design promoted by the Institute for Education Science as manifest in the What Works Clearinghouse delineate what counts as evaluative knowledge and thus what counts as worth knowing about education and schooling.

These are powerful forces, and evaluators working within public school districts across the nation would agree they do little “evaluation” any more. They are too busy with standardized testing of students and trying to figure out how, if at all, they can conjure up a RCT so as to obtain much needed money for local evaluation efforts.

There is still educational evaluation that builds on the explosive and exciting period of the 1970s and 80s—I still do it, but I get paid in posters and I am thankful I have a decent paying job as a university professor.

References

Francis Keppel, The Necessary Revolution in American Education (New York: Harper and Row, 1966), pp. 108–109.

US Government (2003) Identifying and Implementing Educational Practices Supported By Rigorous Evidence: A User Friendly Guide. Available online at http://www.ed.gov/rschstat/research/pubs/rigorousevid/index.html

Vinovskis, M. A. (1998). Overseeing the nation’s report card: The creation and evolution of the National Assessment Governing Board. (Washington DC: US Department of Education, National Assessment Governing Board).

Leave a Reply