Future of Data in the Social Sciences

by Kevin Milligan

Speaking notes

Kevin Milligan

Talk with data librarians: ACCOLEDS

Vancouver BC

November 27, 2014

“The future of data in the social sciences”

Slides here

—-

Introduction

Two news stories this week perfectly capture the tension I want to emphasize in my time with you here this morning.

  • Economist Ross Finnie and his team at University of Ottawa released an amazing new study on the earnings paths of UofO graduates; tracked for multiple cohorts for over a decade after graduation. “Ross Finnie has put together the best new data source in Canadian PSE in a decade.” Said Alex Usher, education analyst.
  • Data breach from CRA. The personal tax information of thousands of Canadians—including some very famous ones—was emailed to CBC in violation of these taxpayers’ privacy.

We are now in an era in which large administrative datasets allow us unimaginable insights into some of the most important questions that social scientists want to know about our society. In turn, these insights are vitally important for the design of public policy; to effectively spend our tax dollars where they are most needed.

But, these datasets can be dangerous. Citizens fear that the private information they trusted will not be handled carefully or will be misused by government officials or researchers. They fear the ‘big brother database.’

Fears of data breaches like this one put all of our research with administrative data at risk.  The media—and the public—does a very poor job of discriminating between cases where big administrative data are being used with appropriate care and cases where they are not. With every case like the CRA leak, it gets harder to maintain public support for the research use of administrative data.

I’m here today to argue that you, as data librarians, and we as social scientist data users, must stake out leadership roles in this debate.  There are lots of people willing to stoke the fire of fear about data use. But we, as people who work with data every working day, we know better than anyone the great value to society that can be unlocked through data analysis. There is no one better placed to make this case to the public than us. We should; and we must.

Here’s the plan for my talk with you today. Three items on the agenda:

  • First, I’ll let you know a bit more about me, so that you can understand my approach to the questions of data.
  • Second, I’ll identify four big trends in the data world that I think are driving the transformation we’re seeing in the data world away from surveys and toward admin datasets.
  • Finally, I want to lay out a plan of action.

I’m aiming to get through all this smartly, so that we will have some time for discussion as well.

 Where am I coming from?

I’m a data user. I did my graduate work in Toronto in economics, focusing on empirical questions of taxation and labour economics. I interacted a lot with UofT data librarian Laine Ruus, who I know was a strong and cherished leader in your community of data librarians. I worked with lots of PUMFs as I got going with my thesis, from the FAMEX to the SHS to the SCF to the Census.

But, my formative time at “Toronto” was actually spent in Ottawa.  I spent two entire summers (and a great deal of time over the winters) working on big administrative datasets in Ottawa. One was housed at the Department of Finance; an internal version of the LAD based on personal T1 taxfiler data. The other project was at Statistics Canada working on something called the Longitudinal Worker File, as well as really raw T4-based earnings records going back to 1971.

From these projects, I got a fantastic education about the features of administrative data.

  • The great value of what can be learned. We had huge sample sizes, with the actual relevant data for the questions we were asking. No guessing, no imputation. We could answer questions decisively.
  • The great challenges of working with such large datasets. Waiting two days for a big SAS merge to run only to find I mis-specified something in my program or forgot to end a command line with the necessary semi-colon. Figuring out why there was missing data for one person in one year, but three observations for that person in another.

I also learned a lot about the work culture of the people who work at Statistics Canada, and how deeply they value the trust that Canadians place in them when they respond to surveys or give permission to merge admin files.

After my graduate work, I moved on to a job in what’s now the Vancouver School of Economics at UBC, working with a tremendous set of colleagues and graduate students who use data every day.  It was my pleasure to work with UBC’s data librarian Mary Luebbe for a decade, and I’m enjoying now working with Darrel Bailie.

I was one of the first users at the RDC at UBC, and for 7 years, I have served as the director of the BCIRDC.

Ok….Let’s talk about the data.

Surveys are dying; admin datasets are rising

I’m going to identify four big trends that I think are shaping the changes underway in the data world.  Then talk about the implications.

  1. Response rates are sinking.

The trend in response rates for surveys since the 1990s is sharply down. In the chart, you can see the response rates from the main household expenditure surveys in Canada, the US, the UK, and Australia.

Social scientists who’ve looked at these trends have suggested different answers.

  • Some cite the ‘Bowling Alone’ phenomenon that arose in our awareness in the 1990s. Data suggest that people seem much less socially engaged then they were in previous generations.
  • Of late, people don’t answer their phones and the kids these days don’t even have landlines. 90% of the calls I get at home are spam—the WestJet call or the ‘cruise ship horn’ call.
  • Trust in government and big institutions have plummeted. People are very wary of what government might do with their data. This was inflamed by the federal government’s execrable decision in the Census debate to fan the flames of fear about how researchers and public servants use data collected from citizens. To my mind, this is the greatest sadness from the loss of the census. The survey itself can be restored by passing a law. But restoring public trust that is now influenced by some partisan flavours will be harder.

Whatever the cause, it is happening.

Of course, we can somewhat correct for the sample response bias using weights derived from the Census.  But, as you all know, we’ve lost the Census. So, our ability to do that kind of weighting is deteriorating.

  1. Funding squeeze

Statistics Canada has faced substantial cuts over the past few years. Not just them, but also the policy shops in other departments like ESDC have been cut back severely. Funding for many of the big surveys—I’m thinking the NLSCY here—came in part from other departments like HRSDC. That funding is gone.

Let’s talk politics for a moment, if I may. These changes have been made by the Conservative government and we should hold them to account for the governing choices they’ve made.

But, let’s move to the other side of the House—no one on the other side is talking about increasing taxes either, and they seem to have other big projects in mind for the projected surpluses that have arisen in economic forecasts.

If no one is talking about raising taxes, the funding squeeze on federal government operations will almost surely continue no matter who is elected in 2015. Apparently, that’s what Canadian voters want or else at least one of the parties would take a different tack.

  1. Supply side admin data factors.

The third trend I’d like to highlight is the supply-side factors pushing toward administrative data sets.  There are two factors here:

i Data storage and dissemination is cheaper and easier.

We’re not dealing with big magnetic tapes any more, and we don’t physically have to move to Ottawa to access data. We have memory keys with terabytes of capacity and we have data pipelines through which we can access data.

ii Number crunching and processing is much faster.

I mentioned earlier about 2 day-long data sorts that I was doing in the 1990s. I’m almost certain I could perform the same sort with a few million observations in minutes rather than days, with today’s capacity. This makes feasible the analysis of big administrative datasets.

  1. Rise in methods that use admin data.

The fourth and final trend that is affecting the data world is the development and focus, at least in economics, on methods that require big, administrative data sources.

One of these methods is called Regression Discontinuity. It was developed originally by psychologists who were studying education.  One of the first papers to use the method was by Thistlethwaite and Campbell in a paper published in the Journal of Educational Psychology in 1960. The graph here shows you how this method works.

The research question was to determine the impact of winning a merit scholarship on education aspirations. Those below the threshold did not get the scholarship; those above the threshold did.

Now, there is clearly a pre-existing relationship between grades and further education aspirations. That can be seen in the upward-sloping graph.  The magic of this RD method, though, comes from the discontinuity in that relationship at the point where the merit scholarship is awarded. We can use that discontinuity to infer the impact of the scholarship.

This kind of method was adopted in economics about a decade ago and has quickly become one of the core tools used by empirical economists.

What kind of demands on the data does this method make?

  • You need the exact test scores—you would lose a lot of accuracy if you relied on self-reports.
  • It is data-hungry—you need a lot of observations close to the point of discontinuity. As an example, in the NLSCY you get about 200 births per month. This is simply not sufficient to use birthdate-based RD strategies. With big admin datasets, you can get thousands born on any birthdate.

Implications:

The key implication coming out of these trends is that there is a large shift going on away from survey data toward admin data. I have some evidence of this.

In the United States, there is an economic research hub called the National Bureau of Economic Research. It is run for academics and by academic; mostly can be thought of as a research centre where profs can book their grants, house and pay their research assistants, and disseminate their research.

Every summer, the NBER holds a conference called the Summer Institute. The different research groups meet over several days and see presentations of the latest and greatest research in their areas. This is really one of the premier conferences in economics—the room is stuffed with professors from top 10 schools and the editors of many top journals are there.

Over the past two years I’ve taken count of the number of papers that use surveys, administrative data, or data from experiments. Here’s what I found.

There were 19 studies in 2013 and 16 in 2014 that used administrative data. There were only 2 and 4 that used survey data.

In 2014, two of the papers that used survey data were methods papers; developing and testing methods that might be applied rather than doing applied research itself. In short, the market share for surveys at the top echelons of empirical economics has fallen to almost zero.

Why does this matter?

I think this matters a lot, for three reasons.

First, much of our whole data system is still built for surveys. From the DLI to our physical infrastructure in RDCs to our security protocols to how we do graduate student training. In all of these areas, we are not ready for a large shift toward administrative data.

Second, if Canada doesn’t keep up with this trend, we will fall behind. In the NBER Summer Institute and in top journals in recent years, the number of papers on Scandinavian data has skyrocketed. American academics didn’t suddenly get a taste for dried fish—these papers are being published because of the awesome power of the administrative register datasets that are available to Scandinavian researchers. We are losing market share. This matters because this will make it harder to attract and retain the best researchers and graduate students. If we want world-leading social scientists in Canada, they need access to world-leading tools.

Third, we will fall behind on what social science can contribute to society. Think back to the Ross Finnie paper on post-secondary education that I started with at the beginning of the talk. What wonderful insights come from that analysis! Without continued and secure access to administrative data the Canadian public will lose out on the kind of insights we can get from that kind of work. These are insights worth paying for.

Plan for the future

Those are the challenges faced by we who work with data every day. What should we do? I propose a three-pronged plan of action, all focused on redeveloping things that have been lost.  I chose ‘redevelop’  over ‘restore’ or ’rebuild’, because I think our efforts should be looking forward, not backward. We do not have a time machine; we cannot just press pause or put things back as they were in 2009, 19999, or whatever year you think was best.  The world has changed; we should incorporate those changes as we redevelop our data infrastructure.

1. Redevelop trust: a Charter of Data Practices

To work on redeveloping the public’s trust in data, I think we need a Charter of Data Practices.

Think of the trust problem from a layperson’s point of view. They hear there is a big database of information in which their personal information sits.  This kind of database can lead to large-font headline on news websites about the security of private information.  It is too hard for a layperson to understand if the data in question actually compromises their information or if it is more innocuous.

We know, as data professionals, that with certain precautions, confidentiality can be maintained.

We can ensure no names or addresses are in the data; identifiers are fully anonymized.

  • We can take physical security precautions with encryption and taking care about who accesses the data and how.
  • We can ensure there is a proper data management protocol for how data are stored and handled after the research project is complete.

With these kind of practices, we can maintain a healthy balance between researcher needs and individual confidentiality.

Is there a way that we can communicate how we arrive at this balance to citizens and stakeholders?

If there were some kind of standard protocol available, perhaps through a Charter of Data Practices, these communications could be improved.  If citizens had confidence in the Charter, then could simply invest their trust in projects that met with the Charter protocols without having to investigate the details of every big-font headline news story.

The Charter of Data Practices could be developed by experts in consultation with all the stakeholders to find a balance between security and privacy concerns and the real benefits that come from research. It could be funded and driven by government, or it could be a purely non-governmental initiative.  The goal is clear: forge social consensus about what where that balance between research and confidentiality should be.

Such an effort in Canada may just duplicate some international efforts about which I am unaware, but we could certainly draw on the experiences of other countries.

But I think having such a Charter would allow us improve on the status quo, where every admin data use needs to be evaluated and protocols developed de novo.

2. Redevelop the funding

The second proposal is to redevelop the support for funding data. Political parties don’t seem to be putting large efforts into making the case that good data needs real dollars. There is a good case to be made, and we should make it.

We need to do that with caution, though. One approach is to channel outrage, organize demonstrations, and use the tools of protest to fight for change. These tools have a role in society, but I must admit they don’t come naturally to me. I think they energize those who are already committed, but I’m not convinced how much they persuade those in the middle to change their minds—and in my view that should be the target.

We have a model for this. Amine Yalnizyan of the Canadian Centre for Policy Alternatives was instrumental in putting together the coalition that pushed back against the changes to the Census. These efforts could have simply become a left-wing coalition fighting against a Conservative government initiative. Instead, Armine put tremendous effort into reaching out to others in Canadian society from business groups to big banks to more pro-market thinktanks like the Canada West Foundation and the C.D. Howe Institute. Ultimately, we were not successful in stopping the killing of the Census. But in y view Armine played her cards in exactly the right way.

In my experience, there are good people in every caucus in Ottawa. MPs who care deeply about their communities; care about facts, research, and evidence.  These voices don’t always prevail in policy decisions, but I know they are there. In making the case for redeveloping funding for data, I believe we need to make a case that speaks to people of all political—and non-political—persuasions.

3. Redevelop the Census

The census is at the core of our data universe. We use it to build weights for the surveys we have. We use it to evaluate the representativeness of the admin data we have. It is the ultimate anchor for nearly everything we do with data.

We need it back.

We need it back, but I think we should allow ourselves to spend time thinking about whether we simply restore it in its 2006 form, or whether we look forward to how we might do it better. The central core that makes the census ‘the census’ is its mandatory coverage. That is the foundation. What we put on top of that, though, should be open to redevelopment.

In the UK, the Cameron Conservative-coalition government initiated an expert panel through the Office for National Statistics with the name ‘Beyond 2011’ to review how the 2021 Census in the UK might take form. Note that the planning process started 10 years in advance—these kinds of changes can’t be made on the fly. They were interested in thinking how they might use administrative data—from public or private sources—in combination or in place with traditional census-taking. I think this was an innovative and productive question to ask.

The Canadian government took some heat a year or two ago about a data project that tried to use information from the website KIJIJI to learn about labour markets.  That project was not successful, and this continues to be fodder for government critics.

I think some of this criticism is misplaced. I think we should be thinking about ways of incorporating new data sources and new techniques into our data collection system. Now, it’s important to work hard to make sure the information you get out of those sources is useable, but the effort itself should not be mocked.

The main recommendations coming out of that expert panel in the UK were to a) make greater use of existing government admin data on incomes and other information in concert with the survey instrument and b) make greater use of internet-based survey-taking methods.  As it turns out, those are both things Statistics Canada was already doing with the Census in Canada in 2006. But, I think this kind of process is one we should emulate as we push to redevelop the Canadian census.

I’d be remiss if I didn’t pay strong tribute to the efforts of Kingston and the Islands MP Ted Hsu. Dr. Hsu is sponsoring a Private Members’ Bill C-626 to restore the long-form census. While this bill is unlikely to pass in the current parliament, his efforts to keep the Census on the front-burner deserve our thanks.

Conclusion

We as social scientists have a lot to contribute to Canadian society. We can’t do that without the help of everyone who produces the data, from the survey-designers right through to the expert data librarians who help researchers find the data that meets their needs.

The trends afoot in the data world are pushing economists—and I expect other social scientists too—toward greater use of administrative data. We are not properly prepared for this change, but we need to adjust to it.

I’ve proposed a three point plan to redevelop data for the social sciences. Redevelop trust, funding, and the census. I’m very curious to hear your thoughts. Thank you.