Tuesday, August 10, 2010

Randomised field experiments

by Ajay Shah.

In recent years, many economists have been attracted by the possibility of obtaining better knowledge using randomised experiments, which are termed the `gold standard' for empirical analysis. I have long been skeptical about this approach, for three reasons:

  1. Reality is a complicated nonlinear relationship in many dimensions. Each randomised experiment illuminates the gradient vector in one small region. It's hard to generalise the results (i.e. low external validity).
  2. I am quite worried about the bang for the buck obtained through this strategy. A lot of money is spent, which could have other uses in funding dataset creation or research.
  3. Economics is a bad field in having low standards of replication. The journals don't publish replication, which is the foundation of science. Randomised experiments, too often, generate proprietary datasets which are controlled by the original authors. The scientific progress which comes about from multiple scholars working on common datasets does not come about easily.

Jim Manzi has a great article on the difficulties of obtaining knowledge about social science questions. He tells the story of a field -- Criminology -- which experienced the Randomised Experiment Revolution in the 1980s:

In 1981 and 1982, Lawrence Sherman, a respected criminology professor at the University of Cambridge, randomly assigned one of three responses to Minneapolis cops responding to misdemeanor domestic-violence incidents: they were required to arrest the assailant, to provide advice to both parties, or to send the assailant away for eight hours. The experiment showed a statistically significant lower rate of repeat calls for domestic violence for the mandatory-arrest group. The media and many politicians seized upon what seemed like a triumph for scientific knowledge, and mandatory arrest for domestic violence rapidly became a widespread practice in many large jurisdictions in the United States.

But sophisticated experimentalists understood that because of the issue's high causal density, there would be hidden conditionals to the simple rule that `mandatory-arrest policies will reduce domestic violence.' The only way to unearth these conditionals was to conduct replications of the original experiment under a variety of conditions. Indeed, Sherman's own analysis of the Minnesota study called for such replications. So researchers replicated the RFT six times in cities across the country. In three of those studies, the test groups exposed to the mandatory-arrest policy again experienced a lower rate of rearrest than the control groups did. But in the other three, the test groups had a higher rearrest rate.

...

Criminologists at the University of Cambridge have done the yeoman work of cataloging all 122 known criminology RFTs with at least 100 test subjects executed between 1957 and 2004. By my count, about 20 percent of these demonstrated positive results: that is, a statistically significant reduction in crime for the test group versus the control group. That may sound reasonably encouraging at first. But only four of the programs that showed encouraging results in the initial RFT were then formally replicated by independent research groups. All failed to show consistent positive results.

I am all for more quasi-experimental econometrics applied to large datasets, to tease out better knowledge by exploiting natural experiments. By using large panel datasets, with treatments spread across space and time, I feel we gain greater external validity. And, there is very high bang for the buck in putting resources into creating large datasets which are used by the entire research community, with a framework of replication and competition between multiple researchers working on the same dataset.

You might like to see a column in the Financial Express which I wrote a few months ago, with the story of an interesting randomised experiment. In this case, there were two difficulties which made me concerned. First, this was not randomised allocation to treatment/control: there was selectivity. Second, it struck me as very poor bang for the buck. Very large sums of money were spent, and I can think of myriad ways to spend that money on datasets or research in Indian economics which would yield more knowledge.

4 comments:

  1. I share the concern that there are poorly done studies that cost a lot of money and yield little by way of new insights. However, I would not quite go so far as to label RCTs themselves as something that would not be worthwhile to pursue. In all economic phenomena there is an enormous amount of noise and signal extraction is a real challenge -- this is as true of fiancial markets as it is of economic development. I think RCTs are a very interesting tool for doing this carefully.

    Now what one does with the signal or how the experiment is set up and the theory behind it are all questions that deserve the same amount of thinking as in the case of any other research project. If one asks a poorly thoughtout question one is likely to get some pretty useless results -- RCTs are not unique in that respect.

    Professor Rohini Pande has carried out an RCT for example for helping us look "beneath the hood" on the whole "black magic" of the zero defaults within JLGs. The insight on how social networks are actually formed / strengthened during weekly meetings and are not necessarily pre-existing is a fascinating one and of immense practical value. I can't imagine how else one would have gotten to this conclusion except through an intervention design.

    ReplyDelete
  2. Nachiket,

    I'm not discussing the poorly done studies. I'm discussing the best studies. The 3 problems:

    * Low external validity

    * Low bang for the buck

    * Lack of replication

    afflict the best studies.

    The Criminology story is quite pertinent. How do you know that a few other RCTs for the defaults-in-JLG question will not yield a different answer?

    As I say in the post, Reality is a complicated nonlinear function in many dimensions. Each properly executed RCT gives the correct answer for the local gradient at a certain point on that function. But the answer lacks external validity: it is hard to generalise too much given the narrowness of the RCT.

    And, these things are hideously expensive. For a counter-example, I will say that Indian economics was vastly improved by the resources expended on building NFHS. This became a standard dataset used by hundreds of researchers, with the discipline and competition which comes from multiple people looking at the same dataset. There are fewer programming mistakes in papers which use standard datasets.

    I am all for a more experimental approach to econometrics - but my sense of the way forward is to push into quasi-experimental studies based on large panel datasets. This has more external validity, gives better bang for the very scarce buck, and generates the correct incentives within the research community on the key issues of replication and competition. You have to be more intelligent when you work with NFHS because 1000 other researchers have the same dataset. The RCT space is being afflicted by too little thinking and too much fund-raising coupled with administrative capability on field execution.

    The great empirical successes of economics lie in big publicly visible datasets - finance and labour economics come to mind.

    ReplyDelete
  3. Great thoughts!

    A more general thought from me. I am wondering if you have ever covered the state of economics education in India. When I was young, economics as a subject was barely covered in school and 'the best and the brightest' almost never went for studying economics. In fact, I see IITians and such people manning and at the helm of various sectors that have nothing to do with engineering but a lot to do with economics.

    Maybe things have changed now or I hope so.

    ReplyDelete

Please note: Comments are moderated. Only civilised conversation is permitted on this blog. Criticism is perfectly okay; uncivilised language is not. We delete any comment which is spam, has personal attacks against anyone, or uses foul language. We delete any comment which does not contribute to the intellectual discussion about the blog article in question.

LaTeX mathematics works. This means that if you want to say $10 you have to say \$10.