Copyright Sociological Research Online, 2003


Stephen Gorard (2003) 'Understanding Probabilities and Re- Considering Traditional Research Training'
Sociological Research Online, vol. 8, no. 1, <>

To cite articles published in Sociological Research Online, please reference the above information and include paragraph numbers if necessary

Received: 7/1/2003      Accepted: 30/1/2003      Published: 28/2/2003


Social science is facing increasing demands for research involving 'quantitative' approaches. Among these are the need, expressed by policy-makers, for practical evidence about what works, and the demand, exemplified by the new ESRC guidelines for research training, that all researchers learn something about techniques of analysis involving numbers. At the same time, however, traditional 'quantitative' approaches are facing a major upheaval caused by growing criticism of null hypothesis significance testing (NHSTs), the increasing availability of high quality numeric datasets, and the development of more and more complex forms of statistical analysis. This paper shows how a re- consideration of the nature and function of probabilities (or uncertainties) in research suggests a new approach to research training that will be more appropriate than traditional courses on statistics for all learners, and that will help researchers explain their findings to policy-makers.

Capacity-building, political arithmetic, significance testing, statistics


This paper is intended as a contribution to the theme introduced by the editors in the last issue (Crow and Ray 2002), in which what they term 'an era of radical restructuring of universities and the wider research environment... has profound consequences for the training of researchers'. This paper addresses some of these consequences for training. In particular, the paper argues that in trying to avoid what Mills (1959, p.33) called the 'useless heights' of grand theory there is a danger that current moves towards a social science of evidence-informed policy-making will inadvertently encourage 'abstracted empiricism' (p.51) instead. Or, more properly, it argues that an overly technical empiricism has many of the same negative attributes for the development of social science as grand theory does. Complex statistical approaches, like grand theory, have an important, if limited, role to play in research. In practice, however, their advocates tend to celebrate complexity per se rather than the cumulation of knowledge. Their work is therefore incomprehensible to many readers, and is appreciated, and typically refereed, by a small group of like-minded peers. Examples, mostly from public policy studies, can be found in Gorard (2000) and Gorard (2003) or via < y>.

A New Climate for Research?

Over the last decade, the value and effectiveness of many areas of social science research have been increasingly called into question in many developed countries, including the USA (e.g. NERPP 2000, Resnick 2000) and the UK (e.g. Lewis 2001, Tooley and Darby 1998). This crisis of confidence is not confined to public policy research (Pirrie 2001). Indeed it is currently characteristic of the relationship between the majority of professions and research, and there have been similar comments about the conduct of research in many public services (Dean 2000). Put simply, it seems to external observers, and some internal ones as well, that 'too many... researchers produce second rate work, and there are, for the most part, too few checks against this occurring' (Evans 2002, p.44).

Of course, despite their public appeal, the evidence base for these criticisms is often weak, but they are general and strident enough for us to have to examine the quality of our own research. Part of the problem is an apparent system-wide shortage in expertise in large-scale studies, especially field trials derived from laboratory experimental designs. Over the last twenty years, there has undoubtedly been a move towards much greater use of 'qualitative' approaches (Hayes 1992), even in traditionally numerate areas of research (Ellmore and Woehilke 1998). In addition, acceptance rates for 'qualitative' publications are higher than for 'quantitative' pieces, by a ratio of around two to one in one US journal (Taylor 2001). There is a danger therefore of applying different standards of rigour to studies depending on their method and, presumably, on their referees. In some important fields, the nineties were dominated by generally small-scale funding leading to predominantly qualitative thinking (McIntyre and McIntyre 2000).

Social science is, therefore, facing increasing demands for research involving 'quantitative' approaches. Among these are the need, expressed by policy-makers, for practical evidence about what works, and the demand, exemplified in the UK by the new ESRC guidelines for research training, that all researchers learn something about techniques of analysis involving numbers (Sooben 2002). Similar sentiments have been expressed in other developed countries (e.g. Diamond 2002). At the same time, however, traditional 'quantitative' approaches are facing a major upheaval caused by growing criticism of null hypothesis significance testing (NHSTs), the increasing availability of high quality numeric datasets, and the development of more and more complex forms of statistical analysis. This paper shows how a re-consideration of the nature and function of probabilities (or uncertainties) in research suggests a new approach to research training that will be more appropriate than traditional courses on statistics for all learners, and that will help researchers explain their findings to policy-makers. In arguing this, it does not seek to privilege work involving numbers. Rather, it should be seen as part of a more general move towards the development of mixed methods approaches (Gorard 2002). A key element of these mixed approaches is a greater understanding of the relatively simple use of numbers in research.

The Simple Role of Numbers

Not everyone understands why prices are as they are in shops, but everyone can check their change. A similar situation arises in research. Not all researchers will want, or be able, to conduct complex statistical analyses. Indeed, there is no need for them to be able to do so. But this is very different from an appreciation of the simple role of numbers in research. This paper is intended to show that confidence in dealing with numbers can be improved, simply by learning to think about them, and to express them more naturally. The paper also shows that a variety of important situations, relevant to my own field of educational research, are being routinely misunderstood (although the examples themselves are taken from a variety of fields, especially health research where there has been a stronger 'quantitative' tradition in the UK). The ignorance of false positives in diagnosis, the abuse known as the prosecutor fallacy, the politician's error, missing comparators, pseudo-quantification and many other symptoms of societal innumeracy have real and tragic consequences. They are part of the reason why no citizen, and certainly no researcher, should be complacent enough to say 'I don't do numbers'.

The paper also shows that much of 'statistics', as traditionally taught, is of limited use in much current educational research. If research does not use a random sample then significance tests and the probabilities they generate are meaningless. It is, therefore, pointless for all researchers to learn a great deal about such statistics. This is especially so given that there is such general misunderstanding of the much simpler role of numbers in research. Most of the data we generate in our research is approximate and rough, and even the most elaborate statistical treatments cannot overcome this. Long-term what we need, therefore, is to create better datasests preferably through active approaches such as experiments, and to ensure that our treatment of these datasets is first of all simple and robust. 'One of the most tempting pitfalls... is to use over-sophisticated techniques' largely because the computer makes it so easy (Wrigley 1976, p.4). The most complicated statistical approaches cannot overcome irrationality, lack of comparators, or pseudo-quantification. Everyone needs to know more about numbers to overcome problems of reporting (such as missing denominators, or unspecified averages). 'I am not talking here about high level statistical skills... [but] concerned more with data and data quality than endless statistical tests' (Plewis 1997, p.10). Perhaps it is time for some of those who use complex techniques somewhat mindlessly to re-assess the simple role of numbers, as well as time for those who refuse to use numbers at all to find out what they are missing.

With a good dataset, analysis is usually rather easy. For example, given data for a population there are no probabilistic calculations necessary, no tests of significance, no confidence intervals, and so on. Similarly, running a simple experiment, for example, with two large randomly selected groups and one outcome measure means that all you need to do is compare the scores of one group with the scores of the other in order to produce an effect size. The key is to have the comparator to make the comparison with - an obvious point perhaps but one that appears to have eluded several writers, in the field of UK public policy research especially. Without a suitable comparison we must end up with an incomplete consideration of the plausible alternative explanations for any observed phenomena.

The Logic of Statistical Tests

This section of the paper considers that bugbear of many 'quantitative methods' courses - statistical tests. It reminds readers that the logic of these tests is, in reality, quite simple and within the capacity of all novice researchers to grasp (readers familiar with this logic could skip the section). There is often actually little need for such tests, but in current practice and due to general ignorance of their limitations they retain considerable rhetorical power.

Suppose that one of the background questions in a survey using a random sample of 100 adult residents in one city asked for the sex of each respondent, and whether they had received any work-based training in the past two years. The results might be presented as in Table 1, which shows that slightly more than half of the men received training (24/41 or around 59%), while fewer than half of the women did (29/59 or 49%).

Table 1: Cross-tabulation of sex by receiving training
TrainingNo trainingTotal
Male2417 41
Female2930 59

For our sample, therefore, we can draw safe conclusions about the relative prevalence of reported training in the two sex categories. The men are more likely to have received training. However, our main motive for using probability sampling was that we could then generalise from our sample to the larger population for the study. If the population in this example is all residents of the city from which the sample was taken, can we generalise from our sample finding about the relationship between sex and receiving training? Put another way, is it likely that men in the city (and not just in the sample) were more likely to receive training than women? In order to answer the question for the population we could imagine that the answer is 'no' and start our consideration from there (this is our 'null hypothesis'). If the answer were actually no, and men and women were equally likely to report training, then what would we expect the figures in Table 1 to look like?

The number of each sex, and the number of people receiving training, remains as defined in Table 1. From this outline we can calculate exactly what we expect the numbers in the other cells to be. We know that 41% of cases are male, and that 53% of cases received training. We would therefore expect 41% of 53% of the overall total to be both male and have received training. This works out at around 22%, or 22 cases (Table 2). We can do the same calculation for each cell of the table. For example, as 59% of the cases are female and 53% of the cases received training, we would expect 59% of 53% of the overall total to be females and have received training. This works out at around 31%, or 31 cases. But then we already knew that this must be so, since 53 people in our survey received training, of whom we expected 22 to be male, so by definition we expected the other 31 to be female. Similarly, 41 cases are male and we expected 22 of these to have received training, so we expected that the other 19 did not. Note that in practice all of these calculations would be generated automatically by a computer.

Table 2: The expected values for sex by receiving training
TrainingNo trainingTotal
Male2219 41
Female3128 59

To recap, we obtained the figures in Table 1 from our survey (our 'observed' figures) and wanted to know whether the apparent difference in training rates for men and women was also likely to be true of the city as a whole. To work this out, we calculated how many men and women we expect to have received training assuming that there was actually no difference, and obtained the figures in Table 2 (our 'expected' figures'). If there were no difference in the city as a whole between the rates of receiving training for men and women then we would expect 22 of 41 males to have received training, but we actually found 24 of them. In each cell of the tables there is a discrepancy of two cases between the observed and expected figures. Is this convincing evidence that men are more likely than women in this city to report receiving training? Hardly. In selecting a sample of 100 cases at random it would be easy for us to have inadvertently introduced a bias equivalent to those two cases. We should therefore conclude that we have no evidence of differential training rates for men and women in this city.

This argument contains just about everything that all researchers need to know about the logic of significance-testing in statistical analysis. It is a form of logic that all social science researchers should be able to follow. There is, therefore, no reason why anyone should not read and understand statistical evidence of this sort. Everything about traditional statistics is built on this rather simple foundation, and yet nothing in statistics is more complicated than this. Much of what appears in traditional texts is simply the introduction of a technical shorthand for the concepts and techniques used in this introductory argument (see Gorard 2003). The key thing a statistical test does for us is to estimate how unlikely it is that what we observed would happen, assuming, for the sake of calculation, that a null hypothesis is true. The more unlikely our observation is then the less likely it is that the null hypothesis is true, and therefore the more likely that we need to look for an alternative explanation. The result is usually expressed as a probability (the probability of the null hypothesis being true is actually around 30% in the example above). However, the underlying logic is usually easier for beginners to see when expressed as frequencies as they are here, rather than as percentages - and this is one of the themes of this paper. Percentages are surprisingly difficult to work with, safely, and surprisingly easy to make mistakes with (see below).

Do We Need Statistical Tests?

This form of statistical testing has many historical roots, although many of the tests in common use today, such as those attributable to Fisher, were derived from agricultural studies (Porter 1986). These tests were developed for one-off use, in situations where the measurement error was negligible, in order to allow researchers to estimate the probability that two random samples drawn from the same population would have divergent measurements (men and women in the example above). In a roundabout way, this probability was then used to help decide whether the two samples actually come from two different populations. For example, vegetative reproduction could be used to create two colonies of what is effectively the same plant. One colony could be given an agricultural treatment, and the results (in terms of survival rates for example) compared between the two colonies. Statistics would help us estimate the probability that a sample of scores from each colony would diverge by the amount we actually observe, assuming that the treatment given to one colony was ineffective. If this probability is very small, therefore, we might conclude that the treatment appeared to have an effect. As we have seen, that, in a nutshell, is what significance tests are, and what they can do for us.

In light of current practice, it is important to emphasise what significance tests are not, and cannot do for us. Most simply, they cannot make a decision for us. The probabilities they generate are only estimates, and they are, after all, only probabilities. Standard limits for retaining or rejecting our null hypothesis of no difference between the two colonies, such as 5%, have no mathematical or empirical relevance. They are only arbitrary. A host of factors might affect our confidence in the probability estimate, or the dangers of deciding wrongly in one way or another. Therefore, there can, and should, be no universal standard. Each case must be judged on its merits. However, it is also often the case that we do not need a significance test to help us decide this (as in the first example). In the agricultural example, if all of the treated plants died and all of the others survived (or vice versa) then we do not need a significance test to tell us that the probability is very low (and precisely how low depends on the number of plants involved) that the treatment had no effect. If there were 1,000 plants in the sample for each colony, and one plant survived in the treated group, while one died in the other group, then again a significance test would be superfluous (and so on). All that the test is doing is formalising the estimates of relative probability that we make anyway in everyday situations. They are really only needed when the decision is not clear-cut (for example where 600/1000 survived in the treated group but only 550/1000 survived in the control), and since they do not make the decision for us, they are of limited practical use even then. Above all, significance tests give no idea about the real importance of the difference we observe. A large enough sample can be used to reject almost any null hypothesis on the basis of a very small 'effect' (see below).

It is also important to re-emphasise that the probabilities generated by significance tests are based on random samples. If the researcher does not use a random sample then inferential statistics are of no use since the probabilities become meaningless. Researchers using significance tests with convenience, quota or snowball samples, for example, are making a key category mistake. Similarly, researchers using significance tests on populations (from official statistics perhaps) are generating meaningless probabilities. It is possible that a trawl of education journals would reveal very few, technically, correct uses of significance tests. Added to this is the problem that social scientists are not generally dealing with variables, such as plant survival rates, with minimal measurement error. In fact, many studies are based on latent variables, such as attitudes, of whose existence we cannot even be certain, let alone how to measure them. Added to this are the problems of non-response and participant dropout in social investigations, that also do not occur in the same way in agricultural applications. All of this means that the variation in observed measurements due to the chance factor of sampling (which is the only thing that significance tests take into account) is generally far less than the potential variation due to other factors. The probability from a test contains the unwritten proviso - assuming that the sample is random with full response, no dropout, and no measurement error. The number of social science studies meeting this proviso is very small indeed. To this must be added the caution that probabilities interact, and that most analyses in the IT age are no longer one-off. In fact many analysts use computers to dredge large datasets for 'significant' results. In addition, most of us start each probability calculation as though nothing prior is known, whereas it may be more realistic and cumulative to build the results of previous work into new calculations (Roberts 2002).

Therefore, while it is important for novice social scientists to be taught about the use of significance tests, it is equally important that they are taught about the limitations as well, and alerted to possible alternatives, such as confidence intervals, effect sizes, and graphical approaches. But even these alternative statistics cannot be used post hoc to overcome design problems or deficiencies in datasets. If all of the treated plants in our example were placed on the lighter side of the greenhouse, with the control group on the other side, then the most sophisticated statistical analysis in the world could not overcome that bias. It is worth stating this because of the current popularity in some powerful circles of complex methods of probability-based analysis, whereas a more fruitful avenue for long-term progress would be the generation of better data, open to inspection through simpler and more transparent methods of accounting. Without adequate empirical information 'to attempt to calculate chances is to convert mere ignorance into dangerous error by clothing it in the garb of knowledge' (Mills 1843, in Porter 1986, p.82- 83). Null hypothesis significance tests (NHSTs) may therefore be a hindrance to scientific progress (Harlow et al. 1997).

Statistics is not, and should not be, reduced to a set of mechanical dichotomous decisions around a 'sacred' value such as 5%. Suggested alternatives to reporting NHSTs have been the use of more non- sampled work, effect sizes (Fitz- Gibbon 1985), meta-analyses, parameter estimation (Howard et al. 2000), or standard confidence intervals for results instead, or the use of more subjective judgements of the worth of results. In the US there has been a debate over whether the reporting of significance tests should be banned from academic journal articles, to encourage the growth of these alternatives (Thompson 2002). Both the American Psychological Society and the American Psychological Association have recommended reporting effect sizes and confidence intervals, and advocated the greater use of graphical approaches to examine data. Whereas a significance test is used to reject a null hypothesis, an effect size is an estimate of the scale of divergence from the null hypothesis. The larger the effect size, the more important the result. A confidence interval may be defined by a high and low limit between which we can be 95% confident (for example) that the 'true' value of our population estimate lies. The smaller the confidence interval the better quality the estimate is (de Vaus 2001).

Of course, several of the proposed replacements, including confidence intervals, are based on the same sort of probability calculations as significance tests. Therefore, they are still largely inappropriate for use with populations and non-random samples, and like significance tests they do nothing to overcome design bias or non-response. Most of these alternatives require considerable subjective judgement in interpretation anyway. For example, a standard effect size from a simple experiment might be calculated as the difference between the mean scores of the treatment and control groups proportional to the variance (or the standard deviation) for that score among the population. This sounds fine in principle, but in practice we will not know the population variance. If we had the population figures then we would not need to be doing this kind of calculation anyway! We could estimate the population variance in some way from the figures for the two groups, but this introduces a new source of error, and the cost may therefore over-ride the benefit on several occasions. There is at present no clear agreement on what to do, other than use intelligent judgement.

Recent UK initiatives, perhaps most prominently the new funding arrangements for ESRC PhD students, have been designed to encourage a wider awareness of statistical techniques among social scientists. While these moves are welcome, the lack of agreement about the alternatives to NHSTs, the absence of textbooks dealing with them (Curtis and Araki 2002), and their need for greater skill and judgment means there is a consequent danger of simply re-visiting all of the debates about statistics that have taken place in other disciplines since at least 1994 (Howard et al. 2000). But perhaps the greatest danger is that further generations of potentially numerate social science researchers will be turned away from 'quantitative' work by trainers who over-emphasise artificial complexity (for whom the techniques themselves become more important than the research - a form of 'abstracted empiricism'), and under-emphasise the simple role of probabilities.

The Simple Role of Probabilities

In order to illustrate the need for greater general awareness of the simple role of probabilities in the comprehension of research evidence, consider the following problem. Imagine that around 1% of children have a particular specific educational need. If a child has that need, then they have a 90% probability of obtaining a positive result from a diagnostic test. Those without that specific need have only a 10% probability of obtaining a positive result from the diagnostic test. Therefore, the test is a very good discriminator. If a large group of children is tested, and a child you know has just obtained a positive result from the test, then what is the probability that they have that specific need?

We all ought to be able to compute the answer in our heads, but faced with problems such as these, most people are unable to calculate any valid estimate of the risk, even with the aid of a calculator. This inability applies to relevant professionals such as physicians, teachers and counsellors, as well as researchers (Gigerenzer 2002). Yet such a calculation is fundamental to the assessment of risk/gain in a multiplicity of real-life situations. Many people who do offer a solution claim that the probability is around 90%, and the most common justification for this is that the test is '90% accurate'. These people have confused the conditional probability of someone having the need given a positive test result with the conditional probability of getting a positive test result given that someone has the need. The two values are completely different. The problem relies for its solution on Bayes' Theorem which describes how to calculate conditional probabilities correctly (Roberts 2002). This theorem states that:

The probability of event A given event B is the same as the probability of: A times the probability of B given A divided by (the probability of A times the probability of B given A plus the probability of not A times the probability of B given not A). Or in more formal terms (where p signifies probability and | signifies given):

p(A|B) = p(A).p(B|A)/(p(A).p(B|A)+p(A').p(B|A')

If we substitute having the disease for A and testing positive for B, then we can calculate the probability of having the disease given a positive result in the test. Unfortunately, many readers will probably be thinking that this approach is of very little help to them. It is still too complicated. Fortunately, we can solve the problem simply by looking at it in another way - a way that all readers should be able to follow. Of 1,000 children chosen at random, on average 10 will have this specific educational need (1%). Of these 10 children with the need, around 9 will obtain a positive result in a diagnostic test (90%). Of the 990 without the need, around 99 will also obtain a positive test result (10%). If all 1,000 children are tested, and a child you know is one of the 108 obtaining a positive result, what is the probability that they have this need? This is the same problem, with the same information as above. But by expressing it in frequencies for an imaginary 1,000 children we find that much of the computation has already been done for us (Table 3). Many more people will now be able to see that the probability of having the need given a positive test result is nothing like 90%. Rather, it is 9/108 or around 8%. This result still depends on Bayes' Theorem, and is of course the same answer as would be obtained by conducting the calculation with the probabilities above. But it is far easier to compute in simple frequencies rather than percentages. Re-expressing the problem has not, presumably, changed the computational ability of readers, but has changed the capability of many readers to see the solution, and the need to take the base rate (or comparator) into account.

Table 3: Probability of SEN having tested positive
Test positiveTest negativeTotal
SEN 9 1 10
Not SEN 99891 990

A similar generic problem involving misunderstood percentages concerns the use of symptoms in medical diagnosis (Dawes 2001). Imagine an illness that occurs in 20% of the population, and has two frequent symptoms. Symptom A occurs in 18% of the cases with this disease, and in 2% of cases without the disease. Symptom B occurs in 58% of the cases with the disease, and in 22% of cases otherwise. Which symptom is the better predictor?

This situation is more complex than the example of special educational need, because there are now two conditional probabilities to deal with. But the same approach of converting it into frequencies leads to greater understanding. Many practitioners would argue that symptom B is the more useful as it is more 'typical' of the disease. There is a 16% gap (18-2) between having and not having the disease with symptom A, whereas the gap is 36% (58- 22) with symptom B. Symptom B, they will conclude, is the better predictor. But while it seems counter-intuitive to say so, this analysis is quite wrong because it ignores the base rate of the actual frequency of the disease in the population.

Looked at in terms of frequencies, in a group of 1,000 people, on average 200 people (20%) would have the disease and 800 would not. Of the 200 with the disease, 36 (18%) would have symptom A and 116 (58%) would have symptom B. Of the 800 without the disease, 16 (2%) would have symptom A, while 176 (22%) would have symptom B. Thus, if we take a person at random from the 1,000 then someone with symptom A is 2.25 times as likely to have the disease as not (36/16), whereas someone with symptom B is only 0.66 times as likely to have the disease as not (116/176). Put another way, someone with symptom A is more likely to have the disease than not (Table 4). Someone with symptom B, on the other hand, is most likely not to have the disease. What we need for diagnosis are discriminators, rather than typical, symptoms. The more general conclusion is that from consideration of the 'politician's error' (Gorard 2000). Simple differences between percentages give misleading, and potentially extremely harmful, results.

Table 4: Typical versus discriminating symptom
IllnessNo illnessTotal
Symptom A361652
Symptom B116176292
Total with illness2008001000

The enormity of such misunderstandings can be made clearer by another real-life example. A study of 280,000 women in Sweden assessed the impact of a screening programme for breast cancer. Over ten years, there were 4 deaths per 1,000 among the over 40s without screening, and 3 deaths per 1,000 with screening. There are several different ways the gain from screening could be expressed. The absolute risk reduction is 1 in 1,000 or 0.1%, while the relative risk reduction is 25%. It is the last that is routinely used by advocates and those standing to gain from screening programmes, perhaps because it sounds like a saving of 25 lives per 100. All versions are correct - but the relative risk is more likely to get funding and headlines because funders, policy-makers and media commentators are easily fooled by research findings expressed in percentages. In this example, the practical outcome is that information leaflets about the screening procedure mostly do not discuss false positives or other costs. Some even give the illusion that screening can reduce the incidence of the cancer. But to achieve even the level of success above requires a large number of false positives, and the distress and unnecessary operations that these entail. To these we must add the danger of cancer from the screening itself, and the financial cost of the programme (and therefore the lost opportunity to spend this money on reducing the risk in some other way). So viewed dispassionately, and with alternative ways of looking at the same data, a 1/1000 risk reduction may not seem worth it for this group.

A similar issue of risk reduction, and the opportunity cost that it entailed, arose in the UK in 2002 concerning security checks on adults working with children. Children were not being supervised in schools (which were closed) for fear that they might be supervised by someone inappropriate. We cannot do the risk reduction calculation for this yet since no figures have been published for the number of problem cases actually uncovered compared to the number of cases examined.


We routinely face scares about health, food, pesticides, the environment, and education, with the figures presented by their peddlers in the most rhetorically convincing way possible. Media reports of epidemiological studies tend to use the rhetorical power of numbers expressed as percentages to make research reports appear more alarming or more flattering. We are unlikely to be able to change this directly. But our ability to see beyond the presentation, and to consider other equally valid presentations is under our control. Improving the ability of the consumer is our best defence, and will harm only those for whom the ideology is more important than the evidence (or who stand to benefit in some way from our confusion). This improvement will be most marked in the apparently simple kinds of problems discussed so far. The lessons could perhaps be summarised as: demand a base rate or comparison group for any claim, and be prepared to re-work the figures yourself in different ways, particularly as simple frequencies, to lessen their rhetorical power.

Part of what this paper tries to do is show that standard approaches to significance testing, currently the cornerstone of many 'quantitative' methods courses, should no longer have automatic pride of place. There is a pressing need for more general awareness of the relatively simple role of numbers in those common social scientific situations for which random samples are not relevant. The importance of this ongoing debate about tests is that it suggests that we need to move away from a formulaic approach to research. However, we need to replace empty formulae for reporting results, not with an 'anything goes' philosophy, but with almost anything goes as long as it can be described, justified and replicated. Above all, we need to remember that statistical analysis is not our final objective, but the starting point of the more interesting social science that follows. A 'significant' result is worth very little in real terms, and certainly does not enable us to generalise safely beyond a poor sample. The key issue in research is not about significance but about the quality of the research design.

Once a discipline or field, like social science, is mature enough then some of its arguments can be converted into formal structures involving numbers. This helps to reduce ambiguity, clarify reasoning and reveal errors (Boudon 1974). More generally, much of what we know about the social world is uncertain. Our knowledge, such as it is, is commonly expressed in terms of probabilities. The same situation applies to health research, engineering, judicial proceedings and many other fields. All professionals have their own craft knowledge, which is at least partially derived from research evidence. However, in a number of practical situations, more explicit use of research findings has been shown to lead to more beneficial outcomes than relying solely on professional experience. The practical problem is that these research findings are generally expressed in terms of risk reduction and uncertainty. If the probabilities are misunderstood by the professionals, as they have been in several of the examples in this paper, then it is difficult for the professionals to make their own judgments in practice, and the use of research findings could, in this situation, have far from beneficial outcomes. In building the capacity to generate and use research evidence relevant to public policy in the UK, it might not be an exaggeration to say that what we need above all else is a more general willingness and ability among interested parties to think about the nature of uncertainty.


This paper arose as part of the ESRC-funded Teaching and Learning Research Programme Research Capacity Building Network (L139251106).


BOUDON, R. (1974) The logic of sociological explanation, Harmondsworth: Penguin.

CROW, G. and RAY, L. (2002) Thinking and working sociologically: a call for contributions, Sociological Research Online, 7, 4.

CURTIS, D. and ARAKI, C. (2002) Effect size statistics: an analysis of statistics textbooks, presentation at AERA, New Orleans April 2002.

DAWES, R. (2001) Everyday irrationality, Oxford: Westview Press.

DE VAUS, D. (2001) Research design in social science, London: Sage.

DEAN, H. (2000) What's the evidence for 'evidence- based' social policy? Welfare reform, low-income families and the role of social science, presented at fifth ESRC seminar on Measuring Success: what counts is what works, Cardiff: September 2000.

DIAMOND, I. (2002) Towards a quantitative Europe, Social Sciences, 51, p.3.

ELLMORE, P. and WOEHILKE, P. (1998) Twenty years of research methods employed in American Educational Research Journal, Education Researcher, and Review of Educational Research from 1978 to 1997, (mimeo) ERIC ED 420701.

EVANS, L. (2002) Reflective practice in educational research, London: Continuum.

FITZ-GIBBON, C. (1985) The implications of meta-analysis for educational research, British Educational Research Journal, 11, 1, 45-49.

GIGERENZER, G. (2002) Reckoning with risk, London: Penguin.

GORARD, S. (2000) Education and Social Justice, Cardiff: University of Wales Press.

GORARD, S. (2002c) Can we overcome the methodological schism?: combining qualitative and quantitative methods, Research Papers in Education, 17, 4.

GORARD, S. (2003) The role of numbers in social science research: quantitative methods made easy, London: Continuum.

HARLOW, L., MULAIK, S. and STEIGER, J. (1997) What if there were no significance tests?, Marwah, NJ: Lawrence Erlbaum.

HAYES, E. (1992) The impact of feminism on adult education publications: an analysis of British and American Journals, International Journal of Lifelong Education, 11, 2, 125-138.

HOWARD, G., MAXWELL, S. and FLEMING, K. (2000) The proof of the pudding: an illustration of the relative strengths of null hypothesis, meta-analysis, and Bayesian analysis, Psychological Methods, 5, 3, 315-332.

LEWIS, J. (2001) The fluctuating fortunes of the social sciences since 1945, mimeo (

MCINTYRE, D. and MCINTYRE, A. (2000) Capacity for research into teaching and learning, Swindon: Report to the ESRC Teaching and Learning Research Programme.

MILLS, C. Wright (1959) The sociological imagination, London: Oxford University Press.

NATIONAL EDUCATIONAL RESEARCH POLICY AND PRIORITIES BOARD (2000) Second policy statement with recommendations on research in education , Washington DC: NERPP.

PIRRIE, A. (2001) Evidence-based practice in education: the best medicine?, British Journal of Educational Studies, 49, 2, 124-136.

PLEWIS, I. (1997) Presenting educational data: cause for concern, Research Intelligence, 61, p.9-10.

PORTER, T. (1986) The rise of statistical thinking, Princeton: Princeton University Press.

RESNICK, L. (2000) Strengthening the capacity of the research system: a report of the National Academy of Education, presentation at AERA, New Orleans, April 2000.

ROBERTS, K. (2002) Belief and subjectivity in research: an introduction to Bayesian theory, Building Research Capacity, 3, 5-6.

SOOBEN, P. (2002) Developing quantitative capacity in UK social science, Social Sciences, 50, p.8.

TAYLOR, E. (2001) From 1989 to 1999: A content analysis of all submissions, Adult Education Quarterly, 51, 4, 322-340.

THOMPSON, B. (2002) What future quantitative social science could look like: confidence intervals for effect sizes, Educational Researcher, 31, 3, 25- 32.

TOOLEY, J. and DARBY, D. (1998) Educational research: a critique, London: OFSTED.

WRIGLEY, J. (1976) Pitfalls in educational research, Research Intelligence, Vol. 2, No. 2, p.2-4.

Copyright Sociological Research Online, 2003