303x Filetype PDF File size 1.20 MB Source: www.apa.org
Statistical Methods in Psychology Journals
Guidelines
and Explanations
Leland Wilkinson and the Task Force on Statistical Inference
APA Board of Scientific Affairs
n the light of continuing debate over the applications of statistical methods only and is not meant as an assessment
significance testing in psychology journals and follow- of research methods in general. Psychology is a broad
ing the publication of Cohen's (1994) article, the Board science. Methods appropriate in one area may be inappro-
of Scientific Affairs (BSA) of the American Psychological priate in another.
Association (APA) convened a committee called the Task The title and format of this report are adapted from a
Force on Statistical Inference (TFSI) whose charge was "to similar article by Bailar and Mosteller (1988). That article
elucidate some of the controversial issues surrounding ap- should be consulted, because it overlaps somewhat with
plications of statistics including significance testing and its this one and discusses some issues relevant to research in
alternatives; alternative underlying models and data trans- psychology. Further detail can also be found in the publi-
formation; and newer methods made possible by powerful cations on this topic by several committee members (Abel-
computers" (BSA, personal communication, February 28, son, 1995, 1997; Rosenthal, 1994; Thompson, 1996;
1996). Robert Rosenthal, Robert Abelson, and Jacob Co- Wainer, in press; see also articles in Harlow, Mulaik, &
hen (cochairs) met initially and agreed on the desirability of Steiger, 1997).
having several types of specialists on the task force: stat- Method
isticians, teachers of statistics, journal editors, authors of
statistics books, computer experts, and wise elders. Nine Design
individuals were subsequently invited to join and all agreed. type of study you are doing.
These were Leona Aiken, Mark Appelbaum, Gwyneth Boo- Make clear at the outset what
doo, David A. Kenny, Helena Kraemer, Donald Rubin, Bruce Do not cloak a study in one guise to try to give it the
that have mul-
studies
For
of another.
reputation
Thompson, Howard Wainer, and Leland Wilkinson. In addi- assumed
those goals.
prioritize
to define and
be sure
tion, Lee Cronbach, Paul Meehl, Frederick Mosteller and John tiple goals,
Tukey served as Senior Advisors to the Task Force and There are many forms of empirical studies in psychol-
commented on written materials. ogy, including case reports, controlled experiments, quasi-
The TFSI met twice in two years and corresponded experiments, statistical simulations, surveys, observational
throughout that period. After the first meeting, the task studies, and studies of studies (meta-analyses). Some are
force circulated a preliminary report indicating its intention hypothesis generating: They explore data to form or sharpen
to examine issues beyond null hypothesis significance test- hypotheses about a population for assessing future hypothe-
ing. The task force invited comments and used this feed- ses. Some are hypothesis testing: They assess specific a priori
back in the deliberations during its second meeting. hypotheses or estimate parameters by random sampling from
After the second meeting, the task force recommended that population. Some are meta-analytic: They assess specific
several possibilities for further action, chief of which a priori hypotheses or estimate parameters (or both) by syn-
would be to revise the statistical sections of the American thesizing the results of available studies.
Psychological Association Publication Manual (APA, Some researchers have the impression or have been
information
1994). After extensive discussion, the BSA recommended taught to believe that some of these forms yield
that "before the TFSI undertook a revision of the APA that is more valuable or credible than others (see Cronbach,
Publication Manual, it might want to consider publishing 1975, for a discussion). Occasionally proponents of some
an article in American Psychologist, as a way to initiate research methods disparage others. In fact, each form of
discussion in the field about changes in current practices of research has its own strengths, weaknesses, and standards
data analysis and reporting" (BSA, personal communica- of practice.
tion, November 17, 1997).
This report follows that request. The sections in italics Jacob Cohen died on January 20, 1998. Without his initiative and gentle
are proposed guidelines that the TFSI recommends could persistence, this report most likely would not have appeared. Grant Blank
be used for revising the APA publication manual or for provided Kahn and Udry's (1986) reference. Gerard Dallal and Paul
developing other BSA supporting materials. Following Velleman offered helpful comments.
each guideline are comments, explanations, or elaborations Correspondence concerning this report should be sent to the Task
assembled by Leland Wilkinson for the task force and Force on Statistical Inference, c/o Sangeeta Panicker, APA Science Di-
under its review. This report is concerned with the use of rectorate, 750 First Street, NE, Washington, DC 20002-4242. Electronic
mail may be sent to spanicker@apa.org.
594 August 1999 * American Psychologist
Copyright 1999 by the American Psychological Association. Inc. 0003-066X/99/$2.00
Vol. 54, No. 8, 594-604
that human participants are incapable of producing a ran-
Population dom process (digits, spatial arrangements, etc.) or of rec-
of any study depends on
of the results
The interpretation ognizing one. It is best not to trust the random behavior of
for analysis.
intended
population
of the
the characteristics a physical device unless you are an expert in these matters.
stimuli, or studies)
Define the population (participants, It is safer to use the pseudorandom sequence from a well-
part of the
groups are
If control or comparison
clearly. designed computer generator or from published tables of
defined.
how they are
present
design, random numbers. The added benefit of such a procedure is
Psychology students sometimes think that a statistical that you can supply a random number seed or starting
population is the human race or, at least, college sopho- number in a table that other researchers can use to check
mores. They also have some difficulty distinguishing a your methods later.
class of objects versus a statistical population-that some- assignment. For some research
times we make inferences about a population through sta- Nonrandom
tistical methods, and other times we make inferences about questions, random assignment is not feasible. In such
affect
that
a class through logical or other nonstatistical methods. cases, we need to minimize effects of variables
and an
variable
a causal
between
relationship
observed
Populations may be sets of potential observations on peo- the
confounds
are commonly called
ple, adjectives, or even research articles. How a population outcome. Such variables
needs to attempt to deter-
The researcher
is defined in an article affects almost every conclusion in or covariates.
that article. mine the relevant covariates, measure them adequately,
or by analysis.
effects either by design
for their
adjust
and
adjusted by analysis, the
are
Sample If the effects of covariates
stated
must be explicitly
are made
that
assumptions
emphasize any in- strong
and
procedures
the sampling
Describe Describe
justified.
to the extent possible, tested and
(e.g., and,
is stratified
If the sample
criteria.
or exclusion
clusion plans
including
of bias,
sources
methods used to attenuate
rationale.
fully the method and
describe
gender)
by site or data.
missing
and
noncompliance,
dropouts,
minimizing
subgroup. for
for each
sample size
Note the proposed Authors have used the term "control group" to de-
Interval estimates for clustered and stratified random scribe, among other things, (a) a comparison group, (b)
samples differ from those for simple random samples. members of pairs matched or blocked on one or more
Statistical software is now becoming available for these nuisance variables, (c) a group not receiving a particular
purposes. If you are using a convenience sample (whose treatment, (d) a statistical sample whose values are adjusted
members are not selected at random), be sure to make that post hoc by the use of one or more covariates, or (e) a
procedure clear to your readers. Using a convenience sam- group for which the experimenter acknowledges bias exists
ple does not automatically disqualify a study from publi- and perhaps hopes that this admission will allow the reader
objectivity to try to conceal this by
cation, but it harms your to make appropriate discounts or other mental adjustments.
implying that you used a random sample. Sometimes the None of these is an instance of a fully adequate control
case for the representativeness of a convenience sample can group.
be strengthened by explicit comparison of sample charac- If we can neither implement randomization nor ap-
teristics with those of a defined population across a wide proach total control of variables that modify effects (out-
range of variables. comes), then we should use the term "control group" cau-
Assignment tiously. In most of these cases, it would be better to forgo
the term and use "contrast group" instead. In any case, we
research involving should describe exactly which confounding variables have
Random assignment. For about which un-
the been explicitly controlled and speculate
units to levels of
of
the assignment
inferences,
causal measured ones could lead to incorrect inferences. In the
(not to be
Random assignment
is critical.
causal variable we should do our best to inves-
for the strongest absence of randomization,
confused with random selection) allows
assumptions. tigate sensitivity to various untestable assumptions.
free of extraneous
inferences
causal
possible
provide enough informa-
is planned,
assignment
If random Measurement
assign-
the actual
for making
process
tion to show that the
ments is random. Variables. Explicitly define the variables in the
of the study,
to the goals
related
exem- how they are
There is a strong research tradition and many study, show
of measure-
The units
measured.
how they are
explain
plars for random assignment in various fields of psychol- and
and outcome, should fit the
causal
ogy. Even those who have elucidated quasi-experimental ment of all variables,
sec-
and discussion
you use in the introduction
designs in psychological research (e.g., Cook & Campbell, language
of ran- of your report.
1979) have repeatedly emphasized the superiority tions
dom assignment as a method for controlling bias and lurk- A variable is a method for assigning to a set of
ing variables. "Random" does not mean "haphazard." Ran- observations a value from a set of possible outcomes. For
domization is a fragile condition, easily corrupted deliber- example, a variable called "gender" might assign each of
ately, as we see when a skilled magician flips a fair coin 50 observations to one of the values male or female. When
we are prepared
repeatedly to heads, or innocently, as we saw when the we define a variable, we are declaring what
drum was not turned sufficiently to randomize the picks in to represent as a valid observation and what we must
the Vietnam draft lottery. As psychologists, we also know consider as invalid. If we define the range of a particular
American Psychologist 595
August 1999 * Psychologist 595
August 1999 • American
possible outcomes) to be from 1 to 7 on area that is based on a previous researcher's well-defined
(the set of
variable then a value of 9 is not an construct implemented with a poorly developed psycho-
a Likert scale, for example, instrument. Innovators, in the excitement of their
outlier (an unusually extreme value). It is an illegal value. metric
If we declare the range of a variable to be positive real discovery, sometimes give insufficient attention to the
numbers and the domain to be observations of reaction time quality of their instruments. Once a defective measure
(in milliseconds) to an administration of electric shock, enters the literature, subsequent researchers are reluctant to
then a value of 3,000 is not illegal; it is an outlier. change it. In these cases, editors and reviewers should pay
Naming a variable is almost as important as measuring special attention to the psychometric properties of the in-
it. We do well to select a name that reflects how a variable struments used, and they might want to encourage revisions
not by the scale's author) to prevent the accumu-
is measured. On this basis, the name "IQ test score" is (even if
preferable to "intelligence" and "retrospective self-report lation of results based on relatively invalid or unreliable
of childhood sexual abuse" is preferable to "childhood measures.
sources of
sexual abuse." Without such precision, ambiguity in defin- Procedure. Describe any anticipated
death, or other
dropout,
due to noncompliance,
ing variables can give a theory an unfortunate resistance to attrition
may affect the gener-
how such attrition
Indicate
empirical falsification. Being precise does not make us factors.
operationalists. It simply means that we try to avoid exces- alizability of the results. Clearly describe the conditions
are taken (e.g., format, time,
which measurements
sive generalization. under
the specific
Describe
data).
who collected
personnel
Editors and reviewers should be suspicious when they place,
especially if
bias,
with experimenter
variables, methods used to deal
notice authors changing definitions or names of
yourself
failing to make clear what would be contrary evidence, or you collected the data
using measures with no history and thus no known prop- Despite the long-established findings of the effects of
erties. Researchers should be suspicious when code books experimenter bias (Rosenthal, 1966), many published stud-
and scoring systems are inscrutable or more voluminous ies appear to ignore or discount these problems. For exam-
than the research articles on which they are based. Every- ple, some authors or their assistants with knowledge of
one should worry when a system offers to code a specific hypotheses or study goals screen participants (through per-
observation in two or more ways for the same variable. sonal interviews or telephone conversations) for inclusion
to collect in their studies. Some authors administer questionnaires.
used
is
a questionnaire
Instruments. If Some authors give instructions to participants. Some au-
its scores
of
properties
summarize the psychometric
data, thors perform experimental manipulations. Some tally or
is used in a
to the way the instrument
regard
with specific of code responses. Some rate videotapes.
include measures
properties
Psychometric
population. An author's self-awareness, experience, or resolve
affecting con-
qualities
any other
and
validity, reliability, does not eliminate experimenter bias. In short, there are no
enough
provide
is used,
If a physical apparatus
clusions. valid excuses, financial or otherwise, for avoiding an op-
to allow
specifications)
model, design
(brand,
information portunity to double-blind. Researchers looking for guid-
another experimenter to replicate your measurement should consult the classic book of
process. ance on this matter
There are many methods for constructing instruments Webb, Campbell, Schwartz, and Sechrest (1966) and an
and psychometrically validating scores from such mea- exemplary dissertation (performed on a modest budget) by
sures. Traditional true-score theory and item-response test Baker (1969).
size. Provide information
sample
and
theory provide appropriate frameworks for assessing reli- Power
ability and internal validity. Signal detection theory and on sample size and the process that led to sample size
various coefficients of association can be used to assess decisions. Document the effect sizes, sampling and mea-
used
procedures
analytic
well as
as
assumptions,
external validity. Messick (1989) provides a comprehen- surement
Because power computations are
sive guide to validity. in power calculations.
and
collected
are
data
when done before
It is important to remember that a test is not reliable or most meaningful
how effect-size estimates
to show
it is important
unreliable. Reliability is a property of the scores on a test examined,
and theory in
from previous research
Brennan, have been derived
for a particular population of examinees (Feldt &
been taken
they might have
that
suspicions
to dispel
1989). Thus, authors should provide reliability coefficients order
to
in the study or, even worse, constructed
used
of the scores for the data being analyzed even when the from data
analyzed,
the study is
size. Once
sample
a particular
research is not psychometric. Interpreting the justify
focus of their in describ-
power
calculated
size of observed effects requires an assessment of the confidence intervals replace
reliability of the scores. ing results.
Besides showing that an instrument is reliable, we Largely because of the work of Cohen (1969, 1988),
need to show that it does not correlate strongly with other psychologists have become aware of the need to consider
key constructs. It is just as important to establish that a power in the design of their studies, before they collect
this stimulates
The intellectual exercise required to do
measure what it should not measure as it data.
measure does not authors to take seriously prior research and theory in their
measure what it should.
is to show that it does field, and it gives an opportunity, with incumbent risk, for
Researchers occasionally encounter a measurement that there is no applicable
problem that has no obvious solution. This happens when a few to offer the challenge
they decide to explore a new and rapidly growing research research behind a given study. If exploration were not
August 1999 * American Psychologist
596 August 1999 * American Psychologist
596
in hypothetico-deductive language, then it might
disguised to influence subsequent research Figure 1
have the opportunity Matrix
constructively. Scatter-Plot
Computer programs that calculate power for various 18 99
designs and distributions are now available. One can use
them to conduct power analyses for a range of reasonable
alpha values and effect sizes. Doing so reveals how power
changes across this range and overcomes a tendency
to regard a single power estimate as being absolutely
definitive. for
Many of us encounter power issues when applying
grants. Even when not asking for money, think about
power. Statistical power does not corrupt.
LU
Results u,
Complications
protocol .I
Before presenting results, report complications,
collec-
events in data I
unanticipated
other
violations, and
tion. These include missing data, attrition, and nonre- 0
devised to ameliorate O
techniques
analytic
sponse. Discuss
statisti- I-
these problems. Describe nonrepresentativeness
of missing
and distributions
patterns
cally by reporting AGE SEX TOGETHER
anal-
Document how the actual
and contaminations.
data
before complications
planned
from the analysis
ysis differs Note. M = male; F = female.
that the reported
arose. The use of techniques to ensure
in the data (e.g.,
by anomalies
produced
results are not
data,
missing
nonrandom
high influence,
of
points
outliers,
problems) should be a standard
selection bias, attrition stacked like a histogram) and scales used for each variable.
component of all analyses. The three variables shown are questionnaire measures of
As soon as you have collected your data, before you number of
Data screening is respondent's age (AGE), gender (SEX), and
data.
your
at
statistics, look
compute any The
not data snooping. It is not an opportunity to discard data or years together in current relationship (TOGETHER).
change values to favor your hypotheses. However, if you graphic in Figure 1 is not intended for final presentation of
assess hypotheses without examining your data, you risk results; we use it instead to locate coding errors and other
publishing nonsense. anomalies before we analyze our data. Figure 1 is a se-
Computer malfunctions tend to be catastrophic: A lected portion of a computer screen display that offers tools
system crashes; a file fails to import; data are lost. Less for zooming in and out, examining points, and linking to
well-known are more subtle bugs that can be more cata- information in other graphical displays and data editors.
strophic in the long run. For example, a single value in a SPLOM displays can be used to recognize unusual patterns
file may be corrupted in reading or writing (often in the first in 20 or more variables simultaneously. We focus on these
or last record). This circumstance usually produces a major three only.
value error, the kind of singleton that can make large There are several anomalies in this graphic. The AGE
correlations change sign and small correlations become histogram shows a spike at the right end, which corre-
large. sponds to the value 99 in the data. This coded value most
Graphical inspection of data offers an excellent pos- likely signifies a missing value, because it is unlikely that
sibility for detecting serious compromises to data integrity. this many people in a sample of 3,000 would have an age
The reason is simple: Graphics broadcast; statistics narrow- of 99 or greater. Using numerical values for missing value
cast. Indeed, some international corporations that must codes is a risky practice (Kahn & Udry, 1986).
defend themselves against rapidly evolving fraudulent The histogram for SEX shows an unremarkable divi-
schemes use real-time graphic displays as their first line of sion into two values. The histogram for TOGETHER is
defense and statistical analyses as a distant second. The highly skewed, with a spike at the lower end presumably
following example shows why. signifying no relationship. The most remarkable pattern is
Figure 1 shows a scatter-plot matrix (SPLOM) of the triangular joint distribution of TOGETHER and AGE.
three variables from a national survey of approximately Triangular joint distributions often (but not necessarily)
3,000 counseling clients (Chartrand, 1997). This display, signal an implication or a relation rather than a linear
pairwise scatter plots arranged in a matrix, is function with error. In this case, it makes sense that the
consisting of diagonal span of a relationship should not exceed a person's age.
found in most modern statistical packages. The wrong here,
cells contain dot plots of each variable (with the dots Closer examination shows that something is
Psychologist 597
August 1999 * American Psychologist 597
August 1999 • American
no reviews yet
Please Login to review.