Stats and Figures

 

AMBULATORY BLOCK
WINTER/SPRING 2005

 

/public/images/ambulatory_content/image002_diabetes.gif

PRACTICAL PRACTICE OF
MEDICINE

 

STATS & FIGURES

 

 

1.    
UpToDate
Glossary of common biostatistical and epidemiological terms

 

 

/public/images/ambulatory_content/image003.gif

/public/images/ambulatory_content/image004.gif

ONLINE 12.3

©2005 UpToDate®



New Search

Table of Contents

Feedback

Help

 

Official reprint from UpToDate® www.uptodate.com

 

Glossary of common biostatistical
and epidemiological terms


Peter
A L Bonis, MD

UpToDate performs a
continuous review of over 330 journals and other resources. Updates are added
as important new information is published. The literature review for version
12.3 is current through August 2004; this topic was last changed on July 28, 2004.
The next version of UpToDate (13.1) will be released in February 2005.

INTRODUCTION — This topic review will provide
a catalog of common biostatistical and epidemiological terms encountered in the
medical literature. A list of textbooks that are geared toward health
professionals interested in these topics is provided in the references [1-8].

STATISTICS THAT
DESCRIBE HOW DATA ARE DISTRIBUTED

Measures of central
tendency
— Three
measures of central tendency are most frequently used to describe data:

  bullet Mean
equals the sum of observations divided by the number of observations.

  bullet Median
equals the observation in the middle when all observations are ordered from
smallest to largest; when there are an even number of observations the median
is defined as the mean of the middle two data points.

  bullet Mode
equals the observation that occurs most frequently.

Measures of dispersion — Dispersion (or variance) refers
to the degree to which data are scattered around a specific value (such as the
mean). The most commonly used measures of dispersion are:

  bullet Range
— The range equals the difference between the largest and smallest observation.

  bullet Standard
deviation — The standard deviation measures the variability of data around the
mean. It provides information on how much variability can be expected among
individuals within a population. Sixty-eight and 95 percent of values in a
sample population fall within one and two standard deviations of the mean,
respectively.

  bullet Standard
error of the mean — Standard deviation of the mean (for a sample population)
should be distinguished from the standard error of the mean, which describes
how much variability can be expected when measuring the mean from several
different samples.

  bullet Percentile
— The percentile equals the percentage of a distribution that is below a
specific value. As an example, a child is in 90th percentile for weight if only
10 percent of children the same age weigh more than she does.

  bullet Interquartile
range — The interquartile range refers to the upper and lower values defining
the central 50 percent of observations. The boundaries are equal to the
observations representing the 25th and 75th percentiles. The interquartile
range is depicted in a box and whiskers plot (show figure 1).

TERMS USED TO DESCRIBE
THE FREQUENCY OF AN EVENT
— Incidence and prevalence are the two main terms used to describe
the frequency of an event.

Incidence — Incidence represents the number
of new events that have occurred in a specific time interval divided by the
population at risk at the beginning of the time interval. The result gives the
likelihood of developing an event in that time interval.

Prevalence — Prevalence refers to the number
of individuals with a given disease at a given point in time divided by the
population at risk at that point in time.

TERMS USED TO DESCRIBE
THE MAGNITUDE OF AN EFFECT
— The types of descriptors used to define the relationship among
variables of interest in a data set and the effect of one variable on another
depend upon the type of data. Important examples are the relative risk and odds
ratio, which are commonly encountered expressions describing the relationship
between nominal characteristics (ie, variables that are grouped as unique
categories) (show figure 2).

Relative risk and
cohort studies

The relative risk (or risk ratio) equals the incidence in exposed individuals
divided by the incidence in unexposed individuals. The relative risk can be
calculated from studies in which the proportion of patients exposed and
unexposed to a risk is known. An example is a cohort study, in which a group of
patients who have variable exposure to a risk factor of interest are followed
over time for an outcome. The Nurses’ Health Study is an example of a cohort
study. A large number of nurses are followed over time for an outcome such as
colon cancer. Those with and without colon cancer are then evaluated for their
dietary fiber intake to determine if it is a risk factor (or a protective
factor) for colon cancer.

Odds ratio and
case-control studies

— The odds ratio equals the odds that an individual with a specific condition
has been exposed to a risk factor divided by the odds that a control has been
exposed. The odds ratio is used in case-control studies. In this type of study,
patients with a disease are identified and compared with matched controls for
exposure to a risk factor. This design does not permit measurement of the
proportion of the population who were exposed to the risk factor and then
developed or did not develop the disease; thus, the relative risk or the
incidence of disease cannot be calculated. However, in case-control studies,
the odds ratio provides a reasonable estimate of the relative risk (show figure 2).

If one were to perform a
case-control study to assess the role of dietary fiber in colon cancer as noted
above for the cohort study, a group of patients with colon cancer would be
compared with matched controls without colon cancer; the fiber intake in the
two groups would then be compared. The case-control study is most useful for
uncommon diseases in which a very large cohort would be required to accumulate
enough cases for analysis.

The relative risk and
odds ratio are interpreted relative to the number one. An odds ratio of 0.6,
for example, suggests that patients exposed to a variable of interest were 40
percent less likely to develop a specific outcome compared to the control
group. Similarly, an odds ratio of 1.5 suggests that the risk was increased by
50 percent.

Absolute risk — The relative risk and odds
ratio provide an understanding of the magnitude of risk compared with a
standard. However, it is more often desirable to know information about the
absolute risk. As an example, a 40 percent increase in mortality due to a
particular exposure does not provide direct insight into the likelihood that
exposure in an individual patient will lead to mortality.

The “attributable
risk” (also called the risk difference) is a measure of absolute risk. It
reflects the additional incidence of disease related to an exposure taking into
account the background rate of the disease. The attributable risk is calculated
by subtracting the incidence of a disease in nonexposed persons from the
incidence of disease in exposed persons.

A related term, the
“population attributable risk” is used to describe the contribution
that an exposure has on the incidence of a specific disease in a population. It
is calculated by multiplying the attributable risk by the prevalence of exposure
to a risk factor in a population. The population attributable risk is
particularly important when considering public health measures and the
allocation of resources intended to reduce the incidence of a disease.

Number needed to treat — The benefit of an intervention
can be expressed by the “number needed to treat” (NNT). NNT is the
reciprocal of the absolute risk reduction (the absolute adverse event rate for
placebo minus the absolute adverse event rate for treated patients). Its
interpretation can be illustrated by the following sentence: “This study
suggests that I would have to treat five patients with a drug to prevent one
death.”

As an example, consider a
placebo-controlled trial involving 100 patients. Thirty patients died during
the study period (10 receiving active drug and 20 receiving placebo) giving a
mortality rate of 20 percent with active drug versus 40 percent with placebo (show figure 3). The difference between these two rates, the
“risk difference”, is used to calculate NNT.

  bullet 40
percent minus 20 percent = 20 percent = 0.2

  bullet 1
divided by 0.2 = 5

Thus, this study suggests
that only five patients need to be treated with the drug (compared with
placebo) to prevent one death.

Because it is intuitive,
the NNT has become an increasingly popular expression of absolute benefit or
risk, potentially allowing for comparison of the relative benefit (or harm) of
different interventions. However, the NNT can be misleading:

  bullet It
implies that the option is to treat or not to treat rather than to treat or
switch to another more effective treatment [9].

  bullet There
are variations on how NNT is determined; NNTs from different studies cannot be
compared unless the methods used to determine them are identical [10]. This may be a particular consideration when NNTs are
calculated for treatment of chronic diseases in which outcomes (such as
mortality) do not cluster in time.

  bullet Calculation
of the NNT depends upon the control rate (ie, the rate of events in the control
arm), which can be variable (particularly in small controlled trials, which are
more vulnerable to random effects). As a result, the NNT may not accurately
reflect the benefit of an intervention if events occurred in the control arm
more or less than would be expected based upon the biology of the disease. This
effect can be particularly problematic when comparing the NNTs among placebo
controlled trials (show figure 3) [11].

TERMS USED TO DESCRIBE
THE QUALITY OF MEASUREMENTS
— The most commonly used measures to describe the quality of an
observation are reliability and validity.

Reliability — Reliability refers to the
extent to which repeated measurements of a relatively stable phenomenon fall
closely to each other. Several different types of reliability can be measured.
Examples include inter- and intraobserver reliability and test-retest
reliability.

Validity — Validity refers to the extent
to which an observation reflects the “truth” of the phenomenon being
measured. Several types can be measured such as content (the extent to which
the measure reflects the dimensions of a particular problem), construct (the
extent to which a measure is affirmed by an external established indicator),
and criterion validity (the extent to which a measure can predict an observable
phenomenon). These types of validity are often applied to questionnaires, in
which the truth is not physically verifiable.

MEASURES OF DIAGNOSTIC
TEST ACCURACY

The most common terms used to describe the accuracy of a diagnostic test are
sensitivity and specificity (show figure 4).

Sensitivity — The number of patients with a
positive test who have a disease divided by all patients who have the disease.
A test with high sensitivity will not miss many patients who have the disease
(ie, few false negative results).

Specificity — The number of patients who have
a negative test and do not have the disease divided by the number of patients
who do not have the disease. A test with high specificity will infrequently
identify patients as having a disease when they do not (ie, few false positive
results).

Sensitivity and
specificity are properties of tests that should be considered when tests are
obtained. In addition, sensitivity and specificity are interdependent. Thus,
for a given test, an increase in sensitivity is accompanied by a decrease in
specificity and vice versa. This can be illustrated by the following example.
Consider two populations of patients: one has chronic hepatitis as defined by a
gold standard, and the other does not. The diagnostic test being used to
evaluate for chronic hepatitis is the serum alanine aminotransferase (ALT)
concentration. The sensitivity and specificity of the ALT depends upon the
value chosen as a cutoff (show figure 5).

The interdependence of
sensitivity and specificity can be depicted graphically using a receiver
operating characteristic curve (ROC). The ROC curve plots sensitivity on the Y
axis, and 1-specificity (which is the false positive rate) on the X axis. The
area under the ROC curves gives an estimate of the accuracy of a test. An ideal
test would have a cutoff value that perfectly discriminated those with disease,
and would have an area under the ROC curve of 1.00 (show figure 6). The ROC curve can be adapted to multivariate
analysis (such as logistic regression) in which it provides an estimate of the
accuracy of the statistical model (ie, how well it predicts an outcome).

Predictive values — In addition to sensitivity and
specificity, the predictive values of a diagnostic test must be considered when
interpreting the results of a test. The positive predictive value of a test
represents the likelihood that a patient with a positive test has the disease.
Conversely, the negative predictive value represents the likelihood that a
patient who has a negative test is free of the disease (show figure 4).

The predictive values
(and the proportion of positive and negative evaluations that can be expected)
depend upon the prevalence of a disease within a population. Thus, for given
values of sensitivity and specificity, a patient with a positive test is more
likely to truly have the disease if the patient belongs to a population with a
high prevalence of the disease (show figure 7). This observation has significant implications
for screening tests, in which false positive results may lead to expensive and
sometimes dangerous testing, and false negative tests may be associated with
morbidity or mortality. As an example, a positive stool test for occult blood
is much more likely to predict colon cancer in a seventy year-old compared with
a twenty year-old. Thus, routine screening of stools in young patients would
lead to a high rate of subsequent false positive examinations and is not
recommended. The predictive values of a test should be considered when selecting
among diagnostic tests for an individual patient in whom demographic or other
clinical risk factors influence the likelihood that the disease is present (ie,
the “prior probability” of the disease).

Likelihood ratio — As discussed above, a
limitation to predictive values as expressions of test characteristics is their
dependence upon disease prevalence. To overcome this limitation, the likelihood
ratio has been increasingly used as an expression of the performance of
diagnostic tests [12]. The likelihood ratio represents a measure of the odds of
having a disease relative to the prior probability of the disease. The estimate
is independent of the disease prevalence. A positive likelihood ratio is
calculated by dividing sensitivity by 1 minus specificity
(sensitivity/(1-specificity)). Similarly, a negative likelihood ratio is
calculated by dividing 1 minus sensitivity by specificity ((1-sensitivity)/specificity).
Positive and negative likelihood ratios of 9 and 0.25, for example, can be
interpreted as meaning that a positive result is seen 9 times as frequently
while a negative test is seen 0.25 times as frequently in those with a specific
condition than those without it.

Accuracy — The performance of a diagnostic
test is sometimes expressed as accuracy, which refers to the number of true
positives and true negatives divided by the total number of observations (show figure 8). However, accuracy by itself is not a good
indicator of test performance since it obscures important information related
to its component parts.

EXPRESSIONS USED WHEN
MAKING INFERENCES ABOUT DATA

Confidence interval — A point estimate (ie, a single
value) from a sample population may not reflect the “true” value from
the entire population. As a result, it is often helpful to provide a range that
is likely to include the true value. A confidence interval is a commonly used
estimate. The boundaries of a confidence interval give values within which
there is a high probability (95 percent by convention) that the true population
value can be found. The calculation of a confidence interval considers the
standard deviation of the data and the number of observations. Thus, a
confidence interval narrows as the number of observations increases, or its
variance (dispersion) decreases.

Errors — Two potential errors are
commonly recognized when testing a hypothesis:

  bullet A
type I error (also known as alpha) is the probability of incorrectly concluding
that there is a statistically significant difference in a dataset. Alpha is the
number after a p-value. Thus, a statistically significant difference reported
as p<0.05 means that there is less than a 5 percent chance that the
difference could have occurred by chance.

  bullet A
type II error (also known as beta) is the probability of incorrectly concluding
that there was no statistically significant difference in a dataset. This error
often reflects insufficient power of the study.

Power — The term “power”
(calculated as 1 – beta) refers to the ability of a study to detect a true
difference. Negative findings in a study may reflect that the study was
underpowered to detect a difference. A “power calculation” should be
performed prior to conducting a study to be sure that there are a sufficient
number of observations to detect a desired degree of difference. The larger the
difference, the fewer the number of observations that will be required. As an example,
it takes fewer patients to detect a 50 percent difference in blood pressure
from a new antihypertensive medication compared with placebo than a 5 percent
difference.

TERMS USED IN
MULTIVARIATE ANALYSIS

— The effect of more than one variable often needs to be considered when
predicting an outcome. As an example, the effect of smoking status and age
needs to be simultaneously considered when assessing the risk of lung cancer.

Statistical methods that
can simultaneously account for multiple variables are known as
“multivariate” (or multivariable) analysis. Two of the most commonly
encountered are multiple regression and logistic regression.

Multiple regression — Multiple regression is used for
performing multivariate analysis when the outcome is a continuous variable,
such as blood pressure. Thus, for example, a patient’s systolic blood pressure
can be predicted from a multivariate model by adding together the appropriately
weighted variables (such as age, gender, diastolic blood pressure, weight, etc).

Logistic regression — Logistic regression is similar
to multiple regression except the outcome is dichotomous (eg, alive or dead, or
a complication occurs or does not occur).

SURVIVAL ANALYSIS — Many examples of medical
research deal with an event that may or may not occur in a given period of time
(such as death, stroke, myocardial infarction). During the study, several
outcomes are possible in addition to the outcome of interest (eg, patients
might die of other causes or drop out from the analysis). Furthermore, the
duration of follow-up can vary among individuals in the study. A patient who is
observed for five years should count more in the statistical analysis than one
observed for five months.

Several methods are
available to account for these considerations. The most commonly used in
medical research are Kaplan-Meier and Cox proportional hazards analyses.

Kaplan-Meier analysis — Kaplan-Meier analysis measures
the ratio of surviving patients (or those free from an outcome) divided by the
total number of patients at risk for the outcome. Every time a patient has an
outcome, the ratio is recalculated. Using these calculations, a curve can be
generated that graphically depicts the probability of survival (show figure 9).

In many studies, the
benefit of a drug or intervention on an outcome is compared with a control
population, permitting the construction of two or more Kaplan-Meier curves.
Curves that are close together or cross are unlikely to reflect a statistically
significant difference. Several formal statistical tests can be used to assess
a significant difference. Examples include the log-rank test and the Breslow
test.

Cox proportional
hazards analysis

— Cox proportional hazards analysis is similar to logistic regression because
it can account for many variables that are relevant for predicting a
dichotomous outcome. However, unlike logistic regression, Cox proportional
hazards analysis permits time to be included as a variable, and for patients to
be counted only for the period of time in which they were observed.

The term “hazard
ratio” is sometimes used when referring to variables included in the
analysis. A hazard ratio is analogous to an odds ratio. Thus, a hazard ratio of
ten means that group of patients exposed to a specific risk factor has ten
times the chance of developing the outcome compared with unexposed controls.

 

Use of UpToDate is subject to the Subscription
and License Agreement.

REFERENCES

1.
 Dawson-Saunders, B, Trapp, RG. Basic Clinical Biostatistics, 2nd ed,
Appleton Lange, Connecticut 1994.

2.
 Shott, S. Statistics for Health Professionals, WB Saunders,
Philadelphia 1990.

3.
 Hulley, SB, Cummings, SR. Designing Clinical Research. Williams
Wilkins, Baltimore 1988.

4.
 Henneckens, CH, Buring, JE. Epidemiology in Medicine, Little Brown,
Boston 1987.

5.
 Fletcher, RH, Fletcher, SW, Wagner, EH. Clinical Epidemiology: The
Essentials, 2nd ed, Williams Wilkins, Baltimore 1988.

6.
 Kleinbaum, DG. Logistic Regression, A Self-Learning Text, Springer-Verlag,
New York 1994.

7.
 Kleinbaum, DG. Survival analysis. A Self-Learning Text,
Springer-Verlag, New York 1996.

8.
 Hopkins, WG. A new view of statistics. Internet Society for Sport
Science 2000 (http://www.sportsci.org/resource/stats/index.html).

9.  Moriarty, PM. Relative risk reduction versus number
needed to treat as measures of lipid-lowering trial results (editorial). Am J
Cardiol 1998; 82:505.

10.  Lubsden, J, Hoes, A, Grobbee, D. Implications of
trial results: The potentially misleading notions of number needed to treat
and average duration of life gained. Lancet 2000; 356:1757.

11.  de Craen, AJ, Vickers, AJ, Tijssen, JGP, Kleijnen,
J. Number needed to treat and placebo controlled trials [see comments].
Lancet 1998; 351:310.

12.  Weissler, AM. A perspective on standardizing the
predictive power of noninvasive cardiovascular tests by likelihood ratio
computation: 1. Mathematical principles. Mayo Clin Proc 1999; 74:1061.

 

 

GRAPHICS



/public/images/ambulatory_content/image006.gif



/public/images/ambulatory_content/image007.gif



/public/images/ambulatory_content/image008.gif



/public/images/ambulatory_content/image009.gif



/public/images/ambulatory_content/image010.gif



/public/images/ambulatory_content/image011.gif



/public/images/ambulatory_content/image012.gif



/public/images/ambulatory_content/image013.gif



/public/images/ambulatory_content/image014.gif

 



New Search

Table of Contents

Feedback

Help

©2005 UpToDate®
www.uptodate.com