Reliability and
Validity
Ed 510
Applications of Educational Research
Here are some
terms that will help you master course materials. Define them and
create examples of your own.
-
Statistical properties
of tests
-
Reliability
-
Measurement error
-
Validity
-
Content validity
-
Statistical validity
-
Estimation error
|
Introduction
All tests are
expected to demonstrate adequate statistical characteristics. These
include reliability, validity and the absence of measurement error (accuracy).
-
Validity is defined
as the extent to which a score measures the underlying construct that it
claims to measure.
-
Reliability is
the extent to which a score ensures an underlying construct with stability
and consistency
-
Accuracy is the
extent to which a score minimizes individual differences to the greatest
extent possible and represents reliable variance
These three terms
have statistical meanings and hence are technical characteristics of sound
tests.
They are related
to test fairness and a fair test must have all three characteristics. However
neither validity, reliability nor accuracy are defined as fairness.
Two kinds
of validity
-
Logical validity
-
Statistical validity
Logical validity
includes content validity and face validity: Logical validity implies that
the content and look of a test correspond to the purposes of the test.
Content validity
is a type of validity that is relevant only to educational tests. A test
has achieved content validity when the following conditions are true: The
test
-
Represents the
content covered by a curriculum or in units of instruction that are associated
with a particular test.
-
Covers the objectives
of instruction as they are represented in a curriculum or unit of instruction
associated with a test.
-
Represents content
in the same proportion that is represented in a curriculum or unit of instruction
associated with a test.
-
Employs item difficulties
that correlated with levels of difficulty of knowledge and skill that are
represented in the curriculum or unit of instruction associated with a
test.
-
Face validity
refers to the appearance of items. Face validity is demonstrated when items
look like they correspond to the intended content of a test. For example,
IQ tests consist of items that assess verbal knowledge, general knowledge
and mathematical skill. Since intelligence, the construct, is defined as
the extent to which individuals have learned from experience, particularly
school experiences, items on IQ tests are argued to have face validity.
However some definitions of IQ prefer to define intelligence in ways that
are free of language content, arguing that language biases tests culturally.
Therefore culture free tests of IQ use abstract content to assess intelligence.
It is difficult to say whether these tests have face validity because the
items themselves may or may not correspond to the given definition of culture
free intelligence.
Statistical validity
includes criterion related validity, predictive validity and construct
validity.
-
Criterion validity
is demonstrated whenever test scores that measure a construct are positively
correlated with scores on a related but independent measure of the same
construct. For example when the construct, self esteem, is measured using
a newly developed test it is important to show that the scores generated
by the new test are positively correlated with an established test of self
esteem, for example the Tennessee Self Concept Battery. When the scores
are correlated then one can infer that both tests evaluate the construct
of self esteem.
The correlation
coefficients should range between .60 and .75 to show similarity. Why would
coefficients higher than .75 begin to pose problems for the new test?
-
Predictive validity
is demonstrated whenever test scores from two different tests that measure
the same construct are positively correlated and the correlation holds
up over a period of time. For example, when SAT scores are positively correlated
with GRE scores then the predictive validity of the SAT test battery can
be said to be demonstrated. The interval of time is crucial. Studies of
predictive validity are frequently performed when it is important to show
that students will persist as undergraduates to the end of the freshman
year of college. Or that law students will achieve a grade point average
that will ensure that they will finish the law degree.
Construct validity
is demonstrated in a two step process. Construct validity requires that
two other kinds of validity can be shown. The first of these is convergent
validity.
-
Construct validity
is demonstrated when two independent but related measures of the same construct
are positively correlated. For example, positive correlations between the
Piers Harris self concept test and the Tennessee sub test for positive
self esteem are achieved. In addition discriminant or divergent validity
must be demonstrated. Discriminant validity is demonstrated when unlike
measures are uncorrelated; or measures of opposite constructs are inversely
correlated. Thus the discriminant validity of the Piers Harris self concept
test would be demonstrated if there were negative correlations between
the Piers Harris score and the Tennessee subtest of negative self esteem.
When these patterns of correlation are systematic and the differences between
the correlations are statistically significant, then a test has demonstrated
construct validity.
Reliability
refers to the consistency and stability of test scores. There are several
kinds of reliability. These include:
-
Reliability over
time - retest reliability
-
Internal stability
- alternate form reliability, item total and inter item reliability
-
Inter subjective
agreement - consistency between raters
-
Intra subjectve
agreement - consistency within the evaluations made by a single rater
The table
below compares the various types of reliability
| Type |
Method |
Statistic
(s) used |
Key
idea |
| Retest |
The same test
is administered to the same group of examinees on two different occasions. |
Pearson product
moment correlation coefficient |
Scores at
one administration of a test will predict scores at the second administration
of the same test. Although individual scores may change in relation to
the test mean, the relative position among examinees should remain stable.
|
Internal consistency
-
alternate forms
-
split halves
-
item total
-
inter item
|
Groups of items
are randomly assigned to two or more forms of a test.
Items on a
test are randomly divided so that one test become two halves of the same
test.
Performance
on individual items is correlated with total test scores .
Performance
on individual items are correlated with all other individual items. |
Pearson product
moment correlation coefficient
Spearman Brown
prophesy formula
Biserial correlations
Biserial or
point biserial correlations
|
The likelihood
of getting any item correct or incorrect on a test should be positively
correlated with the performance on the test as a whole, or with other items
on the same test -- if the assumption that all items are fro
m the same
logical domain is a correct assumption |
| Inter rater |
Percentage
of agreement or non parametric measures of concordance, for example the
gamma statistic or Kendall's tau |
Raters who
are trained to criterion levels of proficiency in judging performance.
Then the judgments of two or more raters are compared. |
Raters are
expected to achieve a high level of agreement in the ratings they assign |
| Intra rater |
Percentage
of agreement |
One rater
who has been trained to criterion levels of proficiency in judging performance,
should evaluate performance consistently over time |
- |
Statistical
accuracy is measured using a statistic referred to as the standard error
of measurement.
The idea of
statistical error should be understood as a way to estimate the extent
to which the unpredictability of individuals diminishes the accuracy of
test scores. This happens because individuals take tests under less than
ideal conditions. They may become bored, sleepy, hungry. They may enter
the test situation in less than perfect condition themselves. They may
have been in an argument, a traffic jam on the way to he examination, they
may have not had the time to prepare for all parts of the examination,
or they may have the flu. The important inference to be made is that the
test can not be designed to anticipate all of these individual circumstances.
Therefore, even the most soundly designed test will produce errors of measurement.
These errors reduce the accuracy with which test scores assess true levels
of knowledge or ability.
The standard
error of measurement describes how much error on average can be found in
an individual score selected at random from a group of scores achieved
on the same test.
Here is the
statistical formula that is used to evaluate the standard error of measurement.
sem = S2-x1,x2
(1 - r2x1,x2)
1/2
An examination
of the parts of the formula tells a lot about the nature of errors of measurement.
-
The number 1 represents
a perfect correlation between tests or items on a test. No errors!
-
The squared value
of the reliability coefficient represents how much predictability there
is in fact between two tests or items on a test.
-
1 minus
the squared reliability measures the error. The difference between perfection
and reality!
-
The square root
is taken of 1 minus the reliability so that the error can be averaged
over all who took the test. This is the calculated error.
-
S- x1,x2
is the pooled variance for two forms of a test, two tests, or an item and
a total score. It represents the average amount of variability in all test
scores for all individuals. It is multiplied by the calculated error so
that the average amount of error can be spread over all the examinees on
all tests.
Thus the error
of measurement tells us how likely a test is to be wrong in the assessment
of individual knowledge or ability.
Summarizing
question
Can you think
of examples where the terms used in this lesson have been misused in actual
school based practices?
return
to the course schedule
Page created
January 5, 2001. Copyright Antonia D'Onofrio 2001/2002/2003.