Reliability and Validity


Ed 510 Applications of Educational Research
 
 
 
 
Here are some terms that will help you master course materials.  Define them and create examples of your own.
 
  • Statistical properties of tests
  • Reliability
  • Measurement error
  • Validity
  • Content validity
  • Statistical validity
  • Estimation error


 

Introduction

All tests are expected to demonstrate adequate statistical characteristics.  These include reliability, validity and the absence of measurement error (accuracy).
 
 

These three terms have statistical meanings and hence are technical characteristics of sound tests.

They are related to test fairness and a fair test must have all three characteristics. However neither validity, reliability nor accuracy are defined as fairness.
 

Two kinds of validity

Logical validity includes content validity and face validity: Logical validity implies that the content and look of a test correspond to the purposes of the test.
 
Content validity is a type of validity that is relevant only to educational tests. A test has achieved content validity when the following conditions are true: The test
 
Face validity refers to the appearance of items. Face validity is demonstrated when items look like they correspond to the intended content of a test. For example, IQ tests consist of items that assess verbal knowledge, general knowledge and mathematical skill. Since intelligence, the construct, is defined as the extent to which individuals have learned from experience, particularly school experiences, items on IQ tests are argued to have face validity. However some definitions of IQ prefer to define intelligence in ways that are free of language content, arguing that language biases tests culturally. Therefore culture free tests of IQ use abstract content to assess intelligence. It is difficult to say whether these tests have face validity because the items themselves may or may not correspond to the given definition of culture free intelligence.
Statistical validity includes criterion related validity, predictive validity and construct validity.
 
  The correlation coefficients should range between .60 and .75 to show similarity. Why would coefficients higher than .75 begin to pose problems for the new test?
 
  Construct validity is demonstrated in a two step process. Construct validity requires that two other kinds of validity can be shown. The first of these is convergent validity.
  Reliability refers to the consistency and stability of test scores. There are several kinds of reliability. These include:
 
The table below compares the various types of reliability



 
 

Type Method Statistic (s) used Key idea
Retest The same test is administered to the same group of examinees on two different occasions. Pearson product moment correlation coefficient Scores at one administration of a test will predict scores at the second administration of the same test. Although individual scores may change in relation to the test mean, the relative position among examinees should remain stable. 
 
 

 

Internal consistency 
 
  • alternate forms

  •  

     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

  • split halves

  •  

     
     
     
     
     
     
     
     
     
     
     
     
     

  • item total

  •  

     
     
     
     
     
     

  • inter item

 
 
 
 

Groups of items are randomly assigned to two or more forms of a test.

Items on a test are randomly divided so that one test become two halves of the same test.
 
 
 

Performance on individual items is correlated with total test scores .
 

Performance on individual items are correlated with all other individual items.

Pearson product moment correlation coefficient 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Spearman Brown prophesy formula 
 
 
 
 
 
 
 
 
 
 
 

Biserial correlations 
 
 

Biserial or point biserial correlations 
 

The likelihood of getting any item correct or incorrect on a test should be positively correlated with the performance on the test as a whole, or with other items on the same test -- if the assumption that all items are fro 

m the same logical domain is a correct assumption

Inter rater Percentage of agreement or non parametric measures of concordance, for example the gamma statistic or Kendall's tau Raters who are trained to criterion levels of proficiency in judging performance. Then the judgments of two or more raters are compared. Raters are expected to achieve a high level of agreement in the ratings they assign
Intra rater Percentage of agreement One rater who has been trained to criterion levels of proficiency in judging performance, should evaluate performance consistently over time -

 
 
 
 
 

Statistical accuracy is measured using a statistic referred to as the standard error of measurement.

The idea of statistical error should be understood as a way to estimate the extent to which the unpredictability of individuals diminishes the accuracy of test scores. This happens because individuals take tests under less than ideal conditions. They may become bored, sleepy, hungry. They may enter the test situation in less than perfect condition themselves. They may have been in an argument, a traffic jam on the way to he examination, they may have not had the time to prepare for all parts of the examination, or they may have the flu. The important inference to be made is that the test can not be designed to anticipate all of these individual circumstances. Therefore, even the most soundly designed test will produce errors of measurement. These errors reduce the accuracy with which test scores assess true levels of knowledge or ability.
 

The standard error of measurement describes how much error on average can be found in an individual score selected at random from a group of scores achieved on the same test.
 

Here is the statistical formula that is used to evaluate the standard error of measurement.
 
 

sem = S2-x1,x2 (1 -  r2x1,x2) 1/2

An examination of the parts of the formula tells a lot about the nature of errors of measurement. Thus the error of measurement tells us how likely a test is to be wrong in the assessment of individual knowledge or ability.
 

Summarizing question

Can you think of examples where the terms used in this lesson have been misused in actual school based practices?
 

return to the course schedule
 
 

Page created January 5, 2001. Copyright Antonia D'Onofrio 2001/2002/2003.