


For him, the advantage of this approach was that you knew exactly what you were testing. Finally, there are cases when a test that normally allows us to make valid inferences is used in such a way that it compromises that validity – teaching to the test and such like.Īs I said in the previous post, one of the foremost assessment experts in the 60s wanted tests to assess isolated and narrow skills and knowledge. This makes it hard to isolate an exact skill that you can make an inference about. The opposite problem is also possible – when you test more than you are trying to make the inference about. Your test fails to adequately represent the construct that you want to make inferences about. If you don’t, this is called construct underrepresentation. First, in order to make a valid inference about a pupil’s ability in a particular domain, you have to make sure the test adequately samples from that domain. Validity can be compromised in many ways – Koretz explains three. In that case, my low score would have been right on the money: I would have been a weak student indeed. Suppose they have wanted to answer a third question: whether I was at that time and with the proficiency I had then, likely to be successful in Hebrew-language university study. What should the admissions officers have concluded about me based on my dismal scores? If they had inferred that I lacked the mathematical and other cognitive skills needed for university, they would have been wrong.Suppose the admissions officers wanted an answer to a similar question: whether, with the additional time and language study, I could be a competent student in a Hebrew-language university programme.’Īgain, to infer from his poor test score that the answer was no would have been wrong. He would have ended up with an ‘appallingly low score’ because of his poor Hebrew. He imagines that during this time, he had taken the PET, which is the Israeli college admissions test. He draws from his own experience of living and working in Israel and not speaking Hebrew very well. Koretz gives a couple more examples of the way assessments can give you information to make a valid inference in one area but not another. Koretz’s first example of this is that an exam on statistics might enable you to make good inferences about someone’s statistical abilities but weaker inferences about their general mathematical abilities. The same test can allow you to make valid inferences about one thing, but less valid inferences about another. Validity does not properly refer to the properties of a test, but to the inferences you make from a test. The scales could tell me I am the same weight every time, but it could be the wrong weight. The GCSE could give me the same score both times, but it could be the wrong score. If I weigh myself on a set of bathroom scales several times in a row, then if the scales are reliable I should get the same reading every time.Ī test can be reliable but inaccurate. If I sit a GCSE in English one day and I sit it again the next day (obviously having done nothing in between that might improve my score), then if the test is reliable I should get the same score. So for a test to be reliable it has to provide consistent results across time. Reliable scores show little inconsistency from one measurement to the next. Reliability refers to the consistency of measurement. Here’s my attempt to summarise his points. Koretz gives the clearest and fullest explanations I’ve read of what reliability and validity mean in the context of assessment. Part three, Why teaching to the test is so bad, is here. Part one, How useful are tests?, is here. This is part two of my summary of Daniel Koretz’s book Measuring Up: What Educational Testing Really Tells Us.
