How to score cognitive tasks

So far, we have thought a lot about what some different cognitive tasks for measuring psychological mechanisms are, and what kinds of psychological mechanisms they might reflect (e.g., we might use a count span task to measure working memory, we might use a flanker task to measure executive function, etc.). But we haven't spent much time examining exactly how we can use those tasks to measure people's cognitive ability.

Ultimately, the purpose of these tasks is to label people. We want to be able to look at someone's performance on a reading span task (for example) and say "that's a person with high working memory". Or we want to be able to look at someone's performance on a Stroop task (for example) and say "that's a person with mid-low executive function". To do that, we need to be able to calculate a score for each person's performance on the task. Only after we have scores can we compare people (e.g., say "Person A got a score of 88, and person C got a score of 62, so Person A has better working memory") or group/label them (e.g., say "These six people have high executive function, those eight people have medium-level executive function, and these last five people have low executive function.")

Below I have put someone's responses from the Daneman & Carpenter reading span task (if you don't remember how this task works, check the example slides from the "working memory" section of this module). The column on the left shows the words that the person was supposed to recall in each block. The column on the right shows the words that the person actually recalled.

There is no right or wrong answer here. There's no rule for what scale these should be on (you can give points, you can give a score out of 100%, or any other idea you have). I have no rules for what aspects you should or should not count when you make your score. That's up to you. Just think of some way you can assign this person a specific, numerical score which will reflect how well you think they did at the task.

Describe your scoring system, in as much detail as possible. I.e., describe exactly what things you looked at and describe how you arrived at a score.

Also, state what score you gave this person.

When you had to make a system to give people a score, you probably had to deal with several issues or questions. Here are a few issues you might have encountered:

Should order matter? Should a person get more points for remembering words in the correct order than for remembering words in the wrong order? Should a person get points at all for words that were remembered but not in the right order/position?
Should the length of each block/group matter? Should a person get more points for remembering all the words in a difficult block of six words (like boat-wound-address-student-doctor-shoulder) than they would get for remembering all the words in two different easy blocks of three words (like picture-pain-unity and officer-train-desire)? If so, how can you set up the scoring system to make the bigger blocks be worth more?
Should people get partial credit? For example, if they correctly remember 3 out of 4 words in a block (like affair-surprise-context-support), should they get zero marks because they didn't remember the whole block (this is the way the digit span task was scored), or should they get 75% because they remembered 75% of the block?

How did you handle the issues you encountered? Did you think about any of these concerns? If so, how did you decide how to handle them in your scoring system? Did you come up with any other problems that I didn't mention?

All of the scoring choices you make will have consequences—sometimes, drastic consequences.

Let's imagine that we decided we do want to consider correct order when we make our scores. In other words, we want people who remember words in the right order to get better scores than people who don't remember them in the right order.

If your previous scoring system already did this, then you don't need to change it now. If your previous scoring system did not consider order, then think about how you can include order in the scoring system now.

Now that you've done that, let's examine the results from four different people. Imagine that the participants were supposed to remember affair, surprise, context, support, in that order. Now, here are the words that four different people recalled:

Participant A: affair, surprise, context
Participant B: affair, context, support
Participant C: support, context, surprise, affair
Participant D: affair, context, surprise, support

Score these four participants strictly according to your scoring system. Then answer the following two questions:

Who got the highest score?
Do you think that's fair? (i.e., do you think it accurately reflects these people's working memory ability?)

Let's keep considering these participants' responses (when they were supposed to recall affair, surprise, context, support):

Participant A: affair, surprise, context
Participant B: affair, context, support
Participant C: support, context, surprise, affair
Participant D: affair, context, surprise, support

One popular way to calculate scores and include order is position-based scoring. In other words, a person gets a point if they recall "affair" in the first position, they get a point if they recall "surprise" in the second position, they get a point if they recall "context" in the third position, and they get a point if they recall "support" in the fourth position.

Under this scoring system, here are the points earned:

Participant A gets three points: they recalled "affair", "surprised", and "context" in the correct position, but they didn't recall "support" at all.
Participant B gets only one point. They recalled "affair" in the first position, but "context" was wrong (they recalled it second, but it should have been third), as was "support".
Participant C gets no points. They recalled all the words, but none in the correct position.
Participant D gets two points. They recalled "affair" and "support" in the right position, but "context" and "surprise" in the wrong position.

Does this seem fair?

One might argue that Participants A and B performed about equally well: each of them forgot just one word. But Participant B lost more points because they forgot an early word (which means the rest of the words get pushed to the wrong position), and Participant A lost fewer points because they forgot a late word, leaving the early words in the correct place.

One might also argue that Participant D got a pretty bad deal. They lost two points, because they had two words in the wrong position. But it's likely that they did that because they switched the order of two words. So in a way, it sounds like they're losing two points for one mistake. Does it really sound right that Participant D, who remembered all words but just made the mistake of switching the order, gets a lower score than Participant A, who made the [arguably] more serious mistake of forgetting a whole word entirely?

I have only discussed position-based scoring. There may be other ways to score this, too. You might have scored it in a different way.

The most important thing in the end is the ranking (which participant has the highest score, which has the second-highest score, etc.): eventually we would use these scores to identify who is a "high" scorer, who is a "low" scorer, etc. In the scoring method I just mentioned, Participant A got the top score, followed by Participant D, then Participant B, and the loser is Participant C.

Is there any other way you could score these, which would result in a different ranking of the participants? If you think of one, explain the scoring system and tell me what the ranking would be. (It's possible that your scoring system from the previous questions already does this.)

There are two things I hope you have learned from the previous reflection questions:

There are a lot of different ways you could choose to score participants in cognitive tasks (we have only looked at one working memory task, but for any other tasks you try you will also face lots of decisions like this);
Scoring the task in a different way can have drastic consequences, even changing the ranking of the participants (i.e., someone who was classified as having "high working memory" in my scoring system might be classified as having "low working memory" in your scoring system).

Conway et al. (2005) offer a very detailed discussion of different scoring schemes for working memory span tasks, and the pros and cons of each.

But overall, this problem illustrates a much more serious problem with research in general. When we do research, we are usually interested in constructs. A construct is an abstract thing in the world that we are trying to study: something like "intelligence", "language proficiency", "working memory ability", "how difficult a sentence is to understand", etc. Generally a construct is something that you can intuitively feel (e.g., in the "Intro to psycholinguistics model", you maybe could feel that the center-embedded sentence is harder to understand than the non-center-embedded one) but can't measure exactly. In fact, constructs can be "felt" but almost never can be directly observed or exactly measured.

Consider some examples. If you want to know someone's intelligence, you might measure it with an IQ test. If you want to know how hard a sentence was to understand, you might measure it by recording how long a person takes to read it. If you want to know how good someone's English level is, you might measure it by looking at their TOEFL, IELTS, or HKDSE scores.

The problem, however is that measures can never perfectly reflect constructs.

Do you know anyone who has a great TOEFL score but sucks at English? Do you know anyone who's really smart, but doesn't get good grades or test scores? Those are both examples of times when measures don't reflect constructs. Maybe someone gets a high IELTS score because they cram for the test a lot and memorize monologues they can do, even though their English is not very good. Maybe a smart person gets bad grades because the university does not recognize their learning style, or because they had many other life events preventing them from being able to do their school work. In any case, the measure (test score, class grades) is not accurately indicating the underlying abstract construct (English profienciency, intelligence).

This issue applies to measurements of working memory, executive function, and everything else we do in psycholinguistics. (Really it applies to things in most sciences, and especially to psychological sciences, which are interested in measuring stuff about the human mind; the human mind is infamously complicated and hard to measure.) Working memory test scores might not accurately reflect someone's working memory because of the choices we made in how to calculate the scores. Measures of how long someone took to read a sentence might not accurately reflect how difficult the sentence is, because the person might have gotten distracted during reading or something like that.

Unfortunately there's no easy solution to this. The only thing we can do is always remember that there is a difference between measures and constructs, and we should always be skeptical and always ask whether the measure we are using (IELTS test scores, working memory scores, etc.) is really a good measurement of the abstract construct (language proficiency, memory capacity, etc.) that we are actually interested in.

Can you think of one more example of a situation where a measure does not accurately reflect a construct?

Last question! Please write a self-reflection about what you learned in this module. That could mean summarizing the main points in your own words, or it could mean raising questions about something you didn't understand, problems or criticisms, pointing out something you disagreed with, suggesting some further issue that builds off of the things in this module, etc.

When you finish this activity, you are done with the module (assuming all your work on this and the previous tasks has been satisfactory). If you are interested in leading a discussion on this module, you can go on to see the suggested discussion topics. Otherwise, you can return to the module homepage to review this module, or return to the class homepage to select a different module or assignment to do now.

Should have said...	Said....
affair surprise context support	affair surprise context
reason factory excuse pencil illness	reason illness pencil
picture pain unity	picture pain unity
boat wound address student doctor shoulder	shoulder doctor
owner fear training cupboard	cupboard owner fear training
officer train desire	officer train desire
husband wonder flower eyes	wonder flower eyes

How to score cognitive tasks (3 hours)