How Not To Do Psychological Science

A study in exaggerated claims, questionable methods, and overdrawn inferences

Jan 06, 2023

Recently, I've been impressed with psychology as a field. And none impressed me more than Adam Mastroianni, who published a proper research investigation, eight experiments and all, in a medium where anybody can read it without the intention of having it published in a “proper” journal. But what's more, he wrote it like he does his articles - with humor and irreverence, as if he were telling you a story over a beer in a local bar.

Having witnessed that, I was coasting on the cloud of “so proud to be (almost) a psychologist.” Until I read the following paper, called “Two Birds, One Stone: The Effectiveness of Health and Environmental Messages to Reduce Meat Consumption and Encourage Pro-environmental Behavioral Spillover”, that, well, is the opposite of what I'd like to see labeled as psychological research.

What you'll find below is a dissection of the paper — its main ideas, the methods employed, the results obtained, and how the authors interpreted the results in the discussion. I try to be as objective as possible — whilst sounding anything but, of course — still, I might have gone a bit overboard in some places. I welcome critique and feedback. To be clear, I have nothing against the people involved and this paper in particular. I can imagine there are dozens, heck hundreds of papers that suffer from the same issues. I just happened to stumble upon this one during the research for my master's thesis. Having said that, let's dig in.

Introduction and main ideas

The paper tests two major hypotheses. The first is that information provision makes people change their behavior. Specifically, the authors tested the influence of the environmental, health-related, and combined (both health and environmental) messages on the consumption of red and processed meat. Here's an example of one such message:

If you eat only a small amount of red and processed meat, you will protect the environment from harmful greenhouse gases and you will protect your health by reducing the likelihood of developing cancer.

Now pause here for a moment and consider that. Why would such a puny message change something people do almost without thinking? What they see everyone doing in restaurants, school cafeterias, and supermarkets? What they consider to be healthy and necessary for a well-rounded diet? What they enjoy doing? Those are just some of the reasons people name when asked about why they consume meat. And all of them sound more convincing than a simple message about the negative consequences of meat consumption. It sounds to me like telling a five-year-old to stop gobbling up ice cream because his tummy will hurt. He will probably look you straight into your eyes while grabbing fistfuls of ice cream from the bucket and slathering them across his cute tiny face.

But let's move on and continue with the second idea of the paper, which is this: information provision might also lead to a positive behavioral spillover.

The main idea of positive behavioral spillover1 is that benefits in one area are supposed to spill-over to others. For instance, someone inundated with the messages about environmental consequences of factory farming might — through changes in their pro-environmental identity — also change other pro-environmental behaviors like recycling, buying sustainable products, etc.

So not only are the messages supposed to change people's habitual behavior (consumption of red and processed meat). They are also supposed to spill over to other areas related to sustainability.

Right from the outset, this feels like a tall order. But, to be fair, messaging on its own has been found to be effective in changing people's behaviors, as farfetched and counterintuitive as I make it sound. For instance, his meta-analysis (which I skimmed) has found a reduction in consumption of electricity, gas, and water by an average of 6.24 % after an information-based intervention.

So how did the authors put the messages into practice?

Methods section

The authors randomized the participants (n = 3202) into four groups. The first group received the environmental messaging, the second group health-related messaging, the third group received both, and the fourth group received none (control group). This is a standard 2 x 2 between-subjects design. On top of that, the authors also measured stuff at three different time points - baseline before the messaging (T1), two weeks later (T2), and one month after T1 (T3).

The participants received the messages every day at 8 am and 5 pm, for 14 days (between T1 and T2), through Facebook Messenger, together with the prompt to track their meat consumption each day. This makes it a mixed design as it allows the authors to measure both the differences between the different kinds of messages and within the groups across time.

The main outcome variable was the number of portions of red and processed meat, and it was measured as a self-report like this:

“How many servings of processed/red meat have you eaten in the previous week? If you cannot remember please give your best estimate”)3

This was measured, in T2 and T3, on a 15-point response scale ranging from 0 to 14 servings or more.

Let's pause here again for a bit: think about what you ate throughout the past week. And think approximately for 5 - 10 seconds, which is a generous estimate of how much time the participants spent doing the task. Got it? How confident are you in your estimate of weekly meat (or cheese, or whatever else if you don't eat meat) consumption? Would you bet your best pair of socks that this guess is accurate? I wouldn't (but my best pair of socks is also awesome).

I also find this an odd decision from the side of the authors because the participants are already tracking their daily consumption of meat during the first two weeks. Why not let them continue doing so for the coming two weeks — without any messages — and have an outcome variable that is way, way more accurate? The participants were already familiar with how it's done and I can't imagine the costs of continuing this digital intervention were out of budget.

Instead, you have an outcome variable that is inaccurate because:

there is literally zero effort and nil consequences for the participants to report whatever they please;
the measurement takes place every two weeks — memory loss is very likely;
the participants are encouraged to take their best guess (“If you cannot remember please give your best estimate”).

You can imagine the accuracy of the answers.

Anyway, here's some more beef (bad meat-related pun #1): consider the initial instructions the participants received in the various conditions. Here are they (for the “combined” and for the control group, bolding is theirs):

Combined

For the next part of the study, you will be asked to record all of the food you consume by keeping a daily food diary for 14 consecutive days. You will be sent some information including a reminder and a link to the food diary each day via the private chat on Facebook messenger during this time.
During the study period we ask that you try to eat no more than two medium portions of red (including processed) meat each week. This is because several scientific studies have shown that red/processed meat consumption is linked to a series of negative health and environmental outcomes4.

Control

For the next part of the study, you will be asked to record all of the food you consume by keeping a daily food diary for 14 consecutive days. You will be sent some information including a reminder and a link to the food diary each day via the private chat on Facebook messenger during this time.
We ask that you do not change your diet in anyway during the study period.

Let's linger here for a bit. They specifically asked the control group not to change their diets “in anyway” (yes, that's a copypasta from the paper), while they told the experimental group to “try to eat no more than two medium portions of red (including processed) meat each week.”

Here's what's bothering me: how in hell are they measuring the differences in the various types of messages — what they’re interested in — and not the differences in the instructions, irrespective of the message the people see? To me, this is a clear confounding variable5. To make a fair comparison between the groups, the request -- in my view -- must be the same.

The same thing happened with the messages people saw daily for a period of two weeks. Here are two examples (control vs. combined again)

This is an adapted version of their supplementary material. I removed the two other experimental groups so that I can show you the messages side to side.

As you can see, the participants in the experimental groups were asked to reduce their consumption of red/processed meat. The control group had no such request in their daily prompt.

So when the time comes — in two weeks and one month after — and you're asked about your weekly meat consumption, what will you fill in? Two dishes fewer, that's what (if you’re in the experimental group). It easily comes to mind since you've been marinating in it for two weeks. In psychology lingo, it is called demand characteristics: the participants guess the purpose of the study — what you want of them — and try to deliver it to you, either consciously or unconsciously. For a nice article about the subject, see here. Also, availability heuristic.

Anyway, leaving this morsel of meat aside (bad meat-related pun #2), recall the second goal of the intervention: positive behavioral spillover. How did the authors measure it? They asked the participants in T1 and T2 “how often they planned to perform the following behaviors in the following 6 months:”

“have shorter showers or infrequent baths,”
“Purchase an eco-friendly product,”
“buy a product with less packaging,”
“buy organic food produce,”
“Buy local rather than imported food produce,”
“eat seasonal fruit and vegetables,”
“reduce my consumption of meat and dairy products,”
“use public transport instead of driving my car,”
“volunteer for an environmental group,” and
“donate to an environmental group.”

The participants could select the following options for each of the above statements:

“not at all,”
“once,”
“2 to 3 times,”
“4 to 5 times,”
“6 to 7 times,”
“8 to 9 times” or
“more than 10 times.”

Okay, ehm, how does this measure actual spillover? At best, it measures the willingness for spillover (which isn't stated in the title, or abstract, making it kind of misleading). At worst, it measures absolutely nothing, since it's unlikely that the female student took her sweet time, going soul-deep and reflecting on her values before giving an answer to each item (instead of simply ticking something to get the participation hours she needs as a psychology student). But you probably did questionnaires yourself so you know.

Why did I write “female student”? Oh, I forgot to mention the males were underrepresented by more than 4 to 1 in this study. This isn't an issue because of some fairness equality stuff many people are rankled about; it's more a problem because:

the sample isn't representative of the population, making the generalization (and the entire concept of inference statistics) quite dodgy and
it's known that females are much more receptive to meat reduction, being vegetarians/vegans/flexitarians, etc. See eg. here, here, and here.

This means that the effects found might be due to participants' sex, not the sexy messaging they were exposed to. This is easily accounted for in the stats section btw, where you can plug in the sex variable into the regressions and check if the effects remain. The authors don't mention they did this — and nowhere in the tables in the paper nor supplementary material is it reported — so I assume they didn't.

In total, we have issues with the instructions, the messages themselves, and both the outcome variables. It can only get better in the results section, right? Wrong.

Results section

So what are the results? There are several to discuss, let's begin with the main hypothesis.

Here is a nice graphic (hello, Excel!):

What we're looking at here are the three measurement points on the X-axis (T1 - T3) and the portions of red/processed meat on the Y-axis. If the authors measured only T1 and T2, their life would've been much easier -- what a nice and clear difference, much wow (not at all attributable to the differences in the instruction and the request not to consume meat each day!). Sadly, there's T3 too, which complicates the interpretation. As you can see, the clear differences between the control and the three experimental conditions nearly vanish. I could critique the graphic here, as it's impossible to distinguish the error bars, but there's already a lot of other critique so I just leave it at this short mention...6

Anyway, after plugging the above results into a linear mixed model (a good choice of statistic method in my view, accounting for the differences between the participants), they found “no significant main effect of condition when controlling for time” (meaning no differences between groups). They did find some interactions. First, the difference between the three experimental groups and the control group from T1 to T2 was significant (can be easily seen in the graphic): the 3 experimental groups reduced their meat consumption more than the control group. Second, all the groups reduced their meat consumption significantly from T1 to T3 over time (can be seen in the graphic; from ~7 to ~ 4 - 5 portions). Lastly, the difference between the control group and the experimental group that received the combined kind of message was significant at T3 (comparing the blue and yellow lines). Authors: “Thus, the results showed that providing information on the health and/or environmental impacts of meat had a significant effect on reducing red and processed meat consumption during the intervention and one-month later, supporting Hypothesis 1.”

It doesn't sound as clear cut to me as they would put it -- after all, there was no main effect of condition and all the groups reduced their meat consumption significantly in T3.7 This is a good case for preregistration by the way, where you need to specify which effects specifically count as confirmation of the hypothesis. To me, a stronger test of the theory would require those differences to be significant as well, but that's me, not having a stake in publishing papers. Again, I did not find any information as to whether the authors have done so.

Anyway, what about the second hypothesis regarding behavioral spillover? Here's a graphic again that tells the story:

First things first: why the hell is the T2 graphic (above) slightly bigger than the T3 graphic (below)? It makes for a weird perceptual disfluency. If you go with this, at least have the decency to spread the graphics apart in the paper so the people reading it aren't as likely to notice they are differently sized... Leaving that aside, recall that this represents the participants’ willingness for spillover, where the assumed/guessed/anticipated/planned/completely made-up willingness is measured on the Y axis and the various pro-environmental behaviors are on the X axis. And as you can see, there aren't any clear differences. In most cases, the combined condition is better than the control condition, sure, but the difference isn't big (and significant) in 9 out of 10 cases. The authors reported as much in the text and concluded:

"Thus, there was some evidence to support that a reduced consumption of red and processed meat one-month after the intervention led to an increased willingness to reduce ones’ meat and dairy consumption. However, a reduced consumption of red and processed meat did not predict any other, untargeted, proenvironmental behaviors at either time point. Thus, Hypothesis 2 was only partially supported."

Ehm, okay, no. I can't grant you that, sorry. How is that a “partial support of the hypothesis” when only 1 of the 10 (hypothetical) behaviors you measured was significant? It's like someone failing a math test completely and utterly and you give him a passing grade. Or going to a fridge and calling it a workout. Or eating fries and counting it as eating vegetables. Anyway, that's not even the part that makes me go:

Image - Fuuu.png | Survivor ORG Wiki | FANDOM powered by Wikia

In light of that, allow me to CAPS the following:

he pressed the capslock button, which nobody ever does because nobody ever needs to write long sentences in all caps

THE ONLY SIGNIFICANT DIFFERENCE BETWEEN THE GROUPS WAS ON THE ITEM THAT MEASURED THE WILLINGNESS TO “REDUCE MEAT AND DAIRY CONSUMPTION”. THAT'S LITERALLY NEARLY THE SAME THING YOU MEASURED AS YOUR OUTCOME VARIABLE (“RED/PROCESSED MEAT CONSUMPTION”)!!1! HOW IS THAT "LIMITED EVIDENCE FOR BEHAVIORAL SPILL-OVER"? THERE IS NO SPILLING-OVER HAPPENING!

NOT TO MENTION THAT YOU PUT TWO THINGS INTO ONE ITEM (A BIG NO-NO IN QUESTIONNAIRES) - MEAT AND DAIRY - SO YOU CAN'T EVEN TELL IF THE PARTICIPANTS RESPONDED ONLY TO ONE OF THEM. MAYBE THEY ARE JUST WILLING TO REDUCE MEAT, NOT DAIRY, BUT THEY CAN'T TELL YOU THAT BECAUSE YOU DIDN'T ALLOW THEM TO DO SO!

SO, IF WE ASSUME THAT MOST PEOPLE AREN'T WILLING TO GIVE UP CHEESE - WHICH, LET'S BE HONEST, MANY AREN'T - YOUR “LIMITED EVIDENCE” IS AS NON-EXISTENT AS THE NUMBER OF FUCKS I’D GIVEN IN THE LAST 3 PARAGRAPHS

gently presses the capslock button again and takes a long inhale. A long exhale. The spirit of FUUU meme leaves him and he feels deflated like a baloon

So, what we learned in the results section is that the authors:

are well-versed in overstating their claims whilst sounding perfectly scientific and;
can stretch the meaning of “partial support” and “limited evidence” more than I can stretch my sentences - which is a lot and which I will amply demonstrate - yet again - by making this sentence needlessly long just like the one that I had to stick into a footnote so that it wouldn’t clutter the main body of text.

Anyway, let's move to the last section — the discussion.

Discussion section

There are several points worth discussing. The first is this: why did the control group participants reduce their meat consumption as well, despite being specifically asked not to change their meat consumption “in anyway”? Recall that I was railing about the difference in the

instruction (“During the study period we ask that you try to eat no more than two medium portions of red (including processed) meat each week.)
and the prompt (“Remember to try and eat no more than two portions of red/processed meat this week.”)

between the groups throughout this article. Turns out, it didn't make much of a difference in the end (at T3).

The authors discuss why this might be the case and I like the rationale. Recall what the participants were asked to do each day, irrespective of the condition they were assigned to: track what they ate each day. This means that all the participants had to be more mindful of what they consumed each day. This might have led even the participants in the control group to consider their meat consumption, and reduce it. In short, what the authors might have found is not that messaging, but self-monitoring food intake works to reduce meat consumption, as it makes people more aware of their food-related choices. I call this a win (although completely unrelated to their hypotheses).

bites his fist and hovers around capslock again

Second, the results suggested some limited evidence of behavioral spillover, partially supporting hypothesis 2. After correcting for multiple comparisons, there was only a significant effect where a reduced consumption of red and processed meat was associated with an increased willingness to eat less meat and dairy. We view this as partial evidence of spillover, considering the similarity between reducing ones’ red and processed meat consumption and reducing ones’ meat and dairy consumption.

Let's see if I can rewrite the above paragraph better. Warning, it's a bit savage:

Second, the results suggested some limited evidence of behavioral spillover, partially supporting hypothesis 2. After correcting for multiple comparisons,
- the people drinking Cola switched to Pepsi
- the people shopping in Zara went to H&M
- the people driving gasoline cars switched to electric cars (this one is both sad and true)
We view this as partial evidence of spillover, considering ~~the similarity~~ two things being exactly the same for all intents and purposes.

Anyway, since that measure is useless in the first place, it doesn't really matter how you interpret it...

…But that apparently does NOT make the authors think, stop, and reassess before coming up with a title like this:

Honestly, if this were me, I'd report the variable descriptively ( = show the bar graph), and leave it at that. As it is now, every single thing regarding the spillover variable is questionable:

it's not behavioral spillover per se, it's just willingness for spillover;
there is no willingness for spillover (only 1 out of 10 hypothetical behaviors is significantly different in the hypothesized direction);
the one item that could show a difference actually doesn't allow it because it aggregates meat and dairy into one item;
the title.

You gotta give it to the authors, though — they are consistent.

Anyway, let's leave this sore spot and move on.

It is worth noting that there are some limitations of the current study.

Yep.

First, the measure of red and processed meat consumption required participants to indicate the number of servings of red and processed meat they had eaten in the previous week. Although participants were provided with example portion sizes for red and processed meat, this might not have been sufficient to ensure a precise measure participants’ meat consumption. Participants also may not have been able to accurately recall the amount of red and processed meat they had consumed retrospectively, during the previous week.

The authors rightly recommend better, more objective, methods to measure meat intake in their discussion, such as counting receipts. I might be going too far here, but I assume the authors must've known about the problems of self-report research. This leads me to the following question: if they knew the method they employed is, well, bad, why didn't they use the better method — about which they likely knew, since it's here in the discussion — in the first place? Why did they put this on the shoulders of some other undefined future researchers instead of doing it themselves?

Next, completely at the end, we have this:

Fourth, participants indicated their intentions to perform different pro-environmental behaviors in the upcoming months. However, there is often a gap between people’s intentions and actions (e.g., Hassan et al., 2016). Future research might therefore benefit from investigating spillover using observable measures of behavior to improve the accuracy of this measure. Finally, the reliance on a student sample means that the findings may not be generalisable to the wider public. Thus, future research might benefit from using different participant samples, for example members of the general public, to improve generalisability.

Well, thanks for this tiny disclaimer about the spillover measure. Again, as with the above, why didn't they use observable measures of behavior themselves? Second, why didn't they opt for a more representative participant sample, one that at least has the sexes equally represented?

Peppered throughout the discussion, we have statements such as these8:

“Nevertheless, this is a promising finding which suggests...”
“It is interesting to note that...”
“These findings contribute to...”
“Although this study showed limited evidence of positive spillover, it is worth noting that...”
“This is an encouraging finding, demonstrating that...”

Because without the authors needing to tell us, we might be at risk of seeing this study as utterly irrelevant, uninteresting, and perhaps even wasteful of the resources and (wo)man-hours spent on such an endeavor.

In the conclusion, the authors reiterate most of the above points so I'll spare you. Let's move on to the final paragraph.

The road ahead

I get that doing research is hard and critiquing easy. I've done some studies myself, partook in others, and plan on doing at least one more. I know that choosing the right method that is available for the limited resources you have is difficult. I know that accounting for confounding variables is a pickle, especially since nearly anything can confound your results in psychology (sadly, we can't put our subjects in ampules; have them genetically modified to not exhibit certain traits; or throw them in a vacuum). I also get that people need to publish papers to further (or keep) their careers and that the entire system is flawed: the peer-review is questionable (which let this paper through); social sciences have issues with publishing just about anything (this is definitely a straw man argument and you should read it as such). I could go on and on with the reasons why this paper exists, but they — to me — don't justify its existence; I expect higher standards of both the researchers, reviewers (those poor souls doing work for free), and editors of the scientific journals. I expect higher quality research done for the sake of answering an interesting question, not an elaborate set of scientific-looking gestures that lay people — and perhaps grant donors — can't tell apart from real, substantial science9.

And if you question the relevance of this critique, consider this:

The paper has been cited 22 times and is almost in the 90th percentile of all the articles in Frontiers. Bad science spreads.

To make it clear, and reiterate what I said in the introduction, this shouldn't read as a screed against the researchers involved. I don't know any of them nor do I have an agenda to take them down. I take this more as a failing of the system: provide a better system and the people will play the game better. Despite each of us thinking we are special snowflakes, we very rarely act like it. Oh, we might think outlier thoughts, and believe ourselves mavericks. But actually putting our livelihoods on the line to stand up for our values? That's the real meat of the stuff (bad meat-related pun #3).

We are of course already trying to improve (psychological) science, with platforms such as osf.io where the researchers can pre-register their hypotheses, and upload their data and manuscripts. Scientists are also calling for better theories, better ways of writing papers, and better ways of making the work accessible. The problem is that the incentives still aren't aligned properly, despite all these efforts. The old system is still alive and well. The hierarchies are strong. We have brave souls10 who voice criticism despite being dependent on and embedded within the system. But these are rarer than pineapple on Italian pizza.

Inspired by the research of Adam Mastroianni I mentioned above, and to publicly put my money where my mouth is, I plan on writing my master thesis avoiding the pitfalls of the study I dissected here and further, to do it more or less in my style — with humor (such as it is) and puns and sarcasm and altogether unscientific-looking like. The question is: am I willing to get an F for putting a few jokes in and keeping the tone light? Am I willing to have to spend another few months writing another thesis? I'm not sure. A part of me is excited and scared by the opportunity. But another part simply wants to play the game where the rules — and outcomes — are clear and where the risk of rejection is low. So yeah, I definitely get how such research gets published. But that is not an excuse we can use forever.

There is also negative behavioral spillover, which is sometimes also called moral licensing, where a good deed (say recycling) leads people to cut themselves some slack in other areas, like reducing their meat consumption.

This is the initial count. As the experiment went on for a month, there was inevitable attrition. At the end of it, one month later, the number went down to 238 (74%) participants.

If you're wondering: "but Marek, where is the opening bracket to the closing bracket in the quote above? Can't you even copy-paste properly?" I assure you, my laziness is greater than my reporting rigor, meaning that the bracket isn't there in the paper. I left the quote intentionally as it is to further sketch the quality of (some) published literature.

Here were some more sentences about the negative consequences, both health and environment-related, of meat consumption which I cut for brevity - and promptly lost the brevity because this explanatory sentence is getting way too long; oh God, I feel like David Foster Wallace, he just "ends" his sentences with a comma and continues writing and just when you think the sentence is over, there comes another "and" and you know you're fucked because you can't hold your breath for that long.

a variable that influences the outcome of the experiment in a major way and isn't accounted for

...and this graphic that I put together in 10 minutes that looks - to me - a lot better. Sometimes, I'm evil.

I eyeballed the means and standard deviations from the graph they used in the paper. The script to reproduce the above graph can be found here.

This is interesting if you consider the control group was asked not to change their diet, and they did so anyway (in T3), but more on that later in the section about the discussion.

the obligatory mantras of science that help to justify the resources spent on this kind of study

Heck, even the scientists who reviewed it were fooled.

I don't count myself as a "brave soul" since I don't care about the impact of this article because a) I don't have the readership and the voice and b) I'm not dependent on the system working as it does. Nobody cares about the barking dog deep somewhere in an abandoned industrial area that nobody ever visits.

TheQuaintPickle