My curiosity turned me down a dark alley of oddly reported and interpreted statistics. It has fancy things in it, like effect sizes, and even confidence intervals, and “Wilcoxon sign tests” in it, and claims of large effects. Perhaps I’m not sophisticated enough to understand its meaning, but to me it seems more like a fun-house out of the twilight zone, or the research mirror world of that old post-modern bs academic writing, with statistical concepts as the lily-pads rather than obscure hipster-words.*
What lured me was a second posting of a data-table from Keith Laws, where he lamented that he still could not make sense out of it. So I re-tweeted. Why not. A click is easy. Daniel Lakens responded, and I was now the witness to a conversation suggesting that this was some crappy analysis. Yes, I had looked at the data-table at one point before, but nothing of value had turned up in my head, so I had nothing to say without looking at the paper. And as schizophrenia and CBT and clinical trials are way off even my meandering paths, I had simply refrained from that. But after the back and forth for a bit (and being copied in on a slur accusation, and, as usual giggling about some remark from DrNeil Martin) I just had to go look. Keith had kindly linked in drop-box copy for anybody’s perusal.
For you, I give you the title and a link to the abstract – it is paywalled. (I just don’t want to use the drop.box copy for a blog). High-Yeld Cognitive behavioarl techniques for psychosis delivered by case managers to their clients with persistent psychotic symptoms. An exloratory trial.
It is a strange world. I wasn’t sure if I should giggle, or possibly wonder if I had misunderstood something about the statistics they were doing, or getting deeply depressed that this passed peer-reviewers, considering that clinical psychology is the one area where we have the most realistic opportunity to both do good, and to do great harm.
From my understanding, having looked at it now in my rabbit-grazing way, the question the group was interested in is whether it is feasible to train Case Managers to deliver a particular kind of Cognitive Behavioral Therapy to their psychotic patients.**
They assessed 69 patients for eligibility for the treatment, and ended up with a total of 38. They also trained 13 case managers to deliver the treatment. Training took five days, and all case-managers received weekly supervision during treatment. I can’t find how long the treatment lasted when I look through it this time, but I assume it took place over several weeks. Patients were assessed on a number of functions at base-line and at the follow-up. The scales or whatever protocols they use for assessment are unknown to me, so I have no means of knowing if they are good, but what they assess seem to be things that it would be reasonable to assess. This is standard, of course. I wouldn’t bother to explain how the IAT works if I was submitting something to a Social Cognition Journal.
So far so good. Nothing is remarkable or out of the ordinary from this non-experts point of view. Doing clinical research seems like a massive job, and I appreciate that there is a great deal of difficulty doing it well.
But, now comes the analysis. First there are four figures of histograms with error bars, showing before-and after scores on the different measures. I can’t find any explicit inferentials about these results in the text, although they claim to have done both t-tests and Wilcoxon sign tests (I looked the latter up on Wikipedia. Both types seem reasonable for before and after assessment). But, just looking at the graphs it is clear that inferentials aren’t really needed because it doesn’t look like anything happened. The means tend to be somewhat lower in the “after” but the standard-errors are rather large and overlapping (I even double checked my Cumming New Statistic book to make sure I understood this). It really looks like the intervention has had no effect whatsoever, at least when you aggregate across all 38 participants, which I assume they did. They claim that no data was missing.
They also do an “effect size analysis” using “Cohen’s d-methodology” referring to Cohens entire 1988 book. Well, fair enough, but I wanted to know if they meant something different by this than we do when we do t-tests and calculate effect size. I gather that this is what they are listing in that table that Keith tweeted in that he could not make heads or tails out of, and that Daniel think is just horrible, and I think resembles a sinister hall of mirrors, or possibly a run-way made of bamboo in the south eastern war theatre in the late 40’s.
Now effect sizes are nice, of course. In this case they run from the middling to large, and also include a few negative ones (suggesting that things got worse). But, one must remember that with only 38 participants, effect-sizes tend to be inflated, as the handy chart in Dan Simon’s blog shows (simulations of effect size estimates where the true effect size is zero – you can do that when you simulate).
The table also shows confidence intervals. I take it that it is for the effect sizes. I looked up how you calculate confidence intervals for effect sizes to try to make sense of this, and you can do it of course. It is a bit trickier than just calculating confidence intervals for estimated means – involving non-central non-symmetric t-distributions, but it can be done, and evidently there are nice R-algorithms for it.
The confidence intervals are large, and all go from a number less than zero to above. That is, for every single effect size, the “no effect whatsoever” is still within the possible estimate. There are likely a couple of typos there also – two confidence intervals are identical. One starts at the same number as the estimated effect size. (It is not the only place where the copy editors and proof readers missed. Figure 5 lacks labels on the axes.)
None of this is backed up in the analysis section, which is all of 5 lines long, naming a number of tests they claim they performed. Of course, looking at the graphs and the tables it really looks like there isn’t much to write about anyway, because I doubt anything would have acquired that magical p<.05 level, but it would have been nice to actually see the values, and the df’s and all that in numbers, because my feeling right now is that I’m not sure the researchers know what they are doing, at least not statistically.
I wouldn’t have let this kind of reporting pass in my undergraduates (who have a good excuse feeling wobbly about stats). It should be a fairly straightforward analysis with a before and after group.
Sure, perhaps it isn’t appropriate or feasible to write down all those numbers in all cases. Right now I have a paper out where I don’t report the inferentials, and only show means with standard errors, but the journal is one focusing on film, and is mostly using their kind of qualitative analyses. I wanted to illustrate that we can induce emotions with films, but showing the data was more supplementary than all the other things I wrote about.
I’m not sure this paper can get away with the excuse, especially as it starts its discussion claiming that the results showed large effect sizes (never mind those confidence intervals), and that the intervention showed good, significant results never mind that the table all suggests that nothing happened, or at least that if something happened it is so overwhelmed with that pesky crud-factor the signal doesn’t make it outside the noise.
They don’t look at the training of the case managers, which I thought was part of the question. There are a lot of claims, but they don’t seem anchored in the data they show, and none of that should be particularly difficult to show.
And, yes, sure, they are aware that the sample is small, and there is nothing that seemed control-like, but they are confident they have shown some kind of feasibility for training case managers to deliver this type of therapy. It seems akin to reading palms.
Now, why, oh, why, did I dive into this sinister mirror world, when I don’t do clinical? I should have stayed in the fun-house of small n counter-intuitive findings in social psychology. We can snark and replicate one another, and nobody’s mental health is in danger.
Still I wonder, did I miss something? Is it some analysis method I don’t understand (yes, there are, plenty of those of course), but that is pertinent to this one?
Am I, in the twilight zone?
* A quote from Katha Politt that I read in a Socal book/article late last milennium keep sticking with me as a perfect illustration of mindless attempts at influence by using particular keywords: A frog jumping from lily pad to lily pad. I finally found where it is from with a little bit of google-fu. The article is called Pomolotov cocktail.. The Frog quote is at the very bottom.
**Never mind that CBT for schizophrenia seems to do very little based on this recent meta-analysis. There is a place to play with the data for the meta-analysis, but for the moment I have lost that link in my twitter flow.
On edit: I had hoped Keith would see that one, and provide me with the link, and he kindly did. Here. Go play with meta-analysis data.