"Brain Scans Diagnose Autism" - when is a test not a test?

Listening to the radio this morning you'd be forgiven for assuming that there had been some extraordinary breakthrough in the battle to better understand autism.  The Today program - BBC Radio 4's flagship news program - ran it as the third item in each of their headline recaps, and quoted, among other things:
 - "the scan works with 90% accuracy"
 - "researchers think it will work even better on children"
 - "the test should be available for general use within two years"
Parents ask me a lot in clinic if I think their child has autism.  I'm not an expert, because it's pretty difficult to diagnose - which is why I've recently been trying to commission an Interpretations paper in Archives E&P on this subject.  I had visions of parents saying to me "Well, can't you just do that new scan they've been talking about?"  So, I resolved to read a bit about it.  Especially as I've never heard of the statistical term "accuracy".

I started with the BBC.  Then Google news.  What was striking was that there was a lot of similarity between the news articles it threw up.  This isn't unusual; lots of journals have press releases about interesting papers.  But, there didn't seem to be any more depth to any of the articles.  One of them was kind enough to mention that this research was from the Journal of Neuroscience.  I went to their website.  No details, and a search for autism revealed nothing.  I've read enough of BadScience  to not find much about science reporting a real surprise, but I have to admit being a little flabbergasted that perhaps none of the news sources had access to the original paper.  So, nobody was able to examine this critically at the time the headlines were being written.

Finally, this evening, the paper is available.  (thanks to the folk at http://trusttheevidence.net for the hint, via @cebmblog on twitter)  It's here:  http://bit.ly/AutismScan  It's pretty hard to recognise this as the same study that the press furore is about.  I'm going to limit my comments to what I see - as a non-researching reader - are the major flaws; I'll try and ignore the other irritating aspects like the absence of a structured abstract which is, surely, just polite in modern publishing.  
1.  It's about 40 people.  That's 20 subjects and 20 controls. 
2.  This is why the numbers - the sensitivity and specificity are such neat round numbers - quoted as 90% and 80% respectively.  Apparently sensitivity is the "accuracy" quoted
3.  The subjects, as far as I can tell, are people with autistic spectrum disorder (ASD), and as far as I can tell - and sorry, I'm not great at all this particular detail - are pretty high functioning.  The authors do discuss this in their paper.  What's clear, however, is that they are not representative of what clinical practice or personal experience will encounter as a spectrum.  
4.  They then measured a whole bunch of stuff, and found that combining five of these features - they call them dimensions - using a "support vector machine" - I think this was a program - then they were able to identify one group from the other, with the sensitivity and the specificity above.

And that's all they did.

They didn't do what most people would require researchers to do under these circumstances:  Try out the technique on a completely new cohort and see how it has worked.  To see the depth of flaw in this, I'll give you the example I gave my wife just now.  (Yes, I know I should have more interesting conversations).  
"There were two people in our kitchen.  We knew one to be female, and one to be male.  We noted, from various variables, that wearing a red top identified 100% of the females - with 100% sensitivity and specificity.  We tried this test back on the two people in the kitchen.  It still worked.  Therefore red tops predict female gender..."

And what about children? 
I can't see anything in the paper about children.
To extend my "kitchen" example, "We expect that this will work even better if we look for red tops in children".

There's a further extension to this.  Like I say, I'm no researcher, but one of the things you learn very early is to distinguish an association from cause and effect.  So, what if, as one of the many possible hypotheses you could generate here, the ASD behaviour causes your brain to develop in certain ways.  This would mean that you could have ASD, and then after years develop the brain changes.  So why would you postulate - strongly - that the brain changes would be the first thing?  
Back to the "kitchen" example.  It's a bit like saying: "Wearing the red top made the female person a female."  Or, "Because the person was female, the top turned red".    

The final aspect to this is something that Ben Goldacre covers with brilliant economy here:  http://bit.ly/crystalppv  
In brief:  The population you start with - the underlying risk of a condition in that population - is the ultimate arbiter of how good a test is.  If you need to remember this as a shorthand, then just memorise that the lower the prevalence, the less likely that a positive result of you test is a true positive.  (As Ben says, it's a good idea to do the arithmetic yourself with a pencil - it really helps get this into your brain.  Otherwise, just trust me on this.)  That's why you have to be very, very careful with new tests, and understand how to use them.  

So, who is at fault here?  A variety of possibles. 
1.  The media?  It's a slow news day.  I mean, there are only a few million people displaced by floods in Pakistan...  (That's sarcasm, for anyone who can't read it on a computer).  You have to say that it is pretty irresponsible giving this much credence to what is a pretty dense and highly technical research theory paper.  Especially if you've not read the paper.  Which I don't think they can have.
2.  The press office of the journal?  I don't know anything about The Journal of Neuroscience; it seems fairly dry - which is not in itself a criticism; perhaps they had a hack who was really bored of trying to spin their other papers, and this one potentially had a really sexy message.
3.  The researchers?  I've not been interviewed on live national radio, but they didn't seem that keen to play down these findings.  One of the people interviewed - I'm not sure if he was an author or someone wheeled out to give an opinion - stated [quote from memory so apologies for inaccuracy]  "Well, it's hard to diagnose autism; it takes a team of people hours, so this test will really help..."  I'm sorry, but I don't care how hard it is for you to get your Research Grant renewed; that's just irresponsibly put.  

Me?  I'm just assuming that for the next couple of years people will ask me about the scan, and I'll have to grit my teeth, not say what's actually on my mind and reply something like "Um, well, that was all a little optimistic..."

Caveats:  This is a really dense paper, in a journal that I'm not familiar with, and using research methods and terms I don't understand.  If I have made any serious error, then of course I will happily amend and apologise.