The number I get for 3 mismatches at 25 bases is 5.8E-11. The number I get for 7 mismatches at 35 bases is 1.3E-11. But given that genome compositions I am likely to see will perform worse than these ideal situations, I drop down to 6 mismatches.
A word about quality here. A trimmed Sanger read with 6 errors per 35 bases is terrible quality. However SOLiD reads are dual base encoded, so they should perform better for generating accurate consensuses than single base encoded sequence reads. On the other hand, the quality values I'm dealing with here are generated by the SOLiD base caller. Do they accurately reflect the real chance of miscall?
I don't know. (I would be very interested to hear from anyone who does know.) It would require quite a bit of work to test. I presume Applied Biosystems has tuned their base caller to give fairly accurate quality values. But I haven't verified this.
In the absence of an accurate reference sequence for any given project, I think this gives some information at least on how useful for a given purpose a data set will be. Or at least a general sense of the overall quality of a data set.
Finally, I mentioned to some of you that I suspected that the error rate for the last 10 bases of a 35mer might be much (like 5x) higher than the rest of the read. If these quality values are to be believed, errors are more likely in the last 10 bases than the first 10 bases, but not drastically so. For the Arabidopsis data set the mean number of projected errors is 1.4 base out of 10 whereas it is 2.3 bases out of 10 for the last 10 bases. That includes all the really bad reads, in the data set, though.
Phillip
[1]The problem is getting a valid reference sequence. Arabidopsis-Col-0 is sequenced, most of our Arabidopsis sequence derives from Arabidopsis-Ler (not sequenced). Wheat is not sequenced. Honeybee has a draft sequence, but honeybee is very heterozygous and our queen bee would have large stretches of her genome from an African source. (The Honeybee reference is European.)