The following derives largely from an email I sent to some "fellow travelers" before I heard of this web site.
Consider 3 SOLiD runs we did, all genomic DNA (fragment, 35mers, single flowcell, 1 region): Arabidopsis (Ath), Wheat and Honeybee (Bee37). From various measures (satays, etc.) they look good. But what does that mean?
The standard AB metric appears to be "mappable reads"--defined as number of reads that align with 3 or less mismatches with the reference sequence. That doesn't work very well as a measure of the usability of any of our data sets[1].
One metric we use for Sanger sequencing is "how good does the data set think it is?", or more often "how good does the program phred think the data set is?" as recorded in quality values for each base.
While not perfect, it gives you a vague idea of where you stand. That is, the quality values for each base call in the quality file gives an estimate of the probability that the base call is wrong. Add them up in the right way for a read and you have an estimate of the number of errors for that read.
That sounds like a useful metric to me. I wrote a simple perl script to take a quality file and print an error/read histogram. That is, number of reads in each category. The category is 0-1 errors/read, 1-2 errors/read, etc:
http://www.genomics.purdue.edu/~pmig...ist_solid.perl
It is very slow on 150 million+ reads. So I just ran it on the first million reads from each data set.
What may be a useful metric is number of beads/reads/bases at some given number of expected errors/read or less. I chose 3 and 6 errors per read for the table below. This would correspond to a perfect reference sequence against reads with 3 or 6 mismatches set. More about why I chose "3" and "6" below.
Here note that the runs have quite different characteristics. First they have different numbers of usable beads--from the standard 150 million up to 275 million for Ath. Also their error frequencies per read are different.
Why "3" and "6" errors (miscalls)? "3" is in the standard definition of "mappable". Since originally it was designed for 25mers, I wanted to get to the same chance of mis-mapping a 35mer read. A simplistic measure of this would be aligning a random 35mer against a 35 base segment of reference sequence. Presuming equal A,C,G and T composition (which is roughly true in E. coli, but not in Wheat, Honeybee nor Arabidopsis) I think the formula for calculating this would be (don't trust me on this though...)
p= sum from 0 to m of ( (1/4)^n * (3/4)^(n-m) * (n! / (m! * (n-m)!)) )
where "m" is number of allowed mismatches, "n" is the read length and "p" is the probability for a single comparison for a random (spurious) match.