Pappus' plane - cricket stats: June 2008

Monday, June 30, 2008

Mini-orders

Samir Chopra asked for a stats post on "mini-orders", and here it is.

A mini-order is defined, for this post, as a block of three players at the same positions in the batting order. So, for instance, you could have Langer-Hayden-Ponting as a mini-order (with positions 1, 2, and 3). Now, I could fill up pages with the various possibilities (123, 345, 456, 567, etc.), but that seems like it might be excessive. So below I've listed the leading mini-orders by runs scored. This is, of course, a list heavily biased towards recent teams.

In the table below, the columns are the number of team innings in which the triple appeared; total runs made in those innings by the batsmen in that mini-order; their average in those innings; the number of runs made in partnerships between those three batsmen in those innings; and the average of those partnerships, adjusted for era and quality of the bowling (relative to an overall average of 31.5). The regular average and partnership average are typically close to each other. The partnership stats are not complete, since I ignore any team innings which look like they involved a retired hurt.

Note that the order is strict — Langer-Hayden-Ponting is considered separately from Hayden-Langer-Ponting. The latter only happened twice, by my count. If you ignore order, then Taylor-Slater-Boon would go into fifth place. Taylor and Slater alternated almost perfectly in which of the two faced the first ball.


pos name1          name2            name3           i   runs  avg   p-runs  adj part avg
123 JL Langer      ML Hayden        RT Ponting      94  15034 60.6  10352   53.8
345 RS Dravid      SR Tendulkar     SC Ganguly      78  11319 55.8  5624    49.5
123 CG Greenidge   DL Haynes        RB Richardson   84  9778  43.8  6611    41.5
123 MS Atapattu    ST Jayasuriya    KC Sangakkara   58  7352  46.5  4807    35.2
456 SR Tendulkar   SC Ganguly       VVS Laxman      54  6742  49.9  2806    53.0
345 JL Langer      ME Waugh         SR Waugh        51  6405  47.1  2340    34.9
123 ME Trescothick MP Vaughan       MA Butcher      49  5956  44.8  4132    41.8
456 ME Waugh       SR Waugh         RT Ponting      43  5466  48.4  2343    54.2
456 PA de Silva    A Ranatunga      HP Tillakaratne 43  4688  40.1  2033    38.2
345 JH Kallis      DJ Cullinan      WJ Cronje       37  4412  43.7  2190    37.6
345 RR Sarwan      BC Lara          S Chanderpaul   35  4281  42.8  1560    40.3
123 SM Gavaskar    CPS Chauhan      DB Vengsarkar   35  4038  40.4  2555    41.3
456 DJ Cullinan    WJ Cronje        JN Rhodes       33  3999  46.0  1678    39.8
345 KC Sangakkara  DPMD Jayawardene TT Samaraweera  32  3992  45.4  1464    30.7
345 HM Amla        JH Kallis        AG Prince       29  3951  52.0  2209    46.4
123 CG Greenidge   DL Haynes        IVA Richards    32  3813  42.8  2990    47.2
345 Younis Khan    Inzamam-ul-Haq   Yousuf Youhana  25  3772  54.7  1778    48.5
123 L Hutton       C Washbrook      WJ Edrich       28  3729  49.7  2407    47.8
345 AP Gurusinha   PA de Silva      A Ranatunga     33  3721  42.8  2081    50.1
123 GR Marsh       MA Taylor        DC Boon         30  3675  43.2  2779    44.3
345 Younis Khan    Yousuf Youhana   Inzamam-ul-Haq  21  3600  66.7  1766    77.0

The constancy of the Australian batting lineup in recent years is well-known, of course, so it's perhaps no surprise to see that the Langer-Hayden-Ponting trio has appeared in more innings in that order than any other. Even allowing for the high scoring these days, they come out easily better than Greenidge-Haynes-Richardson.

Leading mini-order at each position by adjusted average of the batsmen, qualification 10 innings:
123: Woodfull-Ponsford-Bradman, 13 innings, avg 81.7, adj avg 75.4
345: Bradman-Kippax-McCabe, 12 innings, avg 78.8, adj avg 71.7
456: Hussey-Clarke-Symonds, 16 innings, avg 64.3, adj avg 57.8
567: Clarke-Symonds-Gilchrist, 11 innings, avg 53.1, adj avg 48.6

A very Australian affair.

# posted by David Barry : 13:04 7 Comments

Sunday, June 29, 2008

Followup on accuracy of averages

Russ pointed out a couple of things in the previous post. For those who missed the comments thread, here are the revised formulas for calculating uncertainties.

Batting: 0.9 * average / sqrt(# innings)
Bowling: 0.9 * average / sqrt(# wickets)

So, e.g., Mike Hussey becomes 68.4 +/- 9.5. About 68% of 'true' averages will lie within the range given. You need to double it to get it up to 95%.

I haven't made much of an effort to work out the underlying distribution of Australian players that Hussey comes from. To get a rough idea of what should happen, I found the mean and standard deviation of averages of Australian batsmen at batting positions 1 through 7, over the last ten years. There's a bit of a problem about what to do with players who only played a couple of Tests and averaged (say) 5 — clearly they could have averaged up around 20 or 30 if given more opportunities.

Anyway, I bumped those guys up to 20, and the result was something like mean 42, standard deviation 12. So, carrying on with the Hussey example, we crunch the numbers like this:

regressed average = (68.4/9.5² + 42 / 12²) / (1/9.5² + 1/12²)

uncertainty = 1 / sqrt(1/9.5² + 1/12²)

to estimate Hussey's 'true' average as about 58 +/- 7.

Let's just hope that he can score runs in India.

# posted by David Barry : 11:09 1 Comments

Sunday, June 22, 2008

Accuracy of averages

Today I would like to relate some horrifying thoughts about averages. I would like to be wrong, so if you think that there are mistakes with what I've done, do comment. (Update: See the comments thread, and followup. The uncertainties I give below for batsmen are about twice as big as they should be. For bowlers they are ~~about three times~~ also about two times too big.)

I started thinking about this as I started working my way through The Book: Playing the Percentages in Baseball (the authors blog here), trying to pick out the bits which can carry over to cricket, so that we don't have to re-invent wheels that the baseballers have already made for us.

One key point that they make is that a player's raw statistics aren't the best estimates of his true talent — you have to regress to the mean. The less reliable the stat, the more you regress. The less data you have, the more you regress. (And vice versa.) We know this intuitively in some cases — much though I love him, no-one really thinks that Mike Hussey is an 80-average batsman, and indeed in the West Indies his average has dropped to below 70.

But the question is, how many innings does a batsman have to bat before we can be confident that his average is accurately reflecting his talent (and not have to worry about regressing to the mean)? The short answer appears to be something on the order of 10000 innings, if we want to nail the average down to within a run or so.

That's an appallingly large number of innings, completely counter-intuitive for me. Averages seem to stabilise for batsmen after a hundred innings or so. But that intuition we have is based on the wrong thing. Career averages are stable because subsequent innings can't change the overall average much. A better way of thinking is, what would happen if the player re-ran his career from the start (so same opponents, etc.) but with different luck? Here, luck could be things like balls that beat the bat actually finding the edge (or vice versa), dropped catches, etc.

At this point I still would have thought that over a couple of hundred innings, you'd get the same average, to within a run. But the numbers are telling me different things.

To take an artificial example, suppose that a batsman's scores are exponentially distributed with mean 50, and no not-outs. I ran a few simulations of such a batsman over 300 innings, and here are the sample means that came out: 51.8, 54.4, 47.1, 48.4, 50.1.

That's quite a wide range, even for a longer career than any in Test history. At 47.1, you're talking about a very good batsman. At 54.4, he's an all-time great (perhaps not in today's batting-friendly world). In practice, we would expect that it would be even worse than this, because batting scores are not exponentially distributed — the standard deviation for real cricket scores tends to be higher than for exponential scores.

So now let's look at some real cricket scores. The way I'll do this is to take a player, and compare one half of his career to the other. Now, you can't take the first half and second half of the career, because there might be a change in talent over that time (developing better technique, losing reflexes, etc.). So instead, I split the innings into odds and evens (further splitting by first and second innings in matches — I didn't do this perfectly, but it should be close enough). This way, any genuine slumps or good years will be split evenly into the two halves for comparison.

Allan Border in his 'even' innings (132 of them) averaged 49.5, and averaged 51.6 in his 133 odd innings. That's not too bad, I suppose. The two are pretty close together.

But what about Steve Waugh, who was almost as prolific in terms of innings? Evens 55.9, odds 46.3. Tendulkar: evens 52.6, odds 58.0. Viv Richards: evens 66.1, odds 36.5.

Those are some hefty differences (Richards' being one of the most striking). Here is a plot of the odds average against the evens average for all batsmen who played 50 or more Tests and averaged at least 30.

O the scatter.

That R-squared value drops even further (to 0.18) if you remove Bradman. If there were no luck at all involved, then R-squared would be 1, and the dots would make a nice little y = x line. Cricket is a lot more luck-filled than that.

We would like some kind of estimate of the uncertainty involved in batting averages. As we see from the graph above, they'll be pretty big. I'm not entirely sure if what I did was the best way of doing things, so if any stat-heads amongst you can suggest improvements, please do.

I took the odd averages, guessed an error that went like k * (odd avg) / sqrt(number of odd innings), and fiddled with the constant k until roughly 68% of the even averages fell within that margin. I got k = 1.7 or so. (If anyone could tell me where the 1.7 comes from, I'd be grateful. The average co-efficient of variation for batsmen is about 1.05, so by the Central Limit Theorem I would have expected k = 1.05.)

So, we can use this to estimate the uncertainty over whole careers, by 1.7 * avg / sqrt(innings).

Even for a career as long as Border's, that gives an uncertainty of about +/- 5.3 runs. Mike Hussey comes out to 68.4 +/- 17.9.

Now in Hussey's case, we'd lean much more towards the lower part of that estimated range — he's not an 85-average batsman. Why do we think that? Because only one man in history has been that good, and no-one else has ever got close. It's much more likely that Hussey is like everyone else than he's like Bradman.

To make estimates of this sort more rigorous, we need to know the distribution of the batsmen that Hussey is a part of. This won't be the overall mean and standard deviation of averages across all Test batsmen, because clearly the talent pool in Australia is much stronger than in Bangladesh. Probably what I'll do is use my adjusted averages and work by country (and possibly era — the standard deviation of averages is on a slow historical decline). But this will be for a later post.

I'll finish by saying that the story is similar for bowlers. Here is the even-odds graph for bowlers with at least 3 wickets per Test over 50 Tests:

Zaheer Khan is that outlier.

The uncertainties I make to be about 1.7 * avg / sqrt(wickets). Warne (for instance) becomes 25.5 +/- 1.6.

# posted by David Barry : 08:26 8 Comments

Saturday, June 14, 2008

Clarke when the pressure's off

Homer broke down Michael Clarke's innings to see what happened when he came in with the score less than 150, and when he came in with the score greater than or equal to 150. Clarke does much better when the going's easy. But that's not a proof that Clarke is special — we would expect that batsmen do better when the bowlers have been struggling to take wickets.

So ran the numbers for all batsmen at 5 or 6. I grouped the innings into those worse than 3/150 or 4/200 (these seem reasonably equivalent), and those better. Then I took the difference of the averages. Then, to get some mileage out of this old monstrosity post of mine, I got an estimate of the probability that the "going's easy" average would arise by chance, given the "going's not easy" average, and the number of innings in each category. To give an example, Michael Clarke below gets a p-value of 0,20 — only about one in five batsmen would have such a rise in average. If there's an asterisk, then it means that the difference was too large for my estimation algorithm, and I got a senseless result.

(In The Best of the Best, Charles Davis defines a 'pressure average', which takes into account the state of the match — 4/50 in the second innings isn't a pressure situation if you've got a lead of 250 on the first innings. I can't be bothered going into this much detail.)

Note that many of the batsmen below spent much of their career higher up the order. Also note that my stats are a couple of months out of date.

Qualification of at least 10 easy innings and at least 10 not-easy innings:


             worse than 3/150    better than 3/150
name          inns  runs  avg     inns  runs  avg   diff    p
SC Ganguly    77    2285  32,2    43    2069  54,4  -22,3   *
MJ Clarke     24    854   37,1    17    1037  74,1  -36,9   0,20
MV Boucher    12    342   28,5    11    599   66,6  -38,1   0,26
DR Martyn     23    619   31,0    14    787   60,5  -29,6   0,35
PH Parfitt    18    632   39,5    11    696   87,0  -47,5   0,37
DB Vengsarkar 16    439   33,8    11    581   72,6  -38,9   0,37
TE Bailey     25    653   29,7    13    543   60,3  -30,7   0,38
DI Gower      35    1262  39,4    16    926   71,2  -31,8   0,45
KR Miller     32    978   34,9    19    1000  55,6  -20,6   0,49
RP Arnold     13    215   16,5    11    331   30,1  -13,6   0,50

Clarke really has been pretty bad (well, sort of — 37,1 is below average). In terms of the raw difference, he's fifth worst (Les Ames is just off this table, difference of -37,3.).

And now those rare batsmen who do worse when the pressure's off:


             worse than 3/150    better than 3/150
name          inns  runs  avg     inns  runs  avg   diff    p
A Flower      80    3761  57,9    10    310   31,0  26,9    *
CH Lloyd      78    3700  52,1    30    987   35,3  16,9    *
ND McKenzie   29    1056  40,6    16    438   27,4  13,2    0,30
A Symonds     10    389   43,2    10    233   29,1  14,1    0,44
SJ McCabe     19    830   48,8    12    397   33,1  15,7    0,46
SE Gregory    37    1015  28,2    12    205   18,6  9,6     0,56
IVA Richards  45    2051  51,3    22    852   40,6  10,7    0,77
RT Ponting    33    1604  51,7    17    570   40,7  11,0    0,81
KD Walters    60    2653  51,0    28    1113  42,8  8,2     0,87
KF Barrington 21    878   43,9    14    409   37,2  6,7     0,90

When the p-value is higher than 0,5, it means that such a 'slump' would occur in the career of one in two batsmen — pretty unremarkable. Clive Lloyd's record is probably the most remarkable of these, given the relatively large number of innings.

In the set of 83 players, 52 have better averages in easy situations, and 31 in not-easy situations.

Sorry for the no-post last weekend. The problem with devoting only one day a week to cricket stats is that if I don't get something working, then it doesn't get done for a while. I will try to return to IPL analysis next weekend.

# posted by David Barry : 06:00 4 Comments

Subscribe to Posts [Atom]

Pappus' plane - cricket stats

Monday, June 30, 2008

Mini-orders

Sunday, June 29, 2008

Followup on accuracy of averages

Sunday, June 22, 2008

Accuracy of averages

Saturday, June 14, 2008

Clarke when the pressure's off

About Me

Email

Links

Archives