Discussion of Racing Data: 2016 Edition

A topic I never tire of seeing discussed has surfaced again in the small but vocal Thoroughbred racing sphere. No, not rabbits, but there’s been no shortage on that topic this week! I’m talking about racing data. A topic that’s been discussed here and elsewhere for quite some time (see partial history here).

Before I go any further I want to disclose that Equibase is and has been a Hello Race Fans and Raceday360 advertiser (sites I co-own and operate).

The current tide of commentary started with a Paulick Report Q&A with Equibase CEO Jason Wilson. In a way you have to hand it to Wilson as he had to know these discussions would pop-up.

A few notable moments…

Start with the comment thread at the Q&A, which has a lot good insight and valuable feedback should Equibase be inclined to listen. Here’s part of a comment from Tinky:

Furthermore, charging relatively high prices for basic information is a symptom of a larger problem in racing, namely an inability to recognize price sensitivity amongst its customers. There are many big players who have left the game because of high takeout rates, and yet insensitivity to that problem obviously remains.

And one from Ben Cheese:

Hardly a “new” technology but Equibase should FIRST update and re-engineer its website. It is primitive, incredibly clunky and not even close to the standard that any consumer expects today (DRF should do the same). While the industry is ripe for disruption, charging consumers to become acquainted with new products is not a winning or welcoming approach. Like most of the rest of the web, use the “freemium” model and then charge for upgrades. Both Equibase and DRF are years behind what most web players are doing. They’ve got a LOT of work to do.

Matt Gardner, a long time commentator on racing data, posted a good blow-by-blow to Wilson’s answers, and you should read the whole thing. Here’s one particularly notable point that refutes Wilson’s claim that the free data in other sports is used more for fan engagement than anything else:

When Mr. Wilson goes to FanGraphs, and he views stats like WAR (“Wins Above Replacement”) and xFIP (“Expected Fielding Independent Pitching”), does he think no one is setting their fantasy lineup or drafts off of those numbers? And, by the way, the formula for those stats are provided on the website and any fan could calculate them using raw data downloaded for free. Or they could download those stats for free to EXCEL right from FanGraphs, and not a pdf file.

This is a great time to point out a racing entity doing it right and what can come of it. You can go to Keeneland’s Handicapping Database right now and download a csv/spreadsheet of all racing dating back to 2006. Imagine if you could do that for every track or specific meets over a time period at Equibase? You’d see more things like this post that looks at winning post positions by distance on the turf Keeneland going back to 2006. There’s also a data visualization. (I have to update post and viz for last meet, by the way). I have no doubt that people who are more mathematically-minded than me could take it further and come up with things like confidence ratings for each post at every distance.

Regardless of what you think of sites like Five Thirty Eight, it’s a great example of how historical data can be used to create compelling, informative content. And if something like that existed for racing it could be used to help you make handicapping and/or wagering decisions. If it were up to me Equibase.com would be part data-driven content with visualizations, part easier to access stats (with none of this sort of thing) as well as downloadable data in usable formats. Do a query, view the data with the option to export to csv. This doesn’t have to be proprietary data, but data that’s already available for free in the pdf charts.

There are plenty of other ways that one can look into to historical data to find patterns, etc. Here’s a post I did last year on what wins the Breeders’ Cup Juvenile. I manually input the chart data, made freely available from the Breeders’ Cup stats site, and then looked to see if any patterns emerged for winners. Spoiler alert: patterns emerged. View and grab the data here for the Juvenile and here for the Juvenile Fillies.

Someone recently asked this question at Raceday360:

I have a theory that the roi on favorites in the last race of the day may be better than other races given folks trying to get out fit the day. Any stats for last race of the day?

If he had some historical data he could probably figure it out, and wouldn’t that be interesting to know? Who knows what else there is to determine when you make it easy for dig into data. I’d like to think we’ll find out some day.

Cohort, colleague, racing pal and friend Jessica Chapel has been writing on this topic for years. The earliest instance I found was 2005… ELEVEN YEARS AGO. Since then she’s written on the topic numerous times (see partial history here) and again brings incredible insights with this no holds barred Tweetstorm:

And about that pricing… it preemptively negates the ability of smaller, independent shops or individuals to create well-priced tools. As Chapel points out in her example, not everything is a robust, fully formed past performance product like TimeformUS or Formulator. I had a little idea a few years ago for a non-PP tool that would work well in a freemium/tiered model, but it would need a little data (horse name, sire/dam, age, sex, workouts, entries, results, scratches). Using the example pricing from Chapel’s tweets, anything near that 2005 quote means that I wouldn’t be able to charge a reasonable rate, which means I couldn’t sustain or grow the product because no one would want to pay what would need to be charged to cover those costs, let alone make any money. Which means the product will never exist. Which means that all of you out there with ideas for cool little useful products or apps will not be making them because the data is cost prohibitive. Which means that the incredibly data-rich landscape of racing will continue to be product and innovation poor until many of these issues are addressed.

Aside from the issues that are known to us as consumers of the data, there are other issues that are much harder to address:

And I hope for all of your sakes that he’s working on this now:

His insight is invaluable to understanding and eventually addressing longer term structural data problems. Every organization that has data, and that has had data for a long time, faces these issues. You can’t blame Equibase for that, and it’s not an easy problem to solve, but at some point an aging, and apparently failing, data structure needs to be addressed. It takes time, money, resources and most importantly proper stewardship, governance and design. I’m not an insider, but nothing I’ve seen or heard has made me think this is internally known or that there is a desire to address it… but I would love to be wrong!