Back in July, I wrote an article about how we at Pontem ultimately made the decision to use Substack for all of our company announcements, ranging from updates and case studies, to random yet interesting pieces (like this one). From our research at the time, there were a lot of compelling reasons to use the platform and it accomplished many of the objectives we were looking for (and as it turns out, likely what many companies are also probably looking for).
From the time that we made that decision and set up our own company Substack, it's been interesting to keep an eye on the number of writers, companies, influencers, and even politicians that have also transitioned to Substack, or started a new Substack themselves (did you know that the famous R programmer Hadley Wickham even recently started his own Substack to help share best programming practices?).
One of the interesting Substack's to subscribe to is one called 'Fiddler on the Proof'. If you ever skimmed through the website 538, they have some interesting analysis across sports and politics all generally with a statistical lens. In the past, every Friday they would post their ‘Riddler’ challenge, which was essentially a problem to try and solve for anyone interested. Depending on the particular riddle, I personally always found them an interesting way to practice coding solutions to problems in various programming languages.
Well, the Riddler series was recently canceled from 538's website. However, the host that maintained it (Zach Wissner-Gross) quickly came out about his plans to see the weekly challenge live on outside of 538's website. I can only speculate, but after doing a bit of his own research, he too seemed to have decided that Substack would be a good place for his content. So now, every Friday we have the spirit of the Riddler live on in the form of a new publication called 'Fiddler on the Proof'.
Because it's Friday and because this week's riddle looked interesting, I wanted to take a shot at solving it and I'll show my solution below.
About halfway through the current Major League Baseball season, all five teams in the American League East division had better records (i.e., winning percentages, or percent of games won) than all five teams in the American League Central region.
Inspired by this surprising fact, suppose Fiddler League Baseball has six divisions, with five teams in each division. For simplicity, further suppose each team has a winning percentage chosen randomly, uniformly, and independently between zero percent and 100 percent.
Let’s look at two divisions: The Enigma League East division and the Enigma League Central division. What is the probability that every team in the Enigma League East division has a higher winning percentage than every team in the Enigma League Central division?
First, we must start by understanding the distribution that a given team’s average wins percentage will have. The key words for this are noted in the problem statement “uniform and independent”, which means that every team has an equally good chance of having no wins as they do going undefeated. Certainly this isn’t realistic, but for a first pass at understanding potentially how rare this situation was in the MLB, it will do for now.
dist = UniformDistribution[{0, 100}]
Now, the potentially complicated subtlety to this problem is that we are looking for situations when EVERY team within one division is better than EVERY team in another division. From a coding perspective, this means we are looking at all the possible ordering permutations to ensure each team in the East is better than each in the Central division.
compareDivisions[east_, central_] :=
And @@ Flatten[Outer[Greater, east, central, 1]]
Symbolically, we can evaluate this with placeholders to ensure we are getting the kind of logic that we are looking for, in this example assuming we only have 3 teams.
compareDivisions[{a, b, c}, {d, e, f}]
Now we are ready to perform some simulations. We’ll use our expected distribution and compute 100,000 simulations of a league with 2 divisions and 5 teams in each division. Then, we can use our helper function above to evaluate how often we see every team in the East have a higher winning percentage than the Central division.
results = compareDivisions @@@ RandomVariate[dist, {100000, 2, 5}];
To understand the probability, we just need to count the number of True's relative to False’s.
Counts[results]
And this gives us a probability of having two divisions with every team in the first having a better record than every team in the second of around 0.36%.
If we wanted to generalize this, we can easily adapt the code above to show us how this likelihood changes as we introduce more or less teams to a given division. Obviously starting with only a single team in each division, we would expect there to be a 50% chance that East will be better than West because the only other option is the opposite. As the teams increase, however, we see that probability decrease exponentially.
Applying this to the situation in the MLB this season is still a few additional steps. We would want to first increase the number of divisions in the league (which should increase the probability of an outcome like this) and we would also want to gain a better understanding of the typical winning percentage distribution in the MLB (which would will not be uniform). I’ll save that work for another day, but I think what this analysis here starts to tell us is that it may have been a fairly rare event for the American League East to be so much stronger than the American League Central this season.