Prior to the introduction of statistical analytics in sports, most domain experts would have based their predicted results on intuition and gut feelings. As data became more common, experts would base their predictions on simple algorithms that factor in only a handful of variables such as the historical results of a given matchup. Developments in data science have opened up a variety of new methods in collecting and analyzing data for sports.
Statistical analysis has become integral to many major sports. Statistics in baseball has become so well-entrenched that an award-winning movie starring Brad Pitt was made on the subject, titled Moneyball. In basketball, SportVU tracks the movements of every player and the ball to help teams determine specific locations on the court where individual players should shoot from. General Managers such as Daryl Morey of the Rockets and players like Shane Battier, have made extensive use of statistics to improve their teams or their own performance. In American football, though less prevalent in public consciousness, advanced statistics have informed fan discussion, especially with regard to Fantasy Football, and team decision-making.
Soccer (or ‘football’ in most countries outside of the United States) is indisputably the world’s most popular sport. The English Premier League, one of the numerous professional soccer leagues around the world, is televised in 225 broadcast territories worldwide and draws in an in-home audience of three billion viewers alone. The market value of the EPL is estimated to be more than a staggering $5 billion US dollars.
Despite its sizable lead in the popularity contest among world sports, soccer is late to the “analytics revolution”. But why? There is a tremendous amount of value in being able to predict outcomes with a high degree of confidence. A fan might decide whether or not to attend a game based on the likely outcome of a victory for her team. A manager might decide which players to start or rest based on the predicted outcome of a match. The team owner might decide which players to add to his team and which are “surplus to requirements.” A sportsbook might decide what to make the spread of any game or a team’s seasonal over/under. Data has the potential to revolutionize soccer, as it has done for many other sports.
Although there’s no simple answer as to why soccer has not embraced the analytical revolution to the extent that other sports have, part of the answer lies within quick, fluid, and undisturbed nature of the game itself. Baseball, basketball, and American football are separated into many different types of ‘events’ that can be distinguished by resets. In baseball, every pitch thrown is a discrete event. In basketball, there is a pause and reset after every short scored and after most turnovers. In football, every play has an explicitly defined beginning and end. These interval breaks allow for resets and re-strategizing, making these sports generally more predictable for analysts, as scenarios are played out from fixed circumstances. In soccer, however, the playing time remains largely uninterrupted over the course of each 45-minute half. Baseball in a large part has benefited from its systemic gameplay, where the same scenarios are replicated over and over again, allowing statisticians to evaluate outcomes of these situations in a fairly straightforward way. The larger the number of meaningful ‘events’ that a sport generates, the larger the sample size and the easier it is to develop reasonable statistical models.
In addition, baseball, basketball, and football, there are many ‘traditional’ stats (hits, strikeouts, home runs, points scored, rebounds, completed passes, receiving yards, etc.) that are both useful by themselves and also are factors in more ‘advanced’ stats, such as ‘Wins Above Replacement’ for baseball, ‘Plus/Minus’ for Basketball, and ‘Expected Points’ for American football. In soccer, traditional numbers such as passed completed, distance run, tackles made have been shown to have little predictive value.
There is only one clear measure of success: goals. The team that scores the most goals wins (barring any unusual circumstances) and if both teams score an equal amount of goals or no goals at all, they share the honors (unless the match is a knockout match that requires a winner). The “holy grail” for soccer statisticians is thus the ‘Expected Goals’ metric. The ExpG metric is an attempt at boiling down all of the action that occurs over 90 minutes into a single number. In more detail, ExpG is the number of goals that a team should have scored, taking into consideration a host of factors such as shots on target, number of corners, or number of free kick attempts, in a logistic regression model. The ExpG metric can be also be interpreted to describe whether a team overperformed or underperformed during a given match.
Unfortunately, current iterations of ExpG (at least the ones available to the public) leave much to be desired. Their predictive power is often just slightly better than simply multiplying shots on goal (regardless of position, or context) by a constant value.
With the proliferation of data-capturing and data-analyzing technologies being developed today, there emerges an opportunity to utilize more creative ways of developing sports metrics. We believe that by studying the crowd noise during the course of an EPL match and matching it to certain events within the game, we could substantially improve upon existing ExpG models.
There are many subtleties in positions and reaction that statistical models may not be able to capture, even with expensive player tracking technology. In soccer, no two tackles are the same, no two passes are the same, no two shots are the same, etc. By nature, the sport is overwhelmingly dynamic and this has hindered the ability for analytics to penetrate the sport’s confines; yet, the aggregate of the reactions of each crowd member tend to signal important events in a match fairly well. A vast majority of fans in the stadium during a soccer match between two top-league sides have some instinctual knowledge of how the game is flowing and are likely to react to changes in the flow of the game accordingly. For example, a top level attacker such as Lionel Messi or Cristiano Ronaldo receiving the ball 10 yard from goal on his dominant foot with nobody between him and goal but the goalie will elicit a different reaction than a plodding central defender such as Ryan Shawcross driving the ball from 40 yards away.
We believe that supervised machine learning will be important during the early stages of this implementation. Once the stadium noise data is collected for each individual match, the data would be labeled in order to match certain events in the soccer match with the noise data. The hope is that with enough noise data collected and labeled for a given stadium, the machine learning algorithm should be able to identify certain features in the crowd noise data that are important in determining the odds of a given team scoring a goal. The current ExpG metric uses specific observed features in its logistic regression model; however, we expect there to be many important features in the crowd noise data that are nearly impossible to collect through pure observation, but can be captured using a machine learning model.
Our belief is that deep neural networks are likely to emerge as the best machine learning model to use in this analysis. This belief is not entirely unjustified, as many experts in the field of machine learning tout deep neural networks as the state of the art technology when it comes to speech classification. Because speech classification and crowd noise analysis both rely on audio data, it is a fair assumption to believe that deep neural networks are likely to be the best machine learning model for crowd noise analysis.
One of the issues with incorporating crowd noise data in the ExpG metric is that it makes it more difficult to associate certain events with outcomes in a soccer match. The current ExpG gives weights to explicit indicators in a soccer match depending on how likely they are to impact the amount of goals that a team is expected to score. This makes it easy for anyone to pick out which events are important when it comes to the ExpG calculation. However, once a vectorized machine learning output is incorporated to the model, it makes it much more difficult to understand what features the model considers to be important, as the deep neural network operates in a “black-box” of sorts.
Another issue is that crowd noise data is susceptible to a countless number of anomalies. Soccer, arguably more than any other major sport in the world, is deeply entrenched in massively complex concepts such as storylines, histories, and culture. These concepts would surely create lots of noise within the crowd data. It would be a monumental challenge to eliminate this noise from the crowd noise data but measures can be taken to minimize its impact on the analysis. Eventually, with enough data fed to the machine learning model and with improved microphone technology that is able to capture more intricacies within the crowd reactions, these anomalies could possibly be detected as the model may be able to detect subtle differences between the noise and the signal.
The ExpG is meant to be the ideal proxy for a team’s “performance” metric and if crowd noise data does in fact improve on the metric, the soccer ecosystem would benefit enormously. It is frequently the case that the final result of a soccer match does not tell the complete story of what occurred during the 90 minutes. The same is the case with every other performance metric that exists today in soccer, whether it be number of corners, shots on goal, or percentage of possession. Fans, managers, team owners, and others with stakes in a soccer match want to know how well their team performed in order to make more meaningful judgements about whether their team deserved a given result and by analyzing the ExpG metric for a given match compared to previous matches, a proper judgement can be made as to whether the team is improving or not, regardless of match outcome. Currently, the ExpG metric does not do a satisfactory job at explaining this and is brushed under the carpet during most post-match analyses, leaving it in the hands of various pundits to explain their assessment of the match.
Whether through the analysis of crowd noise data or otherwise, one thing is for sure: the soccer community has some catching up to do when it comes to sports analytics. Data has the potential to be revolutionary and the soccer community should make it a goal to come up with better methods of leveraging data in order to analyze team performance.
– Haroon Choudery and Alex Yang