Uncertain expectations

In this previous post, I describe a relatively simple version of an expected goals model that I’ve been developing recently. In this post, I want to examine the limitations and uncertainties relating to how well the model predicts goals.

Just to recap, I built the model using data from the Premier League from 2013/14 and 2014/15. For the analysis below, I’m just going to focus on non-penalty shots with the foot, so it includes both open-play and set piece shot situations. Mixing these will introduce some bias but we have to start somewhere. The data amounts to over 16,000 shots.

What follows is a long and technical post. You have been warned.

Putting the boot in

One thing to be aware of is how the model might differ if we used a different set of shots for input; ideally the answer we get shouldn’t change if we only used a subset of the data or if we resample the data. If the answer doesn’t change appreciably, then we can have more confidence that the results are robust.

Below, I’ve used a statistical technique known as ‘bootstrapping‘ to assess how robust the regression is for expected goals. Bootstrapping belongs to a class of statistical methods known as resampling. The method works by randomly extracting shots from the dataset and rerunning the regression many times (1000 times in the plot below). Using this, I can estimate a confidence interval for my expected goal model, which should provide a reasonable estimate of goal expectation for a given shot.

For example, the base model suggests that a shot from the penalty spot has an xG value of 0.19. The bootstrapping suggests that the 90% confidence interval gives an xG range from 0.17 to 0.22. What this means is that on 90% of occasions that Premier League footballers take a shot from the penalty spot, we would expect them to score somewhere between 17-22% of the time.

The plot below shows the goal expectation for a shot taken in the centre of the pitch at varying distances from the goal. Generally speaking, the confidence interval range is around ±1-2%. I also ran the regressions on subsets of the data and found that after around 5000 shots, the central estimate stabilised and the addition of further shots in the regression just narrows the confidence intervals. After about 10,000 shots, the results don’t change too much.


Expected goal curve for shots in the centre of the pitch at varying distances from the goal. Shots with the foot only. The red line is the median expectation, while the blue shaded region denotes the 90% confidence interval.

I can use the above information to construct a confidence interval for the expected goal totals for each team, which is what I have done below. Each point represents a team in each season and I’ve compared their expected goals vs their actual goals. The error bars show the range for the 90% confidence intervals.

Most teams line up with the one-to-one line within their respective confidence intervals when comparing with goals for and against. As I noted in the previous post, the overall tendency is for actual goals to exceed expected goals at the team level.

Expected goals vs actual goals for teams in the 2013/14 and 2014/15 Premier League. Dotted line is the 1:1 line, the solid line is the line of best fit and the error bars denote the 90% confidence intervals based on the xG curve above.

Expected goals vs actual goals for teams in the 2013/14 and 2014/15 Premier League. Dotted line is the 1:1 line, the solid line is the line of best fit and the error bars denote the 90% confidence intervals based on the xG curve above.

As an example of what the confidence intervals represent, in the 2013/14 season, Manchester City’s expected goal total was 59.8, with a confidence interval ranging from 52.2 to 67.7 expected goals. In reality, they scored 81 non-penalty goals with their feet, which falls outside of their confidence interval here. On the plot below, Manchester City are the red marker on the far right of the expected goals for vs actual goals for plot.

Embracing uncertainty

Another method of testing the model is to look at the model residuals, which are calculated by subtracting the outcome of a shot (either zero or one) from its expected goal value. If you were an omnipotent being who knew every aspect relating to the taking of a shot, you could theoretically predict the outcome of a shot (goal or no goal) perfectly (plus some allowance for random variation). The residuals of such a model would always be zero as the outcome minus the expectation of a goal would equal zero in all cases. In the real world though, we can’t know everything so this isn’t the case. However, we might expect that over a sufficiently large sample, the residual will be close to zero.

In the figure below, I’ve again bootstrapped the data and looked at the model residuals as the number of shots increases. I’ve done this 10,000 times for each number of shots i.e. I extract a random sample from the data and then calculate the residual for that number of shots. The red line is the median residual (goals minus expected goals), while the blue shaded region corresponds to the standard error range (calculated as the 90% confidence interval). The residual is normalised to a per shot basis, so the overall uncertainty value is equal to this value multiplied by the number of shots taken.


Goals-Expected Goals versus number of shots calculated via bootstrapping. Inset focusses on the first 100 shots. The red line is the median, while the blue shaded region denotes the 90% confidence interval (standard error).

The inset shows how this evolves up to 100 shots and we see that over about 10 shots, the residual approaches zero but the standard errors are very large at this point. Consequently, our best estimate of expected goals is likely highly uncertain over such a small sample. For example, if we expected to score two goals from 20 shots, the standard error range would span 0.35 to 4.2 goals. To add a further complication, the residuals aren’t normally distributed at that point, which makes interpretations even more challenging.

Clearly there is both a significant amount of variation over such small samples, which could be a consequence of both random variation and factors not included in the model. This is an important point when assessing xG estimates for single matches; while the central estimate will likely have a very small residual, the uncertainty range is huge.

As the sample size increases, the uncertainty decreases. After 100 shots, which would equate to a high shot volume for a forward, the uncertainty in goal expectation would amount to approximately ±4 goals. After 400 shots, which is close to the average number of shots a team would take over a single season, the uncertainty would equate to approximately ±9 goals. For a 10% conversion rate, our expected goal value after 100 shots would be 10±4, while after 400 shots, our estimate would be 40±9 (note the percentage uncertainty decreases as the number of shots increases).


Same as above but with individual teams overlaid.

Above is the same plot but with the residuals shown for each team over the past two seasons (or one season if they only played for a single season). The majority of teams fall within the uncertainty envelope but there are some notable deviations. At the bottom of the plot are Burnley and Norwich, who significantly under-performed their expected goal estimate (they were also both relegated). On the flip side, Manchester City have seemingly consistently outperformed the expected goal estimate. Part of this is a result of the simplicity of the model; if I include additional factors such as how the chance is created, the residuals are smaller.

How well does an xG model predict goals?

Broadly speaking, the central estimates of expected goals appear to be reasonably good; the residuals tend to zero quickly and even though there is some bias, the correlations and errors are encouraging. When the uncertainties in the model are propagated through to the team level, the confidence intervals are on average around ±15% for expected goals for and against.

When we examine the model errors in more detail, they tend to be larger (around ±25% at the team level over a single season). The upshot of all this is that there appears to be a large degree of uncertainty in expected goal values when considering sample sizes relevant at the team and player level. While the simplicity of the model used here may mean that the uncertainty values shown represent a worst-case scenario, it is still something that should be considered when analysts make statements and projections. Having said this, based on some initial tests, adding extra complexity doesn’t appear to reduce the residuals to any great degree.

Uncertainty estimates and confidence intervals aren’t sexy and having spent the last 1500ish words writing about them, I’m well aware they aren’t that accessible either. However, I do think they are useful and important in the real world.

Quantifying these uncertainties can help to provide more honest assessments and recommendations. For example, I would say it is more useful to say that my projections estimate that player X will score 0.6-1.4 goals per 90 minutes next season along with some central value, rather than going with a single value of 1 goal per 90 minutes. Furthermore, it is better to state such caveats in advance – if you just provided the central estimate and the player posted say 0.65 goals per 90 and you then bring up your model’s uncertainty range, you will just sound like you’re making excuses.

This also has implications regarding over and under performance by players and teams relative to expected goals. I frequently see statements about regression to the mean without considering model errors. As George Box wisely noted:

Statisticians, like artists, have the bad habit of falling in love with their models.

This isn’t to say that expected goal models aren’t useful, just that if you want to wade into the world of probability and modelling, you should also illustrate the limitations and uncertainties associated with the analysis.

Perhaps those using expected goal models are well aware of these issues but I don’t see much discussion of it in public. Analytics is increasingly finding a wider public audience, along with being used within clubs. That will often mean that those consuming the results will not be aware of these uncertainties unless you explain them. Speaking as a researcher who is interested in the communication of science, I can give many examples of where not discussing uncertainty upfront can backfire in the long run.

Isn’t uncertainty fun!


Thanks to several people who were kind enough to read an initial draft of this article and the proceeding method piece.

Great Expectations

One of the most popular metrics in football analytics is the concept of ‘expected goals’ or xG for short. There are various flavours of expected goal models but the fundamental objective is to assess the quality of chances created or conceded by a team. The models are also routinely applied to assessing players using various techniques.

Michael Caley wrote a nice explanation of the what and the why of expected goals last month. Alternatively, you could check out this video by Daniel Altman for a summary of some of the potential applications of the metric.

I’ve been building my own expected goals model recently and I’ve been testing out a fundamental question regarding the performance of the model, namely:

How well does it predict goals?

Do expected goal models actually do what they say on the tin? This is a really fundamental and dumb question that hasn’t ever been particularly clear to me in relation to the public expected goal models that are available.

This is a key aspect, particularly if we want to make statements about prior over or under-performance and any anticipated changes in the future. Further to this, I’m going to talk about uncertainty and how that influences the statements that we can make regarding expected goals.

In this post, I’m going to describe the model and make some comparisons with a ‘naive’ baseline. In a second post, I’m going to look at uncertainties relating to expected goal models and how they may impact our interpretations of them.

The model

Before I go further, I should note that the initial development closely resembles the work done by Michael Caley and Martin Eastwood, who detailed their own expected goal methods here and here respectively.

I built the model using data from the Premier League from 2013/14 and 2014/15. For the analysis below, I’m just going to focus on non-penalty shots with the foot, so it includes both open-play and set piece shot situations. Mixing these will introduce some bias but we have to start somewhere. The data amounts to over 16,000 shots.

I’m only including distance from the centre of the goal in the first instance, which I calculated in a similar manner to Michael Caley in the link above as the distance from the goal line divided by the relative angle. I didn’t raise the relative angle to any power though.

I then calculate the probability of a goal being scored with the adjusted distance of each shot as the input; shots are deemed either successful (goal) or unsuccessful (no goal). Similarly to Martin Eastwood, I found that an exponential decay formula represented the data well. However, I found that there was a tendency towards under-predicting goals on average, so I included an offset in the regression. The equation I used is below:

xG = exp(-Distance/α) + β

Based on the dataset, the fit coefficients were 6.65 for α and 0.017 for β. Below is what this looks like graphically when I colour each shot by the probability of a goal being scored; shots from close to the goal line in central positions are far more likely to be scored than long distance shots or shots from narrow angles, which isn’t a new finding.


Expected goals based on shot location using data from the 2013/14 and 2014/15 Premier League seasons. Shots with the foot only.

So, now we have a pretty map and yet another expected goal model to add to the roughly 1,000,001 other models in existence.


In the figure below, I’ve compared the expected goal totals with the actual goals. Most teams are close to the one-to-one line when comparing with goals for and against, although the overall tendency is for actual goals to exceed expected goals at the team level. When looking at goal difference, there is some cancellation for teams, with the correlation being tighter and the line of best fit passing through zero.


Expected goals vs actual goals for teams in the 2013/14 and 2014/15 Premier League. Dotted line is the 1:1 line, the solid line is the line of best fit. Click on the graph for an enlarged version.

Inspecting the plot more closely, we can see some bias in the expected goal number at the extreme ends; high-scoring teams tend to out-perform their expected goal total, while the reverse is true for low scoring teams. The same is also true for goals against, to some extent, although the general relationship is less strong than for goals for. Michael Caley noted a similar phenomenon here in relation to his xG model. Overall, it looks like just using location does a reasonable job.


The table above includes R2 and mean absolute error (MAE) values for each metric and compares them to a ‘naïve’ baseline where just the average conversion rate is used to calculate the xG values i.e. the location of the shot is ignored. The Rvalue assesses the strength of the relationship between expected goals and goals, with values closer to one indicating a stronger link. Mean absolute error takes an average of the difference between the goals and expected goals; the lower the value the better. In all cases, including location improves the comparison. ‘Naïve’ xG difference is effectively Total Shot Difference as it assumes that all shots are equal.

What is interesting is that the correlations are stronger in both cases for goals for than goals against. This could be a fluke of the sample I’m using but the differences are quite large. There is more stratification in goals for than goals against, which likely helps improve the correlations. James Grayson noted here that there is more ‘luck’ or random variation in goals against than goals for.

How well does an xG model predict goals?

Broadly speaking, the central estimates of expected goals appear to be reasonably good. Even though there is some bias, the correlations and errors are encouraging. Adding location into an xG model clearly improves our ability to predict goals compared to a naïve baseline. This obviously isn’t a surprise but it is useful to quantify the improvements.

The model can certainly be improved though and I also want to quantify the uncertainties within the model, which will be the topic of my next post.

Premier League Pass Masters

In this previous post, I combined territory and possession to create a Territorial-Possession Dominance (TPD) metric. The central basis for this metric is that it is more difficult to pass the ball into dangerous areas. Essentially teams that have the ball in areas closer to their opponent’s goal, while stopping their opponent moving the ball close to their own, will score more highly on this metric.

In the graphic below, I’ve looked at how the teams in the Premier League have been shaping up this year (data correct up to 24/04/15). The plot splits this performance on the offensive side (with the ball) and the defensive side (without the ball). For a frame of reference, league average is defined as a score of 100.

Broadly, these two terms show that teams who dominate territory with the ball also limit the amount of possession they concede close to their own goal. This makes sense given there is only one ball on the pitch, so pinning your opponent back in their half makes it more difficult to maintain possession in dangerous areas in return. Alternatively, teams may choose to sit back, soak up pressure and then aim to counter attack; this would yield a low rating offensively and a higher rating defensively.

Territorial-possession for and against for the 2014/15 English Premier League. A score of 100 denotes league average. Marker colour refers to Territorial-Possession Dominance. Data via Opta.

The top seven (plus Everton) tend to dominate territory and possession, while the bottom thirteen (minus Everton) are typically pinned back. Stoke City are somewhat peculiar, as they are below average on both scores,so while they limit their opponents, they seemingly struggle to manoeuvre the ball into dangerous areas themselves. Michael Caley’s expected goals numbers suggest that Everton have seemingly struggled to convert their territorial and possession dominance into an abundance of good quality chances; essentially they look pretty in-between both boxes.

Sunderland’s passivity is evident as they routinely saw their opponents pass the ball into dangerous areas; based on where their defensive actions occur and the league-leading number of shots from outside of the box they concede, the aim is to get men behind the ball and prevent good quality chances from being created. That is possibly a reasonable tactical system if you can combine that with swift counter-attacking and high quality chances but Poyet’s dismissal is indicative of how that worked out.

On the flip side, Manchester United rank lowest for territorial-possession against. Their system is designed to prevent their opponent’s from building pressure on their defense close to their own goal. Think of it as a system designed to prevent Phil Jones’ face from trending on Twitter. Of course, when the system breaks down and/or opposition skill breaks through, things look awful and high quality chances are conceded.

Finally, Manchester City clearly aren’t trying hard enough.

Passing maestros

The metric I’ve devised classifies each pass completed based on the destination of the pass, so it is relatively straight-forward to breakdown the metric by the player passing the ball. Below are the top twenty players this season ranked according to the average ‘danger’ of their passes (non-headed passes only, minimum 900 minutes played). I can also do this for players receiving the ball but I’ll leave that for another time.

Players who routinely complete passes into dangerous areas will score highly here, so there is an obvious bias towards forwards and attacking midfielders/wingers. Bias will also be introduced by team systems, which would be a good thing to examine in the future. I’ve also noted on the right-hand-side the number of passes each player completes per 90 minutes to give a sense of their involvement.

Some players, like Diafra Sakho and Jamie Vardy, are rarely involved but their passes are often dangerous. Others manage to combine a high-volume of passes with danger; PFA Player of the Year, Eden Hazard, is the standout here (very much a Sum 41 kind of footballer). The link-up skills of Sánchez and Agüero are also evident.

Pass Danger Rating for English Premier League players in the 2014/15 season. Numbers on right indicate number of completed passes played per 90 minutes by each player. Minimum of 900 minutes played. Data via Opta.

I quite like this as a metric, as the results aren’t always obvious; it is nice to have confirmatory metrics but informative metrics are potentially more valuable from an analytics point of view. For instance, the metric can quickly identify the dangerous passers for the opposition, who could then be targeted to reduce their influence. It can also be useful in identifying players who could possibly do more on your own team (*cough* Lallana *cough*). Finally, it’s a metric that could be used as a part of an analytics based scouting system. I’m hoping to develop this further, so watch this space.

Square pegs for square holes: OptaPro Forum Presentation

At the recent OptaPro Forum, I was delighted to be selected to present to an audience of analysts and representatives from the football industry. I presented a technique to identify different player types using their underlying statistical performance. My idea was that this would aid player scouting by helping to find the “right fit” and avoid the “square peg for a round hole” cliché.

In the presentation, I outlined the technique that I used, along with how Dani Alves made things difficult. My vision for this technique is that the output from the analysis can serve as an additional tool for identifying potential transfer signings. Signings can be categorised according to their team role and their performance can then be compared against their peers in that style category based on the important traits of those player types.

The video of my presentation is below, so rather than repeating myself, go ahead and watch it! The slides are available here.

Each of the player types is summarised below in the figures. My plan is to build on this initial analysis by including a greater number of leagues and use more in-depth data. This is something I will be pursuing over the coming months, so watch this space.

Some of my work was featured in this article by Ben Lyttleton.

Forward player types.

Forward player types

Midfielder player types.

Midfielder player types.

Defender player types.

Defender player types.

Help me rondo

In my previous post, I looked at the relationship between controlling the pitch (territory) and the ball (possession). When looking at the final plot in that post, you might infer that ‘good’ teams are able to control both territory and possession, while ‘bad’ teams are dominated on both counts. There are also teams that dominate only one metric, which likely relates to their specific tactical make-up.

When I calculated the territory metric, I didn’t account for the volume of passes in each area of the pitch as I just wanted to see how things stacked up in a relative sense. Territory on its own has a pretty woeful relationship with things we care about like points (r2=0.27 for the 2013/14 EPL) and goal difference (r2=0.23 for the 2013/14 EPL).

However, maybe we can do better if we combine territory and possession into one metric.

To start with, I’ve plotted some heat maps (sorry) showing pass completion percentage based on the end point of the pass. The completion percentage is calculated by adding up all of the passes to a particular area on the pitch and comparing that to the number of passes that are successfully received. I’ve done this for the 2013/14 season for the English Premier League, La Liga and the Bundesliga.

As you would expect, passes directed to areas closer to the goal are completed at lower rates, while passes within a teams own half are completed routinely.


Heat map of pass completion percentage based on the target of all passes in the 2013/14 English Premier League, La Liga and Bundesliga. Data via Opta.

What is interesting in the below plots is the contrast between England and Germany; in the attacking half of the pitch, pass completion is 5-10% lower in the Bundesliga than in the EPL. La Liga sits in-between for the most part but is similar to the Bundesliga within the penalty area. My hunch is that this is a result of the contrasting styles in these leagues:

  1. Defences often sit deeper in the EPL, particularly when compared to the Bundesliga, which results in their opponents completing passes more easily as they knock the ball around in front of the defence.
  2. German and Spanish teams tend to press more than their English counter-parts, which will make passing more difficult. In Germany, counter-pressing is particularly rife, which will make passing into the attacking midfield zone more challenging.

From the above information, I can construct a model* to judge the difficulty of a pass into each area of the pitch and given the differences between the leagues, I do this for each league separately.

I can then use this pass difficulty rating along with the frequency of passes into that location to put a value on how ‘dangerous’ a pass is e.g. a completed pass received on the penalty spot in your opponents penalty area would be rated more highly than one received by your own goalkeeper in his six-yard box.

Below is the resulting weighting system for each league. Passes that are received in-front of the goal within the six-yard box would have a rating close to one, while passes within your own half are given very little weighting as they are relatively easy to complete and are frequent.

There are slight differences between each league, with the largest differences residing in the central zone within the penalty area.


Heat map of pass weighting model for the 2013/14 English Premier League, La Liga and Bundesliga. Data via Opta.

Using this pass weighting scheme, I can assign a score to each pass that a team completes, which ‘rewards’ them for completing more dangerous passes themselves and preventing their opponents from moving the ball into more dangerous areas. For example, a team that maintains possession in and around the opposition penalty area will increase their score. Similarly, if they also prevent their opponent from moving the ball into dangerous areas near their own penalty area, this will also be rewarded.

Below is how this Territorial-Possession Dominance (TPD) metric relates to goal difference. It is calculated by comparing the for and against figures as a ratio and I’ve expressed it as a percentage.

Broadly speaking, teams with a higher TPD have a better goal difference (overall r2=0.59) but this varies across the leagues. Unsurprisingly, Barcelona and Bayern Munich are the stand-out teams on this metric as they pin teams in and also prevent them from possessing the ball close to their own goal. Manchester City (the blue dot next to Real Madrid) had the highest TPD in the Premier League.

In Germany, the relationship is much stronger (r2=0.87), which is actually better than both Total Shot Ratio (TSR, r2=0.74) and Michael Caley’s expected goals figures (xGR, r2=0.80). A major caveat here though is that this is just one season in a league with only 18 teams and Bayern Munich’s domination certainly helps to strengthen the relationship.

The relationship is much weaker in Spain (r2=0.35) and is worse than both TSR (r2=0.54) and xGR (r2=0.77).  A lot of this is driven by the almost non-existent explanatory power of TPD when compared with goals conceded (r2=0.06). La Liga warrants further investigation.

England sits in-between (r2=0.69), which is on a par with TSR (r2=0.72). I don’t have xGR numbers for last season but I believe xGR is usually a few points higher than TSR in the Premier League.


Relationship between goal difference per game and territorial-possession dominance for the 2013/14 English Premier League, La Liga and Bundesliga. Data via Opta.

The relationship between TPD and points (overall r2=0.56) is shown below and is broadly similar to goal difference. The main difference is that the strength of the relationship in Germany is weakened.


Relationship between points per game and territorial-possession dominance for the 2013/14 English Premier League, La Liga and Bundesliga. Data via Opta.

Over the summer, I’ll return to these correlations in more detail when I have more data and the relationships are more robust. For now, the metric appears to be useful and I plan to improve it further. Also, I’ll be investigating what it can tell us about a teams style when combined with other metrics.

——————————————————————————————————————– *For those who are interested in the method, I calculated the relative distance of each pass from the centre of the opposition goal using the distance along the x-axis (the length of the pitch) and the angle relative to a centre line along the length of the pitch.

I then used logistic regression to calculate the probability of a pass being completed; passes are deemed either successful or unsuccessful, so logistic regression is ideal and avoids putting the passes into location buckets on the pitch.

I then weighted the resulting probability according to the frequency of passes received relative to the distance from the opposition goal-line. This gave me a ‘score’ for each pass, which I used to calculate the territory weighted possession for each team.

Territorial advantage?

One of the recurring themes regarding the playing style of football teams is the idea that teams attempt to strike a balance between controlling space and controlling possession. The following quote is from this Jonathan Wilson article during the European Championships in 2012, where he discusses the spectrum between proactive and reactive approaches:

Great teams all have the same characteristic of wanting to control the pitch and the ball – Arrigo Sacchi.

No doubt there are multiple ways of defining both sides of this idea.

Controlling the ball is usually represented by possession, that is the proportion of the passes that a team plays in a single match or series of matches. If a team has the ball, then by definition, they are controlling it.

One way of defining the control of space is to think about ball possession in relation to the location of the ball on the pitch. A team that routinely possesses the ball closer to their opponents goal potentially benefits from the increased attacking opportunities that this provides, while also benefiting from the ball being far away from their own goal should they lose it.

There are certainly issues with defining control of space in this way though e.g. a well-drilled defence may be happy to see a team playing the ball high up the pitch in front of them, especially if they are adept at counter-attacking when they win the ball back.

Below is a heat map of the location of received passes in the 2013/14 English Premier League. The play is from left-to-right i.e. the team in possession is attacking towards the right-hand goal. We can see that passes are most frequently received in midfield areas, with the number of passes received decreasing quickly as we head towards each penalty area.


Heat map of the location of received passes in the 2013/14 English Premier League. Data via Opta.

Below is another heat map showing pass completion percentage based on the end point of the pass. The completion percentage is calculated by adding up all of the passes to a particular area on the pitch and comparing that to the number of passes that are successfully received. One thing to note here is that the end point of uncompleted passes relates to where possession was lost, as the data doesn’t know the exact target of each pass (mind-reading isn’t part of the data collection process as far as I know). That does mean that the pass completion percentage is an approximation but this is based on over 300,000 passes, so the effect is likely small.

What is very clear from the below graphic is that when within a teams own half, passes are completed routinely. The only areas where this drops are near the corner flags; I assume this is due to players either clearing the ball or playing it against an opponent when boxed into the corner.


Heat map of pass completion percentage based on the target of all passes in the 2013/14 English Premier League. Data via Opta.

As teams move further into the attacking half, pass completion drops. In the central zone within the penalty area, less than half of all passes are completed and this drops to less than 20% within the six yard box. These passes within the “danger zone” are infrequent and completed far less frequently than other passes. This danger zone is frequently cited by analysts looking at shot location data as the prime zone for scoring opportunities; you would imagine that receiving passes in this zone would be beneficial.

None of the above is new. In fact, Gabe Desjardins wrote about these features using data from a previous Premier League season here and showed broadly similar results (thanks to James Grayson for highlighting his work at various points). The main thing that looks different is the number of passes played into the danger zone, I’m not sure why this is but 2012/13 and 2014/15 so far look very similar to the above in my data.

Gabe used these results to calculate a territory statistic by weighting each pass by its likelihood of being completed. He found that this measure was strongly related to success and the performance of a team.

Below is my version of territory plotted against possession for the 2013/14 Premier League season. Broadly there are four regimes in the below plot:

  1. Teams like Manchester City, Chelsea and Arsenal who dominate territory and have plenty of possession. These teams tend to pin teams in close to their goal.
  2. Teams like Everton, Liverpool and Southampton who have plenty of possession but don’t dominate territory (all there are just under a 50% share). Swansea are an extreme case in as they have lots of possession but it is concentrated in their own half where passes are easier to complete.
  3. Teams like West Brom and Aston Villa who have limited possession but move the ball into attacking areas when they do have it. These are quite direct teams, who don’t waste much time in their build-up play. Crystal Palace are an extreme in terms of this approach.
  4. Teams that have limited possession and when they do have it, they don’t have much of it in dangerous areas at the attacking end of the pitch. These teams are going nowhere, slowly.

Territory percentage plotted against possession for English Premier League. Data via Opta.

Liverpool are an interesting example, as while their overall territory percentage ranks at fourteenth in the league, this didn’t prevent them moving the ball into the danger zone. For just passes received within the danger zone, they ranked third on 3.4 passes per game behind Chelsea (3.8) and Manchester City (4) and ahead of Arsenal on 2.9.

This ties in with Liverpool’s approach last season, where they would often either attack quickly when winning the ball or hold possession within their own half to try and draw teams out and open up space. Luis Suárez was crucial in this aspect, as he averaged 1.22 completed passes into the danger zone per 90 minutes. This was well ahead of Sergio Agüero in second place on 0.94 per 90 minutes.

The above is just a taster of what can be learnt from this type of data. I’ll be expanding on the above in more detail and for more leagues in the future.

Germany vs Portugal: passing network analysis

Germany faced Portugal in their opening Group G match, with Germany winning 4-0 and Pepe being an idiot (surprise, surprise). Faced with the decision on which diminutive gifted midfielder to leave out of the starting eleven, Jogi Löw just went ahead and picked all of them. Furthermore, Germany’s best fullback, Phillip Lahm played centre midfield. Ronaldo was fit enough to start for Portugal.

Below are the passing networks for both Germany (left) and Portugal (right) based on data from Fifa.com. More information on how these are put together is available here in my previous posts on this subject. For Germany, I’ve not included the substitutes as they contributed little in this aspect. For Portugal, I included Eder who came on for the injured Hugo Almeida after 28 minutes.

Passing networks for the World Cup Group G match between Germany and Portugal at the Arena Fonte Nova, Salvador on the 16th June 2014. Only completed passes are shown. Darker and thicker arrows indicate more passes between each player. The player markers are sized according to their passing influence, the larger the marker, the greater their involvement. Click on the image for a larger view.

Bear in mind that the passing networks above are likely skewed by game state effects, with Germany leading and playing 11 vs 10 for a large proportion of the match.


Germany lined up with something like a 4-1-5-0 formation in the first half, with their full backs being relatively unadventurous, Phillip Lahm playing ahead of the centre backs with Sami Khedira running from deep and often beyond his attacking compatriots. Khedira was less aggressive in the second half with Germany three goals ahead and with a numerical advantage. In the graphic above, I’ve got them lined up in a 4-2-4ish formation based on a mixture of their average positions and making the plot look pretty. In reality, the side was very compact with the central defenders playing a high line and the attackers dropping off continually.

Lahm and Khedira provided a controlling influence for Germany, forming the link between the defence and attack. Höwedes and Boateng were also well involved in build-up play, although they had limited involvement in terms of direct creativity, with just one cross and no key passes between them.

The attacking quartet were all about fluid movement and passing links, as can be seen in the passing network above. Kroos was similarly influential to Lahm/Khedira but with a slightly higher position up the pitch. Özil and Götze were also heavily involved, while Müller was the least involved (unsurprisingly). The relative balance between the German play-makers meant that their attacks were not simply funnelled through one individual, which led to some lovely passing inter-changes and several high-quality shooting opportunities.


Portugal’s passing network was dominated by their central midfielders but they struggled to involve their attacking players in dangerous areas. Ronaldo in particular saw relatively little involvement and the passes he did receive were often well away from the danger-zone. The one Portuguese attacker who was well-involved was Nani; unfortunately for Portugal, he put in a fairly terrible performance. Despite his involvement, Nani created no shooting opportunities for his team mates and put in a total of six crosses with none finding a fellow Portuguese. He did have three shots, with one on target. Sometimes a relatively high passing influence is a bad thing if the recipient wastes their involvement.

Portugal did look dangerous on the counter-attack prior to Pepe’s sending off but failed to really create a clear chance from these opportunities. Overall, Portugal’s passing network was too heavily weighted away from their (potentially) dangerous attacking players and when they did get the ball, they didn’t do enough with it.

Moving forward

Germany were impressive, although this was likely facilitated by Pepe’s indiscretion and the game being essentially over at half-time. The game conditions were certainly in their favour but they capitalised fully. If they can keep their gifted band of play-makers weaving their magic, then they will do well. They’ll need Müller to keep finishing their passing moves, while Mario Götze found himself in several promising shooting situations which may well yield goals on future occasions.

Conversely, Portugal were hampered by the match situation although they looked worryingly dependent on Ronaldo in attack, as noted by the imperious Michael Cox in his recap of day five. Furthermore, the USA likely won’t give them as much space to attack as Germany did. They’ll need to improve the passing links to their dangerous attackers if they are to have much joy at this tournament.

Win, lose or draw

The dynamics of a football match are often dictated by the scoreline and teams will often try to influence this via their approach; a fast start in search of an early goal, keeping it tight with an eye on counter-attacking or digging a moat around the penalty area.

With this in mind, I’m going to examine the repeatability of the amount of time a team spends winning, losing and drawing from year to year. I’m basically copying the approach of James Grayson here who has looked at the repeatability of several statistical metrics. This is meant to be a broad first look; there are lots of potential avenues for further study here.

I’ve collected data from football-lineups.com (tip of the hat to Andrew Beasley for alerting me to the data via his blog) for the past 15 English Premier League seasons and then compared each teams performance from one season (year zero) to the next (year one). Promoted or relegated teams are excluded as they don’t spend two consecutive seasons in the top flight.


Below is a plot showing how the time spent losing varies in consecutive seasons. Broadly speaking, there is a reasonable correlation from one season to the next but with a degree of variation also (R^2=0.41). The data suggests that 64% of time spent winning is repeatable, leaving 36% in terms of variation from one season to the next. This variation could result due to many factors such as pure randomness/luck, systemic or tactical influences, injury, managerial and/or player changes etc.


Relationship between time spent losing per game from one season to the next.

As might be expected, title winning teams and relegated sides tend towards the extreme ends in terms of time spent losing. Generally, teams at these extreme ends in terms of success over and under perform respectively compared to the previous season.


Below is the equivalent plot for time spent winning. Again there is a reasonable correlation from one season to the next, with the relationship for time spent winning (R^2=0.47) being stronger than for time spent losing. The data suggests that 67% of time spent winning is repeatable, leaving 33% in terms of variation from one season to the next.


Relationship between time spent winning per game from one season to the next.

As might be expected, title winning teams spend a lot of time winning. The opposite is true for relegated teams. Title winners generally improve their time spent winning compared to the previous season. Interestingly, they often then see a drop off in the following season.

Manchester City and Liverpool really stick out here in terms of their improvement relative to 2012/13. Liverpool spent 19 minutes more per game in a winning position in 2013/14 than they did the previous season; I have this as the second biggest improvement in the past 15 seasons. They were narrowly pipped into second place (sounds familiar) by Manchester City this season, who improved by close to 22 minutes. They spent 51 and 48 minutes in a winning position per game respectively. They occupy the top two slots for time spent winning in the past 15 seasons.

According to football-lineups.com, Manchester City and Liverpool scored their first goals of the match in the 26th and 27th minutes respectively. Chelsea were the next closest in the 38th minute. They were also in the top four for how late they conceded their first goal on average, with Liverpool conceding in the 55th minute and City in the 57th. Add in their ability to rack up the goals when leading and you have a recipe for spending a lot of time winning.


The final plot below is for time spent drawing. Football-lineups doesn’t report the figures for drawing directly so I just estimated it by subtracting the winning and losing figures from 90. There will be some error here as this doesn’t account for injury time but I doubt it would hugely alter the general picture. The relationship here from season to season is almost non-existent (R^2=0.013), which implies that time spent drawing regresses to the mean by 89% from season to season.


Relationship between time spent drawing per game from one season to the next.

Teams seemingly have limited control on the amount of time they spend drawing. I suspect this is a combination of team quality and incentives. Good teams have a reasonable control on the amount of time they spend winning and losing (as seen above) and it is in their interests to push for a win. Bad teams will face a (literally) losing battle against better teams in general, leading to them spending a lot of time losing (and not winning). It should be noted that teams do spend a large proportion of their time drawing though (obviously this is the default setting for a football match given the scoreline starts at 0-0), so it is an important period.

We can also see the shift in Liverpool and Manchester City’s numbers; they replaced fairly average numbers for time spent drawing in 2012/13 with much lower numbers in 2013/14. Liverpool’s time spent drawing figure of 29.8 minutes this season was the lowest value in the past 15 seasons according to this data!


There we have it then. In broad terms, time spent winning and losing exhibit a reasonable degree of repeatability but with significant variation superimposed. In particular, it seems that title winners require a boost in their time spent winning and a drop in their time spent losing to claim their prize. Perhaps unsurprisingly, things have to go right for you to win the title.

As far as this season goes, Manchester City and Liverpool both improved their time spent winning dramatically. If history is anything to go by, both will likely regress next season and not have the scoreboard so heavily stacked in their favour. It will be interesting to see how they adapt to such potential challenges next year.

Luis Suárez: Home & away

Everyone’s favourite riddle wrapped in an enigma was a topic of Twitter conversation between various analysts yesterday. The matter at hand was Luis Suárez’s improved goal conversion this season compared to his previous endeavours. Suárez has previously been labelled as inefficient by members of the analytics community (not the worst thing he has been called mind), so explaining his upturn is an important puzzle.

In the 2012/13 season, Suárez scored 23 goals from 187 shots, giving him a 12.3% conversion rate. So far this season he has scored 25 goals from 132 shots, which works out at 18.9%.

What has driven this increased conversion?

Red Alert

Below I’ve broken down Suárez’s goal conversion exploits into matches played home and away over the past two seasons. In terms of sample sizes, in 2012/13 he took 98 shots at home and 89 shots away, while he has taken 69 and 63 respectively this season.

Season Home Away Overall
2012/13 11.2% 13.5% 12.3%
2013/14 23.2% 14.3% 18.9%

The obvious conclusion is that Suárez’s improved goal scoring rate has largely been driven by an increased conversion percentage at home. His improvement away is minor, coming in at 0.8% but his home improvement is a huge 12%.

What could be driving this upturn?

Total Annihilation

Liverpool’s home goal scoring record this season has seen them average 3 goals per game compared to 1.7 last season. Liverpool have handed out several thrashings at home this season, scoring 3 or more goals in nine of their fourteen matches. Their away goal scoring has improved from 2 goals per game to 2.27 per game for comparison.

Liverpool have been annihilating their opponents at home this season and I suspect Suárez is reaping some of the benefit of this with his improved goal scoring rate. Liverpool have typically gone ahead early in their matches at home this season but aside from their initial Suárez-less matches, that hasn’t generally seen them ease off in terms of their attacking play (they lead the league in shots per game at home with 20.7).

My working theory is that Suárez has benefited from such situations by taking his shots under less pressure and/or better locations when Liverpool have been leading at home. I would love to hear from those who collect more detailed shot data on this.

Drilling down into some more shooting metrics at home adds some support to this. Suárez has seen a greater percentage of his shots hit the target at home this season compared with last (46.4% vs 35.7%). He has also seen a smaller percentage being blocked this season (13% vs 24.5%). Half of Suárez’s shots on target at home this season have resulted in a goal compared to 31.4% last season. Away from home, the comparison between this season and last is much closer.

These numbers are consistent with Suárez taking his shots at home this season in better circumstances. I should stress that there is a degree of circularity here as Suárez’s goal scoring is not independent of Liverpool’s. Further analysis is required.


The above is an attempt to explain Suárez’s improved goal scoring form. I doubt it is the whole story but it hopefully provides some clues ahead of more detailed analysis. Suárez may well have also benefited from a hot-streak this season and the big question will be whether he can sustain his goal scoring form over the remainder of this season and into next.

As I’ve shown previously, there is a large amount of variability in player shot conversion from season to season. Some of this will be due to ‘luck’ or randomness but some of this could be due to specific circumstances such as those potentially aiding Suárez this season. Explaining the various factors involved in goal scoring is a tricky puzzle indeed.


All data in this post are from Squawka and WhoScored.

You’ll never win anything with crosses

You’ll probably have heard about Manchester United’s penchant for crossing in their match against Fulham yesterday. If you haven’t, all 81 of them are illustrated below in their full chalkboard glory.

Manchester United's crosses in the Premier League match against Fulham on the 9th February 2014.

Manchester United’s crosses in the Premier League match against Fulham on the 9th February 2014. All 81 of them. Image via Squawka.

Rather than focus on the tactical predictability of such a strategy, I’m going to take a look at whether it can be a successful one over the long term.

In the public work on attacking strategies, the analytics community isn’t quite at the stage where the merits of individual strategies has been quantified. The work so far suggests that crossing is probably on the lower end though in terms of effectiveness. Ted Knutson did a nice summary of the work in this area here.

Can crossing bring success?

Given this, I’m going to assess crosses from a different angle. Over the past five seasons, the Premier League Champions have averaged 2.3 goals per game. The fourth placed team has averaged 1.8 goals per game. This suggests that a top team needs an attacking strategy that can yield around two goals per game. Let’s see if crossing can get you there.

I’m going to focus on open play crosses as I feel that is more relevant from a tactical perspective; set piece crosses are a different (more effective) matter. Based on data from the 2011/12 Premier League season, I found that on average it took 79 crosses in open play for a single goal to be scored. On average, teams had 22 open play crosses per game. So an average Premier League team would expect to score a goal from an open play cross every three-to-four games. I only have data for one season, so let’s be generous and round that down to a goal every three matches. That is a long way off two goals per game.

Let’s consider an example of a team that both crosses more than average and converts those crosses into goals more efficiently e.g. Manchester United in 2011/12. They averaged 22 open-play crosses per game and scored 19 goals, which works out at 43.5 crosses per goal. So even a really good crossing team in terms of their goal return could only manage a goal from an open play cross every two games. The caveat to this last point also is that I don’t have the data to look at whether that is a sustainable level of goal production from crosses.

Based on the above, I would say it is basically impossible to be an elite team and use crossing as your main strategy. If you were good at set pieces, you could probably add another 20 or so goals over a season but that still only puts you at a goal per game average.

That isn’t to say that crossing is pointless – as a part of a varied attacking approach and against an opponent who isn’t dug in and ready for them, they can be an effective source of goals (see the video of Dani Alves assists below and Luis Suarez’s sublime assists in the past two games).

This is where the problem occurs for Moyes. According to WhoScored, in the last three seasons under Ferguson, Manchester United averaged 27, 25 and 27 crosses per game while posting 6, 3 and 3 through-balls per game. The crossing figure is up to 29 per game now with through-balls down to a paltry one per game. Crossing is not a new thing at Manchester United but more of their play under Moyes is focussed down the flanks; around 30% of their attacks under Ferguson in his last three years came down the middle of the pitch. Under Moyes, that has dropped to 24%, which is the lowest proportion in the league. This was wonderfully illustrated in this piece by Mike Goodman for Grantland earlier this season.

Moyes’ tactics have seemingly reduced the effectiveness of Manchester United’s previous elite attacking levels, which matches up with the successful lowering of expectations of the current champions prospects.