Sunday, November 23, 2014

Maximising Team Points Hauls

With the final race of the 2014 season run, the Mercedes drivers' battle for the Drivers' Championship over, and the future of McLaren's drivers still uncertain, now may be a good time to ask how well the drivers supported each other in terms of maximising team points haul.

Let's start with the Mercedes. The following charts shows how the drivers fared in terms of ranked position in each race, and points taken. The coloured drop line identifies which driver had the upper hand and also clearly indicates how far apart the drivers were.





In terms of points, the team's points haul across the rounds of the 2014 championship can be summarised using the following chart (final race points have been halved for the purposes of this chart):


The horizontal x-axis shows the number of points taken in a particular race by the highest placed driver in the team. The vertical y-axis is shows the number of points taken in a corresponding race by the lower placed team-mate. The red line is the points maximisation line - points on the line show that the team maximised points in a race given the position of the highest placed driver in the team.

The numbers represent a count of races where a particular points combination occurred. The circle is size proportionate to this value.

If we split the drivers out and generate co-ordinate points based on the points taken across the driver pairing for each race, we get the following style of chart.


This time, we have two guides representing the points support each driver offers the other. Marks away from the dotted line show how far away a driver was from maximising the team points haul based on the the points taken by the higher placed driver in the team. If there are lots of marks in the lower right half of the chart, the driver on the vertical y-axis is the underperformer. If the marks appear in the top left half, the driver  identified on the horizontal x-axis is the underperformer. Marks on the red dotted line show the x-axis driver was better placed, but team points were maximised. Conversely, marks on the blue dotted line show the y-axis driver was higher placed, but again, given that position, team points we maximised. If the team always maximised points, the magenta best fit line would be within the two dotted lines.

Here are the corresponding charts for McLaren.









These charts are working sketches and are likely to appear in some form in the Wrangling F1 Data With R book. Data used to generate the charts was obtained from the ergast API.


Friday, November 21, 2014

Lap Position Count Charts

Whilst putting together a simple routine to calculate the number of laps led by each driver from the ergast data, it struck me that we could count - and chart - the number of racing laps held in a particular position by each driver in a particular race.

The following chart summarises the 2012 Australian Grand Prix in this way.

#Count the number of laps each driver held each position for
posCounts=ddply(lapTimes,.(driverRef,position),summarise,poscount=length(lap))

#Set the transparency relative to the proportion of the race in each position
alpha=function(x) 100*x/max(lapTimes$lap)
#Rotate the x-tick labels
xRotn=function(s=7) theme(axis.text.x=element_text(angle=-90,size=s))

g=ggplot(posCounts)
#For each driver, plot the number of laps in each race position
g=g+geom_text(aes(x=driverRef,y=position,label=poscount,alpha=alpha(poscount)),size=4)
g+theme_bw()+xRotn()+xlab(NULL)+ylab(NULL)

Drivers are aligned along the bottom according to rank position at the end of the race. (Drivers who were unclassified are ranked according to how far into the race they got, and what position they were in when they retired in the case of two or more unclassified drivers having gone on on the same lap.

The number shows the number of laps completed in each race position; the transparency level is also indicative of this value. The pink circle shows the position the driver was in on their last lap, it's size proportional to the total number of laps the driver completed in the race. The empty grey circles show the drivers' grid positions.

Where the pink circle is off the diagonal, it shows that a driver was in a higher position at the point they exited the race than they were finally classified at. The larger the red circle, the closer they were to the end of the race at the point they left it. So for example in this case, we see that Maldonado appears to have been in 6th position quite deep into the race, despite being ranked 13th in the end.

The large lap counts shared by Vettel and Hamilton for second and third position don't tell us hwo these were distributed - was Hamilton in second during a large part of the race, for example, then ceding to Vettel for the latter half of the race, or were they continually changing positions in a hard fought fight? To distinguish that, we would need to look to the actual lapchart, or another metric.

A couple of other summary details that are missing from the chart include a total lap count for each driver, and an indication of the actual final classification of each driver (eg to distinguish those drivers that were unclassified).

The full recipe for creating this chart from data obtained from the ergast database can be found in the Wrangling F1 Data With R book.

Monday, November 17, 2014

F1 Drivers' Championship Showdown, 2014

I tweaked the code for my F1 Drivers' Championship winning combinations explorer to show how many points each driver could win - or lose - by in Abu Dhabi, assuming I've got my sums right.

So if Rosberg wins, he loses the championship by 3 points if Hamilton comes second, but takes it by 3 if Hamilton comes in third.

If Hamilton fails to finish, and Rosberg comes in 6th, Rosberg loses it by a single point. If Hamilton is 10th, and Rosberg 5th, or if Hamilton is 7th and Rosberg is third, Hamilton wins by a single point. And so on.

See the interactive version here.

Elements of this recipe may form part of a forthcoming chapter in the Wrangling F1 Data With R book.

Saturday, November 8, 2014

F1 2014 Championship Race - Round 17 Results

In the previous post I demonstrated a couple of charts that showed the evolution of the drivers' championship race up to round 17. In this post, I increase the information density of the lapchart styled display in two different ways through the use of text annotations.

As before, the data comes from the ergast API:

#Load in the core utility functions to access ergast API
source('ergastR-core.R')

#Get the standings after each of the first 16 rounds of the 2014 season
df=data.frame()
for (j in seq(1,17)){
  dft=seasonStandings(2014,j)
  dft$round=j
  df=rbind(df,dft)
}
#Data is now in: df

The returned data contains the championship standing, and points to date, for each driver at the end of each round. We can derive further data elements from it:

#Sort the data by ascending round and position
df=arrange(df,round,pos)
#Find how many points ahead of the driver behind each driver is
df=ddply(df,.(round),transform,diffbehind=diff(c(points[[1]],points)))

#Sort by ascending round and descending position
df=arrange(df,round,desc(pos))
#Find how many points behind the driver ahead each driver is 
df=ddply(df,.(round),transform,diff=diff(c(points[[1]],points)))
#Derive how many points each driver scored in each race
df=ddply(df,.(driverId,year),transform,racepoints=diff(c(0,points)))

As before, we can generate a base chart:

library(ggplot2)
library(directlabels)

#The base chart
g=ggplot(df,aes(x=round,y=pos,group=driverId))

charter=function(g) {
  g=g+geom_line()
  #Remove axis labels and colour legend
  g=g+ylab(NULL)+xlab(NULL)+guides(color=FALSE)
  #Add a title
  g=g+ggtitle("F1 Drivers' Championship Race, 2014")
  #Add the line labels, resized (cex), and with an x-value offset
  g=g+geom_dl(aes(label=driverId),list("last.points",cex=0.7,dl.trans(x=x+0.2)))
  #Add right hand side padding to the chart so the labels don't overflow
  g=g+scale_x_continuous(limits=c(1,20))
  g
}

g=charter(g)

Let's annotate the chart - firstly with data showing the number of points gained at each race. As previously, crossed lines show changes in championship standing between consecutive rounds:

g+geom_text(data=df,aes(label=racepoints),vjust=-0.4,size=3)


That's okay, insofar as it goes, but we could perhaps add in colour relative to the number of points scored in each race to highlight the higher values a little more clearly.

g+geom_text(data=df,aes(label=racepoints,col=racepoints),vjust=-0.4,size=3)


The default colour scheme scales from black to light blue. The higher values look a little washed out to me, making me think it might be worth exploring other colour mappings to highlight the higher values more clearly.

Annotating the chart with points scored per race helps us see how well each driver fared in a particular race, but the chart does not give us a sense of how many points separate drivers in the championship standings at the end of each round. We can address this by using the total number of championship points scored to date as the text label, preserving the an indication of the number of points awarded for each race by using the colour dimension.

g+geom_text(data=df,aes(label=points,col=racepoints),vjust=-0.4,size=3)+scale_color_continuous(high='red')


Looking down a column, we can compare the number of points separating drivers in the drivers championship at the end of each round. From the colour field we can see how drivers placed next to each other compared in terms of points awarded in each round. Looking along a line, we can (if necessary) calculate the number of points obtained in a particular round as a simple subtraction.

Elements of this recipe may form part of a forthcoming chapter in the Wrangling F1 Data With R book.

Sunday, November 2, 2014

The 2014 Drivers' Championship Race Going in to Round 17

Some quick doodles for a new chapter of Wrangling F1 Data With R, looking at the state of the drivers' championship race as we go in to round 17.

#Load in the core utility functions to access ergast API
source('ergastR-core.R')

#Get the standings after each of the first 16 rounds of the 2014 season
df=data.frame()
for (j in seq(1,16)){
  dft=seasonStandings(2014,j)
  dft$round=j
  df=rbind(df,dft)
}
#Data is now in: df

Now we can have a look at the data. First, the race in the style of a lapchart, plotting the position standings after each round.

library(ggplot2)
library(directlabels)

#The base chart
g=ggplot(df,aes(x=round,y=pos,col=driverId))
g=g+geom_line()
#Remove axis labels and colour legend
g=g+ylab(NULL)+xlab(NULL)+guides(color=FALSE)
#Add a title
g=g+ggtitle("F1 Drivers' Championship Race, 2014")
#Add the line labels, resized (cex), and with an x-value offset
g=g+geom_dl(aes(label=driverId),list("last.points",cex=0.7,dl.trans(x=x+0.1)))
#Add right hand side padding to the chart so the labels don't overflow
g=g+scale_x_continuous(limits=c(1,18))

This chart shows competition throughout the season  particularly between the first two places (Rosberg and Hamilton), fourth to sixth (Bottas, Vettel and Alonso), and ten, eleventh and twelfth (Magnussen, Perez and Raikkonen).

We can get a better feel for the competition in terms of the number of points separating the drivers.

#The only difference if to the base chart
g=ggplot(df,aes(x=round,y=points,col=driverId))
#All the other elements of the chart definition are the same


(Note there is some occlusion of the labels which we would need to manage by hand using the directlabels dl.move() function applying the necessary vjust offset to each driverId group (e.g. alonso).)

Here we see how close fought the fourth to sixth battle has become, as the the points battle for tenth place. We also see a late season charge from Massa, who could still challenge Hulkenberg for eighth.

Let's annotate the chart a little more by placing a guideline showing between 10th and 11th positions.

#Generate a guide that is the mean points value of 10th and 11th positions
dfx=ddply(df[df['pos']==10 | df['pos']==11,],
  .(round), summarize, points=mean(points))
dfx$driverId=''


#Get the drivers fighting around 10th at the end of round 16
#Note that other drivers may have contended this position earlier in the season
df.battle=df[df$driverId %in% as.character(df[df$round==16 & df$pos>=10 & df$pos<=12,'driverId']),]
#Base chart
g=ggplot(df.battle,aes(x=round,y=points,col=driverId))
g=g+geom_line()
g=g+ylab(NULL)+xlab(NULL)+ggtitle("F1 Drivers' Championship Race, 2014")+guides(color=FALSE)
g=g+geom_dl(aes(label=driverId),list("last.points",cex=0.7,dl.trans(x=x+0.1)))
g=g+scale_x_continuous(limits=c(1,18))

#Add in the guideline
g+geom_line(data=dfx,aes(x=round,y=points),col='black',linetype="dashed")


Once again, we really need to tweak the label positions manually so that they are note overlapping if we want to use this chart as a presentation graphic.

Elements of this recipe may form part of a forthcoming chapter in the Wrangling F1 Data With R book.

Saturday, November 1, 2014

Hamilton Chases Record Streak of Consecutive F1 Wins in the Same Season By a British Driver

So it seems that in the run up to the United States Grand Prix at the Circuit of the Americas, Lewis Hamilton "is attempting to become the first British driver since Nigel Mansell in 1992 to win five races in a row" (Guardian, "Futures of Force India and Sauber become subject of speculation").

One of the new chapters I pushed yesterday to Wrangling F1 Data With R covers "streakiness", so I thought I could try to use the routines described there to review in season streaks of length five or more from previous seasons using data from the ergast database.

As a first (optimisation) pass, I thought I'd identify British drivers who have won 5 or more races in a season; this could then be followed by looking for streaks of 5 or more wins by those drivers within their multiple-win seasons.

Firstly, we can get the drivers of a particular nationality with multiple wins within a season by querying the ergast database using a query along the lines of:

multiwinners.gb = dbGetQuery(ergastdb,
 'SELECT driverRef, d.driverId, nationality, MAX(wins), year 
 FROM driverStandings ds JOIN races r JOIN drivers d 
 WHERE ds.raceId=r.raceId AND ds.driverId=d.driverId 
 AND ds.driverId IN (SELECT DISTINCT driverId FROM drivers WHERE nationality="British") 
 GROUP by year,d.driverId 
 HAVING MAX(wins)>=5')

This gives us a set of results of the form:

    driverRef driverId nationality MAX(wins) year
1       clark      373     British         7 1963
2       clark      373     British         6 1965
3     stewart      328     British         6 1969
4     stewart      328     British         6 1971
5     stewart      328     British         5 1973
6        hunt      231     British         6 1976
7     mansell       95     British         5 1986
8     mansell       95     British         6 1987
9     mansell       95     British         5 1991
10    mansell       95     British         9 1992
11 damon_hill       71     British         6 1994
12 damon_hill       71     British         8 1996
13   hamilton        1     British         5 2008
14     button       18     British         6 2009

We can then generate streak reports for each of those drivers in each of those years, identifying the follow streaks of 5 wins or more within a season by a British driver using the streakReview() function:

ddply(multiwinners.gb,.(driverRef,year),function(x) streakReview(x$driverRef,length=5,topN=1,years=x$year,typ=1))

  driverRef year start end l                startc                          endc starty
1     clark 1965     1   6 6 Prince George Circuit                   Nürburgring   1965
2   mansell 1992     1   5 5               Kyalami Autodromo Enzo e Dino Ferrari   1992
  endy brokenbyy                    brokenbyc
1 1965      1965 Autodromo Nazionale di Monza
2 1992      1992            Circuit de Monaco

  • In 1965, Jim Clark won the first 6 rounds of the season, starting with a win at Prince George Circuit with the last win in the streak at the Nürburgring.
  • In 1992, Nigel Mansell won the first five rounds of the season, starting at Kyalami with the final win of the streak at Autodromo Enzo e Dino Ferrari.


For more detailed code examples on wrangling Formula One data with R, see the Wrangling F1 Data With R book.

Thursday, October 23, 2014

Wrangling F1 Data With R - Living Book Release

Earlier this year, I started drafted a book on "Wrangling F1 Data With R". In part this was to explore self-publishing production workflows, in part to try to pull together the various notes and doodlings I've done over the last few years - and to act as a home for further tinkerings.

As projects such as this tend to do, it stalled. But in an attempt to restart it, I've published what I've done to date over on Leanpub in the hope that it'll provoke me to do more...

The book is available in a couple of forms:
  • as a paid for item: if you actually buy the book, the Leanpub model means you'll get access to any and all updates and revisions to it. There is a minimum price point and a pay-what-you-like-over-the-minimum price point.
  • as a preview item: I'm going to be randomly changing the free preview chapters (though I don't know how frequently) so over time every chapter will appear there. On occasion, the whole book to date will appear as a free preview item. If I do blogposts around particular topics (and I'm hoping to start blogging here again, though perhaps not significantly till next year) and those topics a book topics, the corresponding chapters will probably appear in the preview around the time of the post and for a week or two after.
You can find the free preview - and a place to buy access to the living book - here: Wrangling F1 Data With R.

Note that the chapters in the preview and the actual book may still bit a bit ragged and in draft or incomplete form. That's just the way it is... (If nothing else, it'll give you some hints about how a particular chapter might develop...)

At the moment the price is set at the minimum amount to enrol it in the affiliate marketing program. Affiliates get paid half the minimum price of the book. If an affiliate is responsible for the sale of a book at the minimum price, they get paid more than me.

Leanpub also allows coupon based marketing. So here are a couple of offers...
If you think you're deserving of a coupon, let me know...

This is all something of an experiment - in fact, several experiments - so any and all comments and feedback welcome... And purchases, of course;-)