Formula 1: A brief look through history

Introduction

Formula 1 is the highest level of international auto racing sanctioned by the FIA, Fédération Internationale de l'Automobile. F1 consists of open-cockpit and open-wheeled racing cars that can go up to 230 miles per hour! Most drivers start out at a very young age racing go-karts and slowly progress to F3 and F2 racing. We will take a look into various F1 statics and how the sport has changed over time. Through this project, we will examine if the sport has become more popular and safer over time.

Lewis Hamilton's Car - 2020 World Champion and 7-Time World Champion

Data Scraping

To visualize how the sport has progressed over time, we'll first start scraping race data dating back to F1's first year, 1950. We will be using http://ergast.com/mrd/ for our race data. This database hosts a free API for users that can return data in XML, JSON, or PJSON formats.

In [62]:
import pandas as pd
import json
import requests
import numpy as np
import folium
from folium import plugins
from folium.plugins import MarkerCluster
from statsmodels.tsa.holtwinters import ExponentialSmoothing, SimpleExpSmoothing
In [2]:
url = "http://ergast.com/api/f1.json" # website api url
response = requests.get(url) 
data = json.loads(response.text) # parsing the json data

df = pd.DataFrame.from_dict(data["MRData"]["RaceTable"]["Races"]) 
# after finding the relevent json path, we specify what data to convert into a pandas dataframe

This would be a perfectly fine API call, but the website API doesn't allow for us to call all the data at once so, in order to get all the data we need, we can loop through to get all the data we need.

In [3]:
total = int(data["MRData"]["total"])
limit = 100
offset = 0
df = pd.DataFrame() #blank dataframe object
while (offset < total):
    dataset = url + "?limit=" + str(limit) + "&offset=" + str(offset)
    offset = offset + 100
    subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["RaceTable"]["Races"])
    frames = [df, subset]
    df = pd.concat(frames)

df.head()
Out[3]:
season round url raceName Circuit date time
0 1950 1 http://en.wikipedia.org/wiki/1950_British_Gran... British Grand Prix {'circuitId': 'silverstone', 'url': 'http://en... 1950-05-13 NaN
1 1950 2 http://en.wikipedia.org/wiki/1950_Monaco_Grand... Monaco Grand Prix {'circuitId': 'monaco', 'url': 'http://en.wiki... 1950-05-21 NaN
2 1950 3 http://en.wikipedia.org/wiki/1950_Indianapolis... Indianapolis 500 {'circuitId': 'indianapolis', 'url': 'http://e... 1950-05-30 NaN
3 1950 4 http://en.wikipedia.org/wiki/1950_Swiss_Grand_... Swiss Grand Prix {'circuitId': 'bremgarten', 'url': 'http://en.... 1950-06-04 NaN
4 1950 5 http://en.wikipedia.org/wiki/1950_Belgian_Gran... Belgian Grand Prix {'circuitId': 'spa', 'url': 'http://en.wikiped... 1950-06-18 NaN

We can observe what the first 5 rows of the dataframe look like by calling .head() of the dataframe object.

We then clean up the data by reseting the index values and dropping collumns we don't need.

In [4]:
df = df.reset_index()    
df = df.drop(columns=["index","url", "Circuit", "time"])
df.head()
Out[4]:
season round raceName date
0 1950 1 British Grand Prix 1950-05-13
1 1950 2 Monaco Grand Prix 1950-05-21
2 1950 3 Indianapolis 500 1950-05-30
3 1950 4 Swiss Grand Prix 1950-06-04
4 1950 5 Belgian Grand Prix 1950-06-18

Number of Races per Year

Let's graph the number of races per year throughout F1 history. We have a dataframe with individual race data, but we do not have a count of how many races happened each year.

To do this, we can simply call .value_counts() on a particular collumn to count how many times that data point has repeated. This is returned as a pandas.Series object, so we can easily convert this to a dataframe object by calling .to_frame().

In [5]:
year = df['season'].value_counts().sort_index().to_frame()
year.columns = ["number of races"]
year.head(10)
Out[5]:
number of races
1950 7
1951 8
1952 8
1953 9
1954 9
1955 7
1956 8
1957 8
1958 11
1959 9
In [6]:
f1_color = (255/255,24/255,1/255) # official formula 1 color
ax = year.plot.line(y='number of races', use_index=True, color = f1_color, figsize=(20,5))
ax.set_xlabel("Year");
ax.set_ylabel("Number of Races");
ax.set_title("Number of Races over Time");

After graphing this data, we can see that the number of races per year has been steadily increasing. In the first decade of the sport, there was an average of about 8 races per year, but recently that number has increased to 23 for 2021.

Race Finishes Through that Years

With the cars racing at over 200 miles per hour, even the smallest mistakes can lead to the most fatal crashes. Additionally, the cars have only been getting more engineeringly complex over time, so the number of parts that have to work together seemlessily has skyrocketed. Let's take a look at how cars performed over time. We start by scrapingdat on number of cars that 'Did not qualify' for the race , 'Did not finish' the race, had an 'Accident' in the race, and "Finished' the race.

In [7]:
url = "http://ergast.com/api/f1/"
DNQ = [77, 81, 97]
df_dnq = pd.DataFrame() #blank dataframe object
for code in DNQ:
    for yr in range(1950,2021):
        dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
        subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
        subset['year'] = yr
        frames = [df_dnq, subset]
        df_dnq = pd.concat(frames)
        
df_dnq["count"] = pd.to_numeric(df_dnq["count"])        
df_dnq = df_dnq.groupby(["year"]).sum().sort_index()
In [8]:
fin = [1]
df_fin = pd.DataFrame() #blank dataframe object
for code in fin:
    for yr in range(1950,2021):
        dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
        subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
        subset['year'] = yr
        frames = [df_fin, subset]
        df_fin = pd.concat(frames)
        
df_fin["count"] = pd.to_numeric(df_fin["count"])        
df_fin = df_fin.groupby(["year"]).sum().sort_index()
In [9]:
acc = [2,3,104]
df_acc = pd.DataFrame() #blank dataframe object
for code in acc:
    for yr in range(1950,2021):
        dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
        subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
        subset['year'] = yr
        frames = [df_acc, subset]
        df_acc = pd.concat(frames)
        
df_acc["count"] = pd.to_numeric(df_acc["count"])   
df_acc = df_acc.groupby(["year"]).sum().sort_index()
In [26]:
dnf = [31,54]
df_dnf = pd.DataFrame() #blank dataframe object
for code in dnf:
    for yr in range(1950,2021):
        dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
        subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
        subset['year'] = yr
        frames = [df_dnf, subset]
        df_dnf = pd.concat(frames)
        
df_dnf["count"] = pd.to_numeric(df_dnf["count"])   
df_dnf = df_dnf.groupby(["year"]).sum().sort_index()
In [27]:
df_dnq = df_dnq.rename(columns={'count': 'DNQ'})
df_fin = df_fin.rename(columns={'count': 'Finishes'})
df_acc = df_acc.rename(columns={'count': 'Accidents'})
df_dnf = df_dnf.rename(columns={'count': 'DNF'})

results = pd.concat([df_fin, df_dnf, df_acc, df_dnq], axis=1, sort=False)
results = results.fillna(0)
results.head()
Out[27]:
Finishes DNF Accidents DNQ
year
1950 17 3.0 12 0.0
1951 29 3.0 9 0.0
1952 31 9.0 6 11.0
1953 33 9.0 13 0.0
1954 37 12.0 15 1.0

After scraping all this data, we put all of the different dataframes into one so we can easily graph the data. We now have data for the number of finishes, accidents, DNFs, and DNQs for every year. Let's start by plotting all 4 datapoints as line graphs.

In [28]:
lx = results.plot.line(use_index=True, figsize=(20,10))
lx.set_xlabel("Year");
lx.set_ylabel("Number of Cars");
lx.set_title("Race Results");

We can easily notice that the number of race finishes as steadily increased while the number of accidents is slowly decreasing. To further examine if the sport has actually become safer or if this is just an artifact of having more races, let's do an layered area plot.

In [13]:
arx = results.plot.area(use_index=True, figsize=(20,10))
arx.set_xlabel("Year");
arx.set_ylabel("Number of Cars");
arx.set_title("Race Results");

From this we can easily observed that in the starting years of F1, a huge portion of races ended in accidents. More recently, even though the number of accidents has not significantly gone down, the percentage compared to race finishes has drammatically been reduced. We also observe two large sspikes in DNQs that corrlate to major rules changes in Formula 1.

Extra Resource (can be used to generate more powerful graphs): https://seaborn.pydata.org/generated/seaborn.lineplot.html

Distribution of Races around the World

With Formula 1 becoming more and more popular over the years, lets take a look at where Formula 1 Grand Prix's occured around the world. We first stary my scraping race data from our data source and extracting the latitude and longitude of each race location.

In [29]:
countries = pd.DataFrame() #blank dataframe object

for yr in range(1950,2021):
    dataset = url + str(yr) + "/circuits/" + ".json?limit=200"
    subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["CircuitTable"]["Circuits"])
    for i,row in subset.iterrows():
        subset.loc[i,'lat'] = row.Location["lat"]
        subset.loc[i,'long'] = row.Location["long"]
        subset.loc[i,'country'] = row.Location["country"]
    subset['year'] = yr
    subset.drop(columns=["url","Location"])
                         
    frames = [countries, subset]
    countries = pd.concat(frames)
    
In [30]:
countries = countries.reset_index()
countries = countries.drop(columns=["index"])
countries.head()
Out[30]:
circuitId url circuitName Location lat long country year
0 bremgarten http://en.wikipedia.org/wiki/Circuit_Bremgarten Circuit Bremgarten {'lat': '46.9589', 'long': '7.40194', 'localit... 46.9589 7.40194 Switzerland 1950
1 indianapolis http://en.wikipedia.org/wiki/Indianapolis_Moto... Indianapolis Motor Speedway {'lat': '39.795', 'long': '-86.2347', 'localit... 39.795 -86.2347 USA 1950
2 monaco http://en.wikipedia.org/wiki/Circuit_de_Monaco Circuit de Monaco {'lat': '43.7347', 'long': '7.42056', 'localit... 43.7347 7.42056 Monaco 1950
3 monza http://en.wikipedia.org/wiki/Autodromo_Naziona... Autodromo Nazionale di Monza {'lat': '45.6156', 'long': '9.28111', 'localit... 45.6156 9.28111 Italy 1950
4 reims http://en.wikipedia.org/wiki/Reims-Gueux Reims-Gueux {'lat': '49.2542', 'long': '3.93083', 'localit... 49.2542 3.93083 France 1950

We now have a dataframe with race location information for every F1 race. We can observe that the latitude and longitude data is inside the Location column stored as a dict as each entry. To extract that data we start with creating a new data frame and counting the number of times it has appeared on the F1 calendar. We then loop through that list and the list of all our races to find appropriate lat. and long. values for every race.

In [34]:
circuit = countries['circuitId'].value_counts().to_frame()
circuit = circuit.reset_index()
circuit.columns = ["circuit", "count"]
circuit.head()
Out[34]:
circuit count
0 monza 70
1 monaco 66
2 silverstone 54
3 spa 53
4 nurburgring 41
In [36]:
for i, cir in circuit.iterrows():
    for j, races in countries.iterrows():
        if (cir.circuit == races.circuitId):
            circuit.loc[i,'lat'] = races["lat"]
            circuit.loc[i,'long'] = races["long"]
            circuit.loc[i,'country'] = races["country"]

circuit.head()
Out[36]:
circuit count lat long country
0 monza 70 45.6156 9.28111 Italy
1 monaco 66 43.7347 7.42056 Monaco
2 silverstone 54 52.0786 -1.01694 UK
3 spa 53 50.4372 5.97139 Belgium
4 nurburgring 41 50.3356 6.9475 Germany

Mapping

We can then use folium to map this data so we can observe the spread of races around the world.

Extra Resource: https://python-visualization.github.io/folium/quickstart.html

In [58]:
world_map= folium.Map(location=[50, 0], zoom_start=4)
marker_cluster = MarkerCluster().add_to(world_map)
In [59]:
#for each coordinate, create circlemarker of user percent
for i in range(len(df)):
        lat = df.iloc[i]['lat']
        long = df.iloc[i]['long']
        radius= df["count"][i].item()
        popup_text = """Country : {}<> Number of Races : {}<br>"""
        popup_text = popup_text.format(df.iloc[i]['country'], df.iloc[i]['count'])
        folium.CircleMarker(location = [lat, long], radius=radius, popup= popup_text, fill =True).add_to(marker_cluster)
#show the map
world_map
Out[59]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [60]:
def heat_map(m, df):
    arr = df[['lat', 'long']].to_numpy()
    m.add_child(plugins.HeatMap(arr, radius=15))
    return m


def mark_points(m, df):
    for index, row in df.iterrows():
        folium.CircleMarker([row['lat'], row['long']], radius=1).add_to(m)
    return m



m = heat_map(mark_points(folium.Map(location=[50, 0], zoom_start=2), countries), countries)
m
Out[60]:
Make this Notebook Trusted to load map: File -> Trust Notebook
We mapped the races in two different ways. The first was a simple mapping of the lat. and long. points, but we increased the size of the radius to to corrlate with the number of races held in that location. Larger circles represent races that appear most often in a grand prix calendar. We note that the largest circles (Monza being the largest, having appeared in all 70 seasons of the sort) are all in Europe. The second graph is a heat map of the races around the world. We can notice that there is a strong concentration of races in the European Union region. Both these observations make sense because most drivers are European and the sport was started and is extremely popular in Europe.

Conclusion

In conclusion, that Formula 1 has gotten more popular, safer, and a lot more international over its 70 year history. Even though the number of races per year is on a steady increase, I would predict that that upward trend is not going to last much longer and the number of races will soon plateau. Moving tons of equiptment from one track to another all in a couple days time between races is really logistically difficult and so ~23 races is probably the max it will reach. The sport had also gotten much more safe, and I only expect that to increase as time go on with improved safety standards. And lastly, I think the globalization of races will continue to happen as F1 continues to become more popular around the world.

In [ ]:
 
In [ ]:
 
In [ ]: