Formula 1 is the highest level of international auto racing sanctioned by the FIA, Fédération Internationale de l'Automobile. F1 consists of open-cockpit and open-wheeled racing cars that can go up to 230 miles per hour! Most drivers start out at a very young age racing go-karts and slowly progress to F3 and F2 racing. We will take a look into various F1 statics and how the sport has changed over time. Through this project, we will examine if the sport has become more popular and safer over time.
To visualize how the sport has progressed over time, we'll first start scraping race data dating back to F1's first year, 1950. We will be using http://ergast.com/mrd/ for our race data. This database hosts a free API for users that can return data in XML, JSON, or PJSON formats.
import pandas as pd
import json
import requests
import numpy as np
import folium
from folium import plugins
from folium.plugins import MarkerCluster
from statsmodels.tsa.holtwinters import ExponentialSmoothing, SimpleExpSmoothing
url = "http://ergast.com/api/f1.json" # website api url
response = requests.get(url)
data = json.loads(response.text) # parsing the json data
df = pd.DataFrame.from_dict(data["MRData"]["RaceTable"]["Races"])
# after finding the relevent json path, we specify what data to convert into a pandas dataframe
This would be a perfectly fine API call, but the website API doesn't allow for us to call all the data at once so, in order to get all the data we need, we can loop through to get all the data we need.
total = int(data["MRData"]["total"])
limit = 100
offset = 0
df = pd.DataFrame() #blank dataframe object
while (offset < total):
dataset = url + "?limit=" + str(limit) + "&offset=" + str(offset)
offset = offset + 100
subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["RaceTable"]["Races"])
frames = [df, subset]
df = pd.concat(frames)
df.head()
We can observe what the first 5 rows of the dataframe look like by calling .head() of the dataframe object.
We then clean up the data by reseting the index values and dropping collumns we don't need.
df = df.reset_index()
df = df.drop(columns=["index","url", "Circuit", "time"])
df.head()
Let's graph the number of races per year throughout F1 history. We have a dataframe with individual race data, but we do not have a count of how many races happened each year.
To do this, we can simply call .value_counts() on a particular collumn to count how many times that data point has repeated. This is returned as a pandas.Series object, so we can easily convert this to a dataframe object by calling .to_frame().
year = df['season'].value_counts().sort_index().to_frame()
year.columns = ["number of races"]
year.head(10)
f1_color = (255/255,24/255,1/255) # official formula 1 color
ax = year.plot.line(y='number of races', use_index=True, color = f1_color, figsize=(20,5))
ax.set_xlabel("Year");
ax.set_ylabel("Number of Races");
ax.set_title("Number of Races over Time");
After graphing this data, we can see that the number of races per year has been steadily increasing. In the first decade of the sport, there was an average of about 8 races per year, but recently that number has increased to 23 for 2021.
With the cars racing at over 200 miles per hour, even the smallest mistakes can lead to the most fatal crashes. Additionally, the cars have only been getting more engineeringly complex over time, so the number of parts that have to work together seemlessily has skyrocketed. Let's take a look at how cars performed over time. We start by scrapingdat on number of cars that 'Did not qualify' for the race , 'Did not finish' the race, had an 'Accident' in the race, and "Finished' the race.
url = "http://ergast.com/api/f1/"
DNQ = [77, 81, 97]
df_dnq = pd.DataFrame() #blank dataframe object
for code in DNQ:
for yr in range(1950,2021):
dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
subset['year'] = yr
frames = [df_dnq, subset]
df_dnq = pd.concat(frames)
df_dnq["count"] = pd.to_numeric(df_dnq["count"])
df_dnq = df_dnq.groupby(["year"]).sum().sort_index()
fin = [1]
df_fin = pd.DataFrame() #blank dataframe object
for code in fin:
for yr in range(1950,2021):
dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
subset['year'] = yr
frames = [df_fin, subset]
df_fin = pd.concat(frames)
df_fin["count"] = pd.to_numeric(df_fin["count"])
df_fin = df_fin.groupby(["year"]).sum().sort_index()
acc = [2,3,104]
df_acc = pd.DataFrame() #blank dataframe object
for code in acc:
for yr in range(1950,2021):
dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
subset['year'] = yr
frames = [df_acc, subset]
df_acc = pd.concat(frames)
df_acc["count"] = pd.to_numeric(df_acc["count"])
df_acc = df_acc.groupby(["year"]).sum().sort_index()
dnf = [31,54]
df_dnf = pd.DataFrame() #blank dataframe object
for code in dnf:
for yr in range(1950,2021):
dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
subset['year'] = yr
frames = [df_dnf, subset]
df_dnf = pd.concat(frames)
df_dnf["count"] = pd.to_numeric(df_dnf["count"])
df_dnf = df_dnf.groupby(["year"]).sum().sort_index()
df_dnq = df_dnq.rename(columns={'count': 'DNQ'})
df_fin = df_fin.rename(columns={'count': 'Finishes'})
df_acc = df_acc.rename(columns={'count': 'Accidents'})
df_dnf = df_dnf.rename(columns={'count': 'DNF'})
results = pd.concat([df_fin, df_dnf, df_acc, df_dnq], axis=1, sort=False)
results = results.fillna(0)
results.head()
After scraping all this data, we put all of the different dataframes into one so we can easily graph the data. We now have data for the number of finishes, accidents, DNFs, and DNQs for every year. Let's start by plotting all 4 datapoints as line graphs.
lx = results.plot.line(use_index=True, figsize=(20,10))
lx.set_xlabel("Year");
lx.set_ylabel("Number of Cars");
lx.set_title("Race Results");
We can easily notice that the number of race finishes as steadily increased while the number of accidents is slowly decreasing. To further examine if the sport has actually become safer or if this is just an artifact of having more races, let's do an layered area plot.
arx = results.plot.area(use_index=True, figsize=(20,10))
arx.set_xlabel("Year");
arx.set_ylabel("Number of Cars");
arx.set_title("Race Results");
From this we can easily observed that in the starting years of F1, a huge portion of races ended in accidents. More recently, even though the number of accidents has not significantly gone down, the percentage compared to race finishes has drammatically been reduced. We also observe two large sspikes in DNQs that corrlate to major rules changes in Formula 1.
Extra Resource (can be used to generate more powerful graphs): https://seaborn.pydata.org/generated/seaborn.lineplot.html
With Formula 1 becoming more and more popular over the years, lets take a look at where Formula 1 Grand Prix's occured around the world. We first stary my scraping race data from our data source and extracting the latitude and longitude of each race location.
countries = pd.DataFrame() #blank dataframe object
for yr in range(1950,2021):
dataset = url + str(yr) + "/circuits/" + ".json?limit=200"
subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["CircuitTable"]["Circuits"])
for i,row in subset.iterrows():
subset.loc[i,'lat'] = row.Location["lat"]
subset.loc[i,'long'] = row.Location["long"]
subset.loc[i,'country'] = row.Location["country"]
subset['year'] = yr
subset.drop(columns=["url","Location"])
frames = [countries, subset]
countries = pd.concat(frames)
countries = countries.reset_index()
countries = countries.drop(columns=["index"])
countries.head()
We now have a dataframe with race location information for every F1 race. We can observe that the latitude and longitude data is inside the Location column stored as a dict as each entry. To extract that data we start with creating a new data frame and counting the number of times it has appeared on the F1 calendar. We then loop through that list and the list of all our races to find appropriate lat. and long. values for every race.
circuit = countries['circuitId'].value_counts().to_frame()
circuit = circuit.reset_index()
circuit.columns = ["circuit", "count"]
circuit.head()
for i, cir in circuit.iterrows():
for j, races in countries.iterrows():
if (cir.circuit == races.circuitId):
circuit.loc[i,'lat'] = races["lat"]
circuit.loc[i,'long'] = races["long"]
circuit.loc[i,'country'] = races["country"]
circuit.head()
We can then use folium to map this data so we can observe the spread of races around the world.
Extra Resource: https://python-visualization.github.io/folium/quickstart.html
world_map= folium.Map(location=[50, 0], zoom_start=4)
marker_cluster = MarkerCluster().add_to(world_map)
#for each coordinate, create circlemarker of user percent
for i in range(len(df)):
lat = df.iloc[i]['lat']
long = df.iloc[i]['long']
radius= df["count"][i].item()
popup_text = """Country : {}<> Number of Races : {}<br>"""
popup_text = popup_text.format(df.iloc[i]['country'], df.iloc[i]['count'])
folium.CircleMarker(location = [lat, long], radius=radius, popup= popup_text, fill =True).add_to(marker_cluster)
#show the map
world_map
def heat_map(m, df):
arr = df[['lat', 'long']].to_numpy()
m.add_child(plugins.HeatMap(arr, radius=15))
return m
def mark_points(m, df):
for index, row in df.iterrows():
folium.CircleMarker([row['lat'], row['long']], radius=1).add_to(m)
return m
m = heat_map(mark_points(folium.Map(location=[50, 0], zoom_start=2), countries), countries)
m
In conclusion, that Formula 1 has gotten more popular, safer, and a lot more international over its 70 year history. Even though the number of races per year is on a steady increase, I would predict that that upward trend is not going to last much longer and the number of races will soon plateau. Moving tons of equiptment from one track to another all in a couple days time between races is really logistically difficult and so ~23 races is probably the max it will reach. The sport had also gotten much more safe, and I only expect that to increase as time go on with improved safety standards. And lastly, I think the globalization of races will continue to happen as F1 continues to become more popular around the world.