Formula 1: A brief look through history¶

Introduction¶

Formula 1 is the highest level of international auto racing sanctioned by the FIA, Fédération Internationale de l'Automobile. F1 consists of open-cockpit and open-wheeled racing cars that can go up to 230 miles per hour! Most drivers start out at a very young age racing go-karts and slowly progress to F3 and F2 racing. We will take a look into various F1 statics and how the sport has changed over time. Through this project, we will examine if the sport has become more popular and safer over time.

Lewis Hamilton's Car - 2020 World Champion and 7-Time World Champion

Data Scraping¶

To visualize how the sport has progressed over time, we'll first start scraping race data dating back to F1's first year, 1950. We will be using http://ergast.com/mrd/ for our race data. This database hosts a free API for users that can return data in XML, JSON, or PJSON formats.

import pandas as pd
import json
import requests
import numpy as np
import folium
from folium import plugins
from folium.plugins import MarkerCluster
from statsmodels.tsa.holtwinters import ExponentialSmoothing, SimpleExpSmoothing

url = "http://ergast.com/api/f1.json" # website api url
response = requests.get(url) 
data = json.loads(response.text) # parsing the json data

df = pd.DataFrame.from_dict(data["MRData"]["RaceTable"]["Races"]) 
# after finding the relevent json path, we specify what data to convert into a pandas dataframe

This would be a perfectly fine API call, but the website API doesn't allow for us to call all the data at once so, in order to get all the data we need, we can loop through to get all the data we need.

total = int(data["MRData"]["total"])
limit = 100
offset = 0
df = pd.DataFrame() #blank dataframe object
while (offset < total):
    dataset = url + "?limit=" + str(limit) + "&offset=" + str(offset)
    offset = offset + 100
    subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["RaceTable"]["Races"])
    frames = [df, subset]
    df = pd.concat(frames)

df.head()

We can observe what the first 5 rows of the dataframe look like by calling .head() of the dataframe object.

We then clean up the data by reseting the index values and dropping collumns we don't need.

df = df.reset_index()    
df = df.drop(columns=["index","url", "Circuit", "time"])
df.head()

Number of Races per Year¶

Let's graph the number of races per year throughout F1 history. We have a dataframe with individual race data, but we do not have a count of how many races happened each year.

To do this, we can simply call .value_counts() on a particular collumn to count how many times that data point has repeated. This is returned as a pandas.Series object, so we can easily convert this to a dataframe object by calling .to_frame().

year = df['season'].value_counts().sort_index().to_frame()
year.columns = ["number of races"]
year.head(10)

f1_color = (255/255,24/255,1/255) # official formula 1 color
ax = year.plot.line(y='number of races', use_index=True, color = f1_color, figsize=(20,5))
ax.set_xlabel("Year");
ax.set_ylabel("Number of Races");
ax.set_title("Number of Races over Time");

After graphing this data, we can see that the number of races per year has been steadily increasing. In the first decade of the sport, there was an average of about 8 races per year, but recently that number has increased to 23 for 2021.

Race Finishes Through that Years¶

With the cars racing at over 200 miles per hour, even the smallest mistakes can lead to the most fatal crashes. Additionally, the cars have only been getting more engineeringly complex over time, so the number of parts that have to work together seemlessily has skyrocketed. Let's take a look at how cars performed over time. We start by scrapingdat on number of cars that 'Did not qualify' for the race , 'Did not finish' the race, had an 'Accident' in the race, and "Finished' the race.

url = "http://ergast.com/api/f1/"
DNQ = [77, 81, 97]
df_dnq = pd.DataFrame() #blank dataframe object
for code in DNQ:
    for yr in range(1950,2021):
        dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
        subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
        subset['year'] = yr
        frames = [df_dnq, subset]
        df_dnq = pd.concat(frames)
        
df_dnq["count"] = pd.to_numeric(df_dnq["count"])        
df_dnq = df_dnq.groupby(["year"]).sum().sort_index()

fin = [1]
df_fin = pd.DataFrame() #blank dataframe object
for code in fin:
    for yr in range(1950,2021):
        dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
        subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
        subset['year'] = yr
        frames = [df_fin, subset]
        df_fin = pd.concat(frames)
        
df_fin["count"] = pd.to_numeric(df_fin["count"])        
df_fin = df_fin.groupby(["year"]).sum().sort_index()

acc = [2,3,104]
df_acc = pd.DataFrame() #blank dataframe object
for code in acc:
    for yr in range(1950,2021):
        dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
        subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
        subset['year'] = yr
        frames = [df_acc, subset]
        df_acc = pd.concat(frames)
        
df_acc["count"] = pd.to_numeric(df_acc["count"])   
df_acc = df_acc.groupby(["year"]).sum().sort_index()

dnf = [31,54]
df_dnf = pd.DataFrame() #blank dataframe object
for code in dnf:
    for yr in range(1950,2021):
        dataset = url + str(yr) + "/status/" + str(code) + ".json?limit=200"
        subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["StatusTable"]["Status"])
        subset['year'] = yr
        frames = [df_dnf, subset]
        df_dnf = pd.concat(frames)
        
df_dnf["count"] = pd.to_numeric(df_dnf["count"])   
df_dnf = df_dnf.groupby(["year"]).sum().sort_index()

df_dnq = df_dnq.rename(columns={'count': 'DNQ'})
df_fin = df_fin.rename(columns={'count': 'Finishes'})
df_acc = df_acc.rename(columns={'count': 'Accidents'})
df_dnf = df_dnf.rename(columns={'count': 'DNF'})

results = pd.concat([df_fin, df_dnf, df_acc, df_dnq], axis=1, sort=False)
results = results.fillna(0)
results.head()

After scraping all this data, we put all of the different dataframes into one so we can easily graph the data. We now have data for the number of finishes, accidents, DNFs, and DNQs for every year. Let's start by plotting all 4 datapoints as line graphs.

lx = results.plot.line(use_index=True, figsize=(20,10))
lx.set_xlabel("Year");
lx.set_ylabel("Number of Cars");
lx.set_title("Race Results");

We can easily notice that the number of race finishes as steadily increased while the number of accidents is slowly decreasing. To further examine if the sport has actually become safer or if this is just an artifact of having more races, let's do an layered area plot.

arx = results.plot.area(use_index=True, figsize=(20,10))
arx.set_xlabel("Year");
arx.set_ylabel("Number of Cars");
arx.set_title("Race Results");

From this we can easily observed that in the starting years of F1, a huge portion of races ended in accidents. More recently, even though the number of accidents has not significantly gone down, the percentage compared to race finishes has drammatically been reduced. We also observe two large sspikes in DNQs that corrlate to major rules changes in Formula 1.

Extra Resource (can be used to generate more powerful graphs): https://seaborn.pydata.org/generated/seaborn.lineplot.html

Distribution of Races around the World¶

With Formula 1 becoming more and more popular over the years, lets take a look at where Formula 1 Grand Prix's occured around the world. We first stary my scraping race data from our data source and extracting the latitude and longitude of each race location.

countries = pd.DataFrame() #blank dataframe object

for yr in range(1950,2021):
    dataset = url + str(yr) + "/circuits/" + ".json?limit=200"
    subset = pd.DataFrame.from_dict(json.loads(requests.get(dataset).text)["MRData"]["CircuitTable"]["Circuits"])
    for i,row in subset.iterrows():
        subset.loc[i,'lat'] = row.Location["lat"]
        subset.loc[i,'long'] = row.Location["long"]
        subset.loc[i,'country'] = row.Location["country"]
    subset['year'] = yr
    subset.drop(columns=["url","Location"])
                         
    frames = [countries, subset]
    countries = pd.concat(frames)

countries = countries.reset_index()
countries = countries.drop(columns=["index"])
countries.head()

We now have a dataframe with race location information for every F1 race. We can observe that the latitude and longitude data is inside the Location column stored as a dict as each entry. To extract that data we start with creating a new data frame and counting the number of times it has appeared on the F1 calendar. We then loop through that list and the list of all our races to find appropriate lat. and long. values for every race.

circuit = countries['circuitId'].value_counts().to_frame()
circuit = circuit.reset_index()
circuit.columns = ["circuit", "count"]
circuit.head()

for i, cir in circuit.iterrows():
    for j, races in countries.iterrows():
        if (cir.circuit == races.circuitId):
            circuit.loc[i,'lat'] = races["lat"]
            circuit.loc[i,'long'] = races["long"]
            circuit.loc[i,'country'] = races["country"]

circuit.head()

Mapping¶

We can then use folium to map this data so we can observe the spread of races around the world.

Extra Resource: https://python-visualization.github.io/folium/quickstart.html

world_map= folium.Map(location=[50, 0], zoom_start=4)
marker_cluster = MarkerCluster().add_to(world_map)

#for each coordinate, create circlemarker of user percent
for i in range(len(df)):
        lat = df.iloc[i]['lat']
        long = df.iloc[i]['long']
        radius= df["count"][i].item()
        popup_text = """Country : {}<> Number of Races : {}<br>"""
        popup_text = popup_text.format(df.iloc[i]['country'], df.iloc[i]['count'])
        folium.CircleMarker(location = [lat, long], radius=radius, popup= popup_text, fill =True).add_to(marker_cluster)
#show the map
world_map

def heat_map(m, df):
    arr = df[['lat', 'long']].to_numpy()
    m.add_child(plugins.HeatMap(arr, radius=15))
    return m


def mark_points(m, df):
    for index, row in df.iterrows():
        folium.CircleMarker([row['lat'], row['long']], radius=1).add_to(m)
    return m



m = heat_map(mark_points(folium.Map(location=[50, 0], zoom_start=2), countries), countries)
m

Conclusion¶

In conclusion, that Formula 1 has gotten more popular, safer, and a lot more international over its 70 year history. Even though the number of races per year is on a steady increase, I would predict that that upward trend is not going to last much longer and the number of races will soon plateau. Moving tons of equiptment from one track to another all in a couple days time between races is really logistically difficult and so ~23 races is probably the max it will reach. The sport had also gotten much more safe, and I only expect that to increase as time go on with improved safety standards. And lastly, I think the globalization of races will continue to happen as F1 continues to become more popular around the world.

	season	round	url	raceName	Circuit	date	time
0	1950	1	http://en.wikipedia.org/wiki/1950_British_Gran...	British Grand Prix	{'circuitId': 'silverstone', 'url': 'http://en...	1950-05-13	NaN
1	1950	2	http://en.wikipedia.org/wiki/1950_Monaco_Grand...	Monaco Grand Prix	{'circuitId': 'monaco', 'url': 'http://en.wiki...	1950-05-21	NaN
2	1950	3	http://en.wikipedia.org/wiki/1950_Indianapolis...	Indianapolis 500	{'circuitId': 'indianapolis', 'url': 'http://e...	1950-05-30	NaN
3	1950	4	http://en.wikipedia.org/wiki/1950_Swiss_Grand_...	Swiss Grand Prix	{'circuitId': 'bremgarten', 'url': 'http://en....	1950-06-04	NaN
4	1950	5	http://en.wikipedia.org/wiki/1950_Belgian_Gran...	Belgian Grand Prix	{'circuitId': 'spa', 'url': 'http://en.wikiped...	1950-06-18	NaN

	season	round	raceName	date
0	1950	1	British Grand Prix	1950-05-13
1	1950	2	Monaco Grand Prix	1950-05-21
2	1950	3	Indianapolis 500	1950-05-30
3	1950	4	Swiss Grand Prix	1950-06-04
4	1950	5	Belgian Grand Prix	1950-06-18

	number of races
1950	7
1951	8
1952	8
1953	9
1954	9
1955	7
1956	8
1957	8
1958	11
1959	9

	Finishes	DNF	Accidents	DNQ
year
1950	17	3.0	12	0.0
1951	29	3.0	9	0.0
1952	31	9.0	6	11.0
1953	33	9.0	13	0.0
1954	37	12.0	15	1.0

	circuitId	url	circuitName	Location	lat	long	country	year
0	bremgarten	http://en.wikipedia.org/wiki/Circuit_Bremgarten	Circuit Bremgarten	{'lat': '46.9589', 'long': '7.40194', 'localit...	46.9589	7.40194	Switzerland	1950
1	indianapolis	http://en.wikipedia.org/wiki/Indianapolis_Moto...	Indianapolis Motor Speedway	{'lat': '39.795', 'long': '-86.2347', 'localit...	39.795	-86.2347	USA	1950
2	monaco	http://en.wikipedia.org/wiki/Circuit_de_Monaco	Circuit de Monaco	{'lat': '43.7347', 'long': '7.42056', 'localit...	43.7347	7.42056	Monaco	1950
3	monza	http://en.wikipedia.org/wiki/Autodromo_Naziona...	Autodromo Nazionale di Monza	{'lat': '45.6156', 'long': '9.28111', 'localit...	45.6156	9.28111	Italy	1950
4	reims	http://en.wikipedia.org/wiki/Reims-Gueux	Reims-Gueux	{'lat': '49.2542', 'long': '3.93083', 'localit...	49.2542	3.93083	France	1950