import pandas as pd
import matplotlib.pyplot as plt
from pandasql import sqldf
import copy
import numpy as np
import scipy.stats
Olympics data with SQL and pandas- home games
What is the effect of a home Olympic games
Overview
This is part of a project that looks at an Olympics dataset.
In this part the hypothesis considered is:
At a home Olympic games a nation will on average obtain more medals than at other games
- But can we quantify this effect?
- Are there any residual effects before and after the games?
- What about a home continent games?
Methodology
To be able to answer the question, the steps taken are as follows. Note the data and tables created in a previous page.
- Create table of medal athletes as
athlete
Medals
Year
NOC
- Others?
Sex
,number_of_athletes
- Join
athlete
table tocountry
table to get the country added toathlete
Now the next step is trickier. There are several ways we could look at the effect of a home game and the nearness to it. The way I like to do this is to visualise the plot(s) I would like and work back from that.
The plot I’d like is: - x-axis = year from games (0 = at the games, +ve after and -ve before) - y-axis = number of medals
If we normalise the y-axis we can put all the events together to get an effect of games, before and after. We can also pick out individual games or do the averageing on different time-periods.
So to achieve this we need a vector for each games of: - Year from games - Number of medals
There are a number of potential problems with this: 1. If a nation holds two games close to each other 1. Normally games are every 4 years but how do we deal with exceptions to this? 1. What do we do when there is a lack of data before or after the games? e.g. games at the start of Olympics or current games, or if a nation stops partipation 1. When looking at the effect of continent, the approach would need to be adjusted based on there being fewer continents and most games being in Europe
can be solved by reducing the times to +- 20 years, and because we are averageing any exceptions should be covered
a way to solve this is to fit the data across the years/medals data we have with a function then use the function to give us values on a set scale (e.g. -20 to 20 in steps of 4 yrs)
a bit trickier, may we fit values that have positive and negative values up to 20 yrs first. Then if an event has a gap in years fill with the average values below and above 0 yrs
for this maybe just look at a partcular game based on recent games that are not in the same continent
Some Starting Code
=pd.read_csv('data/athlete_F_S')
df_F_S =pd.read_csv('data/athlete_F_W')
df_F_W=pd.read_csv('data/athlete_M_S')
df_M_S=pd.read_csv('data/athlete_M_W')
df_M_W
= pd.read_csv('data/all_athletes')
df_all_athletes= pd.read_csv('data/country')
df_country# df_event= pd.read_csv('data/event')
= pd.read_csv('data/games')
df_games# df_population= pd.read_csv('data/population')
= df_country.groupby('NOC').max()
df_country # df_country.head(10)
df_games.head()
Unnamed: 0 | games_id | City | Country | Year | Region | Summer | Winter | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | Athens | Greece | 1896 | Europe | 1 | 0 | 37.9838 | 23.7275 |
1 | 1 | 1 | Paris | France | 1900 | Europe | 1 | 0 | 48.8566 | 2.3522 |
2 | 2 | 2 | St. Louis | USA | 1904 | North America | 1 | 0 | 38.6270 | 90.1994 |
3 | 3 | 3 | Athens | Greece | 1906 | Europe | 1 | 0 | 37.9838 | 23.7275 |
4 | 4 | 4 | London | UK | 1908 | Europe | 1 | 0 | 51.5072 | 0.1276 |
Create a table of Country & Medals
This first step is to create a table of athletes data that includes both women and men athletes from the summer games, we want the following columns: - The Olympic Year - The Nation - The Continent of the Nation - The number of medals the nation got that year
Basically just UNION
to join the male and female athletes- with some additions so we get the total number of medals per country
Followed by a JOIN
on the country
table to get more information about the countries.
def number_of_medals(df_F_S,df_M_S):
=sqldf('\
testa2 SELECT \
COUNT(*) AS number_of_medals, \
Year, NOC \
FROM \
(SELECT NOC,Year,COUNT(*) AS counta \
FROM \
(SELECT \
athlete_ID, \
event_ID, \
Medal_Gold,Medal_Silver,Medal_Bronze, \
Year, \
NOC \
from df_F_S \
WHERE Medal_Gold=1 OR Medal_Silver=1 OR Medal_Bronze=1 \
UNION ALL \
SELECT \
athlete_ID, \
event_ID, \
Medal_Gold,Medal_Silver,Medal_Bronze, \
Year, \
NOC \
from df_M_S \
WHERE Medal_Gold=1 OR Medal_Silver=1 OR Medal_Bronze=1 \
order by Year asc) A \
GROUP BY \
Year, NOC,event_id,Medal_Gold,Medal_Silver,Medal_Bronze) AS B\
GROUP BY Year, NOC \
;',locals() )
return testa2
def join_country(df_,df_country):
=sqldf('\
testa2 SELECT \
c.NOC,c.Nation,c.Continent, \
a.number_of_medals, \
a.Year \
FROM \
df_ AS a \
INNER JOIN \
df_country AS c \
ON \
c.NOC=a.NOC \
GROUP BY \
c.NOC,a.Year \
;',locals() )
return testa2
=number_of_medals(df_F_S,df_M_S)
df_medals=join_country(df_medals,df_country)
df_medals2 df_medals2.head()
NOC | Nation | Continent | number_of_medals | Year | |
---|---|---|---|---|---|
0 | AFG | Afghanistan | Asia | 1 | 2008 |
1 | AFG | Afghanistan | Asia | 1 | 2012 |
2 | ALG | Algeria | Africa | 2 | 1984 |
3 | ALG | Algeria | Africa | 2 | 1992 |
4 | ALG | Algeria | Africa | 3 | 1996 |
Find the change in medals around a home games
The easy solution here would be to take an average of the medals a country obtains obtains at a home games versus non-home games. However, this wouldn’t work mainly due to the changes in medals with time (they increase).
We could correct for overall changes in the number of medals by normalising based on the total number of medals in a games. This would be better, but we may need an extra step to account for the lack of diversity of nations in earlier years too.
But also we wouldn’t be able to see what effect a home games had before and after the games were held.
To observe how medal count is effected before and after a games, and to somewhat reduce the increase in medals with time, the following steps are taken.
- Scoll through the
games
table for each summer games - For each game add data to the following variables (of length of the number of games) for each particular games
nation
the nation hosting the particular gamesmedals
the number of medals for each games for the nation hosting the particular games, in all the other gamesyears
the difference in years between the particular games and all other games
For example:
USA 1984, Los Angeles
nation[20]
is ‘USA’years[20]
is[-88, -84, -80, -78, -76, -72, -64, -60, -56, -52, -48, -36, -32,-28, -24, -20, -16, -12, -8, 0, 4, 8, 12, 16, 20, 24, 28, 32]
the 0 being the 1984 games, and the 32 being the 2016 games
medals[20]
is[ 19, 54, 230, 23, 46, 63, 95, 99, 56, 110, 57, 84, 76, 74, 71, 90, 107, 94, 94, 173, 94, 108, 101, 91, 101, 110, 103, 121]
the 173 being the number of medals obtained in 1984 and 121 in 2016
- Create a vector for each games of medals against years around the home game.
- Since not all games are seperated by 4 years and some games can be missed the data needs to be interpolated onto a range in steps of 4 years. Outer bounds of +-28 years ae chosen
- When data is missing because the full range of data can’t be obtained (i.e. for recent games or ones near the start of Olympics) replace any missing values with 0
- Add up all the vectors for different games
- ignore games where the range isn’t full (-32->32)
- this gives the average effect of a home games
# Step 1 and 2
# Since we just want the summer games
=df_games[df_games.Summer==1].reset_index()
df_games
=[]
medals=[]
years=[]
nationfor i in range(len(df_games)):
=df_games.loc[i,'Country']
country=df_games.loc[i,'Year']
year
nation.append(country)==country].number_of_medals.values)
medals.append(df_medals2[df_medals2.Nation==country].Year.values - year)
years.append(df_medals2[df_medals2.Nation
"""
Step 3
Function to interpolate number of medals in years around a games
Takes as input x=years and y=number of medals and outputs a new x and y values
that have been interpolated between -28 to +28 years in steps of 4 years
For games where the full range can't be obtained (e.g. an event close to 2016
will have missing data for years after it) the data is interpolated to the nearest
4 years and missing data replaced with zeroes.
"""
def do_interp(x,y):
from scipy.interpolate import interp1d
=x,y
xx,yy=28,-28
xhi,xlo
=np.arange(xlo,xhi+4,4)
xnew=np.zeros(np.shape(xnew))
ynew# here make adjustments if the whole range doesn't exist to go to nearest 4 yrs
# or if inbetween go outside to next one i.e. 11 years->12 years, 8->8,-5->-8
# lower years are dealt with separately to later years then combined
# and normalised
=np.max(x),np.min(x)
maxx,minxif maxx<xhi:
= 4*(np.ceil(maxx/4))
xhi=xx[1:]
xx=yy[1:]
yyif minx>xlo:
= -4*(np.ceil(abs(minx)/4))
xlo=xx[:-1]
xx=yy[:-1]
yy
=((xnew<=xhi) & (xnew>=xlo))
cond try:
= interp1d(x, y, kind='cubic')
f2 =f2(xnew[cond])
ynew[cond]except:
= interp1d(xx, yy, kind='cubic')
f2 =((xnew<=xhi-4) & (xnew>=xlo+4))
cond =f2(xnew[cond])
ynew[cond]
=ynew/max(ynew)
ynewreturn xnew,ynew
# Step 3
=[]
XYfor i,year_range in enumerate(years):
=year_range
x=medals[i]
y=do_interp(x,y)
xnew,ynew
XY.append(np.array(ynew))
np.shape(XY)
(29, 15)
# If values are less than 0 put as just above zero- can occur when
# make XY a numpy array
= np.array(XY)
XY
# if interpolation is not good
<0]=0
XY[XY
# create vectors for y-data either side of home-event
= np.zeros((8))
yallL = np.zeros((8))
yallR
# scroll through each games
# find if the low years or high years have zeros in them
# if they don't include them in the sum
for i,ygames in enumerate(XY):
if i>11:
if 0 not in ygames[-8:]:
-8:]=yallL[-8:]+ygames[-8:]
yallL[
if 0 not in ygames[0:8]:
0:8]=yallR[0:8]+ygames[0:8]
yallR[
# normalise
=yallL/max(yallL)
yallL=yallR/max(yallR)
yallR
# combine low and high years and only use home game year once
=np.concatenate([yallR[0:-1],yallL])
yall
# plot the results
=(6,4))
plt.subplots(figsize/max(yall),'ok-')
plt.plot(xnew,yall
0 ,1.05])
plt.ylim([True)
plt.grid('Normalise medals won',fontsize=14)
plt.ylabel('Years either side of a home games',fontsize=14)
plt.xlabel(# plt.plot(xnew,yall2/max(yall2),'m+--')
Text(0.5, 0, 'Years either side of a home games')
Plot the effect of home games for different games
Look at the individual games plots
=11
startnum
=['mo--','gs:','bv-.']
colafor i2 in range(6):
=plt.subplots()
fig,ax= np.arange(startnum,startnum+3,1,dtype=int)
vals =0
i1for i in vals:
plt.plot(xnew,XY[i,:],cola[i1])-.2 ,1])
plt.ylim([True)
plt.grid(+ ' ' + str(df_games.loc[ii,'Year']) for jj,ii in enumerate(vals)]);
plt.legend([ nation[ii] 'Normalise medals won')
plt.ylabel('Years either side of a home games')
plt.xlabel(=i1+1
i1=startnum+3 startnum
Continent home games
The Olympics have mainly been held in Europe and North America. Because of this it is not possible to do the same analysis as for home country games.
Instead we can look at the difference if a games is a home continent or not.
=sqldf('\
continentYorN SELECT \
home_continent, \
Year, \
COUNT(*) AS num_countries, \
AVG(avg_medals) AS avg_medals \
FROM \
(SELECT \
Nation,home_continent, Year, \
AVG(number_of_medals) AS avg_medals\
FROM \
(SELECT \
a.*, \
CASE WHEN \
c.Region =a.Continent THEN 1 \
ELSE 0 END AS home_continent \
FROM \
df_medals2 AS a \
INNER JOIN \
df_games AS c \
ON \
c.Year=a.Year \
WHERE c.Year> 1951) as inner \
GROUP BY \
Nation, Year ) as midder \
GROUP BY home_continent,Year \
;',locals() )
continentYorN.head()
home_continent | Year | num_countries | avg_medals | |
---|---|---|---|---|
0 | 0 | 1952 | 20 | 7.450000 |
1 | 0 | 1956 | 36 | 11.944444 |
2 | 0 | 1960 | 21 | 7.095238 |
3 | 0 | 1964 | 34 | 13.470588 |
4 | 0 | 1968 | 35 | 11.285714 |
=plt.subplots(figsize=(8,5))
fig,ax
==1].Year,continentYorN[continentYorN.home_continent==1].avg_medals,'rs--')
plt.plot(continentYorN[continentYorN.home_continent==0].Year,continentYorN[continentYorN.home_continent==0].avg_medals,'ob--')
plt.plot(continentYorN[continentYorN.home_continent
True)
plt.grid('Average medals')
plt.ylabel('Games year')
plt.xlabel(
# plt.plot([df_games[df_games.Region=='Asia'].Year.values,df_games[df_games.Region=='Asia'].Year.values],[0,35],color=[1,.8,.8]);
0,0],'m--');
plt.plot([0,0],'g-.');
plt.plot([
=='Americas'].Year.values,df_games[df_games.Region=='Americas'].Year.values],[0,35],'m--');
plt.plot([df_games[df_games.Region
=='Europe'].Year.values,df_games[df_games.Region=='Europe'].Year.values],[0,35],'g-.');
plt.plot([df_games[df_games.Region
# plt.plot([df_games[df_games.Region=='Oceania'].Year.values,df_games[df_games.Region=='Oceania'].Year.values],[0,35],color=[0.7,1,.7]);
1950,2020])
plt.xlim([
'Home continent','Another continent','Americas','Europe']) plt.legend([
<matplotlib.legend.Legend at 0x1f4339ed580>
- Not a good measure
- Instead need following columns
- total or average medals per continent
- Home Continent
- Year
"""
A check on the inner part of the SQL statement
"""
'\
sqldf( SELECT * FROM \
(SELECT \
a.*, \
CASE WHEN \
c.Region =a.Continent THEN 1 \
ELSE 0 END AS home_continent \
FROM \
df_medals2 AS a \
INNER JOIN \
df_games AS c \
ON \
c.Year=a.Year \
WHERE c.Year> 1951) as inner \
;',locals() )
NOC | Nation | Continent | number_of_medals | Year | home_continent | |
---|---|---|---|---|---|---|
0 | ANZ | Australia | Oceania | 11 | 1952 | 0 |
1 | ARG | Argentina | Americas | 5 | 1952 | 0 |
2 | AUT | Austria | Europe | 2 | 1952 | 1 |
3 | BEL | Belgium | Europe | 4 | 1952 | 1 |
4 | BOH | Czech Republic | Europe | 13 | 1952 | 1 |
... | ... | ... | ... | ... | ... | ... |
960 | UKR | Ukraine | Europe | 11 | 2016 | 0 |
961 | USA | USA | Americas | 121 | 2016 | 0 |
962 | UZB | Uzbekistan | Asia | 13 | 2016 | 0 |
963 | VEN | Venezuela | Americas | 3 | 2016 | 0 |
964 | VIE | Vietnam | Asia | 2 | 2016 | 0 |
965 rows × 6 columns
"""
The complete SQL statement
"""
=sqldf('\
continentYorN2 SELECT \
Year, sum(number_of_medals) AS num_medals, Continent,home_continent\
FROM \
(SELECT \
a.*, \
CASE WHEN \
c.Region =a.Continent THEN 1 \
ELSE 0 END AS home_continent \
FROM \
df_medals2 AS a \
INNER JOIN \
df_games AS c \
ON \
c.Year=a.Year \
WHERE c.Year> 1951) as inner \
GROUP BY Continent,Year, home_continent\
ORDER BY Year\
;',locals() )
continentYorN2.head()
Year | num_medals | Continent | home_continent | |
---|---|---|---|---|
0 | 1952 | 11 | Africa | 0 |
1 | 1952 | 100 | Americas | 0 |
2 | 1952 | 24 | Asia | 0 |
3 | 1952 | 308 | Europe | 1 |
4 | 1952 | 14 | Oceania | 0 |
=(8,5))
plt.subplots(figsize='Asia'
region=((continentYorN2.Continent==region)&(continentYorN2.home_continent==1))
cond'or',markersize=10)
plt.plot(continentYorN2[cond].Year,continentYorN2[cond].num_medals,=((continentYorN2.Continent==region)&(continentYorN2.home_continent==0))
cond':or')
plt.plot(continentYorN2[cond].Year,continentYorN2[cond].num_medals,
='Europe'
region=((continentYorN2.Continent==region)&(continentYorN2.home_continent==1))
cond'bo',markersize=10)
plt.plot(continentYorN2[cond].Year,continentYorN2[cond].num_medals,=((continentYorN2.Continent==region)&(continentYorN2.home_continent==0))
cond'b--o')
plt.plot(continentYorN2[cond].Year,continentYorN2[cond].num_medals,
='Americas'
region=((continentYorN2.Continent==region)&(continentYorN2.home_continent==1))
cond'm<',markersize=10)
plt.plot(continentYorN2[cond].Year,continentYorN2[cond].num_medals,=((continentYorN2.Continent==region)&(continentYorN2.home_continent==0))
cond'm-o')
plt.plot(continentYorN2[cond].Year,continentYorN2[cond].num_medals,
='Oceania'
region=((continentYorN2.Continent==region)&(continentYorN2.home_continent==1))
cond'cv',markersize=10)
plt.plot(continentYorN2[cond].Year,continentYorN2[cond].num_medals,=((continentYorN2.Continent==region)&(continentYorN2.home_continent==0))
cond'c-o')
plt.plot(continentYorN2[cond].Year,continentYorN2[cond].num_medals,
'Asia','','Europe','','Americas','','Oceania',''])
plt.legend([True) plt.grid(
='Europe'
region=((continentYorN2.Continent==region)&(continentYorN2.home_continent==1))
cond=np.average(continentYorN2[cond].num_medals)
asum=((continentYorN2.Continent==region)&(continentYorN2.home_continent==0))
cond=np.average(continentYorN2[cond].num_medals)
bsum
print('For {} the average medals at home continent = {:.0f} and for away continent = {:.0f}\n \
With a ratio of {:.2f}'.format(region,asum,bsum,asum/bsum))
='Asia'
region=((continentYorN2.Continent==region)&(continentYorN2.home_continent==1))
cond=np.average(continentYorN2[cond].num_medals)
asum=((continentYorN2.Continent==region)&(continentYorN2.home_continent==0))
cond=np.average(continentYorN2[cond].num_medals)
bsum
print('For {} the average medals at home continent = {:.0f} and for away continent = {:.0f}\n \
With a ratio of {:.2f}'.format(region,asum,bsum,asum/bsum))
='Americas'
region=((continentYorN2.Continent==region)&(continentYorN2.home_continent==1))
cond=np.average(continentYorN2[cond].num_medals)
asum=((continentYorN2.Continent==region)&(continentYorN2.home_continent==0))
cond=np.average(continentYorN2[cond].num_medals)
bsum
print('For {} the average medals at home continent = {:.0f} and for away continent = {:.0f}\n \
With a ratio of {:.2f}'.format(region,asum,bsum,asum/bsum))
='Oceania'
region=((continentYorN2.Continent==region)&(continentYorN2.home_continent==1))
cond=np.average(continentYorN2[cond].num_medals)
asum=((continentYorN2.Continent==region)&(continentYorN2.home_continent==0))
cond=np.average(continentYorN2[cond].num_medals)
bsum
print('For {} the average medals at home continent = {:.0f} and for away continent = {:.0f}\n \
With a ratio of {:.2f}'.format(region,asum,bsum,asum/bsum))
For Europe the average medals at home continent = 420 and for away continent = 396
With a ratio of 1.06
For Asia the average medals at home continent = 121 and for away continent = 103
With a ratio of 1.18
For Americas the average medals at home continent = 177 and for away continent = 130
With a ratio of 1.36
For Oceania the average medals at home continent = 50 and for away continent = 32
With a ratio of 1.57
Again the above stats are not really great as
- don’t take account of changes in the average number of medals with time for a nation
- games in Oceania and Americas are often also home games
The best metric uses Europe and suggests a less than 6% increase for a home continent