import pandas as pd
import matplotlib.pyplot as plt
from pandasql import sqldf
import copy
import numpy as np
import scipy.stats
Olympics data with SQL and pandas- GDP and population
What is the effect of GDP and population on medals
Introduction
Due to the global importance of the Olympics, in 2020 there was a broadcast audience of more than 3 billion, I was interested to explore whether countries with the most medals will reflect global politics. And to see if the countries with most influence get more medals.
There are two relatively easy to obtain metrics that can be used to define the importance of a nation internationally, GDP and population.
Gross domestic product (GDP) is a monetary measure of the market value of all the final goods and services produced in a specific time period by countries. GDP is often used as a metric for international comparisons as well as a broad measure of economic progress. It is often considered to be the “world’s most powerful statistical indicator of national development and progress”. GDP Wikipedia.
The population of a country is an important parameter in assessing the global importance of a nation. Furthermore, the more people in a country the greater the pool of potential athletes.
In the final part of this page I will briefly look at how the Cold War was reflected in the Olympics data.
Pre-analysis
Get medals won by country and year
What we want to end up with is a new table with- - Nation - Number of medals - Event or year of the event - Then the details of nations we’re studying: - GDP - Population - Continent
To get a unique medal count for this, we can do a count by grouping on - Nation and Year But also on: - Medal type (Gold, Silver, Bronze) - Event id As we only know if a medal is unique for an athlete if the same medal type does not exist for another athlete in the same event at the same games (year). This is to avoid duplication for team sports. If it is a mixed event then we would need to do the grouping on data of male and female athletes.
From inside out
- Use
UNION
to join the males and female summer athletes- Get NOC, medal types, year, event_id
- Use
GROUP BY
to get unique medals andCOUNT
- Use
GROUP BY
to get medals for countries in each event/year,SUM
so we are adding up events
=pd.read_csv('data/athlete_F_S')
df_F_S # df_F_W=pd.read_csv('data/athlete_F_W')
=pd.read_csv('data/athlete_M_S')
df_M_S# df_M_W=pd.read_csv('data/athlete_M_W')
= pd.read_csv('data/all_athletes')
df_all_athletes= pd.read_csv('data/country')
df_country= pd.read_csv('data/event')
df_event# df_games= pd.read_csv('data/games')
# df_population= pd.read_csv('data/population')
= df_country.groupby('NOC').max()
df_country 10) df_country.head(
Unnamed: 0 | Nation | Continent | Population | GDP | |
---|---|---|---|---|---|
NOC | |||||
AFG | 0 | Afghanistan | Asia | 32890171 | 19807 |
ALB | 1 | Albania | Europe | 2829741 | 14800 |
ALG | 2 | Algeria | Africa | 45400000 | 145164 |
AND | 3 | Andorra | Europe | 79535 | 3155 |
ANG | 4 | Angola | Africa | 33086278 | 62307 |
ANT | 5 | Antigua | Americas | 99337 | 1415 |
ANZ | 7 | Australia | Oceania | 25921518 | 1330901 |
ARG | 8 | Argentina | Americas | 47327407 | 383067 |
ARM | 9 | Armenia | Asia | 2963900 | 12645 |
AUT | 10 | Austria | Europe | 9027999 | 428965 |
=\
df_by_year'SELECT \
sqldf( NOC, \
Year, \
SUM(counta) AS num_medals \
FROM(SELECT \
NOC, \
Year, \
event_id, \
Medal_Bronze, Medal_Gold, Medal_Silver, \
COUNT(*) as counta \
FROM(SELECT \
NOC, \
Year, \
event_id, \
Medal_Bronze, Medal_Gold, Medal_Silver \
FROM \
df_F_S \
WHERE Medal_Bronze=1 OR Medal_Gold=1 OR Medal_Silver \
UNION \
SELECT \
NOC, \
Year, \
event_id, \
Medal_Bronze, Medal_Gold, Medal_Silver \
FROM \
df_M_S \
WHERE Medal_Bronze=1 OR Medal_Gold=1 OR Medal_Silver) as MF \
GROUP BY \
NOC, Year, event_id, Medal_Bronze, Medal_Gold, Medal_Silver\
ORDER BY \
NOC) as inner \
GROUP BY \
NOC, Year;',locals())
==2016].sort_values('num_medals',ascending=False).head(10) df_by_year[df_by_year.Year
NOC | Year | num_medals | |
---|---|---|---|
1246 | USA | 2016 | 121 |
246 | CHN | 2016 | 70 |
527 | GBR | 2016 | 67 |
415 | EUN | 2016 | 56 |
471 | FRA | 2016 | 42 |
497 | FRG | 2016 | 42 |
706 | JPN | 2016 | 41 |
38 | ANZ | 2016 | 29 |
669 | ITA | 2016 | 28 |
230 | CAN | 2016 | 22 |
Join with the country table
Now just join to the country table
=\
df_'SELECT \
sqldf( NOC, \
Nation, \
Continent, \
SUM(num_medals) AS number_of_medals, \
Population, \
GDP \
FROM( \
SELECT * \
FROM \
df_country AS c \
INNER JOIN \
df_by_year AS d \
ON \
c.NOC=d.NOC \
WHERE \
d.Year>=2008) \
GROUP BY \
NOC,Continent \
ORDER BY \
num_medals desc;',locals())
df_
NOC | Nation | Continent | number_of_medals | Population | GDP | |
---|---|---|---|---|---|---|
0 | USA | USA | Americas | 334 | 332906919 | 20936600 |
1 | CHN | China | Asia | 259 | 1412600000 | 14722731 |
2 | EUN | Russia | Europe | 210 | 147190000 | 1483498 |
3 | GBR | UK | Europe | 180 | 67081234 | 2707744 |
4 | ANZ | Australia | Oceania | 110 | 25921518 | 1330901 |
... | ... | ... | ... | ... | ... | ... |
99 | TUN | Tunisia | Africa | 7 | 11746695 | 39236 |
100 | UAE | United Arab Emirates | Asia | 1 | 9282410 | 421142 |
101 | UGA | Uganda | Africa | 1 | 42885900 | 37372 |
102 | VEN | Venezuela | Americas | 5 | 28705000 | 482359 |
103 | VIE | Vietnam | Asia | 3 | 98505400 | 271158 |
104 rows × 6 columns
Correlation in Population and GDP
First create a function to display correlations: - produce scatter plots - find which countries are poor correlation - find some correlation stats
def modname(string):
=''.join([string[0].upper(),string[1:].lower()])
string=string.replace('_',' ')
stringreturn string
def doFigAddOns(nomX,nomY,fontsize):
=fontsize)
plt.ylabel(modname(nomY), fontsize=fontsize)
plt.xlabel(nomX, fontsizeTrue)
plt.grid(
def scatter_combo(df_temp,xval,yval):
=df_temp.set_index('Nation')
df_temp=14
fontsize
=copy.copy(df_temp)
df_temp# remove 0 values
=df_temp[df_temp.loc[:,xval]!=0]
df_temp
# drop NaNs
=df_temp.dropna()
df_temp
# values in millions
=df_temp.loc[:,xval]/1e6
df_temp.loc[:,xval]
# the names of the columns
=xval#(df_temp.columns[xval])
nomX=yval#modname(df_temp.columns[yval])
nomY
# define the x and y values
=df_temp.loc[:,xval]
X=df_temp.loc[:,yval]
Y
# do some plots
=plt.subplots(figsize=(7,15))
fig,ax'font.size'] = '13'
plt.rcParams[
# Fig A
=plt.subplot(3,1,1)
ax1'+b',markeredgewidth=2,ms=10)
ax1.plot(X ,Y,=np.polyfit(X,Y,1)
m,b=X*m+b
Ypred'--k');
plt.plot(X,Ypred,
doFigAddOns(nomX,nomY,fontsize)
# Fig B
=plt.subplot(3,1,2)
ax3'+b',markeredgewidth=2,ms=10)
ax3.plot(X ,Y,=X.sort_values()
X=X*m+b
Ypred'--k');
plt.plot(X,Ypred, 'log')
plt.yscale('log')
plt.xscale(
doFigAddOns(nomX,nomY,fontsize)
# Fig C
= df_temp.loc[:,yval]-Ypred
diffa =plt.subplot(3,1,3)
ax2'xk',markeredgewidth=2,ms=10);
ax2.plot(X,diffa, 'log')
plt.xscale('Difference between \nlinear fit',fontsize)
doFigAddOns(nomX,
=0.35)
plt.subplots_adjust(hspace
# because sorted x, need to get original x back
=df_temp.loc[:,xval]
X
# Get some correlation values, and what are worst countries
=scipy.stats.pearsonr(X, Y)
corrprint('Correlation of = {:.2f} ({:.0e}) \n \
And m= {:.3f} b={:.1f} '.format(corr[0],corr[1],m,b))
=(diffa).sort_values(ascending=False).index[0:10]
country_hi_corrprint('Countries worst correlation where overachive',diffa[country_hi_corr])
=(diffa).sort_values().index[0:10]
country_low_corrprint('Countries worst correlation where unachieve',diffa[country_low_corr])
={'Underachieve':country_low_corr,'Overachieve':country_hi_corr}
dreturn pd.DataFrame(data=d)
def getCorr(df_temp, xval,yval):
=copy.copy(df_temp)
df_temp# remove 0 values
=df_temp[df_temp.loc[:,xval]!=0]
df_temp
# drop NaNs
=df_temp.dropna()
df_temp
# define the x and y values
=df_temp.loc[:,xval]
X=df_temp.loc[:,yval]
Y
# Get some correlation values, and what are worst countries
=scipy.stats.pearsonr(X, Y)
corrprint('Pearson Correlation of = {:.2f} ({:.0e})'.format(corr[0],corr[1]))
# Get some correlation values, and what are worst countries
=scipy.stats.spearmanr(X, Y)
corrprint('Spearman Correlation of = {:.2f} ({:.0e})'.format(corr[0],corr[1]))
=scatter_combo(df_,'GDP','number_of_medals')
popCorr popCorr
Correlation of = 0.84 (4e-29)
And m= 17.195 b=14.2
Countries worst correlation where overachive Nation
Russia 170.270535
UK 119.219670
Australia 72.894435
France 59.020671
Germany 47.334168
Ukraine 41.104002
South Korea 37.742411
Italy 36.341876
Cuba 33.005895
Kazakhstan 26.858922
dtype: float64
Countries worst correlation where unachieve Nation
India -48.322884
USA -40.224834
Saudi Arabia -25.259275
Indonesia -22.420333
United Arab Emirates -20.462293
Philippines -19.436562
Israel -18.132356
Mexico -17.725355
Austria -17.596809
Chile -17.570066
dtype: float64
Underachieve | Overachieve | |
---|---|---|
0 | India | Russia |
1 | USA | UK |
2 | Saudi Arabia | Australia |
3 | Indonesia | France |
4 | United Arab Emirates | Germany |
5 | Philippines | Ukraine |
6 | Israel | South Korea |
7 | Mexico | Italy |
8 | Austria | Cuba |
9 | Chile | Kazakhstan |
=scatter_combo(df_,'Population','number_of_medals')
GDPCorr GDPCorr
Correlation of = 0.42 (7e-06)
And m= 0.113 b=20.6
Countries worst correlation where overachive Nation
USA 275.692534
Russia 172.705692
UK 151.769692
Germany 96.889859
France 89.679993
Australia 86.426756
China 78.529401
Japan 69.159606
Italy 55.694604
South Korea 53.504928
dtype: float64
Countries worst correlation where unachieve Nation
India -165.076115
Indonesia -41.444198
Nigeria -40.164409
Philippines -32.332446
Vietnam -28.785832
Egypt -26.362309
Sudan -24.687085
Uganda -24.492699
Saudi Arabia -23.601957
Cameroon -22.395235
dtype: float64
Underachieve | Overachieve | |
---|---|---|
0 | India | USA |
1 | Indonesia | Russia |
2 | Nigeria | UK |
3 | Philippines | Germany |
4 | Vietnam | France |
5 | Egypt | Australia |
6 | Sudan | China |
7 | Uganda | Japan |
8 | Saudi Arabia | Italy |
9 | Cameroon | South Korea |
'Continent']=='Europe'],'Population','number_of_medals') scatter_combo(df_[df_[
Correlation of = 0.93 (2e-16)
And m= 1.498 b=5.2
Countries worst correlation where overachive Nation
UK 74.302826
Hungary 23.300286
Netherlands 23.269039
Belarus 19.808781
Denmark 17.002417
France 11.114933
Croatia 9.991807
Czech Republic 5.060035
Slovenia 4.658317
Lithuania 4.630427
dtype: float64
Countries worst correlation where unachieve Nation
Poland -31.142356
Spain -24.255647
Portugal -16.682378
Russia -15.733456
Austria -14.709259
Romania -12.930475
Belgium -11.649525
Italy -10.448382
Greece -8.813758
Ukraine -8.812069
dtype: float64
Underachieve | Overachieve | |
---|---|---|
0 | Poland | UK |
1 | Spain | Hungary |
2 | Portugal | Netherlands |
3 | Russia | Belarus |
4 | Austria | Denmark |
5 | Romania | France |
6 | Belgium | Croatia |
7 | Italy | Czech Republic |
8 | Greece | Slovenia |
9 | Ukraine | Lithuania |
'Continent']!='Europe'],'Population','number_of_medals') scatter_combo(df_[df_[
Correlation of = 0.49 (2e-05)
And m= 0.109 b=14.0
Countries worst correlation where overachive Nation
USA 283.666410
Australia 93.134297
China 90.957076
Japan 76.277923
South Korea 60.318993
Canada 40.727568
Cuba 33.748715
Kazakhstan 27.859642
New Zealand 25.400672
Azerbaijan 19.852149
dtype: float64
Countries worst correlation where unachieve Nation
India -152.808654
Indonesia -33.720543
Nigeria -32.669700
Philippines -25.269105
Vietnam -21.778878
Egypt -19.334337
Sudan -17.902477
Uganda -17.715178
Saudi Arabia -16.856911
Cameroon -15.694183
dtype: float64
Underachieve | Overachieve | |
---|---|---|
0 | India | USA |
1 | Indonesia | Australia |
2 | Nigeria | China |
3 | Philippines | Japan |
4 | Vietnam | South Korea |
5 | Egypt | Canada |
6 | Sudan | Cuba |
7 | Uganda | Kazakhstan |
8 | Saudi Arabia | New Zealand |
9 | Cameroon | Azerbaijan |
'Continent']=='Europe'],'GDP','number_of_medals') scatter_combo(df_[df_[
Correlation of = 0.80 (6e-09)
And m= 45.027 b=10.1
Countries worst correlation where overachive Nation
Russia 133.114656
UK 47.990590
Ukraine 40.906665
Belarus 26.198814
Hungary 25.932286
Croatia 8.392025
Czech Republic 4.946635
Denmark 4.919196
Netherlands 3.836572
Serbia 2.527421
dtype: float64
Countries worst correlation where unachieve Nation
Germany -54.463232
Switzerland -25.766716
Austria -25.402938
Belgium -22.291826
Ireland -18.937224
Portugal -16.500704
Spain -15.776437
Finland -14.300792
Italy -12.028820
Sweden -10.294891
dtype: float64
Underachieve | Overachieve | |
---|---|---|
0 | Germany | Russia |
1 | Switzerland | UK |
2 | Austria | Ukraine |
3 | Belgium | Belarus |
4 | Ireland | Hungary |
5 | Portugal | Croatia |
6 | Spain | Czech Republic |
7 | Finland | Denmark |
8 | Italy | Netherlands |
9 | Sweden | Serbia |
'Continent']!='Europe'],'GDP','number_of_medals') scatter_combo(df_[df_[
Correlation of = 0.94 (2e-33)
And m= 16.234 b=8.7
Countries worst correlation where overachive Nation
Australia 79.660038
South Korea 44.796081
Cuba 38.591086
Kazakhstan 32.508245
Kenya 28.660696
New Zealand 27.815933
Azerbaijan 25.573605
Canada 23.586944
Jamaica 23.041050
Brazil 19.812142
dtype: float64
Countries worst correlation where unachieve Nation
India -40.315038
Saudi Arabia -19.100124
Indonesia -15.916697
USA -14.609781
United Arab Emirates -14.571357
Philippines -13.602977
Israel -12.259868
Chile -11.840843
Venezuela -11.565125
Mexico -11.204664
dtype: float64
Underachieve | Overachieve | |
---|---|---|
0 | India | Australia |
1 | Saudi Arabia | South Korea |
2 | Indonesia | Cuba |
3 | USA | Kazakhstan |
4 | United Arab Emirates | Kenya |
5 | Philippines | New Zealand |
6 | Israel | Azerbaijan |
7 | Chile | Canada |
8 | Venezuela | Jamaica |
9 | Mexico | Brazil |
=copy.copy(df_[['Nation','Continent','GDP']])
df_GDP
= df_GDP.sort_values('GDP',ascending=False)
df_GDP
df_GDP= df_GDP.head(52).index
rich_list # # index_not_rich=[i for i,country in enumerate(df_.index) if (country not in rich_list) and (df_.iloc[i,2]!='Europe') and (df_.index[i]!='India')]
# # index_rich=[i for i,country in enumerate(df_.index) if ( (df_.iloc[i,2]!='Europe') )]
=[i for i,country in enumerate(df_.index) if (country in rich_list) ]
index_rich=[i for i,country in enumerate(df_.index) if (country not in rich_list) ]
index_not_rich
print(' Index Rich- Population')
'Population','number_of_medals')
getCorr(df_.iloc[index_rich,:],print(' Index Not Rich- Population')
'Population','number_of_medals')
getCorr(df_.iloc[index_not_rich,:],print(' Index Rich- GDP')
'GDP','number_of_medals')
getCorr(df_.iloc[index_rich,:],print(' Index Not Rich- GDP')
'GDP','number_of_medals') getCorr(df_.iloc[index_not_rich,:],
Index Rich- Population
Pearson Correlation of = 0.37 (6e-03)
Spearman Correlation of = 0.30 (3e-02)
Index Not Rich- Population
Pearson Correlation of = 0.17 (2e-01)
Spearman Correlation of = 0.15 (3e-01)
Index Rich- GDP
Pearson Correlation of = 0.84 (7e-15)
Spearman Correlation of = 0.50 (2e-04)
Index Not Rich- GDP
Pearson Correlation of = 0.32 (2e-02)
Spearman Correlation of = 0.38 (6e-03)
print(' Index Europe- GDP')
'Continent']=='Europe'],'GDP','number_of_medals')
getCorr(df_[df_[print(' Index Not Europe- GDP')
'Continent']!='Europe'],'GDP','number_of_medals')
getCorr(df_[df_[
print(' Index Europe- Population')
'Continent']=='Europe'],'Population','number_of_medals')
getCorr(df_[df_[print(' Index Not Europe- Population')
'Continent']!='Europe'],'Population','number_of_medals') getCorr(df_[df_[
Index Europe- GDP
Pearson Correlation of = 0.80 (6e-09)
Spearman Correlation of = 0.71 (1e-06)
Index Not Europe- GDP
Pearson Correlation of = 0.94 (2e-33)
Spearman Correlation of = 0.51 (1e-05)
Index Europe- Population
Pearson Correlation of = 0.93 (2e-16)
Spearman Correlation of = 0.81 (3e-09)
Index Not Europe- Population
Pearson Correlation of = 0.49 (2e-05)
Spearman Correlation of = 0.42 (3e-04)
print('GDP'),getCorr(df_,'GDP','number_of_medals')
print('Population'),getCorr(df_,'Population','number_of_medals')
GDP
Pearson Correlation of = 0.84 (4e-29)
Spearman Correlation of = 0.60 (2e-11)
Population
Pearson Correlation of = 0.42 (7e-06)
Spearman Correlation of = 0.41 (1e-05)
(None, None)
Overview of GDP and population
Methods Used
Looking at number of medals against GDP and population, for a period from 2008-2016 inclusive. Using population and GDP data from ~2020. - scatter plots - to visualize the relationships between medals and GDP, and medals and population. Both show a correlation - Pearson & Spearman correlation - To quatify the correlation seen. Pearson looks for a linear relationship whereas Spearman considers the ordering of the variables. The Pearson results showed a moderate to strong relationship for GDP and a weak to moderate relationship for population. The Spearman results were similar (more correlation for GDP) but the correlation values were lower. - I considered looking at the changes with time, but there were a lot of factors affecting country participation and the data was harder to obtain easily
Overview
Results were mainly as expected and hence PROVED.
- GDP
- ALL = strong AND significant
- Europe = strong AND significant
- NOT Europe = very strong AND significant
- Population
- ALL = moderate AND significant
- Europe = very strong AND significant
- NOT Europe = moderate AND significant
Using a line fit of the data (GDP or Population to Medals) gives an indication of which countries are reducing the correlation. If the country has more medals than the fit then it is overachieve and if it has less it is underachieveing. The table below shows which countries are under and overachieveing for GDP and population.
={'Overachieve':'GDP - Overachieve','Underachieve':'GDP - Underachieve'},inplace=True)
GDPCorr.rename(columns={'Overachieve':'Population - Overachieve','Underachieve':'Population - Underachieve'},inplace=True)
popCorr.rename(columns
=pd.concat([GDPCorr,popCorr],axis=1)
corrComb corrComb
GDP - Underachieve | GDP - Overachieve | Population - Underachieve | Population - Overachieve | |
---|---|---|---|---|
0 | India | USA | India | Russia |
1 | Indonesia | Russia | USA | UK |
2 | Nigeria | UK | Saudi Arabia | Australia |
3 | Philippines | Germany | Indonesia | France |
4 | Vietnam | France | United Arab Emirates | Germany |
5 | Egypt | Australia | Philippines | Ukraine |
6 | Sudan | China | Israel | South Korea |
7 | Uganda | Japan | Mexico | Italy |
8 | Saudi Arabia | Italy | Austria | Cuba |
9 | Cameroon | South Korea | Chile | Kazakhstan |
So based on the above it may make sense to group countries based on their GDP, into a rich and poor list. But as shown above this doesn’t increase the correlation for either group.
The only useful metric to increase the correlation was to group the countries into those from Europe. This is probably due to the similarity of countries within Europe (and outside): size, GDP but also culturally; and it may be because the European countries have much greater participation at the Olympics (both over currently and historically).
Guide on Correlation
'Correlation':['1','0.8 - 1.0','0.6 - 0.8','0.4 - 0.6','0.2 - 0.4','0 - 0.2'],\
pd.DataFrame({'Interpretation':['Pefect','Strong to Perfect','Moderate to Very Strong','Moderate to strong','Weak to moderate','Zero to weak']})
Correlation | Interpretation | |
---|---|---|
0 | 1 | Pefect |
1 | 0.8 - 1.0 | Strong to Perfect |
2 | 0.6 - 0.8 | Moderate to Very Strong |
3 | 0.4 - 0.6 | Moderate to strong |
4 | 0.2 - 0.4 | Weak to moderate |
5 | 0 - 0.2 | Zero to weak |
To determine whether the correlation between variables is significant, compare the p-value to your significance level. Usually, a significance level (denoted as α or alpha) of 0.05 works well. An α of 0.05 (5e-2) indicates that the risk of concluding that a correlation exists—when, actually, no correlation exists—is 5%. The p-value tells you whether the correlation coefficient is significantly different from 0. (A coefficient of 0 indicates that there is no linear relationship.)
- P-value ≤ α (5e-2): The correlation is statistically significant
If the p-value is less than or equal to the significance level, then you can conclude that the correlation is different from 0.
- P-value > α (5e-2): The correlation is not statistically significant
If the p-value is greater than the significance level, then you cannot conclude that the correlation is different from 0.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6107969/
https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/correlation/interpret-the-results/
Number of athletes
A metric that would make sense to correlate with medals would be the number of athletes each nation sends to an Olympics. This is because, for the most part, participation is done on merit. That is athletes have to qualify against athletes from other nations.
So the metrics we want to look at are: - the number of athletes per nation attending a particular games - the number of medals per nation per games
NB I will just use male athletes for simplicity
Change in average athletes per continent
First, let us look if the greater number of athletes from Europe is also reflected in the average number of athletes each nation sends within each continent.
As shown below European countries send around twice as many athletes per nation as other continents.
= sqldf('SELECT \
tempa Year, \
Continent, \
AVG(numbers) as ath_per_nation \
FROM(SELECT \
c.Continent, \
a.Year,\
COUNT(*) AS numbers \
FROM \
df_country as c \
LEFT JOIN \
df_M_S as a \
ON \
a.NOC = c.NOC \
GROUP BY \
a.Year,c.Continent,c.Nation \
ORDER BY \
Year asc) A \
GROUP BY \
Year,Continent;',locals())
=(8,5))
plt.subplots(figsize=['g>--','<r-.','bo-','mv-','kp:']
colafor i,continent in enumerate(tempa.Continent.unique()):
=tempa[tempa.Continent==continent]['Year']
x=tempa[tempa.Continent==continent]['ath_per_nation']
y;
plt.plot(x,y,cola[i]);
plt.legend(tempa.Continent.unique())-5,200])
plt.ylim([True)
plt.grid('Year')
plt.xlabel('Number of male athletes\n per nation'); plt.ylabel(
Athletes per nation VS medals per nation
To look at this correlation we need to do the following steps: - Join athletes with country tables - Group by Nation and whether they have a medal - Count this, this will give number of events with a medal - Then need to remove nations that didn’t get any medals - Get scatter and correlation of these - Group by continent and do the same
= sqldf('SELECT \
tempa Continent, \
Nation, \
Medal, \
SUM(numbers) AS numbers \
FROM(SELECT \
c.Continent, \
c.Nation, \
a.Medal, \
COUNT(*) AS numbers \
FROM \
df_country as c \
LEFT JOIN \
df_M_S as a \
ON \
a.NOC = c.NOC \
WHERE Year>2003 \
GROUP BY \
c.Continent,c.Nation,a.Medal_Gold,a.Medal_Silver,a.Medal_Bronze) A \
GROUP BY \
Continent, Nation,Medal;',locals())
= sqldf('SELECT \
tempa2 Nation, \
COUNT(numbers) AS numbers \
FROM(SELECT \
a.athlete_ID, \
c.Nation, \
COUNT(*) AS numbers \
FROM \
df_country as c \
LEFT JOIN \
df_F_S as a \
ON \
a.NOC = c.NOC \
WHERE Year=2016 \
GROUP BY \
c.NOC,a.athlete_ID,a.Year) A \
GROUP BY \
Nation;',locals())
= tempa[tempa.Medal==1]
tempa3=sqldf('\
tempa4 SELECT A.Nation,A.numbers as num_medals,B.numbers as num_ath \
FROM tempa3 as A \
LEFT JOIN tempa2 as B \
ON A.Nation=B.Nation \
;',locals())
tempa4
Nation | num_medals | num_ath | |
---|---|---|---|
0 | Algeria | 4 | 10.0 |
1 | Botswana | 1 | 3.0 |
2 | Egypt | 9 | 37.0 |
3 | Eritrea | 1 | 1.0 |
4 | Ethiopia | 12 | 20.0 |
... | ... | ... | ... |
95 | UK | 245 | 159.0 |
96 | Ukraine | 51 | 117.0 |
97 | Australia | 251 | 212.0 |
98 | Fiji | 13 | 17.0 |
99 | New Zealand | 44 | 97.0 |
100 rows × 3 columns
=tempa4.dropna()
tempa4=tempa4.num_ath
x=tempa4.num_medals
y
=(8,5))
plt.subplots(figsize
plt.scatter(x,y)=scipy.stats.spearmanr(x,y)
a=scipy.stats.pearsonr(x,y)
bprint('Spearman correlation = {:.2f} ({:.0e}) \nand Pearson correlation= {:.2f} ({:.0e})'\
format(a[0],a[1],b[0],b[1]))
.
True)
plt.grid('Athletes per nation')
plt.xlabel('Medals per nation');
plt.ylabel(
=np.poly1d( np.polyfit(x,y,2) )
p=np.arange(0,300,2)
xx=p(np.arange(0,300,2))
yy
; plt.plot(xx,yy)
Spearman correlation = 0.83 (3e-26)
and Pearson correlation= 0.87 (5e-32)
Overview Medals VS Athletes per nation
As was expected there is a good correlation between the number of athletes a nation sends and the number of medals they get
What may also have been expected and shown in the data, is that the relationship is not linear. Instead the more athletes a nation sends the greater the medals/athlete ratio.
i.e. If a nation sends more athletes it is more likely that a higher proportion of them will win medals
Cold War
The Cold War was a period of geopolitical tension between the United States and the Soviet Union and their respective allies, the Western Bloc and the Eastern Bloc, which began following World War II and ended in the early 1990s.
The conflict was based around the ideological and geopolitical struggle for global influence by these two superpowers.
The Soviet Union competed at the Olympics from 1952-1988. The Russian Empire had previously competed at the 1900, 1908 and 1912 Olympics games. In these games the best they ranked was 12th. In contrast the USA competed from the start of the Olympics and all subsequent games. In 1952 they were the most succesful nation, coming 1st in the medals table in 8 out of the previous 11 games (and second in the other 3).
The figures below show how during the Cold War period, The Soviet Union was able to compete with USA and in some cases beat them in the medals table. After this Cold War period the medals obtained by both the USA and the Soviet Union fell with respect to the totals from the European nations. Furthermore, USA reasserted it’s dominance after the Cold War period.
def number_of_athletes_USA_USSR(df_F_S,df_M_S):
=sqldf('\
testa2 SELECT \
Year, \
NOCSMALL, \
count(*) AS number_of_athletes, \
"F" AS Sex \
FROM \
(SELECT \
athlete_ID, \
Year, \
CASE \
WHEN NOC IN ("FRA","ESP","ITA","POR","GBR","IRL","NED","BEL","DEN","SUI") THEN "WES"\
WHEN NOC IN ("POL","ROU","UKR","LAT","BUL","HUN","LTU","LAT","BLR","ALB","SVK","AUT","EST","BIH","BOH") THEN "EST"\
WHEN NOC="USA" THEN "USA" \
WHEN NOC="EUN" THEN "EUN" \
ELSE "ROW"\
END AS NOCSMALL \
from df_F_S \
group by athlete_ID,Year \
order by Year asc) A \
GROUP BY Year, NOCSMALL \
UNION ALL \
SELECT \
Year, \
NOCSMALL, \
count(*) AS number_of_athletes, \
"M" AS Sex \
FROM \
(SELECT \
athlete_ID, \
Year, \
CASE \
WHEN NOC IN ("FRA","ESP","ITA","POR","GBR","IRL","NED","BEL","DEN","SUI") THEN "WES"\
WHEN NOC IN ("POL","ROU","UKR","LAT","BUL","HUN","LTU","LAT","BLR","ALB","SVK","AUT","EST","BIH","BOH") THEN "EST"\
WHEN NOC="USA" THEN "USA" \
WHEN NOC="EUN" THEN "EUN" \
ELSE "ROW"\
END AS NOCSMALL \
from df_M_S \
group by athlete_ID,Year \
order by Year asc) A \
GROUP BY Year, NOCSMALL;',locals() )
return testa2
def number_of_medals_USA_USSR(df_F_S,df_M_S):
=sqldf('\
testa2 SELECT \
COUNT(*) AS number_of_medals,\
Year, Sex, NOCSMALL\
FROM \
(SELECT NOCSMALL,Year,Sex,COUNT(*) AS counta\
FROM \
(SELECT \
athlete_ID, \
event_ID, \
"F" AS Sex, \
Medal_Gold,Medal_Silver,Medal_Bronze,\
Year, \
CASE \
WHEN NOC IN ("FRA","ESP","ITA","POR","GBR","IRL","NED","BEL","DEN","SUI") THEN "WES"\
WHEN NOC IN ("POL","ROU","UKR","LAT","BUL","HUN","LTU","LAT","BLR","ALB","SVK","AUT","EST","BIH","BOH") THEN "EST"\
WHEN NOC="USA" THEN "USA" \
WHEN NOC="EUN" THEN "EUN" \
ELSE "ROW"\
END AS NOCSMALL \
from df_F_S \
WHERE Medal_Gold=1 OR Medal_Silver=1 OR Medal_Bronze=1\
UNION ALL \
SELECT \
athlete_ID, \
event_ID, \
"M" AS Sex, \
Medal_Gold,Medal_Silver,Medal_Bronze,\
Year, \
CASE \
WHEN NOC IN ("FRA","ESP","ITA","POR","GBR","IRL","NED","BEL","DEN","SUI") THEN "WES"\
WHEN NOC IN ("POL","ROU","UKR","LAT","BUL","HUN","LTU","LAT","BLR","ALB","SVK","AUT","EST","BIH","BOH") THEN "EST"\
WHEN NOC="USA" THEN "USA" \
WHEN NOC="EUN" THEN "EUN" \
ELSE "ROW"\
END AS NOCSMALL \
from df_M_S \
WHERE Medal_Gold=1 OR Medal_Silver=1 OR Medal_Bronze=1\
order by Year asc) A\
GROUP BY \
Year, NOCSMALL,event_id,Medal_Gold,Medal_Silver,Medal_Bronze) AS B\
GROUP BY Year, NOCSMALL, Sex\
;',locals() )
return testa2
=number_of_medals_USA_USSR(df_F_S,df_M_S)
USA_USSR_medals=number_of_athletes_USA_USSR(df_F_S,df_M_S) USA_USSR_athletes
def yrplot(df__,whatplot= 'avg_weight'):
=['EST', 'EUN' ,'ROW', 'USA' ,'WES']
countries
# df__.NOCSMALL.unique()
# countries=np.sort(countries)
print(countries)
=['>','o','+','*','<']
cola=[[1,0.6,.6],[1,0,0],[.5,.5,.5],[0,0,1],[.6,.6,1]]
colur# ['EST' 'EUN' 'ROW' 'USA' 'WES']
# 'EST','USA','WES','ROW','EUN'
=plt.subplots(figsize=(8,5))
fig,ax1
for i,country in enumerate(countries):
if country!='ROW':
==country].Year,\
ax1.plot(df__[df__.NOCSMALL==country][whatplot],\
df__[df__.NOCSMALL=cola[i],linestyle='None',color=colur[i]\
marker=10)
,markersize
def doPlot(df_F,avgNo,whatplot,country,col,lw,ax1):
= df_F[df__.NOCSMALL==country].Year.rolling(avgNo).mean()
bb = df_F[df__.NOCSMALL==country][whatplot]
cc = cc.rolling(avgNo).mean()
cc =lw,color=col)
ax1.plot(bb,cc,linewidthreturn ax1
for i,country in enumerate(countries):
if country!='ROW':
=doPlot(df__,avgNo=3,whatplot=whatplot,country=country,col=colur[i],lw=3,ax1=ax1)
ax1
= ['East Europe','Russia','USA','West Europe']
lega
plt.legend(lega)True)
plt.grid(
plt.ylabel(modname(whatplot))
return ax1
=='F'],whatplot= 'number_of_medals') yrplot(USA_USSR_medals[USA_USSR_medals.Sex
=='M'],whatplot= 'number_of_medals') yrplot(USA_USSR_medals[USA_USSR_medals.Sex