import pandas as pd
from pandasql import sqldf
import matplotlib.pyplot as plt
import re
Olympics data with SQL and pandas- creating a country table
Creating a country table of the Olympics
Overview
The country table needed extra analysis so I seperated it from the rest of the analysis.
It also requires importing some new data, which I will add here too
= pd.read_csv("athlete_events.csv")
df=pd.read_csv("noc_regions.csv") df2
The next line is just to add a unique ID for athletes when the data is split up later
= df.reset_index()
df={'index':'event_athlete_ID','ID':'athlete_ID'},inplace=True) df.rename(columns
df.head()
event_athlete_ID | athlete_ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
1 | 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
2 | 2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN |
3 | 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
4 | 4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
df2.head()
NOC | region | notes | |
---|---|---|---|
0 | AFG | Afghanistan | NaN |
1 | AHO | Curacao | Netherlands Antilles |
2 | ALB | Albania | NaN |
3 | ALG | Algeria | NaN |
4 | AND | Andorra | NaN |
What do we use to value to identify a nation
print('There are {} unique teams and {} unique NOCs in df.\n \
And {} unique NOC values, {} unique regions and {} unique notes in df2.'.format( \
len(pd.unique(df.Team)),len(pd.unique(df.NOC)),
len(pd.unique(df2.region)),len(pd.unique(df2.region)),len(pd.unique(df2.notes)) ))
There are 1184 unique teams and 230 unique NOCs in df.
And 207 unique NOC values, 207 unique regions and 22 unique notes in df2.
1184 Seems a lot of teams to consider, it may be best to stick with using NOC
as a unique identifier for a country. We can then probably use regions
as the name of the country.
Let’s have a look at the values of teams
,NOC
and regions
"SELECT \
sqldf( NOC, \
Team, \
count(*) \
FROM \
df \
GROUP BY \
Team, NOC \
ORDER BY team DESC \
LIMIT 30;",locals())
NOC | Team | count(*) | |
---|---|---|---|
0 | FIN | rn-2 | 5 |
1 | BEL | Zut | 3 |
2 | ZIM | Zimbabwe | 309 |
3 | GRE | Zefyros | 2 |
4 | ZAM | Zambia | 183 |
5 | YUG | Yugoslavia-2 | 10 |
6 | YUG | Yugoslavia-1 | 10 |
7 | YUG | Yugoslavia | 2558 |
8 | SUI | Ylliam VIII | 5 |
9 | SUI | Ylliam VII | 6 |
10 | SUI | Ylliam II | 5 |
11 | GBR | Yeoman XII | 3 |
12 | GBR | Yeoman VII | 3 |
13 | RSA | Yeoman V | 3 |
14 | BAH | Yeoman | 4 |
15 | YEM | Yemen | 32 |
16 | MYA | Yangon | 2 |
17 | MEX | Xolotl | 3 |
18 | FIN | Xantippa | 3 |
19 | GBR | Wolseley-Siddeley-1 | 4 |
20 | CAN | Winnipeg Shamrocks-1 | 12 |
21 | CAN | Windor | 2 |
22 | NED | Willem-Six | 3 |
23 | ARG | Wiking | 5 |
24 | USA | Widgeon | 2 |
25 | FRA | Whitini Star | 1 |
26 | DEN | White Lady | 3 |
27 | JPN | Whisper | 1 |
28 | CAN | Whirlaway | 2 |
29 | USA | Western Rowing Club-3 | 6 |
Team in df seems to not reflect the country very well. e.g. Whisper is not a country but JPN probably represents Japan.
So the use of NOC seems to make sense
Now let us consider the NOC
, region
and notes
variables
"SELECT \
sqldf( NOC, \
Region, \
Notes, \
count(*) \
FROM \
df2 \
GROUP BY \
NOC, Region, Notes \
ORDER BY Region DESC \
LIMIT 30;",locals())
NOC | region | notes | count(*) | |
---|---|---|---|---|
0 | RHO | Zimbabwe | None | 1 |
1 | ZIM | Zimbabwe | None | 1 |
2 | ZAM | Zambia | None | 1 |
3 | YAR | Yemen | North Yemen | 1 |
4 | YEM | Yemen | None | 1 |
5 | YMD | Yemen | South Yemen | 1 |
6 | ISV | Virgin Islands, US | Virgin Islands | 1 |
7 | IVB | Virgin Islands, British | None | 1 |
8 | VIE | Vietnam | None | 1 |
9 | VNM | Vietnam | None | 1 |
10 | VEN | Venezuela | None | 1 |
11 | VAN | Vanuatu | None | 1 |
12 | UZB | Uzbekistan | None | 1 |
13 | URU | Uruguay | None | 1 |
14 | UAE | United Arab Emirates | None | 1 |
15 | UKR | Ukraine | None | 1 |
16 | UGA | Uganda | None | 1 |
17 | USA | USA | None | 1 |
18 | GBR | UK | None | 1 |
19 | TKM | Turkmenistan | None | 1 |
20 | TUR | Turkey | None | 1 |
21 | TUN | Tunisia | None | 1 |
22 | TTO | Trinidad | Trinidad and Tobago | 1 |
23 | WIF | Trinidad | West Indies Federation | 1 |
24 | TGA | Tonga | None | 1 |
25 | TOG | Togo | None | 1 |
26 | TLS | Timor-Leste | None | 1 |
27 | THA | Thailand | None | 1 |
28 | TAN | Tanzania | None | 1 |
29 | TJK | Tajikistan | None | 1 |
NOC doesn’t look unique enough for us. For example, - Zimbabwe is RHO and ZIM and this probably just reflects a name change in the country - Yemen has 3 NOC values, presumably reflecting the unification that took place in 1990. Most of these differences are not that important to what I am looking at, where it would be better to have a broader description of a nation
So what I will do is - use the region tag as a unique identifier of a country - replace multiple NOC values of a single country with a single NOC value
The first step is to identify which region
s have multiple NOC values. We can do this again in SQL by creating a new table
=sqldf("SELECT \
d1 NOC, \
region, \
notes, \
count(*) \
FROM \
df2 \
GROUP BY \
region \
HAVING COUNT(*)>1 \
ORDER BY count(*) DESC;",locals())
d1
NOC | region | notes | count(*) | |
---|---|---|---|---|
0 | FRG | Germany | None | 4 |
1 | YAR | Yemen | North Yemen | 3 |
2 | SCG | Serbia | Serbia and Montenegro | 3 |
3 | EUN | Russia | None | 3 |
4 | MAL | Malaysia | None | 3 |
5 | BOH | Czech Republic | Bohemia | 3 |
6 | ROT | None | Refugee Olympic Team | 3 |
7 | RHO | Zimbabwe | None | 2 |
8 | VIE | Vietnam | None | 2 |
9 | TTO | Trinidad | Trinidad and Tobago | 2 |
10 | SYR | Syria | None | 2 |
11 | CRT | Greece | Crete | 2 |
12 | CHN | China | None | 2 |
13 | CAN | Canada | None | 2 |
14 | ANZ | Australia | Australasia | 2 |
We can then create a table to get - the country, - the new NOC value for each country - all the NOC values that correspond to that country
=sqldf("SELECT \
d2 d1.NOC AS new_NOC, \
df2.NOC AS orig_NOC, \
df2.region, \
df2.notes \
FROM \
d1 \
LEFT JOIN \
df2 \
ON \
d1.region=df2.region \
ORDER BY \
df2.region DESC;",locals())
d2
new_NOC | orig_NOC | region | notes | |
---|---|---|---|---|
0 | RHO | RHO | Zimbabwe | None |
1 | RHO | ZIM | Zimbabwe | None |
2 | YAR | YAR | Yemen | North Yemen |
3 | YAR | YEM | Yemen | None |
4 | YAR | YMD | Yemen | South Yemen |
5 | VIE | VIE | Vietnam | None |
6 | VIE | VNM | Vietnam | None |
7 | TTO | TTO | Trinidad | Trinidad and Tobago |
8 | TTO | WIF | Trinidad | West Indies Federation |
9 | SYR | SYR | Syria | None |
10 | SYR | UAR | Syria | United Arab Republic |
11 | SCG | SCG | Serbia | Serbia and Montenegro |
12 | SCG | SRB | Serbia | None |
13 | SCG | YUG | Serbia | Yugoslavia |
14 | EUN | EUN | Russia | None |
15 | EUN | RUS | Russia | None |
16 | EUN | URS | Russia | None |
17 | MAL | MAL | Malaysia | None |
18 | MAL | MAS | Malaysia | None |
19 | MAL | NBO | Malaysia | North Borneo |
20 | CRT | CRT | Greece | Crete |
21 | CRT | GRE | Greece | None |
22 | FRG | FRG | Germany | None |
23 | FRG | GDR | Germany | None |
24 | FRG | GER | Germany | None |
25 | FRG | SAA | Germany | None |
26 | BOH | BOH | Czech Republic | Bohemia |
27 | BOH | CZE | Czech Republic | None |
28 | BOH | TCH | Czech Republic | None |
29 | CHN | CHN | China | None |
30 | CHN | HKG | China | Hong Kong |
31 | CAN | CAN | Canada | None |
32 | CAN | NFL | Canada | Newfoundland |
33 | ANZ | ANZ | Australia | Australasia |
34 | ANZ | AUS | Australia | None |
35 | ROT | None | None | None |
And finally replace the values of NOC in df
and df2
with the new values for countries with duplicate values
Easier to do this with Python
for i,old_NOC in enumerate(d2.orig_NOC):
==old_NOC,'NOC']=d2.loc[i,'new_NOC']
df.loc[df.NOC==old_NOC,'NOC']=d2.loc[i,'new_NOC']
df2.loc[df2.NOC
GDP data
Wikipedia was used to get data on population and GDP of different countries. The data was imported using Excel’s capability to give it a url to obtain the table, and saved as different tabs in the file CountryData.xlsx
. For GDP I selected the World Bank Estimate.
Website urls correct as of 23/7/22
= pd.read_excel('CountryData.xlsx',sheet_name=2)
df_GDP
=df_GDP.drop(columns=['IMF[1][12] Estimate','IMF[1][12] Year','United Nations[13] Estimate','United Nations[13] Year','World Bank[14][15] Year'])
df_GDP=df_GDP.rename(columns={'World Bank[14][15] Estimate':'GDP','Country/Territory':'Country','UN Region':'Continent'},errors='raise')
df_GDP
10) df_GDP.head(
Country | Continent | GDP | |
---|---|---|---|
0 | World | - | 84,705,567 |
1 | United States | Americas | 20,936,600 |
2 | China | Asia | 14,722,731 |
3 | Japan | Asia | 4,975,415 |
4 | Germany | Europe | 3,806,060 |
5 | India | Asia | 2,622,984 |
6 | United Kingdom | Europe | 2,707,744 |
7 | France | Europe | 2,603,004 |
8 | Canada | Americas | 1,643,408 |
9 | Italy | Europe | 1,886,445 |
make column GDP an integer, and remove the comma
'GDP']=[x.replace(',','') for x in df_GDP['GDP']]
df_GDP[
'GDP']=pd.to_numeric(df_GDP['GDP'],errors='coerce').fillna(0).astype('int') df_GDP[
We now need to match the names of countries from the data we have found on the internet with the Olympics data.
A good example of this is United Kingdom, which can be named several ways (with slightly different meanings) including: UK, Great Britain, Great Britain, Great Britain and Northern Ireland, GB.
The function changeDF_country
takes the imported dataframes and df2
as inputs and outputs the imported dataframes with corrected NOC
and country
values
- Checks whether country in imported df matches region in
df
- If so country doesn’t need to be changed and can provide matching NOC value
- If not above then see if
df2.notes
matches the country
- If so can return country and NOC matching the notes that match
- If not aboves, try some name changes, like United States to USA
- Then return the matching NOC and country and NOC matching values
- None of the above
- Then return country given and give NOC a name to show there was no match.
- Print out the name of the country where no match was found. To see if missing anything big or obvious
def getNation(region_to_check):
if region_to_check=='United States':
= 'USA'
region_out elif bool(re.search(r'Germany', region_to_check)):
='Germany'
region_outelif region_to_check=='United Kingdom':
= 'UK'
region_out elif region_to_check=='Soviet Union':
='Russia'
region_outelse:
=region_to_check
region_outprint('nothing found for {}'.format(region_out))
return region_out
def changeDF_country(df__,df2):
=[],[]
xALL,nocALLfor i in range(len(df__)):
= df__.loc[i,'Country']
country_check # print(i,country_check)
=df2.region==country_check
boolCountry=df2[boolCountry].region
x
try:
=str(x.iloc[0])
x'region']=x
df2.loc[boolCountry,except:
if len(x)<1:
=df2.notes==df__.loc[i,'Country']
boolCountry=df2[boolCountry].region
xtry:
=str(x.iloc[0])
x'region']=x
df2.loc[boolCountry,except:
if len(x)<1:
=getNation(df__.loc[i,'Country'])
country_check=df2.region==country_check
boolCountry=df2[boolCountry].region
xtry:
=str(x.iloc[0])
x'region']=x
df2.loc[boolCountry,except:
=country_check
x
xALL.append(x)try:
0])
nocALL.append(df2[boolCountry].NOC.iloc[except:
'---')
nocALL.append(
# print('x= ',x)
# print('-------------------')
# if i==42:
# break
# print(nocALL)
try:
1,'NOC',nocALL)
df__.insert(1,'Nation',xALL)#,'Nation',xALL})
df__.insert(except:
print('done allready')
return df__
=changeDF_country(df_GDP,df2) df_GDP
nothing found for World
nothing found for DR Congo
nothing found for Bolivia
nothing found for Macau
nothing found for Congo
nothing found for North Macedonia
nothing found for New Caledonia
nothing found for French Polynesia
nothing found for Eswatini
nothing found for Greenland
nothing found for Curaçao
nothing found for East Timor
nothing found for Zanzibar
nothing found for British Virgin Islands
nothing found for Northern Mariana Islands
nothing found for Saint Kitts and Nevis
nothing found for Saint Vincent and the Grenadines
nothing found for Sint Maarten
nothing found for São Tomé and Príncipe
nothing found for Anguilla
nothing found for Montserrat
Population table
The same thing as above but for the population data
= pd.read_excel('CountryData.xlsx',sheet_name=0)
df_population 5) df_population.head(
Rank | Country / Dependency | UN Region | Population | Percentage of the world | Date | Source (official or from the United Nations) | Notes | Column9 | |
---|---|---|---|---|---|---|---|---|---|
0 | – | World | NaN | 7965207000 | 1.0000 | 2022-07-20 | UN projection[2] | NaN | NaN |
1 | 1 | China | Asia | 1412600000 | 0.1770 | 2021-12-31 | National annual estimate | The population figure refers to mainland China... | NaN |
2 | 2 | India | Asia | 1373761000 | 0.1720 | 2022-03-01 | Annual national estimate | The figure includes the population of Indian-a... | NaN |
3 | 3 | United States | Americas | 332906919 | 0.0418 | 2022-07-20 | National population clock | The figure includes the 50 states and the Dist... | NaN |
4 | 4 | Indonesia | Asia | 272248500 | 0.0342 | 2021-07-01 | National annual estimate | NaN | NaN |
=['Percentage of the world','Date','Source (official or from the United Nations)','Column9','Notes'],inplace=True)
df_population.drop(columns=df_population.rename(columns={'Country / Dependency':'Country','UN Region':'Continent'},errors='raise')
df_population
df_population
Rank | Country | Continent | Population | |
---|---|---|---|---|
0 | – | World | NaN | 7965207000 |
1 | 1 | China | Asia | 1412600000 |
2 | 2 | India | Asia | 1373761000 |
3 | 3 | United States | Americas | 332906919 |
4 | 4 | Indonesia | Asia | 272248500 |
... | ... | ... | ... | ... |
237 | – | Niue | Oceania | 1549 |
238 | – | Tokelau (New Zealand) | Oceania | 1501 |
239 | 195 | Vatican City | Europe | 825 |
240 | – | Cocos (Keeling) Islands (Australia) | Oceania | 573 |
241 | – | Pitcairn Islands (United Kingdom) | Oceania | 40 |
242 rows × 4 columns
=changeDF_country(df_population,df2) df_population
nothing found for World
nothing found for DR Congo
nothing found for Bolivia
nothing found for Hong Kong (China)
nothing found for Congo
nothing found for Puerto Rico (United States)
nothing found for North Macedonia
nothing found for East Timor
nothing found for Eswatini
nothing found for Macau (China)
nothing found for Western Sahara
nothing found for Northern Cyprus
nothing found for Transnistria
nothing found for French Polynesia (France)
nothing found for New Caledonia (France)
nothing found for Abkhazia
nothing found for São Tomé and Príncipe
nothing found for Guam (United States)
nothing found for Curaçao (Netherlands)
nothing found for Artsakh
nothing found for Aruba (Netherlands)
nothing found for Saint Vincent and the Grenadines
nothing found for Jersey (British Crown Dependency)
nothing found for U.S. Virgin Islands (United States)
nothing found for Isle of Man (British Crown Dependency)
nothing found for Cayman Islands (United Kingdom)
nothing found for Bermuda (United Kingdom)
nothing found for Guernsey (British Crown Dependency)
nothing found for Greenland (Denmark)
nothing found for Saint Kitts and Nevis
nothing found for Faroe Islands (Denmark)
nothing found for South Ossetia
nothing found for American Samoa (United States)
nothing found for Northern Mariana Islands (United States)
nothing found for Turks and Caicos Islands (United Kingdom)
nothing found for Sint Maarten (Netherlands)
nothing found for Gibraltar (United Kingdom)
nothing found for Saint Martin (France)
nothing found for Åland (Finland)
nothing found for British Virgin Islands (United Kingdom)
nothing found for Anguilla (United Kingdom)
nothing found for Wallis and Futuna (France)
nothing found for Saint Barthélemy (France)
nothing found for Saint Helena, Ascension and Tristan da Cunha (United Kingdom)
nothing found for Saint Pierre and Miquelon (France)
nothing found for Montserrat (United Kingdom)
nothing found for Falkland Islands (United Kingdom)
nothing found for Christmas Island (Australia)
nothing found for Norfolk Island (Australia)
nothing found for Niue
nothing found for Tokelau (New Zealand)
nothing found for Vatican City
nothing found for Cocos (Keeling) Islands (Australia)
nothing found for Pitcairn Islands (United Kingdom)
Combine everything for one country table
Now we merge all data sets together. I’ll use pandas but this is the same as a SQL left outer join. Because we want to keep all the NOC values in df2
we start with that then join each one after
=[]
df_country=df2.merge(df_population,left_on='NOC',right_on='NOC')
df_country
=df_country.merge(df_GDP,left_on='Nation',right_on='Nation')
df_country=['Continent_y','Country_x','Country_y','NOC_y','Rank','notes','region'],inplace=True)
df_country.drop(columns=df_country.rename(columns={'NOC_x':'NOC','Continent_x':'Continent'},errors='raise')
df_country
# We have multiple versions of Nation so we just take the max one
= df_country.groupby('NOC').max() df_country
'country')
df_country.to_csv(
"athlete_events.csv")
df.to_csv("noc_regions.csv") df2.to_csv(