Olympics data with SQL and pandas- creating a country table

Creating a country table of the Olympics

Author

Thomas H. Simm

Overview

The country table needed extra analysis so I seperated it from the rest of the analysis.

It also requires importing some new data, which I will add here too

import pandas as pd
from pandasql import sqldf
import matplotlib.pyplot as plt
import re

df= pd.read_csv("athlete_events.csv")
df2=pd.read_csv("noc_regions.csv")

The next line is just to add a unique ID for athletes when the data is split up later

df= df.reset_index()
df.rename(columns={'index':'event_athlete_ID','ID':'athlete_ID'},inplace=True)

df.head()

	event_athlete_ID	athlete_ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
0	0	1	A Dijiang	M	24.0	180.0	80.0	China	CHN	1992 Summer	1992	Summer	Barcelona	Basketball	Basketball Men's Basketball	NaN
1	1	2	A Lamusi	M	23.0	170.0	60.0	China	CHN	2012 Summer	2012	Summer	London	Judo	Judo Men's Extra-Lightweight	NaN
2	2	3	Gunnar Nielsen Aaby	M	24.0	NaN	NaN	Denmark	DEN	1920 Summer	1920	Summer	Antwerpen	Football	Football Men's Football	NaN
3	3	4	Edgar Lindenau Aabye	M	34.0	NaN	NaN	Denmark/Sweden	DEN	1900 Summer	1900	Summer	Paris	Tug-Of-War	Tug-Of-War Men's Tug-Of-War	Gold
4	4	5	Christine Jacoba Aaftink	F	21.0	185.0	82.0	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 500 metres	NaN

df2.head()

	NOC	region	notes
0	AFG	Afghanistan	NaN
1	AHO	Curacao	Netherlands Antilles
2	ALB	Albania	NaN
3	ALG	Algeria	NaN
4	AND	Andorra	NaN

What do we use to value to identify a nation

print('There are {} unique teams and {} unique NOCs in df.\n     \
And {} unique NOC values, {} unique regions and {} unique notes in df2.'.format( \
    len(pd.unique(df.Team)),len(pd.unique(df.NOC)), 
    len(pd.unique(df2.region)),len(pd.unique(df2.region)),len(pd.unique(df2.notes)) ))

There are 1184 unique teams and 230 unique NOCs in df.
     And 207 unique NOC values, 207 unique regions and 22 unique notes in df2.

1184 Seems a lot of teams to consider, it may be best to stick with using NOC as a unique identifier for a country. We can then probably use regions as the name of the country.

Let’s have a look at the values of teams,NOC and regions

sqldf("SELECT                                \
         NOC,                                \
         Team,                               \
         count(*)                            \
       FROM                                  \
         df                                  \
       GROUP BY                              \
         Team, NOC                           \
       ORDER BY team DESC                    \
       LIMIT 30;",locals())

	NOC	Team	count(*)
0	FIN	rn-2	5
1	BEL	Zut	3
2	ZIM	Zimbabwe	309
3	GRE	Zefyros	2
4	ZAM	Zambia	183
5	YUG	Yugoslavia-2	10
6	YUG	Yugoslavia-1	10
7	YUG	Yugoslavia	2558
8	SUI	Ylliam VIII	5
9	SUI	Ylliam VII	6
10	SUI	Ylliam II	5
11	GBR	Yeoman XII	3
12	GBR	Yeoman VII	3
13	RSA	Yeoman V	3
14	BAH	Yeoman	4
15	YEM	Yemen	32
16	MYA	Yangon	2
17	MEX	Xolotl	3
18	FIN	Xantippa	3
19	GBR	Wolseley-Siddeley-1	4
20	CAN	Winnipeg Shamrocks-1	12
21	CAN	Windor	2
22	NED	Willem-Six	3
23	ARG	Wiking	5
24	USA	Widgeon	2
25	FRA	Whitini Star	1
26	DEN	White Lady	3
27	JPN	Whisper	1
28	CAN	Whirlaway	2
29	USA	Western Rowing Club-3	6

Team in df seems to not reflect the country very well. e.g. Whisper is not a country but JPN probably represents Japan.

So the use of NOC seems to make sense

Now let us consider the NOC, region and notes variables

sqldf("SELECT                                \
         NOC,                                \
         Region,                             \
         Notes,                              \
         count(*)                            \
       FROM                                  \
         df2                                 \
       GROUP BY                              \
         NOC, Region, Notes                  \
       ORDER BY Region DESC                  \
       LIMIT 30;",locals())

	NOC	region	notes	count(*)
0	RHO	Zimbabwe	None	1
1	ZIM	Zimbabwe	None	1
2	ZAM	Zambia	None	1
3	YAR	Yemen	North Yemen	1
4	YEM	Yemen	None	1
5	YMD	Yemen	South Yemen	1
6	ISV	Virgin Islands, US	Virgin Islands	1
7	IVB	Virgin Islands, British	None	1
8	VIE	Vietnam	None	1
9	VNM	Vietnam	None	1
10	VEN	Venezuela	None	1
11	VAN	Vanuatu	None	1
12	UZB	Uzbekistan	None	1
13	URU	Uruguay	None	1
14	UAE	United Arab Emirates	None	1
15	UKR	Ukraine	None	1
16	UGA	Uganda	None	1
17	USA	USA	None	1
18	GBR	UK	None	1
19	TKM	Turkmenistan	None	1
20	TUR	Turkey	None	1
21	TUN	Tunisia	None	1
22	TTO	Trinidad	Trinidad and Tobago	1
23	WIF	Trinidad	West Indies Federation	1
24	TGA	Tonga	None	1
25	TOG	Togo	None	1
26	TLS	Timor-Leste	None	1
27	THA	Thailand	None	1
28	TAN	Tanzania	None	1
29	TJK	Tajikistan	None	1

NOC doesn’t look unique enough for us. For example, - Zimbabwe is RHO and ZIM and this probably just reflects a name change in the country - Yemen has 3 NOC values, presumably reflecting the unification that took place in 1990. Most of these differences are not that important to what I am looking at, where it would be better to have a broader description of a nation

So what I will do is - use the region tag as a unique identifier of a country - replace multiple NOC values of a single country with a single NOC value

The first step is to identify which regions have multiple NOC values. We can do this again in SQL by creating a new table

d1=sqldf("SELECT                               \
            NOC,                               \
            region,                            \
            notes,                             \
            count(*)                           \
         FROM                                  \
            df2                                \
         GROUP BY                              \
            region                             \
         HAVING COUNT(*)>1                     \
         ORDER BY count(*) DESC;",locals())
d1

	NOC	region	notes	count(*)
0	FRG	Germany	None	4
1	YAR	Yemen	North Yemen	3
2	SCG	Serbia	Serbia and Montenegro	3
3	EUN	Russia	None	3
4	MAL	Malaysia	None	3
5	BOH	Czech Republic	Bohemia	3
6	ROT	None	Refugee Olympic Team	3
7	RHO	Zimbabwe	None	2
8	VIE	Vietnam	None	2
9	TTO	Trinidad	Trinidad and Tobago	2
10	SYR	Syria	None	2
11	CRT	Greece	Crete	2
12	CHN	China	None	2
13	CAN	Canada	None	2
14	ANZ	Australia	Australasia	2

We can then create a table to get - the country, - the new NOC value for each country - all the NOC values that correspond to that country

d2=sqldf("SELECT                            \
            d1.NOC      AS new_NOC,         \
            df2.NOC     AS orig_NOC,        \
            df2.region,                     \
            df2.notes                       \
          FROM                              \
            d1                              \
          LEFT JOIN                         \
            df2                             \
          ON                                \
            d1.region=df2.region            \
          ORDER BY                          \
            df2.region DESC;",locals())
d2

	new_NOC	orig_NOC	region	notes
0	RHO	RHO	Zimbabwe	None
1	RHO	ZIM	Zimbabwe	None
2	YAR	YAR	Yemen	North Yemen
3	YAR	YEM	Yemen	None
4	YAR	YMD	Yemen	South Yemen
5	VIE	VIE	Vietnam	None
6	VIE	VNM	Vietnam	None
7	TTO	TTO	Trinidad	Trinidad and Tobago
8	TTO	WIF	Trinidad	West Indies Federation
9	SYR	SYR	Syria	None
10	SYR	UAR	Syria	United Arab Republic
11	SCG	SCG	Serbia	Serbia and Montenegro
12	SCG	SRB	Serbia	None
13	SCG	YUG	Serbia	Yugoslavia
14	EUN	EUN	Russia	None
15	EUN	RUS	Russia	None
16	EUN	URS	Russia	None
17	MAL	MAL	Malaysia	None
18	MAL	MAS	Malaysia	None
19	MAL	NBO	Malaysia	North Borneo
20	CRT	CRT	Greece	Crete
21	CRT	GRE	Greece	None
22	FRG	FRG	Germany	None
23	FRG	GDR	Germany	None
24	FRG	GER	Germany	None
25	FRG	SAA	Germany	None
26	BOH	BOH	Czech Republic	Bohemia
27	BOH	CZE	Czech Republic	None
28	BOH	TCH	Czech Republic	None
29	CHN	CHN	China	None
30	CHN	HKG	China	Hong Kong
31	CAN	CAN	Canada	None
32	CAN	NFL	Canada	Newfoundland
33	ANZ	ANZ	Australia	Australasia
34	ANZ	AUS	Australia	None
35	ROT	None	None	None

And finally replace the values of NOC in df and df2 with the new values for countries with duplicate values

Easier to do this with Python

for i,old_NOC in enumerate(d2.orig_NOC):
    df.loc[df.NOC==old_NOC,'NOC']=d2.loc[i,'new_NOC']
    df2.loc[df2.NOC==old_NOC,'NOC']=d2.loc[i,'new_NOC']

GDP data

Wikipedia was used to get data on population and GDP of different countries. The data was imported using Excel’s capability to give it a url to obtain the table, and saved as different tabs in the file CountryData.xlsx. For GDP I selected the World Bank Estimate.

Website urls correct as of 23/7/22

df_GDP = pd.read_excel('CountryData.xlsx',sheet_name=2)

df_GDP=df_GDP.drop(columns=['IMF[1][12] Estimate','IMF[1][12] Year','United Nations[13] Estimate','United Nations[13] Year','World Bank[14][15] Year'])
df_GDP=df_GDP.rename(columns={'World Bank[14][15] Estimate':'GDP','Country/Territory':'Country','UN Region':'Continent'},errors='raise')

df_GDP.head(10)

	Country	Continent	GDP
0	World	-	84,705,567
1	United States	Americas	20,936,600
2	China	Asia	14,722,731
3	Japan	Asia	4,975,415
4	Germany	Europe	3,806,060
5	India	Asia	2,622,984
6	United Kingdom	Europe	2,707,744
7	France	Europe	2,603,004
8	Canada	Americas	1,643,408
9	Italy	Europe	1,886,445

make column GDP an integer, and remove the comma

df_GDP['GDP']=[x.replace(',','') for x in df_GDP['GDP']]

df_GDP['GDP']=pd.to_numeric(df_GDP['GDP'],errors='coerce').fillna(0).astype('int')

We now need to match the names of countries from the data we have found on the internet with the Olympics data.

A good example of this is United Kingdom, which can be named several ways (with slightly different meanings) including: UK, Great Britain, Great Britain, Great Britain and Northern Ireland, GB.

The function changeDF_country takes the imported dataframes and df2 as inputs and outputs the imported dataframes with corrected NOC and country values

Checks whether country in imported df matches region in df

If so country doesn’t need to be changed and can provide matching NOC value

If not above then see if df2.notes matches the country

If so can return country and NOC matching the notes that match

If not aboves, try some name changes, like United States to USA

Then return the matching NOC and country and NOC matching values

None of the above

Then return country given and give NOC a name to show there was no match.
Print out the name of the country where no match was found. To see if missing anything big or obvious

def getNation(region_to_check):
       
    if region_to_check=='United States':
        region_out = 'USA'
    elif bool(re.search(r'Germany', region_to_check)):
        region_out='Germany'
    elif region_to_check=='United Kingdom':
        region_out = 'UK'
    elif region_to_check=='Soviet Union':
        region_out='Russia'
    else:
        region_out=region_to_check
        print('nothing found for {}'.format(region_out))
    
    return region_out

def changeDF_country(df__,df2):
    xALL,nocALL=[],[]
    for i in range(len(df__)):
        country_check = df__.loc[i,'Country']
#         print(i,country_check)
        boolCountry=df2.region==country_check
        x=df2[boolCountry].region
        
        try:
            x=str(x.iloc[0])
            df2.loc[boolCountry,'region']=x
        except:
            if len(x)<1:
                boolCountry=df2.notes==df__.loc[i,'Country']
                x=df2[boolCountry].region
                try:
                    x=str(x.iloc[0])
                    df2.loc[boolCountry,'region']=x
                except:
                    if len(x)<1:
                        country_check=getNation(df__.loc[i,'Country'])
                        boolCountry=df2.region==country_check
                        x=df2[boolCountry].region
                        try:
                            x=str(x.iloc[0])
                            df2.loc[boolCountry,'region']=x
                        except:
                            x=country_check
        xALL.append(x)
        try:
            nocALL.append(df2[boolCountry].NOC.iloc[0])
        except:
            nocALL.append('---')
            
#         print('x= ',x)
#         print('-------------------')
#         if i==42:
#             break
#     print(nocALL)
    try:
        df__.insert(1,'NOC',nocALL)
        df__.insert(1,'Nation',xALL)#,'Nation',xALL})
    except:
        print('done allready')
    return df__

df_GDP=changeDF_country(df_GDP,df2)

nothing found for World
nothing found for DR Congo
nothing found for Bolivia
nothing found for Macau
nothing found for Congo
nothing found for North Macedonia
nothing found for New Caledonia
nothing found for French Polynesia
nothing found for Eswatini
nothing found for Greenland
nothing found for Curaçao
nothing found for East Timor
nothing found for Zanzibar
nothing found for British Virgin Islands
nothing found for Northern Mariana Islands
nothing found for Saint Kitts and Nevis
nothing found for Saint Vincent and the Grenadines
nothing found for Sint Maarten
nothing found for São Tomé and Príncipe
nothing found for Anguilla
nothing found for Montserrat

Population table

The same thing as above but for the population data

df_population = pd.read_excel('CountryData.xlsx',sheet_name=0)
df_population.head(5)

	Rank	Country / Dependency	UN Region	Population	Percentage of the world	Date	Source (official or from the United Nations)	Notes	Column9
0	–	World	NaN	7965207000	1.0000	2022-07-20	UN projection[2]	NaN	NaN
1	1	China	Asia	1412600000	0.1770	2021-12-31	National annual estimate	The population figure refers to mainland China...	NaN
2	2	India	Asia	1373761000	0.1720	2022-03-01	Annual national estimate	The figure includes the population of Indian-a...	NaN
3	3	United States	Americas	332906919	0.0418	2022-07-20	National population clock	The figure includes the 50 states and the Dist...	NaN
4	4	Indonesia	Asia	272248500	0.0342	2021-07-01	National annual estimate	NaN	NaN


df_population.drop(columns=['Percentage of the world','Date','Source (official or from the United Nations)','Column9','Notes'],inplace=True)
df_population=df_population.rename(columns={'Country / Dependency':'Country','UN Region':'Continent'},errors='raise')

df_population

	Rank	Country	Continent	Population
0	–	World	NaN	7965207000
1	1	China	Asia	1412600000
2	2	India	Asia	1373761000
3	3	United States	Americas	332906919
4	4	Indonesia	Asia	272248500
...	...	...	...	...
237	–	Niue	Oceania	1549
238	–	Tokelau (New Zealand)	Oceania	1501
239	195	Vatican City	Europe	825
240	–	Cocos (Keeling) Islands (Australia)	Oceania	573
241	–	Pitcairn Islands (United Kingdom)	Oceania	40

242 rows × 4 columns

df_population=changeDF_country(df_population,df2)

nothing found for World
nothing found for DR Congo
nothing found for Bolivia
nothing found for Hong Kong (China)
nothing found for Congo
nothing found for Puerto Rico (United States)
nothing found for North Macedonia
nothing found for East Timor
nothing found for Eswatini
nothing found for Macau (China)
nothing found for Western Sahara
nothing found for Northern Cyprus
nothing found for Transnistria
nothing found for French Polynesia (France)
nothing found for New Caledonia (France)
nothing found for Abkhazia
nothing found for São Tomé and Príncipe
nothing found for Guam (United States)
nothing found for Curaçao (Netherlands)
nothing found for Artsakh
nothing found for Aruba (Netherlands)
nothing found for Saint Vincent and the Grenadines
nothing found for Jersey (British Crown Dependency)
nothing found for U.S. Virgin Islands (United States)
nothing found for Isle of Man (British Crown Dependency)
nothing found for Cayman Islands (United Kingdom)
nothing found for Bermuda (United Kingdom)
nothing found for Guernsey (British Crown Dependency)
nothing found for Greenland (Denmark)
nothing found for Saint Kitts and Nevis
nothing found for Faroe Islands (Denmark)
nothing found for South Ossetia
nothing found for American Samoa (United States)
nothing found for Northern Mariana Islands (United States)
nothing found for Turks and Caicos Islands (United Kingdom)
nothing found for Sint Maarten (Netherlands)
nothing found for Gibraltar (United Kingdom)
nothing found for Saint Martin (France)
nothing found for Åland (Finland)
nothing found for British Virgin Islands (United Kingdom)
nothing found for Anguilla (United Kingdom)
nothing found for Wallis and Futuna (France)
nothing found for Saint Barthélemy (France)
nothing found for Saint Helena, Ascension and Tristan da Cunha (United Kingdom)
nothing found for Saint Pierre and Miquelon (France)
nothing found for Montserrat (United Kingdom)
nothing found for Falkland Islands (United Kingdom)
nothing found for Christmas Island (Australia)
nothing found for Norfolk Island (Australia)
nothing found for Niue
nothing found for Tokelau (New Zealand)
nothing found for Vatican City
nothing found for Cocos (Keeling) Islands (Australia)
nothing found for Pitcairn Islands (United Kingdom)

Combine everything for one country table

Now we merge all data sets together. I’ll use pandas but this is the same as a SQL left outer join. Because we want to keep all the NOC values in df2 we start with that then join each one after

df_country=[]
df_country=df2.merge(df_population,left_on='NOC',right_on='NOC')

df_country=df_country.merge(df_GDP,left_on='Nation',right_on='Nation')
df_country.drop(columns=['Continent_y','Country_x','Country_y','NOC_y','Rank','notes','region'],inplace=True)
df_country=df_country.rename(columns={'NOC_x':'NOC','Continent_x':'Continent'},errors='raise')

# We have multiple versions of Nation so we just take the max one
df_country = df_country.groupby('NOC').max()

df_country.to_csv('country')

df.to_csv("athlete_events.csv")
df2.to_csv("noc_regions.csv")