Olympics data with SQL and pandas- creating a country table

Creating a country table of the Olympics

Author

Thomas H. Simm

Overview

The country table needed extra analysis so I seperated it from the rest of the analysis.

It also requires importing some new data, which I will add here too

import pandas as pd
from pandasql import sqldf
import matplotlib.pyplot as plt
import re 
df= pd.read_csv("athlete_events.csv")
df2=pd.read_csv("noc_regions.csv")

The next line is just to add a unique ID for athletes when the data is split up later

df= df.reset_index()
df.rename(columns={'index':'event_athlete_ID','ID':'athlete_ID'},inplace=True)
df.head()
event_athlete_ID athlete_ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN
df2.head()
NOC region notes
0 AFG Afghanistan NaN
1 AHO Curacao Netherlands Antilles
2 ALB Albania NaN
3 ALG Algeria NaN
4 AND Andorra NaN

What do we use to value to identify a nation

print('There are {} unique teams and {} unique NOCs in df.\n     \
And {} unique NOC values, {} unique regions and {} unique notes in df2.'.format( \
    len(pd.unique(df.Team)),len(pd.unique(df.NOC)), 
    len(pd.unique(df2.region)),len(pd.unique(df2.region)),len(pd.unique(df2.notes)) ))
There are 1184 unique teams and 230 unique NOCs in df.
     And 207 unique NOC values, 207 unique regions and 22 unique notes in df2.

1184 Seems a lot of teams to consider, it may be best to stick with using NOC as a unique identifier for a country. We can then probably use regions as the name of the country.

Let’s have a look at the values of teams,NOC and regions

sqldf("SELECT                                \
         NOC,                                \
         Team,                               \
         count(*)                            \
       FROM                                  \
         df                                  \
       GROUP BY                              \
         Team, NOC                           \
       ORDER BY team DESC                    \
       LIMIT 30;",locals())
NOC Team count(*)
0 FIN rn-2 5
1 BEL Zut 3
2 ZIM Zimbabwe 309
3 GRE Zefyros 2
4 ZAM Zambia 183
5 YUG Yugoslavia-2 10
6 YUG Yugoslavia-1 10
7 YUG Yugoslavia 2558
8 SUI Ylliam VIII 5
9 SUI Ylliam VII 6
10 SUI Ylliam II 5
11 GBR Yeoman XII 3
12 GBR Yeoman VII 3
13 RSA Yeoman V 3
14 BAH Yeoman 4
15 YEM Yemen 32
16 MYA Yangon 2
17 MEX Xolotl 3
18 FIN Xantippa 3
19 GBR Wolseley-Siddeley-1 4
20 CAN Winnipeg Shamrocks-1 12
21 CAN Windor 2
22 NED Willem-Six 3
23 ARG Wiking 5
24 USA Widgeon 2
25 FRA Whitini Star 1
26 DEN White Lady 3
27 JPN Whisper 1
28 CAN Whirlaway 2
29 USA Western Rowing Club-3 6

Team in df seems to not reflect the country very well. e.g. Whisper is not a country but JPN probably represents Japan.

So the use of NOC seems to make sense

Now let us consider the NOC, region and notes variables

sqldf("SELECT                                \
         NOC,                                \
         Region,                             \
         Notes,                              \
         count(*)                            \
       FROM                                  \
         df2                                 \
       GROUP BY                              \
         NOC, Region, Notes                  \
       ORDER BY Region DESC                  \
       LIMIT 30;",locals())
NOC region notes count(*)
0 RHO Zimbabwe None 1
1 ZIM Zimbabwe None 1
2 ZAM Zambia None 1
3 YAR Yemen North Yemen 1
4 YEM Yemen None 1
5 YMD Yemen South Yemen 1
6 ISV Virgin Islands, US Virgin Islands 1
7 IVB Virgin Islands, British None 1
8 VIE Vietnam None 1
9 VNM Vietnam None 1
10 VEN Venezuela None 1
11 VAN Vanuatu None 1
12 UZB Uzbekistan None 1
13 URU Uruguay None 1
14 UAE United Arab Emirates None 1
15 UKR Ukraine None 1
16 UGA Uganda None 1
17 USA USA None 1
18 GBR UK None 1
19 TKM Turkmenistan None 1
20 TUR Turkey None 1
21 TUN Tunisia None 1
22 TTO Trinidad Trinidad and Tobago 1
23 WIF Trinidad West Indies Federation 1
24 TGA Tonga None 1
25 TOG Togo None 1
26 TLS Timor-Leste None 1
27 THA Thailand None 1
28 TAN Tanzania None 1
29 TJK Tajikistan None 1

NOC doesn’t look unique enough for us. For example, - Zimbabwe is RHO and ZIM and this probably just reflects a name change in the country - Yemen has 3 NOC values, presumably reflecting the unification that took place in 1990. Most of these differences are not that important to what I am looking at, where it would be better to have a broader description of a nation

So what I will do is - use the region tag as a unique identifier of a country - replace multiple NOC values of a single country with a single NOC value

The first step is to identify which regions have multiple NOC values. We can do this again in SQL by creating a new table

d1=sqldf("SELECT                               \
            NOC,                               \
            region,                            \
            notes,                             \
            count(*)                           \
         FROM                                  \
            df2                                \
         GROUP BY                              \
            region                             \
         HAVING COUNT(*)>1                     \
         ORDER BY count(*) DESC;",locals())
d1
NOC region notes count(*)
0 FRG Germany None 4
1 YAR Yemen North Yemen 3
2 SCG Serbia Serbia and Montenegro 3
3 EUN Russia None 3
4 MAL Malaysia None 3
5 BOH Czech Republic Bohemia 3
6 ROT None Refugee Olympic Team 3
7 RHO Zimbabwe None 2
8 VIE Vietnam None 2
9 TTO Trinidad Trinidad and Tobago 2
10 SYR Syria None 2
11 CRT Greece Crete 2
12 CHN China None 2
13 CAN Canada None 2
14 ANZ Australia Australasia 2

We can then create a table to get - the country, - the new NOC value for each country - all the NOC values that correspond to that country

d2=sqldf("SELECT                            \
            d1.NOC      AS new_NOC,         \
            df2.NOC     AS orig_NOC,        \
            df2.region,                     \
            df2.notes                       \
          FROM                              \
            d1                              \
          LEFT JOIN                         \
            df2                             \
          ON                                \
            d1.region=df2.region            \
          ORDER BY                          \
            df2.region DESC;",locals())
d2
new_NOC orig_NOC region notes
0 RHO RHO Zimbabwe None
1 RHO ZIM Zimbabwe None
2 YAR YAR Yemen North Yemen
3 YAR YEM Yemen None
4 YAR YMD Yemen South Yemen
5 VIE VIE Vietnam None
6 VIE VNM Vietnam None
7 TTO TTO Trinidad Trinidad and Tobago
8 TTO WIF Trinidad West Indies Federation
9 SYR SYR Syria None
10 SYR UAR Syria United Arab Republic
11 SCG SCG Serbia Serbia and Montenegro
12 SCG SRB Serbia None
13 SCG YUG Serbia Yugoslavia
14 EUN EUN Russia None
15 EUN RUS Russia None
16 EUN URS Russia None
17 MAL MAL Malaysia None
18 MAL MAS Malaysia None
19 MAL NBO Malaysia North Borneo
20 CRT CRT Greece Crete
21 CRT GRE Greece None
22 FRG FRG Germany None
23 FRG GDR Germany None
24 FRG GER Germany None
25 FRG SAA Germany None
26 BOH BOH Czech Republic Bohemia
27 BOH CZE Czech Republic None
28 BOH TCH Czech Republic None
29 CHN CHN China None
30 CHN HKG China Hong Kong
31 CAN CAN Canada None
32 CAN NFL Canada Newfoundland
33 ANZ ANZ Australia Australasia
34 ANZ AUS Australia None
35 ROT None None None

And finally replace the values of NOC in df and df2 with the new values for countries with duplicate values

Easier to do this with Python

for i,old_NOC in enumerate(d2.orig_NOC):
    df.loc[df.NOC==old_NOC,'NOC']=d2.loc[i,'new_NOC']
    df2.loc[df2.NOC==old_NOC,'NOC']=d2.loc[i,'new_NOC']
    

GDP data

Wikipedia was used to get data on population and GDP of different countries. The data was imported using Excel’s capability to give it a url to obtain the table, and saved as different tabs in the file CountryData.xlsx. For GDP I selected the World Bank Estimate.

Website urls correct as of 23/7/22

df_GDP = pd.read_excel('CountryData.xlsx',sheet_name=2)

df_GDP=df_GDP.drop(columns=['IMF[1][12] Estimate','IMF[1][12] Year','United Nations[13] Estimate','United Nations[13] Year','World Bank[14][15] Year'])
df_GDP=df_GDP.rename(columns={'World Bank[14][15] Estimate':'GDP','Country/Territory':'Country','UN Region':'Continent'},errors='raise')

df_GDP.head(10)
Country Continent GDP
0 World - 84,705,567
1 United States Americas 20,936,600
2 China Asia 14,722,731
3 Japan Asia 4,975,415
4 Germany Europe 3,806,060
5 India Asia 2,622,984
6 United Kingdom Europe 2,707,744
7 France Europe 2,603,004
8 Canada Americas 1,643,408
9 Italy Europe 1,886,445

make column GDP an integer, and remove the comma

df_GDP['GDP']=[x.replace(',','') for x in df_GDP['GDP']]

df_GDP['GDP']=pd.to_numeric(df_GDP['GDP'],errors='coerce').fillna(0).astype('int')

We now need to match the names of countries from the data we have found on the internet with the Olympics data.

A good example of this is United Kingdom, which can be named several ways (with slightly different meanings) including: UK, Great Britain, Great Britain, Great Britain and Northern Ireland, GB.

The function changeDF_country takes the imported dataframes and df2 as inputs and outputs the imported dataframes with corrected NOC and country values

  1. Checks whether country in imported df matches region in df
  • If so country doesn’t need to be changed and can provide matching NOC value
  1. If not above then see if df2.notes matches the country
  • If so can return country and NOC matching the notes that match
  1. If not aboves, try some name changes, like United States to USA
  • Then return the matching NOC and country and NOC matching values
  1. None of the above
  • Then return country given and give NOC a name to show there was no match.
  • Print out the name of the country where no match was found. To see if missing anything big or obvious
def getNation(region_to_check):
       
    if region_to_check=='United States':
        region_out = 'USA'
    elif bool(re.search(r'Germany', region_to_check)):
        region_out='Germany'
    elif region_to_check=='United Kingdom':
        region_out = 'UK'
    elif region_to_check=='Soviet Union':
        region_out='Russia'
    else:
        region_out=region_to_check
        print('nothing found for {}'.format(region_out))
    
    return region_out

def changeDF_country(df__,df2):
    xALL,nocALL=[],[]
    for i in range(len(df__)):
        country_check = df__.loc[i,'Country']
#         print(i,country_check)
        boolCountry=df2.region==country_check
        x=df2[boolCountry].region
        
        try:
            x=str(x.iloc[0])
            df2.loc[boolCountry,'region']=x
        except:
            if len(x)<1:
                boolCountry=df2.notes==df__.loc[i,'Country']
                x=df2[boolCountry].region
                try:
                    x=str(x.iloc[0])
                    df2.loc[boolCountry,'region']=x
                except:
                    if len(x)<1:
                        country_check=getNation(df__.loc[i,'Country'])
                        boolCountry=df2.region==country_check
                        x=df2[boolCountry].region
                        try:
                            x=str(x.iloc[0])
                            df2.loc[boolCountry,'region']=x
                        except:
                            x=country_check
        xALL.append(x)
        try:
            nocALL.append(df2[boolCountry].NOC.iloc[0])
        except:
            nocALL.append('---')
            
#         print('x= ',x)
#         print('-------------------')
#         if i==42:
#             break
#     print(nocALL)
    try:
        df__.insert(1,'NOC',nocALL)
        df__.insert(1,'Nation',xALL)#,'Nation',xALL})
    except:
        print('done allready')
    return df__
        
df_GDP=changeDF_country(df_GDP,df2)
nothing found for World
nothing found for DR Congo
nothing found for Bolivia
nothing found for Macau
nothing found for Congo
nothing found for North Macedonia
nothing found for New Caledonia
nothing found for French Polynesia
nothing found for Eswatini
nothing found for Greenland
nothing found for Curaçao
nothing found for East Timor
nothing found for Zanzibar
nothing found for British Virgin Islands
nothing found for Northern Mariana Islands
nothing found for Saint Kitts and Nevis
nothing found for Saint Vincent and the Grenadines
nothing found for Sint Maarten
nothing found for São Tomé and Príncipe
nothing found for Anguilla
nothing found for Montserrat

Population table

The same thing as above but for the population data

df_population = pd.read_excel('CountryData.xlsx',sheet_name=0)
df_population.head(5)
Rank Country / Dependency UN Region Population Percentage of the world Date Source (official or from the United Nations) Notes Column9
0 World NaN 7965207000 1.0000 2022-07-20 UN projection[2] NaN NaN
1 1 China Asia 1412600000 0.1770 2021-12-31 National annual estimate The population figure refers to mainland China... NaN
2 2 India Asia 1373761000 0.1720 2022-03-01 Annual national estimate The figure includes the population of Indian-a... NaN
3 3 United States Americas 332906919 0.0418 2022-07-20 National population clock The figure includes the 50 states and the Dist... NaN
4 4 Indonesia Asia 272248500 0.0342 2021-07-01 National annual estimate NaN NaN

df_population.drop(columns=['Percentage of the world','Date','Source (official or from the United Nations)','Column9','Notes'],inplace=True)
df_population=df_population.rename(columns={'Country / Dependency':'Country','UN Region':'Continent'},errors='raise')

df_population
Rank Country Continent Population
0 World NaN 7965207000
1 1 China Asia 1412600000
2 2 India Asia 1373761000
3 3 United States Americas 332906919
4 4 Indonesia Asia 272248500
... ... ... ... ...
237 Niue Oceania 1549
238 Tokelau (New Zealand) Oceania 1501
239 195 Vatican City Europe 825
240 Cocos (Keeling) Islands (Australia) Oceania 573
241 Pitcairn Islands (United Kingdom) Oceania 40

242 rows × 4 columns

df_population=changeDF_country(df_population,df2)
nothing found for World
nothing found for DR Congo
nothing found for Bolivia
nothing found for Hong Kong (China)
nothing found for Congo
nothing found for Puerto Rico (United States)
nothing found for North Macedonia
nothing found for East Timor
nothing found for Eswatini
nothing found for Macau (China)
nothing found for Western Sahara
nothing found for Northern Cyprus
nothing found for Transnistria
nothing found for French Polynesia (France)
nothing found for New Caledonia (France)
nothing found for Abkhazia
nothing found for São Tomé and Príncipe
nothing found for Guam (United States)
nothing found for Curaçao (Netherlands)
nothing found for Artsakh
nothing found for Aruba (Netherlands)
nothing found for Saint Vincent and the Grenadines
nothing found for Jersey (British Crown Dependency)
nothing found for U.S. Virgin Islands (United States)
nothing found for Isle of Man (British Crown Dependency)
nothing found for Cayman Islands (United Kingdom)
nothing found for Bermuda (United Kingdom)
nothing found for Guernsey (British Crown Dependency)
nothing found for Greenland (Denmark)
nothing found for Saint Kitts and Nevis
nothing found for Faroe Islands (Denmark)
nothing found for South Ossetia
nothing found for American Samoa (United States)
nothing found for Northern Mariana Islands (United States)
nothing found for Turks and Caicos Islands (United Kingdom)
nothing found for Sint Maarten (Netherlands)
nothing found for Gibraltar (United Kingdom)
nothing found for Saint Martin (France)
nothing found for Åland (Finland)
nothing found for British Virgin Islands (United Kingdom)
nothing found for Anguilla (United Kingdom)
nothing found for Wallis and Futuna (France)
nothing found for Saint Barthélemy (France)
nothing found for Saint Helena, Ascension and Tristan da Cunha (United Kingdom)
nothing found for Saint Pierre and Miquelon (France)
nothing found for Montserrat (United Kingdom)
nothing found for Falkland Islands (United Kingdom)
nothing found for Christmas Island (Australia)
nothing found for Norfolk Island (Australia)
nothing found for Niue
nothing found for Tokelau (New Zealand)
nothing found for Vatican City
nothing found for Cocos (Keeling) Islands (Australia)
nothing found for Pitcairn Islands (United Kingdom)

Combine everything for one country table

Now we merge all data sets together. I’ll use pandas but this is the same as a SQL left outer join. Because we want to keep all the NOC values in df2 we start with that then join each one after

df_country=[]
df_country=df2.merge(df_population,left_on='NOC',right_on='NOC')

df_country=df_country.merge(df_GDP,left_on='Nation',right_on='Nation')
df_country.drop(columns=['Continent_y','Country_x','Country_y','NOC_y','Rank','notes','region'],inplace=True)
df_country=df_country.rename(columns={'NOC_x':'NOC','Continent_x':'Continent'},errors='raise')

# We have multiple versions of Nation so we just take the max one
df_country = df_country.groupby('NOC').max()
df_country.to_csv('country')

df.to_csv("athlete_events.csv")
df2.to_csv("noc_regions.csv")