🐍Python/Pandas

[Pandas] Pandas06 - US Baby Names 풀이

728x90
반응형

US - Baby Names

Introduction:

We are going to use a subset of US Baby Names from Kaggle.
In the file it will be names from 2004 until 2014

Step 1. Import the necessary libraries

In [29]:
import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called baby_names.

In [30]:
url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv'
baby_names = pd.read_csv(url,sep=',')

Step 4. See the first 10 entries

In [31]:
baby_names.head(10)
Out[31]:
Unnamed: 0IdNameYearGenderStateCount
01134911350Emma2004FAK62
11135011351Madison2004FAK48
21135111352Hannah2004FAK46
31135211353Grace2004FAK44
41135311354Emily2004FAK41
51135411355Abigail2004FAK37
61135511356Olivia2004FAK33
71135611357Isabella2004FAK30
81135711358Alyssa2004FAK29
91135811359Sophia2004FAK28

Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [32]:
del baby_names['Unnamed: 0']
del baby_names['Id']
baby_names.head()
Out[32]:
NameYearGenderStateCount
0Emma2004FAK62
1Madison2004FAK48
2Hannah2004FAK46
3Grace2004FAK44
4Emily2004FAK41

Step 6. Is there more male or female names in the dataset?

In [33]:
baby_names['Gender'].value_counts()
Out[33]:
F    558846
M    457549
Name: Gender, dtype: int64

Step 7. Group the dataset by name and assign to names

In [59]:
# 이름별 총 횟수를 계산하고 싶은 것
# continuous인 Year을 delete해야함
# del baby_names['Year']
names = baby_names.groupby('Name').sum()
names.sort_values(by='Count',ascending=False).head()
Out[59]:
Count
Name
Jacob242874
Emma214852
Michael214405
Ethan209277
Isabella204798

Step 8. How many different names exist in the dataset?

In [60]:
len(names)
Out[60]:
17632

Step 9. What is the name with most occurrences?

In [61]:
names.idxmax()
Out[61]:
Count    Jacob
dtype: object

Step 10. How many different names have the least occurrences?

In [62]:
len(names[names['Count']==names['Count'].min()])
Out[62]:
2578

Step 11. What is the median name occurrence?

In [64]:
names[names['Count'] == names['Count'].median()]
Out[64]:
Count
Name
Aishani49
Alara49
Alysse49
Ameir49
Anely49
Antonina49
Aveline49
Aziah49
Baily49
Caleah49
Carlota49
Cristine49
Dahlila49
Darvin49
Deante49
Deserae49
Devean49
Elizah49
Emmaly49
Emmanuela49
Envy49
Esli49
Fay49
Gurshaan49
Hareem49
Iven49
Jaice49
Jaiyana49
Jamiracle49
Jelissa49
......
Kyndle49
Kynsley49
Leylanie49
Maisha49
Malillany49
Mariann49
Marquell49
Maurilio49
Mckynzie49
Mehdi49
Nabeel49
Nalleli49
Nassir49
Nazier49
Nishant49
Rebecka49
Reghan49
Ridwan49
Riot49
Rubin49
Ryatt49
Sameera49
Sanjuanita49
Shalyn49
Skylie49
Sriram49
Trinton49
Vita49
Yoni49
Zuleima49

66 rows × 1 columns

Step 12. What is the standard deviation of names?

In [69]:
names.Count.std()
Out[69]:
11006.06946789057

Step 13. Get a summary with the mean, min, max, std and quartiles.

In [70]:
names.describe()
Out[70]:
Count
count17632.000000
mean2008.932169
std11006.069468
min5.000000
25%11.000000
50%49.000000
75%337.000000
max242874.000000


728x90
반응형