728x90

US - Baby Names

Introduction:

We are going to use a subset of US Baby Names from Kaggle.
In the file it will be names from 2004 until 2014

Step 1. Import the necessary libraries

In [29]:

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called baby_names.

In [30]:

url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv'
baby_names = pd.read_csv(url,sep=',')

Step 4. See the first 10 entries

In [31]:

baby_names.head(10)

Out[31]:

	Unnamed: 0	Id	Name	Year	Gender	State	Count
0	11349	11350	Emma	2004	F	AK	62
1	11350	11351	Madison	2004	F	AK	48
2	11351	11352	Hannah	2004	F	AK	46
3	11352	11353	Grace	2004	F	AK	44
4	11353	11354	Emily	2004	F	AK	41
5	11354	11355	Abigail	2004	F	AK	37
6	11355	11356	Olivia	2004	F	AK	33
7	11356	11357	Isabella	2004	F	AK	30
8	11357	11358	Alyssa	2004	F	AK	29
9	11358	11359	Sophia	2004	F	AK	28

Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [32]:

del baby_names['Unnamed: 0']
del baby_names['Id']
baby_names.head()

Out[32]:

	Name	Year	Gender	State	Count
0	Emma	2004	F	AK	62
1	Madison	2004	F	AK	48
2	Hannah	2004	F	AK	46
3	Grace	2004	F	AK	44
4	Emily	2004	F	AK	41

Step 6. Is there more male or female names in the dataset?

In [33]:

baby_names['Gender'].value_counts()

Out[33]:

F    558846
M    457549
Name: Gender, dtype: int64

Step 7. Group the dataset by name and assign to names

In [59]:

# 이름별 총 횟수를 계산하고 싶은 것
# continuous인 Year을 delete해야함
# del baby_names['Year']
names = baby_names.groupby('Name').sum()
names.sort_values(by='Count',ascending=False).head()

Out[59]:

	Count
Name
Jacob	242874
Emma	214852
Michael	214405
Ethan	209277
Isabella	204798

Step 8. How many different names exist in the dataset?

In [60]:

len(names)

Out[60]:

Step 9. What is the name with most occurrences?

In [61]:

names.idxmax()

Out[61]:

Count    Jacob
dtype: object

Step 10. How many different names have the least occurrences?

In [62]:

len(names[names['Count']==names['Count'].min()])

Out[62]:

Step 11. What is the median name occurrence?

In [64]:

names[names['Count'] == names['Count'].median()]

Out[64]:

	Count
Name
Aishani	49
Alara	49
Alysse	49
Ameir	49
Anely	49
Antonina	49
Aveline	49
Aziah	49
Baily	49
Caleah	49
Carlota	49
Cristine	49
Dahlila	49
Darvin	49
Deante	49
Deserae	49
Devean	49
Elizah	49
Emmaly	49
Emmanuela	49
Envy	49
Esli	49
Fay	49
Gurshaan	49
Hareem	49
Iven	49
Jaice	49
Jaiyana	49
Jamiracle	49
Jelissa	49
...	...
Kyndle	49
Kynsley	49
Leylanie	49
Maisha	49
Malillany	49
Mariann	49
Marquell	49
Maurilio	49
Mckynzie	49
Mehdi	49
Nabeel	49
Nalleli	49
Nassir	49
Nazier	49
Nishant	49
Rebecka	49
Reghan	49
Ridwan	49
Riot	49
Rubin	49
Ryatt	49
Sameera	49
Sanjuanita	49
Shalyn	49
Skylie	49
Sriram	49
Trinton	49
Vita	49
Yoni	49
Zuleima	49

66 rows × 1 columns

Step 12. What is the standard deviation of names?

In [69]:

names.Count.std()

Out[69]:

11006.06946789057

Step 13. Get a summary with the mean, min, max, std and quartiles.

In [70]:

names.describe()

Out[70]:

	Count
count	17632.000000
mean	2008.932169
std	11006.069468
min	5.000000
25%	11.000000
50%	49.000000
75%	337.000000
max	242874.000000

728x90

저작자표시

[Pandas] Pandas06 - US Baby Names 풀이

US - Baby Names

Introduction:

Step 1. Import the necessary libraries

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called baby_names.

Step 4. See the first 10 entries

Step 5. Delete the column 'Unnamed: 0' and 'Id'

Step 6. Is there more male or female names in the dataset?

Step 7. Group the dataset by name and assign to names

Step 8. How many different names exist in the dataset?

Step 9. What is the name with most occurrences?

Step 10. How many different names have the least occurrences?

Step 11. What is the median name occurrence?

Step 12. What is the standard deviation of names?

Step 13. Get a summary with the mean, min, max, std and quartiles.

티스토리툴바