🐍Python/Pandas

[Pandas] Pandas03 - Occupation 풀이

728x90
반응형

Occupation

Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Step 1. Import the necessary libraries

In [1]:
import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called users.

In [13]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'
users = pd.read_csv(url,sep='|',index_col='user_id')
users.head()
Out[13]:
agegenderoccupationzip_code
user_id
124Mtechnician85711
253Fother94043
323Mwriter32067
424Mtechnician43537
533Fother15213

Step 4. Discover what is the mean age per occupation

In [14]:
# groupby로 occupation을 선택하고 
# 그 중에서 age column의 mean을 구한다.
users.groupby('occupation')['age'].mean()
Out[14]:
occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

Step 5. Discover the Male ratio per occupation and sort it from the most to the least

In [24]:
def gender2num(x):
    if x == 'M':
        return 1
    if x == 'F':
        return 0 
    
    
users['gender_n'] = users['gender'].apply(gender2num)

a = users.groupby('occupation')['gender_n'].sum() / users.occupation.value_counts() * 100
# users.occupation.value_counts() 
a.sort_values(ascending=False)
Out[24]:
doctor           100.000000
engineer          97.014925
technician        96.296296
retired           92.857143
programmer        90.909091
executive         90.625000
scientist         90.322581
entertainment     88.888889
lawyer            83.333333
salesman          75.000000
educator          72.631579
student           69.387755
other             65.714286
marketing         61.538462
writer            57.777778
none              55.555556
administrator     54.430380
artist            53.571429
librarian         43.137255
healthcare        31.250000
homemaker         14.285714
dtype: float64

Step 6. For each occupation, calculate the minimum and maximum ages

In [32]:
users.groupby('occupation')['age'].agg(['min','max'])
Out[32]:
minmax
occupation
administrator2170
artist1948
doctor2864
educator2363
engineer2270
entertainment1550
executive2269
healthcare2262
homemaker2050
lawyer2153
librarian2369
marketing2455
none1155
other1364
programmer2063
retired5173
salesman1866
scientist2355
student742
technician2155
writer1860

Step 7. For each combination of occupation and gender, calculate the mean age

In [34]:
users.groupby(['occupation','gender'])['age'].mean()
Out[34]:
occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.166667
               M         33.216667
retired        F         70.000000
               M         62.538462
salesman       F         27.000000
               M         38.555556
scientist      F         28.333333
               M         36.321429
student        F         20.750000
               M         22.669118
technician     F         38.000000
               M         32.961538
writer         F         37.631579
               M         35.346154
Name: age, dtype: float64

Step 8. For each occupation present the percentage of women and men

In [47]:
# 먼저 직업과 성별에 대해 나누고 총 남,녀수를 계산한다.
a = users.groupby(['occupation','gender']).agg({'gender':'count'})
# 그리고 직업별로 각 칼럼들의 합을 구함 = Gender의 총합
b = users.groupby(['occupation']).agg('count')
# a를 b로 나눠준다. level로 기준을 occupation으로 만든다. 소수점으로 나와서 * 100
c = a.div(b,level='occupation') * 100
# gender 칼럼만 본다 
# loc으로 행은 다보고 열은 gender 선택
c.loc[:,'gender']
Out[47]:
occupation     gender
administrator  F          45.569620
               M          54.430380
artist         F          46.428571
               M          53.571429
doctor         M         100.000000
educator       F          27.368421
               M          72.631579
engineer       F           2.985075
               M          97.014925
entertainment  F          11.111111
               M          88.888889
executive      F           9.375000
               M          90.625000
healthcare     F          68.750000
               M          31.250000
homemaker      F          85.714286
               M          14.285714
lawyer         F          16.666667
               M          83.333333
librarian      F          56.862745
               M          43.137255
marketing      F          38.461538
               M          61.538462
none           F          44.444444
               M          55.555556
other          F          34.285714
               M          65.714286
programmer     F           9.090909
               M          90.909091
retired        F           7.142857
               M          92.857143
salesman       F          25.000000
               M          75.000000
scientist      F           9.677419
               M          90.322581
student        F          30.612245
               M          69.387755
technician     F           3.703704
               M          96.296296
writer         F          42.222222
               M          57.777778
Name: gender, dtype: float64


728x90
반응형