Pandas Dataframe Groupby() 함수로 그룹별 연산하기: Split, Apply, Combine

Pandas DataFrame에서 가장 많이 사용하는 Groupy 사용법을 설명하도록 하겠습니다. SQL 개발 경험을 가지고 있는 분이라면 GROUPBY를 높은 빈도로 사용했을 것입니다.

Groupby 동작 방식은 Pandas 공식 사이트(링크)에 자세히 설명되어 있습니다. Groupby()는 ① 전체 데이터를 그룹별로 분할(split)하고, ② mean(), sum(), count()와 같은 Aggregate function을 사용하여 연산(apply)하고, 연산 결과를 ③ 다시 합치는(combine) 과정을 거치게 됩니다.

Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.

Apply 단계는 Aggregation, Transformation, Filteration 3가지 연산이 존재합니다. Aggregation은 Split 한 그룹의 평균, Variation, Size 등과 같은 통계적인 값을 계산합니다. Transformation은 한 그룹의 계산 결과를 index object처럼 추가합니다. Filteration은 특정 조건을 만족하는 Boolean column을 만들어 특정 조건의 그룹을 새로 만듭니다.

Aggregation: compute a summary statistic (or statistics) for each group. Some examples:
- Compute group sums or means
- Compute group sizes / counts.

Transformation: perform some group-specific computations and return a like-indexed object. Some examples:
- Standardize data (zscore) within a group.
- Filling NAs within groups with a value derived from each group.
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discard data that belongs to groups with only a few members.
- Filter out data based on the group sum or mean.

아래 그림은 Groupy()의 동작 방식인 Split-Apply-Combine을 설명하는 예제입니다. Column은 Grade(학년), Name(이름), Test1, Test2, Test3, Test4, Final(점수), Level (평가 결과)이고, 각 학생별로 실제 값들이 Row로 저장되어 있는 데이터입니다. Groupby('Grade') 함수의 동작은 각 Grade별도 Dataframe을 split 하고, Aggregate function에 의해서 각각 나눠진 dataframe에서 연산을 apply 하고, 그 결과를 다시 합쳐(Combine)해서 새로운 Dataframe을 만듭니다.

import pandas as pd

# csv file에서 읽기

# df = pd.read_csv('./Grade_sample.csv')

df = pd.read_csv('https://raw.githubusercontent.com/kibua20/devDocs/master/pandas_example/Grade_Sample.csv')

# groupby()

df_groupby = df.groupby(['Grade']).mean()

print (df_groupby)

Split, Apply, Combine의 Groupby의 사용 예제

간단한 예제로 Groupyby의 연산 방식을 확인하도록 하겠습니다. A, B, C, D column이 있고, A와 B column은 string 값으로 구성되어 있고, C와 D는 임의의 숫자로 구성되어 있습니다. Groupby('A')는 A값을 기준으로 'bar'와 'fool' 2개의 group으로 분리하고 각 그룹에 대해서 sum() 연산을 적용하면 새로운 dataframe을 생성합니다. 이와 비슷하게 Groupby(['A', 'B'])는 'bar' 그룹에서 B column의 각 그룹으로 분리하고 각각의 분리된 dataframe에서 sum() 연산을 수행합니다. 연산 결과는 아래 그림과 같습니다.

df = pd.DataFrame(

{

"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],