교차검증 수행하기 - K 폴드, Stratified K 폴드, cross_val_score

4 minute read

일반적으로 기계학습을 진행할 때는 Hold Out 과정에서 Train 데이터와 Test 데이터를 나누게 된다. 사이킷런 Model Selection 모듈의 train_test_split()을 이용해 Train / Test 데이터로 나눌 수 있는데, 이 과정만 수행한다면 여전히 과적합 문제가 존재할 수 있다. 이는 고정된 Train 데이터와 Test 데이터로 평가를 하다보면 Test 데이터에만 최적의 성능을 발휘하도록 편향되게 모델을 유도할 수 있기 때문이다. 이러한 문제점을 해결하기 위해 교차검증이 필요하다.

1. K 폴드 교차검증

K 폴드 교차검증은 K개의 데이터 폴드 세트를 만들어 K번만큼 각 폴드 세트에 학습과 검증 평가를 반복적으로 수행하는 방법이다. 사이킷런에서는 Model Selection 모듈의 KFold 클래스를 이용해 K 폴드 교차검증을 수행할 수 있다.

다음 예시에서 데이터는 사이킷런에서 기본적으로 제공되는 iris 데이터를 사용했으며, Estimator로는 DecisionTreeClassifier를 사용했다.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import numpy as np

iris = load_iris()
features = iris.data
label = iris.target
dt_clf = DecisionTreeClassifier()

# 5개의 폴드 세트로 분리하는 KFold 객체
kfold = KFold(n_splits=5)
cv_accuracy = []
n_iter = 0

# KFold객체의 split() 호출하면 폴드별 학습용, 검증용 각각의 로우 인덱스를 array로 반환한다.  
for train_index, test_index  in kfold.split(features):
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]
    
    #학습 및 예측 
    dt_clf.fit(X_train , y_train)    
    pred = dt_clf.predict(X_test)
    n_iter += 1
    
    # 반복 시 마다 정확도 측정 
    accuracy = np.round(accuracy_score(y_test,pred), 4)
    train_size = X_train.shape[0]
    test_size = X_test.shape[0]
    print('\n#{0} 교차 검증 정확도 :{1}, 학습 데이터 크기: {2}, 검증 데이터 크기: {3}'
          .format(n_iter, accuracy, train_size, test_size))
    print('#{0} 검증 세트 인덱스:{1}'.format(n_iter,test_index))
    
    cv_accuracy.append(accuracy)
    
# 개별 iteration별 정확도를 합하여 평균 정확도 계산 
print('\n## 평균 검증 정확도:', np.mean(cv_accuracy)) 

                             .
                             .
                             .

#5 교차 검증 정확도 :0.8, 학습 데이터 크기: 120, 검증 데이터 크기: 30
#5 검증 세트 인덱스:[120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]

## 평균 검증 정확도: 0.9199999999999999

2. Stratified K 폴드

데이터 레이블의 분포가 불균형할 경우, 원본 데이터와 유사한 레이블 값의 분포를 학습/테스트 세트에도 유지할 필요가 있다. 이렇게 분포를 유지하게 만드는 것을 stratify로 수행할 수 있으며, 특히 데이터의 개수가 적을 시에 이 과정이 필요하다.

예를 들어, Train / Test 데이터를 분리할 때 사용하는 train_test_split()에서도 다음과 같이 stratify를 적용해볼 수 있다.

from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['label'] = iris.target

train, test = train_test_split(iris_df, stratify=iris_df.label)

K 폴드 교차검증에서도 stratify를 사용할 수 있는데, 이를 Stratified K 폴드라고 한다. 일반적으로 분류(Classification)에서의 교차검증은 Stratified K 폴드로 수행한다.

회귀(Regression)에서는 Stratified K 폴드가 지원되지 않는다.

iris 데이터의 경우 레이블 값은 0(Setosa), 1(Versicolor), 2(Virginica) 모두 50개로 동일한 개수를 가지고 있다. 전체 데이터 개수가 충분하지 않은(150개) 이 데이터에 일반적인 K 폴드를 적용할 경우 학습용 레이블이나 검증용 레이블 데이터 분포가 원본 데이터의 분포를 제대로 반영하지 못하는 문제가 발생한다. (KFold(n_split=3)을 적용하여 value_counts()로 확인해보면 이러한 문제가 극명하게 드러난다.)

이 문제를 해결하기 위해 Stratified K 폴드를 사용해보자.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3)
n_iter=0

for train_index, test_index in skf.split(iris_df, iris_df['label']):
    n_iter += 1
    label_train= iris_df['label'].iloc[train_index]
    label_test= iris_df['label'].iloc[test_index]
    print('## 교차 검증: {0}'.format(n_iter))
    print('학습 레이블 데이터 분포:\n', label_train.value_counts())
    print('검증 레이블 데이터 분포:\n', label_test.value_counts())

                             .
                             .
                             .

## 교차 검증: 3
학습 레이블 데이터 분포:
 0    34
2    33
1    33
Name: label, dtype: int64
검증 레이블 데이터 분포:
 2    17
1    17
0    16
Name: label, dtype: int64

레이블 데이터 세트의 분포가 고르게 분리된 것을 확인할 수 있다. 이를 활용하여 교차검증하는 코드를 다시 작성해보자.

dt_clf = DecisionTreeClassifier()

skfold = StratifiedKFold(n_splits=5)
cv_accuracy=[]
n_iter=0

# StratifiedKFold의 split() 호출시 반드시 레이블 데이터 셋도 추가 입력 필요  
for train_index, test_index  in skfold.split(features, label):
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]
    
    #학습 및 예측 
    dt_clf.fit(X_train , y_train)    
    pred = dt_clf.predict(X_test)

    # 반복 시 마다 정확도 측정 
    n_iter += 1
    accuracy = np.round(accuracy_score(y_test,pred), 4)
    train_size = X_train.shape[0]
    test_size = X_test.shape[0]
    
    print('\n#{0} 교차 검증 정확도 :{1}, 학습 데이터 크기: {2}, 검증 데이터 크기: {3}'
          .format(n_iter, accuracy, train_size, test_size))
    print('#{0} 검증 세트 인덱스:{1}'.format(n_iter,test_index))
    cv_accuracy.append(accuracy)
    
# 교차 검증별 정확도 및 평균 정확도 계산 
print('\n## 교차 검증별 정확도:', np.round(cv_accuracy, 4))
print('## 평균 검증 정확도:', np.mean(cv_accuracy)) 

                             .
                             .
                             .

#5 교차 검증 정확도 :1.0, 학습 데이터 크기: 120, 검증 데이터 크기: 30
#5 검증 세트 인덱스:[ 40  41  42  43  44  45  46  47  48  49  90  91  92  93  94  95  96  97
  98  99 140 141 142 143 144 145 146 147 148 149]

## 교차 검증별 정확도: [0.9667 0.9667 0.9    0.9667 1.    ]
## 평균 검증 정확도: 0.9600200000000001

3. cross_val_score()

사이킷런에선는 cross_val_score()로 보다 간편하게 교차검증을 수행할 수 있다. cross_val_score()는 분류에서는 classifier가 입력되면 Stratified K 폴드 방식으로 레이블값의 분포에 따라 학습/테스트 세트를 분할하며, 회귀에서는 K 폴드 방식으로 분할한다.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score , cross_validate
from sklearn.datasets import load_iris
import numpy as np

iris_data = load_iris()
dt_clf = DecisionTreeClassifier()

data = iris_data.data
label = iris_data.target

# 성능 지표는 정확도(accuracy) , 교차 검증 세트는 5개 
scores = cross_val_score(dt_clf , data , label , scoring='accuracy',cv=5)
print('교차 검증별 정확도:',np.round(scores, 4))
print('평균 검증 정확도:', np.mean(scores))

교차 검증별 정확도: [0.9667 0.9667 0.9    0.9667 1.    ]
평균 검증 정확도: 0.9600000000000002

Share on

Twitter Facebook LinkedIn

Eunkyung Koh

교차검증 수행하기 - K 폴드, Stratified K 폴드, cross_val_score

1. K 폴드 교차검증

2. Stratified K 폴드

3. cross_val_score()

Share on

Leave a comment

You may also enjoy

스태킹(Stacking) 앙상블

불균형 데이터로 머신러닝 수행하기 - 언더 샘플링(Undersampling), 오버 샘플링(Oversampling)

R로 데이터 분석하기 - 모델링 (3) : 랜덤포레스트(Random Forest)

R로 데이터 분석하기 - 모델링 (2) : 로지스틱 회귀분석