[Python/ML] Ada Boosting, Gradient Boosting

Data Park/Python

[Python/ML] Ada Boosting, Gradient Boosting

Data Park 2022. 12. 11. 05:26

안녕하십니까! 데 박입니다!

저번 기술자료 Random Forest에 이어 Boosting 알고리즘도 다루게 되었네요,,ㅎㅎ

Bagging을 사용한 Random Forest도 강력하지만 최근 머신러닝 알고리즘 트렌드는 Boosting 알고리즘이 훌륭한 성능을 나타내는데요.

Random Forest를 이해하셨다면 Boosting을 이해하시는 것도 큰 어려움없이 잘 따라오실 겁니다!

그럼 가볼까요?

□ Boosting이란?

bagging은 병렬 , boosting은 직렬

부스팅은 약한 학습기를 여러 개 연결하여 강한 학습기를 만드는 앙상블 방법을 말합니다.

부스팅 밥법의 아이디어는 앞의 모델을 보완해나가면서 일련의 예측기를 학습시키는 것입니다.

부스팅방법에는 여러가지가 있지만 가장 인기있는 것은 에이다부스팅(Ada Booost, adaptive boosting의 줄임말)와 그레이디언트 부스팅(gradient boosting)입니다.

□ Ada Boost (에이다 부스트) 개념

에이다 부스팅 먼저 시작해보겠습니다.

이전 예측기를 보완하는 새로운 예측기를 만드는 방법은 이전 모델이 과소적합했던 훈련 샘플의 가중치를 더 높이는 것입니다.

이렇게 하면 새로운 예측기는 학습하기 어려운 샘플에 점점 더 맞춰지게 됩니다. 이것이 에이다부스트에서 사용하는 방식입니다.

1st 에이다 부스트 분류기를 만들때, 먼저 알고리즘이 기반이 되는 첫번째 분류기(ex, 의사결정나무)를 훈련 세트에서 훈련시키고 예측을 만듭니다.

2nd 그다음에 알고리즘이 잘못 분류된 훈련 샘플의 가중치를 상대적으로 높입니다.

3rd 두번째 분류기는 업데이트된 가중치를 사용해 훈련 세트에서 훈련하고 다시 예측을 만듭니다.

Nth 그다음에 다시 가중치를 업데이트하는 식으로 n_estimators(트리개수)에 따라 계속 되고, 학습이 끝납니다.

(과대적합이 되지않게 조기종단도 가능합니다!!)

□ Ada Boosting 실습

이번에도 moons dataset을 생성해서 실습해보죠!

# 필요한 라이브러리 불러오기
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

# 결정 경계 시각화 함수 정의
def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.45, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#FFD8D8','#9898ff','#B2EBF4'])
    plt.contourf(x1, x2, y_pred, #alpha=0.3, 
                 cmap=custom_cmap)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, #alpha=0.8,
                    cmap=custom_cmap2)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "ro", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

# 데이터 생성
X_m, y_m = make_moons(n_samples=500, noise=0.30, random_state=42)
plt.style.use('default')
sns.scatterplot(X_m[:,0],X_m[:,1], y_m)

# mglearn 라이브러리로도 시각화가 가능합니다ㅎㅎ
#!pip install mglearn
import mglearn
mglearn.discrete_scatter(X_m[:,0],X_m[:,1], y_m)

사이킷런은 SAMME라는 에이다 부스트의 다중 클래스 버전을 사용합니다. 클래스가 두 개일때는 SAMME가 에이다부스트와 동일합니다. 예측기가 클래스의 확률을 추정할 수 있다면 (즉, predict_proba()메서드가 있다면) 사이킷런은 SAMME.R(끝의 R은 'Real'을 말합니다)라는 SAMME의 알고리즘을 사용합니다. 이 알고리즘은 예측값 대신 클래스 확률에 기반하여 일반적으로 성능이 더 좋습니다!!

다음 코드는 사이킷런의 AdaBoostClassifier를 사용하여 200개의 아주 얕은 결정 트리를 기반으로 하는 에이다부스트 분류기를 훈련시킵니다. 여기에서 사용하는 결정 트리는 max_depth=1입니다.

다시 말해, 노드 두개로 이루어진 트리입니다, 이 트리가 AdaBoostClassifier의 기본 추정기입니다.

SAMME : https://www.intlpress.com/site/pub/pages/journals/items/sii/content/vols/0002/0003/a008/

SII vol. 2 (2009) no. 3 article 8

Statistics and Its Interface Volume 2 (2009) Number 3 Multi-class AdaBoost Pages: 349 – 360 DOI: https://dx.doi.org/10.4310/SII.2009.v2.n3.a8 Authors Trevor Hastie (Department of Statistics, Stanford University, Stanford, Calif., U.S.A.) Saharon Rosset

www.intlpress.com

from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier

# 학습 / 검증 분할
X_train, X_test, y_train, y_test = train_test_split(X_m, y_m, random_state=42)

# 에이다 부스트
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

# 랜덤 포레스트
rnd_clf = RandomForestClassifier(n_estimators=200)
rnd_clf.fit(X_train, y_train)

fig, axes = plt.subplots(ncols=2, figsize=(12,5), sharey=True)
plt.sca(axes[0])
plot_decision_boundary(ada_clf, X_m, y_m)
plt.title("Adaptive Boosting", fontsize=15)

plt.sca(axes[1])
plot_decision_boundary(rnd_clf, X_m, y_m)
plt.ylabel("")
plt.title("Random Forest", fontsize=15)

결정경계를 보면 에이다 부스팅이 랜덤 포레스트보다 성능이 좀 떨어질 것 같이 보이죠??

에이다 부스팅의 결정경계를 일반화시키기엔 무리가 있어보이긴 합니다만 모델평가에서는 랜덤 포레스트와 성능이 비슷합니다.

from sklearn.metrics import classification_report, plot_confusion_matrix
# 에이다 부스트 모델평가
#plt.style.use('dark_background')
ada_pred = ada_clf.predict( X_test )
ada_acc = ada_clf.score( X_test, y_test )
print("에이다 부스트 testing accuracy :", ada_acc * 100, "%")
print("--------------------------------------------------------------------------")
print(classification_report( y_test, ada_pred ))
plot_confusion_matrix(ada_clf , X_test , y_test , cmap='Blues')
plt.title("Ada boost Confusion Matrics")

에이다 부스트 testing accuracy : 94.66666666666667 %
--------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      0.91      0.95        45
           1       0.88      1.00      0.94        30

    accuracy                           0.95        75
   macro avg       0.94      0.96      0.95        75
weighted avg       0.95      0.95      0.95        75

rnd_pred = rnd_clf.predict( X_test )
rnd_acc = rnd_clf.score( X_test, y_test )
print("랜덤 포레스트 testing accuracy :", rnd_acc * 100, "%")
print("--------------------------------------------------------------------------")
print(classification_report( y_test, rnd_pred ))
plot_confusion_matrix(rnd_clf , X_test , y_test , cmap='Reds')
plt.title("Random Forest Confusion Matrics")

랜덤 포레스트 testing accuracy : 94.66666666666667 %
--------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      0.91      0.95        45
           1       0.88      1.00      0.94        30

    accuracy                           0.95        75
   macro avg       0.94      0.96      0.95        75
weighted avg       0.95      0.95      0.95        75

이 moons 데이터셋에서는 Boosting알고리즘이 Bagging을 기반한 랜덤 포레스트와 비슷한 성능을 보이지만

이후 정확도를 높이는 미니 프로젝트에서 보시면 느끼시겠지만 부스팅 알고리즘은 작동시간도 짧고 성능도 좋습니다,,!

□ Gradient Boost 란?

인기가 많은 또 하나의 부스팅 알고리즘은 그레이디언트 부스팅(Gradient Boosting)입니다. 에이다 부스트처럼 그레이디언트 부스팅은 앙상블에 이전까지의 오차를 보정하도록 예측기를 순차적으로 추가합니다. 하지만 에이다 부스트처럼 반복마다 샘플의 가중치를 수정하는 대신 이전 예측기가 만든 잔여 오차(residual error)에 새로운 예측기를 학습시킵니다

from sklearn.ensemble import GradientBoostingClassifier
fig, axes = plt.subplots(ncols=2, figsize=(12,5), sharey=True)

gbrt = GradientBoostingClassifier(max_depth=2, n_estimators=2, learning_rate=1.0)
gbrt.fit(X_train, y_train)
plt.sca(axes[0])
plot_decision_boundary(gbrt, X_m, y_m)
plt.title("learning_rate=1.0, n_estimators=2", fontsize=15)

gbrt1 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=500)
gbrt1.fit(X_train, y_train)
plt.sca(axes[1])
plot_decision_boundary(gbrt1, X_m, y_m)
plt.ylabel("")
plt.title("learning_rate=0.1, n_estimators=500", fontsize=15)

사이킷런 GradientBoostingClassifier()를 사용하면 앙상블을 간단하게 훈련시킬수 있습니다.

트리수(n_estimators)와 같이 앙상블의 훈련을 제어하는 파라미터는 RandomForestClassifier()와 비슷하게 결정 트리의 성장을 제어하는 파라미터(max_depth, min_samples_leaf)를 가지고 있습니다.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html?highlight=gradientboostingclassifier#sklearn.ensemble.GradientBoostingClassifier

sklearn.ensemble.GradientBoostingClassifier

Examples using sklearn.ensemble.GradientBoostingClassifier: Early stopping of Gradient Boosting Early stopping of Gradient Boosting Feature transformations with ensembles of trees Feature transform...

scikit-learn.org

learning_rate 파라미터가 각 트리의 기여 정도를 조절합니다. 이를 0.1처럼 낮게 설정하면 앙상블을 훈련 세트에 학습시키기 위해 많은 트리가 필요하지만 일반적으로 예측의 성능은 좋아집니다. 이는 축소(shirinkage)라고 부르는 규제 방법입니다. 아래의 그림은 작은 학습률로 훈련시킨 두 개의 GBRT 앙상블을 보여줍니다. 왼쪽은 훈련 세트를 학습하기에 트리가 충분하지 않은 반면 오른쪽은 트리가 너무 많아 훈련 세트에 과대적합되었습니다.

□ Gradient Boost의 조기종료

최적의 트리 수를 찾기 위해서는 조기종료기법을 사용할 수 있습니다. 간단하게 구현하려면 staged_predict()메서드를 사용합니다. 이 메서드는 훈련의 각 단계(트리하나, 트리 두개 등)에서 앙상블에 의해 만들어진 예측기를 순회하는 반복자(iterator)를 반환합니다. 다음 코드는 120개의 트리로 GBRT 앙상블을 훈련시키고 최적의 트리 수를 찾기 위해 각 훈련 단계에서 검증 오차를 측정합니다. 마지막에 최적의 트리 수를 사용해 새로운 GBRT 앙상블을 훈련시킵니다

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

gbrt = GradientBoostingClassifier(n_estimators=500)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_test, y_pred)
          for y_pred in gbrt.staged_predict(X_test)]

end_n_estimators = np.argmin(errors) + 1
gbrt_end = GradientBoostingClassifier(n_estimators=end_n_estimators)
gbrt_end.fit(X_train, y_train)

plt.style.use('dark_background')
plot_decision_boundary(gbrt_end, X_m, y_m)
plt.title("GradientBoostingClassifier(n_estimators=8)")

이상으로 [Python/ML] Ada Boosting, Gradient Boosting 를 마치겠습니다. 읽어주셔서 감사합니다!