[Python/ML] Random Forest (랜덤 포레스트)

안녕하십니까 데 박 입니다!!

저번 Decision Tree 기술자료에 이어 Random Forest 기술자료를 업로드합니다^^

□ 앙상블 학습과 랜덤포레스트

의사결정나무가 "하나의 거대한 나무" 라고 한다면,

랜덤포레스트는 보다 "작은 나무로 이루어진 숲" 이라고 이해하면 됩니다!

무작위로 선택된 수천명의 사람에게 복잡한 질문을 하고 대답을 모은다고 가정합니다,

많은 경우 이렇게 모은 답이 전문가의 답보다 낫습니다. 이를 대중의 지혜라고 합니다.

이와 비슷하게 일련의 예측기 (분류, 회귀 모델)로부터 예측을 수집하면 가장 좋은 모델 하나보다 더 좋은 예측을 얻을 수 있을 것입니다

이를 '앙상블(ensemble)' 이라고 합니다!

앙상블 방법의 예를 들면

훈련 세트로부터 무작위로 각기 다른 서브셋을 만들어 일련의 Decision Tree(의사결정나무)분류기를 훈련시킬 수 있습니다. 그런 다음 가장 많은 선택을 받은 클래스를 예측으로 삼습니다. 의사결정나무의 앙상블을 '랜덤 포레스트' 라고 합니다. 간단한 방법임에도 랜덤 포레스트는 오늘날 강력한 머신러닝 알고리즘 중 하나입니다!

실제로 머신러닝 경연 대회에서 우승하는 솔루션은 여러가지 앙상블 방법을 사용한 경우가 많다고 합니다! 이는 특히 넷플릭스 대회에서 가장 인기가 많다고 합니다.

https://www.shalomeir.com/2014/11/netflix-prize-1/

NETFLIX PRIZE - 다이나믹 했던 알고리즘 대회 (1) - shalomeir's blog

기계학습, 알고리즘 대회에서 연구원들이 치열하게 연구결과를 가지고 경쟁하는 대회가, 대회 기간이 하루 이틀도 아니고 1년 단위로 이뤄진다면 과연 스포츠 경기 종료 휘슬 직전과 같은 흥미

www.shalomeir.com

□ 랜덤포레스트의 프로세스

랜덤 포레스트는 앙상블 기법중 배깅(bagging)이란 방법을 씁니다.

1st 훈련 세트에서 중복을 허용하여 Feature(변수)를 샘플링한 후 의사결정나무로 학습을 하고

2nd aggregating, 마지막엔 '종합'하여 랜덤포레스트 모델을 구축합니다!!

3rd 모든 예측기가 훈련을 마치면 앙상블은 모든 예측기의 예측을 모아서 결과를 투표합니다. (분류에서는 통계적 최빈값, 회귀에서는 평균을 계산합니다)

4th 개별 의사결정나무는 원본 훈련 세트로 훈련시킨 것보다 훨씬 크게 편향되어 있지만 수집 함수를 통과하면 편향과 분산이 모두 감소합니다!

일반적으로 앙상블의 결과는 원본 데이터셋으로 하나의 예측기를 훈련시킬때와 비교해 편향은 비슷하지만 분산은 줄어듭니다.

□ Bagging, Random Forest 실습

from sklearn.datasets import make_moons
import seaborn as sns

# 데이터 생성
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
# 산점도
sns.scatterplot(X[:,0],X[:,1], y)

이 Moon dataset을 분류해보겠습니다!!

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# 학습 / 검증 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# 모델생성
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

from sklearn.metrics import accuracy_score
print("배깅 accuracy :",accuracy_score(y_test, y_pred))

>>> 배깅 accuracy : 0.904

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print("의사결정나무 accuracy :", accuracy_score(y_test, y_pred_tree))

>>> 의사결정나무 accuracy : 0.856

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('dark_background')
def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.45, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#FFD8D8','#9898ff','#B2EBF4'])
    plt.contourf(x1, x2, y_pred, #alpha=0.3, 
                 cmap=custom_cmap)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, #alpha=0.8,
                    cmap=custom_cmap2)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "ro", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
    
fig, axes = plt.subplots(ncols=2, figsize=(12,6), sharey=True)
plt.sca(axes[0])
plot_decision_boundary(tree_clf, X, y)
plt.title("Decision Tree", fontsize=18)
plt.sca(axes[1])
plot_decision_boundary(bag_clf, X, y)
plt.title("Decision Trees with Bagging", fontsize=18)
plt.ylabel("")
plt.show()

앞서 언급했듯이 랜덤 포레스트는 일반적으로 배깅방법을 적용한 의사결정나무의 앙상블입니다. 전형적으로 max_sample을 훈련 세트의 크기로 지정합니다. Bagging Classifier에 DecisionTreeClassifier를 넣어 만드는 대신 결정 트리에 최적화되어 사용하기 편한 RandomForestClassifier를 사용할 수 있습니다.(회귀 문제를 위한 RandomForestRegressor도 있습니다) 다음은 (최대 16개의 리프노드를 갖는) 500개 트리로 이루어진 랜덤 포레스트 분류기를 여러 CPU 코어에서 훈련시키는 코드입니다

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

랜덤 포레스트는 결정 트리의 배깅과 비슷합니다 RandomForestClassifier는 몇가지 예외(트리 성장의 조절은 위함)가 있지만 DecisionTreeClassifier의 파라미터와 앙상블 자체를 제어하는 데 필요한 BaggingClassifier의 파라미터를 모두 갖고 있습니다!

랜덤 포레스트 알고리즘은 트리의 노드를 분할할 때, 전체 특성 중에서 최선의 특성을 찾는 대신 무작위로 선택한 특성 후보 중에서 최적의 특성을 찾는 식으로 무작위성을 더 주입합니다. 이는 결국 트리를 더욱 다양하게 만들고 (다시 한번) 편향을 손해보는 대신 분산을 낮추어 전체적으로 더 훌륭한 모델을 만들어냅니다.

다음은 BaggingClassifier를 사용해 앞의 랜덤포레스트 분류기와 거의 유사하게 만든 것입니다!

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500, random_state=42)
    
bag_clf.fit(X_train, y_train)

y_pred = bag_clf.predict(X_test)

np.sum(y_pred == y_pred_rf) / len(y_pred)  # 거의 에측이 동일합니다.

>>> 1.0

fig, axes = plt.subplots(ncols=2, figsize=(12,6), sharey=True)
plt.sca(axes[0])
plot_decision_boundary(rnd_clf, X, y)
plt.title("Random Forest", fontsize=18)
plt.sca(axes[1])
plot_decision_boundary(bag_clf, X, y)
plt.title("Decision Trees with Bagging", fontsize=18)
plt.ylabel("")
plt.show()

□ 특성 중요도(feature Importance)

랜덤 포레스트의 또 다른 장점은 특성의 상대적 중요도를 측정하기 쉽다는 것입니다. 사이킷런은 어떤 특성을 사용한 노드가 (랜덤 포레스트에 있는 모든 트리에 걸쳐서) 평균적으로 불순도를 얼마나 감소시키는지 확인하여 특성의 중요도를 측정합니다. 더 정확히 말하면 가중치 편균이며 각 노드의 가중치는 연관된 훈련 샘플수와 같습니다.

사이킷런은 훈련이 끝난 뒤 특성마다 자동으로 이 점수를 계산하고 중요도의 전체 합이 1이 되도록 결괏값을 정규화합니다. 이 값은 feature_importances_ 변수에 저장되어 있습니다. 예를 들어 다음 코드는 iris 데이터셋에 랜덤포레스트 분류기를 훈련시키고 각 특성의 중요도를 출력하고 시각화한 코드입니다. 가장 중요한 특성은 꽃잎의 길이(44%)와 너비(42%)이고 꽃받침의 길이와 너비는 덜 중요해 보입니다.

from sklearn.datasets import load_iris
iris = load_iris()

# X, y 할당
X = iris["data"]
y = iris['target']

# Train, Test 분할
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix, plot_confusion_matrix
import time

start = time.time()
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(X_train, y_train)
rnd_clf_pred = rnd_clf.predict(X_test) #X_test 예측
rnd_clf_acc = rnd_clf.score(X_test, y_test ) # 정확도(Accuracy) 출력
#---------------------------------------------- 
print("랜덤포레스트 training accuracy  :", rnd_clf.score( X_train, y_train )*100, "%")
print("랜덤포레스트 test accuracy :", rnd_clf_acc * 100, "%")
#----------------------------------------------
print("--------------------------------------------------------------------------")
print(classification_report(y_test, rnd_clf_pred))
print("--------------------------------------------------------------------------")
print(confusion_matrix(y_test, rnd_clf_pred))
print("--------------------------------------------------------------------------")
#----------------------------------------------
end = time.time() 
print('Execution time is:') 
print(end - start)
plot_confusion_matrix(rnd_clf, X_test, y_test , cmap='YlGn')
plt.title("<< Random Forest >>")

>>> 랜덤포레스트 training accuracy  : 100.0 %
>>> 랜덤포레스트 test accuracy : 89.47368421052632 %
--------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.80      0.92      0.86        13
           2       0.91      0.77      0.83        13

    accuracy                           0.89        38
   macro avg       0.90      0.90      0.90        38
weighted avg       0.90      0.89      0.89        38

--------------------------------------------------------------------------
[[12  0  0]
 [ 0 12  1]
 [ 0  3 10]]
--------------------------------------------------------------------------
Execution time is:
0.8784017562866211

# 특성 중요도 출력
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print([name, score])

# 특성 중요도 시각화
sns.barplot(x=rnd_clf.feature_importances_, y=iris.feature_names)
plt.title("Random forest_feature importance")

>>> ['sepal length (cm)', 0.1219238899824993]
>>> ['sepal width (cm)', 0.028212457341966667]
>>> ['petal length (cm)', 0.44200866611709616]
>>> ['petal width (cm)', 0.40785498655843794]

이상으로 [Python/ML_Classification] Random Forest 를 마치도록 하겠습니다.

궁금하신 부분이나 질문은 댓글로 남겨주시면 되겠고, 오늘도 긴 글 읽어주셔서 대단히 감사드립니다!

'Data Park > Python' 카테고리의 다른 글

[Python/ML] K-MEANS Clustering (0)	2023.01.27
[Python/ML] K-NN (K-Nearest Neighbors) (0)	2023.01.10
[Python/ML] SVM (Support Vector Machine) (0)	2023.01.04
[Python/ML] Ada Boosting, Gradient Boosting (0)	2022.12.11
[Python/ML] Decision Tree (의사결정나무) (0)	2022.12.10