ๆผธๅ…ฅไฝณๅขƒ

๐Ÿชช HOME CREDIT #3:Modeling+Result

ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง

์ด 99๊ฐœ์˜ ๋…๋ฆฝ๋ณ€์ˆ˜๋“ค ์ค‘์—์„œ ๋‹ค์ค‘๊ณต์„ ์„ฑ์„ ์ œ๊ฑฐํ•  ๊ฒƒ์ด๋‹ค.

๋‹ค์ค‘๊ณต์„ ์„ฑ์„ ์ œ๊ฑฐํ•˜๋Š” ์ด์œ ๋Š” ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๊ด€์„ฑ์ด ๋†’์„ ๊ฒฝ์šฐ ํšŒ๊ท€ ๊ณ„์ˆ˜ ์ถ”์ •์˜ ๋ถˆ์•ˆ์ •์„ฑ ๋ฌธ์ œ์™€ ๋ชจ๋ธ์˜ ์‹ ๋ขฐ์„ฑ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

ํŠนํžˆ ํšŒ๊ท€ ๊ณ„์ˆ˜๋ฅผ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ๊ฒฝ์šฐ ๋‹ค์ค‘๊ณต์„ ์„ฑ ์ œ๊ฑฐ๊ฐ€ ํ•„์ˆ˜์ด๋‹ค.

XGBoost ์™€ ๊ฐ™์€ ML ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋ชจ๋ธ๋ง ๊ณผ์ •์—์„œ ํ•„์ˆ˜ ๊ณผ์ •์€ ์•„๋‹ˆ์ง€๋งŒ,
๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๊ด€์„ฑ์ด ๋†’์„ ๊ฒฝ์šฐ ์ถ”ํ›„ ํŠน์ • ๋ณ€์ˆ˜์˜ ๋ณ€ํ™”๊ฐ€ ๋‹ค๋ฅธ ๋ณ€์ˆ˜๋“ค์—๋„ ํฌ๊ฒŒ ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์˜ ์‹ ๋ขฐ์„ฑ ํ™•๋ณด ์ฐจ์›์—์„œ ํ•„์š”ํ•œ ๋‹จ๊ณ„๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

๋‹ค์ค‘๊ณต์„ ์„ฑ ์ œ๊ฑฐ๋Š” 1. ์ƒ๊ด€๊ณ„์ˆ˜ ๊ธฐ๋ฐ˜ 2. VIF ๊ธฐ๋ฐ˜ ์ˆœ์„œ๋กœ ์ง„ํ–‰ํ–ˆ๋‹ค.

IV ๊ณ„์‚ฐ

๊ทธ ์ „์— ๊ฐ ๋ณ€์ˆ˜์˜ IV๋ฅผ ๊ณ„์‚ฐํ•ด๋‘๋„๋ก ํ•˜๊ฒ ๋‹ค. ์ดํ›„ ์ƒ๊ด€์„ฑ์ด ๋†’์€ ๋ณ€์ˆ˜๋“ค ์ค‘ IV ์ ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ณ€์ˆ˜๋ฅผ ์ œ์™ธํ•  ์˜ˆ์ •์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

optbinning์„ ํ™œ์šฉํ•œ Woe , IV ๊ณ„์‚ฐ ์ฝ”๋“œ์ด๋‹ค.

# --- [Step 1] ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ๋ถ„ํ•  ---
X, y = df.drop(columns='TARGET'), df['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
feature_list = cat_cols + num_cols

# --- [Step 2] BinningProcess ์„ค์ • ๋ฐ ํ•™์Šต ---
selection_criteria = {
    "iv": {"min": 0.025, "max": 0.7, "strategy": "highest", "top": 30},
    "quality_score": {"min": 0.01}
}

binning_process = BinningProcess(
    variable_names=feature_list,
    categorical_variables=cat_cols,
    selection_criteria=selection_criteria
)

# Train ๋ฐ์ดํ„ฐ๋กœ Rule ํ•™์Šต
binning_process.fit(X_train, y_train)

selection_criteria๋กœ Woe ๋ฐ IV ๊ณ„์‚ฐ์— ํ•„์š”ํ•œ ๊ทœ์น™๋“ค์„ ์ •ํ•ด์ฃผ์—ˆ๋‹ค. IV ๊ธฐ์ค€์— ๋”ฐ๋ผ 0.03 ์ด์ƒ์˜ ๋ณ€์ˆ˜๋“ค๋งŒ์„ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ํ•œ๋‹ค.

์ตœ์ข…์ ์œผ๋กœ ์„ ์ •๋œ 30๊ฐœ ๋ณ€์ˆ˜๋ฅผ IV ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.

Your Alt Text

๊ฒฐ๊ณผ๋ฅผ ๋ณด๋‹ˆ EXT ๋ณ€์ˆ˜๋“ค์ด ๋‹ค๋ฅธ ๋ณ€์ˆ˜๋“ค์— ๋น„ํ•ด IV ๊ฐ’์ด ํฐ ๊ฑธ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
์™ธ๋ถ€ ๊ธฐ๊ด€์˜ ์‹ ์šฉ์ ์ˆ˜์ธ ๋งŒํผ ๊ทธ ์–ด๋–ค ๋ณ€์ˆ˜๋ณด๋‹ค ๊ณ ๊ฐ์˜ ์‹ ์šฉ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๊ฑด ์–ด์ฉŒ๋ฉด ๋‹น์—ฐํ•œ ์‚ฌ์‹ค์ด๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ € EXT ๋ณ€์ˆ˜๋“ค์ด ๋‹ค์ค‘๊ณต์„ ์„ฑ์—์„œ๋Š” ์•„๋งˆ ๋ช‡ ๊ฐœ ์ œ์™ธ๋ ๊ฑฐ๋ผ ์ƒ๊ฐํ•œ๋‹ค.
๊ทธ ๋‹ค์Œ์œผ๋กœ๋Š” ๊ทผ๋ฌด ๊ธฐ๊ฐ„, ๋‚˜์ด ๋“ฑ์˜ ์ˆœ์„œ๋กœ ์œ ์˜ํ•œ ๋ณ€์ˆ˜์ธ๊ฑธ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์ฃผ์š” ๋ณ€์ˆ˜๋“ค์˜ ๊ตฌ๊ฐ„๋ณ„ ๋ถ€๋„ํ™•๋ฅ ์„ ์‹œ๊ฐํ™” ํ•ด๋ณด์ž.

๋จผ์ € EXT_3 ๋ณ€์ˆ˜์ด๋‹ค.

Your Alt Text

‘์‹ ์šฉ์ ์ˆ˜๊ฐ€ ๋†’์„ ์ˆ˜๋ก ๋ถ€๋„์œจ์ด ๋‚ฎ๋‹ค’ ๋ผ๋Š” ์ผ๋ฐ˜์ƒ์‹์— ๋ถ€ํ•ฉํ•˜๋Š” ๊ทธ๋ž˜ํ”„์ด๋‹ค.
์•„๋งˆ ๋…๋ฆฝ๋ณ€์ˆ˜๋“ค ์ค‘ ๊ฐ€์žฅ ๋‹จ์กฐ์„ฑ์ด ์ž˜ ๋“œ๋Ÿฌ๋‚˜๋Š” ๋ณ€์ˆ˜๊ฐ€ ์•„๋‹๊นŒ ํ•œ๋‹ค.

๊ทธ ๋‹ค์Œ์€ YEARS_EMPLOYED ๋ณ€์ˆ˜ ์ด๋‹ค.

Your Alt Text

์ผ๋ฐ˜์ ์œผ๋กœ ๋ณ€์ˆ˜๊ฐ€ ๋‹จ์กฐ์„ฑ์„ ๋„๋Š”๊ฒŒ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ์ข‹๋‹ค๊ณ ๋Š” ํ•˜์ง€๋งŒ ์˜ˆ์™ธ์ธ ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค.
๊ทผ์†๋…„์ˆ˜์ฒ˜๋Ÿผ ์—ญU์ž๊ฐ€ ์˜คํžˆ๋ ค ๋น„์ฆˆ๋‹ˆ์Šค ๋กœ์ง์— ๋ถ€ํ•ฉํ•˜๋Š” ๊ฒฝ์šฐ์ด๋‹ค.
์ด ๊ฒฝ์šฐ์—๋„ ๊ทผ์†๋…„์ˆ˜๊ฐ€ ์ €์—ฐ์ฐจ-์ค‘์—ฐ์ฐจ์ผ์ˆ˜๋ก ๋ถ€๋„์œจ์ด ๋†’์€๊ฒŒ ๋” ํ•ฉ๋ฆฌ์ ์ธ ํ•ด์„์ด๋‹ค.

์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ ๋ฐ ๋ณ€์ˆ˜ ํƒˆ๋ฝ

์ด์ œ IV ๊ฐ’๊นŒ์ง€ ๋‹ค ๊ตฌํ–ˆ์œผ๋‹ˆ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ณ€์ˆ˜๋ฅผ 1์ฐจ ํ•„ํ„ฐ๋ง ํ•˜๊ฒ ๋‹ค.

def remove_high_correlation(df_woe, iv_summary, threshold=0.8):
    corr_matrix = df_woe.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    to_drop = set() # ์ค‘๋ณต ์ œ๊ฑฐ๋ฅผ ์œ„ํ•ด set ์‚ฌ์šฉ
    for column in upper.columns:
        high_corr_vars = upper.index[upper[column] > threshold].tolist()
        for var in high_corr_vars:
            # IV ๊ฐ’ ์ถ”์ถœ ์‹œ ์˜ˆ์™ธ ์ฒ˜๋ฆฌ ์ถ”๊ฐ€
            iv_col = iv_summary.loc[iv_summary['name'] == column, 'iv'].values[0]
            iv_var = iv_summary.loc[iv_summary['name'] == var, 'iv'].values[0]
            
            # IV๊ฐ€ ๋‚ฎ์€ ์ชฝ์„ drop
            drop_target = column if iv_col < iv_var else var
            to_drop.add(drop_target)
    return list(to_drop)

low_iv_corr_vars = remove_high_correlation(X_train_woe, iv_summary, threshold=0.8)
X_train_reduced = X_train_woe.drop(columns=low_iv_corr_vars)

์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜ ์Œ๊ณผ ๊ฐ ๋ณ€์ˆ˜์˜ IV๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํƒˆ๋ฝํ•œ ๋ณ€์ˆ˜์ด๋‹ค.

๊ทผ์†์ผ์ˆ˜์™€ ๋‚˜์ด ๋ณ€์ˆ˜์˜ ๊ฒฝ์šฐ ๋‚ด๊ฐ€ ํŒŒ์ƒ๋ณ€์ˆ˜๋ฅผ ๋งŒ๋“ค๊ณ  ๊ธฐ์กด ๋ณ€์ˆ˜๋“ค์„ ๋”ฐ๋กœ ์ œ๊ฑฐ๋ฅผ ์•ˆํ•ด์ค˜์„œ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 1์ด ๋‚˜์™”๋‹ค.

Your Alt Text

์ด๋ ‡๊ฒŒ 1์ฐจ์ ์œผ๋กœ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ธฐ๋ฐ˜ ๋ณ€์ˆ˜ ํ•„ํ„ฐ๋ง์„ ํ•œ ํ›„ VIF๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋‹ค์‹œ ํ•„ํ„ฐ๋งํ•˜๊ฒ ๋‹ค.

VIF ๊ณ„์‚ฐ ๋ฐ ๋ณ€์ˆ˜ ํƒˆ๋ฝ

์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋ณ€์ˆ˜์˜ 1:1 ๊ด€๊ณ„๋งŒ ํŒ๋‹จํ•œ๋‹ค๋ฉด, VIF๋Š” 1:N ์˜ ์ƒ๊ด€์„ฑ์„ ํŒ๋‹จํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๋‹ค์ค‘๊ณต์„ ์„ฑ์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋‹ค.

def calculate_vif(df):
    vif_data = pd.DataFrame()
    vif_data["variable"] = df.columns
    # ์ƒ์ˆ˜ํ•ญ(intercept)์— ์˜ํ•œ ์˜ํ–ฅ์„ ๋ฐฐ์ œํ•˜๊ธฐ ์œ„ํ•ด ๋ณดํ†ต WoE ๋ฐ์ดํ„ฐ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ
    vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif_data

def iterative_vif_reduction(df, threshold=10):
    while True:
        vif_df = calculate_vif(df)
        max_vif = vif_df['VIF'].max()
        if max_vif > threshold:
            drop_var = vif_df.sort_values('VIF', ascending=False)['variable'].iloc[0]
            print(f"Dropping '{drop_var}' with VIF: {max_vif:.2f}")
            df = df.drop(columns=[drop_var])
        else:
            break
    return df

X_train_final_woe = iterative_vif_reduction(X_train_reduced, threshold=10)
final_selection_list = X_train_final_woe.columns.tolist()

์ด ๊ฒฐ๊ณผ EXT_SOURCES_MEAN๊ฐ€ ํƒˆ๋ฝํ•˜์˜€๋‹ค.
๋‚˜๋ฆ„ EXT ๊ด€๋ จํ•˜์—ฌ ํŒŒ์ƒ๋ณ€์ˆ˜๋“ค์„ ๋งŒ๋“ค์–ด ๋†จ๋Š”๋ฐ ํƒˆ๋ฝํ•ด๋ฒ„๋ฆฌ ์•ฝ๊ฐ„ ์„œ์šดํ•˜๋‹ค.

์ตœ์ข… ๋ณ€์ˆ˜

์ตœ์ข… ์„ ์ •๋œ ๋ณ€์ˆ˜๋Š” 21๊ฐœ๋กœ

'CODE_GENDER', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'ORGANIZATION_TYPE', 'REGION_RATING_CLIENT_W_CITY', 'REG_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_3', 'DAYS_EMPLOYED_ANOM', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'LIVINGAREA_AVG', 'DAYS_LAST_PHONE_CHANGE', 'AGE', 'YEARS_EMPLOYED', 'PAYMENT_RATE'

์ด๋‹ค.

๋ณ€์ˆ˜๋“ค์„ ๋ถ„๋ฅ˜ํ•ด๋ณด์ž๋ฉด,

๊ตฐ์ง‘ ๋ณ€์ˆ˜๋ช… ์‹ค๋ฌด์  ์˜๋ฏธ ๋ฐ ํ•ด์„
ํ•ต์‹ฌ ์ง€ํ‘œ EXT_SOURCE_1 2 3 ์™ธ๋ถ€ ์‹ ์šฉ ์ •๋ณด. ๋ชจ๋ธ์—์„œ ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๋ณ€๋ณ„๋ ฅ์„ ๊ฐ€์ง
์ƒํ™˜ ๋Šฅ๋ ฅ AMT_ANNUITY, PAYMENT_RATE, AMT_CREDIT ์†Œ๋“ ๋Œ€๋น„ ์›๋ฆฌ๊ธˆ ๋น„์ค‘ ๋ฐ ์ƒํ™˜ ๊ฐ€๋Šฅ ์ˆ˜์ค€์„ ์ธก์ •
์•ˆ์ •์„ฑ/์‹ ๋ขฐ AGE, YEARS_EMPLOYED, DAYS_ID_PUBLISH ์—ฐ๋ น ๋ฐ ๊ทผ์† ์—ฐ์ˆ˜๋Š” ์†Œ๋“ ์•ˆ์ •์„ฑ๊ณผ ์ •๋น„๋ก€ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Œ
์‚ฌํšŒ์  ์ง€์œ„ OCCUPATION_TYPE, NAME_EDUCATION_TYPE ์ง์—…๊ตฐ๊ณผ ๊ต์œก ์ˆ˜์ค€์— ๋”ฐ๋ฅธ ๋ถ€๋„ ์œ„ํ—˜ ์ฐจ์ด๋ฅผ ๋ฐ˜์˜
์ฃผ๊ฑฐ/ํ™˜๊ฒฝ REGION_RATING_CLIENT_W_CITY, LIVINGAREA_AVG ๊ฑฐ์ฃผ ์ง€์—ญ์˜ ๊ฒฝ์ œ์  ์ˆ˜์ค€ ๋ฐ ์ž์‚ฐ ์ƒํƒœ ์ถ”์ •

์ด๋ ‡๊ฒŒ ํ•ด์„ํ•˜๋ฉด ๋  ๋“ฏํ•˜๋‹ค.

Logistic Regression

์‹ ์šฉํ‰๊ฐ€์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ๋Š” ์ „ํ†ต ๋ชจํ˜•์ธ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ณ  ์„ฑ๋Šฅ์„ ์ธก์ •ํ–ˆ๋‹ค.

์ด ๋•Œ threshold๋Š” ์ž„์˜๋กœ 0.5๋กœ ์ง€์ •ํ•˜๋Š” ๋Œ€์‹  Youden’s J statistic ๋ฅผ ์ ์šฉํ•˜์—ฌ ์ตœ์ ์˜ threshold๋ฅผ ์ฐพ๋„๋ก ํ•˜์˜€๋‹ค.

ํ™•์ธํ•ด ๋ณผ ๊ฒฐ๊ณผ๋กœ๋Š” AUC, Gini, KS ํ†ต๊ณ„๋Ÿ‰, Confusion Matrix, ์ตœ์ ์˜ ์ž„๊ณ„๊ฐ’, ๋“ฑ๊ธ‰๋ณ„ ๋ถ€๋„์œจ ์ด๋‹ค.

Test AUC: 0.7491
Test Gini: 0.4983
Test K-S Statistic: 0.3726

Optimal Threshold: 0.4918

[Optimized Classification Report]
              precision    recall  f1-score   support

           0       0.96      0.67      0.79     56538
           1       0.16      0.70      0.26      4965

    accuracy                           0.67     61503
   macro avg       0.56      0.69      0.52     61503
weighted avg       0.90      0.67      0.75     61503


[Decile Analysis]
        count  bad_count  bad_rate
decile                            
0        6151       1634  0.265648
1        6150        860  0.139837
2        6150        675  0.109756
3        6150        473  0.076911
4        6151        411  0.066818
5        6150        267  0.043415
6        6150        226  0.036748
7        6150        200  0.032520
8        6150        137  0.022276
9        6151         82  0.013331

๊ทธ๋ฆฌ๊ณ  ROC ์ปค๋ธŒ์ด๋‹ค.

Your Alt Text

์žฌ๋ฏธ๋‹ˆํ•œํ…Œ ๋ฌผ์–ด๋ณด๋‹ˆ AUC ๊ธฐ์ค€ 0.7 ์ด์ƒ์ด๋ฉด ์–‘ํ˜ธํ•œ ์„ฑ๋Šฅ์ด๋ผ๊ณ ๋Š” ํ•œ๋‹ค.
ํ•˜์ง€๋งŒ ์•„๋ฌด๋ž˜๋„ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋„ ์žˆ๊ณ , woe ๊ณผ์ •์—์„œ ๋” ์ •๊ตํ•œ ๋ถ„์„์„ ํ•˜์ง€ ์•Š์•„์„œ ์„ฑ๋Šฅ์ด ๋‚ฎ๊ฒŒ ๋‚˜์˜ฌ ์ˆ˜๋„ ์žˆ๋‹ค๊ณ  ๋ณธ๋‹ค.

์ด ํ•™์Šต ์ „์— threshold๋ฅผ ์ž„์˜๋กœ 0.5๋กœ ๋‘๊ณ ๋„ ํ•™์Šตํ•ด๋ดค๋Š”๋ฐ, ๊ทธ ๋•Œ๋Š” ๋ถ€๋„ ๊ทธ๋ฃน์˜ recall์ด 0.1๋กœ ์ฒ˜์ฐธํ•œ ์ˆ˜์ค€์ด์—ˆ๋‹ค.
recall ์ง€ํ‘œ๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ์ตœ์ ์˜ ์ž„๊ณ„๊ฐ’์„ ์ฐพ์•„์„œ ๋ถ„๋ฅ˜ํ•˜์˜€๋‹ค (precision์˜ ํฌ์ƒ์„ ๊ณ๋“ค์ธ.. ๊ทธ๋ž˜์„œ f1 score ์ž์ฒด๋Š” ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์ง€ ์•Š์•˜๋‹ค)

๊ธˆ์œต๊ธฐ๊ด€์—์„œ๋Š” ๋ถ€๋„ ๊ฐ€๋Šฅ์„ฑ ๊ณ ๊ฐ์„ ํŒ๋ณ„ํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— recall์„ ๋†’์ด๋Š” ๊ฒŒ ๋งž๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ์ฆ‰, “๋ถ€๋„์ž๋ฅผ ๋” ๋งŽ์ด ์žก์•„๋‚ด๊ธฐ ์œ„ํ•ด ๋Œ€์ถœ์„ ๊ฑฐ์ ˆ๋‹นํ•˜๋Š” ์‚ฌ๋žŒ ์ค‘ ์‹ค์ œ๋กœ๋Š” ๊ฐš์„ ๋Šฅ๋ ฅ์ด ์žˆ๋Š” ์‚ฌ๋žŒ(์ •์ƒ์ธ)์ด ์„ž์—ฌ ์žˆ๋Š” ๊ฒƒ์„ ๊ฐ์ˆ˜ํ•˜๊ฒ ๋‹ค.”
๋Š” ๋œป์ด ๋œ๋‹ค.

recall๊ณผ precision์€ trade off ๊ด€๊ณ„์ธ ๋งŒํผ ๊ธฐ๊ด€์—์„œ ์–ด๋””์— ๋” ์ค‘์ ์„ ์ค„์ง€์— ๋”ฐ๋ผ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์„ ์กฐ์ •ํ•  ๊ฒƒ์ด๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ ๊ตฌ๊ฐ„๋ณ„ ๋ถ€๋„์œจ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ์‹ ์šฉ์ ์ˆ˜ ๊ตฌ๊ฐ„์ด ๋†’์„์ˆ˜๋ก(์ ์ˆ˜๊ฐ€ ๋†’์„์ˆ˜๋ก) ๋ถ€๋„์œจ์ด ๋‚ฎ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด “์‹ ์šฉ์ ์ˆ˜๊ฐ€ ๋‚ฎ์„ ์ˆ˜๋ก ๋ถ€๋„์œจ์ด ๋†’์•„์ง€๋Š”๊ฐ€?“๋ฅผ ๋‚˜๋ฆ„ ์ž˜ ์ฆ๋ช…ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

XGBoost

1๊ธˆ์œต๊ถŒ๊ณผ ๊ฐ™์ด ์ „ํ†ต ๊ธˆ์œต ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์€ ๊ณณ์—์„œ๋Š” ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.
ํ•˜์ง€๋งŒ ์‹ ์šฉ ์ •๋ณด, ๊ธˆ์œต ๊ฑฐ๋ž˜ ์ •๋ณด ๋“ฑ์ด ๋ถ€์กฑํ•œ ์ค‘์ €์‹ ์šฉ์ž ๋ฐ ์”ฌํŒŒ์ผ๋Ÿฌ ๊ณ ๊ฐ์˜ ๊ฒฝ์šฐ ์ „ํ†ต ๊ธˆ์œต ๋ฐ์ดํ„ฐ ์™ธ์˜ ๋Œ€์•ˆ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ AI ๋ชจ๋ธ๋กœ ์‹ ์šฉํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค.

์ด ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฝ์šฐ ๋Œ€์•ˆ์ •๋ณด๋Š” ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ XGBoost ๊ฐ„์˜ ์„ฑ๋Šฅ์ด ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ผ ๊ฑฐ ๊ฐ™์ง€๋Š” ์•Š์•˜๋‹ค.

xgb์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์„ค์ •ํ–ˆ๋‹ค.

params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'learning_rate': 0.05,
    'max_depth': 4,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'random_state': 42,
    'scale_pos_weight': (y_train == 0).sum() / (y_train == 1).sum() # ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ํ•ด์†Œ
}

๊ทธ๋ฆฌ๊ณ  ๋ชจ๋ธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

[XGBoost Final Performance]
Test AUC: 0.7570
Test Gini: 0.5140
Test K-S Statistic: 0.3806

Optimal Threshold (Youden's J): 0.4888

[Optimized Classification Report]
              precision    recall  f1-score   support

           0       0.96      0.69      0.80     56538
           1       0.16      0.69      0.26      4965

    accuracy                           0.69     61503
   macro avg       0.56      0.69      0.53     61503
weighted avg       0.90      0.69      0.76     61503


[Decile Analysis - Risk Ranking]
        count  bad_count  bad_rate
decile                            
9        6151       1680  0.273126
8        6150        899  0.146179
7        6150        638  0.103740
6        6150        477  0.077561
5        6150        376  0.061138
4        6151        306  0.049748
3        6150        204  0.033171
2        6150        196  0.031870
1        6150        113  0.018374
0        6151         76  0.012356

๋กœ์ง€์Šคํ‹ฑ์— ๋น„ํ•ด ์•ฝ๊ฐ„์”ฉ ์„ฑ๋Šฅ์ด ๋” ์ž˜ ๋‚˜์™”๋‹ค.

๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

Your Alt Text

EXT ๋ณ€์ˆ˜์•ผ ๋‹น์—ฐํžˆ ์ค‘์š”๋„๊ฐ€ ๋†’์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ๋Š”๋ฐ, ํ•™๋ ฅ ๋ณ€์ˆ˜๋„ ๊ฝค ๋†’๊ฒŒ ๋‚˜์˜จ๊ฑธ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

ํ˜„์žฌ ํ•™๋ ฅ ๋ณ€์ˆ˜์˜ ๊ฒฝ์šฐ ๋‘ ๊ตฌ๊ฐ„์œผ๋กœ ๊ตฌ๊ฐ„ํ™”๊ฐ€ ๋˜์–ด ์žˆ๋‹ค.

  • Academic degree, Higher education (EventRate : 0.053866 )
  • Incomplete higher, Secondary / secondary special,Lower secondary (EventRate : 0.089380)

์‹ค์ œ๋กœ ๊ตญ๋‚ด ์‹ ์šฉํ‰๊ฐ€์—์„œ๋„ ํ•™๋ ฅ์ด ์‹ ์šฉ ์ ์ˆ˜์— ํฐ ์˜ํ–ฅ์„ ์ค„์ง€ ๊ถ๊ธˆํ•˜๋‹ค.
์–ด์ฉŒ๋ฉด ๋‹จ์ˆœํžˆ ๊ณ ์กธ, ์ดˆ๋Œ€์กธ, ๋Œ€์กธ ์ด ์•„๋‹ˆ๋ผ ์ธ์„œ์šธ ๋ช‡์œ„ ์ง€๊ฑฐ๊ตญ ๋ช‡์œ„ ์ด๋Ÿฐ์‹์œผ๋กœ ๋” ์„ธ๋ถ„ํ™” ๋˜์–ด ์žˆ์„ ์ˆ˜๋„ ์žˆ๊ฒ ๋‹ค. (+ ํ•™๊ณผ ์ •๋ณด๋„ ์žˆ์„์ˆ˜๋„)

Score Card

์ตœ์ข…์ ์œผ๋กœ ์Šค์ฝ”์–ด ์นด๋“œ๋ฅผ ์‹œ๊ฐํ™”ํ•ด๋ดค๋‹ค.

Your Alt Text

์ด์ƒ์ ์ธ ๊ฒฐ๊ณผ๋Š” ๋‘ ๊ทธ๋ฃน์˜ ๋ถ„ํฌ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€์–ด์•ผ ํ•œ๋‹ค.

๋Œ€์ฒด๋กœ ์ค‘๋ฐ˜์ ์ˆ˜๋Œ€์— ๊ณ ๊ฐ์ด ๋ชฐ๋ ค์žˆ๊ณ  ๋ถ€๋„ ๊ณ ๊ฐ์˜ ๊ฒฝ์šฐ ์ดˆ์ค‘๋ฐ˜ ์ ์ˆ˜๋Œ€์— ๋ชฐ๋ ค์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

K-S ๊ฐ’์œผ๋กœ๋„ ํ™•์ธํ–ˆ๋“ฏ์ด ์—„์ฒญ ๋ณ€๋ณ„๋ ฅ์ด ์žˆ์–ด๋ณด์ด๋Š” ๊ฒฐ๊ณผ๋Š” ์•„๋‹ˆ๋‹ค.

SHAP

์ด๋ฒˆ์—” SHAP ๋ฅผ ํ™•์ธํ•ด๋ณด์•˜๋‹ค.
๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ ๋‹ฌ๋ฆฌ XGBoost๋Š” ๋ณ€์ˆ˜ ๊ฐ„์˜ ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ‘์™œ ์ด๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋Š”์ง€’ ์„ค๋ช…ํ•˜๊ธฐ ์–ด๋ ต๋‹ค. (๋ธ”๋ž™๋ฐ•์Šค๋ชจ๋ธ)

feature_importance๋กœ ๋ณ€์ˆ˜์˜ ์ค‘์š”๋„๋Š” ํŒŒ์•…ํ–ˆ์ง€๋งŒ ๊ทธ ๋ณ€์ˆ˜๊ฐ€ ์–ด๋А ๋ฐฉํ–ฅ์œผ๋กœ ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€๋Š” ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ต๋‹ค.
๊ทธ๋ž˜ํ”„ ์ƒ์—์„œ 0 ๊ธฐ์ค€์œผ๋กœ ์˜ค๋ฅธ์ชฝ์€ ๋ถ€๋„ํ™•๋ฅ ์„ ๋†’์ด๋Š” ๋ฐ ๊ธฐ์—ฌํ•˜๋Š” ๋ณ€์ˆ˜์ด๋‹ค. ๋ฐ˜๋Œ€๋กœ ์™ผ์ชฝ์€ ๋ถ€๋„ํ™•๋ฅ ์„ ๋‚ฎ์ถ”๋Š” ๋ฐ ๊ธฐ์—ฌํ•˜๋Š” ๋ณ€์ˆ˜์ด๋‹ค.

Your Alt Text

์นดํ…Œ๊ณ ๋ฆฌ ๋ณ€์ˆ˜์™€ ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜์˜ ๊ฒฐ๊ณผ ํ˜•ํƒœ์— ์ฐจ์ด๊ฐ€ ์žˆ์Œ์ด ๋ณด์ธ๋‹ค.

๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉฐ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์ฐพ์•„๋ณด์ž.

๋ฐฉํ–ฅ์„ฑ์˜ ์ผ์น˜ (๊ฒฝ์ œํ•™์  ํƒ€๋‹น์„ฑ):

EXT_SOURCE 2 & 3: ๋†’์€ ๊ฐ’(๋นจ๊ฐ„์ƒ‰)์ด ์™ผ์ชฝ(๋ถ€๋„ ํ™•๋ฅ  ๊ฐ์†Œ)์— ๋ชฐ๋ ค ์žˆ์Œ. ์ฆ‰, ์™ธ๋ถ€ ์‹ ์šฉ ์ ์ˆ˜๊ฐ€ ๋†’์„์ˆ˜๋ก ์šฐ๋Ÿ‰ ๊ณ ๊ฐ์œผ๋กœ ํŒ๋‹จ -> ๋ชจ๋ธ์˜ ๋…ผ๋ฆฌ๊ฐ€ ํƒ€๋‹นํ•จ

AGE: ๋‚ฎ์€ ๊ฐ’(ํŒŒ๋ž€์ƒ‰, ์ Š์€ ์ธต)์ด ์˜ค๋ฅธ์ชฝ(๋ถ€๋„ ํ™•๋ฅ  ์ฆ๊ฐ€)์— ๋ถ„ํฌ -> ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌํšŒ ์ดˆ๋…„์ƒ์˜ ๋ฆฌ์Šคํฌ๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋†’๊ฒŒ ์ธก์ •๋˜๋Š” ๊ฒฝํ–ฅ์„ ๋ฐ˜์˜

LTV : LTV๋Š” ํŒŒ์ƒ๋ณ€์ˆ˜๋ผ๋Š” ์ธก๋ฉด์—์„œ ์›์ฒœ ๋ฐ์ดํ„ฐ์˜€๋˜ AMT_CREDIT, AMT_GOODS_PRICE๋ฅผ ๊ฐ๊ฐ ๋ชจ๋ธ์— ์ ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋” ๋†’์€ ์„ฑ๋Šฅ์— ๊ธฐ์—ฌํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ํ•ด์„์—์„œ ์ฃผ์˜ํ•  ์‚ฌํ•ญ์ด ์žˆ๋‹ค.
SHAP ๊ฒฐ๊ณผ๋งŒ ๋ณด๋ฉด LTV๊ฐ€ ๋†’์„ ์ˆ˜๋ก ๋ถ€๋„ ๋ฆฌ์Šคํฌ๊ฐ€ ๋‚ฎ๋‹ค ๋กœ ํ•ด์„๋  ์—ฌ์ง€๊ฐ€ ์žˆ๋‹ค.(์ผ๋ฐ˜์ ์œผ๋กœ LTV์™€ ๋ถ€๋„๋ฆฌ์Šคํฌ๋Š” ๋น„๋ก€๊ด€๊ณ„)

์ด๋Š” ๊ฒฐ๊ณผ๋ฅผ LTV๊ฐ’ ์œผ๋กœ ํ•ด์„ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
SHAP๋ฅผ ๋„์ถœํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋Š” LTV์˜ WOE์ด๋‹ค.
๋”ฐ๋ผ์„œ LTV์˜ WoE๊ฐ€ ํด์ˆ˜๋ก(์šฐ๋Ÿ‰ ๊ณ ๊ฐ ๋น„์ค‘์ด ๋งŽ์„์ˆ˜๋ก) ๋ถ€๋„ ๋ฆฌ์Šคํฌ๋Š” ๋‚ฎ์•„์ง„๋‹ค๋กœ ํ•ด์„ ํ•ด์•ผ ํ•œ๋‹ค.

PDP

์ด๋ฒˆ์—๋Š” ์ฃผ์š” ๋ณ€์ˆ˜๋“ค์˜ Partial Dependence Plot์„ ๊ทธ๋ ค๋ณด์ž
X์ถ•์€ ํ•ด๋‹น ๋ณ€์ˆ˜์˜ WoE ์ด๋ฉฐ Y์ถ•์€ ๋ถ€๋„์œจ์ด๋‹ค.

Your Alt Text

๋ชจ๋‘ ๋ถ€์š”์œจ๊ณผ ์Œ์˜ ๊ด€๊ณ„๋ฅผ ๋ณด์ธ๋‹ค๋Š” ์ ์—์„œ ํƒ€๋‹นํ•˜๋‹ค.

๊ทธ๋ž˜ํ”„์˜ ๊ธฐ์šธ๊ธฐ๋กœ ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ๋ถ€๋„์œจ์— ์–ผ๋งˆ๋‚˜ ๋ฏผ๊ฐํ•˜๊ฒŒ ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์œผ๋ฉฐ,
์„ธ๋ถ€์ ์œผ๋กœ๋Š” ๊ฐ ๊ทธ๋ž˜ํ”„์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๊ฐ€ํŒ”๋ผ์ง€๋Š” ๊ตฌ๊ฐ„์„ ์‚ดํŽด๋ณด๋Š” ๊ฒŒ ์ค‘์š”ํ•˜๋‹ค.

์‹ค์Šต ์ฝ”๋“œ
๋ชจ๋ธ๋ง.ipynb