1. credit 데이터셋¶
In [169]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [170]:
credit_df = pd.read_csv('/content/drive/MyDrive/KDT/6. 머신러닝과 딥러닝/Data/credit.csv')
credit_df
Out[170]:
ID | Customer_ID | Name | Age | SSN | Occupation | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | ... | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0x1602 | CUS_0xd40 | Aaron Maashoh | 23 | 821-00-0265 | Scientist | 19114.12 | 3 | 4 | 3 | ... | 4.0 | 809.98 | 26.822620 | 22 Years and 1 Months | No | 49.574949 | 80.41529544 | High_spent_Small_value_payments | 312.494089 | Good |
1 | 0x160e | CUS_0x21b1 | Rick Rothackerj | 28_ | 004-07-5839 | _______ | 34847.84 | 2 | 4 | 6 | ... | 2.0 | 605.03 | 24.464031 | 26 Years and 7 Months | No | 18.816215 | 104.2918252 | Low_spent_Small_value_payments | 470.690627 | Standard |
2 | 0x161a | CUS_0x2dbc | Langep | 34 | 486-85-3974 | _______ | 143162.64 | 1 | 5 | 8 | ... | 3.0 | 1303.01 | 28.616735 | 17 Years and 9 Months | No | 246.992320 | 168.4137027 | !@9#%8 | 1043.315978 | Good |
3 | 0x1626 | CUS_0xb891 | Jasond | 54 | 072-31-6145 | Entrepreneur | 30689.89 | 2 | 5 | 4 | ... | 4.0 | 632.46 | 26.544229 | 17 Years and 3 Months | No | 16.415452 | 81.22885871 | Low_spent_Large_value_payments | 433.604773 | Standard |
4 | 0x1632 | CUS_0x1cdb | Deepaa | 21 | 615-06-7821 | Developer | 35547.71_ | 7 | 5 | 5 | ... | 4.0 | 943.86 | 39.797764 | 30 Years and 8 Months | Yes | 0.000000 | 276.7253943 | !@9#%8 | 288.605522 | Standard |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12495 | 0x25fb6 | CUS_0x372c | Lucia Mutikanik | 18 | 340-85-7301 | Lawyer | 42903.79 | 0 | 4 | 6 | ... | 1.0 | 1079.48 | 27.289440 | 28 Years and 1 Months | No | 50894.000000 | 78.51494451 | High_spent_Small_value_payments | 493.341182 | Good |
12496 | 0x25fc2 | CUS_0xf16 | Maria Sheahanb | 44 | #F%$D@*&8 | Media_Manager | 16680.35 | 1 | 1 | 5 | ... | 4.0 | 897.16 | 39.868572 | NaN | NM | 41.113561 | 52.95197782 | High_spent_Small_value_payments | 318.737378 | Good |
12497 | 0x25fce | CUS_0xaf61 | Chris Wickhamm | 49 | 133-16-7738 | Writer | 37188.1 | 1 | 4 | 5 | ... | 3.0 | 620.64 | 39.080823 | 29 Years and 9 Months | No | 84.205949 | 223.8750182 | Low_spent_Small_value_payments | 291.619866 | Good |
12498 | 0x25fda | CUS_0x8600 | Sarah McBridec | 28 | 031-35-0942 | Architect | 20002.88 | 10 | 8 | 29 | ... | 9.0 | 3571.7_ | 22.895966 | 5 Years and 8 Months | Yes | 60.964772 | 43.37067007 | High_spent_Large_value_payments | 328.655224 | Poor |
12499 | 0x25fe6 | CUS_0x942c | Nicks | 24 | 078-73-5990 | Mechanic | 39628.99 | 4 | 6 | 7 | ... | 3.0 | 502.38 | 32.991333 | 31 Years and 3 Months | No | 35.104023 | 401.1964806 | Low_spent_Small_value_payments | 189.641080 | Poor |
12500 rows × 24 columns
In [171]:
credit_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12500 entries, 0 to 12499
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 12500 non-null object
1 Customer_ID 12500 non-null object
2 Name 11273 non-null object
3 Age 12500 non-null object
4 SSN 12500 non-null object
5 Occupation 12500 non-null object
6 Annual_Income 12500 non-null object
7 Num_Bank_Accounts 12500 non-null int64
8 Num_Credit_Card 12500 non-null int64
9 Interest_Rate 12500 non-null int64
10 Num_of_Loan 12500 non-null object
11 Type_of_Loan 11074 non-null object
12 Delay_from_due_date 12500 non-null int64
13 Num_of_Delayed_Payment 11657 non-null object
14 Num_Credit_Inquiries 12264 non-null float64
15 Outstanding_Debt 12500 non-null object
16 Credit_Utilization_Ratio 12500 non-null float64
17 Credit_History_Age 11387 non-null object
18 Payment_of_Min_Amount 12500 non-null object
19 Total_EMI_per_month 12500 non-null float64
20 Amount_invested_monthly 11935 non-null object
21 Payment_Behaviour 12500 non-null object
22 Monthly_Balance 12366 non-null float64
23 Credit_Score 12500 non-null object
dtypes: float64(4), int64(4), object(16)
memory usage: 2.3+ MB
- ID: 고유 식별자
- Customer_ID: 고객 ID
- Name: 이름
- Age: 나이
- SSN: 주민등록번호
- Occupation: 직업
- Annual_Income: 연간 소득
- Num_Bank_Accounts: 은행 계좌 수
- Num_Credit_Card: 신용 카드 수
- Interest_Rate: 이자율
- Num_of_Loan: 대출 수
- Type_of_Loan: 대출 유형
- Delay_from_due_date: 마감일로부터 연체 기간
- Num_of_Delayed_Payment: 연체된 결제 수
- Num_Credit_Inquiries: 신용조회 수
- Outstanding_Debt: 미상환 잔금
- Credit_Utilization_Ratio: 신용카드 사용률
- Credit_History_Age: 카드 사용 기간
- Payment_of_Min_Amount: 리볼빙 여부
- Total_EMI_per_month: 월별 총 지출 금액
- Amount_invested_monthly: 매월 투자 금액
- Payment_Behaviour: 지불 행동
- Monthly_Balance: 월별 잔고
- Credit_Score: 신용 점수
In [172]:
credit_df.drop(['ID', 'Customer_ID', 'Name', 'SSN'], axis=1, inplace=True)
In [173]:
credit_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12500 entries, 0 to 12499
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 12500 non-null object
1 Occupation 12500 non-null object
2 Annual_Income 12500 non-null object
3 Num_Bank_Accounts 12500 non-null int64
4 Num_Credit_Card 12500 non-null int64
5 Interest_Rate 12500 non-null int64
6 Num_of_Loan 12500 non-null object
7 Type_of_Loan 11074 non-null object
8 Delay_from_due_date 12500 non-null int64
9 Num_of_Delayed_Payment 11657 non-null object
10 Num_Credit_Inquiries 12264 non-null float64
11 Outstanding_Debt 12500 non-null object
12 Credit_Utilization_Ratio 12500 non-null float64
13 Credit_History_Age 11387 non-null object
14 Payment_of_Min_Amount 12500 non-null object
15 Total_EMI_per_month 12500 non-null float64
16 Amount_invested_monthly 11935 non-null object
17 Payment_Behaviour 12500 non-null object
18 Monthly_Balance 12366 non-null float64
19 Credit_Score 12500 non-null object
dtypes: float64(4), int64(4), object(12)
memory usage: 1.9+ MB
In [174]:
credit_df['Credit_Score'].value_counts()
Out[174]:
Credit_Score
Standard 6943
Poor 3582
Good 1975
Name: count, dtype: int64
In [175]:
credit_df['Credit_Score'] = credit_df['Credit_Score'].replace({'Poor':0, 'Standard':1, 'Good':2})
credit_df.head()
Out[175]:
Age | Occupation | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Type_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23 | Scientist | 19114.12 | 3 | 4 | 3 | 4 | Auto Loan, Credit-Builder Loan, Personal Loan,... | 3 | 7 | 4.0 | 809.98 | 26.822620 | 22 Years and 1 Months | No | 49.574949 | 80.41529544 | High_spent_Small_value_payments | 312.494089 | 2 |
1 | 28_ | _______ | 34847.84 | 2 | 4 | 6 | 1 | Credit-Builder Loan | 3 | 4 | 2.0 | 605.03 | 24.464031 | 26 Years and 7 Months | No | 18.816215 | 104.2918252 | Low_spent_Small_value_payments | 470.690627 | 1 |
2 | 34 | _______ | 143162.64 | 1 | 5 | 8 | 3 | Auto Loan, Auto Loan, and Not Specified | 5 | 8 | 3.0 | 1303.01 | 28.616735 | 17 Years and 9 Months | No | 246.992320 | 168.4137027 | !@9#%8 | 1043.315978 | 2 |
3 | 54 | Entrepreneur | 30689.89 | 2 | 5 | 4 | 1 | Not Specified | 0 | 6 | 4.0 | 632.46 | 26.544229 | 17 Years and 3 Months | No | 16.415452 | 81.22885871 | Low_spent_Large_value_payments | 433.604773 | 1 |
4 | 21 | Developer | 35547.71_ | 7 | 5 | 5 | 0 | NaN | 5 | NaN | 4.0 | 943.86 | 39.797764 | 30 Years and 8 Months | Yes | 0.000000 | 276.7253943 | !@9#%8 | 288.605522 | 1 |
In [176]:
credit_df.describe()
Out[176]:
Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Delay_from_due_date | Num_Credit_Inquiries | Credit_Utilization_Ratio | Total_EMI_per_month | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|
count | 12500.000000 | 12500.000000 | 12500.00000 | 12500.000000 | 12264.000000 | 12500.000000 | 12500.000000 | 12366.000000 | 12500.000000 |
mean | 17.275120 | 21.647680 | 69.46520 | 21.051440 | 24.591650 | 32.291949 | 1303.781040 | 405.815391 | 0.871440 |
std | 118.518214 | 123.789969 | 455.95698 | 14.859994 | 183.422458 | 5.084327 | 8118.261086 | 218.136964 | 0.654268 |
min | 0.000000 | 1.000000 | 1.00000 | -5.000000 | 0.000000 | 20.992914 | 0.000000 | 0.088628 | 0.000000 |
25% | 3.000000 | 4.000000 | 8.00000 | 10.000000 | 2.000000 | 28.110034 | 29.128806 | 271.785749 | 0.000000 |
50% | 6.000000 | 5.000000 | 13.00000 | 18.000000 | 4.000000 | 32.297912 | 66.372879 | 337.169588 | 1.000000 |
75% | 7.000000 | 7.000000 | 20.00000 | 28.000000 | 8.000000 | 36.458660 | 149.904496 | 475.222487 | 1.000000 |
max | 1779.000000 | 1479.000000 | 5788.00000 | 67.000000 | 2592.000000 | 49.564519 | 82122.000000 | 1602.040519 | 2.000000 |
In [177]:
sns.barplot(x='Payment_of_Min_Amount', y='Credit_Score', data=credit_df)
Out[177]:
<Axes: xlabel='Payment_of_Min_Amount', ylabel='Credit_Score'>
In [178]:
plt.figure(figsize=(20, 5))
sns.barplot(x='Occupation', y='Credit_Score', data=credit_df)
Out[178]:
<Axes: xlabel='Occupation', ylabel='Credit_Score'>
In [179]:
plt.figure(figsize=(12, 12))
sns.heatmap(credit_df.corr(numeric_only=True), cmap='coolwarm', vmin=-1, vmax=1, annot=True)
Out[179]:
<Axes: >
In [180]:
credit_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12500 entries, 0 to 12499
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 12500 non-null object
1 Occupation 12500 non-null object
2 Annual_Income 12500 non-null object
3 Num_Bank_Accounts 12500 non-null int64
4 Num_Credit_Card 12500 non-null int64
5 Interest_Rate 12500 non-null int64
6 Num_of_Loan 12500 non-null object
7 Type_of_Loan 11074 non-null object
8 Delay_from_due_date 12500 non-null int64
9 Num_of_Delayed_Payment 11657 non-null object
10 Num_Credit_Inquiries 12264 non-null float64
11 Outstanding_Debt 12500 non-null object
12 Credit_Utilization_Ratio 12500 non-null float64
13 Credit_History_Age 11387 non-null object
14 Payment_of_Min_Amount 12500 non-null object
15 Total_EMI_per_month 12500 non-null float64
16 Amount_invested_monthly 11935 non-null object
17 Payment_Behaviour 12500 non-null object
18 Monthly_Balance 12366 non-null float64
19 Credit_Score 12500 non-null int64
dtypes: float64(4), int64(5), object(11)
memory usage: 1.9+ MB
In [181]:
credit_df['Payment_Behaviour'].dtype
Out[181]:
dtype('O')
In [182]:
for i in credit_df.columns:
if credit_df[i].dtype == 'O':
print(i)
Age
Occupation
Annual_Income
Num_of_Loan
Type_of_Loan
Num_of_Delayed_Payment
Outstanding_Debt
Credit_History_Age
Payment_of_Min_Amount
Amount_invested_monthly
Payment_Behaviour
In [183]:
for i in ['Age', 'Annual_Income', 'Num_of_Loan', 'Num_of_Delayed_Payment', 'Outstanding_Debt', 'Amount_invested_monthly']:
credit_df[i] = pd.to_numeric(credit_df[i].str.replace('_', ''))
In [184]:
credit_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12500 entries, 0 to 12499
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 12500 non-null int64
1 Occupation 12500 non-null object
2 Annual_Income 12500 non-null float64
3 Num_Bank_Accounts 12500 non-null int64
4 Num_Credit_Card 12500 non-null int64
5 Interest_Rate 12500 non-null int64
6 Num_of_Loan 12500 non-null int64
7 Type_of_Loan 11074 non-null object
8 Delay_from_due_date 12500 non-null int64
9 Num_of_Delayed_Payment 11657 non-null float64
10 Num_Credit_Inquiries 12264 non-null float64
11 Outstanding_Debt 12500 non-null float64
12 Credit_Utilization_Ratio 12500 non-null float64
13 Credit_History_Age 11387 non-null object
14 Payment_of_Min_Amount 12500 non-null object
15 Total_EMI_per_month 12500 non-null float64
16 Amount_invested_monthly 11935 non-null float64
17 Payment_Behaviour 12500 non-null object
18 Monthly_Balance 12366 non-null float64
19 Credit_Score 12500 non-null int64
dtypes: float64(8), int64(7), object(5)
memory usage: 1.9+ MB
In [185]:
credit_df['Credit_History_Age']
Out[185]:
0 22 Years and 1 Months
1 26 Years and 7 Months
2 17 Years and 9 Months
3 17 Years and 3 Months
4 30 Years and 8 Months
...
12495 28 Years and 1 Months
12496 NaN
12497 29 Years and 9 Months
12498 5 Years and 8 Months
12499 31 Years and 3 Months
Name: Credit_History_Age, Length: 12500, dtype: object
In [186]:
# Credit_History_Age의 데이터를 개월로 변경
# 22 Years and 1 Months -> 22 * 12 + 1 = 265
credit_df['Credit_History_Age'] = credit_df['Credit_History_Age'].str.replace(' Months', '')
# 22 Years and 1
credit_df['Credit_History_Age'] = pd.to_numeric(credit_df['Credit_History_Age'].str.split(' Years and ', expand=True)[0])*12 + pd.to_numeric(credit_df['Credit_History_Age'].str.split(' Years and ', expand=True)[1])
credit_df.head()
Out[186]:
Age | Occupation | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Type_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23 | Scientist | 19114.12 | 3 | 4 | 3 | 4 | Auto Loan, Credit-Builder Loan, Personal Loan,... | 3 | 7.0 | 4.0 | 809.98 | 26.822620 | 265.0 | No | 49.574949 | 80.415295 | High_spent_Small_value_payments | 312.494089 | 2 |
1 | 28 | _______ | 34847.84 | 2 | 4 | 6 | 1 | Credit-Builder Loan | 3 | 4.0 | 2.0 | 605.03 | 24.464031 | 319.0 | No | 18.816215 | 104.291825 | Low_spent_Small_value_payments | 470.690627 | 1 |
2 | 34 | _______ | 143162.64 | 1 | 5 | 8 | 3 | Auto Loan, Auto Loan, and Not Specified | 5 | 8.0 | 3.0 | 1303.01 | 28.616735 | 213.0 | No | 246.992320 | 168.413703 | !@9#%8 | 1043.315978 | 2 |
3 | 54 | Entrepreneur | 30689.89 | 2 | 5 | 4 | 1 | Not Specified | 0 | 6.0 | 4.0 | 632.46 | 26.544229 | 207.0 | No | 16.415452 | 81.228859 | Low_spent_Large_value_payments | 433.604773 | 1 |
4 | 21 | Developer | 35547.71 | 7 | 5 | 5 | 0 | NaN | 5 | NaN | 4.0 | 943.86 | 39.797764 | 368.0 | Yes | 0.000000 | 276.725394 | !@9#%8 | 288.605522 | 1 |
In [187]:
credit_df.describe()
Out[187]:
Age | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Total_EMI_per_month | Amount_invested_monthly | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 12500.000000 | 1.250000e+04 | 12500.000000 | 12500.000000 | 12500.00000 | 12500.00000 | 12500.000000 | 11657.000000 | 12264.000000 | 12500.000000 | 12500.000000 | 11387.000000 | 12500.000000 | 11935.000000 | 12366.000000 | 12500.000000 |
mean | 113.371280 | 1.888617e+05 | 17.275120 | 21.647680 | 69.46520 | 3.43656 | 21.051440 | 31.280089 | 24.591650 | 1426.220376 | 32.291949 | 217.588127 | 1303.781040 | 643.291976 | 405.815391 | 0.871440 |
std | 691.223297 | 1.482707e+06 | 118.518214 | 123.789969 | 455.95698 | 65.35565 | 14.859994 | 229.911798 | 183.422458 | 1155.169458 | 5.084327 | 99.638681 | 8118.261086 | 2063.324328 | 218.136964 | 0.654268 |
min | -500.000000 | 7.005930e+03 | 0.000000 | 1.000000 | 1.00000 | -100.00000 | -5.000000 | -3.000000 | 0.000000 | 0.230000 | 20.992914 | 1.000000 | 0.000000 | 0.000000 | 0.088628 | 0.000000 |
25% | 24.000000 | 1.948777e+04 | 3.000000 | 4.000000 | 8.00000 | 1.00000 | 10.000000 | 9.000000 | 2.000000 | 566.072500 | 28.110034 | 141.000000 | 29.128806 | 73.810753 | 271.785749 | 0.000000 |
50% | 33.000000 | 3.765508e+04 | 6.000000 | 5.000000 | 13.00000 | 3.00000 | 18.000000 | 14.000000 | 4.000000 | 1166.155000 | 32.297912 | 215.000000 | 66.372879 | 134.201478 | 337.169588 | 1.000000 |
75% | 42.000000 | 7.289813e+04 | 7.000000 | 7.000000 | 20.00000 | 5.00000 | 28.000000 | 18.000000 | 8.000000 | 1945.962500 | 36.458660 | 298.000000 | 149.904496 | 264.555831 | 475.222487 | 1.000000 |
max | 8592.000000 | 2.365819e+07 | 1779.000000 | 1479.000000 | 5788.00000 | 1496.00000 | 67.000000 | 4388.000000 | 2592.000000 | 4998.070000 | 49.564519 | 397.000000 | 82122.000000 | 10000.000000 | 1602.040519 | 2.000000 |
In [188]:
credit_df[credit_df['Age'] < 0]
Out[188]:
Age | Occupation | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Type_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
70 | -500 | Scientist | 144546.720 | 0 | 3 | 7 | 2 | Payday Loan, and Not Specified | 15 | 7.0 | 1.0 | 1045.11 | 40.840687 | 275.0 | No | 136.988557 | 573.411590 | High_spent_Small_value_payments | 730.555853 | 1 |
81 | -500 | Teacher | 103353.060 | 3 | 3 | 6 | 1 | Student Loan | 4 | 9.0 | 2.0 | 1374.56 | 32.685522 | 307.0 | No | 64.001686 | NaN | High_spent_Medium_value_payments | 857.826203 | 2 |
122 | -500 | Journalist | 19163.220 | 3 | 7 | 7 | 2 | Personal Loan, and Payday Loan | 23 | 18.0 | 5.0 | 2226.37 | 38.080964 | 194.0 | Yes | 24.175648 | 15.206734 | High_spent_Medium_value_payments | 345.311119 | 1 |
165 | -500 | Doctor | 70112.780 | 1 | 7 | 3 | 2 | Mortgage Loan, and Student Loan | 5 | 3.0 | 2.0 | 877.06 | 36.976516 | 298.0 | No | 69.717948 | 433.185268 | Low_spent_Medium_value_payments | 376.069951 | 0 |
207 | -500 | Teacher | 94454.100 | 1 | 4 | 5 | -100 | Credit-Builder Loan, Home Equity Loan, and Hom... | 4 | 6.0 | 4.0 | 1342.61 | 26.063001 | 357.0 | NM | 152.506895 | 10000.000000 | High_spent_Medium_value_payments | 719.685063 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11722 | -500 | Developer | 130682.940 | 2 | 331 | 1 | 4 | Payday Loan, Payday Loan, Mortgage Loan, and D... | 14 | 11.0 | 3.0 | 752.98 | 27.912234 | 344.0 | No | 263.381445 | 463.128666 | Low_spent_Large_value_payments | 604.014389 | 2 |
11817 | -500 | Developer | 35195.970 | 5 | 3 | 4 | 4 | Mortgage Loan, Auto Loan, Not Specified, and H... | 3 | 7.0 | 4.0 | 843.36 | 24.565802 | 338.0 | No | 93.199692 | 191.504338 | !@9#%8 | 279.095720 | 2 |
11839 | -500 | Architect | 8845.965 | 7 | 10 | 26 | 5 | Credit-Builder Loan, Home Equity Loan, Mortgag... | 40 | 23.0 | 6.0 | 1422.97 | 25.221897 | NaN | Yes | 33.948744 | NaN | Low_spent_Medium_value_payments | 278.220305 | 0 |
11917 | -500 | Teacher | 29969.660 | 8 | 6 | 18 | 1 | Auto Loan | 24 | 18.0 | 5.0 | 157.07 | 32.408860 | 266.0 | Yes | 14.380462 | 39.495791 | High_spent_Medium_value_payments | 437.970914 | 1 |
12053 | -500 | Doctor | 13672.200 | 7 | 5 | 13 | 1 | Auto Loan | 28 | 19.0 | 6.0 | 1419.68 | 38.426894 | 393.0 | Yes | 10.260200 | 97.019680 | Low_spent_Small_value_payments | 301.755120 | 1 |
115 rows × 20 columns
In [189]:
credit_df = credit_df[credit_df['Age'] >= 0]
In [190]:
credit_df.sort_values('Age').head(5)
Out[190]:
Age | Occupation | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Type_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3556 | 14 | Media_Manager | 7946.090 | 7 | 10 | 17 | 5 | Personal Loan, Mortgage Loan, Not Specified, P... | 54 | 27.0 | 12.0 | 1572.64 | 40.750451 | 184.0 | Yes | 30.019943 | 81.412882 | Low_spent_Small_value_payments | 266.884591 | 0 |
10388 | 14 | Engineer | 82941.600 | 6 | 6 | 20 | 7 | Payday Loan, Auto Loan, Auto Loan, Personal Lo... | 53 | 22.0 | 6.0 | 2598.17 | 30.264092 | 124.0 | NM | 347.228223 | 530.025784 | Low_spent_Small_value_payments | 75.325993 | 0 |
4215 | 14 | Engineer | 9511.795 | 6 | 7 | 20 | 7 | Student Loan, Auto Loan, Student Loan, Credit-... | 59 | 17.0 | 12.0 | 2344.43 | 34.636520 | 153.0 | NM | 28.825270 | 78.769438 | Low_spent_Small_value_payments | 250.470250 | 0 |
3231 | 14 | Teacher | 14015.320 | 8 | 3 | 11 | 6 | Home Equity Loan, Not Specified, Credit-Builde... | 20 | 20.0 | 7.0 | 1852.75 | 33.104663 | 118.0 | Yes | 54.001905 | 128.429618 | Low_spent_Medium_value_payments | 221.462810 | 1 |
10494 | 14 | Manager | 61922.240 | 6 | 9 | 18 | 9 | Payday Loan, Auto Loan, Student Loan, Credit-B... | 25 | 19.0 | 12.0 | 4678.77 | 25.661164 | 40.0 | Yes | 336.014862 | 131.642176 | High_spent_Large_value_payments | 314.161629 | 0 |
In [191]:
credit_df.sort_values('Age').tail(20)
Out[191]:
Age | Occupation | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Type_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1258 | 7980 | Scientist | 44941.950 | 4 | 6 | 3 | 1 | Payday Loan | 0 | 9.0 | 2.0 | 1150.28 | 27.894834 | 367.0 | No | 22.434725 | 133.727501 | !@9#%8 | 508.454024 | 2 |
7445 | 8035 | Doctor | 32753.050 | 2 | 7 | 6 | 4 | Credit-Builder Loan, Debt Consolidation Loan, ... | 2 | 5.0 | 3.0 | 1300.30 | 36.662384 | 359.0 | No | 79.182771 | 63.646267 | !@9#%8 | 400.513046 | 2 |
10824 | 8052 | Manager | 35820.620 | 0 | 6 | 8 | 4 | Student Loan, Not Specified, Student Loan, and... | 10 | 10.0 | 3.0 | 731.26 | 26.763490 | 237.0 | No | 87.981091 | 169.322300 | Low_spent_Medium_value_payments | 298.501775 | 2 |
857 | 8080 | Entrepreneur | 62835.080 | 0 | 4 | 1 | 2 | Home Equity Loan, and Debt Consolidation Loan | 10 | 14.0 | 0.0 | 306.85 | 40.859818 | 392.0 | No | 67.384074 | 181.104372 | Low_spent_Medium_value_payments | 577.937220 | 2 |
4233 | 8081 | Accountant | 19609.380 | 9 | 7 | 23 | 8 | Payday Loan, Mortgage Loan, Home Equity Loan, ... | 16 | 23.0 | 12.0 | 4186.04 | 25.656993 | 62.0 | Yes | 87.281543 | 93.293367 | Low_spent_Small_value_payments | 276.936589 | 1 |
7108 | 8172 | Doctor | 23838.090 | 5 | 6 | 13 | 773 | Debt Consolidation Loan, Payday Loan, Home Equ... | 24 | 10.0 | 7.0 | 802.30 | 31.348708 | 358.0 | Yes | 53.308099 | 10000.000000 | Low_spent_Small_value_payments | 232.429877 | 1 |
4046 | 8179 | Scientist | 49847.550 | 9 | 1384 | 18 | 3 | Home Equity Loan, Personal Loan, and Student Loan | 21 | 18.0 | 8.0 | 2173.47 | 39.903631 | 233.0 | Yes | 71.675700 | 318.284663 | Low_spent_Medium_value_payments | 278.535887 | 0 |
10197 | 8198 | Journalist | 20488.290 | 8 | 5 | 33 | 9 | Auto Loan, Home Equity Loan, Home Equity Loan,... | 56 | 18.0 | 7.0 | 3218.07 | 35.075428 | 92.0 | Yes | 131.465294 | 122.316046 | Low_spent_Medium_value_payments | 207.354411 | 0 |
5788 | 8246 | Accountant | 42900.210 | 7 | 4 | 17 | 4 | Personal Loan, Payday Loan, Mortgage Loan, and... | 18 | 15.0 | 4.0 | 2696.09 | 31.455247 | 117.0 | Yes | 111.209311 | 150.713217 | Low_spent_Small_value_payments | 376.079222 | 0 |
2725 | 8249 | Developer | 35372.360 | 8 | 5 | 25 | 9 | Home Equity Loan, Auto Loan, Mortgage Loan, Au... | 15 | 24.0 | 6.0 | 3850.97 | 31.673174 | 85.0 | Yes | 173.241966 | 80.806393 | High_spent_Small_value_payments | 292.521308 | 1 |
11176 | 8279 | _______ | 37443.130 | 2 | 4 | 7 | 0 | NaN | 13 | 0.0 | 2.0 | 843.53 | 34.062221 | 302.0 | No | 0.000000 | NaN | High_spent_Small_value_payments | 490.572275 | 0 |
4738 | 8306 | Developer | 27647.080 | 7 | 6 | 13 | 5 | Not Specified, Credit-Builder Loan, Mortgage L... | 17 | 8.0 | 7.0 | 159.63 | 26.757773 | 62.0 | Yes | 100.740203 | 10000.000000 | High_spent_Small_value_payments | 294.462237 | 1 |
8284 | 8348 | Accountant | 19323.400 | 10 | 5 | 18 | 8 | Debt Consolidation Loan, Debt Consolidation Lo... | 48 | 15.0 | 7.0 | 3033.25 | 30.197723 | 100.0 | Yes | 99.101347 | 110.085452 | Low_spent_Medium_value_payments | 245.741535 | 0 |
9384 | 8403 | Developer | 7756.465 | 10 | 7 | 19 | 2 | Payday Loan, and Student Loan | 45 | 21.0 | 11.0 | 2338.09 | 25.301853 | 67.0 | Yes | 9.682065 | NaN | High_spent_Medium_value_payments | 270.333736 | 0 |
7630 | 8409 | Lawyer | 34413.760 | 6 | 5 | 8 | 5 | Home Equity Loan, Home Equity Loan, Home Equit... | 23 | 20.0 | 6.0 | 113.06 | 39.139526 | 69.0 | Yes | 74.328709 | 37.339760 | High_spent_Large_value_payments | 420.012864 | 1 |
12472 | 8425 | Writer | 18512.970 | 7 | 5 | 18 | 3 | Student Loan, Student Loan, and Mortgage Loan | 15 | 11.0 | 6.0 | 1366.56 | 29.253092 | 131.0 | Yes | 24.621173 | 63.282651 | High_spent_Medium_value_payments | 311.570927 | 1 |
11218 | 8481 | Journalist | 15874.010 | 1 | 3 | 5 | 0 | NaN | 12 | 6.0 | 1.0 | 1173.38 | 24.443887 | 380.0 | No | 0.000000 | 53.610174 | High_spent_Small_value_payments | 311.373243 | 0 |
12439 | 8490 | Lawyer | 150131.680 | 5 | 1 | 4 | 0 | NaN | 8 | -2.0 | 0.0 | 1138.36 | 30.013470 | 376.0 | No | 0.000000 | 949.847265 | Low_spent_Small_value_payments | 599.850069 | 1 |
9255 | 8587 | Journalist | 28286.240 | 4 | 7 | 7 | 1 | Mortgage Loan | 18 | 16.0 | 4.0 | 1406.94 | 33.599043 | 350.0 | No | 17.431672 | 41.114532 | High_spent_Medium_value_payments | 418.772462 | 1 |
2963 | 8592 | _______ | 81815.020 | 0 | 5 | 4 | 1 | Mortgage Loan | 4 | NaN | 3.0 | 1434.75 | 28.645438 | 394.0 | No | 53.927785 | 277.263979 | Low_spent_Medium_value_payments | 637.300069 | 2 |
In [192]:
sns.boxplot(y=credit_df['Age'])
Out[192]:
<Axes: ylabel='Age'>
In [193]:
credit_df[credit_df['Age'] >= 100].sort_values('Age')
Out[193]:
Age | Occupation | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Type_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3911 | 102 | Musician | 38460.280 | 6 | 9 | 16 | 7 | Personal Loan, Personal Loan, Home Equity Loan... | 56 | 16.0 | 11.0 | 4106.50 | 26.537577 | 73.0 | Yes | 179.135821 | 267.336505 | Low_spent_Medium_value_payments | 157.630007 | 1 |
2416 | 126 | Teacher | 22050.560 | 5 | 4 | 12 | 1 | Home Equity Loan | 5 | 13.0 | 4.0 | 37.42 | 28.956967 | 387.0 | NM | 13.485884 | NaN | High_spent_Medium_value_payments | 359.927004 | 1 |
7418 | 169 | Doctor | 50109.760 | 6 | 3 | 4 | 1 | Personal Loan | 16 | 17.0 | 0.0 | 893.62 | 27.560776 | 191.0 | NM | 22.847438 | 351.547678 | Low_spent_Large_value_payments | 331.086217 | 2 |
952 | 181 | _______ | 87957.020 | 2 | 5 | 9 | 4 | Home Equity Loan, Auto Loan, Credit-Builder Lo... | 9 | 1.0 | 2.0 | 811.01 | 41.470014 | 231.0 | No | 195.913703 | 265.660815 | High_spent_Medium_value_payments | 533.800649 | 0 |
3197 | 216 | Media_Manager | 15829.875 | 4 | 2 | 4 | 1 | Student Loan | 10 | NaN | 3.0 | 968.61 | 34.361779 | 236.0 | No | 6.719535 | NaN | Low_spent_Small_value_payments | 352.096096 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12472 | 8425 | Writer | 18512.970 | 7 | 5 | 18 | 3 | Student Loan, Student Loan, and Mortgage Loan | 15 | 11.0 | 6.0 | 1366.56 | 29.253092 | 131.0 | Yes | 24.621173 | 63.282651 | High_spent_Medium_value_payments | 311.570927 | 1 |
11218 | 8481 | Journalist | 15874.010 | 1 | 3 | 5 | 0 | NaN | 12 | 6.0 | 1.0 | 1173.38 | 24.443887 | 380.0 | No | 0.000000 | 53.610174 | High_spent_Small_value_payments | 311.373243 | 0 |
12439 | 8490 | Lawyer | 150131.680 | 5 | 1 | 4 | 0 | NaN | 8 | -2.0 | 0.0 | 1138.36 | 30.013470 | 376.0 | No | 0.000000 | 949.847265 | Low_spent_Small_value_payments | 599.850069 | 1 |
9255 | 8587 | Journalist | 28286.240 | 4 | 7 | 7 | 1 | Mortgage Loan | 18 | 16.0 | 4.0 | 1406.94 | 33.599043 | 350.0 | No | 17.431672 | 41.114532 | High_spent_Medium_value_payments | 418.772462 | 1 |
2963 | 8592 | _______ | 81815.020 | 0 | 5 | 4 | 1 | Mortgage Loan | 4 | NaN | 3.0 | 1434.75 | 28.645438 | 394.0 | No | 53.927785 | 277.263979 | Low_spent_Medium_value_payments | 637.300069 | 2 |
260 rows × 20 columns
In [194]:
credit_df = credit_df[credit_df['Age'] < 110]
In [195]:
credit_df.describe()
Out[195]:
Age | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Total_EMI_per_month | Amount_invested_monthly | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 12126.000000 | 1.212600e+04 | 12126.000000 | 12126.000000 | 12126.000000 | 12126.000000 | 12126.000000 | 11304.000000 | 11892.000000 | 12126.000000 | 12126.000000 | 11045.000000 | 12126.000000 | 11584.000000 | 11995.000000 | 12126.000000 |
mean | 33.049398 | 1.897604e+05 | 17.275606 | 21.458601 | 68.507917 | 3.179861 | 21.053604 | 31.464084 | 24.994871 | 1426.321231 | 32.293466 | 217.614396 | 1308.972585 | 641.812701 | 406.077932 | 0.870856 |
std | 10.810043 | 1.486221e+06 | 118.546216 | 122.910157 | 453.592030 | 64.030581 | 14.857112 | 230.020040 | 185.478101 | 1155.255348 | 5.084805 | 99.405014 | 8131.974016 | 2059.907639 | 218.484165 | 0.653729 |
min | 14.000000 | 7.005930e+03 | 0.000000 | 1.000000 | 1.000000 | -100.000000 | -5.000000 | -3.000000 | 0.000000 | 0.230000 | 20.992914 | 1.000000 | 0.000000 | 0.000000 | 0.088628 | 0.000000 |
25% | 24.000000 | 1.945750e+04 | 4.000000 | 4.000000 | 8.000000 | 1.000000 | 10.000000 | 9.000000 | 2.000000 | 565.967500 | 28.118929 | 141.000000 | 29.275040 | 73.698246 | 271.752715 | 0.000000 |
50% | 33.000000 | 3.765508e+04 | 6.000000 | 5.000000 | 13.000000 | 3.000000 | 18.000000 | 14.000000 | 4.000000 | 1166.555000 | 32.285050 | 215.000000 | 66.196875 | 134.363758 | 337.123394 | 1.000000 |
75% | 41.000000 | 7.305204e+04 | 7.000000 | 7.000000 | 20.000000 | 5.000000 | 28.000000 | 18.000000 | 8.000000 | 1945.677500 | 36.452073 | 297.000000 | 149.873130 | 265.384383 | 474.903731 | 1.000000 |
max | 102.000000 | 2.365819e+07 | 1779.000000 | 1479.000000 | 5788.000000 | 1496.000000 | 67.000000 | 4388.000000 | 2592.000000 | 4998.070000 | 49.564519 | 397.000000 | 82122.000000 | 10000.000000 | 1602.040519 | 2.000000 |
In [196]:
len(credit_df[credit_df['Num_Bank_Accounts'] > 10 ]) / len(credit_df)
Out[196]:
0.013029853207982847
In [197]:
credit_df = credit_df[credit_df['Num_Bank_Accounts'] <= 10]
In [198]:
credit_df.describe()
Out[198]:
Age | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Total_EMI_per_month | Amount_invested_monthly | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 11968.000000 | 1.196800e+04 | 11968.000000 | 11968.000000 | 11968.000000 | 11968.000000 | 11968.000000 | 11155.000000 | 11739.000000 | 11968.000000 | 11968.000000 | 10902.000000 | 11968.000000 | 11435.000000 | 11840.000000 | 11968.000000 |
mean | 33.059241 | 1.906275e+05 | 5.374749 | 21.289606 | 68.378509 | 3.202958 | 21.070438 | 31.480592 | 25.259392 | 1426.112040 | 32.290274 | 217.519171 | 1316.881794 | 637.877772 | 406.065193 | 0.870572 |
std | 10.805276 | 1.493985e+06 | 2.589361 | 121.976512 | 454.393273 | 64.430573 | 14.868583 | 231.016220 | 186.667973 | 1155.106304 | 5.084843 | 99.394703 | 8162.702164 | 2052.261335 | 218.516737 | 0.653619 |
min | 14.000000 | 7.005930e+03 | 0.000000 | 1.000000 | 1.000000 | -100.000000 | -5.000000 | -3.000000 | 0.000000 | 0.230000 | 20.992914 | 1.000000 | 0.000000 | 0.000000 | 0.088628 | 0.000000 |
25% | 24.000000 | 1.943611e+04 | 3.000000 | 4.000000 | 8.000000 | 1.000000 | 10.000000 | 9.000000 | 2.000000 | 565.175000 | 28.110034 | 141.000000 | 29.292854 | 73.446749 | 271.845069 | 0.000000 |
50% | 33.000000 | 3.757821e+04 | 6.000000 | 6.000000 | 13.000000 | 3.000000 | 18.000000 | 14.000000 | 4.000000 | 1167.045000 | 32.286924 | 215.000000 | 66.182932 | 133.519013 | 337.124205 | 1.000000 |
75% | 41.000000 | 7.305026e+04 | 7.000000 | 7.000000 | 20.000000 | 5.000000 | 28.000000 | 18.000000 | 8.000000 | 1947.157500 | 36.441411 | 297.000000 | 149.824990 | 263.272841 | 475.278504 | 1.000000 |
max | 102.000000 | 2.365819e+07 | 10.000000 | 1479.000000 | 5788.000000 | 1496.000000 | 67.000000 | 4388.000000 | 2592.000000 | 4998.070000 | 49.564519 | 397.000000 | 82122.000000 | 10000.000000 | 1602.040519 | 2.000000 |
In [199]:
len(credit_df[credit_df['Num_Credit_Card'] > 20 ]) / len(credit_df)
Out[199]:
0.021975267379679145
In [200]:
credit_df = credit_df[credit_df['Num_Credit_Card'] <= 20]
In [201]:
credit_df.describe()
Out[201]:
Age | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Total_EMI_per_month | Amount_invested_monthly | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 11705.000000 | 1.170500e+04 | 11705.000000 | 11705.000000 | 11705.000000 | 11705.000000 | 11705.000000 | 10905.000000 | 11482.000000 | 11705.000000 | 11705.000000 | 10654.000000 | 11705.000000 | 11182.000000 | 11580.000000 | 11705.000000 |
mean | 33.051431 | 1.884028e+05 | 5.376506 | 5.539000 | 68.847074 | 3.185903 | 21.086715 | 31.781110 | 25.397579 | 1428.121067 | 32.285006 | 217.185752 | 1309.298126 | 641.162451 | 405.791718 | 0.869970 |
std | 10.808661 | 1.483877e+06 | 2.593134 | 2.073587 | 457.588574 | 64.374527 | 14.898377 | 233.314628 | 187.626702 | 1156.429189 | 5.079989 | 99.529739 | 8140.727825 | 2058.913765 | 218.284978 | 0.653615 |
min | 14.000000 | 7.005930e+03 | 0.000000 | 1.000000 | 1.000000 | -100.000000 | -5.000000 | -3.000000 | 0.000000 | 0.230000 | 20.992914 | 1.000000 | 0.000000 | 0.000000 | 0.088628 | 0.000000 |
25% | 24.000000 | 1.942803e+04 | 3.000000 | 4.000000 | 8.000000 | 1.000000 | 10.000000 | 9.000000 | 2.000000 | 565.260000 | 28.111034 | 140.000000 | 29.541637 | 73.248674 | 271.691464 | 0.000000 |
50% | 33.000000 | 3.764810e+04 | 6.000000 | 5.000000 | 13.000000 | 3.000000 | 18.000000 | 14.000000 | 4.000000 | 1166.910000 | 32.282544 | 215.000000 | 66.388434 | 133.865834 | 336.844798 | 1.000000 |
75% | 41.000000 | 7.304796e+04 | 7.000000 | 7.000000 | 20.000000 | 5.000000 | 28.000000 | 18.000000 | 8.000000 | 1952.970000 | 36.424496 | 297.000000 | 150.026842 | 263.669812 | 475.357593 | 1.000000 |
max | 102.000000 | 2.365819e+07 | 10.000000 | 17.000000 | 5788.000000 | 1496.000000 | 67.000000 | 4388.000000 | 2592.000000 | 4998.070000 | 49.564519 | 397.000000 | 82122.000000 | 10000.000000 | 1602.040519 | 2.000000 |
In [202]:
credit_df = credit_df[credit_df['Interest_Rate'] <= 40]
In [203]:
credit_df.describe()
Out[203]:
Age | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Total_EMI_per_month | Amount_invested_monthly | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 11489.000000 | 1.148900e+04 | 11489.000000 | 11489.000000 | 11489.000000 | 11489.000000 | 11489.000000 | 10697.000000 | 11272.000000 | 11489.000000 | 11489.000000 | 10456.000000 | 11489.000000 | 10976.000000 | 11365.000000 | 11489.000000 |
mean | 33.051005 | 1.866428e+05 | 5.379058 | 5.539473 | 14.581774 | 3.085299 | 21.071111 | 32.144433 | 25.610628 | 1428.023787 | 32.274542 | 217.205337 | 1309.664252 | 643.204912 | 405.469860 | 0.869005 |
std | 10.806912 | 1.473204e+06 | 2.592378 | 2.073895 | 8.769424 | 63.606989 | 14.867793 | 235.555903 | 188.942465 | 1156.841447 | 5.081960 | 99.622916 | 8147.400974 | 2063.638237 | 217.870485 | 0.652951 |
min | 14.000000 | 7.005930e+03 | 0.000000 | 1.000000 | 1.000000 | -100.000000 | -5.000000 | -3.000000 | 0.000000 | 0.230000 | 20.992914 | 1.000000 | 0.000000 | 0.000000 | 0.088628 | 0.000000 |
25% | 24.000000 | 1.939932e+04 | 3.000000 | 4.000000 | 7.000000 | 1.000000 | 10.000000 | 9.000000 | 2.000000 | 565.280000 | 28.109667 | 140.000000 | 29.579944 | 73.167755 | 271.696442 | 0.000000 |
50% | 33.000000 | 3.747152e+04 | 6.000000 | 5.000000 | 13.000000 | 3.000000 | 18.000000 | 14.000000 | 4.000000 | 1166.470000 | 32.257865 | 215.000000 | 66.140607 | 133.820509 | 336.826381 | 1.000000 |
75% | 41.000000 | 7.288608e+04 | 7.000000 | 7.000000 | 20.000000 | 5.000000 | 28.000000 | 18.000000 | 8.000000 | 1950.620000 | 36.403191 | 297.000000 | 149.469529 | 263.650717 | 474.571468 | 1.000000 |
max | 102.000000 | 2.365819e+07 | 10.000000 | 17.000000 | 34.000000 | 1496.000000 | 67.000000 | 4388.000000 | 2592.000000 | 4998.070000 | 49.564519 | 397.000000 | 82122.000000 | 10000.000000 | 1602.040519 | 2.000000 |
In [204]:
len(credit_df[credit_df['Num_of_Loan'] > 20 ])
Out[204]:
60
In [205]:
credit_df = credit_df[(credit_df['Num_of_Loan'] <= 20) & (credit_df['Num_of_Loan'] >= 0)]
In [206]:
credit_df.describe()
Out[206]:
Age | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Total_EMI_per_month | Amount_invested_monthly | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 10962.000000 | 1.096200e+04 | 10962.000000 | 10962.000000 | 10962.000000 | 10962.000000 | 10962.000000 | 10211.000000 | 10756.000000 | 10962.000000 | 10962.000000 | 9967.000000 | 10962.000000 | 10468.000000 | 10840.000000 | 10962.000000 |
mean | 33.044426 | 1.883743e+05 | 5.386882 | 5.538405 | 14.592501 | 3.552728 | 21.079091 | 32.153168 | 25.391595 | 1430.842134 | 32.280912 | 216.911006 | 1322.954566 | 646.217707 | 405.524273 | 0.867907 |
std | 10.814471 | 1.479179e+06 | 2.592619 | 2.073381 | 8.771167 | 2.449676 | 14.865527 | 235.514645 | 187.830090 | 1157.869902 | 5.083798 | 99.455191 | 8210.229115 | 2068.395275 | 218.131576 | 0.651802 |
min | 14.000000 | 7.005930e+03 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | -5.000000 | -3.000000 | 0.000000 | 0.230000 | 20.992914 | 1.000000 | 0.000000 | 0.000000 | 0.088628 | 0.000000 |
25% | 24.000000 | 1.940034e+04 | 3.000000 | 4.000000 | 7.000000 | 2.000000 | 10.000000 | 9.000000 | 2.000000 | 566.242500 | 28.134649 | 140.000000 | 29.617499 | 73.265262 | 271.841052 | 0.000000 |
50% | 33.000000 | 3.768010e+04 | 6.000000 | 5.000000 | 13.000000 | 3.000000 | 18.000000 | 14.000000 | 4.000000 | 1166.420000 | 32.242901 | 214.000000 | 66.333116 | 133.543758 | 336.830727 | 1.000000 |
75% | 41.000000 | 7.305204e+04 | 7.000000 | 7.000000 | 20.000000 | 5.000000 | 28.000000 | 18.000000 | 8.000000 | 1961.112500 | 36.409833 | 296.000000 | 150.112234 | 265.694698 | 474.525972 | 1.000000 |
max | 102.000000 | 2.365819e+07 | 10.000000 | 17.000000 | 34.000000 | 19.000000 | 67.000000 | 4388.000000 | 2592.000000 | 4998.070000 | 49.564519 | 397.000000 | 82122.000000 | 10000.000000 | 1602.040519 | 2.000000 |
In [207]:
credit_df = credit_df[credit_df['Delay_from_due_date'] >= 0]
In [208]:
len(credit_df[credit_df['Num_of_Delayed_Payment'] > 30])
Out[208]:
80
In [209]:
credit_df = credit_df[(credit_df['Num_of_Delayed_Payment'] <= 30) & (credit_df['Num_of_Delayed_Payment'] >= 0)]
In [210]:
credit_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10005 entries, 0 to 12498
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 10005 non-null int64
1 Occupation 10005 non-null object
2 Annual_Income 10005 non-null float64
3 Num_Bank_Accounts 10005 non-null int64
4 Num_Credit_Card 10005 non-null int64
5 Interest_Rate 10005 non-null int64
6 Num_of_Loan 10005 non-null int64
7 Type_of_Loan 8894 non-null object
8 Delay_from_due_date 10005 non-null int64
9 Num_of_Delayed_Payment 10005 non-null float64
10 Num_Credit_Inquiries 9817 non-null float64
11 Outstanding_Debt 10005 non-null float64
12 Credit_Utilization_Ratio 10005 non-null float64
13 Credit_History_Age 9107 non-null float64
14 Payment_of_Min_Amount 10005 non-null object
15 Total_EMI_per_month 10005 non-null float64
16 Amount_invested_monthly 9550 non-null float64
17 Payment_Behaviour 10005 non-null object
18 Monthly_Balance 9896 non-null float64
19 Credit_Score 10005 non-null int64
dtypes: float64(9), int64(7), object(4)
memory usage: 1.6+ MB
In [211]:
credit_df['Num_Credit_Inquiries'] = credit_df['Num_Credit_Inquiries'].fillna(0)
<ipython-input-211-17ca6241ab57>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
credit_df['Num_Credit_Inquiries'] = credit_df['Num_Credit_Inquiries'].fillna(0)
In [212]:
credit_df.isna().mean()
Out[212]:
Age 0.000000
Occupation 0.000000
Annual_Income 0.000000
Num_Bank_Accounts 0.000000
Num_Credit_Card 0.000000
Interest_Rate 0.000000
Num_of_Loan 0.000000
Type_of_Loan 0.111044
Delay_from_due_date 0.000000
Num_of_Delayed_Payment 0.000000
Num_Credit_Inquiries 0.000000
Outstanding_Debt 0.000000
Credit_Utilization_Ratio 0.000000
Credit_History_Age 0.089755
Payment_of_Min_Amount 0.000000
Total_EMI_per_month 0.000000
Amount_invested_monthly 0.045477
Payment_Behaviour 0.000000
Monthly_Balance 0.010895
Credit_Score 0.000000
dtype: float64
In [213]:
sns.displot(credit_df['Credit_History_Age'])
Out[213]:
<seaborn.axisgrid.FacetGrid at 0x79ae6435ac80>
In [214]:
sns.displot(credit_df['Amount_invested_monthly'])
Out[214]:
<seaborn.axisgrid.FacetGrid at 0x79ae643591b0>
In [215]:
sns.displot(credit_df['Monthly_Balance'])
Out[215]:
<seaborn.axisgrid.FacetGrid at 0x79ae64024340>
In [216]:
credit_df = credit_df.fillna(credit_df.median(numeric_only=True))
In [217]:
credit_df.isna().mean()
Out[217]:
Age 0.000000
Occupation 0.000000
Annual_Income 0.000000
Num_Bank_Accounts 0.000000
Num_Credit_Card 0.000000
Interest_Rate 0.000000
Num_of_Loan 0.000000
Type_of_Loan 0.111044
Delay_from_due_date 0.000000
Num_of_Delayed_Payment 0.000000
Num_Credit_Inquiries 0.000000
Outstanding_Debt 0.000000
Credit_Utilization_Ratio 0.000000
Credit_History_Age 0.000000
Payment_of_Min_Amount 0.000000
Total_EMI_per_month 0.000000
Amount_invested_monthly 0.000000
Payment_Behaviour 0.000000
Monthly_Balance 0.000000
Credit_Score 0.000000
dtype: float64
In [218]:
credit_df.head()
Out[218]:
Age | Occupation | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Type_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | Credit_Utilization_Ratio | Credit_History_Age | Payment_of_Min_Amount | Total_EMI_per_month | Amount_invested_monthly | Payment_Behaviour | Monthly_Balance | Credit_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23 | Scientist | 19114.12 | 3 | 4 | 3 | 4 | Auto Loan, Credit-Builder Loan, Personal Loan,... | 3 | 7.0 | 4.0 | 809.98 | 26.822620 | 265.0 | No | 49.574949 | 80.415295 | High_spent_Small_value_payments | 312.494089 | 2 |
1 | 28 | _______ | 34847.84 | 2 | 4 | 6 | 1 | Credit-Builder Loan | 3 | 4.0 | 2.0 | 605.03 | 24.464031 | 319.0 | No | 18.816215 | 104.291825 | Low_spent_Small_value_payments | 470.690627 | 1 |
2 | 34 | _______ | 143162.64 | 1 | 5 | 8 | 3 | Auto Loan, Auto Loan, and Not Specified | 5 | 8.0 | 3.0 | 1303.01 | 28.616735 | 213.0 | No | 246.992320 | 168.413703 | !@9#%8 | 1043.315978 | 2 |
3 | 54 | Entrepreneur | 30689.89 | 2 | 5 | 4 | 1 | Not Specified | 0 | 6.0 | 4.0 | 632.46 | 26.544229 | 207.0 | No | 16.415452 | 81.228859 | Low_spent_Large_value_payments | 433.604773 | 1 |
6 | 33 | Lawyer | 131313.40 | 0 | 1 | 8 | 2 | Credit-Builder Loan, and Mortgage Loan | 0 | 3.0 | 2.0 | 352.16 | 32.200509 | 367.0 | NM | 137.644605 | 378.171253 | High_spent_Medium_value_payments | 858.462474 | 2 |
In [219]:
# 문제
# Type_of_Loan 의 모든 대출 상품을 변수에 저장
# NaN인 데이터는 'No Loan' 으로 대체
# 대출상품 만큼의 컬럼을 만들고 해당 대출 상품을 받았다면 1 아니면 0으로 데이터 처리
credit_df['Type_of_Loan'] = credit_df['Type_of_Loan'].str.replace('and ', '')
In [222]:
credit_df.isna().mean()
Out[222]:
Age 0.000000
Occupation 0.000000
Annual_Income 0.000000
Num_Bank_Accounts 0.000000
Num_Credit_Card 0.000000
Interest_Rate 0.000000
Num_of_Loan 0.000000
Type_of_Loan 0.111044
Delay_from_due_date 0.000000
Num_of_Delayed_Payment 0.000000
Num_Credit_Inquiries 0.000000
Outstanding_Debt 0.000000
Credit_Utilization_Ratio 0.000000
Credit_History_Age 0.000000
Payment_of_Min_Amount 0.000000
Total_EMI_per_month 0.000000
Amount_invested_monthly 0.000000
Payment_Behaviour 0.000000
Monthly_Balance 0.000000
Credit_Score 0.000000
dtype: float64
In [223]:
credit_df['Type_of_Loan'] = credit_df['Type_of_Loan'].fillna('No Loan')
In [224]:
credit_df.isna().mean()
Out[224]:
Age 0.0
Occupation 0.0
Annual_Income 0.0
Num_Bank_Accounts 0.0
Num_Credit_Card 0.0
Interest_Rate 0.0
Num_of_Loan 0.0
Type_of_Loan 0.0
Delay_from_due_date 0.0
Num_of_Delayed_Payment 0.0
Num_Credit_Inquiries 0.0
Outstanding_Debt 0.0
Credit_Utilization_Ratio 0.0
Credit_History_Age 0.0
Payment_of_Min_Amount 0.0
Total_EMI_per_month 0.0
Amount_invested_monthly 0.0
Payment_Behaviour 0.0
Monthly_Balance 0.0
Credit_Score 0.0
dtype: float64
In [226]:
type_list = set(credit_df['Type_of_Loan'].str.split(', ').sum())
type_list
Out[226]:
{'Auto Loan',
'Credit-Builder Loan',
'Debt Consolidation Loan',
'Home Equity Loan',
'Mortgage Loan',
'No Loan',
'Not Specified',
'Payday Loan',
'Personal Loan',
'Student Loan'}
In [227]:
for i in type_list:
credit_df[i] = credit_df['Type_of_Loan'].apply(lambda x: 1 if i in x else 0)
In [228]:
credit_df.head()
Out[228]:
Age | Occupation | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Type_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | ... | Home Equity Loan | No Loan | Auto Loan | Personal Loan | Not Specified | Credit-Builder Loan | Debt Consolidation Loan | Payday Loan | Mortgage Loan | Student Loan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23 | Scientist | 19114.12 | 3 | 4 | 3 | 4 | Auto Loan, Credit-Builder Loan, Personal Loan,... | 3 | 7.0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 28 | _______ | 34847.84 | 2 | 4 | 6 | 1 | Credit-Builder Loan | 3 | 4.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 34 | _______ | 143162.64 | 1 | 5 | 8 | 3 | Auto Loan, Auto Loan, Not Specified | 5 | 8.0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 54 | Entrepreneur | 30689.89 | 2 | 5 | 4 | 1 | Not Specified | 0 | 6.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
6 | 33 | Lawyer | 131313.40 | 0 | 1 | 8 | 2 | Credit-Builder Loan, Mortgage Loan | 0 | 3.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
5 rows × 30 columns
In [229]:
credit_df.drop('Type_of_Loan', axis=1, inplace=True)
In [230]:
credit_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10005 entries, 0 to 12498
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 10005 non-null int64
1 Occupation 10005 non-null object
2 Annual_Income 10005 non-null float64
3 Num_Bank_Accounts 10005 non-null int64
4 Num_Credit_Card 10005 non-null int64
5 Interest_Rate 10005 non-null int64
6 Num_of_Loan 10005 non-null int64
7 Delay_from_due_date 10005 non-null int64
8 Num_of_Delayed_Payment 10005 non-null float64
9 Num_Credit_Inquiries 10005 non-null float64
10 Outstanding_Debt 10005 non-null float64
11 Credit_Utilization_Ratio 10005 non-null float64
12 Credit_History_Age 10005 non-null float64
13 Payment_of_Min_Amount 10005 non-null object
14 Total_EMI_per_month 10005 non-null float64
15 Amount_invested_monthly 10005 non-null float64
16 Payment_Behaviour 10005 non-null object
17 Monthly_Balance 10005 non-null float64
18 Credit_Score 10005 non-null int64
19 Home Equity Loan 10005 non-null int64
20 No Loan 10005 non-null int64
21 Auto Loan 10005 non-null int64
22 Personal Loan 10005 non-null int64
23 Not Specified 10005 non-null int64
24 Credit-Builder Loan 10005 non-null int64
25 Debt Consolidation Loan 10005 non-null int64
26 Payday Loan 10005 non-null int64
27 Mortgage Loan 10005 non-null int64
28 Student Loan 10005 non-null int64
dtypes: float64(9), int64(17), object(3)
memory usage: 2.3+ MB
In [241]:
# Occupation
# '-----' 를 'Unknown'
credit_df['Occupation'] = credit_df['Occupation'].replace('_______', 'Unknown')
credit_df['Occupation'].unique()
Out[241]:
array(['Scientist', 'Unknown', 'Entrepreneur', 'Lawyer', 'Journalist',
'Teacher', 'Manager', 'Accountant', 'Musician', 'Mechanic',
'Writer', 'Architect', 'Engineer', 'Developer', 'Media_Manager',
'Doctor'], dtype=object)
In [244]:
# Payment_of_Min_Amount
# '!@9#%8' 를 'Unknown'
credit_df['Payment_of_Min_Amount'] = credit_df['Payment_of_Min_Amount'].replace('!@9#%8', 'Unknown')
credit_df['Payment_of_Min_Amount'].unique()
Out[244]:
array(['No', 'NM', 'Yes'], dtype=object)
In [246]:
# Payment_Behaviour
credit_df['Payment_Behaviour'] = credit_df['Payment_Behaviour'].replace('!@9#%8', 'Unknown')
credit_df['Payment_Behaviour'].unique()
Out[246]:
array(['High_spent_Small_value_payments',
'Low_spent_Small_value_payments', 'Unknown',
'Low_spent_Large_value_payments',
'High_spent_Medium_value_payments',
'Low_spent_Medium_value_payments',
'High_spent_Large_value_payments'], dtype=object)
In [247]:
# 위 object를 원핫인코딩
credit_df = pd.get_dummies(credit_df, columns=['Occupation', 'Payment_of_Min_Amount', 'Payment_Behaviour'])
credit_df.head()
Out[247]:
Age | Annual_Income | Num_Bank_Accounts | Num_Credit_Card | Interest_Rate | Num_of_Loan | Delay_from_due_date | Num_of_Delayed_Payment | Num_Credit_Inquiries | Outstanding_Debt | ... | Payment_of_Min_Amount_NM | Payment_of_Min_Amount_No | Payment_of_Min_Amount_Yes | Payment_Behaviour_High_spent_Large_value_payments | Payment_Behaviour_High_spent_Medium_value_payments | Payment_Behaviour_High_spent_Small_value_payments | Payment_Behaviour_Low_spent_Large_value_payments | Payment_Behaviour_Low_spent_Medium_value_payments | Payment_Behaviour_Low_spent_Small_value_payments | Payment_Behaviour_Unknown | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23 | 19114.12 | 3 | 4 | 3 | 4 | 3 | 7.0 | 4.0 | 809.98 | ... | False | True | False | False | False | True | False | False | False | False |
1 | 28 | 34847.84 | 2 | 4 | 6 | 1 | 3 | 4.0 | 2.0 | 605.03 | ... | False | True | False | False | False | False | False | False | True | False |
2 | 34 | 143162.64 | 1 | 5 | 8 | 3 | 5 | 8.0 | 3.0 | 1303.01 | ... | False | True | False | False | False | False | False | False | False | True |
3 | 54 | 30689.89 | 2 | 5 | 4 | 1 | 0 | 6.0 | 4.0 | 632.46 | ... | False | True | False | False | False | False | True | False | False | False |
6 | 33 | 131313.40 | 0 | 1 | 8 | 2 | 0 | 3.0 | 2.0 | 352.16 | ... | True | False | False | False | True | False | False | False | False | False |
5 rows × 52 columns
In [248]:
credit_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10005 entries, 0 to 12498
Data columns (total 52 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 10005 non-null int64
1 Annual_Income 10005 non-null float64
2 Num_Bank_Accounts 10005 non-null int64
3 Num_Credit_Card 10005 non-null int64
4 Interest_Rate 10005 non-null int64
5 Num_of_Loan 10005 non-null int64
6 Delay_from_due_date 10005 non-null int64
7 Num_of_Delayed_Payment 10005 non-null float64
8 Num_Credit_Inquiries 10005 non-null float64
9 Outstanding_Debt 10005 non-null float64
10 Credit_Utilization_Ratio 10005 non-null float64
11 Credit_History_Age 10005 non-null float64
12 Total_EMI_per_month 10005 non-null float64
13 Amount_invested_monthly 10005 non-null float64
14 Monthly_Balance 10005 non-null float64
15 Credit_Score 10005 non-null int64
16 Home Equity Loan 10005 non-null int64
17 No Loan 10005 non-null int64
18 Auto Loan 10005 non-null int64
19 Personal Loan 10005 non-null int64
20 Not Specified 10005 non-null int64
21 Credit-Builder Loan 10005 non-null int64
22 Debt Consolidation Loan 10005 non-null int64
23 Payday Loan 10005 non-null int64
24 Mortgage Loan 10005 non-null int64
25 Student Loan 10005 non-null int64
26 Occupation_Accountant 10005 non-null bool
27 Occupation_Architect 10005 non-null bool
28 Occupation_Developer 10005 non-null bool
29 Occupation_Doctor 10005 non-null bool
30 Occupation_Engineer 10005 non-null bool
31 Occupation_Entrepreneur 10005 non-null bool
32 Occupation_Journalist 10005 non-null bool
33 Occupation_Lawyer 10005 non-null bool
34 Occupation_Manager 10005 non-null bool
35 Occupation_Mechanic 10005 non-null bool
36 Occupation_Media_Manager 10005 non-null bool
37 Occupation_Musician 10005 non-null bool
38 Occupation_Scientist 10005 non-null bool
39 Occupation_Teacher 10005 non-null bool
40 Occupation_Unknown 10005 non-null bool
41 Occupation_Writer 10005 non-null bool
42 Payment_of_Min_Amount_NM 10005 non-null bool
43 Payment_of_Min_Amount_No 10005 non-null bool
44 Payment_of_Min_Amount_Yes 10005 non-null bool
45 Payment_Behaviour_High_spent_Large_value_payments 10005 non-null bool
46 Payment_Behaviour_High_spent_Medium_value_payments 10005 non-null bool
47 Payment_Behaviour_High_spent_Small_value_payments 10005 non-null bool
48 Payment_Behaviour_Low_spent_Large_value_payments 10005 non-null bool
49 Payment_Behaviour_Low_spent_Medium_value_payments 10005 non-null bool
50 Payment_Behaviour_Low_spent_Small_value_payments 10005 non-null bool
51 Payment_Behaviour_Unknown 10005 non-null bool
dtypes: bool(26), float64(9), int64(17)
memory usage: 2.3 MB
In [254]:
from sklearn.model_selection import train_test_split
In [255]:
X_train, X_test, y_train, y_test = train_test_split(credit_df.drop('Credit_Score', axis=1), credit_df['Credit_Score'], test_size=0.2, random_state=2024)
In [256]:
X_train.shape, X_test.shape
Out[256]:
((8004, 51), (2001, 51))
In [257]:
y_train.shape, y_test.shape
Out[257]:
((8004,), (2001,))
2. lightGBM(LGBM)¶
- Microsoft에서 개발한 Gradient Boosting Framework
- 리프 중심 히스토그램 기반 알고리즘
- 작은 데이터셋에서도 높은 성능을 보이며, 특히 대용량 데이터셋에서 다른 알고리즘보다 빠르게 학습
- 메모리 사용량이 상대적으로 적은편
- 적은 데이터셋을 사용할 경우 과적합 가능성이 매우 큼(일반적으로 데이터가 10,000개 이상은 사용해야함)
- 조기 중단(early stopping)을 지원
2-1. 리프 중심 히스토그램 기반 알고리즘¶
- 트리를 균형적으로 분할하는 것이 아니라, 최대한 불균형하게 분할
- 특성들의 분포를 히스토그램으로 나타내고, 해당 히스토그램을 이용하여 빠르게 후보 분할 기준을 선택
- 후보 분할 기준 중에서 최적의 분할 기준을 선택하기 위해, 데이터 포인트들을 히스토그램에 올바르게 배치하고 이를 이용하여 최적의 분할 기준을 선택
2-2 GBM(Gradient Boosting Model)¶
- 순차적으로 모델을 학습시킴
- 첫 번째 모델을 학습시키고, 두 번째 모델은 첫 번째 모델의 오류를 학습하는 식으로 진행(이런 방식으로 각 모델이 이전 모델의 오류를 보완)
- 부스팅에서는 각 데이터 포인트에 가중치를 부여하지만, 이후 모델이 학습되면서 잘못 예측된 데이터 포인트의 가중치를 증가시켜 다음 모델이 디 데이터 포인트에 더 주의를 기울이도록 함
- 트리가 모두 학습된 후 예측 결과를 결합하여 최종 예측을 만드는데 일반적으로 분류 문데에서는 다수결 투표 방식으로, 회귀 문제에서는 예측값의 평균을 사용
2-3. 부스팅 모델의 주요 개념¶
- 약한 학습기(Weak Learner): 단독으로는 성능이 좋지 않은 간단한 모델(주로 깊이가 얕은 결정 트리, 깊이가 1인 매우 간단한 모델)
- 약한 학습기를 순차적으로 학습시키고 그 다음에는 첫 번째 학습기의 오류를 보완하는 두 번째 학습기를 학습시킴
In [258]:
from lightgbm import LGBMClassifier
In [260]:
base_model = LGBMClassifier(random_state=2024)
In [261]:
base_model.fit(X_train, y_train)
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004698 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2103
[LightGBM] [Info] Number of data points in the train set: 8004, number of used features: 51
[LightGBM] [Info] Start training from score -1.246598
[LightGBM] [Info] Start training from score -0.576753
[LightGBM] [Info] Start training from score -1.891803
Out[261]:
LGBMClassifier(random_state=2024)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMClassifier(random_state=2024)
In [262]:
pred = base_model.predict(X_test)
In [263]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score
In [264]:
accuracy_score(y_test, pred)
Out[264]:
0.7386306846576711
In [265]:
confusion_matrix(y_test, pred)
Out[265]:
array([[442, 140, 35],
[131, 858, 99],
[ 10, 108, 178]])
In [268]:
print(classification_report(y_test, pred))
precision recall f1-score support
0 0.76 0.72 0.74 617
1 0.78 0.79 0.78 1088
2 0.57 0.60 0.59 296
accuracy 0.74 2001
macro avg 0.70 0.70 0.70 2001
weighted avg 0.74 0.74 0.74 2001
In [266]:
proba = base_model.predict_proba(X_test)
proba
Out[266]:
array([[5.31741083e-02, 5.47962223e-01, 3.98863668e-01],
[3.10313153e-03, 9.96804047e-01, 9.28219692e-05],
[9.62549410e-01, 3.73701069e-02, 8.04835753e-05],
...,
[8.47182625e-01, 1.52551839e-01, 2.65535858e-04],
[1.93275922e-01, 8.06637389e-01, 8.66884417e-05],
[9.69702722e-03, 1.60090661e-01, 8.30212312e-01]])
In [269]:
5.31741083e-02, 5.47962223e-01, 3.98863668e-01
Out[269]:
(0.0531741083, 0.547962223, 0.398863668)
In [270]:
roc_auc_score(y_test, proba, multi_class='ovr')
Out[270]:
0.8943814823196526
In [ ]: