1. Rent 데이터셋
import numpy as np
import pandas as pd
import seaborn as sns
rent_df = pd.read_csv('/content/drive/MyDrive/KDT/6. 머신러닝과 딥러닝/Data/rent.csv')
rent_df
rent_df.info()
- Posted On: 매물 등록 날짜
- BHK: 베드, 홀, 키친의 개수
- Rent: 렌트비
- Size: 집 크기
- Floor: 총 층수 중 몇층
- Area Type: 공용공간을 포함하는지, 집의 면적만 포함하는지
- Area Locality: 지역
- City: 도시
- Furnishing Status: 풀옵션 여부
- Tenant Preferred: 선호하는 가족형태
- Bathroom: 화장실 개수
- Point of Contact: 연락할 곳
rent_df.describe()
round(rent_df.describe(), 2)
rent_df['BHK']
sns.displot(rent_df['BHK'])
sns.displot(rent_df['Rent'])
rent_df['Rent'].sort_values()
출력:
4076 1200
285 1500
471 1800
2475 2000
146 2200
...
1459 700000
1329 850000
827 1000000
1001 1200000
1837 3500000
Name: Rent, Length: 4746, dtype: int64
sns.boxplot(y=rent_df['Rent'])
sns.boxplot(y=rent_df['BHK'])
rent_df.isna().sum()
출력:
Posted On 0
BHK 3
Rent 0
Size 5
Floor 0
Area Type 0
Area Locality 0
City 0
Furnishing Status 0
Tenant Preferred 0
Bathroom 0
Point of Contact 0
dtype: int64
rent_df.isna().mean()
출력:
Posted On 0.000000
BHK 0.000632
Rent 0.000000
Size 0.001054
Floor 0.000000
Area Type 0.000000
Area Locality 0.000000
City 0.000000
Furnishing Status 0.000000
Tenant Preferred 0.000000
Bathroom 0.000000
Point of Contact 0.000000
dtype: float64
rent_df.dropna(subset=['BHK'])
na_index = rent_df[rent_df['Size'].isna()].index
na_index
출력:
Index([425, 430, 4703, 4731, 4732], dtype='int64')
# rent_df[rent_df['Size'].isna()] = rent_df['Size'].median()
rent_df['Size'].fillna(rent_df['Size'].median(numeric_only=True)).loc[na_index]
출력:
425 850.0
430 850.0
4703 850.0
4731 850.0
4732 850.0
Name: Size, dtype: float64
rent_df = rent_df.fillna(rent_df.median(numeric_only=True))
rent_df.isna().mean()
출력:
Posted On 0.0
BHK 0.0
Rent 0.0
Size 0.0
Floor 0.0
Area Type 0.0
Area Locality 0.0
City 0.0
Furnishing Status 0.0
Tenant Preferred 0.0
Bathroom 0.0
Point of Contact 0.0
dtype: float64
rent_df.info()
출력:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Posted On 4746 non-null object
1 BHK 4746 non-null float64
2 Rent 4746 non-null int64
3 Size 4746 non-null float64
4 Floor 4746 non-null object
5 Area Type 4746 non-null object
6 Area Locality 4746 non-null object
7 City 4746 non-null object
8 Furnishing Status 4746 non-null object
9 Tenant Preferred 4746 non-null object
10 Bathroom 4746 non-null int64
11 Point of Contact 4746 non-null object
dtypes: float64(2), int64(2), object(8)
memory usage: 445.1+ KB
rent_df['Floor'].value_counts()
출력:
Floor
1 out of 2 379
Ground out of 2 350
2 out of 3 312
2 out of 4 308
1 out of 3 293
...
11 out of 31 1
50 out of 75 1
18 out of 26 1
12 out of 27 1
23 out of 34 1
Name: count, Length: 480, dtype: int64
rent_df['Area Type'].value_counts()
출력:
Area Type
Super Area 2446
Carpet Area 2298
Built Area 2
Name: count, dtype: int64
rent_df['Area Type'].unique()
출력:
array(['Super Area', 'Carpet Area', 'Built Area'], dtype=object)
rent_df['Area Type'].nunique()
출력:
3
for i in ['Floor', 'Area Type', 'Area Locality', 'City', 'Furnishing Status', 'Tenant Preferred', 'Point of Contact']:
print(i, rent_df[i].nunique())
출력:
Floor 480
Area Type 3
Area Locality 2235
City 6
Furnishing Status 3
Tenant Preferred 3
Point of Contact 3
rent_df.drop(['Floor', 'Area Locality', 'Tenant Preferred', 'Point of Contact', 'Posted On'], axis=1, inplace=True)
rent_df.info()
출력:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 BHK 4746 non-null float64
1 Rent 4746 non-null int64
2 Size 4746 non-null float64
3 Area Type 4746 non-null object
4 City 4746 non-null object
5 Furnishing Status 4746 non-null object
6 Bathroom 4746 non-null int64
dtypes: float64(2), int64(2), object(3)
memory usage: 259.7+ KB
rent_df = pd.get_dummies(rent_df, columns=['Area Type', 'City', 'Furnishing Status'])
rent_df.head()
X = rent_df.drop('Rent', axis=1) # 독립변수
y = rent_df['Rent'] # 종속변수
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2024)
X_train.shape, X_test.shape
출력:
((3796, 15), (950, 15))
y_train.shape, y_test.shape
출력:
((3796,), (950,))
2. 선형 회귀(Linear Regression)
- 데이터를 통해 데이터를 가장 잘 설명할 수 있는 직선으로 데이터를 분석하는 방법
- 단순 선형 회귀분석(단일 독립변수를 이용)
- 다중 선형 회귀분석(다중 독립변수를 이용)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
3. 평가 지표 만들기
3-1. MSE(Mean Squared Error)
- 예측값과 실제값의 차이에 대한 제곱에 대해 평균을 낸 값(1n)∑ni=1(yi−xi)2(1n)∑ni=1(yi−xi)2
- (1/n)∑ni=1(yi−xi)2
p = np.array([3, 4, 5]) # 예측값
act = np.array([1, 2 ,3]) # 실제값
def my_mse(pred, actual):
return ((pred - actual) ** 2).mean()
my_mse(p, act)
출력: 4.0
3-2. MAE(Mean Absolute Error)
- 예측값과 실제값의 차이에 대한 절대값에 대해 평균을 낸 값
- (1/n)∑ni=1|yi−xi|
def my_mae(pred, actual):
return np.abs(pred - actual).mean()
my_mae(p, act)
출력: 2.0
3-3. RMSE(Root MEan Squared Error)
- 예측값과 실제값의 차이에 대한 제곱에 대해 평균을 낸 후 루트를 씌운 값
3-4. 데이터에 평가 지표 적용하기
pred = lr.predict(X_test)
mean_squared_error(y_test, pred, squared=False)
출력: 37731.275512059074
# 1837 삭제 전: 37765.125980605386
# 1837 삭제 후: 37731.275512059074
37765.125980605386 - 37731.275512059074
# 33.850468546312186 만큼 오차가 줄었음
출력: 33.850468546312186