top of page
Search

Building a Neural Network Prediction Model for House Prices: Decisions and Rationale

  • Writer: Felipe Leite
    Felipe Leite
  • Sep 23, 2024
  • 3 min read

Updated: Nov 6, 2024

  1. Introduction

A Neural Network (NN) model has been developed to predict house sale prices based on a limited set of features: Garage Cars, Garage Area, Overall Quality, and Ground Living Area. This blog post explores our choices of model structure, training strategy, data processing, and evaluations. We also analyze the model's results and discuss potential improvements.

  1. Data Preparation and Initial Setup

I began by loading the dataset using Python’s Pandas library, a common approach in data science for reading and handling structured data.

import pandas as pd df = pd.read_excel("/content/Assignment 2_BUSI 651_House Prices (1).xls")


2.1 Feature Selection

Based on the dataset provided, we identified Sale Price as the target variable (y) and selected four predictive features (x): Garage Cars, Garage Area, Overall Quality, and Ground Living Area. These features were chosen due to their presumed correlation with house price, though limited, as some other impactful variables (such as location) were not available.


y = df["SalePrice"] x = df[["GarageCars", "GarageArea", "OverallQual", "GrLivArea"]]

3. Train-Test Split

A crucial part of model training is the separation of data into training and testing sets. We used an 80/20 split, where 80% of data was used to train the model and 20% for testing. This standard approach helps prevent overfitting and allows us to evaluate the model’s generalizability.


y_train = y[:1168] y_test = y[1168:] x_train = x[:1168] x_test = x[1168:]

  1. Model Architecture and Training Strategy

4.1 Neural Network Structure

The neural network model was built using Keras, leveraging two hidden layers, each with 32 nodes and an activation function of ReLU (Rectified Linear Unit). This choice was grounded in a balance between computational efficiency and model performance, as deeper or wider networks could lead to overfitting with our relatively limited dataset.


from keras.models import Sequential from keras.layers import Dense model = Sequential() model.add(Dense(32, activation="relu", input_dim=4)) model.add(Dense(32, activation="relu")) model.add(Dense(1))

4.2 Choice of Activation Function

The ReLU activation function was selected due to its effectiveness in deep learning models, particularly in regression tasks. ReLU introduces non-linearity into the network and handles positive values well, which aligns with our dataset as house prices are always positive.

4.3 Compilation and Loss Function

We compiled the model using the Mean Squared Error (MSE) as the loss function and the Adam optimizer. MSE is commonly used in regression to minimize the error between predicted and actual values, while Adam is an adaptive learning rate optimizer that converges quickly, making it suitable for our large number of iterations.


model.compile(loss="mean_squared_error", optimizer="adam")


4.4 Model Training

The model was trained with 100,000 epochs, a high number chosen after various tests with fewer iterations resulted in suboptimal accuracy. By using verbose=0, we ensured that the code ran efficiently without printing output for every epoch.


model.fit(x_train, y_train, epochs=100000, verbose=0)


  1. Model Evaluation and Performance

After training, we tested the model on the test set to generate predictions. To ensure compatibility between the predicted and actual test values, we converted them to numpy arrays.


y_pred = model.predict(x_test) import numpy as np y_test_1 = y_test.to_numpy() y_pred_1 = y_pred.reshape(1, 292)

Mean Absolute Percentage Error (MAPE)

To assess the model's accuracy, we calculated the Mean Absolute Percentage Error (MAPE), which quantifies the average error percentage between predicted and actual values. Through multiple runs and adjustments, the model achieved a MAPE of 15.6%, which we deemed acceptable given the limited feature set.


APE = np.abs(np.subtract(y_test_1, y_pred_1)) / y_test_1 * 100 MAPE = np.mean(APE)

  1. Insights on Feature Behavior

To understand how each feature impacted the sale price, we evaluated each one individually, observing both linear and non-linear behavior:

  1. Garage Cars: Minimal effect on sale price, with variations within 1%.

  2. Garage Area: Significant impact, with noticeable non-linear fluctuations.

  3. Overall Quality: Strong linear correlation; increasing quality raises sale price.

  4. Ground Living Area: High sensitivity to reduction in area but diminishing returns when increased.

This individual analysis allowed us to observe the strengths and weaknesses of the model, especially the notable linear correlations and non-linear influences that certain features exerted on the target variable.


  1. Limitations and Considerations

While effective to an extent, the model has notable limitations:

  • Limited Feature Set: Key variables like location were unavailable, impacting prediction accuracy.

  • Insufficient Data: A larger dataset could improve learning and reduce overfitting.

  • High Computational Cost: Extensive training time was required, which may be impractical for larger datasets.

  • Complex Non-linear Patterns: Our model showed weaknesses in capturing certain non-linear patterns, potentially due to unobserved correlations.


  1. Conclusion

The neural network prediction model developed in this project achieved moderate success in forecasting house sale prices. While some features showed a strong correlation with sale price, limitations in dataset scope and size constrained accuracy. Future iterations could benefit from additional features and data, allowing for a more nuanced understanding of house price determinants and improved prediction accuracy.

 
 
 

Recent Posts

See All

Comentários


SIGN UP AND STAY UPDATED!

Thanks for submitting!

  • LinkedIn
  • GitHub

© 2024 by Felipe Leite

bottom of page