Predicting Apartment Rental Prices Using XGBoost: A Comprehensive Guide ~ BI-FI Blogs

In today’s data-driven world, accurate predictions of apartment rental prices are crucial for both landlords and tenants. The ability to effectively analyze and interpret rental market data can lead to better decision-making, streamlined transactions, and ultimately, improved satisfaction for both parties. In this blog post, we’ll take you through the process of building a predictive model for apartment rental prices using a dataset obtained from the UCI Machine Learning Repository. We will cover the various preprocessing steps, feature engineering techniques, model setup using XGBoost, and the creation of interactive dashboards for stakeholders.

Obtaining the Data

The dataset used in this project, titled “Apartment for Rent Classified”, is accessible from the UCI Machine Learning Repository. It contains classified advertisements for apartments available for rent in the United States, providing a wealth of information across various features such as rental price, square footage, amenities, and location details. The dataset comprises 10,000 rows and 22 columns, making it a robust resource for understanding the factors influencing rental prices.

The first step in our analysis was to download and load the data into our data processing environment. Once the dataset was ready, we began the journey of data cleaning and preprocessing to prepare it for modeling.

Preprocessing Steps

Data Cleaning

Data cleaning is an essential part of any data analysis process. We started by removing any irrelevant columns that wouldn’t contribute to our model's predictive power. Features such as IDs, titles, and timestamps were dropped to focus on variables with a direct impact on rental prices. We also checked for and handled any missing values, ensuring that our dataset was complete and ready for analysis.

Feature Engineering

Feature engineering is the process of creating new variables that can enhance the predictive power of a model. In our case, we specifically focused on the amenities offered in each apartment listing. To achieve this, we created binary features indicating the presence or absence of key amenities, such as:

Has Dishwasher
Has Parking
Has Gym
Has Internet Access
Has Pool
Has Fireplace

This transformation allowed us to quantify the impact of each amenity on the rental price, providing a clearer understanding of their significance.

Geospatial Features:

To enhance location-based analysis, we used K-means clustering on latitude and longitude to divide the US into 6 distinct regions. These clusters were one-hot encoded and used as features. We also retained latitude and longitude as direct features to capture more granular location-based variation in prices.

Feature Validation:

We introduced a component called "Feature Check," where we plot correlation matrices, VIP scores, and other metrics to ensure that our engineered features add predictive value without introducing multicollinearity.

Normalization

To ensure that our model could effectively learn from the data, we applied normalization techniques to standardize our numerical features. By scaling the square footage, we ensured that all features contributed equally to the model training process. This step is crucial as it prevents features with larger scales from dominating those with smaller scales, ultimately leading to a more balanced model.

Setting Up XGBoost

After preprocessing and feature engineering, we set up our predictive model using XGBoost, a powerful machine learning algorithm known for its speed and performance. XGBoost is an implementation of gradient boosting designed to be highly efficient and effective for structured data.

Model Training

We split our dataset into training (80%) and testing (20%) sets to evaluate the model's performance accurately. The model was trained on the training set, using the rental price as the target variable and the engineered features as inputs. We configured several hyperparameters, such as the learning rate (eta) and the number of boosting rounds, to optimize model performance.

Results

After training the model, we achieved the following performance metrics:

R²: 0.7719
Mean Absolute Error (MAE): 215.54
Mean Squared Error (MSE): 175,374.24
Root Mean Squared Error (RMSE): 418.78
Mean Absolute Percentage Error (MAPE): 0.1453
Adjusted R²: 0.7719

While these results indicate a strong predictive capability, especially for a real-world dataset, there’s always room for refinement. The features we engineered, particularly the geospatial clusters and amenity flags, played a significant role in improving the model’s performance.

These results demonstrated that our model could predict rental prices with remarkable precision, allowing stakeholders to make informed decisions based on reliable data.

Creating Diverse Dashboards for Stakeholders

With our model trained and evaluated, the next step was to present the results in a way that stakeholders could easily interpret and utilize. We developed interactive dashboards using KNIME that allowed users to filter data by various dimensions, such as city and state.

Dynamic Filtering

The dashboards were designed with user-friendly widgets that enabled stakeholders to select specific cities or states, dynamically calculating average rental prices and prediction errors based on the filters applied. This feature provided a real-time understanding of the rental market, empowering users to identify trends and anomalies in pricing.

Summary Metrics

In addition to dynamic filtering, the dashboards displayed summary metrics, including average actual prices, average predictions, and average residuals. This information was presented in a concise, human-readable format, allowing stakeholders to quickly grasp the model's performance and pricing insights. For example, the summary might state:

“In [City], the average rental price is $X, with our model predicting prices with an average error of $Y.”

By presenting the data in a straightforward manner, we ensured that stakeholders could easily understand the implications of the model's predictions.

Conclusion

In conclusion, this project successfully demonstrated how to leverage data science techniques to predict apartment rental prices accurately. By obtaining a well-structured dataset, applying rigorous preprocessing steps, engineering relevant features, and setting up an effective XGBoost model, we were able to generate highly accurate predictions.

The development of dynamic dashboards further enhanced stakeholder engagement, providing them with real-time insights into the rental market. This holistic approach not only facilitates better decision-making but also sets the stage for future enhancements in predictive analytics within the real estate sector. As data continues to evolve, our methods and models will adapt, ensuring that stakeholders remain informed and empowered in their real estate endeavors.

You can download the workflow from the link below BI-FI Blogs Knime Hub Page