Skip to the content.

Predicting Used Car Prices

image

Project Overview and Introduction

In the highly competitive used car market, accurate pricing is crucial for maximizing profits, attracting customers, and managing inventory effectively. This project aims to leverage data analytics and machine learning to identify key drivers of used car prices and develop a predictive model to forecast these prices accurately. This will enable used car business to make informed pricing decisions, enhance customer trust, and improve overall business performance.

CRISP-DM Framework

Using the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework to guide this project. This widely adopted methodology provides a structured approach to data mining and ensures systematic and efficient analysis. The CRISP-DM process consists of six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

image

Business Understanding

The used car market is highly dynamic, influenced by a variety of factors ranging from economic conditions to consumer preferences. For this business, accurately predicting used car prices is crucial for several reasons:

Given these aspects, the primary goal is to identify the key drivers that significantly influence the prices of used cars. Understanding these drivers will enable the business to develop a robust pricing strategy, improve decision-making, and enhance overall business performance.

Objectives

By achieving these objectives, the business aims to gain a competitive edge in the used car market, improve operational efficiency, and drive sustainable growth.

Data Understanding

In the Data Understanding phase, explored and familiarized with the dataset comprising 426,880 used car listings. This involved identifying key variables such as price, year, odometer readings, and categorical attributes like manufacturer, model, condition, etc.. Initial data exploration allowed to assess data quality, distribution, and relationships between variables, laying the groundwork for subsequent data preparation and modeling phases within the CRISP-DM framework.

Data Collection

The dataset comprising 426,880 used car listings was sourced from Kaggle, a platform known for hosting diverse datasets contributed by the community.

Data Description

The dataset contains information on 426,880 used cars, with 18 attributes detailing various aspects of each vehicle. Below is a detailed description of each column:

  1. id: A unique identifier for each car listing.
  2. region: The geographic region where the car is listed.
  3. price: The listed price of the car in dollars.
  4. year: The manufacturing year of the car.
  5. manufacturer: The manufacturer or brand of the car (e.g., Ford, Toyota).
  6. model: The model name of the car.
  7. condition: The condition of the car (e.g., new, like new, excellent, good, fair, salvage).
  8. cylinders: The number of cylinders in the car’s engine.
  9. fuel: The type of fuel the car uses (e.g., gas, diesel, electric, hybrid).
  10. odometer: The mileage of the car (distance traveled in miles).
  11. title_status: The status of the car’s title (e.g., clean, salvage, rebuilt).
  12. transmission: The type of transmission (e.g., automatic, manual).
  13. VIN: The Vehicle Identification Number, a unique code used to identify individual motor vehicles.
  14. drive: The type of drivetrain (e.g., 4wd, fwd, rwd).
  15. size: The size category of the car (e.g., compact, mid-size, full-size).
  16. type: The type or category of the car (e.g., sedan, SUV, truck).
  17. paint_color: The exterior color of the car’s paint.
  18. state: The state where the car is listed.

Data Exploration

In the Data Exploration phase, conducted a thorough analysis to uncover patterns, relationships, and insights within the dataset. Key activities included:

These exploratory analyses laid the foundation for subsequent data preparation and modeling steps, ensuring a comprehensive understanding of the dataset and guiding informed decisions throughout the project.

Data Preparation

In the Data Preparation phase, focused on transforming the raw dataset into a clean and structured format suitable for modeling. This phase is crucial for ensuring the accuracy and reliability of our predictive model. Here’s a detailed overview of the steps taken:

By meticulously preparing the dataset in this manner, established a solid foundation for building and evaluating predictive models that accurately forecast used car prices. This phase not only enhanced data quality but also streamlined subsequent phases of modeling, evaluation, and deployment within the CRISP-DM framework.

Modeling

In the Modeling phase, aimed to develop a robust predictive model to forecast used car prices based on the cleaned and transformed dataset. Key steps and considerations include:

Modeling Output

Linear Regression has the best performance among the four models with the lowest RMSE and the highest R2 value. It is the most suitable model for this dataset based on the given metrics. Similar output from Ridge and Lasso, which can further be evaluated for different alphas.

Evaluation

In the Evaluation phase, assessed the performance of the trained models to ensure they met the project objectives and business requirements:

Model Predictions

image

image


Conclusion

In conclusion, this project focused on predicting used car prices using a structured approach based on the CRISP-DM framework. We started with understanding business objectives and data collection, followed by thorough data preparation, modeling, evaluation, and deployment phases. Here’s a summary of our findings and actionable insights:

Interesting Findings

Feature Recommendation Coefficient Value Impact Interpretation
Fuel Collect more data on electric and hybrid models, and consider offering incentives for fuel-efficient options Diesel - 4892.09 High Fuel type Diesel is associated with higher prices, with electric and hybrid models also contributing to higher prices
Type Focus on marketing convertibles, offroad, pickup, truck, and coupe models Convertible - 2707.22 High Convertibles, offroad, pickup trucks, trucks, and coupes are associated with higher prices
Make, Model Highlight popular makes and models in listings 2539.73, Make_model^2: -83.8 High Certain makes and models can drastically increase the price
Cylinders Emphasize performance aspects in high-cylinder vehicles 1398.6 Medium Vehicles with more cylinders are often high-performance and can command higher prices
Condition Ensure accurate and detailed condition reports 759.02 Medium Better condition typically results in higher prices
Odometer Highlight vehicles with lower mileage prominently -2013 High Higher mileage significantly reduces the price
Title Status Consider investigating the reasons for price differences and provide clear title status information 1007.08 Medium Clear title status increases the vehicle’s price
Transmission Promote manual transmissions and educate buyers on their benefits Manual: 734, Auto: -643 Medium Manual transmissions increase the price, while automatic transmissions slightly reduce it
Drive Highlight the benefits of specific drive types, such as 4WD 575.37 Low Certain drive types (e.g., 4WD) can increase the price
Size Emphasize the advantages of larger vehicle sizes in listings 428.59 Low Larger vehicle sizes can lead to a price increase
Paint Color Consider studying the popularity of colors in the market and highlight desirable colors Yellow: 658, Custom: 599 Low Popular or unique colors can increase the price
Vehicle Age Provide maintenance records and emphasize longevity for older vehicles -4832.72 Very High Older vehicles significantly decrease in price

Actionable Insights

By leveraging these insights, our business can improve its competitive edge, drive revenue growth, and enhance customer satisfaction.

Future Work

Repository Structure

Notebook

The detailed analysis and code can be found in the Jupyter notebook here.