This is my first Machine Learning project.
Predict house prices from various property features using a supervised learning model.
- Rooms: Number of rooms in the property
- Distance: Distance from the central business district (in kilometers)
- Postcode: Postal code of the property
- Bedroom2: Number of bedrooms (as reported by the real estate agent)
- Bathroom: Number of bathrooms
- Car: Number of car spots
- Landsize: Land size in square meters
- BuildingArea: Building size in square meters
- YearBuilt: Year the house was built
- Lattitude: Geographic latitude
- Longtitude: Geographic longitude
- Propertycount: Number of properties in the same suburb
Random Forest Regressor was used to train the model and predict house prices.
- Algorithm Type: Ensemble method (combines multiple decision trees).
- Metric: Mean Absolute Error (MAE).
- Average Accuracy: Approximately 85.08%.
- Name: Melbourne Housing Dataset.
- Source: Kaggle — Melbourne Housing Market dataset.
- Size: ~13,580 rows and 21 columns.
- Target Variable:
Price
- Actual vs Predicted Prices: Scatter plot showing model predictions against real values.
- Error Distribution: Histogram showing how prediction errors are spread.
- Top 20 Features: Histogram displaying the top 20 features after proper encoding
- Residual Graph: This is a scatter plot showing the residuals
- The File app.py contains the UI for the app
- the UI was made possible via Streamlit
- The steps to launch it are as follow:
- Have the PKL code ready within your model's code, the section is clearly defined with comments within melb_model.py
- Run that file, this will create the pkl file
- launch the app using the command
streamlit run app.py - Following the link that gets displayed will lead us towards the app within your browser of choice
- Python: for obvious reasons
- Pandas: used for loading cleaning and manipulating the dataset.
- NumPy: Provides efficient numerical operations and array handling, which Pandas and Scikit-learn both depend on internally.
- Scikit-learn: Used for machine learning tasks — splitting data, training models (RandomForestRegressor), and evaluating performance (Mean Absolute Error).
- Matplotlib: Handles data visualization, used to create plots and charts to see how the model performs (e.g., actual vs predicted).
- Seaborn: Built on top of Matplotlib, used for statistical visualizations and improving the appearance of plots (e.g., the error distribution histogram).
The dataset is for educational and non-commercial use.