NYC Taxi Fare and Trip Duration Prediction
🎯 Objective
Estimate the fare and duration of a trip based on variables such as pick-up and drop-off locations, date and time of the ride, and other factors.

Description
The project began with a full problem definition and determination of the learning approach. Since the predictions relied on labeled historical data, it was classified as a supervised learning problem — specifically, regression.
The dataset was subjected to extensive data cleaning:
-
Removal of invalid or inconsistent values (negative durations, fares ≤ 0, duplicates)
-
Handling missing values and outliers
-
Splitting into train/test/validation sets
During EDA, a new target feature trip_duration was engineered from pickup and drop-off timestamps. Correlation analysis revealed trip_distance as the most relevant predictor (0.8 correlation). Visual analyses included scatter plots, boxplots, and heatmaps.
A data preprocessing pipeline was implemented to standardize cleaning and transformation processes.
Two-Stage Modeling Approach
Two separate models were trained:
- Trip Duration Model — trained and optimized first
- Fare Prediction Model — trained using the predicted trip duration as a key feature
This cascaded architecture was used because trip duration is the strongest predictor of fare.
After the baseline models (Linear Regression & Random Forest), the modeling phase transitioned to XGBoost, which significantly improved performance.
⏱ Trip Duration Model - XGBoost
We can predict trip duration within ~3 minutes and explain ~80% of the real variability.
| Metric | Result |
|---|---|
| MAE | 3.10 minutes |
| RMSE | 4.60 minutes |
| R² | 0.8003 |
💵 Fare Prediction Model - XGBoost
| Metric | Result |
|---|---|
| MAE | 1.87 USD |
| R² | 0.9326 |
🧩 Key Visualization
- Fare and Duration distributions
- Correlation heatmaps
- Predictions vs actuales values
- Trip duration vs Trip distance
💡 Technologies used
- Python
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
- Postgresql
- FastAPI
- Docker
- Redis
- React
- OpenWeather
- Google Maps
- TailwindCSS
- JavaScript
🌐 Results Achieved
- The model learns strong relationships between trip duration, distance, and fare pricing, improving decision-making for mobility services.
- On average, the first model can predict trip duration within ~3 minutes of the actual travel time.
- The fare model can estimate costs within ~$1.87 USD of the real fare, which is below the standard taxi pricing error tolerance in NYC.