October 15, 2025

NYC Taxi Fare and Trip Duration Prediction

Thumbnail of YC Taxi fare and trip duration prediction

🎯 Objective

Estimate the fare and duration of a trip based on variables such as pick-up and drop-off locations, date and time of the ride, and other factors.

Architecture of NYC fare and trip duration prediction

Description

The project began with a full problem definition and determination of the learning approach. Since the predictions relied on labeled historical data, it was classified as a supervised learning problem — specifically, regression.

The dataset was subjected to extensive data cleaning:

  • Removal of invalid or inconsistent values (negative durations, fares ≤ 0, duplicates)

  • Handling missing values and outliers

  • Splitting into train/test/validation sets

During EDA, a new target feature trip_duration was engineered from pickup and drop-off timestamps. Correlation analysis revealed trip_distance as the most relevant predictor (0.8 correlation). Visual analyses included scatter plots, boxplots, and heatmaps.

A data preprocessing pipeline was implemented to standardize cleaning and transformation processes.

Two-Stage Modeling Approach

Two separate models were trained:

  1. Trip Duration Model — trained and optimized first
  2. Fare Prediction Model — trained using the predicted trip duration as a key feature

This cascaded architecture was used because trip duration is the strongest predictor of fare.

After the baseline models (Linear Regression & Random Forest), the modeling phase transitioned to XGBoost, which significantly improved performance.

⏱ Trip Duration Model - XGBoost

We can predict trip duration within ~3 minutes and explain ~80% of the real variability.

MetricResult
MAE3.10 minutes
RMSE4.60 minutes
0.8003

💵 Fare Prediction Model - XGBoost

MetricResult
MAE1.87 USD
0.9326

🧩 Key Visualization

  • Fare and Duration distributions
  • Correlation heatmaps
  • Predictions vs actuales values
  • Trip duration vs Trip distance

💡 Technologies used

  • Python
  • Pandas
  • Matplotlib
  • Seaborn
  • Scikit-learn
  • Postgresql
  • FastAPI
  • Docker
  • Redis
  • React
  • OpenWeather
  • Google Maps
  • TailwindCSS
  • JavaScript

🌐 Results Achieved

  • The model learns strong relationships between trip duration, distance, and fare pricing, improving decision-making for mobility services.
  • On average, the first model can predict trip duration within ~3 minutes of the actual travel time.
  • The fare model can estimate costs within ~$1.87 USD of the real fare, which is below the standard taxi pricing error tolerance in NYC.

Ready to take your idea to the next level? Let's work together.