๐ฆ Data Sources and Cleaning
๐ง This project is currently a work in progress.
Iโm actively building out my case study as part of the Google Data Analytics Capstone.
Full write-up, code, and visualizations will be available soon!
๐๏ธ Overview
This project integrates and analyzes Divvy bike-share trip data, station data, and Chicago weather data. The cleaning steps below describe the transformations, filters, and fixes applied to ensure high-quality, analysis-ready data.
๐ฆ Data Sources
- Divvy Trip Data: Divvy Trip Archive and City of Chicago Data Portal
- Weather Data: Meteostat Hourly Bulk Data
- Tourist Stations: Custom dataset created from Google Maps lat/Long queries
See the ๐ฆ Data Sources page for more details and attribution.
๐๏ธ Ride Data Cleaning
Removed pandemic-period rides:
- Dropped all rides with
start_time > Jan 1, 2020
to avoid skew due to COVID-19 lockdown. - Also removed one late outlier from April 2020.
- Dropped all rides with
Removed zero-duration loop rides:
- Filtered out rides where
start_time == end_time
andstart_station_id == end_station_id
.
- Filtered out rides where
Removed โteleportationโ rides:
- Rides with
start_time == end_time
but different start and end stations were eliminated.
- Rides with
Eliminated negative-duration rides (โtime travelersโ):
- Rides where
start_time > end_time
were removed.
- Rides where
Dropped long-duration rides:
- Excluded rides longer than 24 hours (
duration > 86400
seconds). - These were excluded from analysis but retained in the database.
- Excluded rides longer than 24 hours (
Filtered rides missing user type:
- Removed 194 rides with
user_type IS NULL
.
- Removed 194 rides with
Deduplicated rides:
- Removed ~2,767 duplicate records where
bike_id
,start_time
, andend_time
were identical.
- Removed ~2,767 duplicate records where
Filtered overlapping rides per bike:
- For bikes with multiple overlapping rides, kept the ride with the lowest
ride_id
.
- For bikes with multiple overlapping rides, kept the ride with the lowest
๐ Station Data Cleaning
Allowed stations with same name but different coordinates:
Replaced unique constraint on
name
with composite index:CREATE UNIQUE INDEX uniq_vector ON stations(name, lat, long); CREATE INDEX idx_name ON stations(name);
Renamed ambiguous duplicates:
- Manually added suffixes like โ IIโ to disambiguate repeated station names at different coordinates.
Normalized source formats:
- Some station data was manually extracted from Excel, then converted to CSV for uniform processing.
โ๏ธ Weather Data Cleaning
Selected Midway Airport station (72534):
- Chosen for completeness and consistency across years.
Dropped unused or missing columns:
snow
,wpgt
,tsun
,pres
, anddatetime
were excluded from final dataset.
Created
epoch
column for joins:- Combined
date
andhour
fields into datetime, then converted to UNIX epoch to align with ride timestamps.
- Combined
๐ก Notes
- All transformations were tracked in versioned scripts and logged in
logs/workLog.md
. - Weather data cleaning steps were scripted in
src/load_weather.R
. - Station and ride data were validated and transformed via shell scripts and SQLite.
๐ License
Divvy and City of Chicago data is subject to the City of Chicago Terms of Use.
Source: Meteostat (opens new window)
This case study is ยฉ 2025 Scott Sesher and provided under the MIT License. See LICENSE for details.