Chicago Bikeshare Data Exploration and Analysis

Divvy

Introduction

The dataset is from Divvy, a program of the Chicago Department of Transportation and operated by Motivate, a global bikeshare service operator. It was combined with Chicago weather data and then formatted into a CSV file by Jifu Zhao, who is credited below.

I found this data online while searching Kaggle for a new project idea after my previous dataset fell through. This dataset interested me because it reminded me of the various bikesharing/transportation renting services like Bcycle and Bird/Limebike scooters that have been popping up around the city of Austin. These bikes come from fixed bike stations much like B-Cycle in Austin. I’m interested in seeing if there’s anything also applicable to Austin.

The repository with my code can be viewed here.

Chicago

Tools Used

Here are some of the tools I used during this project along with links for more information:

  • matplotlib: Used for quick visualizations of the data
  • Seaborn: What I primarily used to create the final visualizations for the report
  • pandas: “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.”
  • numpy: Used for fast numerical calculations in python
  • Jupyter Notebooks/Anaconda: Workspace used to create exploratory notebooks

Resources and Credit

Here I will list the resources I used and credit to the dataset creator

Getting Started

Firstly, I downloaded the data set directly from Kaggle and created a Jupyter notebook to begin my work. I imported the csv into a pandas Dataframe so that it was easier to create the graphs I needed for analysis. Initially the data was ‘dirty’, with missing values and unusable entries, so I had to clean it by doing several steps:

  • Reformat trip duration from seconds to minutes
  • Remove trips less than a minute and more than an hour
  • Remove entries with no humidity values
  • Split time and date into multiple separate fields
  • Format weather events into broader categories
  • Remove entries with missing lat/longitude values
  • Resave cleaned data as a new .csv file

Note that this cleaning will inherently affect the conclusions we draw at the end.

Data Exploration

Date

First lets take a look at the change in the number of trips and their duration over the years.

Year Count

As we can see from 2014 to 2017 there was a steady increase in the number of riders from right below 250,000 to above 350,000, but the average trip duration only slightly decreased. There isn’t too much here except the confirmation that ridership has indeed been growing.

Next I took a look at the same categories, but compared things from month to month.

Month Count

Now we’re starting to get some real information. Not too surprisingly the summer months are busier and the winter months much less so, with a smooth transition during spring/fall. The average trip duration and distribution also increases during the summer months, likely because of people taking advantage of fair weather for day excursions.

7-Day Count

Overall the distribution is fairly uniform with a slight decline on the weekends. One thing to note is that trips on the weekend are often longer compared to those on the weekdays.

Hour Count

In the count plot I noticed two big spikes coinciding with morning and evening commute times, with a very large spike at 5PM when many people get off work. Morning trip durations are also noticeably shorter, and mid day trips are usually longer.

Hour Heatmap

This heatmap shows the traffic intensity for each hour of every day of the week. As we had seen before, the biggest peaks during the weekdays were the morning and evening commutes, while the weekends show a more gradual rise and decay centered around the mid afternoon.

Week Heatmap

Similar to the previous heatmap, this one shows the traffic intensity for every day of every week of the year. This is particularly useful for noticing reoccurring highs and lows. One such example is that weekend of the 17th week(late April) has an unusually low number of bikers.

Weather

Chicago is located in the humid continental climate zone and experiences four distinct seasons throughout the year. Nearby Lake Michigan regulates the weather year round as well.

Temp year and month

The above violin plots are used as a measure of distribution of the temperature that people rode their bikes at. The wider the shape, the more users that are clustered there. Every year so far has had similar weather data, and the months get warmer closer to summer. Another thing to note is that the winter months have a larger range of temperatures.

Temp Week and Hour

The week violin plot basically mirrors what we saw in the monthly plot, but the hourly plot has much more of a uniform distribution.

Weather Events

Most days that people rode bikes in Chicago, it was cloudy, almost overwhelmingly so. The category ‘not clear’ refers to hazy/foggy weather, which was just about as rare as snow/ice. The more interesting part of this chart is the chart of trip durations, which tells us that on days where it was clear or cloudy weather people tended to ride their bikes for longer. Snow/ice severely impacts the length of rides compared to most other weather events.

User Types

  • Customers: 24-Hour Pass holders
  • Subscribers: Annual Pass holders
  • Dependents: Single Ride Purchase

User Type

The vast majority of rides come from annual pass holders, and they typically take the shortest rides of anyone. 24-hour pass holders are the next most common riders, and they usually take the longest rides overall. An important thing to note is that rides over 30 minutes for Annual and 24-Hour pass holders start to incur a usage fee, so it’s important to identify that cutoff, typically most Annual pass holders do not go over this threshold. There were not a sufficient number of single pass users (less than 100) so I will not include them in my analysis.

Stations

A look at the busiest stations:

Busiest Stations

It turns out that the busiest stations for departure aren’t exactly the same for arrival. The top ten for both seem to be the same stations, just arranged in a different order. Below that the changes in ranking are much more drastic.

Traffic Map

This map of starting station density gives us a good visual image of where most traffic occurs.

Correlation

If I wanted to make a model that could predict the number of riders on a given day or given temperature range first I would look at the natural orrelation between these variables.

Corr

Unfortunately there is only weak correlation at best for most of the categories. There is still work here to be done to discover a stronger correlation.

Conclusions

Major takeaways:

  • Divvy has shown successful growth for the past four years that it has been in operation
  • Most users use Divvy for regular weekly commutes to work and back
  • Divvy should target 24-Hour pass customers if they want to increase their usage overtime revenue
  • Snow/ice and cold temperatures seem to reduce the frequency and duration of trips
  • Divvy should increase marketing leading into the summer to increase seasonal sales
  • The busiest stations may need extra attention to keep them stocked and clean for operation

Further Exploration

There are a couple of directions I could head with this if I wanted to:

  • Interactive map of high traffic locations and their usertype
  • Trip duration and rider count prediction
  • Deeper dive into weather data and unexamined features
  • Try and find more detailed correllation

Thanks for reading!

Divvy2