Using Joins to Connect Disparate Datasets

Published on October 12, 2016

In New York City, where population density is higher and average daily commute is longer than other metropolitan cities in the US, someone is killed in a car crash every 33 hours. In order for city planners to implement measures to decrease such mishaps, the occurrence and causes of these crashes need to be analyzed. The New York Police Department publishes the NYC Motor Vehicle Collisions data. This data was analyzed using Arcadia Instant to find answers to the following questions:

  •     What are the major causes of motor vehicle accidents that result in fatalities?
  •     Is there a particular day of week when more motor vehicle accidents occur?
  •     Does weather contribute to the occurrence of motor vehicle accidents?

Data Preparation and Loading

The NYPD dataset contains details of accidents in 2015. To perform the desired analysis, certain changes were made to the data by importing a table into the dataset and creating new data fields. This is easily done in Arcadia Instant.

Importing a Table to Visualize Days:  The NYPD dataset has the Date column. However to visualize the number of accidents by day of the week, a CSV file with dates and respective days was created and imported into Arcadia Instant. Joining it with the Date column of the dataset in the Data Model section adds Days of Week into the dataset.

1-join

Categorizing Causes:  The raw dataset contains details of the cause of each accident. To see what caused more accidents, grouping all these reasons into broader categories helps. Categorization can be done using an expression in the Column View window.

2-causes

The case statement used for creating categories is quite long.  The following screenshot shows a portion of the case statement used.

4-3-part-of-formula

Visualizing and Analyzing

In Arcadia Instant, applications can be created to visualize all the relevant information in a single view. The application created to find answers to the above questions is comprised of the following visuals:

5-all-visuals

Each visual shows specific information in a way that’s easy to quickly understand.

 

  1.       Longitudinal analysis: The line chart below shows a number of troughs and crests depicting the variance in the number of accidents during 2015.

Hovering over the chart allows the user to see the highest number of accidents (960) occurred on January 18. A quick internet search showed that weather was extremely harsh on January 18, a Sunday. The freezing rains and icy roads likely contributed to the abnormally high number of accidents that day. We also see an unusual dip on January 27. On January 26, there was a blizzard which dumped up to two feet of snow on the roads of NYC. That could have caused folks to stay indoors rather than hit the roads. Based on the data listed above, it is plausible that extreme weather, especially heavy rain and snow, adversely affects the safety of commuters on the roads.

6-date-on-timeline

  1.       Week Day: The bar chart below shows patterns in the occurrence of these accidents by displaying which specific day more accidents occur when compared to the others. As it turns out, Fridays are more accident prone as compared to the rest of the week. Perhaps people are rushing to get to their destinations in time to start the weekend, are tired from a long workweek or there is an increase in inebriated driving leading to negligence on the road. These are all just guesses, of course. Also, the number of accidents is comparatively lower on Saturday and Sunday, which can be attributed to less commuting and a slower, more relaxed pace on weekends.

7-bar-chart-weekday

  1.       Causes: Accident causes have been grouped into broader categories like Driving Error or Instrument Failure, to better draw meaningful conclusions. The table below shows the list of causes in a descending order. With Variable Measures, the level of granularity can be adjusted on the table. It can be refreshed to see different measures, i.e. Number of Accidents, Fatalities, Injuries. These measures can further be seen for different variables, i.e. for Pedestrians, Cyclists, and Motorists only, or for all. The table chart makes it clear that Driving Error and Aggressive Driving are the two major causes of car accidents.

8-table

To see what caused the maximum number of fatalities, Fatalities can be selected from the drop down menu.

9-filter

10-table-refresh

Again, Driving Error and Aggressive Driving are the leading causes of Fatalities. What’s alarming is the magnitude of scale. Out of a total of 92,000 accidents, Aggressive Driving caused approximately 22%. However, on comparing fatalities, out of a total of 147, 44% i.e. almost half of them were caused by Aggressive Driving. This shows that the share of Aggressive Driving in causing total accidents is lower, but caused a lot more fatalities as compared to Driving Error.

The analysis of NYPD data and its visualization on Arcadia Instant helped in recognizing patterns and causes of car accidents, including the role of weather and aggressive driving. These insights can really help city planners a great deal with planning and implementing measures to prevent these collisions. Although it is a topic of discussion for a public policy forum, if people are calmer and more careful while driving, especially on Fridays and during bad weather, many fatalities could potentially be avoided.

Want to try your hands at it? Go ahead and download Arcadia Instant here.