Code Walkthrough Video

Project Demo Video

Data Journey

Cleaning, converting, and manipulating the data was no easy task! In this section, we’ll discuss all of the necessary steps we took to convert the original data source into a format suitable for manipulation and visualizations in Python.

Original Data Source

The original data source used in this project can be found in the data section of the LA City organization website, and it is titled “Crime Data from 2020 to Present”. The data source is in CSV format with over 840,000 rows and 28 columns, and new data is added to it weekly. For the analysis performed in this project, we were working with the datasource as it was on November 17th, 2023. Click here to access the original data source!

Journey and Caveats

Partitioning the Data into Different CSVs

With over 840,000 rows, the downloaded dataset in the LA City's original format was impractical to use unconverted. Thus, we made the decision to partition this large csv into multiple smaller files in order to streamline the file upload process within the Colab Notebook environment. This process was done largely in Microsoft Excel. We used column filters to create different csv files based on years in the dataset (2020, 2021, 2022, and 2023) and different types of crimes (e.g. sexual assaults).

Uploading the Data into Colab Notebooks

We experimented with two different file upload methods in the Google Colab notebooks we created. The first method, which was used in the notebook that contains the first five questions' visualizations, involved importing the google colab "files" package and using the "files.upload" command to select a csv file from our respective computers' downloads. The other method, which was used for the notebook containing Question 6's visualization, used the file upload button (left icon) in the file menu (bottom left icon). We found that the latter method was significantly faster.

Creating Dataframes within Colab Notebooks

After importing the csvs we created in Excel from the original dataset into our Colab Notebooks, we created Pandas dataframes to further narrow down our queries. This involved a variety of methods, such as boolean masking, the combination of lists with the same index length, for loops, and if-else statements.

Selecting Relevant Data to Focus On

Considering the size and scope of our original dataset, we chose to focus our analyses for the first five exploratory questions on 2022, the most recent full calendar year. We decided to use the data for all full years in the original dataset to answer the sixth question because we wanted to see if monthly trends differed by year.