R: Write Up

PART A: PROCESS

i) Completing the Script

1. Acquire(1): Obtain the Raw Data

For this Assignment, we explored the coding language R to create an interactive web map based on the data from the Vancouver Crime Data. In particular, I was concerned with the crime data from 2012. Normally this dataset is available from the Open Data Catalogue from the City of Vancouver but for the purpose of this assignment, the data has been provided to us with some alterations already made.

To begin working with the dataset in RStudio, I installed and loaded the packages that were not pre downloaded in R Studio using the functions install.packages() and library (). The packages installed and loaded were the following:
GISTools: allows to read shapefiles
RJSONIO: Serializes R objects to JSON
rgdal: Bindings for the Geospatial Data
RCurl: General Network (http/ Ftp/..) Client Interface for R
curl: A modern and flexible web client for R
sp: Classes and methods for Spatial data

Once the packages were loaded, the data that was downloaded from the course website was read into Studio as ‘data’ using the function, read.csv (). As the original data included all available years, it was filtered in Excel to just data from 2012, and then read into RStudio.

2.Parse: Provide some structure for the data’s meaning, and order it into categories

In order to geospatially represent the data, the data needs to be geocoded: giving the data some characteristic that will attribute it to a place the land. Therefore in order to geocode, it was necessary to first, make the data recognizable to the geocode function by attributing it to longitudes and latitudes.

Due to the nature of the data is confidential in some cases, addresses had been offset by using “XX” to represent the hundredth block. In order to make this configuration readable by the geocode function, we need to replace all the XX’s to “00”s. The function data$h_block creates a new vector in the data frame to store the reattributed data by the function gsub(). The function print(head(data)) allows us to review that the “XX”s have been successfully substituted by “00”s. The address is still incomplete however as it does not indicate the city. Thus we create a new vector once again using data$address and combine the previous vector set, data$h_block and “Vancouver, BC”, to create a data$full_address. Finally, we need to remove “OFFSET TO PROTECT PRIVACY” from the full addresses by using the gsub() function once again.

Now that the full addresses given some structure, it can be read by the geocode function. Setting up the function as bc_geocode, we attribute a function that calls into the BC government’s geocoding API. As the geocoding process loops through all the addresses, it will generate longitude and latitude coordinates for each address in vectors that we set up as lat=c() and lon= c(), and stored in the R Studio data frame as data$lat and data$lon.

3.Mine: Apply methods from statistics or data mining as a way to discern patterns or place the data in mathematical context

With the data now geocoded, it is necessary to format the data in ways that is useful to
for our visual representation as an interactive web map.

Using the function unique(), we are able to assess the overall patterns in the data, for the TYPE of crime in the dataset. With that the 8 levels were generated:
[1] Offence Against a Person [2] Theft from Vehicle
[3] Other Theft [4]Theft of Vehicle
[5] Break and Enter Residential/Other
[6] Break and Enter Commercial
[7] Homicide [8] Mischief

Examining the 8 breaks we can see that some of the categories have overlapping attributes. As for this Assignment, we are more concerned with the overall level of crime in Vancouver, than the detailed particulars of the different kinds of crimes, rather than keeping the 8 levels some categories, it could be better represented by being combining some of the groups and displaying the information simplistically as possible. The below grouping were decided upon and grouped together using the function c():

#———Break-ins———#
Break and Enter Residential/Other
Break and Enter Commercial

  •  I decided that for the purpose of this assignment, it was probably unnecessary to differentiate between a Residential break-ins and Commercial break-ins.

#———Theft———#
Theft from Vehicle
Other Theft
Theft of Vehicle

  • Similar to the break-ins, I am more concerned about the frequency of theft and the area where theft happens rather than what is being stolen.
  • #———Mischief———#
    Mischief
    there was no category to combine it with.

#———Offences and Homocides———#
Offence Against a Person
Homocide

  • These two categories were grouped because
    • 1) due to the similarity of the crime in which it is harming a person, and
    • 2) the nature of the data for these two categories, in that both cases were “OFFSET DUE TO PRIVACY REASONS”, and thus there is no address that can be linked to this data. Once the categories were regrouped using the c( , ) function, I assigned each category a unique class ID number from 1-4 for easier identification and representation later.
  • Because of the nature of the data in which it gives no locational information of the Offences Against a Person and Homocides, I was particularly hesitant to even add this to interactive map. I wondered if there was logical to add non-geographically assigned information to a map. However in the end I added it to the map for the purpose of this assignment to demonstrate this issue.

After these groups have been regrouped and CID assigned, we “jittered” the data. Because the data is based on the hundred block, it is likely that there are going to be points that overlap with each other. To better represent an average spread of the data points over a general area, we may “jitter” the data by offsetting the longitudes and latitudes slightly by a random number. The outcomes of this function is then saved in the data frame as data$lat_offset and data$lon_offset.

4.Filter: Remove all but the data of interest
For this assignment, I am only concerned with crime data within the administrative boundaries of the municipality of the City of Vancouver in 2012. The timeframe for the data has already been filtered out in Excel to 2012 before importing into RStudio.In order to filter to the administrative boundaries of the City of Vancouver was done by subsetting into a vector data_filter where any data that fell out of the geographic coordinates of the boundaries. The data was then visualized using the function plot()

5.Represent/Acquire (2): Obtain/Convert Datafiles for Leaflet
For this particular assignment, the javascript library, Leaflet, was chosen to be the platform which we use to represent our geodata that we had refined in RStudio. This html file allows us to create an html file that can be opened on any web browser instead of a specialized geographic visualization software which makes it accessible to all users.

In order to be able to read the data from RStudio and into the Leaflet javascript and into a geospatial format, it is necessary to convert the data into a shape files and geojsons. This is done through the functions SpatialPointsDataFrame(), CRS(), and then organizing them in the same geofolder in order to finally write the file as a shape file using the function writeOGR() to be read into Leaflet later.

In Leaflet, we chose to use the hexagonal grid to spatially divide our data. This grid was already prepared for us by Joey Lee, and it was available to extract from the url: ‘https://raw.githubusercontent.com/joeyklee/aloha-r/master/data/calls_2014/geo/hgrid_250m.geojson’, and then read the saved GeoJSON and shape files as the hexgrid. We also perform the function spTransform to format the OGR files into the correct WGS84 Projection style.
The hexagonal grid is useful to represent the frequencies of crime in that area based on a choropleth projection. Thus to derive this function, we use the function poly.counts() to only count the number of occurrences of crime per grid. Finally this is saved as shape files and geojsons to be applied to the leaflet package in RStudio.

After installing and loading the leaflet package in RStudio, we read in the data filter csv. file that was saved previously. This is the data that we have given structure to by eliminating and formatting unnecessary data from the original raw Vancouver Crime Data. To link the data filter with the cid vector that we established in RStudio, we use the function subset() again.

6.Represent: Leaflet
Finally after formatting our data, we can visually represent the data using the Leaflet javascript into a html file. After initiating leaflet(), we add the default openstreetmap tiles with addtiles(). This gives us a map that is seen on openstreetmap. On top of this layer we need to add the data on the Vancouver Crime Data from 2012 that we had refined previously and has been saved as data_filter in RStudio. To visually represent this data on the map, we add markers and differentiate between the different types of crime using different colours. To add these markers and different colours we add the colour domains using the function colorFactors(), and then add the markers for each category/CID that we had defined previously.

7.Interact(1): Add methods for manipulating the data or controlling what features are visible.

As this map is using various layers of data, it is useful to have a “toggle” feature that allows us to compare and analyze the various layers easily. This is made possible with the addLayersControl() function.

8.Refine: Improve the basic representation to make it clearer and more visually engaging

Now that we have the basic basemap, we need to loop through the entirety of the data, and plot all data points and markers to our map. Once that is complete, we can refine the colours chosen for the markers, as well as the names of the markers, and the overall colour palette for the hexagonal grid.

Originally, the datafile for the hexagonal grid being used for Call 3-1-1 data, the headers or names were catered to the “number of calls” or “3-1-1 data”. This was all modified to represent Vancouver Crime Data instead. I decided to use “Crime Density” for the hexagonal layer.

Colour choice wise, I kept the “Greens” palette as I thought it was a pleasant colour to use for the basemap. I played around with the colour choices for the markers, trying to choose the best colour scheme for what the data represented. In the end, I chose colours that were vivid rather than light or calm, such as using basic primary colours red and blue because I believed that is more visually engaging and catching considering the context of the data. In addition, from inspiration from Sydney, I adjusted the opacity of the hexagonal grid to 0.75. Once those changes were made, I re-ran the codes to reflect the modifications, to generate the final form of the map.

9.Interact(2)
The final deliverable of this assignment was to generate an html. file that can be used on a browser interface. This was done simply through the ‘Export’ function in RStudio, and then posted on my Undergrad Portfolio (blogs.ubc.ca/manahashimoto). Given the limitations on WordPress, it was uploaded as a zip file alongside screenshots of the map. The contents of the zip file allows the user to download it to their local drive and access the interactive web map on their browser, while the screenshots give a quick overview of the functions of the webmap.

ii) Understanding the Data

a) Lost Records of Data

At the very start, we started with a massive data file with crime data from years throughout 2003 to 2015 and in total, we had 576877 cases of crime. Before importing the dataset into RStudio, I filtered it to just my focus year in 2012 and reduced the cases to 34065 cases, and after filtering and parsing in RStudio I had in total, 33820 cases. The discrepancy between before coding the data and after is due to the fact that some overlapping data points had been offset for the purpose of mapping them on the interactive map.

b) Error

For the purpose of the assignment in learning R and Leaflet, I think we did the best we can, however if someone were to approach this map not knowing the intentions behind the map, I feel that this interactive map is quite misrepresentative of the trends of crime in Vancouver and thus arguably significant amount of error. In particular, the placement of the markers based on the ‘random subset’ is questionable. Although this is the best we can do based on the nature of the data provided, I feel that this is a significant limitation in representing this data because the computer equation causes the data points to be organized into neat rows which I find unnatural and misrepresentative. We must consider that the original dataset was subsetted to the hundredth block to protect privacy, which we then applied attributed loosely related geographic coordinates, and then subsetted again if there was overlapping coordinates based on a random computer algorithm. While the computer may generate markers that make it seem that crime is occurring along an entire block, it may perhaps be just concentrated in one intersection. It is overall hard to make accurate judgements.

c) File Formats

  • shp: shapefiles are files to be used to for GIS software, and commonly used for ArcMap.
  • csv: Comma Separated Values files are a way to store data as plain text,
  • excel: Excel files are data files with formatting and structure (unlike csv files which are plain text.)
  • geojson: GeoJSON files are a lightweight type of shapefile that are to be specifically to be used in Leaflet.

Throughout the entire coding process, I created and interacted with all 4 types of files. I began with a csv file of data of all years of Vancouver Crime data which I opened in Microsoft Excel and filtered to just the year 2012 and was saved as a csv file that was imported and read in RStudio. After filtering, mining, and parsing the data to fit the requirements of the assignment, I overwrote it again as a csv file. As the final purpose of this assignment was to create an interactive map using Leaflet, we needed a way to go convert the csv to a GeoJSON. In order to achieve this, the intermediary step of converting the csv to a shp was necessary. Once the csv was saved as a shp, I converted that into a geoJSON which could be read by the leaflet library. All of these files had to be saved in the local drive with careful denominations so that we could access and rewrite them as we revise our code and interactive map.

PART B: VISUALIZATION 

a) Cartographic Design: 

The purpose of this interactive map was to gain understanding of the overall trends of the frequency of crime, the patterns in the location of the crime, and the type of crime occurring in Vancouver. The interactive map fulfilled this purpose by linking data with locational coordinates, and allowing us to easily view and compare the different types of crime that occurred in the city. However, to add more value to this visual representation, I wished I had more flexibility or skills to represent a few more things.

First, I wish I had the skill and knowledge in R to visually present more than one layer of crime one the base map at one time. Currently I am only able to show either Break-ins, Theft, Mischief, OR Offences and Homocides but it may have been useful if I was able to compare two or more crimes in one given location.

Secondly, I also wish I could control the transparency of the hexagonal layer instead of turning it completely off. If we were able to see the underlayer and the crime density layer at the same time, we are able to gain better understanding of the locationality of the crime which is more valuable information to us than simply the extent of crime density.

Finally, I believe it could have been much more informative if we had the ability to add a graph or chart on perhaps the corners of the map. For instance, through our coding, we were able to derive information on the number of crimes, and what crimes were most frequent. I think I would have been informative to the map viewer if we were able to add that to our map as a bar chart or pie chart to give an overall understanding of the map.

b) Interactivity 

Overall, I although I feel accomplished with what I was able to generate using coding and open sourced software, I am not entirely satisfied with limitations I experienced with data representation with using the Vancouver Crime Dataset. Like I mentioned in the ‘Error’ section in the above, relying on an algorithm to map information that we only have partial comprehension of is, I believe, a source of misrepresentation between the virtual and the actual world that we should be careful of.

Based on the data we were given, we were limited to 8 types of crime data, and for easier visualization, I had reduced the data down to 4 categories. While I think it was reasonable to re-group the categories, it is also informative to know the lower-level specifics of that crime as well. Although we can identify the specific type of crime by clicking on a random marker, I wish I could create a secondary layer that allows us to toggle through.

Perhaps having a layer that represents the trends of per capita income would have been useful to better understand the trends in the visual representation of the data. In addition, being able to locate public institutions like schools or rec centres that may be related to public investment in that community may have been useful. Overall, with the limited skills we have in R and the Vancouver Crime Data, I feel the interactivity is useful and innovative.

 

PART C: REFLECTION

You have experience with expensive proprietary software packages ArcGIS and Adobe Illustrator that are supported by the UBC geography labs. Reflect on these few weeks of learning enough of an open source tool, R and leaflet, to create this interactive map. In your reflection discuss pros and cons of proprietary software and FOSS.

Through this course I have gained better understanding on how I can better utilize open sourced software and while also understanding the limitations and restrictions to these equiptment. Having this knowledge has become an empowering tool in communicating data in flexible ways they have allowed me to express myself more freely and creatively in visual representation, especially approaching the final project. Through my experience learning how to use R and trying to troubleshoot, I realized the huge online community of people who share the same issues. Through reading through these threads I was also able to see other skills I can apply to my code or see the bigger picture of the software. Combining this knowledge of new software with familiar file formats csv files and excel files also allowed me to build rather than start from scratch.

Personally, I struggled using the proprietary software, ArcGIS and Adobe Illustrator. I do not learn well when I have limited access to these software ie. not having them on hand on my computer. With Adobe Illustrator, the expenses that limited my access to the software also hindered my better understanding about the software. The access to the data and the software is limited to those who only have the means, and thus it is harder to have a greater audience to interact with it. But I am also aware that the expensive equipment allows the creator to express the data in more comprehensive ways through the many tools and techniques that the programs allow. While the open source data allows for better data collection and interaction, the proprietary software allows for a better expression of the data.

Overall, as an amateur learner in cartography, personally the proprietary software is not the best option. However, I can see the achievements that can be made through the use of it if fully understood and utilized.