Methods of quantitative data classification and their implications

As part of this lab, we were required to create 4 separate maps of the Vancouver census tracts, each with a different data classification method. Of the many methods of data classification that exist, the four most commonly used (and the ones that we used in the lab) are:

  • Natural breaks
  • Equal interval
  • Standard deviation
  • Manual breaks

The natural breaks method, as well as equal interval and standard deviation are automated methods within ArcMap. Natural breaks is the default method in ArcGIS and is based on natural groupings within the dataset with a default class number of 5. Standard deviation is based on statistical principles, grouping the values according to how much they vary from the average of the dataset. For manual breaks, the GIS user adds the class breaks manually, rather than the computer doing it automatically.

While each of these methods have pros and cons, there is no single method of data classification that can be deemed superior to the others. The one that is chosen to classify the dataset depends on the purpose of the data. When comparing datasets, for example housing affordability between Vancouver and Ottawa, as is the case here, it is important to use the same range of values in order to properly represent the data.

This map shows how these four different methods of data classification produce very different visualizations of housing affordability in Vancouver, even though the dataset is the exact same. The different visualizations of the same dataset could lead to different interpretations of the housing costs associated with census tracts across Vancouver, and could prove detrimental to the intention of the map.


Ethical implications associated with the choice of data classification method: 

The method of data classification that is used to portray specific data through a map can influence how the maps look. As such, data classification techniques can be subjective, as a specific method may be chosen knowing that it will produce specific results, effectively manipulating the outcome of the map so that the mapmaker’s goals are tailored to.

In this lab concerning housing affordability, I was asked which classification methods I would use for two different scenarios: (1) if I was a journalist putting together maps of housing cost in Vancouver, and (2) if I was a real estate agent preparing a presentation for prospective home buyers in the area near the University of British Columbia.

First of all, there are two main ways of describing housing cost in an area: median and average. The median cost of housing in an area can give a good idea of the price of real estate in that area, as well as a picture of how a certain area has been performing over time. Median cost of housing also reflects the sample size that is used, and looking at median prices over time can also provide an indication of market trends, help estimate the prices of properties, and whether or not a certain location is within price range. However, in order to get a complete picture of the market in a certain area, the median prices must be considered in conjunction with other factors. Using the average cost of housing can yield skewed results, as a significant outlier in price – either on the high or low end – can lead to misrepresentation by including these outlying values.

If I were a journalist, I may choose the equal interval method of data classification to divide the cost of housing into an equal range of values. This would effectively allocate the small fraction of exceptionally expensive houses into a class of their own, and on a map, it would appear as though only a small fraction of the homes in Vancouver are exceptionally expensive, where in reality, nearly all of Vancouver is severely unaffordable. Choosing to represent housing cost by the equal interval data classification method could thus lead to an inaccurate representation of the distribution of housing cost, and lead individuals to believe that a greater area of Vancouver has lower housing costs than actually is true. Furthermore, choosing average housing cost over median housing cost to portray the data could also have ethical implications, as using average housing cost could skew the results and provide a housing cost value that does not efficiently represent the cost of homes on the lower end of the price spectrum.

If I were a real estate agent, I might consider to display the housing data on the map using median housing cost, because the average housing cost near UBC is most of the highest averages in the city due to the extremely expensive homes in the area. By using the median housing cost, the value I could show prospective buyers would be lower since it takes into account the less expensive homes in the city. By doing this, ethical implications certainly arise as I purposely would choose median housing cost in attempt to have a lower cost of housing to report to the prospective buyers in the area. In terms of what data classification method I might choose as a real estate agent, using manual breaks, however, the fact that I would be choosing which values to set the manual breaks at produces ethical implications, since the values I choose could lead to very different representations of the data.

Leave a Reply

Your email address will not be published. Required fields are marked *