Cleaning Data

Cleaning Data

Data are seldom presented in a way conducive to data visualization and must be “cleaned” before they can be utilized.

But because data sets are often enormous and sometimes contain thousands of points, it is not an efficient use of time to manually clean data. 

The preloaded functions of spreadsheet software like Excel are very helpful on this front.

Download the data

  1. Download and open the following data set: Horn of Africa Refugee Camp Statistics
  2. The file contains data on 11 different Refugee camps in Ethiopia and Kenya over multiple dates.
    1. We want to map this data using Google Fusion tables (in the next exercise). But to do that we need to decide what to keep and what to eliminate.
  3. Copy the entire spreadsheet and paste it into a new Sheet.
    1.  Make sure you capture the headings.
    2. Click the small ‘+’ sign next to the name of the Sheet you are currently working in. This creates a new blank sheet.
  4. Save your work.

Delete the empty and unwanted rows and columns

  1. Delete the blank rows below the column names by clicking the Row numbers at the left.
    1. Shift + select to highlight both
    2. Right click and select Delete (using the delete key will only delete the contents of the first cell)
  2. Delete the column called Source, it is redundant in this case.

Sort the data

To get a better sense of what you’re looking at, you’ll want to arrange the data.

  1. First, you can freeze the Header row in place so you’ll always know which data you’re looking at.
    1. To do this, select the SECOND row. From the Window menu, select Freeze Panes. The header column should stay put when you scroll up and down.
  2. Select all the data (Command + a) and, from the Data menu, select Sort. Under Column, select Camp to first arrange the data by Camp name. Select the “+” sign below to add another condition. Under Column, select DatePublished and under Order, select Newest to Oldest.

**Note: if you don’t select all the data before sorting, you will mix up the rows – an error that is hard to catch in a spreadsheet full of numbers

 

Now that data is sorted you can see each camp has multiple data entries.

Remove Duplicate Data

You only want the most recent data to show up on your map, so you need to remove the older stuff. To do this, you will use the “Remove Duplicates” function.

  1. Select all (Command + a).
  2. Under the Data tab, click Remove Duplicates
    1. In the pop-up window, un-check the Select All box and then check Column A.
    2. Click Remove Duplicates

Only the most recent data points should be left.

(For users of older versions of Excel)

  1. Select column A
  2. Under the data menu, click filter, then advanced filter
  3. Check the Unique records only box
  4. Click OK

Now copy the data and paste it into a new Excel file. This ensures that Google Fusions tables will know which sheet you are trying to upload (see Exercise 3).

  • Call the new file V2.0 UNHCR Horn of Africa.

 

Other helpful data cleaning tips

See the link for more ways of cleaning up your data:

http://multimedia.journalism.berkeley.edu/tutorials/cleaning-data/cleaning-data/

Next Exercise: Mapping with Google Fusion —–>

 

Secret Clean Data File