Using Palladio to Analyze Historical Migration Patterns- The Caveats
Why am I writing this tutorial?
This tutorial From Hermeneutics to Data to Networks: Data Extraction and Network Visualization of Historical Sources by Marten Düring on Programming Historian served as a great inspiration for me in exploring Palladio, a network analysis tool, and exploiting this tool to analyze a historical dataset for studying migration patterns. Nevertheless, when I drilled in the visualization produced by Palladio based off on my data, I found myself perplexed at some deceptively obvious characteristics of the graph, such as the nodes’ size and the connections, and their relationships. Once I tackled these puzzles, I got an entirely new understanding of the graph, which interestingly led to more research questions that can be asked of the graph.
At the core, since the data I used is, by its nature, very different from extracting data from textual sources as shown in Marten Düring’s tutorial, the underlying concepts, and potential caveats-ones I discovered important to interpreting the visualization-were not covered in Marten Düring’s tutorial. What’s more, considering Palladio is a software still in its infancy, there aren’t many tutorials out there at all. Therefore I think uncovering these hidden concepts would ease researchers into experimenting with Palladio, a very-easy-to-use tool and therefore perfect for newcomers to network analysis.
Background and the dataset
The Canadian government imposed a head tax on Chinese immigrants entering Canada between 1885 and 1923 in order to restrict immigration. While a print register was created to keep track of the influx of migrants, the detailed recording resulted in years of demographic information about the immigrants that has become a rich source of data for researchers. For more details about the background information, please refer to the Research page of this site.
The visualization produced by Palladio
This graph was born out of my interest in studying at the granular level of village, which I believe served as a closely-knit clan or social unit, to examine migration patterns. Each number represents an origin village in China where immigrants were from, while the immigrants’ destinations are spelled (with darker shade). Wherever a village node is connected to a destination node, it means there were immigrants originating from the village in China who chose that destination in Canada.
The trap of node size
The first thing to notice here on this graph is the nodes’ size. Vancouver and Victoria loom to be the most prominent destinations. Hmm.. let’s pause for a second here, do you think that’s because Vancouver and Victoria are connected to the largest number of villages, as the graph seemingly suggests? In other words, do you think the node size corresponds only to how many lines it connects to?
To find the answer, let’s see how we can simplify our data. So below is an extremely simplified spreadsheet with only 4 rows of hypothetical records. Note that each row represents an immigrant. In the original dataset, there are many other variables about the immigrants’ demographic information such as age, arrival year, and height, but for the sake of clarify only the two variables-the origin (village code) and destination are retained.
After feeding it to Palladio (put village code as Source, and destination as Target), here is the visualization Palladio spit out:
Now let’s put a spotlight on Destination 3: it is only connected to one village (237), then how come its size is larger than that of Destination 1 and 2?
Well, it turns out, a node size is related to how many lines it connects to, but this is not the whole picture. In fact, a node size is also tied to another factor: how many immigrants a village sent. We can write an equation like this:
Let’s first take a destination node as an example to see how we can apply the equation:
n= how many villages the destination is connected to (how many lines a destination is connected to)
Ai= the number of immigrants each village sent
So what this equation computes is the sum of the immigrants that each village sent which is connected to this destination.
Now it’s clear that a node size equals the total number of immigrants that the corresponding destination received, or if we speak in the statistical language, the node size corresponds to the frequency of this destination! I find it fascinating to build a mental linkage between the network visualization and the statistics based off on the same data.
And to this point, if we take a step back, we’ll be able to take a bird-view of the graph; it tells us very rich information: which village(s) is linked to which destination(s), and the frequencies of each village and destination, all at one glance.
A final hint of thought
Given a very low bar to learning Palladio, it would be a perfect tool to give it a try and possibly enter the sphere of network analysis. However, there are lurking risks beneath rashly diving into the water, harvesting a fantastic-looking visualization without critically thinking about the rationale behind the tool. As Emmanuelle Chaze elaborated on her blog, she witnessed many researchers were motivated by this recurring impulsion “I have my data, but I’m no IT specialist, how can I quickly visualize my networks?”, but many humanists willing to use digital tools remain reluctant to learn about their proper use first.
During my journey of exploring Palladio, I had a revelation that it’s dangerous to feed my data into the machine and wait for it to spit out a beautiful-looking graph. It’s like a “black box”. And it is risky to blindly trust a black box. If we don’t understand what algorithm it uses, we won’t be able to correctly interpret the results. Hence in this post, I focused on a seeming no-brainer question-what a node size represents-and from there dig in this question and uncover its underlying algorithm. I didn’t give any instruction on how to use Palladio, which, again, is very easy to learn, and should you interested this blog by Miriam Posner would be a great start.