Task 9: Network Assignment Using Golden Record Curation Quiz Data

Looking at the Palladio visualization data for our Golden Record Curation Quiz Data, I was first reminded of a similar visualization of data in another MET course, ETEC 543: Understanding Data Analytics. In it, we were given access to Threadz, “a learning analytics tool that allows you to visualize and better quantify the student discussions happening in Canvas discussion boards” (University of British Colombia Learning Technology Hub 2023).

Threadz works better than the visualization for the Golden Record quiz data. Threadz’s nodes are merely participants, rather than the two kinds of nodes in the Golden Record visualization data (participants and tracks), and connects in Threadz visually demonstrate who’s responding to whom, giving a real sense of tangible networks in both meanings of the word (edges between nodes, as well as connections between classmates). The Golden Record visualization data, on the other hand, is visually overwhelming due to the degree of connectivity each node may have. Each participant node will have edges to the ten track nodes they have selected, and each track node could have potentially zero to 23 nodes, depending on how many participants selected it (though from later analysis, the nodes with the highest degree of connectivity was Johnny B. Goode and Melancholy Blues each with 16 connections, and each track was picked at least once).

Because of this, and due to my familiarity with Excel, I decided to represent the data in another way to better analyze it. Unknown to me at the time (I looked at and analyzed the data before watching the videos for this module), I was building an adjacency matrix, though I used the word “Yes” instead of “1.”

While building this adjacency matrix, I relied on Google searches to find Excel formulas that would automate this process instead of me doing it manually. While the first search, “search within a cell,” did not return helpful results, my refined search of “search within a cell” had the top result providing information that I needed. This second search also supports the point made in Code.org’s 2017 video. Rather than interpreting, for example, searching for various organelles such as the mitochondria in a biological cell, perhaps due to my previous search, the results all provided information relating to Excel cells.

With the adjacency matrix completed, I first analyzed it by using the “countif” formula on Excel to see how many times a particular track has been selected and Excel’s built-in sort function to place them in order (this version of the adjacency matrix is not shown, as I did further analysis afterwards that rearranged the tracks). For tracks with low select counts, I could have gleamed the same amount of information from the Palladio visualization as I could from the adjacency matrix: for example, Track 22 was selected by only one participant, while Track 27 was selected by two. Yet, the adjacency matrix was far better at giving the numerical value of the degree of connectivity of the more popular track nodes, whereas I would have had to manually count the degree of connectivity for each node on the Palladio visualization.

Comparing the facet grouping to an adjacency matrix also yielded interesting results. I was placed in group two with Stephanie, Jonathan, Carol, and Carlo (highlighted light green on my adjacency matrix). Looking at the Palladio visualization data for this group first, I noticed that there was only one piece that all of us selected Track 6: El Cascabel, and there were several tracks that most of us selected such as Track 11: The Queen of the Night Aria (three of us selected this one) and Track 23: Wedding Song (four of us selected this one).

Curious as to how these groupings were made, I rearranged the tracks to only the ones I have selected and using the aforementioned “search within a cell” formula results, I quickly built another column to see the number of tracks I had in common with each participant. Of the 22 participants (excluding myself) and 27 tracks, I had between two to five tracks in common with each participant, and the following people had five tracks in common with me: Stephen, Stephanie, Lachelle, Krisjana, and Jonathan. I immediately noticed that only two of these (Stephanie and Jonathan) were members of my group created by Palladio. If I wanted to spend more time on this, I could have done this for each participant, and built a new table to see how many tracks each participant has in common with all other participants, to see if the algorithm grouped us based on that.

Yet, I went with another approach to dig deeper without building a new table by looking at specific tracks we had in common. This analysis was far easier with the adjacency matrix due to the ability to rearrange and/or sort nodes, while it would have been quite difficult to do this analysis on the Palladio visualization, trying to follow specific edges when there are numerous of them. I moved the five of us to the top of my adjacency matrix, and it turns out that the only track all five of us selected is El Cascabel. Yet, what about the other seven participants that also selected El Cascabel? Why weren’t they in our group? The only track selected by four of us was Wedding Song, but there were also others such as Stephen, Lachelle, and Kristjana who selected El Cascabel and Wedding Song but weren’t part of our group. Perhaps it was our non selections? The only three tracks not selected by the five of us are the Brandenburg Concerto 2, Sacrificial Dance, and Flowing Streams. Stephen and Lachelle both selected Flowing Streams, while Krisjana selected Sacrificial Dance. Was that the requirements to get into our group set up by the grouping algorithm, that one needs to select El Cascabel without selecting Brandenburg Concerto 2, Sacrificial Dance, or Flowing Streams? This is only a hypothesis and I lack time as well as additional data to see if this was how the algorithm grouped us.

That said, if this is indeed how the algorithm grouped us, then this is quite an arbitrary grouping system that highlights a key point about machine learning and algorithms I’ve encountered in other MET courses as well as my personal life: algorithms are programmed to only link the data to find correlation, without much consideration as to the reasons behind the correlation. The lack of consideration of causation is demonstrated in these algorithm groupings: I selected my tracks with the rationale of wanting to demonstrate human vocal cord capabilities to any potential alien species that discover the Golden Record. Other members of my group focused on diversity of some sort, whether it’s geographical, cultural, and/or instruments used. Their primary reason was quite different than mine, but the algorithm placed me into the same group based solely on the results rather than the rationale.

Also, my various analysis and hypotheses as to how we were grouped also highlight another key point about algorithms: while programmers designed algorithms and machine learning methods to find correlation between data, the processes to determine the output is starting to become incomprehensible even to the programmers themselves. This is outlined in Rudin & Radin 2019, who highlight numerous negative consequences of the “black box model” of machine learning algorithms, such as not knowing about any deficiencies in the data that trained the algorithm (reasons behind us selecting the tracks), or how people may incorrectly hypothesize the processes the algorithm used to output the groupings (my hypotheses may very well be rejected with more data/analysis). In addition, relying on machine groupings without understanding the reasons behind it may lead to numerous negative consequences such as perpetuating existing biases in the training data or specific grouping criteria that turned out to be inconsequential, such as whether participants selected El Cascabel without selecting Brandenburg Concerto 2, Sacrificial Dance, or Flowing Streams.

References

Code.org. (2017, June 13). The Internet: How search works. [Video]. YouTube.

Rudin, C. & Radin, J. (2019). Why Are We Using Black Box Models in AI When We Don’t Need To? A Lesson From an Explainable AI Competition. Harvard Data Science Review1(2). https://doi.org/10.1162/99608f92.5a8a3a3d

University of British Columbia Learning Technology Hub. (October 2023). Threadz Instructor Guide. Learning Technology Hub. https://lthub.ubc.ca/guides/threadz-instructor-guide/

3 comments

  1. Matt,
    I enjoyed the thoroughness of your data evaluation and the execution of your spreadsheet. I love a good spreadsheet, so kudos for what you’ve built. I found the data in the Palladio visualization to be a bit vague and uninteresting so I’m glad you went to the trouble of organizing an adjacency matrix, whether you knew that’s what it was termed or not.

    During the completion of this task I was curious about which songs had low levels of selection, and it seems that you were also interested in this, enough so to find the answer for us. It would be interesting to run a separate data analysis on the non-selections and see how the groupings changed. Much like myself, it seems that you were left unsatisfied with the level of analysis possible given the Palladio visualization, and were left with questions about the reasoning behind the groupings. It would be beneficial to have more information about the participants (rationales,/age/socioeconomic status/gender/geographical location/musical tastes/time spent on the task etc.) and see if there are some commonalities between the groupings that are currently unclear to us. I’d be curious to hear your thoughts on which program you think would be best to use to run this sort of data analysis.

    Well done on the task post, it’s always a pleasure to read your work.
    Katy

  2. Hi Matt, I really appreciate your detailed analysis! I shared a parallel conclusion to yours in that quantitative data is not as useful as it may appear. It reminds me of when I was doing my BEd back in 2018/2019 and one of the students in my class had a PhD in microbiology. In one of our courses we were talking about research and he was adamant that the humanities, as a whole, did not do real research because of the difficulty to quantify their findings. Real research, to him, started and ended with quantitative analysis. It got to the point where the TA teaching the class had to be very direct to him and say that the humanities do, in fact, use real research. This exchange has a very interesting spot in my brain as it marked a clear shift in how I thought about learning.

    As you very succinctly pointed out, the “why” is far more valuable than the “what”. Five people selected Pygmy Girl’s Initiation Song… so what? The real information is why they selected it, what was the criteria behind it, what level of education do they have that informs their choice, is there any demographic correlation to the selection… there are probably a dozen questions that “should” be asked before placing value on the quantitative result.

    Thanks for your insightful post!

  3. Hi Matt,

    Your analysis of the golden curation data was an interesting read! I really appreciated learning about your approach, as it was quite different from my own, yet I think we ended up with similar conclusions.

    I agree that the visualization of the data on Palladio is overwhelming, and I struggled with analysis of the information. Your adjacency matrix was a great workaround for this and I wish I thought of it myself. My approach was much more low-tech. I began by listing the information I knew to be true from the network data and then tried to draw conclusions based on that.

    Overall, I found the groupings somewhat arbitrary and challenging to interpret. As you pointed out, while we in group 2 shared some song choices, these were also common among others in the course (and not in our group). For this reason, it was difficult to determine how the groups were formed based on selections. Your consideration of tracks we didn’t chose (or some combination thereof) was a good suggestion and something I hadn’t thought of myself.

    Regardless of how the groups were formed, I also had trouble with the idea of “community” in this context. I think we both noticed that people in our group chose songs for reasons different from our own (and to be honest, many of my choices were quite arbitrary). This makes me wonder about the basis upon which this “community” was built and connects to your observation that algorithms focus on finding correlations rather than understanding causation. Here it feels like patterns were identified without fully understanding the reasoning behind them.

    I have to say, your analysis of algorithms is awesome because it highlights issues like potential bias in data and grouping criteria, which we’re currently reading about in Module 11. For someone like me, who initially saw algorithms as almost like magical formulas based solely on objective numbers and hard data, it’s important to be reminded that they do have limitations. Your mention of the “black box” nature of algorithms is a great example.

    Thanks for bringing these concepts together. I appreciate it!

    Steph

Leave a Reply

Your email address will not be published. Required fields are marked *