After reviewing the dataset with Palladio, I don’t believe an analysis can provide any meaningful information about the reasons for all the participants’ song choices.
Filtering the Palladio graph by communities shows a difference in mutual connections when comparing larger and smaller communities. For instance, in the largest community of 5, members had an average of 8.4 song choices in common. In the smallest community of 2, members had only 3 songs in common. This seems to suggest that the largest communities are responsible for selecting the most popular songs on average. In fact, 5/6 of the members in the largest community selected Johnny B Goode, the most popular song overall. However, only one member of this community selected Tsuru No Sugomori (Crane’s Nest), the second most popular song. In fact, the difference between communities may be due to the fact the more members in a community, the higher the chance to have a connection with other members.
The two least selected songs were Pygmy Girls’ Initiation Song and String Quartet, which have two edges each. From the data alone, this suggests they are the least appropriate songs for the Golden Record Curation. However, these two songs were missing from the YouTube playlist provided for the task. So, it’s likely they were simply overlooked by those that didn’t realize the playlist was incomplete. If they were on the playlist, they may have been selected by more people. An unaware observer would make incorrect assumptions about the value of those songs due to their lack of connections. This is a clear example of an external unknown influence impacting the reliability of a dataset.
Additionally, not all participants made 10 song selections, and the reason is not clear. Did they forget to select some songs when they submitted their list? Was it an error or mistake when creating the Json file? Was it intentional? Without the context of the task, an observer may think there is a reason not all participants have 10 connections, when in fact it may be an accident.
None of these dataset characteristics ultimately tell us why these songs were selected. One way to include that in the data would be to ask participants to assign labels to each song indicating why they chose it. For example, there could be a “percussion” label to select if they chose a song due to its drum content. Of course, that would require anticipating and defining labels for every possible reason someone could select a song. I think the lesson to take from this is that the purpose of data needs to be considered before it is collected to ensure that the right data is gathered. Even when gathering data for a specific purpose, external variables like user error can influence the output in nonobvious ways. The connections themselves do not give an indication of the meaning behind them. Just to understand the connections between nodes requires external knowledge of the task that isn’t evident in the data itself.