When Dr. Van Neutegem set us to work on the PRT/WSP assignment, I was excited. Out of the assignments we undertake in the graduate certificate, this was probably the one to which I’d been looking forward the most. Generally-speaking, hockey is basically still in the dark ages of using data and evidence to make decisions; there are still NHL teams who don’t employ a single staff member to collect and analyze data. While this might seem like a commitment to mediocrity, the truth is that hockey is so conservative and slow to change that most people simply don’t believe data can help you perform. There are still respected coaches and executives in hockey whose idea of “data” is tracking goals, assists, and shots. That’s the general landscape but my passion and my area of true expertise is in refereeing, which is even further behind.
The reason I am here, in this program, is to bring refereeing out of the dark ages. Hockey is one of the few sports where refereeing is an athletic pursuit; other examples would be soccer, rugby, and (to a lesser extent) basketball. Rugby and soccer lead the way in terms of research and data-driven analysis of their referees. Over the last twelve months, I have been devouring their research on physical attributes, decision-making, and psychology to inform my own research. Ultimately, my goal is to progress to the Master’s program and conduct research to validate the models that I have started to build as part of this course.
The Winning Style of Refereeing
I won’t recap my entire WSP presentation or the process that led me there because it is clearly explained with corresponding visuals in my presentation. Instead, I will simply summarize my findings and move on but you’re welcome to skip the section to get to the meat of the post.
My primary objective was to measure decision-making ability. Decision-making isn’t the only factor. Hockey officiating is physically-demanding: not only do you have to be a good, technical skater but you also have to be extremely fit because you’ll be skating three twenty-minute periods without the same breaks that players have. Having said that, there are plenty of ways to measure physical fitness and realistically, nobody is reaching this level with deficiencies in their fitness. I’ve actually discontinued on-ice fitness testing in my program because it’s completely useless. At the elite level, everyone’s scores are so close to one another that it’s not an effective way of evaluating officials. So, while decision-making is not the only factor, it is the most important and the one on which I needed to focus my attention.
I created four statistics and I hypothesized they would measure decision-making. Those three statistics were duels, penalty points per decision, non-penalty points per decision, and positioning errors per 60 minutes of play (to account for games that went to overtime). The original idea (and term) of “duels” came from soccer. Opta defines a soccer duel as a “50-50 contest between two players of opposing sides in the match”. That still wasn’t specific enough for me, particularly because there’s more physical contact in hockey (even women’s hockey) than there is in soccer. I settled on defining a hockey duel as “any time a player uses their body or stick to apply opposite-directional force to an opponent”. The difference is small but crucial. There is so much body contact in a game that without the “opposite-directional” qualifier, my model would award referees so many points for unpenalized duels that it would completely invalidate the statistic.
Points Per Penalty Decision (PPD) validated my hypothesis; the three referees I identified scored extremely well. It would appear that this category is a workable way of measuring officiating performance.
Points Per Non-Penalty Decision (NPPD) were all over the place and I couldn’t identify any rhyme or reason for why that was. One possibility is that my categorization and weighting isn’t quite right; the other is that the sample is too small to draw conclusions. It’s also possible that this is just an exclusion criterion: i.e. if you can’t consistently hit a certain number of NPPD, you shouldn’t be at this level but over and above that benchmark won’t buy you any extra credit. Either way, an area for further research.
Positioning Errors Per 60 Minutes (PE/60) also validated my hypothesis, although it didn’t appear that way at first. As you can see on the right side of the graph, one of the best-performing referees (the lowest number of PE/60) was not one of the referees that I expected. So, I thought about why that might be and considered the intensity of the game. Is seems likely that the more intense the game, the more likely a referee is to make positioning errors; a more intense game is more difficult to predict, which means a higher probability of errors. So I compared the number of duels per game to the PE/60 and that showed a clear trend that matched with my hypothesis.
Dr. Van Neutegem’s Feedback
I was pretty pleased with the outcome of my WSP project. It was a smaller sample size than I had originally hoped for but that was out of my control and ultimately, I felt like I’d created something credible that could be the basis for further research. Having said that, Dr. Van Neutegem is literally the ultimate arbiter for a WSP in Canada, so I was a little nervous about submitting it and awaiting his feedback. Especially since he set his deadline for Christmas and while I’m not a huge celebrator of Christmas, a low mark on this assignment would put a bit of a damper on the holiday. Fortunately, his feedback was pretty positive and I’m going to address it point-by-point here as a way of leading into the next steps.
On LTAD and transfer of WSP concept to refereeing…
- AVN: Arguably, your Pathway is based more on their qualifications and decision-making levels, and possibly the type (intensity?) and number of matches officiated. I would resist any notion of ascribing a referee pathway to align with the athlete pathway. A referee who officiates a T2W athlete match (e.g Olympic finals) is not a T2W referee unless the benchmarks associated with that athlete stage have been defined and probably more than just the fact that they were selected to an Olympic final.
- DH: This is interesting because I had been trying to conform an idea of refereeing LTAD to the athlete pathway template for ease of transfer (i.e. making it easy to explain what I’m doing and what level of athlete I’m working with). But Andy is suggesting that I build out a completely separate pathway that is more focused on benchmarks. Either way, we don’t have an NSO-defined pathway in refereeing (because we’ve never taken coaching seriously at all levels) and that is an area for growth.
On scoring duels and decision-making…
- AVN: I am assuming that the evaluation matrix [for duels] would be based on consensus of peers assessing the decision? How do we achieve validity regarding the evaluation?
- DH: This is something that I would undertake as part of a thesis-level project but the short answer is yes, to have a panel of experts presented with clips and individually judge the decisions. For the WSP, I was the only one judging the decisions as correct or incorrect but I work with our leadership at the provincial and national level and feel confident that my assessments would match with consensus. I would realistically have to pay people for their time and that’s something that would need funding of some kind.
- AVN: You did a great job acknowledge the limitations and determining a plausible model or proof of concept. Perhaps you can grade the intensity of each duel (e.g. involving 2 or more players) and ascribing an evaluation to the situation in an algorithmic assessment.
- DH: This is an interesting idea. Obviously, any part of the assessment that could be delegated to an algorithm would be great. However, I’m not sure that having more players involved (particularly in women’s hockey) actually increases the intensity or the difficulty. But it would be something worth exploring as part of a larger project.
On inclusion and exclusion criteria
- AVN: You assessed several referees based on their medal performances. Arguably you made the assumption that their consistent appearance at medal matches in major competitions defined them as top referees. You retro-fitted the decision-making assessment to validate your assumption. In the future, the decision-making scores should be the definition of ‘top’ referee.
- DH: Absolutely agreed with Andy here and this is a major limitation of my project from a validity perspective. My WSP is not “valid” from a scientific perspective. I had to work with the video that was available to me and I hypothesized that a select few referees who are generally-accepted to be the top in the world would come out ahead of the other referees. In a thesis-level project, an entire tournament would be watched and evaluated using the model and then the “top” referees would be determined based upon the objective evaluation.
In applying the model across all levels of competition as a true WSP
- AVN: Context is important and as mentioned, a ‘junior’ referee could achieve the same benchmark for their performance context as a ‘senior’ referee. Is this defendable? If you approach your benchmarks as being universal, it will require a defined set of criteria to demarcate the levels of referee performance/context. Not all Olympic finals could be difficult to officiate if the number of duels and severity is low. Perhaps a junior official could manage that game given the fact that it is only the speed of the game that probably differentiates the highest standard of play from other levels of play. Conversely, a junior game (e.g. Canada Games final) could be extremely difficult to referee (more so than the Olympics perhaps) if the duels are high in number and very competitive. Again, perhaps duels need to be more defined, and different gradations given.
- DH: Andy makes a good point. One the acknowledged limitations in my presentation was that I do not have a concrete way to compare the level of play between the U18 national, U18 international, and senior international levels. My thought is that these would have to be built on player attributes: weight, skating speed, shot power, etc. Everything else is relative; Andy mentioned grading duels but ultimately U18 players going against U18 players won’t be as intense as Senior players going against Senior players. I don’t really know how else to do that, although someone who works in analytics in another sport might have some useful advice. Ultimately, I envision the benchmarks as being universal: i.e. a referee at the U18 level should meet the benchmark in order to move to the Senior level, at which point, they probably won’t hit that benchmark right away.
Where to Now? Next Steps
So now I’m left with thinking about the next steps. I believe this introduction of statistical performance analysis can actually change the game. At present, there is no objective way to analyze the performance of a referee and that leads to all kind of problems, both internally and externally. Internally, there’s no way for a referee to judge their own progress and often times, the final decision of whether a referee is given the opportunity to progress to the international level or turn professional (men only), is based on personal preference of decision-makers. Externally, it’s extremely difficult to justify decisions to teams, league officials, or the public.
So what are the next steps between where I am now and the conclusion of Year 3 of the HPCTL program, where I have a validated model that I can credibly “sell” to Hockey Canada and the International Ice Hockey Federation?
- Establish a way of measuring game intensity. As I found, and Andy reinforced, referee performances cannot be compared to one another without having a way to quantify the intensity of the game. To that end, I will need to…
- Conducting a second “test” with a small, non-valid sample size to explore the possibility of grading duels and seeing if I can come up with something that makes sense. Andy suggested grading duels based on the number of players involved. I’m not sure if that would work but it’s worth a try. Perhaps, it would also make sense to separate and weight stick-duels vs. body-duels. Again, I have no idea if that would actually reflect the intensity of the game, but it’s probably worth a try.
- Obtain physical testing data from Hockey Canada regarding their national team players. As I said previously, I don’t think there’s any way to compare the U18 National, U18 International, and Senior International levels without comparing the players’ physical attribute. I’m thinking bodyweight, skating speed, and shot power would be good measurements to average out and then use as comparison points. Hockey Canada does extensive testing on their athletes and so, if I agree to non-disclosure of confidential information, they might agree to allow me access to the data for the purpose of creating a model.
- Identification of secondary benchmarks through the Gold Medal Profile. Decision-making is obviously the most important and physical benchmarks are not particularly useful at this level. But are there secondary benchmarks that could be included or used to inform my analysis and are there other sports from which I could borrow?
- I’m particularly interested in other sports because, as Andy said, there is no reason to use statistics from hockey as my point of reference. If I borrowed and adapted duels from soccer, there’s no reason to assume I can’t borrow other ideas from other sports.
- The world of scholarship on refereeing is not particularly helpful here… one focus of rugby/soccer has been evaluating the distance between the referee and the play that they have to judge as a foul or a legal play. There are two problems with that: 1) the research says that distance doesn’t matter once you’re within a certain range; and 2) those studies were conducted using GPS technology and hockey is an indoor sport. So, while I could theoretically do distance-evaluation via photogrammetry, this isn’t something that could be delegated to an algorithm and I don’t believe would be worth the incredible amount of time it would take to execute.
- Continue to build out the LTAD pathway. As I discussed in my presentation, the pathway for referees to progress is very clear; it’s an A to B to C pathway. The challenge is how performances, which would allow a referee to progress, are judged. I’m not going to fix that by drawing out an LTAD pathway but having that be made very clear is useful.
Setting Up the Project
- Identify a competition(s) with an appropriate sample-size that would allow correlations to be more clearly drawn. I’m thinking the Senior and U18 Women’s World Championship would be ideal. That would give me a total of 51 games (22 U18, 29 Senior) across two competitions through which to draw conclusions. These would also work because they would be professionally broadcast and video could be analyzed from multiple angles. By comparison, a domestic competition like Canada Winter Games or a U18 National Championship is often webcast with a single camera, which would not allow for accurate judgements.
- Recruit a panel of experts to judge duels. Again, I’m not sure what the appropriate size for a panel would be; 3 people independently scoring each duel? My initial observations identified an average of 92 duels per game. If we’re talking about 51 games, that’s somewhere in the neighbourhood of 4000-5000 duels to judge and each one needs to be judged by multiple people. I would definitely need some kind of funding to pay people with expertise to participate. So that’s something to consider.
- Identify an appropriate software for viewing and cataloguing these duels. I have no idea what similar studies have used but I would need something that is capable of storing this data, as well as the panel’s judgements without allowing them to see that. I suppose that could also be done manually but that transfers the risk of inaccurate record-keeping to the members of the panel (and myself) and would require manual analysis of the data once compiled, which may not be the best use of time. Like with #2, this one may come down to a question of time and money.
What do you think? Are there any implications or big questions I’m missing as I look to move forward with this project?