Update 7b: Reflection on Design and Evaluation Process

After completing Milestones I to IV, we reflected on our design process and had several points to consider:

  • Our design and evaluation focused on an aspect of a collaborative video platform but not the entire system. The original vision of our design was to create a collaborative video platform, to help users watch educational videos and seek useful guidance when a concept was unclear. At times, it was difficult to realize the relationship between low level details of our design and the high level design goals. For example, timing users completing the video task could show us if interface type affects how users complete tasks in a video. Similarly, comment search time could be considered as a metric of how easy it is for users to seek help. In hindsight, we could have focused on testing different aspects of the system, that would help reinforce the relationship between the low level details and the high level design goals.
  • When developing the medium-fidelity prototype, we felt restricted by our ability to implement it in Javascript and HTML. Specifically, we wanted the segments to be labelled on the video which we could not implement due to our limited experience with the Youtube player API and time constraints. The visual appearance and graphic design of the website also suffered due to our lack of experience designing websites.
  • One of the things that was surprising when conducting the study was that users’ abilities varied greatly. Although piloting indicated that the difficulty of the tasks was reasonable, during the actual experiment, some participants struggled greatly with the task while others found it easy. As a result, user performance variation should be considered more carefully when designing a study.
  • Lastly, we could have performed another field study to see how people would interact with our segments and annotation system. We found that when designing our experiment, we had to make many assumptions about how annotations would be used and the type of content that they would hold. Having a better idea of this would have allowed us to develop a stronger and more valid experiment.

Update 7a: Final Conclusions and Recommendations

We were unable to find a statistically significant difference in overall task completion time or comment/annotation search time between System Blue (our developed system) and System Red (YouTube). However, 6 out of 8 participants agreed that it was easier to complete tasks on System Blue than on System Red, and that they would be more likely to use System Blue over System Red for watching educational video. 6 out of 8 participants also indicated that they liked the feature of having the video divided into smaller segments for navigation. Although no statistically significant timing differences were found, a majority of participants preferred System Blue’s annotation and video segmentation system.

From the results of our experiment, we can recommend that the overall approach of our system was valid. However, there are various recommendations that can be made to further improve both the design of our system and experiment. Firstly, we could modify segments by integrating them into the video playback bar by adding markers, instead of only  having hyperlinks under the video. This would make our segment design more visually salient and possibly affect the way that users interact with it. Secondly, we could focus more on learning about the types of interactions users would have with annotations to enlighten our interface design. It could also be useful to determine which types of annotations people post, as well as what they find useful for their video completion goals. Thirdly, since our experiment only examined one type of video (tutorials) it could be useful to include a wider range of video types, such as informational videos. Lastly, we would recommend that the experiment be conducted again, with some minor changes. The videos being used in the experiment should be changed. The current videos proved to be too challenging for some users and resulted in many users not completing the prescribed tasks in time.

Update 6c: Revised Supplementary Experiment Materials

We added a script so that our studies could be conducted in a consistent manner.

Script

Minor changes were made to the consent form to fit the context of our study, such as adjusting the time the experiment should take and what the participant should expect to do.

MedFiconsentform

We made separate questionnaires for each one of our interfaces and also a post study one. Questions were also reworded for clarity.

GulfOEQuestionnaire

We compiled a list of all the comments on both videos for the participants to use during the search for the comments.

ListofComments

Update 6b: Experiment Abstract

For our study, we compared two interfaces, System Red (Youtube) and System Blue (our interface), to observe how users completed tutorial videos and searched for comments or annotations in each setting. We recruited 8 participants for our study and had each person complete a tutorial video on both interfaces, as well as follow-up questionnaire questions. We wanted to determine which interface helped users finish an entire task, and find comments or annotations faster.

After running our study we found that over half of the participants failed to complete at least one of two tutorials. Our results can’t support that finding annotations is faster on System Blue compared to System Red, but there is a trend in the data to System Blue being faster. Despite this, our qualitative data results showed that the majority of participants prefer to System Blue to watch educational videos (6/8 participants).

 

Update 6a: Pilot Test

As a result of our pilot test, we noted following issues that needed to be changed in our experiment protocol:

  • We should counterbalance for which type of annotation the participant is asked to find first, having some find the low visibility annotations first and others find the other high visibility annotations first. We found that there was a significant learning effect depending on the order that the participant was asked to do these tasks. This doubles the number of conditions from four to eight.
  • We decided to have a printout of the annotations that the participants need to find to show the participant. This clarifies the task for the participant.
  • We wanted to have a copy of the interfaces with a dummy video tutorial that the participant can use to familiarize themselves with both interfaces, before starting the experiment.
  • We decided to have a time limit of 15 minutes to prevent the study from running too long if the participant gets stuck on a tutorial.
  • Some of the wording on questionnaire needed to be changed. Specifically, we were asking users for their “comments/opinions” which was confusing, since we often used the term “comments” in our experiment with a different meaning.

Update 5a: Rationale of Medium Fidelity Prototyping Approach

Our prototype was built using HTML and Javascript. Originally, our group decided to use Axure as our main prototyping tool, but changed our decision when we realized there would be too many limitations using Axure alone, especially regarding video function. When interacting with the prototype, users can add and scroll through both comments and timestamped annotations, watch an embedded video with standard YouTube player controls and select hyperlinked video segments to traverse different portions of the tutorial.

We learned from our field study that these functions were often used by participants while watching an educational video. Therefore, they would be important to include to thoroughly test Task 1 (completing an entire task in a video). The prototype also contains all the functionality we plan to test for Tasks 2 and 3 (finding specific annotations).

There are both vertical and horizontal aspects of our design. Comments and timestamped annotations can be made with the prototype. However, if a user were to add a timestamped annotation, it does not actually give the current playing time of the video, making this functionality horizontal. We decided that this is acceptable for our design, since testing users on their ability to add comments and annotations is no longer being tested as one of the tasks in our experiment (although we initially planned to test the process of adding comments/annotations, we decided to no longer pursue this due to time constraints and experiment complexity). Furthermore, video segments have been decided in advance by our team. They are not generated based on users access patterns of the video and videos queues (i.e. screen transitions, long pauses), as they would be ideally in a fully-functioning design. We decided that it was also important for video annotations to update automatically when a new segment is reached. We wanted users to see that annotations are specific to each segment and observe if their interaction with these segments affected their ability to complete the tasks in the experiment.

It was also important that our prototype had a somewhat professional appearance. Since users will be tested using both our interface and YouTube, we did not want them to feel that our design was less serious or professional, and judge it based on this fact alone.

Update 4b: Experiment Design

Participants:

We are planning to have have our participants composed of the general population. The inclusion criteria are as follows:

  • must be able to use a computer and follow along with an educational video.
  • must not be an expert of the video tutorial subjects used in the study

We plan to recruit participants using word of mouth (convenience sampling) and a call for participants at UBC. We expect to recruit and perform the experiment on 8 participants due to our 4 combination counterbalancing.

Conditions:

In this experiment, we will compare a user’s performance on our prototype versus the user’s performance using YouTube. We will examine how quickly users can perform tasks on both interfaces. This includes how quickly users can find annotations (a timestamped user remark) as well as complete an entire task described in a video tutorial. In addition, we will be looking at a user’s willingness to use each system and their preference.

Tasks:

On each video and interface, participants will be asked to perform the following tasks in the given order:

  1. Complete entire task described in video
  2. Find a specific high visibility annotation
  3. Find a specific low visibility annotation
  4. Complete questionnaire including Likert scales indicating their preferences

Design:

To test the speed of finding annotations, we will use a 2×2 (annotation visibility x interface type) within-subjects factorial design. We will have 2 levels of annotation visibility: high and low. We will also have 2 levels of interface type: YouTube (System Red to participants) and our interface (System Blue). Regarding annotation visibility, high means that the placement of the annotation is in an immediately visible position in the list of annotations without scrolling on our system. A low visibility means that the annotation may only be found after scrolling through the list of annotations on our system. To test the time taken to complete an entire task described in a video, we will use a t-test to compare both interface types.

We will use a counterbalancing method to eliminate order effects. Participants will interact with both interfaces with two different videos. For example, a user might be assigned to the first video on our system followed by the second video on YouTube. There are four possible possible combinations displayed in the table below:

Table 1: Counterbalancing method for our experiment

Combination First Scenario Second Scenario
1 YouTube, Video 1 Our System, Video 2
2 YouTube, Video 2 Our System, Video 1
3 Our System, Video 1 YouTube, Video 2
4 Our System, Video 2 YouTube, Video 1

We plan to counterbalance this way because a user cannot watch the same video twice, due to learning effects. After completing a tutorial, a user would become familiar with the steps and anticipate what should be done next, biasing our results. Thus, we are using two different videos regarding knot tying, choosing the videos based on:

  • Similarity of content. We wanted the videos to be similar enough in content and length to be comparable without introducing video as a factor, but different enough to eliminate significant learning effects.
  • Number of comments. The videos should have a similar number of comments.
  • How easily segmentable they are. We want to use videos that have logically segmentable sections, so that the segments can be potentially useful for users.
  • How likely it is for participants to already know how to complete the prescribed task. We want to use tutorial videos that users will probably be unfamiliar with, to address the potential confounding factor of participant expertise.
  • How complicated the video is. We want the users to be able to complete the tutorial without struggling greatly, but also make sure the task to be completed is non-trivial.
  • How lengthy the video is. Since participants will be watching two videos, we do not want to bore them with long videos.

The videos chosen are as follows:

(Video 1) How to Tie the Celtic Tree of Life Knot by TIAT: https://www.youtube.com/watch?v=scU4wbNrDHg

(Video 2) How to Tie a Big Celtic Heart Knot by TIAT: https://www.youtube.com/watch?v=tfPTJdCKzVw

For the Youtube interface, the participant will be directed to the corresponding video hosted on Youtube. For our developed interface, the participant will interact with the interface on a local machine. The comments (non-timestamped remark) and annotations for our developed system will be imported from the Youtube video hosting the same video. These will be randomly assigned in our system to be either a comment or an annotation, with a 50% chance of each. We decided on 50% since there is no precedent for a system like this to provide more accurate data on how comments and annotations should be distributed. Similarly, we assume that annotations’ timestamps are uniformly distributed across time. To make a fair comparison between both interfaces, all comments will be sorted based on most recent to least recent. For our developed interface, the video will be segmented manually beforehand based on places where the video either visually or audially pauses for more than one second or where the video transitions in some way (e.g. a screen transition).

Procedure:

  1. Participants will be informed of the goals of the research experiment and consent will be reviewed.
  2. Participants will be asked about their familiarity with the tutorial subject to exclude participants who are experts in the subject.
  3. On the first interface and video combination as determined by counterbalancing, the participant will be asked to perform the experimental tasks in the order given. Tasks 1 to 3 will be timed and recorded by the researcher using a stopwatch. The researcher will also take any relevant notes as the participant is doing each task, especially when the participant interacts with the segments feature in our developed interface. To determine when a task has been completed for timing purposes, the participants will be asked to indicate to the researcher when they have completed the task. The task will only be accepted once it has been completed correctly.
  4. Repeat step 2 using the other interface and video combination.
  5. Have the participants fill in a questionnaire regarding their demographic information and their preferences regarding the two interfaces that they used.
  6. Ask if the participants have any remaining questions before concluding the study.

Apparatus:

  • We plan on conducting this experiment in a quiet environment, such as ICICS x360.
  • The participant will use one of the computers available in the lab room to perform the tasks required while the researcher observes from the side.
  • Cell phone timers will be used to record the time it takes a participant to complete all tasks.
  • During the experiment, notes will be taken by team members on either a laptop or on paper.

Hypotheses:

Speed:
H1. Finding a specified annotation is faster using our system compared to Youtube for high visibility annotations.
H2. Finding a specified annotation is no slower using our system compared to Youtube for low visibility annotations.
H3. Completing an entire task prescribed in a video is no slower on our system compared to Youtube.

User Preference:
H4. Users will prefer our system’s comment and annotation system over Youtube’s.
H5. Users will not have a preference towards either system overall.

Priority of Hypotheses:

  • H4 and H3 are most important since they have more direct and tangible implications for design; H3 is concerned with overall usefulness of the system and H4 is concerned with users’ willingness to use the system.
  • H1 is important since it tests one of the big potential advantages of our system; however, it is in a more limited scope and applicability than H3.
  • H2 is reasonably important since it examines the potential tradeoff of having annotations and comments in separate sections.
  • H5 is least important since it is dependent on a comparison of a fully functional interface and a still-in-development interface. At this stage, it would be beneficial to get a sense of users’ overall opinion, but it is important to recognize that this may change as our interface is developed.

Planned Analysis:

For our statistical analysis, we will be using a 2 factor ANOVA (2 interface types x 2 annotation visibilities) for the time it takes for the participant to find specific annotations in our system compared to Youtube’s interface. A two-tailed paired t-test will also be used to compare the completion time of an entire task between the two interfaces. In order to measure the user’s preference of interface type, descriptive statistics of Likert scale data will be collected from each participant’s questionnaire.

Expected Limitations:

There are various issues that we expect to be limitations in our experiment, including:

  • The type of video we are testing. Since we are only using one specific type of educational video in our experiment, we may miss out on some interactions users have with different types of video.
  • Breadth of Comparison. Our experiment will only be testing our system against Youtube. We are not accounting for differences that may exist between our system and other popular video contexts, such as Khan Academy.
  • Comment/Annotation placement. The way that “existing” comments/annotations are placed in our system is predetermined. We are assuming an equal chance for a user to post an annotation or comment, and that annotations will be uniformly distributed by time. Since there is no precedent for a similar system, we cannot determine the validity of this assumption.
  • Video Segmentation. We are segmenting the videos based on our own judgement whereas the fully functional system would automatically segment based on user input. This may limit the validity of the video segments that have been chosen.