Track 1

The dataset annotations include the starting and ending timestamps, as well as the action performed on a target, which is described as a verb-noun pair, for each video clip. During the annotation process, medical professionals were responsible for providing accurate timestamps and actions for each procedural step. To minimize errors in timestamping and video segmentation, the annotations were partially reviewed and underwent a peer review process. The actions in each procedure are annotated by three medical professionals. One person annotates, and the other two people review the generated annotations. 

Track 2

The dataset annotations include questions, all possible answers, and the most relevant answers at the per-frame level. During the annotation process, our researchers were responsible for providing the VQA annotation. To minimize errors, the annotations underwent a peer review process. The questions and answers were partially reviewed. One person annotates, and the other two people review the generated annotations.