Track 1

o Task 1: Action Recognition: Given a video segment, the task is to classify actions. 

o Task 2: Action Anticipation: Given a video segment, the task is to classify the predicted next actions. 

o Task 3: Procedure Recognition: Given a video segment, the task is to classify medical procedures. 

Track 2

o Task 1: Visual Question Answering:  Given an image and a natural language question about the image, the task is to provide a natural language answer.


Evaluation of Tasks in Track 1

For Track 1, the clips in the validation and test dataset, we use the class agnostic to evaluate the accuracy. To compute the accuracy, we count the number of correctly predicted action clips and divide it by the total number of action clips in the validation and test set. We evaluate the top 1 and top 5 accuracies for the verb, noun, and action (verb + noun) for this challenge's tasks 1 and task 2 in Track 1.
• Task 1 of Track 1, we report Top-1/5 action, verb, and noun accuracy on test sets.
• Task 2 of Track 1, we report Top-1/5 action, verb, and noun accuracy on test sets.
• Task 3 of Track 1, we report Top-1 accuracy on test sets.


Evaluation of Task in Track 2

For Track 2, we take the average accuracy and BLEU score as the algorithm performance to determine the winner.
• Task 1 of Track 2, we report the average accuracy and BLEU score.