2024 . 10 . 14

ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models