Video demo?

Love the project, and looks super interesting. Works almost too well (very surprisingly) given how noisy dense labeling data is. I'm surprised the model can still infer!

I was looking through the paper, code and demo, but it wasn't clear to me how the video summarization worked. Can you elaborate? Do you break down the video frame by frame and treat it as an image problem?