misalign to be penalized in the loss
regression for temporal stamps of event that corresponds to a sentence description.
I think this task can still be categorized into event detection/retrieval.
There's an embedding learning from videos by using 3D CNN and pooling.
There's an embedding learning from text by using word2vec or the ***graph technique.
The fusion is done FC layers with loss designed as mentioned above - try to align text with the event.