What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision
程序知识procedural knowledge 从多个模态提取
alignment
(instructional step - speech signal)HMM
数据收集与预处理
youtube上搜索,并且增加扩展连接的内容
句子分类 naive bayes (recipe step, recipe ingredient, background)
parse:POS tagging,entity chunking, constituency parsing 分类树节点必为v
(欧式距离->词之间距离)stem 若找不到明显entity 启发式找前句
speech transcript
ASR system
factored HMM(step of recipe -- ASR words), keyword confidence
visual detecors, CNN classify, 找到direct object