This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

Films are divided into individual scenes, a self-contained series of shots which may contain dialogue, visual action, and more. Being able to programmatically identify specific scenes is key to turning a film into structured data. We attempt to identify the start and end frames for individual scenes by using Keras’ VGG16 image model to group similar frames (images) into clusters known as shots. Then an original algorithm, rooted in film editing expertise, is applied to partition individual scenes.

Our goal is to, given a set of input frames, identify the start frame and end frame for individual scenes. (This is completely unsupervised, but for the purposes of explanation, I’ll comment on our progress, as well as provide visualization.) In this example, 400 frames, one taken every second from The Hustle (2019) are being fed into the algorithm. Keras’ VGG16 image model is used to vectorize these images, and then unsupervised HAC clustering is applied to group similar frames into clusters. Frames with equal cluster values are similar, so a set of three consecutive frames with the same cluster value could represent a three-second shot of a character.

Here is the vectorization of our sample 400 frames from The Hustle.

Image for post
Clustering of 400 frames from “The Hustle”
Target visualization


In this example, we have two partial scenes and two complete scenes. Our goal is to identify the scene boundaries of each scene; in this example, we’ll try and identify the boundaries of the blue scene. I’ve colored in this visualization manually, to illustrate our “target”.

Image for post
Manual annotation of the 400 frames, divided into scenes

五步算法 (Five-step algorithm)

Step 1: Finding the A/B/A/B Shot Pattern

Among all 400 frames, we look for any pairs of shots that form an A/B/A/B pattern.

Image for post
A/B/A/B pattern
Step 2: Checking for MCUs


Finding four A/B/A/B patterns, we run each shot through the MCU image classifier. Two of the patterns were rejected because they contain a shot that doesn’t pass the MCU check. In the below image, the top shot-pair represents our example scene.

Image for post
MCU Check

Step 3: Designating a Preliminary Scene Boundary: Anchor Start/End


Once we’ve confirmed that we’re looking at Medium Close-Up shots, we can reasonably believe that we’re looking at a two-character dialogue scene. We look for the first and last appearances of either shot (regardless of A or B). These frames define the Anchor Start and Anchor End Frames, a preliminary scene boundary.

Image for post
Anchor Frames: Preliminary Scene Boundaries

Step 4: Identify Cutaways


In between the Anchor Start and Anchor End are many other shots known as cutaways. These may represent any of the following:

  • POV shots, showing what characters are looking at offscreen

  • Inserts, different shots of Speaker A or B, such as a one-off close-up

  • Other characters, both silent and speaking


After we identify these cutaways, we may be able to expand the scene’s start frame backward, and the end frame forward. If we see these cutaways again, but before the Anchor start or after the Anchor end, they must still be part of the scene.

Image for post
Cutaways Which Appear Between the Anchor Frames

Step 5a: Extending the Scene End


After the Anchor End are three frames with a familiar shot (cluster). Since we saw this cluster earlier, as a Cutaway, we incorporate these three frames into our scene. The following frames are unfamiliar, and are indeed not part of this scene.

Image for post
Extending the Scene End, Forward

Step 5b: Extending the Scene Start


We apply this same technique to the scene’s beginning, in the opposite direction. We find many Cutaways, so we keep progressing earlier and earlier until no more Cutaways are found.

Image for post
Extending the Scene Start, Backward



Below is a visualization of the total frames in the scene, with the blue highlighted frames included in our prediction, and the orange highlighted frames not included in our prediction. This algorithm managed to label most frames of the scene. Although some frames were missed at the scene’s beginning, these are non-speaking introductory frames. The scene takes some time to get started, and we’ve indeed captured all frames containing dialogue, the most important criteria.

Image for post
Blue Frames Were Captured by the Algorithm

