Video Shot Boundary Detection Using Gist

Gist has been shown to characterize the structure of images well while being resistant to luminance change and also small translation. We use gist representation to model the global appearance of the scene. Gist treats the scene as one object, which can be characterized by consistent global and local structure. This method differs from prior work that characterizes scene by the identity of objects present in the image. Gist has been shown to perform well on scene category recognition. Gist also provides good contextual prior for facilitating object recognition task. Within one shot, due to the motion, appearance or disappearance of objects, the color histogram may not be consistent. Gist captures the overall texture of the background while ignoring these small change due to foreground objects.

In order to compute gist-features we resize the image into 128×128 pixels. The filters used to compute gist are divided into 3 scales and 8 orientation. After convolving with each of the 24 filters, images are equally divided into 16 blocks, and the average is taken for each block. This results in a 384dimensional (16 x 24) vector. The dimensionality of the features is reduced by using PCA. We retain the top 40 components as our gist representation. To represent the color-content of the image, we compute a global histogram with 10 uniformly placed bins in RGB as well as HSV color-space. This results in 60 additional dimensions encoding the color. We observe that abrupt shot boundaries are marked by sharp-changes in the gist and color features. However, the absolute change in features are not consistent across videos or even across different shot boundaries within the same videos. Thus, a simple pair-wise frame difference will not work in order to detect shot- boundaries.

