Update: Nov 28 2011 – The OpenCV framework has been rebuilt using opencv svn revision 7017
Hot on the heels of our last article, in which we showed you how to build an OpenCV framework for iOS, we are turning our
attention to capturing live video and processing video frames with OpenCV. This is the foundation for augmented reality, the
latest buzz topic in computer vision. The article is accompanied by a demo app that detects faces in a real-time video feed from your iOS device’s camera. You can check out the source code for the app at GitHub or
follow the direct download link at the end of the article.
As shown in our last article, OpenCV supports video capture on iOS devices using the
highgui module. Calling the
of this class allows you to capture a single video frame and return it as a cv::Mat object for processing. However, the class is not optimized for processing live video:
- Each video frame is copied several times before being made available to your app for processing.
- You are required to ‘pull’ frames from
cv::VideoCaptureat a rate that you decide rather than being ‘pushed’ frames in real time as they become available.
- No video preview is supported. You are required to display frames manually in your UI.
In designing image processing apps for iOS devices we recommend that you use OpenCV for what it excels at – image processing – but use standard iOS support for accessing hardware and implementing UI. It may be a philosophical standpoint, but we find that cross-platform
layers such as OpenCV’s
always incur performance and design restrictions in trying to support the lowest common denominator. With that in mind, we have implemented a re-useable view controller subclass (
that enables high performance processing of live video using video capture support provided by the AVFoundation framework. The controller automatically manages a video preview layer and throttles the rate at which video frames are supplied to your processing
implementation to accomodate processing load. The components of the underlying AVFoundation video capture stack are also made available to you so that you can tweak behaviour to match your exact requirements.
The Video Capture View Controller
The AVFoundation video capture stack and video preview layer are conveniently wrapped up in the
provided with the demo source code. This class handles creation of the video capture stack, insertion of the view preview layer into the controller’s view hierarchy and conversion of video frames to cv::Mat instances for processing with OpenCV. It also provides
convenience methods for turning the iPhone 4′s torch on and off, switching between the front and back cameras while capturing video and displaying the current frames per second.
The details of how to set up the AVFoundation video capture stack are beyond the scope of this article and we refer you to the documentation from
Apple and the canonical application sample AVCam. If you are interested in how the stack is
created, however, then take a look at the implementation of the
which is called from
are a number of interesting aspects of the implementation, which we will go into next.
Hardware-acceleration of grayscale capture
For many image processing applications the first processing step is to reduce the full-color BGRA data received from the video hardware to a grayscale image to maximize processing speed when color information is not required. With OpenCV, this is usually achieved
cv::cvtColor function, which
produces a single channel image by calculating the weighted average of the R, G and B components of the original image. In
perform this conversion in hardware using a little trick and save processor cycles for the more interesting parts of your image processing pipeline.
If grayscale mode is enabled then the video format is set to
The video hardware will then supply YUV formatted video frames in which the Y channel contains luminance data and the color information is
encoding in the U and V chrominance channels. The luminance channel is used by the controller to create a single-channel grayscale image and the chrominance channels are ignored. Note that the video preview layer will still display the full-color video feed
whether grayscale mode is enabled or not.
Processing video frames
VideoCaptureViewController implements the
and is set as the delegate for receiving video frames from AVFoundation via the
This method takes the supplied sample buffer containing the video frame and creates a cv::Mat object. If grayscale mode is enabled then a single-channel cv::Mat is created; for full-color mode a BGRA format cv::Mat is created. This cv::Mat object is then passed
the OpenCV heavy-lifting is implemented. Note that no video data is copied here: the cv::Mat that is created points right into the hardware video buffer and must be processed before
If you need to keep references to video frames then use the cv::Mat::clone method to create a deep copy of the video data.
called on a private GCD queue created by the view controller. Your overridden
is also called on this queue. If you need to update UI based on your frame processing then you will need to use dispatch_sync or dispatch_async to dispatch those updates on the main application queue.
VideoCaptureViewController also monitors video frame timing information and uses it to calculate a running average of performance measured in frames per second. Set the
of the controller to
YES to display this
information in an overlay on top of the video preview layer.
Video orientation and the video coordinate system
Video frames are supplied by the iOS device hardware in landscape orientation irrespective of the physical orientation of the device. Specifically, the front camera orientation is
if you were holding the device in landscape with the Home button on the left) and the back camera orientation is
if you were holding the device in landscape with the Home button on the left). The video preview layer automatically rotates the video feed to the upright orientation and also mirrors the feed from the front camera to give the reflected image that we are used
to seeing when we look in a mirror. The preview layer also scales the video according to its current videoGravity mode: either stretching the video to fill its full bounds or fitting the video while maintaining the aspect ratio.
All these transformations create a problem when we need to map from a coordinate in the original video frame to the corresponding coordinate in the view as seen by the user and vice versa. For instance, you may have the location of a feature detected in the video frame and need to draw a marker at the corresponding position in the view. Or a user may have tapped on the view and you need to convert that view coordinate into the corresponding coordinate in the video frame.
All this complexity is handled in -[VideoCaptureController affineTransformForVideoRect:orientation:], which creates an affine transform that you can use to convert CGPoints and CGRects between the video coordinate system and the view coordinate system. If you
need to convert in the opposite direction then create the inverse transform using the
If you are not sure what an affine transform is then just look at the following code snippet for how to use them to convert CGPoints and CGRects between different coordinate systems.
Using VideoCaptureViewController in your own projects
VideoCaptureViewController is designed to
be re-useable in your own projects by subclassing it just as you would subclass Apple-provided controllers like UIViewController and UITableViewController. Add the header and implementation files (
to your project and modify your application-specific view controller(s) to derive from VideoCaptureViewController instead of UIViewController. If you want to add additional controls over the top of the video preview you can use Interface Builder and connect
up IBOutlets as usual. See the demo app for how this is done to overlay the video preview with UIButtons. You implement your application-specific video processing by overriding the
in your controller. Which leads us to face tracking…
Face tracking seems to be the ‘Hello World’ of computer vision and judging by the number of questions about it on StackOverflow many developers are looking for an iOS implementation. We couldn’t resist choosing it as the subject for our demo app either. The
implementation can be found in the
This is a subclass of
as described above, we’ve added our app-specific processing code by overriding the
of the base class. We have also added three UIButton controls in InterfaceBuilder to demonstrate how to extend the user interface. These buttons allow you to turn the iPhone4 torch on and off, switch between the front and back cameras and toggle the frames-per-second
Processing the video frames
The VideoCaptureViewController base class handles capturing frames and wrapping them up as
Each frame is supplied to our app-specific subclass via the
which is overridden to implement the detection.
The face detection is performed using OpenCV’s CascadeClassifier and the ‘haarcascade_frontalface_alt2′ cascade provided with the OpenCV distribution. The details of the detection are beyond the scope of this article but you can find lots of information about the Viola-Jones method and Haar-like features on Wikipedia.
The first task is to rotate the video frame from the hardware-supplied landscape orientation to portrait orientation. We do this to match the orientation of the video preview layer and also to allow OpenCV’s CascadeClassifier to operate as it will only detect upright features in an image. Using this technique, the app can only detect faces when the device is held in the portrait orientation. Alternatively, we could have rotated the video frame based on the current physical orientation of the device to allow faces to be detected when the device is held in any orientation.
The rotation is performed quickly by combining a cv::transpose, which swaps the x axis and y axis of a matrix, and a cv::flip, which mirrors a matrix about a specified axis. Video frames from the front camera need to be mirrored to match the video preview display so we can perform the rotation with just a transpose and no flip.
Once the video frame is in the correct orientation, it is passed to the CascadeClassifier for detection. Detected faces are returned as an STL vector of rectangles. The classification is run using the CV_HAAR_FIND_BIGGEST_OBJECT flag, which instructs the classifier
to look for faces at decreasing size and stop when it finds the first face. You can remove this flag at the start of
which instructs the classifier to start small, look for faces at increasing size and return all the faces it detects in the frame.
The STL vector of face rectangles (if any) is passed to the
for display. We use GCD’s
to dispatch the call on the main application thread. Remember that
called on our private video processing thread but UI updates must be performed on the main application thread. We use dispatch_sync rather than dispatch_async so that the video processing thread is blocked while the UI updates are being performed on the main
thread. This will cause AVFoundation to discard video frames automatically while our UI updates are taking place and ensures that we are not processing video frames faster than we can display the results. In practice, processing the frame will take longer
than any UI update associated with the frame but its worth bearing in mind if your app is doing simple processing accompanied by lengthy UI updates.
Displaying the face markers
For each detected face, the method creates an empty CALayer of the appropriate size with a 10 pixel red border and adds it into the layer hierarchy above the video preview layer. These ‘FaceLayers’ are re-used from frame to frame and repositioned within a CATransaction block to disable the default layer animation. This technique gives us a high-performance method for adding markers without having to do any drawing.
The face rectangles passed to this method are in the video frame coordinate space. For them to line up correctly with the video preview they need to be transformed into the view’s coordinate space. To do this we create a CGAffineTransform using the
of the VideoCaptureViewController class and use this to transform each rectangle in turn.
The displayFaces:forVideoRect:videoOrientation: method supports display of multiple face markers even though, with the current settings, OpenCV’s CascadeClassifier will return the single largest face that it detects. Remove the
at the start of
enable detection of multiple faces in a frame.
On an iPhone 4 using the
the demo app achieves up to 4 fps when a face is in the frame. This drops to around 1.5 fps when no face is present. Without the
multiple faces can be detected in a frame at around 1.8 fps. Note that the live video preview always runs at the full 30 fps irrespective of the processing frame rate and
called at 30 fps if you only perform minimal processing.
The face detection could obviously be optimized to achieve a faster effective frame rate and this has been discussed at length elsewhere. However, the purpose of this article is to demonstrate how to efficiently capture live video on iOS devices . What you do with those frames and how you process them is really up to you. We look forward to seeing all your augmented reality apps in the App Store!