Computer vision with iOS Part 2: Face tracking in live video

原创 2013年12月02日 20:49:21

Computer vision with iOS Part 2: Face tracking in live video


Update: Nov 28 2011 – The OpenCV framework has been rebuilt using opencv svn revision 7017


Hot on the heels of our last article, in which we showed you how to build an OpenCV framework for iOS, we are turning our attention to capturing live video and processing video frames with OpenCV. This is the foundation for augmented reality, the latest buzz topic in computer vision. The article is accompanied by a demo app that detects faces in a real-time video feed from your iOS device’s camera. You can check out the source code for the app at GitHub or follow the direct download link at the end of the article.
The FaceTracker demo app
As shown in our last article, OpenCV supports video capture on iOS devices using the cv::VideoCapture class from thehighgui module. Calling the grab method of this class allows you to capture a single video frame and return it as a cv::Mat object for processing. However, the class is not optimized for processing live video:

  • Each video frame is copied several times before being made available to your app for processing.
  • You are required to ‘pull’ frames fromcv::VideoCapture at a rate that you decide rather than being ‘pushed’ frames in real time as they become available.
  • No video preview is supported. You are required to display frames manually in your UI.

In designing image processing apps for iOS devices we recommend that you use OpenCV for what it excels at – image processing – but use standard iOS support for accessing hardware and implementing UI. It may be a philosophical standpoint, but we find that cross-platform layers such as OpenCV’s highgui module always incur performance and design restrictions in trying to support the lowest common denominator. With that in mind, we have implemented a re-useable view controller subclass (VideoCaptureViewController) that enables high performance processing of live video using video capture support provided by the AVFoundation framework. The controller automatically manages a video preview layer and throttles the rate at which video frames are supplied to your processing implementation to accomodate processing load. The components of the underlying AVFoundation video capture stack are also made available to you so that you can tweak behaviour to match your exact requirements.

The Video Capture View Controller

The AVFoundation video capture stack and video preview layer are conveniently wrapped up in theVideoCaptureViewController class provided with the demo source code. This class handles creation of the video capture stack, insertion of the view preview layer into the controller’s view hierarchy and conversion of video frames to cv::Mat instances for processing with OpenCV. It also provides convenience methods for turning the iPhone 4′s torch on and off, switching between the front and back cameras while capturing video and displaying the current frames per second.

The details of how to set up the AVFoundation video capture stack are beyond the scope of this article and we refer you to the documentation from Apple and the canonical application sample AVCam. If you are interested in how the stack is created, however, then take a look at the implementation of thecreateCaptureSessionForCamera:qualityPreset:grayscale: method, which is called from viewDidLoad. There are a number of interesting aspects of the implementation, which we will go into next.

Hardware-acceleration of grayscale capture

For many image processing applications the first processing step is to reduce the full-color BGRA data received from the video hardware to a grayscale image to maximize processing speed when color information is not required. With OpenCV, this is usually achieved using the cv::cvtColor function, which produces a single channel image by calculating the weighted average of the R, G and B components of the original image. InVideCaptureViewController we perform this conversion in hardware using a little trick and save processor cycles for the more interesting parts of your image processing pipeline.

If grayscale mode is enabled then the video format is set to kCVPixelFormatType_420YpCbCr8BiPlanarFullRange. The video hardware will then supply YUV formatted video frames in which the Y channel contains luminance data and the color information is encoding in the U and V chrominance channels. The luminance channel is used by the controller to create a single-channel grayscale image and the chrominance channels are ignored. Note that the video preview layer will still display the full-color video feed whether grayscale mode is enabled or not.

Processing video frames

VideoCaptureViewController implements the AVCaptureVideoDataOutputSampleBufferDelegate protocol and is set as the delegate for receiving video frames from AVFoundation via thecaptureOutput:didOutputSampleBuffer:fromConnection: method. This method takes the supplied sample buffer containing the video frame and creates a cv::Mat object. If grayscale mode is enabled then a single-channel cv::Mat is created; for full-color mode a BGRA format cv::Mat is created. This cv::Mat object is then passed on toprocessFrame:videoRect:videoOrientation: where the OpenCV heavy-lifting is implemented. Note that no video data is copied here: the cv::Mat that is created points right into the hardware video buffer and must be processed beforecaptureOutput:didOutputSampleBuffer:fromConnection: returns. If you need to keep references to video frames then use the cv::Mat::clone method to create a deep copy of the video data.

Note that captureOutput:didOutputSampleBuffer:fromConnection: is called on a private GCD queue created by the view controller. Your overridden processFrame:videoRect:videoOrientation: method is also called on this queue. If you need to update UI based on your frame processing then you will need to use dispatch_sync or dispatch_async to dispatch those updates on the main application queue.

VideoCaptureViewController also monitors video frame timing information and uses it to calculate a running average of performance measured in frames per second. Set the showDebugInfo property of the controller to YES to display this information in an overlay on top of the video preview layer.

The FaceTracker App screenshot

Video orientation and the video coordinate system

Video frames are supplied by the iOS device hardware in landscape orientation irrespective of the physical orientation of the device. Specifically, the front camera orientation isAVCaptureVideoOrientationLandscapeLeft (as if you were holding the device in landscape with the Home button on the left) and the back camera orientation isAVCaptureVideoOrientationLandscapeRight (as if you were holding the device in landscape with the Home button on the left). The video preview layer automatically rotates the video feed to the upright orientation and also mirrors the feed from the front camera to give the reflected image that we are used to seeing when we look in a mirror. The preview layer also scales the video according to its current videoGravity mode: either stretching the video to fill its full bounds or fitting the video while maintaining the aspect ratio.

All these transformations create a problem when we need to map from a coordinate in the original video frame to the corresponding coordinate in the view as seen by the user and vice versa. For instance, you may have the location of a feature detected in the video frame and need to draw a marker at the corresponding position in the view. Or a user may have tapped on the view and you need to convert that view coordinate into the corresponding coordinate in the video frame.

All this complexity is handled in -[VideoCaptureController affineTransformForVideoRect:orientation:], which creates an affine transform that you can use to convert CGPoints and CGRects between the video coordinate system and the view coordinate system. If you need to convert in the opposite direction then create the inverse transform using the CGAffineTransformInvert function. If you are not sure what an affine transform is then just look at the following code snippet for how to use them to convert CGPoints and CGRects between different coordinate systems.

// Create the affine transform for converting from the video coordinate system to the view coordinate system
CGAffineTransform t = [self affineTransformForVideoRect:videoRect orientation:videoOrientation];
// Convert CGPoint from video coordinate system to view coordinate system
viewPoint = CGPointApplyAffineTransform(videoPoint, t);
// Convert CGRect from video coordinate system to view coordinate system
viewRect = CGRectApplyAffineTransform(videoRect, t);
// Create inverse transform for converting from view coordinate system to video coordinate system
CGAffineTransform invT = CGAffineTransformInvert(t);
videoPoint = CGPointApplyAffineTransform(viewPoint, t);
videoRect = CGRectApplyAffineTransform(viewRect, t);

Using VideoCaptureViewController in your own projects

VideoCaptureViewController is designed to be re-useable in your own projects by subclassing it just as you would subclass Apple-provided controllers like UIViewController and UITableViewController. Add the header and implementation files (VideoCaptureViewController.h and to your project and modify your application-specific view controller(s) to derive from VideoCaptureViewController instead of UIViewController. If you want to add additional controls over the top of the video preview you can use Interface Builder and connect up IBOutlets as usual. See the demo app for how this is done to overlay the video preview with UIButtons. You implement your application-specific video processing by overriding theprocessFrame:videoRect:videoOrientation: method in your controller. Which leads us to face tracking…

Face tracking

Face tracking seems to be the ‘Hello World’ of computer vision and judging by the number of questions about it on StackOverflow many developers are looking for an iOS implementation. We couldn’t resist choosing it as the subject for our demo app either. The implementation can be found in the DemoVideoCaptureViewController class. This is a subclass of VideoCaptureViewController and, as described above, we’ve added our app-specific processing code by overriding the processFrame:videoRect:videoOrientation: method of the base class. We have also added three UIButton controls in InterfaceBuilder to demonstrate how to extend the user interface. These buttons allow you to turn the iPhone4 torch on and off, switch between the front and back cameras and toggle the frames-per-second display.

Processing the video frames

The VideoCaptureViewController base class handles capturing frames and wrapping them up as cv::Mat instances. Each frame is supplied to our app-specific subclass via the processFrame:videoRect:videoOrientation: method, which is overridden to implement the detection.

The face detection is performed using OpenCV’s CascadeClassifier and the ‘haarcascade_frontalface_alt2′ cascade provided with the OpenCV distribution. The details of the detection are beyond the scope of this article but you can find lots of information about the Viola-Jones method and Haar-like features on Wikipedia.

The first task is to rotate the video frame from the hardware-supplied landscape orientation to portrait orientation. We do this to match the orientation of the video preview layer and also to allow OpenCV’s CascadeClassifier to operate as it will only detect upright features in an image. Using this technique, the app can only detect faces when the device is held in the portrait orientation. Alternatively, we could have rotated the video frame based on the current physical orientation of the device to allow faces to be detected when the device is held in any orientation.

The rotation is performed quickly by combining a cv::transpose, which swaps the x axis and y axis of a matrix, and a cv::flip, which mirrors a matrix about a specified axis. Video frames from the front camera need to be mirrored to match the video preview display so we can perform the rotation with just a transpose and no flip.

Once the video frame is in the correct orientation, it is passed to the CascadeClassifier for detection. Detected faces are returned as an STL vector of rectangles. The classification is run using the CV_HAAR_FIND_BIGGEST_OBJECT flag, which instructs the classifier to look for faces at decreasing size and stop when it finds the first face. You can remove this flag at the start of, which instructs the classifier to start small, look for faces at increasing size and return all the faces it detects in the frame.

The STL vector of face rectangles (if any) is passed to the displayFaces:forVideoRect:videoOrientation:method for display. We use GCD’s dispatch_sync here to dispatch the call on the main application thread. Remember that processFrame:videoRect:videoOrientation: is called on our private video processing thread but UI updates must be performed on the main application thread. We use dispatch_sync rather than dispatch_async so that the video processing thread is blocked while the UI updates are being performed on the main thread. This will cause AVFoundation to discard video frames automatically while our UI updates are taking place and ensures that we are not processing video frames faster than we can display the results. In practice, processing the frame will take longer than any UI update associated with the frame but its worth bearing in mind if your app is doing simple processing accompanied by lengthy UI updates.

// Dispatch updating of face markers to main queue
    dispatch_sync(dispatch_get_main_queue(), ^{
        [self displayFaces:faces

Displaying the face markers

For each detected face, the method creates an empty CALayer of the appropriate size with a 10 pixel red border and adds it into the layer hierarchy above the video preview layer. These ‘FaceLayers’ are re-used from frame to frame and repositioned within a CATransaction block to disable the default layer animation. This technique gives us a high-performance method for adding markers without having to do any drawing.

// Create a new feature marker layer
    featureLayer = [[CALayer alloc] init]; = @"FaceLayer";
    featureLayer.borderColor = [[UIColor redColor] CGColor];
    featureLayer.borderWidth = 10.0f;
    [self.view.layer addSublayer:featureLayer];
    [featureLayer release];

The face rectangles passed to this method are in the video frame coordinate space. For them to line up correctly with the video preview they need to be transformed into the view’s coordinate space. To do this we create a CGAffineTransform using the affineTransformForVideoRect:orientation: method of the VideoCaptureViewController class and use this to transform each rectangle in turn.

The displayFaces:forVideoRect:videoOrientation: method supports display of multiple face markers even though, with the current settings, OpenCV’s CascadeClassifier will return the single largest face that it detects. Remove theCV_HAAR_FIND_BIGGEST_OBJECT flag at the start of to enable detection of multiple faces in a frame.


On an iPhone 4 using the CV_HAAR_FIND_BIGGEST_OBJECT option the demo app achieves up to 4 fps when a face is in the frame. This drops to around 1.5 fps when no face is present. Without the CV_HAAR_FIND_BIGGEST_OBJECT option multiple faces can be detected in a frame at around 1.8 fps. Note that the live video preview always runs at the full 30 fps irrespective of the processing frame rate and processFrame:videoRect:videoOrientation: is called at 30 fps if you only perform minimal processing.

The face detection could obviously be optimized to achieve a faster effective frame rate and this has been discussed at length elsewhere. However, the purpose of this article is to demonstrate how to efficiently capture live video on iOS devices . What you do with those frames and how you process them is really up to you. We look forward to seeing all your augmented reality apps in the App Store!

Links to demo project source code

Git –
Download zip –


Computer vision with iOS Part 2: Face tracking in live video

Update: Nov 28 2011 – The OpenCV framework has been rebuilt using opencv svn revision 7017 Introd...
  • gzhujsj
  • gzhujsj
  • 2012年05月15日 21:15
  • 21619

Computer vision with iOS Part 1: Building an OpenCV framework

Introduction The example project Using the OpenCV framework in your own projects Rebuilding ...
  • gzhujsj
  • gzhujsj
  • 2012年05月15日 21:11
  • 609

翻译:Mastering OpenCV with Practical Computer Vision Projects(第8章)(一)

Face recognition using eigenfaces or fisherfaces 这一章主要介绍有关人脸检测和人脸识别及其相关概念,人脸识别一直是一个热门且困难的课题,许多的研究者在...

新书推荐:Mastering OpenCV with Practical Computer Vision Projects

Mastering OpenCV with Practical Computer Vision Projects Full source-code for the book. Source-C...
您举报文章:Computer vision with iOS Part 2: Face tracking in live video