1.http://www.cvchina.info/2012/05/25/kinect-face-tracking/:提供了 AAM revisited,AAM based face tracking with temporal matching and face segmentation和ICP的链接
After a long journey, my team at Microsoft shipped Face Tracking SDK as part of Kinect For Windows 1.5! I worked on the face tracking technology (starting from the times when it was part of Avatar Kinect) and so I’d like to describe its capabilities and limitations here. First of all, here is the demo:
You can use the Face Tracking SDK in your program if you install Kinect for Windows Developer Toolkit 1.5. After you install it, go to the provided samples and run/build yourself “Face Tracking Visualization” C++ sample or ”Face Tracking Basics-WPF” C# sample. Off course you need to have Kinect camera attached to your PC The face tracking engine tracks at the speed of 4-8 ms per frame depending on how powerful your PC is. It does it work on CPU only (does not use GPU on purpose, since you may need it for graphics)
If you look at the 2 mentioned code samples, you can see that it is relatively easy to add face tracking capabilities to your application. You need to link with a provided lib, place 2 dlls in the global path or in the working directory of your your executable (so they can be found) and add something like this to your code (this is in C++, you can also do it in C#, see the code samples):
// Include main Kinect SDK .h file #include "NuiAPI.h" // Include the face tracking SDK .h file #include "FaceTrackLib.h" // Create an instance of a face tracker IFTFaceTracker* pFT = FTCreateFaceTracker(); if(!pFT) { // Handle errors } // Initialize cameras configuration structures. // IMPORTANT NOTE: resolutions and focal lengths must be accurate, since it affects tracking precision! // It is better to use enums defined in NuiAPI.h // Video camera config with width, height, focal length in pixels // NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS focal length is computed for 640x480 resolution // If you use different resolutions, multiply this focal length by the scaling factor FT_CAMERA_CONFIG videoCameraConfig = {640, 480, NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS}; // Depth camera config with width, height, focal length in pixels // NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS focal length is computed for 320x240 resolution // If you use different resolutions, multiply this focal length by the scaling factor FT_CAMERA_CONFIG depthCameraConfig = {320, 240, NUI_CAMERA_DEPTH_NOMINAL_FOCAL_LENGTH_IN_PIXELS}; // Initialize the face tracker HRESULT hr = pFT->Initialize(&videoCameraConfig, &depthCameraConfig, NULL, NULL); if( FAILED(hr) ) { // Handle errors } // Create a face tracking result interface IFTResult* pFTResult = NULL; hr = pFT->CreateFTResult(&pFTResult); if(FAILED(hr)) { // Handle errors } // Prepare image interfaces that hold RGB and depth data IFTImage* pColorFrame = FTCreateImage(); IFTImage* pDepthFrame = FTCreateImage(); if(!pColorFrame || !pDepthFrame) { // Handle errors } // Attach created interfaces to the RGB and depth buffers that are filled with // corresponding RGB and depth frame data from Kinect cameras pColorFrame->Attach(640, 480, colorCameraFrameBuffer, FORMAT_UINT8_R8G8B8, 640*3); pDepthFrame->Attach(320, 240, depthCameraFrameBuffer, FTIMAGEFORMAT_UINT16_D13P3, 320*2); // You can also use Allocate() method in which case IFTImage interfaces own their memory. // In this case use CopyTo() method to copy buffers FT_SENSOR_DATA sensorData; sensorData.pVideoFrame = &colorFrame; sensorData.pDepthFrame = &depthFrame; sensorData.ZoomFactor = 1.0f; // Not used must be 1.0 sensorData.ViewOffset = POINT(0,0); // Not used must be (0,0) bool isFaceTracked = false; // Track a face while ( true ) { // Call Kinect API to fill videoCameraFrameBuffer and depthFrameBuffer with RGB and depth data ProcessKinectIO(); // Check if we are already tracking a face if(!isFaceTracked) { // Initiate face tracking. // This call is more expensive and searches the input frame for a face. hr = pFT->StartTracking(&sensorData, NULL, NULL, pFTResult); if(SUCCEEDED(hr) && SUCCEEDED(pFTResult->Status)) { isFaceTracked = true; } else { // No faces found isFaceTracked = false; } } else { // Continue tracking. It uses a previously known face position. // This call is less expensive than StartTracking() hr = pFT->ContinueTracking(&sensorData, NULL, pFTResult); if(FAILED(hr) || FAILED (pFTResult->Status)) { // Lost the face isFaceTracked = false; } } // Do something with pFTResult like visualize the mask, drive your 3D avatar, // recognize facial expressions } // Clean up pFTResult->Release(); pColorFrame->Release(); pDepthFrame->Release(); pFT->Release();
Note1 about the camera configuration structure - it is very important to pass correct parameters in it like frame width, height and the corresponding camera focal length in pixels. We don’t read these automatically from Kinect camera to give more advanced users more flexibility. If don’t initialize them to the correct values (that can be read from Kinect APIs), the tracking accuracy will suffer or the tracking will fail entirely.
Note2 about the frame of reference for 3D results - the face tracking SDK uses both depth and color data, so we had to pick which camera space (video or depth) to use to compute 3D tracking results in. Due to some technical advantages we decided to do it in the color camera space. So the resulting frame of reference for 3D face tracking results is the video camera space. It is a right handed system with Z axis pointing towards a tracked person and Y pointing UP. The measurement units are meters. So it is very similar to Kinect’s skeleton coordinate frame with the exception of the origin and its optical axis orientation (the skeleton frame of reference is in the depth camera space). Online documentation has a sample that describes how to convert from color camera space to depth camera space.
Also, here are several things that will affect tracking accuracy:
1) Light – a face should be well lit without too many harsh shadows on it. Bright backlight or sidelight may make tracking worse.
2) Distance to the Kinect camera – the closer you are to the camera the better it will track. The tracking quality is best when you are closer than 1.5 meters (4.9 feet) to the camera. At closer range Kinect’s depth data is more precise and so the face tracking engine can compute face 3D points more accurately.
3) Occlusions – if you have thick glasses or Lincoln like beard, you may have issues with the face tracking. This is still an open area for improvement Face color is NOT an issue as can be seen on this video
Here are some technical details for more technologically/math minded people: We used the Active Apperance Model as the foundation for our 2D feature tracker. Then we extended our computation engine to use Kinect’s depth data, so it can track faces/heads in 3D. This made it much more robust and realiable. Active Appearance Models are not quite robust to handle real open world scenarios. Off course, we also used lots of secret sauce to make things working well together You can read about some of the algorithms here, here and here.
Have fun with the face tracking SDK!
2.http://nsmoly.wordpress.com/2012/05/21/face-tracking-sdk-in-kinect-for-windows-1-5/ 与上同
3.http://nsmoly.wordpress.com/2012/05/19/avatar-kinect/:给了K-L image Alignment,AAM revisited,Real-Time Combined 2D+3D Active Appearance Models的链接。
We used a combination of Active Appearance Models on “steroids” plus few other things like neural network, face detector and various classifiers to make it stable and robust. You can read more about Active Appearance Models here, here and here . Off course the usage of Kinect camera improved precision and robustness a lot (due to its depth camera)。
1)K-L Alignment的参考资料:
Lucas and Kanade, An iterative image registration
technique with an application to stereo vision. IJCAI,
1981.
• Lucas, Generalized Image Matching by the Method of
Differences, doctoral dissertation, 1984
• Simon Baker and Iain Matthews. Lucas-Kanade 20
Years On: A Unifying Framework. IJCV2004
• Sourcecode
– OpenCV
– An Implementation of the Kanade–Lucas–Tomasi
Feature Tracker, http://www.ces.clemson.edu/~stb/klt/
其中1981和2004讲的是L-K alignment,1984讲的是L-K光流方程。
二者都是解能量方程,使一副图或者一副图的局部经仿射变换以后可以和目标图的距离最小。
不同点是:光流方程中引入了运动(速度)的概念,使方程形式有不同,因此,解法也不一样。
2)AAM revisited的主要创新点是:利用L-K的A Unifying framework中的图像对齐算法(inverse compositional image alignment algorithm)解决AAM的非线性优化问题。
3)Real-Time Combined 2D+3D Active Appearance Models,本文首先对比了2DAAM和3DMM的异同,并且说明虽然AAM是2D的,但是它可以代表任何的三维物体,不过需要6倍于3DMM的参数,也正因此AAM有更强的表示能力,以致于会生成不存在(不合理)的东西。
由2DAAM可以计算3D modes,本文利用的是非刚性结构运动算法。生成的3D mode又反过来可以约束2DAAM,从而AAM只能表示合理的3D 模型。
最后本文又利用2D+3D AAM对人脸进行2D和3D跟踪。
本文中:约束的意思是给能量函数增加一项。
关键技术:投影,non-rigid structure from motion,能量函数优化
在本文中,3D modes的得到,使用的是NRSFM的方法,这个知识的简介我参考了ICCV 2011 Tutorial on Non‐rigid registration and reconstruction。我对NRSFM的理解就是列出投影方程,通过二维图像上特征点的跟踪,找到约束条件,解投影方程,就可以得到三维重建的结果。