C64x+ iUniversal Codec Creation - from memcpy to Canny Edge Detector

最新推荐文章于 2022-02-23 15:02:25 发布

yuyin86

最新推荐文章于 2022-02-23 15:02:25 发布

阅读量5k

点赞数

分类专栏： dsp 文章标签： codec c algorithm application server video

dsp 专栏收录该内容

220 篇文章 16 订阅

订阅专栏

C64x+ iUniversal Codec Creation - from memcpy to Canny Edge Detector

[hide]

Introduction

The purpose of this application note is to explain the principles of developing and optimizing a DSP Video Analytics application in an IUNIVERSAL Codec Engineframework using the TI Video Analytics Library (VLib, http://www.ti.com/vlibrequest). The algorithm used as the example is the Canny Edge Detector.

In addition it will show how to develop the algorithm as an XDM Compliant codec using IUNIVERSAL interface to Codec Engine. Developing the algorithm in this way allows us to write a single C64x+ core algorithm that can be used on either a DSP only device such as the DM6437 or on an ARM+DSP device such as the DM644x, DM6467 or OMAP35xx. The following article aims to provide a worked example. It should be read in conjunction with this IUNIVERSAL article.

The Canny Edge Detection algorithm will be used to illustrate how to architect the DSP software to break down an image processing algorithm based on large frames of data into slices where possible. Using slices allows optimal use of the small but fast internal memory available on these devices to improve performance.

The algorithm development is done on the DM6437 EVM using the DVSDK1.11 as the software environment with both CCS v3.3 and 4.1 and will run on a PAL or NTSC video stream at full frame rate. The algorithm is then packaged as a codec server and integrated into the DM6467 DVSDK 2.00.00.22 GA release and the OMAP3 DVSDK 3.01.00.10 release. This process is then formalized into a recommended development flow for creating a DSP codec for a multi-processor platform.

The available demonstrations are summarized in this table which gives the name of the file from the package to be used for that configuration. The software package can be downloaded here. In order to run the Linux applications you must copy your module .ko files that match your uImage into the application directory. Also the boot args must match those for the DVSDK used (OMAP3530 and DM6467)

Description	Performance	Hardware Requirements	Demo Binaries	Demo Sources
DM6437 Frame Based Canny	17fps PAL @ 100% DSP	DM6437 EVM + D1 camera + composite Display	dm6437_binaries/video_canny_universal_dman3_frame.out	dvsdk_1_11_00_00_universal_codecs_ccs_v1_1.zip
DM6437 Slice Based Canny	25fps PAL @100% DSP	DM6437 EVM + D1 camera + composite Display	dm6437_binaries/video_canny_universal_dman3_slice.out	dvsdk_1_11_00_00_universal_codecs_ccs_v1_1.zip
DM6467 Frame Based Canny	16fps PAL @ 100% DSP	DM6467 EVM + D1 camera + composite Display	pre-compiled executable uses slice mode	dvsdk_2_00_00_22_Canny_iUniversal_v1_0.tar.gz
DM6467 Slice Based Canny	25fps PAL @ 60% DSP	DM6467 EVM + D1 camera + composite Display	dvsdk_2_00_00_22_6467_canny_executable_v1_1.tar.gz	dvsdk_2_00_00_22_Canny_iUniversal_v1_0.tar.gz
OMAP3530 Frame Based Canny + video decode	8fps NTSC @ 95% DSP	OMAP3 EVM	pre-compiled executable uses slice mode	dvsdk_3_01_00_10_Canny_iUniversal_v1_1.tar.gz
OMAP3530 Slice Based Canny + video decode	13fps NTSC @ 98% DSP	OMAP3 EVM	dvsdk_3_01_00_10_canny_executable_omap3530_v1_0.tar.gz	dvsdk_3_01_00_10_Canny_iUniversal_v1_1.tar.gz

Note that the VLib package is not included in the source packages. It must be requested from the link http://www.ti.com/vlibrequest.

Software Requirements

Software Required to Build DSP codec

This algorithm was developed on the DM6437 as it is easier to write and debug C64x+ code in a Code Composer Studio environment on a single core DSP with an Emulator. It was developed in DVSDK v1.11.00.00 for the DM6437 which is available from https://www-a.ti.com/downloads/sds_support/targetcontent/dvsdk/bios_dvsdk/index.html.

The IUNIVERSAL XDM interface was introduced in XDM1.1. The latest official DVSDK version for the DM6437 includes Codec Engine v1.20.02 which does not support this interface. Therefore, it is necessary to upgrade the Codec Engine and some other tools within the DVSDK to the latest version from https://www-a.ti.com/downloads/sds_support/targetcontent/index.html. Whilst some of the DVSDK needs upgrading the original DVSDK is still used for the drivers (Peripheral Support Package) and examples. The following tools version upgrades were downloaded and installed in the DVSDK 1.11.00.00 directory Codec Engine FAQ.

The DSP/BIOS update is installed in \CCStudio_3.3

DVSDK 1.11 was developed for CCS v3.3 and so these instructions are specific to CCS 3.3. This is not required for CCS v4. In order to use these new tools the following changes need to be made to paths in the file C:\dvsdk_1_11_00_00\xdcpaths_evmDM6437.dat

xdcpaths =  
 
//Codec Engine 
dvsdkInstallDir + "codec_engine_2_21/packages;" +  
dvsdkInstallDir + "codec_engine_2_21/examples;" +  
 
// Framework Components
dvsdkInstallDir + "codec_engine_2_21/cetools/packages;" + 
 
// XDAIS
dvsdkInstallDir + "codec_engine_2_21/cetools/packages;" + 
 
// Codecs
dvsdkInstallDir + "codecs_1_10/packages;" + 
 
// NDK package
dvsdkInstallDir + "ndk_1_92_00_22_eval/packages;" + 
 
// BIOS utilities
dvsdkInstallDir + "codec_engine_2_21/cetools/packages;" + 
 
// PSP Package
dvsdkInstallDir + "pspdrivers_1_10_00/packages;" + 
 
// EDMA3 Package
dvsdkInstallDir + "edma3_lld_1_05_00/packages;" + 
 
// TCONF import path to .tci files imported by projects' .tcf files
dvsdkInstallDir + "examples/common/evmDM6437;" +
 
"";

and the following environment variables which ensure that the packaging wizard picks up expected XDCTOOLS from DVSDK 1.11 rather than the later tools installed by CCSv4.

XDC_INSTALL_DIR=C:\dvsdk_1_11_00_00\xdctools_3_10_03;
XDCPATH=%XDC_INSTALL_DIR%
XDCROOT=%XDC_INSTALL_DIR%

To validate that the new tools are correctly installed rebuild an existing example application such as\dvsdk_1_11_00_00\examples\video_encdec\evmDM6437\video_encdec.pjt.

Software Requirements to Package Algorithm as Codec and Server

In order to create a DSP algorithm package to be consumed by the application, the RTSC Codec and Server Package Wizard needs to be downloaded as part of CE UTILS v1.07 from:

https://www-a.ti.com/downloads/sds_support/applications_packages/ceutils/index.htm

Install CE UTILS to C:\dvsdk_1_11_00_00\ceutils_1_07 and edit XDCPATH environment variable to add:

C:\dvsdk_1_11_00_00\codec_engine_2_21\cetools\packages;C:\dvsdk_1_11_00_00\codec_engine_2_21\packages;C:\dvsdk_1_11_00_00\ceutils_1_07\packages

The packaging wizard also requires a utility called CG_XML. CG_XML v2.10 can be found at:

https://www-a.ti.com/downloads/sds_support/applications_packages/cg_xml/index.htm

Creating a new Application and DSP Codec Package

The best way to create a new application and algorithm codec package is to rework some of the supplied examples in the DVSDK. The purpose of this section is to explain how to create a new blank application and codec which can be used as a basis for a new project. The overall architecture of the build process is shown in Figure 1.

This document will use the following terms to describe the different blocks of code.

Application

This refers to the code that is managing the I/O drivers (video capture and display), creating the Codec Engine instance and passing blocks of data to the DSP algorithm (or codec) for processing using the Codec Engine XDM APIs.

DSP Codec

This is the DSP algorithm that implements the XDM interface and the actual processing algorithm. In this case the algorithm is a simple loopback from input to output. For more details on the TI eXpressDSP architecture refer to the book " OMAP and DaVinci Software for Dummies" ( www.ti.com/dummiesbook) or the Wiki documented here.

In the case of the single core DM6437 used in this application note these functional blocks of code exist on the same core but are logically separate.

Figure 1: Application and DSP Codec Build Architecture

The source package is expected to be extracted to the C: drive and is provided with project files to build with both CCS v3.3 and 4.1. There is a common code directory which will be extracted to C:\dvsdk_1_11_00_00_universal_codecs_source. The CCS 3.3 project files will be in C:\dvsdk_1_11_00_00_universal_codecs_ccsv3_3 and the v4.1 files in C:\dvsdk_1_11_00_00_universal_codecs_ccs_v4. The project build folders both follow the same structure containing a packaged codecs directory (\codecs_1_10_new), a DSP codec source and build directory ( \dsp_alg) and an application directory ( \examples).

The source package is thus independent of the main DVSDK install but refers to it for all the component packages. In this exampleC:/dvsdk_1_11_00_00_universal_codecs_XXX will be used. The CCS v3.3 directory has a new copy of xdcpaths_evmDM6437.dat that adds the following line to xdcpath variable //Add path for the new universal codecs

"C:/dvsdk_1_11_00_00_universal_codecs_ccsv3_3/codecs_1_10_new/packages;" +

This allows the compiler to pick up all the component definitions from the main DVSDK and the new codecs and applications from the new directory.

The instructions in the rest of the document are for use of CCS v3.3 but can be easily translated to CCS 4.1. CCS v4.1 is very similar except that the source file build is in a different project to the RTSC package + DSP/BIOS configuration. So in CCS v4.1 each of the examples is actually composed of two linked projects. This can be illustrated using the CCS v3.3 project file C:\dvsdk_1_11_00_00_universal_codecs\examples\video_universal_app_empty\evmDM6437\video_universal_app_empty.cfg. In CCS v4.1 there is the main project in the directory C:\dvsdk_1_11_00_00_universal_codecs_ccs_v4\examples\video_universal_app_empty, this controls the build of the source files and also links to the RTSC configuration project C:\dvsdk_1_11_00_00_universal_codecs_ccs_v4\examples\video_universal_app_empty_configurationwhich has the project's .tcf and .cfg files.

Create Application

The easiest way to create a new application is to copy and adapt an existing one from the DVSDK. This section will create a new application project called video_analytic_app_empty. It is based on the video_encdec project which loops video through a video encode algorithm and subsequent video decode algorithm for final display. [Note: Codec Engine supplies a file IO based test application in \codec_engine_2_21\examples\ti\sdo\ce\examples\apps\universal_copy ]

video_encdec.cfg to video_universal_copy_app_empty.cfg

video_encdec.tcf to video_universal_copy_app_empty.tcf

video_encdec.c to video_universal_copy_app_empty.c

[Note: that RTSC requires that the project has a .cfg and .tcf file with the same name as the project. ]

This will create a new application called video_universal_copy_app_empty that at this stage still implements the video encode and decode operation as it still calls the VIDENC and VIDDEC codec packages. The application will automatically pull the required codecs from the .\dvsdk_1_11_00_00\codecs_1_10 directory due to thevideo_universal_copy_app_empty.cfg file.

In order to use a new UNIVERSAL codec package the application must be rewritten to use the UNIVERSAL VISA API rather than VIDENC and VIDDEC.

As a summary this involves changing the .cfg file:

var UNIVERSAL_COPY = xdc.useModule('ti.sdo.codecs.universal_copy.ce.UNIVERSAL_COPY')
 
var Engine = xdc.useModule('ti.sdo.ce.Engine');
 
/* This creates a engine called analytics with the codec UNIVERSAL_COPY. */
 
/* On a DM6437 this codec is local to codec engine as it is on same DSP */
 
var vcr = Engine.create("analytics", [
 
 $ {name: "universal",mod: UNIVERSAL_COPY,groupId:0, local: true},
 
]);

A reference copy of this video_universal_copy_app_empty application which loops video through the algorithm UNIVERSAL_COPY is supplied in the directory\dvsdk_1_11_00_00\examples\video_universal_copy_app_empty. The UNIVERSAL_COPY codec is supplied by Codec Engine as an XDAIS example in\codec_engine_2_21\examples\ti\xdais\dm\examples\universal_copy.

For the purposes of building this video_universal_copy_app_empty application a copy of the codec has already been copied tocodecs_1_10\packages\ti\sdo\codecs\universal_copy. The instructions to build this codec from source are described in the next section.

Note:

The example application provides a different project configuration to speed algorithm development and debugging. There is a project filedvsdk_1_11_00_00_universal_codecs_ccsv3_3\examples\video_universal_app_empty\evmDM6437s\video_universal_app_empty_simulator_no_codec_engine.pjtwhich adds the defines SIMULATOR and DO_NOT_USE_CODEC_ENGINE as well as changing the tcf file to change RTDX to simulator mode.

This configuration changes the application behaviour to:

- Use small input test patterns (eg 32 x 32 byte matrix) rather than live video. This allows debugging with entire known input image in a CCS memory window.

- The define INPUT422 allows the passing of 16 bit pixels to test implementation for a device such as DM6437 or OMPA3530 whose capture driver provides YUYV interleaved images. By not using this define it defaults to 8 bit pixels such in a 420SP image as captured by DM6467. This is designed for analytics images where only the Luma plane is of interest as in the Canny algorithm.

- Allows the application to directly call the algorithm function and bypass Codec Engine infrastructure. This has the main benefits of avoiding the codec packaging process and separating the debug of the algorithm itself from that of the integration into iUniversal and codec engine.

- Build the algorithm file as part of the application build.

The DO_NOT_USE CODEC_ENGINE option is only supplied on the example universal_copy application.

Create Codec

Again the best way to create a new codec project is to copy and modify an existing example. In this section a new copy of the UNIVERSAL_COPY codec from\codec_engine_2_21\examples\ti\xdais\dm\examples\universal_copy will be build and packaged to create a DSP codec called universal_copy. The purpose of this exercise to understand the build and configuration steps in creating a new DSP codec.

Start the wizard by opening a DOS command shell in C:\dvsdk_1_11_00_00\xdctools_3_10_03:

xs ti.sdo.codecutils.genpackage

Step 1

Package Name: ti.fae.codecs.universal_copy - This defines final directory path in codecs directory

Module: UNIVERSAL_COPY - This defines the Module name

Version: 0.0.1

Codec Class: ti.sdo.ce.universal.IUNIVERSAL

Instruction Architecture Set: C64P

Create ce content:

Set Output Repository: C:/dvsdk_1_11_00_00_universal_codecs_ccsv3_3/dsp_alg/codec_packaging - Directory where codec package is placed.

Select "Next" .

Step 2

Leave watermark as the only boolean in use.

Select the False radio button and add the patch to the DSP algorithm libraryC:/dvsdk_1_11_00_00_universal_codecs_ccsv3_3/dsp_alg/universal_copy/lib/debug/universal_copy.a64P

Leave True radio button with a NULL library

This page allows different versions of the library to be packaged in the codec and then chosen via .cfg file by the application.

Select "Next".

Step 3

Use "Browse for files" to add the file C:/dvsdk_1_11_00_00_universal_codecs_ccsv3_3/dsp_alg/universal_copy/universal_copy_ti.h

to the package

Select "Next"

Step 4

Browse for the appropriate executables:

cg_xml: C:/dvsdk_1_11_00_00/cg_xml_2_10

ofd6x: C:/CCStudio_v3.3/C6000/cgtools/bin/ofd6x.exe

nm6x: C:/CCStudio_v3.3/C6000/cgtools/bin/nm6x.exe

Tick the "Check to guess GetStackSize()" tick box.

Stack Size Pad %: 20

Leave ticked the "Check to update section info:" tick box.

Select "Next"

Step 5

This will automatically populate the ialgFxns box with the structure name for the algorithm. In this case UNIVERSALCOPY_TI_ALG.

Note: In the more general case if an algorithm has been build with DMAN3 support then idma3Fxns entry will also be populated with the algorithm IDMA3 table. This will be of the form ALGNAME_TI_IDMA3. Similarly if IRES support has been used then there is an iresFxns entry.

Select "Next"

Step 6

This step allows alignment to be added to the algorithm's memory sections. In this case leave them blank.

Note:If the table in the wizard shows a section ".const:.string" then untick the Use box and press "Guess Link.xdt" button to update the Link.xdt file. This section ".const:.string" contains any DSP/BIOS LOG_printf() strings and these need to be placed by the Server into a .const section to be picked up by CCS. With this line in the xdt file they are explicitly placed separately and so break DSP/BIOS logging.

Select "Finish"

At this stage select yes to save the values that have been added into a package XML file. Save asC:\dvsdk_1_11_00_00_universal_codecs_ccsv3_3\dsp_alg\ti_fae_codecs_universal_copy_wizard.xml. This allows subsequent runs of the wizard to use the same settings. The next time the wizard is run open the predefined xml configuration with File->Open

C:\dvsdk_1_11_00_00_universal_codecs_ccsv3_3\dsp_alg\ti_fae_codecs_universal_copy_wizard.xml

and simply click "Finish" and this will create an output package in

C:\dvsdk_1_11_00_00_universal_codecs_ccsv3_3\dsp_alg\codec_packaging

This directory will be the output directory of the codec packaging process.

The first time a codec is packaged create a file containing details of the compiler \ codec_packaging\ti\fae\codecs\universal_copy\config.bld.

/*
 * ==== config.bld ====
 * User note: YOU MUST MODIFY THIS FILE TO SPECIFY THE COMPILER TOOL PATHS.
 *
 * Edit this file to specify compiler toolchain paths, and any custom
 * compiler/linker options.
 */ 
 
/* location of your C6000 codegen tools */
var C64P = xdc.useModule('ti.targets.C64P');
C64P.rootDir = "C:\CCStudio_v3.3\C6000\cgtools\bin "; 
 
/* add support for building .sa files */
C64P.extensions[".sa"] = {suf: ".sa", typ: "asm:-fl"};
/*
 * ==== Build.targets ====
 * list of targets (ISAs + compilers) to build for
 */
Build.targets = [
C64P,];

Each time edit the file codec_packaging\ti\fae\codecs\universal_copy\UNIVERSAL_COPY.xdc to add the memory allocation of "DDR2".

metaonly module UNIVERSAL_COPY
{
/*!
 * ======== watermark ========
 * This config param allows the user to indicate whether to include
 * a watermark or not. 
 */
config Bool watermark = false;
/*!
 * ======== Code Section ========
 */
config String codeSection = "DDR2";
/*!
 * ======== Uninitialized Data Section ========
 */
config String udataSection = "DDR2";
/*!
 * ======== Initialized Data Section ========
 */
config String dataSection = "DDR2"; 
}

Open a DOS command shell in codec_packaging\ti\fae\codecs\universal_copy and execute

xdc release -PR .

This will create 2 tar files that form the release of the packaged codec. A codec should only ever be released as a pair of tar files. They are:

codec_packaging\ti\fae\codecs\universal_copy\ti_fae_codecs_universal_copy.tar

codec_packaging\ti\fae\codecs\universal_copy\ce\ti_fae_codecs_universal_copy_ce.tar

The two files need to be extracted to C:\dvsdk_1_11_00_00_universal_codecs_ccsv3_3\codecs_1_10_new_codecs\packages to allow them to be consumed by the application code.

How can algorithm request hardware resources from the system

One of the features of an XDM compliant algorithm is that it is independent of the actual silicon it is running on. This means that the algorithm does not know the details or have any dependencies on hardware resources such as EDMA channels or memory resources. The algorithm must always request resources from a resource manager in Codec Engine that is configured as part of the Application build via its cfg file.

The universal_copy example described previously requests memory from the system via the UNIVERSALCOPY_TI_alloc() function. This section will describe how an algorithm can request EDMA channels. This is complicated by the fact that different DVSDKs use different resource managers for different product families, for EDMA channels they either use DMAN or the newer and more flexible RMAN. This is shown in the table below. All these DVSDKs contain support for both DMAN and RMAN and so if only a single iUniversal Algorithm is being written either can be used. The main restriction comes if the algorithm is going to be integrated into an existing codec server where it must reuse the resource manager already in use by the server.

Silicon Family	DVSDK	Resource Manager for EDMA by codecs
DM6437	1.11	DMAN
DM6446	2.00.00.22	DMAN
DM6467	2.00.00.22	RMAN
OMAP3530	3.01.00.10	DMAN

In order to provide an example of both implementations there are two additional versions of the Universal_copy algorithm provided that uses EDMA instead ofmemcpy(). These are universal_copy_dman and universal_copy_rman. These projects can be used as the basis for other algorithms that require EDMA.

The implementations of these versions of the UNIVERSALCOPY algorithms can be found in the directoriesdvsdk_1_11_00_00_universal_codecs_ccsv3_3\dsp_alg\universal_copy_dman3 and dvsdk_1_11_00_00_universal_codecs_ccsv3_3\dsp_alg\universal_copy_ires.

Associated with these codecs are test applications in dvsdk_1_11_00_00_universal_codecs_ccsv3_3\examples\video_universal_app_empty_dman3 anddvsdk_1_11_00_00_universal_codecs_ccsv3_3\examples\video_universal_app_empty_ires.

These are provided as examples to provide examples to use as the basis for new algorithms

Canny Edge Detection Codec using VLib on DM6437

The Canny Edge Detection Algorithm is used to illustrate the creation of a "real life" analytics algorithm. It conveniently illustrates the different memory management models possible, the architecture choices to make and also creates an easily visualized demo application for display. This section describes the codec architecture, build process and integration into an application on the DM6437.

The source code to build this algorithm are available from the link in section 1.

Pre-built out files are available for loading with CCS. The code uses JP1 to determine if it is running PAL or NTSC:

Memory Management

The architecture of C64x+ core devices provides a small block of L1 Internal RAM which can be accessed at full speed by the CPU and a two layer L1/L2 cache architecture to minimize the performance impact of having to access the relatively slow external DDR2 memory. The exact size of these blocks varies between devices. On the DM6437 there is 64k of mapped L1 RAM available for data and on the DM6467 there is only 32k of mapped L1 RAM. One of the keys to successful DSP optimization is to make the best use of this available memory. In order to generate an algorithm that is portable between these two processors it is necessary to architect the codec to fit in the smaller 32kbyte L1 RAM. In most cases the optimal solution is to process the data (in this case an image) in slices. This involves using DMA to quickly copy a slice of data into internal memory, process it with the data in internal memory and then use DMA to output the data back to external memory.

Canny Algorithm

The VLib (v2.1) provides 5 functions that are used to implement the steps of a Canny Edge Detector. The input to the Canny Filter should be an image with 8 bit Luminance data only.

This smoothes the raw input 8bits per pixel (bpp) Luma image with a 7x7 Gaussian filter. It actually comes from the IMG v2.01 library rather than VLIB. It operates on rows of the image and so is a candidate for slicing.

This takes the smoothed 8bpp image and calculates 3 gradient images. These are the horizontal gradient (GradX), vertical gradient (GradY) and gradient magnitude (GradMag). It operates on rows of the image and so is a candidate for slicing.

This takes the 3 gradient images and calculates which pixels are possible edges. It operates on rows of the image and so is a candidate for slicing.

This takes the possible edges frame uses hysterysis thresholding to identify possible edges. It outputs an array containing the indices of possible edges. It operates on rows of the image and so is a candidate for slicing.

This function takes the possible edges and tries to link them to form continuous "definite" edges. This function does not operate on discrete lines of data as it must be able to link together any number of pixels in any direction. Therefore, it is not possible to implement this function in slices. It must use the cache to operate on the full frame in external DDR2.

Image Format Conversion

The VLib functions only operate on Luma data, ignoring the Chrominance bytes from the camera. On the DM6437 (and OMAP3530) the video data is captured in an Interleaved YUYV 422 format and so the Luminance values must be extracted from the interleaved data. On the DM6467 however the video is captured in a semiplanar 420 format with a single Luma plane. Therefore, in order for the codec to be portable it must be able to handle both of these formats. When required the codec uses the following function to extract the Luma from the 422 Interleaved (YUYV) frame:

This is a VLib function to extract only the 8bpp Luma data from the input YUYV 422 Frame.

The recreation of the 16bpp YUYV frame is done within the loop through the edge map which clears all possible edges (value 127) to 0 leaving only the definite edges (value 255). This loop inserts a neutral chroma value of 0x80 between each luma edge value..

The codec is passed the parameter VidAnalyticsParams.inputChromaFormat which specifies the format to be XDM_YUV_422ILE (DM6437 or OMAP3530) or 8 bit planar XDM_YUV_420SP (DM6467).

The overall flow of data in the example application is shown in Figure 2.

Figure 2: Algorithm Flow

Building the example code

The Canny code is found in the directory C:\dvsdk_1_11_00_00_universal_codecs_source\dsp_alg\universal_canny_dman3. This is the source code for the Canny algorithm and is built in CCS using the project C:\dvsdk_1_11_00_00_universal_codecs_ccsv3_3\dsp_alg\canny\pjt\universal_canny_dman3.pjt. The codec is build to either use external memory buffers and the DSP's cache or to use DMA channels to bring small slices into DSP internal memory for processing. These will be refered to as "cache" and "slice" modes of operation. The mode of operation is selected by the application using the algorithm's parameter frameNotSlice. This version of the algorithm uses DMAN3 to request EDMA channels. The

Once the code is built the package wizard must be used to build the appropriate codec package into the package release dir as described in section 3.2. The package wizard configuration in dsp_alg\ti_fae_codecs_universal_canny_dman3_wizard.xml can be used for this.

The wizard place the packaged codec in dsp_alg\codec_packaging. After packaging the UNIVERSAL_CANNY_DMAN3.xdc file must be edited to add the "DDR2" placement. Finally the package can be released by opening a command window in dsp_alg\codec_packaging\ti\fae\codecs\universal_canny_dman3 and running

xdc release -PR .

which will generate the codecs release .tar files in

These two release tar files need to be extracted to the codecs_1_10_new\packages directory so they can be consumed by the application side source code in \examples\video_canny_universal_dman3\evmDM6437 . The application is built in CCS using video_canny_universal.pjt which pulls in the requested "CANNY" DSP codec via the file \examples\video_analytic\evmDM6437\video_canny_universal.cfg with the line:

var CANNY = xdc.useModule('ti.fae.codecs.universal_canny_dman3.ce.UNIVERSAL_CANNY_DMAN3');

Using the Cache to process data from External Memory

The benchmarking of the Canny Edge Detection function using the default external frame + cache based codec is shown in Table 1. This was on a DM6437 EVM at 594MHz on a D1 PAL input stream. The benchmarking is found using CCS's log tracing capability in the LOG buffer CANNY_TI_trace.

Function	Time(ms)
Extract Luma	3.9
Fill Luma	10.6
Total pre+post processing	14.5
Gaussian Filtering	7.0
Gradient Calculation	16.9
Non Maximal Edge Suppression	12.4
Double Thresholding	8.0
Edge Relaxation	2.2
Total Canny	46.5

Table 1: Frame + Cache Benchmarking

This gives a total time of 61.3ms per frame which corresponds to a frame rate of 15fps.

The configuration file \video_canny_universal.tcf takes the default DM6437 EVM cache settings from examples\common\evmDM6437\common.tci

This tci file also sets up a 64k internal heap in L1 mapped memory. The DSP codec requests memory to be allocated from the heaps in CANNY_TI_alloc(). However, as all the arrays (except pZeroLine) that are requested for the processing buffers are of size >64k they cannot be allocated from this heap and so they default to the external heap.

Using Slices to process data in Internal L1 Memory

As noted before all the functions except VLIB_hysteresisThresholding() operate on lines in the image. This means that they are suitable for processing in slices. The principle of using slices can be summarized as follows:

The benefit of this technique scales with the number of data operands that the function has to process. The cache is able to bring in lines of 128 bytes to on chip memory but there is still a significant penalty for the first cache miss. The slicing technique takes advantage of a priori knowledge of the algorithm which allows more efficient use of the DMA to bring in Kbytes of required data as efficiently as possible.

Applying Slicing to the Pre and Post Processing Algorithms

Applying slicing to the pre/post processing functions would be simple as there is a direct pixel to pixel mapping between the input and output arrays. If the image is broken up into slices each with OUTPUT_SLICE_SIZE (=7) lines, then each chunk of 720*7 pixels can be copied into a buffer in internal L1 memory by DMA, processed and then copied out again by DMA. If the algorithm is only to be ported to the DM6437 then it would make sense to move the pre and post processing routines to the codec side where they can share the internal RAM. However, the aim of this document is to describe how to make a portable codec so this optimization is not included in the source code.

The improvement that could be gained is shown below.

Function	Frame/Cache (ms)	Slicing (ms)
Luma Extraction	3.9	1.7
Luma Filling	10.6	8.0
Total Pre/Post Processing	14.6	9.7

The algorithm uses the ACPY3 interface in the codec engine framework to control access to the EDMA channel via DMAN3. The IDMA3 interface is used to add the function vTable CANNY_TI_IDMA3. For more details on the use of IDMA3 and ACPY3 see the DMAN3/ACPY3 User's Guide.

Applying Slicing to the Canny Algorithm

The implementation of the Canny Edge Detection algorithm in CannyEdgeDetectorInSlices() involves more complex management of the slices. This is a result of the signal processing in the functions producing output slices that are smaller than the input slice as shown in Figure 3.

Moving backwards through the algorithm, the functions VLIB_doubleThresholding() and VLIB_nonMaximumSuppressionCanny() requires N+2 lines of gradient images to generate N lines of Canny edges. Both these functions use the same buffer. See VLib API documentation for more details.

Similarly, VLIB_xyGradientsAndMagnitude() requires N+4 lines of smoothed image to generate N+2 lines of gradient images, which then produces N lines of Canny edges.

Finally, VLIB_gaussianFilter7X7() requires N+10 lines of raw image to generate N+6 lines of smoothed image.

However the last function of the Canny Edge detection, VLIB_hysteresisThresholding() must be calculated on a complete frame as it does not operate on discrete lines in the image. This means that its inputs must be in external DDR2. The impact of this on the overall algorithm will be discussed after the slicing.

Figure 3: Slice Management in Canny Edge Detector

In order to be able to fit all the slice buffers in the 64k of L1 Heap available on the DM6437 the number of output slices N was set to 7. This means that to generate the each 7 lines of Canny Edge data 17 lines of raw image must be read in and processed. Each subsequent slice of N Canny Edge lines can take advantage of the fact that due to this overlap some of the intermediate lines are already in the internal memory buffers.

As an example for the smoothed image slice, the last 4 lines calculated will be reused in calculating the next slice of N+2 gradient image lines. Therefore, each loop in the code will copy the last 4 lines of the smoothed image slice from the previous loop to the top of the slice and only the last N lines of gradient image will be calculated. This optimization means that each line in the raw image is only read in from external memory once and each intermediate line calculated once.

The final function VLIB_edgeRelaxation() must read the possible edge image from external DDR2 and write its output definite edge image back to DDR2. For optimal performance this means that it must use caching. This introduces the issue of cache coherency, as the cache has not been aware of the slice processing in internal mapped RAM to calculate the possible edges. Therefore, the cache contains "dirty" values for this image. In order to force the cache to read the correct image from external memory the cache must be invalidated before calling the function. Similarly after the function has been called the cache must undergo a WriteBack operation to ensure that all the output definite edge data is written into DDR2. This ensures that the drivers output a clean edge image to the display.

However, the aim is to make a portable codec that will run in either the 64k on DM6437 or 32k in DM6467. This means that the algorithm must be sized to use the 32k. There are two optimisations used here as shown in Figure 4.

-Reducing the number of lines N per slice to 3. This is controlled via #define OUTPUT_SLICE_SIZE 3

-Making use of the fact that the Gaussian smoothing function is only reading from cache and writing to internal RAM, that each input pixel will be reread 49 times and that the pixels are sequential and so will be fetched as lines by the cache. This means there is a very high cache hit rate. This mode is enabled via #define SMOOTH_EXTERNAL

These reduce L1 RAM memory usage to less than 32k.

Figure 4: Using cache for Gaussian Filter

The benchmarking of the slicing implementation is shown in Table 2 which also shows for comparison the caching data.

	Slice N=3 (ms)	Slice N=7 (ms)	Gaussian Cache, Slice N=3 (ms)	Cache (ms)
Gaussian Filtering	4.8	4.8	5.4	7.0
Gradient Calculation	0.7	0.8	0.7	16.9
Non-maximal Suppression	5.7	5.8	5.7	12.4
Double Thresholding	2.4	2.9	2.4	8.0
Edge Relaxation	2.1	3.0	2.0	2.2
Slice DMA/cache management	4.8	2.5	3.1
Canny Total	20.5	19.8	19.3	46.5
Preprocessing - Luma Extraction	1.7	1.5	1.7	3.9
Pre-processing - Chroma insertion	8.3	8.0	8.3	10.6
Pre-Post Processing Total	10.0	9.5	10.0	14.5
Total	31.2	28.8	30.3	61.3

Table 2: Benchmarking of Slicing vs Caching on DM6437

These results allow the following conclusions to be drawn:

- Use of internal memory via slicing brings a major performance with a 50% reduction in time for this algorithm compared to use of cache.

- The overhead for managing the slicing with EDMA channels while noticable at 3ms is negligible compared to the overall system gain.

- The use of even small slices such as N=3 shows only a slight performance hit of 5% compared to N=7 while using a significantly smaller amount of internal RAM.

- Analysis of the relative performance individual functions such as the Gaussian with cache and slice modes allowed a significant internal memory saving to be made on the largest buffers with minimal impact on performance.

- As a general rule functions such as the gaussian ( IMG_conv_7x7_i8_c8s())which do 49 data reads for each write are not impacted so much by reading via the cache. On the other hand a function such as VLIB_xyGradientsAndMagnitude() which does 3 writes per 6 reads are heavily impacted by the cache.

Taking Codec Package and integrating in a Codec Server for a DM6467

In order for the DSP codec algorithm to be run on a dual core ARM+DSP device such as a DM6467 the codec must be integrated into a codec server package that will run in Codec Engine on the DSP. This section explains how to migrate the codec package generated for the DM6437 to a codec server and then creates a Linux application on the ARM9 to run it. The DM6467 code is build in the GA release for DVSDK2.00 which is DVSDK2.00.00.22.

The DM6467 codec server defaults to using RMAN to allocate its resources and so a separate version of the Canny algorithm which uses RMAN is found in\dsp_alg\universal_canny_ires. This algorithm is the same, it is just the iUniversal wrapper and DMA function that is different.

Building the Codec Server

The Codec Server is built in two steps. It is firstly packaged by the wizard in \CE_UTILS and then built by make in the DVSDK.

Packaging the Codec Server using the Wizard

The Codec Server must be packaged into the Unit Server using the DVSDK 1.11 version (ie on the PC for the DM6437). This is because this DVSDK built the codec package with version 1.07 of the Ceutils. The GA release of the DM6467 DVSDK only includes Ceutils 1.06 and so is unable to build the unit server from the later codec package.

Open a command window in the directory C:\dvsdk_1_11_00_00\xdctools_3_10_03 and run the wizard

xs ti.sdo.codecutils.genserver

When wizard opens populate the Codec Package, Platform and Output Repository names as follows. Note that the output repository is the path to the DVSDK 2.00 codec combos directory on the Linux machine.

Codec Package name ti.fae.codecs.universal_canny_ires
Module name: Is automatically filled from codec name
Platform: ti.platforms.evmDM6467
Server package name: Is automatically filled from module and platform name BUT reduce the length of the name to ti.fae.servers.unicanny_dm6467 to avoid a DSP/Link internal limitation. The server name (unicanny_dm6467) needs to be < 24 chars to avoid a limit of 32 chars (DSP_MAX_STRLEN) including the NULL in the name of MSGQs.

If ever in doubt the concatentated name that in subject to the 32 char limit can be found by running the application with CE_DEBUG=2 and looking for the following trace

OC - Comm_create('universalcanny_dm6467_1382_1', ...) failed: status 0x8000800b

Set Output Server Repository: X:/dvsdk_2_00_00_22/trunk/dm6467_dvsdk_combos_2_05/packages

Click Finish

If the wizard reports that it cannot find the package then in the DOS shell add the codec package path to the XDCPATH

set XDCPATH=C:\dvsdk_1_11_00_00_universal_codecs_working\codecs_1_10_new/packages;%XDCPATH%

Now export the codec tar files for universalcanny (eg ti.fae.codecs.universalcanny.tar & ti.fae.codecs.universalcanny.ce.tar) to the Linux DVSDK by extracting them to~/dvsdk_2_00_00_22_Canny_iUniversal/dm6467_dvsdk_combos_2_05/packages

Note: This packaging step only needs to be done once. If subsequent changes are only to the codec the only step required is that of exporting the new codec files to the DVSDK.

Building Codec Unit Server in the DVSDK

Before making the codec unit server in the Linux DVSDK a couple of configuration steps are required

1. Edit /dvsdk_2_00_00_22_Canny_iUniversal/dm6467_dvsdk_combos_2_05/packages/ti/sdo/servers/unicanny_dm6467/codec.cfg to set the 3 code sections to "DDR2".

UNIVERSAL_CANNY_IRES.alg.codeSection = "DDR2"; 
UNIVERSAL_CANNY_IRES.alg.udataSection = "DDR2"; 
UNIVERSAL_CANNY_IRES.alg.dataSection = "DDR2";

2. Tell the codec server to link in the VLib and IMGLib libraries used by the codec by adding to _unicanny_dm6467/link.cmd. Note that the imglib2.l64P library has a capital P in the extension.

-l vlib.l64p
-l imglib2.l64P

3. Copy these libs from their install dirs under CCS ( C:\CCStudio_v3.3\c6400\VLIB_V_2_1\library\c64plus) to \dvsdk_2_00_00_22_Canny_iUniversal \dm6467_dvsdk_combos_2_05\packages\ti\fae\servers\unicanny_dm6467.

4. Open the file server.tcf in a text editor to add to BIOS the LOG buffer CANNY_TI_trace that is used by the codec. Add the two lines with the comments at the end of the file as shown below:

if (config.hasReportedError == false) {
 
bios.MEM.instance("IRAM").createHeap = 1;
bios.MEM.instance("IRAM").heapSize = 0x00018000;
bios.MEM.instance("IRAM").enableHeapLabel = 1;
bios.MEM.instance("IRAM").heapLabel = prog.extern("IRAM_HEAP", "asm");
 
/* Add these two lines to create a LOG buffer of length 1024 words */
bios.LOG.create("CANNY_TI_trace");
bios.LOG.instance("CANNY_TI_trace").bufLen = 1024;
/* end of add CANNY_TI_trace */
 
// !GRAPHICAL_CONFIG_TOOL_SCRIPT_INSERT_POINT!
 
 prog.gen();
}

5. Build the server as follows.

#cd dvsdk_2_00_00_22_Canny_iUniversal 
#make codecs

This will build the executable files which are sufficient for further development on the local machine \dvsdk_2_00_00_22_Canny_iUniversal \dm6467_dvsdk_combos_2_05\packages\ti\fae\servers\unicannyi_dm6467\unicanny_dm6467.x64P

In order to generate a distributable tar package for the server it will be necessary to run the xdc packaging tool:

#cd  \dvsdk_2_00_00_22_Canny_iUniversal \dm6467_dvsdk_combos_2_05\packages\ti\fae\servers\unicanny_dm6467 
#/opt/xdctools_3_10_03/xdc release -PR .

Building the Linux Application

The ARM9 Linux application are packaged in a standalone directory dvsdk_2_00_00_22_Canny_iUniversal which contains only the new codecs/servers, an updated DMAI to support composite capture and the applications (in file dvsdk_2_00_00_22_Canny_iUniversal_v1_0.tar.gz). This standalone directory has a Rules.mak file that must be edited to point to the main DVSDK directory to pick up all the components. The configuration of each application to select the server is done as follows:

The configuration file canny.cfg includes the specific unit server as follows:

var demoEngine = Engine.createFromServer( "unicanny_dm6467",
"./unicanny_dm6467.x64P",
"ti.fae.servers.unicanny_dm6467");

The application hardcodes the name of the server to use in the files codecs.c. This name is the one given in the first parameter of the createFromServer() API

static Engine encodeDecodeEngine = {
"unicanny_dm6467", /* Engine string name used by CE to find the engine */
NULL, /* Speech decoders in engine (not supported) */
NULL, /* Audio decoders in engine (not supported) */
NULL, /* NULL terminated list of video decoders in engine */
NULL, /* Speech encoders in engine (not supported) */
NULL, /* Audio encoders in engine (Not supported) */
NULL /* NULL terminated list of video encoders in engine */};

and the name of the codec in the server in video.c

/* Create video analytics instance */
 hVaEnv.hAnalytics = UNIVERSAL_create(hEngine, "universal_canny_ires", (UNIVERSAL_Params *)&VanalyticsParams);
 
 if (hVaEnv.hAnalytics == NULL) {
 ERR("Failed to create video analytics codec: %s\n", "universal_canny_ires");
 cleanup(THREAD_FAILURE);
 }

The DM6467 captures video frames on the Linux side in a 420SemiPlanar format which has a single plane of Luma data and then separate Chroma planes. This means that there is no pre/post-processing to be done on the Application side to pass the data to the DSP codec.

Build and install the application from a terminal window

#cd dvsdk_2_00_00_22_Canny_iUniversal
#make demos
#make install

The install will copy the application canny to the EXEC_DIR.

Running the Linux Application

A pre-built executable directory is in dvsdk_2_00_00_22_6467_canny_executable_v1_0.tar.gz and extracted to the /root directory of the MV 5.0 file system (remember to copy your .ko files into /6467_canny). Connect the D1 camera to the composite input J13 and the D1 composite display to J7 the DM6467 EVM. The particular EVM used for benchmarking is configured to run the DSP at 729MHz and ARM at 364MHz. Then boot it and cd to the EXEC_DIR in the NFS.

#./loadmodules.sh
#./canny

It will log the performance of the frame and slice based versions as it runs. The performance achieved is shown below:

	Slice (ms)	Cache (ms)
Gaussian Filtering	5.9	7.0
Gradient Calculation	0.7	18.2
Non-maximal Suppression	5.0	16.3
Double Thresholding	1.9	14.1
Edge Relaxation	0.8	1.3
Chroma Insertion(Edge cleanup on 6467)	1.2	5.8
Slice Overhead	3.0
Total	18.6	62.8

Table 3: Benchmarking of Slicing vs Caching on DM6467

This means that with the slicing on the DM6467 it can achieve 25fps PAL with only 50% DSP CPU load but when the cache is used the maximum frame rate is 16fps with a 100% DSP load.

Taking Codec Package and integrating in a Codec Server for a OMAP3530

This section explains how to migrate the codec package generated for the DM6437 to a codec server for an OMAP3530 and then modifies a DVSDK Linux application on the Cortex A8 to run it. The application is based upon the DVSDK decode demo from DVSDK 3.01.00.10. This version of the DVSDK does not have an encode application which will capture a video input and so the demo with the Canny edge detector uses the decode demo to decode a D1 720x480 stream which can then be edge filtered before display on the LCD. As the DSP is already 70% utilised for the decode at 30fps this means the decode + canny edge filter run at a reduced frame rate as they must share the DSP. This project can be found in dvsdk_3_01_00_10_Canny_iUniversal_v1_0.tar.gz and built in a separate directory (dvsdk_3_01_00_10_Canny_iUniversal) and refers to the main DVSDK for all the components.

The OMAP3530 codec server defaults to using DMAN3 to allocate its resources and so it uses the universal_canny_dman3 codec.A prebuilt version of the demo is in file dvsdk_3_01_00_10_canny_executable_omap3530_v1_0.tar.gz and then extracted to /home/root in the Arago filesystem (remember to copy your Linux kernel's .ko files into /canny). In order to pick up this application the /etc/init.d/omap-demo file will need to be modified to:

#!/bin/sh

echo 65355 > /sys/devices/platform/omapfb/sleep_timeout
if [[ -x /home/root/canny/loadmodules.sh ]]; then
   cd /home/root/canny
        ./loadmodules.sh
fi

The decode_canny application can now be run from the command line as follows (or the shell script run.sh can be used)

./decode_canny -i -o -v data/videos/davincieffect_ntsc_1_50s.m4v -a data/sounds/davincieffect_HEv2.aac  -l

Building the Codec Server

Codec Engine can only have one server package opened at once and so the Canny codec must be added to the existing server package that includes the video decoders. This is a different process to that carried out on DM6467 when there was a new server created. The following instructions describe how to add the codec to a server. Either of these techniques can be used on either a DM6467 or OMAP35xx project as appropriate and so they are not specific to the device.

Install the Codec server package (cs1omap3530_setupLinux_1_01_00-prebuilt-dvsdk3.01.00.10.bin for OMAP3530 from http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/dvsdk/DVSDK_3_00/latest/index_FDS.html) to ~/dvsdk_3_01_00_10_Canny_iUniversal

Export the codec tar files for universal_canny_dman3 (eg ti.fae.codecs.universal_canny_dman3.tar & ti.fae.codecs.universal_canny_dman3.ce.tar) to the Linux DVSDK 3.01.00.10 by extracting them to ~/dvsdk_3_01_00_10_Canny_iUniversal/cs1omap3530_1_01_00/packages

Before making the codec server in the Linux DVSDK the following configuration steps are required to extend the existing server to include the new canny filter codec:

1. Edit /dvsdk_3_01_00_10_Canny_iUniversal/cs1omap3530_1_01_00/packages/ti/sdo/server/cs/codec.cfg to add the following lines which describe the codec itself:

var UNIVERSAL_CANNY = xdc.useModule('ti.fae.codecs.universal_canny_dman3.ce.UNIVERSAL_CANNY_DMAN3');
// Package Config
UNIVERSAL_CANNY.alg.watermark = false; 
UNIVERSAL_CANNY.alg.codeSection = "DDR2"; 
UNIVERSAL_CANNY.alg.udataSection = "DDR2"; 
UNIVERSAL_CANNY.alg.dataSection = "DDR2";

and then the following lines to the Server.algs structure to add that codec to the server

{name: "universalcanny", mod: UNIVERSAL_CANNY , threadAttrs: {
 stackMemId: 0, priority: Server.MINPRI + 1},
 groupId&nbsp;: 0,
 },

This adds the codec UNIVERSAL_CANNY_DMAN3 to the server. The groupId is used to indentify which group of codecs this codec should be assigned to in the server. Each group of codecs in the server will share the same scratch memory and EDMA resources . These resources are allocated in the cfg files

2. Tell the codec server to link in the VLib and IMGLib libraries used by the codec by adding to ti/sdo/servers/cs/link.cmd

-l vlib.l64p
-l imglib2.l64p

3. Open the file server.tcf in a text editor to add to BIOS the LOG buffer CANNY_TI_trace that is used by the codec. Add the two lines with the comments at the end of the file as shown below:

/* Add these two lines to create a LOG buffer of length 1024 words */
bios.LOG.create("CANNY_TI_trace");
bios.LOG.instance("CANNY_TI_trace").bufLen = 1024;
/* end of add CANNY_TI_trace */
 
 
/* ===========================================================================
 * Generate configuration files...
 * ===========================================================================
 */
if (config.hasReportedError == false) {
 prog.gen();
}

4. Copy the libs _vlib.l64p_ and _imglib2.l64P_ from their install dirs under CCS ( C:\CCStudio_v3.3\c6400\VLIB_V_2_1\library\c64plus) to\dvsdk_3_01_00_10_Canny_iUniversal/cs1omap3530_1_01_00/packages/ti/sdo/servers/cs.

With this configuration the cs engine will be rebuilt next time the DVSDK is built and then installed correctly to the NFS.

Building the Linux Application

The cortex Linux application ./decode_canny is in the directory \dvsdk_3_01_00_10_Canny_iUniversal\dvsdk_demos_3_01_00_13\omap3530\decode_canny. This application will simply use the cs.x64p server with which ever canny algorithm it was built with. This application adds a call to the canny codec after the decode codec is called.

Build and install the application from a terminal window

#cd dvsdk_3_01_00_10_Canny_iUniversal
#make
#make install

The install will copy the application canny to the EXEC_DIR.

Running the Linux Application

The standard Arago filesystem on the DVSDK will autorun demo through the /etc/init.d/omap-demo file which by default will configure the memory and modules by running ./loadmodules.sh and then running the interface application. This file can be edited as follows to only do the configuration stage:

#!/bin/sh

echo 65355 > /sys/devices/platform/omapfb/sleep_timeout
if [[ -x /home/root/canny/loadmodules.sh ]]; then
   cd /home/root/canny
        ./loadmodules.sh
fi

The decode_canny application can now be run from the command line as follows

./decode_canny -i -o -v data/videos/davincieffect_ntsc_1_50s.m4v -a data/sounds/davincieffect_HEv2.aac  -l

This will decode the davincieffect video and display the edge map with a performance overlay on the LCD. As the DSP has to do both the decode and canny filtering in this demo the frame rates are reduced. Using the cache mode a frame rate of 8fps (video decode + canny) is achieved and with the slice mode 13fps. The performance breakdown of the Canny algorithm is shown in Table 4.

	Slice (ms)	Cache (ms)
Luma extraction	3.0	10.9
Gaussian Filtering	6.1	7.7
Gradient Calculation	18.5	0.9
Non-maximal Suppression	6.3	20.3
Double Thresholding	2.3	6.8
Edge Relaxation	1.0	1.2
Chroma Insertion + Edge cleanup	9.7	11.7
Slice Overhead	4.8
Total	41.4	79.1

Table 4: Benchmarking of Slicing vs Caching on OMAP3530

These numbers again show the benefit of using internal memory. The OMAP3530 EVM is running with a DSP clock speed of 450MHz as compared to the DM6467 which was running the DSP at 729MHz. In addition the OMAP codec has to run the Luma extraction algorithm and then reinsert neutral chroma as the Linux driver uses an interleave YUV format.

Creating a Codec on DM644x

There are no demo packages here for creating a codec server on DM644x. In principle the process is exactly the same. The only known issue is that if your codec is using ACPY3 to access the DMA channels then there is a small change to make in the server.cfg file. ACPY3 uses the IDMA controller (Internal DMA controller, see SPRU871K TMS320C64x+ DSP Megamodule Reference Guide for details) which can only access internal memory. This means that DMAN3 memory needs to be allocated from internal memory rather than the default server configuration of external. To do this make the following change to server.cfg to allocateDMAN3.heapInternal from an internal memory pool "L1DHEAP".

/*
 *  This setting would affect performance very lightly.
 *
 *  By setting DMAN3.heapInternal = <external-heap>  DMAN3 *may not* supply
 *  ACPY3_PROTOCOL IDMA3 channels the protocol required internal memory for
 *  IDMA3 channel 'env' memory. To deal with this catch-22 situation we
 *  configure DMAN3 with hook-functions to obtain internal-scratch memory
 *  from the shared scratch pool for the associated algorithm's
 *  scratch-group (i.e. it first tries to get the internal scratch memory
 *  from DSKT2 shared allocation pool, hoping there is enough extra memory
 *  in the shared pool, if that doesn't work it will try persistent
 *  allocation from DMAN3.internalHeap).
 */
DMAN3.heapInternal    = "L1DHEAP";       /* L1DHEAP is an internal segment */
//DMAN3.heapInternal = "DDRALGHEAP";    /* DDRALGHEAP is an external segment */

Migrating the demo applications to another DVSDK

These demo applications were developed with the production DVSDKs for these devices at the time of writing. In order to migrate to later DVSDKs the following information must be borne in mind.

The C6467 canny application was based on the encodedecode example found at dvsdk_2_00_00_22/dvsdk_demos_2_00_00_07/dm6467/encodedecode so in order to identify the changes required by the Canny application it is necessary to do a file compare. The identified Canny specific changes can then be applied to a new application folder in the DVSDK of choice.

Similarly the OMAP3530 Canny application was based on the decode demo found at dvsdk_3_00_02_44/dvsdk_demos_3_00_01_13/omap3530/decode.

This is important as there are often changes in underlying components that the demo applications need to adapt to in their API usage.

Recommended Development Flow for DSP Codec Development

This work can be used to develop the following process flow for efficient development of a DSP codec.

1. If the application uses a library such as VLib that comes with PC platform libraries (C++ dlls and Matlab m models) then develop the algorithm on a PC to take advantage of the visualisation tools these platforms provide.

2. The first development with actual DSP code should take place on a single core DSP platform such as a DM6437 EVM or a DM6437 simulator. This should be the actual algorithm development as suggested in the universal_copy examples with the DO_NOT_USE_CODEC_ENGINE flag. This avoids the cycle time for the codec build and integration stage and allows the focus to be on algorithm only.

3. Now add the Codec Engine support to the algorithm in DM6437/simulator. This step is to debug the codec engine/resource management infrastructure only with the assumption that in step 2 the algorithm is correct.

4. Integrate the DSP codec into the target multi-core build environment. This step is to concentrate on the multi-core issues such as memory coherency and application API usage. The DSP debugging for this step can be done in CCS using CE_DSPDEBUG=1 on the command line as described [[1][here]].

Conclusions

This application note has explained how to create an XDM compliant iUniversal algorithm that can be integrated into Codec Engine on either single or multicore devices. The example application used TI's VLib to get the optimum performance for video analytic applications using the Canny Edge detection algorithm as an example. This example illustrate the benefits of using slicing versus the basic cache model where appropriate and the trade offs required between the use of precious internal memory and the cache. Finally it pulled together all the examples to propose a development flow for codec development for a multicore platform.

http://processors.wiki.ti.com/index.php/C64x%2B_iUniversal_Codec_Creation_-_from_memcpy_to_Canny_Edge_Detector