How to use DSP to speed up OMAP code?

I am working on a video codec for OMAP3430. I already have code written in C ++, and I'm trying to modify / port some parts of it to use the DSP (SDK (OMAP ZOOM3430 SDK) I have an additional DSP).

I tried to connect a small loop that works with a very small amount of data (~ 250 bytes), but about 2 M times to different data. But the congestion in communication between the CPU and DSP is much more than gain (if I have one).

I assume that this task is very similar to optimizing the code for the GPU on regular computers. My question is porting which parts will be useful? How do GPU programmers take care of such tasks?

change

  • The GPP application allocates a buffer of size 0x1000 bytes.
  • The GPP application calls DSPProcessor_ReserveMemory to reserve a virtual DSP address space for each allocated buffer using a size that is 4K larger than the allocated buffer to allow for automatic page alignment. The total reservation size should also be aligned along the border of the 4K page.
  • The GPP application calls DSPProcessor_Map to map each allocated buffer to the DSP virtual address spaces reserved in the previous step.
  • The GPP application prepares a message to notify you of the DSP execution phase of the base address of the virtual address space that has been mapped to the buffer allocated to the GPP. The GPP application uses DSPNode_PutMessage to send a message to the DSP.
  • GPP calls memcpy to copy the data to be processed into shared memory.
  • The GPP application calls DSPProcessor_FlushMemory to clear the data cache.
  • The GPP application prepares a message to notify the DSP execution phase that it has finished writing to the buffer, and the DSP can now access the buffer. The message also contains the amount of data written to the buffer so that the DSP knows how much data needs to be copied. GPP uses DSPNode_PutMessage to send a message to the DSP, and then calls DSPNode_GetMessage to wait to hear the message from the DSP.

After that, the execution of the DSP program starts, and the DSP notifies the GPP with a message about the completion of processing. Just to try, I do not put any processing inside the DSP program. I will just send a message about the completion of processing in GPP. And it still takes a lot of time. Could this be due to the use of internal / external memory or is it just due to communication overload?

+6
c ++ c embedded signal-processing omap
source share
3 answers

From the measurements I made, one messaging cycle between CPU and DSP takes about 160us. I do not know if this is due to the kernel used or to the bridge driver; but it is a very long time for a simple transfer back and forth.

It seems reasonable to transfer the algorithm to a DSP if the overall computing load is comparable to the time required for messaging; and if the algorithm is suitable for simultaneous calculation on the CPU and DSP.

-one
source share

The OMAP3430 does not have a built-in DSP, it has an IVA2 + Video / Audio decoding mechanism connected to the system bus, and the Cortex core has DSP-like SIMD instructions. The OMAP3430 GPU is a PowerVR SGX-based unit. Although it has programmable shaders, and I do not believe that there is any support for general-purpose programming ala CUDA or OpenCL. I could be wrong, but I never heard of such support.

If you use the IVA2 + encoding / decoding engine that is on board, you need to use the appropriate libraries for this device, and it only supports certain codecs that I know. Are you trying to write your own library in this module?

If you are using Cortex embedded in DSPish (SIMD instructions), place the code.

If your dev board has an additional DSP on it, what is a DSP and how is it connected to OMAP?

As for the desktop GPU request, in the case of video decoding, you use the library of functions provided by the provider to make calls to the hardware, there are several VDAPUs for Nvidia on linux, similar libraries on windows (PureViewHD, I think its called). ATI also has linux and windows libraries for their built-in decoding engines, I don’t know the names.

+2
source share

I do not know what the temporary transfer database is, but I know that the TMS32064x specified in the specification for the SDK has a very powerful DMA mechanism. (I assume this is an ORIGAL ZOOM OMAP34X MDK. He says he has 64xx.) I would hope that OMAP has something simalar to use in its fullest advantages. I would advise setting up the ping-pong buffers in the 64xx snare drum and using SDRAM as shared memory using a DMA transfer descriptor. External RAM will be the bottleneck in any part of the 6xxx series, so keep everything you can lock in the internal memory to improve performance. Typically, these parts will be able to transfer 8 32-bit words to the processor core when they are in the internal memory, but can vary from part to part, based on what level cache it allows to display as direct access. The cost-dependent parts from TI move the "mapped memory" further than some other chips. Also, all manuals for details are available on TI for free download in PDF format. They even gave me hard copies without the TMS320C6000 instructions and instructions and many other books.

As far as programming is concerned, you may need to use some “embedded processors” or an integrated assembly to optimize any math you do. For 64xx, it is preferable to use an integer operation, since it does not have an embedded floating point kernel. (This is in the 67xx series.) If you look at the extraction units, and you can compare your calculations in such a way that the different parts are designed for different operations in a way that can happen in one cycle, then you can achieve the best performance from these parts. The instruction set guide lists the types of operations performed by each executive module. If you can split your calculation into two sets of data streams and stretch the loops a little, the compiler will be “better” for you when the full optimizer is turned on. This is due to the fact that the processor is divided into left and right sides with almost identical execution units on both sides.

Hope this helps.

+2
source share

All Articles