Tutorials to .com

Tutorials to .com » Mechine » Dsp » Based on the TMS320C6000 DSP Viterbi optimal design procedure

Based on the TMS320C6000 DSP Viterbi optimal design procedure

Print View , by: iSee ,Total views: 24 ,Word Count: 2373 ,Date: Sat, 22 Aug 2009 Time: 4:01 PM

Convolutional code encoder because of its simple, high coding gain and has a strong ability to correct random errors in the communication system has been widely used. Based on the criteria of maximum likelihood Viterbi algorithm (VA) is additive white Gaussian noise (AWGN) channel the best performance of the convolutional code decoding algorithm, an algorithm is used.

In general, the realization of soft-decision Viterbi decoding can have the choice of three options: application-specific integrated circuit (ASIC) chips, programmable logic array (FPGA) chip and digital signal processor (DSP) chip. References [3] the merits of the three programs have done a detailed comparison. The use of dsp decoding chip is the most flexible option, but also the slowest speed, because the whole decoding process is achieved by software.

In recent years the emergence of software radio technology, called for the use of programmable devices capable of (DSP, CPU, etc.) instead of a dedicated digital circuit. Of channel codec, the advantages of doing so is in the process need only be a small number of changes, you can adapt to different coding rate and a variety of communications systems required by different methods codec. However, the bottleneck limiting the speed of decoding in real-time DSP systems, thereby increasing the decoding speed DSP for software radio has an important significance. The purpose of this paper is to optimize the structure of the decoding process, to improve the implementation of DSP chip speed VA algorithm.

Based on the TMS320C6000 DSP Viterbi optimal design procedure

1 Viterbi decoder

First of all, will need to define two terms used in this article:

Enter the frame - for each input bit decoder;

Output frame - corresponding to an input frame, the bit decoder output.

Shown in Figure 1 is a convolutional decoder (VA algorithm) of a typical structure.

To (2,1,7) convolutional code as an example (with 2-bit input frame and output frame of 1-bit), to illustrate the decoder of three main sections.

1.1 Calculation of slip measurement unit (BMG)

Calculation of the current input frame corresponds to the path of 128 degrees slip money and keep people slip measure memory cell (BMM).

Plus 1.2 unit selection (ACS)

Money will slip degrees and connected to the path in front of the sum of money received degrees at the new extension of the degree of the path of money; compared to connect in the same state on the path of two new degrees of money; choose a smaller degree of money that path (survivor path), and degree of its money to the new path metric storage memory (SM), the surviving path value (corresponding to the state of the input coding bits) stored in the path memory (PM) in the.

Calculation of 1.3 units survived path

64 survivors found to moderate the path of a minimum value (maximum-likelihood path), through the retrospective operation (Traceback) in the PM to find the path of all the corresponding input bit, followed by the output is the result of decoding.

Based on the TMS320C6000 DSP Viterbi optimal design procedure

Output of each one, all the corresponding element of a slip road and 64 at ACS operation. ACS to operate in the total computing time up a large proportion. Procedures to optimize the main job is to try to reduce the ACS operation for each clock cycle needed for a few.

2 TMS320C6000 DSP chip characteristics

TMS320C6000 DSP platform is based on the TMS320C6000 floating-point DSP processor 32. It contains two sub-series: for the TMS320C62x fixed-point calculations for series and series TMS320C67x floating-point calculations TMS320C6000 series CPU structure as shown in Figure 2. Clock frequency up to 250MHz. The series consists of two general-purpose DSP Register Group A and B, each have 16 32-bit register. 8 computing chip functional unit: the two multipliers (. M1 and. M2); six arithmetic logic unit (. L1.L2.S1.S2.D1.D2). All units can operate independently in parallel. To TM320C6701 for example, the operating frequency up to 167MHz, the fastest speeds of up to 8 × 167 = 1336MIPS.

In fact, in order to achieve this speed, there are many bottlenecks, there are several types of limitations are the following:

(1) limit the function of eight functional modules to perform different commands. In the actual program, the program flow as a result of restrictions on the location of command can not change, it is not possible in each clock cycle so that the work of eight modules at the same time. The main means of optimization procedures is to raise the level of instructions in parallel, an average of each cycle the number of instructions executed at the same time.

(2) cross-path (Cross Path) restrictions on each function module can only register their own group to operate the register directly. For example. L1 can only register the results directly into the group A. If you want to group on the implementation of another register read or write operation, the need to use "cross-path", and the entire CPU, only two cross paths. In other words, a cycle can accommodate up to two cross-reading and writing in the opposite direction.

Based on the TMS320C6000 DSP Viterbi optimal design procedure

(3) restrictions on multi-cycle instruction LD command's function is to read data from the memory register by. D module implementation. However, the implementation of LD orders must wait for 4-cycle data can be required. The need for such a number of cycles to complete an order (for example, commands Jump B) to improve instruction have become an obstacle to the degree of parallel processing.

(4) data on the operation of long-C6000 instruction set can only be restricted to 8-bit, 16 bit, 32-bit or 40-bit data as a unit operation.

3 VA in the DSP to achieve the optimization

ACS operation algorithm of the VA the largest part of computing. The usual procedures in design, the use of a symmetric operator realization of ACS butterfly operations, each operation can be completed two ACS. So the core task of optimization is to reduce the consumption of computing each butterfly of the computing cycles.

Please refer to the principle of butterfly computing Figure 3. The former two adjacent state-level 2i and 2i +1, a total of four slip. Calculate the four-pass and receive signals in Euclidean distance, and two former state-level 2i and 2i +1 path previously stored in the sum of money to be four degrees paths A1, A2, B1, B2 of the degree of money. Then the current state of the two corresponding i and i +32 of 22 compared to the current state of each degree of money have left a smaller path (survivor path), at the same time, the degree of the current state of money and survive the path of the corresponding input people keep the corresponding bit position, ready to calculate the next level.

Each butterfly operation include: three load data manipulation (load), because they can prove that a butterfly in the measurement of the four slip have the same absolute value, so that when there is only need to load a pre-calculated results BMU; four addition operations; two comparison operations; comparisons of the four storage operation. Of these, four addition operations in a cycle can be completed at the same time; state i and i +32 of the surviving path is independent of the computing and storage.

For the above-mentioned degree of parallel processing to increase the number of obstacles, the following method can be used to solve were:

(1) solve the constraints function modules can be ordered with different substitutes for one another. For example, can only be used MV assignment. L,. S and. D function modules to complete, if the modules have been occupied by other instructions in parallel, can be used by a method of assignment, and the multiplication is the MPY instruction. M unit to achieve. Similarly, the increase can also be used zero or minus zero MV instructions to replace the instructions.

(2) restrictions on cross-cutting the path need to rely on register allocation and switching, so that the same instruction of the register relating to the same register as much as possible in the group, reducing the need to use the opportunity to cross paths. (3) to solve multi-cycle instruction to load data limitations the results of four cycles need to be after. In order to effectively make use of the waiting period, in the programming instructions to load data on the front of the butterfly operations to perform, when to enter the butterfly operation, the load will be able to immediately use the new data. Similarly, the butterfly operation to implement a butterfly operation for the next instruction to load data. B Directive (Directive Jump) problems can be resolved with similar way of addressing it.

(4) to solve data manipulation on the long limit in (2,1,7) convolutional code decoder of the VA, the surviving path is stored in the PM-ri. Each input frame corresponds to 64 possible, and will have 64-bit results of the comparison of the surviving path. But not directly on the TMS320C6701 than 64 special operations to read and write data, so the PM is divided into two identical 32-digit Group PMO and PMl. The former used to store the state of the surviving path corresponding to 0-31; which store the state of the surviving path corresponding to 32-64. PM0 [i] and PM1 [i] together, that the first i-grid of all 64 surviving path. When the greater length of coding constraints, they can use the same approach to the separation of storage. For example, (2,1,9), vol. PM on the product code can be divided into eight 32-bit array to store state information 256. Back operation, the first to identify the path through which a state can be from a corresponding array value read out the path, only a LD (load) operation.

Figure 4 gives the optimized flow chart of the butterfly computation. Each cycle of 4 clock cycles, respectively, for the map of the E0-E3, corresponds to computing a butterfly. In addition to the increase in some of the key selection operation, but also need some auxiliary operations to achieve the cycle, as well as copies of each register, with an average over each clock cycle can be implemented in parallel instruction 6.

4 to optimize the effectiveness and the promotion of

Decoder output by a number of clock cycles required for

TBMC + n TButter + Ttb

Which, TBMC, TButter and slip respectively Ttb measure, the butterfly computation, as well as the need to back operation of the cycle number, n that corresponds to each output frame number of butterfly computation.

For (2,1,7) convolutional code decoder, the output of a frame 32 butterfly computing needs, so n = 32. Surviving path in the back when there are two programs output decoding results: one is the one yards input sequence, a decoder on the output results; the other is the input N-frame sequences, and then output the results of N-frame decoding . The latter approach, the output required for each frame can reduce the number of cycles for Ttb / N, but at the same time also increase the delay for the (N-1) TButter / TCPS, which is a TCPS operation of DSP clock cycles per second, the number of frequency equal to the work of DSP.

If companies use the definition of TI linear assembler language using the structure shown in Figure 1 to achieve the (2,1,7) decoder, after CCS2 software automatically compile and optimize the after-o1-class, each given a bit about the need for 1000 clock cycles (TButter = 22, n = 32), the clock speed to 167MHz when the decoder does not exceed 160kbps.

Based on the TMS320C6000 DSP Viterbi optimal design procedure

After the methods described in this article after the process of optimization is still the (2,1,7) convolutional code, TBMC = 20, TButter = 4, n = 32; Ttb = 700, choose N = 16, therefore given a bits of the average time is 128 +20 + (700/16) = 192 clock cycles. To TMS320C6701 for example, work in the 167MHz, the decoding process can achieve the rate of about 870kbps, while the delay is only 18μS. Clearly, in this article to optimize the performance is much higher than the effect of auto-optimization.

Coding for different constraint length of convolutional codes, such as used in WCDMA (2,1,9) code, butterfly computing unit processes and (2,1,7) code is identical. Difference is that each state level to the 256 number. Therefore only need to store and process the path back to do some changes to the command you can use.

For different DSP systems, because in the instruction set, bus, register, and many other differences, for the C6000 series of optimized assembler can not be directly applied. However, in the decoding process optimization problems are also more or less the same, the focus of optimization is to reduce the task of computing the volume of ACS, so the procedure proposed in this paper the basic idea of process as well as some problem-solving skills can continue to use.


Digital Signal Processing Articles


Can't Find What You're Looking For?


Rating: Not yet rated

Comments

No comments posted.