Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
2.ALU Design
Olle Seger ([email protected])Dake Liu ([email protected])
•ALU, an overview•AU, a case study•Exercises•About Lab-2
1
ALU Key component in datapath of a DSP Processor Usually all operands from RF, except imm Execution Cost : 1 Clock Cycle Use one guard bit
Key Components of ALU Arithmetic Unit Logic Unit (AND, OR, XOR etc) Shifter (LRS, LLS, ASR, ASL) Special Functions (e.g. bit manipulation) Multiplexers
2
ALU Overview
Logic Shift Special
Flags
AU
Pre-Processing
Post-Processing
ResultSaturation
3
Let’s design a small AUFunctional Specification
0. A + B with saturation OP=00001. A + B without saturation OP=00012. A + B + Cin with saturation OP=00103. A + B + Cin without saturation OP=00114. A - B with saturation OP=01005. A - B without saturation OP=01016. A compare to B with saturation OP=01107. ABS(A) Absolute operation on A OP=01118. NEG(A) Negate operation on A OP=10009. (A+B)/2 Average operation OP=1001
10. NOP OP=1010
The C, Z, V, and N flag should be updated for OP0-9
4
AU functions
A B A B
Saturation
+
A B
+ + +
A B
CinCin
SAT(A + B) A + B SAT(A + B + C) A + B +C
Saturation
Average (A+B)
+
A B
‘1’+
A B
‘1’Flag-only
+
A B
‘1’+
A
B=0
MSB of A
0 1
+
A B=0
‘1’
ASR
+
A B
SAT(A -B) A - B compare ABS(A) NEG(A)
Saturation
5
HW with multiplexing
C1
=1
A[15] A[15:0] B[15:0]
01
CA[15]
ASRSAT
C4
C3
DECC1C2C3C4
OP
00 01 10
00 01 10
11 100100
Flags
17-bit adder
C5
C5
0 1
CinCout = S[16]
S
R
C2
0
00 01 10
trunc
6
HW with multiplexingalways @(posedge clk)if (c5) begin
C <= Cout;Z <= !|R;N <= R[15];V <= (S[16] != S[15]);
end
Flags
ASR ½ assign R = S[16:1];
always @(*)if (S[16]==S[15])
R <= S[15:0];else if (S[16]==0)
R <= 16’h7fff;else
R <= 16’h8000;
Sat
DEC
OP C1 C2 C3 C4 C50 Sat(A+B) 00 00 01 00 11 A+B 00 00 01 01 12 Sat(A+B+C) 00 00 10 00 13 A+B+C 00 00 10 01 14 A-B 00 01 00 01 15 Sat(A-B) 00 01 00 00 16 Cmp(A,B) 00 01 00 - 17 Abs(A) 10 10 01 01 18 Neg(A) 01 10 01 01 19 (A+B)/2 00 00 01 10 110 NOP - - - 0
Truncassign R = S[15:0];
7
Exercise 2.1
8
Exercise 2.2
10
We have a processor with a pipeline where we can:* Read out two operands from the register file and write one operand
to the register file, all at the same time
* Instead of reading out one of the operands you can choose to take a 16-bit immediate from the instruction word
* We have 32 16-bit registers
* A conditional branch takes 3 clock cycles
* We have a repeat instruction
* We have only one load instruction of interest: load Rd, DM0[AR0++], AR0 is set with the instruction set AR0, Rs
* The store instruction works the same waystore DM0[AR0++],Rs
* After a load instruction we must wait a clock cycle before we can use the result
Exercise 2.3
11
Function 1 (execution time max 105 clock cycles, exclusive the RET instruction)
int16_t dct_indata[32];
// Return value in r0uint16_t find_maxabsval(void){uint16_t biggest = 0, b;int16_t a;
for(int i=0; i < 32; i++){a = dct_indata[i];b = abs(a);if(b > biggest)biggest = b;
}}
Exercise 2.3
12
int64_t packet_ctr;
int update_statistics(int16_t length) /* Length is in register r0 when this function is called */{
packet_ctr += length;}
max 25 clockcycles (exclusive the RET instruction)
Exercise 2.3
13
SET ar0,dct_indataSET r0,0 ; max valueREPEAT loop,32LD r1,(ar0++)NOPABS r2,r1MAX r0,r2,r0
loopRET
SET ar0,dct_indataSET r0,0 ; max valueREPEAT loop,16LD r1,(ar0++)LD r3,(ar0++)ABS r2,r1MAX r0,r2,r0ABS r4,r3MAX r0,r4,r0
loopRET
4*32 + 3 = 131 6*16 + 3 = 99
A goldstar if you can do it faster!
Exercise 2.3
14
SET ar0,dct_indataLD r1,(ar0++)SET r0,0 ; max value prologABS r2,r1REPEAT loop,31LD r1,(ar0++)MAX r0,r2,r0 loopABS r2,r1
loop:MAX r0,r2,r0 epilogRET
3*31 + 6 = 99
Exercise 2.3
15
set ar0,packet_ctrset r4,0add r1,r0,0x8000 ; carry = (length<0)addc r4,r4,r4 ; r4 = (length<0)ld r1,(ar0)sub r4,0,r4 ; r4 = (length<0)?-1:0add r1,r0st (ar0++),r1repeat endloop,3ld r1,(ar0)nop ; Silverstar if you remove this
; without unrolling loop completely!addc r1,r4st (ar0++),r1
endloopret
P_c[0]
ext length
P_c[1]P_c[2]P_c[3]
ar0
ext ext
r0
Exercise 2.3
3*4 + 9 = 2116
set ar0,packet_ctrset r4,0add r1,r0,0x8000 ; carry = (length<0)addc r4,r4,r4 ; 1 in r4 if length<0ld r1,(ar0)sub r4,0,r4 ; -1 in r4 if negadd r2,r1,r0repeat endloop,3ld r1,(ar0+1)st (ar0++),r2 ; loop addc r2,r1,r4
endloopst (ar0++),r2ret
Exercise 2.3 software pipelining
3*3 + 9 = 18 17
ALU
C1 C2 C3 C4 C5ABS(A) 1 10 11 0 0 MAX(A,B) 0 01 00 1 0A+B 0 00 01 0 1A-B 0 01 00 0 1A+B+C 0 00 10 0 1
17-bit adder
{B[15],B[15:0]}
00 01 10
{A[15],A[15:0]}
0 1
Cout
17
C1 C2
C4
=1
A[15]
0
01
A[15]
C3
11 100100
C
10 00,01 11
always @(posedge clk)if (C5) begin
C <= Cout;end
S
[15:0]
S[16]12
Exercise 2.3
18
Exercise 2.4
19
Exercise 2.4Software pipelining
SET ar0,dct_indataSET r0,0 ; max valueLD r1,(ar0++) ; prologREPEAT loop,31LD r1,(ar0++)MAXABS r0,r1,r0 ; loop
loop:MAXABS r0,r1,r0 ; epilogRET
2*31+5=67
This code utilizes pipeline delay!20
Exercise 2.4Loop unrolling
SET ar0,dct_indataSET r0,0 ; max value
REPEAT loop,16LD r1,(ar0++) LD r2,(ar0++)MAXABS r0,r1,r0 MAXABS r0,r2,r0
loop RET
4*16+3=67
21
About Lab 2 (Datapath)• Manual for Lab 2 (Ch-2) • Source code for LAB-2• You can use Verilog or VHDL.• Go through Ch-0 and Ch-2 for all details
Read the manuals carefully before starting the labs!
22
About Lab 2
saturation.vhd mac_dp.vhd adder_ctrl.vhd min_max_ctrl.vhd
saturation.asm rounding_vector.asm alu_test.asm
Write this HW Write this SW
1) Run SW on srsim for reference2) Run SW and HW using vsim3) Compare output4) Check coverage. Was all your HW tested?
SW should test allcorner cases
23
About Lab 2 Verification
– Write Assembly Program to test your modules– Some Templates are provided– Fill with your choice of registers, and operands– Perform the operation– Write the results to a file using “out 0x11, r?”– Use coverage metrics to find obvious missing corner cases
– Run Modelsim Simulator using commands mentioned in Section 0.5
– Simulate and Debug
24