Upload
gupta-aakash
View
216
Download
0
Embed Size (px)
Citation preview
7/29/2019 Dijsktra Thesis
1/65
Universitat BielefeldTechnische Fakultat
AG Technische Informatik
Diplomarbeit
A parallel implementation of path planning on
reconfigurable hardware
Marcus Grieger
November 11, 2004
BetreuerProf. Dr. Ralf Moller
Dipl.-Inform. Tim Kohler
7/29/2019 Dijsktra Thesis
2/65
Diese Diplomarbeit wurde von Prof. Dr. Ralf Moller und Dipl.-Inform. Tim Kohler
betreut.
Gutachter waren Prof. Dr. Ralf Moller und Dipl.-Inform. Tim Kohler.
2 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
3/65
Contents
Contents
1 Introduction 5
1.1 Dijkstras shortest path algorithm . . . . . . . . . . . . . . . . . . 51.2 Path planning on binary maps . . . . . . . . . . . . . . . . . . . . 61.3 Mapping of binary map nodes on reconfigurable hardware . . . . 6
1.4 Applications of grid based shortest path algorithms . . . . . . . . 61.5 Related projects . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 The Dijkstra Algorithm 9
2.1 The Dijkstra Shortest Path algorithm . . . . . . . . . . . . . . . . 92.1.1 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Simplification of the Dijkstra algorithm for a mapping of graph ver-tices on digital hardware in grid-based applications . . . . . . . . 122.2.1 Tracing of the shortest path in the potential vector . . . . 13
3 Field Programmable Gate Arrays (FPGAs) 17
3.1 History of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Internal structure of FPGAs . . . . . . . . . . . . . . . . . . . . . 183.3 Suitability of FPGAs for the specified application . . . . . . . . . 19
4 Implementation of a Prototype 21
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Design of graph vertices in digital hardware . . . . . . . . . . . . 234.3 Generation of the node matrix by a Verilog code generator . . . . 254.4 Interface to a host system . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Software environment on the host system . . . . . . . . . . . . . 324.6 Result of a test run . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Analysation and Discussion 37
5.1 Substitution of DFFs by LUT based delay elements . . . . . . . . 375.2 Low-level instantiation of logic elements by VQM source files . . 40
Universitat Bielefeld, AG Technische Informatik 3
7/29/2019 Dijsktra Thesis
4/65
Contents
5.2.1 Structure of VQM files . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Implementation of the LUT delay element in VQM . . . . 455.3 Reduction of the number of needed DFFs by using an alternate
path tracing algorithm . . . . . . . . . . . . . . . . . . . . . . . . 465.4 Implementation of the address logic and host interface . . . . . . 48
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A Large format schematics 53A.1 Schematic of the top level entity . . . . . . . . . . . . . . . . . . . 53A.2 Schematic of a graph vertex implemented in digital hardware . . 54
B Source code 55B.1 main.m - Main module of the Verilog code generator . . . . . . . 55B.2 serial.c - Support functions for the host utilities . . . . . . . . . . 58
B.3 map transfer.c - Host utility to transfer a map to the FPGA . . . . 61
C Bibliography 63
4 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
5/65
1Introduction
One of the problems in mobile robotics is navigation and path planning on maps. An
approach to finding a shortest path is to encode the map in a weighted graph and use
Dijkstras shortest path algorithm [Min78] to find the shortest path between two nodes
in this graph. The downside of this algorithm is a quadratic growth of complexity de-
pendent on the number of nodes. This work describes a parallel implementation of a
simplified version of Dijkstras algorithm to achieve a linear growth.
1.1 Dijkstras shortest path algorithm
The common version of Dijkstras algorithm solves an all-shortest-path problem 1 in a
weighted graph. Each vertex represents a location on a map and is connected to one
or more other vertices via weighted edges, the weight representing a distance or cost
incurred by following this edge. After defining a destination or reference vertex, the
algorithm iteratively calculates a distance potential in every vertex which represents the
distance to the destination node. The shortest path from any starting vertex is found by
subsequently following the edge to the neighbouring vertex with the lowest potential. A
detailed formal description of this algorithm is found in the first part of chapter 2.
1A solution to an all-shortest-path problem provides the shortest path from a reference vertex to any
other vertex of the graph, not only the shortest path between two specific vertices.
Universitat Bielefeld, AG Technische Informatik 5
7/29/2019 Dijsktra Thesis
6/65
1 Introduction
1.2 Path planning on binary maps
A possible representation of a map generated by a mobile robots sensors is a simple two
dimensional binary map, where a logical zero represents free space and a logical one
represents obstacles. This map is converted to a graph by placing a vertex in every pixel
of the map and connecting it with eight edges to its eight neighbours. The weights result
from the simple Euclidian distance of 1 to orthogonal neighbours and
2 to diagonalneighbours. Obstacle vertices get an infinite weight or are not connected at all.
1.3 Mapping of binary map nodes on reconfig-
urable hardwareIn a classic implementation, each iteration requires a sequential calculation of each ver-
tex potential to be updated, therefore a parallel calculation of all map nodes on suitable
hardware should result in a significant reduction of complexity and thus in great accel-
eration of this algorithm. To achieve this goal, more simplifications are made to allow
for a creation of an electronic equivalent of a graph vertex. An interconnected matrix of
these vertices can be implemented on a field programmable gate array (FPGA) and the
calculation of one iteration is possible in a single clock cycle.
1.4 Applications of grid based shortest path algo-
rithms
In addition to the acceleration of offline and real time path planning in mobile robots,
variants of this algorithm could be used in other applications:
A quite similar problem of finding a shortest path can be found in so-called maze
routing, where conductive tracks on a printed circuit board have to be routed between
pins of components without overlapping or crossing other tracks. This variant starts
with a clear map and every routed track becomes an obstacle for the following tracks.
This method of routing normally uses only orthogonally connected vertices, but in thecase of a multilayer board, the map may become three dimensional.
Another application could be segmentation in the field of image processing. A compar-
ison of region growing algorithms to the wave propagation in this algorithm suggests
using a variant that uses more than one reference vertex as initial region seeds and some
binary tag being propagated with each wave to distinguish between different regions
later.
6 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
7/65
1 Introduction
1.5 Related projects
[Nes03] describes an implementation of a maze-routing accelerator, based on Xilinx
XC4000 and Xilinx VirtexII FPGAs.
An implementation of an accelerator for the calculation of a shortest path in a common
graph is described in [TS01]. This approach aims at the acceleration of the Dijkstra
algorithm used in dynamic network routing.
Universitat Bielefeld, AG Technische Informatik 7
7/29/2019 Dijsktra Thesis
8/65
1 Introduction
8 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
9/65
2The Dijkstra Algorithm
2.1 The Dijkstra Shortest Path algorithm
The Dijkstra Shortest Path Algorithm is described in [Min78]. To solve an all-shortest-
path problem, each location is assigned to a vertex v V in an undirected graph G =(V, E). All locations reachable from this vertex are then connected by a set of weighted
edges E. A weight represents a distance or cost incurred by following the path alongthis edge. The algorithm involves an initialisation step and a number of iterations:
2.1.1 Initialisation
A reference or target vertex t V is defined and a distance potential p(x) is assigned toevery x V. This potential is initialized with p(x) = 0 for x = t and p(x) = for allother vertices:
p(t) = 0 (2.1)
p(x) = x = t, x V. (2.2)Initially, all vertices x = t and all edges are uncoloured. Vertex t is coloured andbecomes the most recently coloured vertex named i.
2.1.2 Iteration
For each uncoloured vertex x that receives an edge from i, p(x) is updated as the min-imum of its own potential and the potential of vertex i in addition to the cost of the
connecting edge:
Universitat Bielefeld, AG Technische Informatik 9
7/29/2019 Dijsktra Thesis
10/65
2 The Dijkstra Algorithm
p(x) = min(p(x), p(i) + w(i, x)) (2.3)
If uncoloured vertices with a finite potential exist, the vertex j with the lowest potential
is coloured and i is set to j. Additionally, the edge that determined the value of p(x) iscoloured. Else, no further vertices are reachable and the algorithm stops.
If the vertex s representing the starting location has been coloured, the shortest path from
this location to the reference node t has been found and can be obtained by following
the coloured edges.
10 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
11/65
2 The Dijkstra Algorithm
Figure 2.1: Example of a shortest path calculation in a weighted graph from vertex s to t using the Dijkstra method. The
reference vertex t is initialized with p(t) = 0, all other vertices are initialized with p(x) = , x = t, denotedby a blank node. In f), the starting node s is coloured and the shortest path from s to t is represented bybold (coloured) edges.
Universitat Bielefeld, AG Technische Informatik 11
7/29/2019 Dijsktra Thesis
12/65
2 The Dijkstra Algorithm
2.2 Simplification of the Dijkstra algorithm for a
mapping of graph vertices on digital hardware
in grid-based applications
To solve a path planning problem on a binary map, the vertices of the graph G = (V, E)are placed in a regular grid and each vertex v V is connected with its eight neighboursvia edges with constant weights w(y, x) = 1 for orthogonal edges and w(y, x) =
2
for diagonal edges.
For this special case, [Mol99] suggests a simplification of the Dijkstra algorithm that
allows an implementation of a graph vertex in simple digital hardware.
The first simplification suggested by the accelerated Dijkstra method as described in
[Mol99] is to use integral weights. Incorporating a factor of 2, weights are w(y, x) = 2for orthogonal edges and w(y, x) = 3 for diagonal edges. This eliminates the needto implement complex combinatorial circuits for real number arithmetics and allows
an encoding of integral weights through an equivalent number of delay elements in
hardware.
In the original Dijkstra algorithm, the vertex with the lowest potential is coloured
in each iteration. To achieve the same goal, the implementation proposed by [Mol99]
introduces a global counter that is incremented in constant periods of time. At thestart of a calculation, a binary signal is propagated from the reference node t to the
neighbouring nodes. This signal then spreads through the entire graph, delayed by a
number of clock cycles equivalent to the weight of the edges. When the signal reaches
a node, the value of the global counter is latched in a register inside the node. The
time needed by the signal to reach a node is equal to the number of delay elements
traversed and thus equal to the sum of the weights of the edges involved. Since the
global counter increments equally to the time needed, the latched value represents the
same sum of the weights. All incoming signals of the neighbouring edges of a node are
OR combined, so the earliest signal locks this node. Therefore, the OR combination is
the corresponding equivalent to the minimum function from the original algorithm.
Due to the integral weights and the regularity of the grid, multiple nodes may
obtain the same minimal potential at a time. Because the distance between two
neighbouring nodes is always shorter than any other path between these nodes, all
nodes with the same potential can be coloured at once. This is the main reason why a
parallel implementation of this algorithm is possible.
By using this method, the maximum number of iterations that are needed to cal-
culate all potentials of reachable nodes in a given n by m grid can be approximated: In
12 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
13/65
2 The Dijkstra Algorithm
a worst case scenario, every other row of the map has to be traversed almost completely.
The rows in between are completely marked as obstacle nodes, except for a one nodewide passage alternating at the left and right border of the map. So the only path
through this map is serpentine shaped (see example in figure 2.2). If the diagonal edges
at the turns are neglected, an upper approximation of the length of the longest path is
(m 1) (n 1)2
(2.4)
and the maximum potential pmax is twice that number:
pmax = (m 1) (n 1) (2.5)
Figure 2.2: Serpentine shaped map for a worst-case scenario
After the number of clock cycles exceeds pmax, all reachable nodes still have a potential
value lower than the final value cmax of the global counter. Thus, these can be considered
coloured and all nodes with a potential equal to cmax are unreachable. This elimiatesthe need to encode the colouration of nodes.
2.2.1 Tracing of the shortest path in the potential vector
The colouration of edges can also be omitted if an iterative tracing algorithm on the
resulting potential vector is used.
Universitat Bielefeld, AG Technische Informatik 13
7/29/2019 Dijsktra Thesis
14/65
2 The Dijkstra Algorithm
Initialization
v0 = s s V (2.6)
v denotes the resulting path, v0 = s is the starting location.
Iteration
In each iteration step, the following vertex vn+1 is set as the neighbouring vertex with
the minimal potential. When the reference vertex t is reached at T, the algorithm stops
and v0, . . . , vT describes the shortest path from s to t.
vn+1 = arg minxconnected tovn
(p(x)) (2.7)
Figure 2.3 shows an example of the resulting algorithm at different values of the global
counter.
14 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
15/65
2 The Dijkstra Algorithm
Figure 2.3: Calculation of node potentials in the simplified algorithm. The values for the global counter are a) 0, b) 2,c) 4, d) 5, e) 6, f) 7. Red edges have a weight of 2, blue edges have a weight of 3, and green edges areunconnected.
Universitat Bielefeld, AG Technische Informatik 15
7/29/2019 Dijsktra Thesis
16/65
2 The Dijkstra Algorithm
16 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
17/65
3Field Programmable Gate Arrays
(FPGAs)
3.1 History of FPGAs
Today, FPGAs are used mainly for rapid prototyping of application specific integratedcircuits (ASIC).
In the past, the demand for integrated circuits for all kinds of applications steadily
increased, but the development of a full-custom chip was an expensive process.
Originally, the creation of a chip required the design of custom wafers. Special clean
rooms and equipment for photolithography and etching of these wavers were needed
for the creation of a prototype, and the whole process was very uneconomical if only a
limited-lot production was needed.
One of the early ancestors of FPGAs was the Uncommited Logic Array (ULA)
or Gate Array (GA). The die of an ULA consists of an array of unconnected standard
logic gates and other elements needed for the creation of an application specific
integrated circuit (ASIC). These elements were connected later by applying metal
interconnections to an additional layer of the die. The possibility to mass produce ULAs
for different applications reduced the overall design cost compared to the creation of
full-custom chips. The creation of an ASIC prototype using an ULA was still a complex
process, since the application of the metal layer could not be done by the designer. In
those days, the design process started to change - the designer no longer had to create
the wafer mask of the die, but the new task was to connect already existing parts of the
chip to implement the desired functionality.
Universitat Bielefeld, AG Technische Informatik 17
7/29/2019 Dijsktra Thesis
18/65
3 Field Programmable Gate Arrays (FPGAs)
The desire to have user programmable chips arose and led to the invention of pro-
grammable array logic (PAL) and gate array logic (GAL) chips. A PAL consists ofa matrix of AND and OR gates and the possibility to create interconnections at spe-
cific points in this matrix by using fuse or antifuse technologies similar to those used
in PROMs and EPROMs. A GAL is a more complex version, which basically includes
several PALs on one die. These can be connected with each other to create more com-
plex logical functions.
At this point, the developer no longer needed to send his metal connection mask to
the ULA manufacturer, but he could obtain readily available chips and program the
needed interconnections using programming hardware on site, which was much cheaper.
Another step to speed up the development of projects involving ASICs was theinvention of in-circuit-programming methods. The first user programmable chips had
to be removed from and reinserted into a socket for each test of a new version. This not
only took time, but uncareful handling and mechanical stress from a repeated removal
and insertion could also lead to a destruction of the component. As a result, the Joint
Test Action Group (JTAG) developed a standard interface which allowed programming
of a device by using a simple serial port protocol without the need for a removal from
the circuit board.
The growing complexity of ASIC designs eventually led to the invention of the
field-programmable gate array (FPGA) in the early 1980s. An FPGA consists of a lot oflogic cells arranged in a two dimensional array, which can be interconnected in almost
any way1. With computers becoming faster, the design process evolved simultaneously.
More and more tasks of the design process were implemented in software tools to
aid the designer. This led to todays complex development environments that allow a
prototype to be created and simulated on almost any personal computer.
Two of the main FPGA manufacturers nowadays are Altera and Xilinx.
3.2 Internal structure of FPGAs
FPGAs consist of basic elements called logic blocks. These are arranged in a twodimensional array and can be connected by a programmable connection matrix. To
connect internal logic to external circuits, I/O blocks at the border of the array can
also be connected to the logic blocks. Newer devices may also contain a number of
additional features like clock generators, larger blocks of on-chip memory, digital
signal processor (DSP) components and even complete CPU cores.
1only limited by connection ressources, but there are plenty for most practical applications
18 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
19/65
3 Field Programmable Gate Arrays (FPGAs)
The logic blocks consist mainly of one or more lookup tables (LUTs), in most
cases 4-input LUTs, to implement logical gates. Most FPGAs implement the LUTsas well as the configuration registers for the connection matrix in SRAM technology
to achieve high overall chip clock rates. This requires a reconfiguration of the device
after each power-up. Some alternative implementations use non-volatile concepts like
EEPROM or Flash technology. Those are not as fast as an SRAM based device, but no
external components for the configuration of the logic device is needed after an initial
programming process.
In addition to the LUTs, flip-flops are provided to implement storage registers in
a logic element. Depending on the manufacturer and the particular device, these logic
blocks are called logic array blocks (LABs, Altera) or configurable logic blocks (CLBs,Xilinx).
Most newer devices partition these basic blocks into smaller elements. While a Xilinx
XC4000 2 CLB contains two LUTs and two registers only, in the Virtex II3 FPGA,
one CLB is divided into four slices which contain two LUTs and two registers each.
An Altera Stratix 4 LAB includes ten logic elements (LEs), which provide one LUT
and one register each. The more advanced Stratix II 5 provides eight adaptive logic
modules (ALMs) per LAB. An ALM contains a comparable larger LUT that can be
used in different configurations regarding the number of input signals. It also includes
two registers.
The programmable interconnections differ dependent on manufacturer and de-
vice, too. All devices provide a varying number of horizontal and vertical signal lines of
variable length across the chip, though. Additionally, most devices support global clock
signals to distribute a low skew clock to every part of the chip. One main difference
between Altera and Xilinx devices is that Altera uses a column oriented approach while
Xilinx maintains a true grid structure. In an Altera FPGA, the LABs are placed in
columns thus providing very fast interconnections to the neighbouring LABs in a row.
3.3 Suitability of FPGAs for the specified applica-
tion
The simplified algorithm described in chapter 2.2 can be realised with a very small
number of gates and registers per node. This fact and the arrangement of the nodes
2XC4000 data sheet: [Cor99]3Virtex II data sheet: [Cor03b]4Stratix data sheet: [Cor04b]5Stratix II data sheet: [Cor04c]
Universitat Bielefeld, AG Technische Informatik 19
7/29/2019 Dijsktra Thesis
20/65
3 Field Programmable Gate Arrays (FPGAs)
in a grid suggest the use of a similar structured silicon device. Besides the expensive
creation of an ASIC or a full-custom chip, the implementation on an FPGA seems to bethe only reasonable option. An efficient parallel implementation of the algorithm can
be achieved if it is possible to map one graph vertex to a single logic block of an FPGA.
The selection of an appropriate FPGA for this algorithm is difficult, because these
days, a broad range of devices optimised for different applications are commercially
available. There are low-budget as well as high performance solutions with a lot of
different additional features from both Xilinx and Altera, and there are other devices
from other companies like Actel, Quicklogic, and Lattice, too.
For this project, the Altera Stratix device family was chosen in favor of a Xilinxproduct for one main reason: Compared to XC4000 and Virtex II from Xilinx, they
offer more flip-flops per basic logic block. Since the maximum counter value depends
on the size of the map, a larger number of registers to store that value is desirable to
maximize the possible map size.
20 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
21/65
4Implementation of a Prototype
A running implementation of the algorithm described in 2.2 was created using an Altera
Nios Statix Edition prototyping board[Cor03a] on an Altera Stratix 1S10 FPGA running
at a clock rate of 50 MHz.
This version allows the calculation of potentials in a map as big as 16 by 16 pixels.
A graph node is implemented by using standard elements from the Quartus II design
software, possible optimisations and problems associated with this implementation arediscussed in the next chapter.
4.1 Overview
The prototype consists of the FPGA development board which is connected to a host
system by a RS232 serial connection. On the host, a set of utilities described later is
used to transfer map data to the FPGA, execute the calculation and retrieve the resulting
potential vector.
Figure 4.1: System overview of the prototype
Universitat Bielefeld, AG Technische Informatik 21
7/29/2019 Dijsktra Thesis
22/65
4 Implementation of a Prototype
Figure 4.2: Top level entity of the FPGA design. A larger version of this figure can be found in appendix A.1
The FPGA top-level design (Figure 4.2) uses the following components:
node matrix
The node matrix component contains a matrix of 16 by 16 graph nodes for the cal-
culation of the potentials. It is created automatically by an Objective C program
described in 4.3 that generates a Verilog source file which contains the instantia-
tion and interconnection source code for single graph nodes.
command register, decoder, msr
These components implement a state machine to execute instructions receivedvia the universal asynchronous receiver/transmitter (UART). The output of the
command register is wired to eight on-board LEDs to display the contents of this
register for debugging purposes.
address register, address decoders
The address register stores the address of a node selected by the ADDRESS in-
struction. The binary address is then decoded to an array of unary selector lines
by the row decoder and col decoder. Also, the node address is decoded by two
22 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
23/65
4 Implementation of a Prototype
seven-segment-decoders to display the current address on the seven segment ele-
ments on the Nios board.
UARTS
The UART core was obtained in VHDL source from Bertrand Cuzeau of
ALSE[ALS03] with permission to use in this project. It is connected by two
data busses for incoming and outgoing data and a few signal lines indicating the
receive and transmit states.
A variant of this design supporting larger addresses was made to test an implementation
with maps larger than 16 by 16 nodes. This included a second status register and an
extension of the instruction decoder to transfer these addresses. An implementationwhich could handle a map sized 24 by 24 nodes could be compiled successfully for the
Stratix 1S10 FPGA.
4.2 Design of graph vertices in digital hardware
The first implementation of a graph node is a modification of the design suggested in
[Mol99] It uses three D-Flip-Flops (DFF) to implement the delay elements and a fourth
DFF to store a configuration bit to mark obstacles.
More DFFs are used in the lpm countermegafunction of the Quartus II software which
can implement an arbitrary width binary counter. The prototype used a five bit widecounter, this results in a total of nine DFFs per vertex used in this design.
Figure 4.3: Implementation of a hardware graph vertex. A larger version of this figure can be found in appendix A.2
Universitat Bielefeld, AG Technische Informatik 23
7/29/2019 Dijsktra Thesis
24/65
4 Implementation of a Prototype
in0, . . . , in7 are the input signals for incoming edges of the neighbour vertices. in0,
. . . , in3 are connected to the orthogonal and in4, ..., in7 are connected to the diag-onal neighbours of the current vertex. Each of these four-input sets is OR combined
to generate the orthogonal edge and diagonal edge signals. The diagonal edge signal
is connected to the diagonal delay DFF resulting in a delay of one node clock cycle.
This DFFs output is then OR combined to the orthogonal edge signal and the result
is fed to the next delay DFF, delay1. Finally, the output of delay1 is delayed again by
DFF delay2. This implements the fact, that a diagonal edge has a weight of 3 while an
orthogonal edges weight is 2. Another OR gate feeds the output of delay2 back to its
input to keep a high state, regardless of the output of delay1. This feedback allows to
implement a preset logic for the reference node.
The output ofdelay2 is masked with the output of the configuration bit by an AND gate.If the output of the DFF which contains the configuration bit is low, this node is marked
as an obstacle and the output will stay low, regardless of any incoming signals.
The masked signal out is then distributed to the neighbouring nodes inputs and
simultaneously used as a low active clock enable signal for the counter present in each
node. In contrast to the suggested implementation in [Mol99], no global counter is used
in this variant. As long as the output of a node is low, the counter is incremented with
each cycle ofnode clk.
To set the configuration bit of a particular node and to set a reference node, a
simple address logic is used. Two binary to unary decoders are connected to the upperand lower half of the address register respectively to generate a number of row select
and column select signals according to the number of rows and columns of the map.
These decoders assure that only one row and column select signal is high active at any
given time. Each node of a given row n is connected to the corresponding row select[n]
signal, each node of a column m is connected to col select[m].
A node select signal is generated by performing a logical and on a nodes row select
and col selectinputs. This allows the transfer of a configuration bit to the node currently
selected by the address register using the conf data and conf clk inputs common to all
nodes. Also, the node selectsignal and the common conf setsignal are fed into a NAND
gate. This gates output is connected to the asynchronous preset input of the delay2 DFF,
which allows a preset of the currently selected node as a reference node.
Additionally, the node select signal is connected to the enable input of
counter tristate bufferto output the current nodes potential counteron the counter out
bus.
All configuration bits can be cleared asynchronously by assigning a low signal to the
common conf reset input. The counter and delay DFFs can be reset asynchronously by
assigning a low signal to node reset.
24 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
25/65
4 Implementation of a Prototype
Figure 4.4: Block schematic of the address logic
4.3 Generation of the node matrix by a Verilog
code generator
The current design uses a matrix of 16 by 16 nodes. Each nodes output is connected
to the inputs of each of its eight neighbours. This results in a total number of 65536connections for the output signals only. Thus, a manual placement of nodes and inter-
connections is impractical, but the regularity of the connections allows an automatic
generation of the node matrix in a hardware description language.
The input and output signals of each node can be divided into three different
categories:
The inputs node reset n, node clk, conf reset n, conf clk, conf set, conf data and
Universitat Bielefeld, AG Technische Informatik 25
7/29/2019 Dijsktra Thesis
26/65
4 Implementation of a Prototype
the output bus counter out are common signals to each node and passed through
from or to the node matrix.
All row selectand col selectselector input lines of the node matrix are connectedto each node in the corresponding row and column. The definition and inter-
connection of these signals depends on both the size of the node matrix and the
position of an individual node in the matrix.
The in0, .. . , in7inputs of each node are connected to the out signals of the eightneighbouring nodes. At the border of the matrix, a low signal is assigned to these
inputs if the appropriate neighbour does not exist. The definition of these signals
depends on the position and size of the matrix, too.
The Verilog source code of the node matrix consists of three parts:
A header containing the definitions of the node matrix module, inputs, outputs, and
internal signal lines, followed by the repeated instantiation and interconnection of nodes,
and a footer ending the module definition.
Three template files representing the needed parts are used to generate the resulting
file. Placeholder variables, marked by the #-character at the beginning and the end,
represent values dependent on size of the grid and position of individual nodes. They
are calculated and filled in upon execution of the code generator. The template class
supports integer, float and string variables, though only integer and string variables are
used in this application.
26 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
27/65
4 Implementation of a Prototype
module node_matrix(node_row,node_col,
node_reset_n,conf_reset_n,conf_clk,conf_set,conf_data,node_clk,
counter_out);
input [#maxrow#:0] node_row;
input [#maxcol#:0] node_col;
input node_reset_n;
input conf_reset_n;
input conf_clk;
input conf_data;
input conf_set;input node_clk;
output [4:0] counter_out;
wire node_clk;
wire [#maxrow#:0] node_row;
wire [#maxcol#:0] node_col;
wire node_reset_n;
wire conf_clk;
wire conf_data;
wire conf_set;
wire conf_reset_n;
tri [4:0] counter_out;
wire node_out[#maxnode#:0];
Figure 4.5: header.v - Template for the node matrix.v header
Figure 4.5 shows the template for the module header. For a height rows by width
columns matrix, the values for the used variables are:
#maxrow# = height-1
#maxcol# = width-1
#maxnode# = (height * width) -1
Those variables are used to define the number of selector signals and output signals of
all nodes.
Universitat Bielefeld, AG Technische Informatik 27
7/29/2019 Dijsktra Thesis
28/65
4 Implementation of a Prototype
The template for the instantiation of one node is shown in figure 4.6.
node node_x#x#_y#y# (
.row_select(node_row[#x#]),
.col_select(node_col[#y#]),
.conf_clk(conf_clk),
.conf_data(conf_data),
.conf_set(conf_set),
.conf_reset_n(conf_reset_n),
.node_clk(node_clk),
.node_reset_n(node_reset_n),
.in0(#in0#),
.in1(#in1#),
.in2(#in2#),
.in3(#in3#),
.in4(#in4#),
.in5(#in5#),
.in6(#in6#),
.in7(#in7#),
.out(node_out[#n#]),
.counter_out(counter_out)
);
Figure 4.6: node.v - Template for the node instantiation in node matrix.v
The nodes are generated row by row and column by column in each row. In a nested
loop, this template is used once for each node with #y# running from 0 to #maxrow#
and #x# running from 0 to #maxcol#. Therefore, a node at row 5, column 3
gets the name node\_x3\_y5, its row select and col select inputs are connected to
row select[5] and col select[3] respectively.
The number #n# of a node is calculated as
#n# = y*width + x
and this nodes output is connected to node out[#n#]. For the above example, in a
16 by 16 matrix the output line is node out[83].
As described in section 4.2, in0, . . . , in7 are the input signals from the neigh-
bouring nodes. Figure 4.7 shows the definition of a lookup table containing the
corresponding #x# and #y# offsets for each incoming signal.
28 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
29/65
4 Implementation of a Prototype
struct offset {
int dx,dy;};
struct offset edge_offsets[8] = {
{ -1, 0 }, // left
{ 1, 0 }, // right
{ 0, -1 }, // bottom
{ 0, 1 }, // top
{ -1, -1 }, // top left
{ 1, -1 }, // top right
{ -1, 1 }, // bottom left{ 1, 1 } // bottom right
};
Figure 4.7: Offset table describing the position of adjacent nodes
For each input inN, the offset edge_offsets[N] is added to the current #x# and
#y# to obtain the position of the neighbouring node.
In case of the resulting position being out of bounds, inN is substituted by the string
b0 which represents a logic zero in Verilog. Otherwise, the replacement string is
node_out[x+edge_offsets[N].dx+(y+edge_offsets[N].dy)*width]
After the Verilog code for all nodes is generated, the footer template is appended. This
template only contains the line
endmodule
to end the Verilog module definition.
4.4 Interface to a host system
An interface to the host system is needed to transfer map data to the FPGA, set ref-
erence nodes, and retrieve the resulting potential vector. For the prototype, an RS232
connection was chosen for the ease of creating the needed host software.
The interface mainly consists of a ready-to-use UART connected to the address and
command registers, a command decoder, and a state register, which allow to execute a
set of commands received via the RS232.
Universitat Bielefeld, AG Technische Informatik 29
7/29/2019 Dijsktra Thesis
30/65
4 Implementation of a Prototype
The RXD and TXD data lines of the UART are connected directly to FPGA pins, which
are then connected to the TTL to RS232 voltage level converters. After a byte is receivedby the UART, the RxRDY signal gets high for one clock cycle and data can be read from
the Dout bus.
The current data at the Din bus is sent to the RS232 by assigning a high signal to the LD
input for one clock cycle. While a byte is transmitted, the TxBusy signal is high.
Another status signal is RxErr, which indicates an error during the receipt one byte.
This signal is unconnected, since no error handling for the interface is implemented and
transmission errors are unlikely at a rate of 9600 bits per second.
The UART supports a choice between two transmission rates, defined as parame-
ters, by assigning a high or low signal to the Baud input. Since only a rate of 9600 bitsper seconds is used, this signal is tied to GND. RST is a low-active reset signal and
connected to the system reset signal.
Dout is fed into the data inputs of both the command and address registers, allowing to
load received data by assigning a high signal to one of the registers enable inputs for
one clock cycle.
counter outof the node matrix is connected to Din of the UART. This tri-state bus holds
the current counter value of the node selected by the address register.
The state register is used to distinguish between the wait for command state and the
actual execution of an instruction. As long as it is low, the system waits to receive a
byte via the UART. After uart rxrdy switches to a high value, the decoder performs afetch command cycle by assigning a high to command enable which results in the
command register to take over and store the value of the UARTs Dout bus. The msr is
set to high, too, and the execution of the particular instruction starts in the next system
clock cycle.
Table 4.1 shows the decoder table for the instruction decoder. The actual encoding
of each command can be taken from the column command register. The following
commands are implemented:
ADDRESS
Two states are needed for the ADDRESS instruction to load a node address to the
address register. First, the system stays in the load address wait state until a
second byte is received via the UART. After another transition of uart rxrdy to a
high value, the decoder assigns a high signal to address enable and the value of
Dout is taken over by the address register in the load address load cycle.
READ
The READ instruction is implemented in two states, too. As long as the
uart txbusy signal is in a high state, the system will stay in this state to wait for
an ongoing transmission of a byte to finish. When the UART is ready to transmit
30 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
31/65
4 Implementation of a Prototype
state msrin
commandregister
uartrxrdy
uarttxbusy
addressenable
commandenable
uartld
nodeclk
noderesetn
confclk
confresetn
confset
confdata
msrout
wait for command 0 x 0 x 0 0 0 0 1 0 1 0 0 0
fetch command 0 x 1 x 0 1 0 0 1 0 1 0 0 1
load address wait 1 b00000001 0 x 0 0 0 0 1 0 1 0 0 1
load address load 1 b00000001 1 x 1 0 0 0 1 0 1 0 0 0
read wait 1 b00000010 x 1 0 0 0 0 1 0 1 0 0 1
read send 1 b00000010 x 0 0 0 1 0 1 0 1 0 0 0
reset nodes 1 b00000100 x x 0 0 0 0 0 0 1 0 0 0
reset map 1 b00001000 x x 0 0 0 0 1 0 0 0 0 0
iterate 1 b00010000 x x 0 0 0 1 1 0 1 0 0 0
load config 1 bab100000 x x 0 0 0 0 1 1 1 a b 0
Table 4.1: Decoder table of the instruction decoder
new data, a high signal is assigned to the LD input in the read send state. This
triggers a transmission of the currently assigned value of the Din port to the host
system via the RS232.
RESET NODES
The RESET NODES instruction causes the node reset n output of the decoder to
become low for one clock cycle. This resets the counter and delay registers in the
node matrix.
RESET MAP
This instruction works similar to the RESET NODES command. By assigning
a low value to conf reset n for one cycle, the map configuration registers in the
node matrix are reset.
ITERATE
The ITERATE instruction causes a high value on the node clk output of the de-
coder, which results in a clock cycle for both the counter and delay registers of
the node matrix. This is used to perform iterations of the implemented algorithm.
LOAD CONFIG
LOAD CONFIG is used to set the currently selected node as a reference or an
obstacle node. Unlike the other instructions, this is the only one which encodes
Universitat Bielefeld, AG Technische Informatik 31
7/29/2019 Dijsktra Thesis
32/65
4 Implementation of a Prototype
additional parameters in the instruction word. These parameters, a and b in table
4.1, are passed through to the node matrix. If parameter a, the conf set signal, isset, the currently selected node is set as a reference node by asynchronously set-
ting the last delay element to a high value. Parameter b is passed to the conf data
signal and a simultaneous conf clkcycle causes the configuration DFF of the node
to take over this signal. Since the output of the configuration bit is AND combined
with the output of the last delay DFF, a low value has to be set for an obstacle node
and a high signal for a clear node.
The chosen encoding uses one bit of the instruction word to distinguish between dif-
ferent instructions. This keeps an implementation of the decoder in gate logic simple.
Figure 4.8 shows the implementation of the decoder.
Figure 4.8: Schematic of the instruction decoder
4.5 Software environment on the host system
On the host system, a set of command line utilities written in C is used for test runs.
The source code for a module containing common functions used in these utilities can
32 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
33/65
4 Implementation of a Prototype
be found in appendix B.2
The utilities include:
send
This program just sends the value supplied as an argument to the serial port. It is
used mainly for testing purposes, e.g. for sending a single command byte to the
FPGA.
Example:
$ ./send 4 # issue "reset_nodes" command to the FPGA
address
address sends the ADDRESS command, followed by the node address parame-
ter. It is used for the debugging of the address logic and seven segment decoders,
and for setting one or more reference nodes.
Example:
$ ./address 50 # 0x31 - set address to node (3,1)
$ ./send 14 # 0xE0 - s et node (3,1) as
# clear reference node
map transfer
This utility reads in an ASCII file , resets the map configura-
tion in the FPGA, and transmits the neccessary commands to configure the map
according to the contents of the file. The first two lines of the file contain the num-
meric values of the width and height of the map. In the following lines, a space
represents a clear node and any non-space character marks an obstacle. Figure 4.9
shows an example file.
iterate
This commands sends ITERATE instructions to the FPGA to perform
iterations of the algorithm.
read
To retrieve the potential vector from the FPGA, the read utility is used. It uses AD-
DRESS and READ NODE instructions to read every nodes potential and prints
it to the console.
Universitat Bielefeld, AG Technische Informatik 33
7/29/2019 Dijsktra Thesis
34/65
4 Implementation of a Prototype
16
16****************
* *
* * *
* * *
* * *
********* ****
* *
* * *
* * *
* * ************* ***
* *
**** * ********
* * *
* * *
****************
Figure 4.9: Example ASCII map for the transfer map utility
4.6 Result of a test run
In a test run, the map shown in figure 4.10 encoded as an ASCII map as shown in figure
4.9 is transfered to the FPGA. An ADDRESS instruction is then sent to the FPGA to set
the address register to node (1,1) and a following LOAD CONFIG instruction sets this
node as the reference node:
$ ./address 17 # 0x11 - set current node address to (1,1)
$ ./send 14 # 0xE0 - set current node as
# reference node
Figure 4.11 shows the output of the read utility after 71 iterations. These values are
loaded into Matlab to generate a visualisation of the resulting potential vector in a
greyscale image as shown in figure 4.12.
34 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
35/65
4 Implementation of a Prototype
Figure 4.10: Map for the test run
71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71
71 0 2 4 6 8 10 12 14 16 18 20 22 24 26 71
71 2 3 5 7 9 11 13 71 17 19 21 23 25 27 71
71 4 5 6 8 10 12 14 71 19 20 22 24 26 28 71
71 6 7 8 9 11 13 15 71 21 22 23 25 27 29 71
71 71 71 71 71 71 71 71 71 23 24 25 71 71 71 71
71 40 38 36 34 32 30 28 26 25 26 27 28 30 32 71
71 41 39 37 35 33 31 29 71 27 28 29 30 31 33 71
71 42 40 38 36 34 32 31 71 29 30 31 32 33 34 71
71 43 41 39 37 35 34 33 71 31 32 33 34 35 36 7171 71 71 71 71 71 71 71 71 71 71 71 36 71 71 71
71 59 57 55 53 51 49 47 45 43 41 39 38 39 41 71
71 71 71 71 54 52 71 48 71 71 71 71 71 71 71 71
71 61 59 57 55 54 71 50 51 53 55 57 59 61 63 71
71 62 60 58 57 56 71 52 53 54 56 58 60 62 64 71
71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71
Figure 4.11: Output of the read utility after 71 iterations
Universitat Bielefeld, AG Technische Informatik 35
7/29/2019 Dijsktra Thesis
36/65
4 Implementation of a Prototype
Figure 4.12: Greyscale image of the resulting potential vector.
X X X X X X X X X X X X X X X X
X t XX X XX X XX X XX X X X X X X X X X X X XX XX X XX
X
X
X X XX X X X X X X X X X X X X X XX XX X X X X X X X X X X X XX X XX X XX X X X X X X X X X X X X X X X
Figure 4.13: Direction vectors generated by a path tracing algorithm. By following these vectors from any point in themap, the reference node t is reached via the shortest path possible.
36 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
37/65
5Analysation and Discussion
The FPGA used in this application supplies a matrix of 30 by 40 LABs, containing 10
LEs each. To achieve an optimal mapping of the graph structure onto the FPGAs LAB
grid, the implementation of one graph node in one LAB is desirable. However, on the
chosen Stratix 1S10 FPGA two main problems occur:
With an increase of the map size, distanced may become longer. This leads toa higher maximal counter value, and the number of DFFs needed to store node
potentials grows.
No real tri-state signals are supported, so additional logic is required to retrievethe potentials from the FPGA.
The following sections discuss these problems.
5.1 Substitution of DFFs by LUT based delay ele-
ments
Figure 5.1, taken from the Stratix Device Handbook [Cor04b], shows the internal struc-
ture of a LE within a LAB.
Universitat Bielefeld, AG Technische Informatik 37
7/29/2019 Dijsktra Thesis
38/65
5 Analysation and Discussion
Figure 5.1: Stratix LE block schematic taken from [Cor04b]
An approach to conserve registers of LEs for storing larger counter values is the replace-ment of delay DFFs with a LUT based solution. The following requirements have to be
met by the replacement:
To put the system in a defined state prior to a computation, a reset signal is needed. An input signal to the element must be propagated on a clock edge. An additional asynchronous preset logic is needed in the last delay element to set
a reference node.
The first step in the creation of a LUT based register is the implementation of an asyn-
chronous RS flip-flop as shown in figure 5.2.
Figure 5.2: Equivalent block schematic for a LUT based RS flip-flop
38 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
39/65
5 Analysation and Discussion
The set input is connected to an OR gate. The output of this gate is then masked with
the low-active reset n signal and fed back to the second input of itself. As long as thereset n signal stays in a high state, this feedback loop causes the output to stay in a high
state, too, once a high signal is assigned to the set input. A low reset n signal causes
the output to become low. As long as the input signal also stays low, the flip-flop will
remain in this state.
To use this LUT based flip-flop in graph nodes, an extension to create an edge
triggered behaviour is needed. In theory, this can be achieved by masking the input
signal with the negated clock signal in a first LUT and then masking the output of the
first LUT with the same clock signal. When the input of the first LUT switches to high,
its output also switches to high during a low clock signal. At the edge of the clocksignal, the output of the first LUT stays high for the propagation delay of the LUT. In
this period of time, the second LUT already received the high clock signal without the
delay and the LUT output is expected to switch to and stay in a high state until a low
reset n signal is assigned.
Figure 5.3: Equivalent block schematic for a LUT based delay element
Figure 5.3 shows the complete schematic for the LUT based delay element. The
leftmost AND gate and the inverter are implemented in the first LUT, the rest of
this circuit is implemented in the second LUT. Tests of this circuit in the Quartus II
simulator produce contradicting results as shown in figures 5.4 and 5.5. While the
functional netlist simulation reports the predicted results, the output signal oscillates in
an unstable state in the timing simulation. A test on the FPGA hardware showed the
predicted behaviour to be true, so an implementation of a delay element in this way ispossible.
Figure 5.4: Results of a functional netlist simulation of the delay element.
Universitat Bielefeld, AG Technische Informatik 39
7/29/2019 Dijsktra Thesis
40/65
5 Analysation and Discussion
Figure 5.5: Results of a timing simulation of the delay element
Since the development environment usually optimises a users design to save on-chip
ressources, another problem arises during the implementation of this circuit. Every
network of logic gates is mapped onto 4-input LUTs, so suitable parts of a circuit are
reduced to 4-input functions to fit into those LUTs. For example, the RS flip-flop in
figure 5.2 is synthesized into a single LUT. In the proposed schematic of the clock
edge triggered delay element, the optimiser identifies both leftmost AND gates asredundant. This part of the schematic can also be written as the equation (a b) bwhich evaluates to false for all values ofa and b, if no propagation delays are taken into
account.
To solve this problem, the above circuit was created by using a method to directly
instantiate logic elements on the FPGA as described in the next section.
5.2 Low-level instantiation of logic elements by
VQM source filesTo allow design and synthesis of FPGA designs in 3rd party tools without the publica-
tion of the bitstream format, Quartus II offers the possibility to include Verilog Quartus
Mapping (VQM) files in a project and assemble them. These files make it possible to
instantiate and configure FPGA ressources like logic cells (LEs), I/O blocks, on-chip
memory, etc. on a low level. As the name suggests, the syntax of VQM files is the same
as that of Verilog files.
Unfortunately, no detailed public information about VQM files is available, but the
Quartus II software can create a VQM file of a compiled project. Therefore, reverse
engineering the file format is possible by compiling simple projects and analysing the
resulting VQM files.
5.2.1 Structure of VQM files
As an example, figure 5.6 shows the resulting VQM file of the RS flip-flop design from
the last section. The most interesting part of this file is the instantiation of a logic cell to
implement the given logical function. Since the design of the flip-flop uses only three
inputs overall, it was reduced to a single LE using only one LUT.
40 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
41/65
5 Analysation and Discussion
stratix_lcell (); creates a logic element named
with connections to other logic listed in . Configuration parameters of the
logic element are defined by defparam.
Two operation modes are possible for a Stratix LE: normal mode and dynamic
arithmetic mode. The Stratix Device Handbook[Cor04b] describes both modes in
detail.
The following input and output ports can be defined in a VQM file:
dataa, datab, datac, datad
These are the four data inputs of the LUT. datac can be configured as the cin
carry input signal or as an input for a feedback from the register of the same LE
alternately.
cin
This is the carry chain input signal of a LE. It can only be connected to the cout
carry out output of a previous LE in the same LAB.
regcascinThis is the register cascade input of the LE. To connect the input of the LEs
register directly to the output of the register in the previous LE inside a LAB, the
regcascin input is used.
aclr
The aclr input of a LE can be connected to a LAB wide asynchronous high-
active reset signal. Only two different signals of this type are available to all LEs
in one LAB.
aload
This is a LAB wide asynchronous preset signal. One signal of thise type common
to all LEs inside a LAB can be used. When a high signal is assigned to this input,
the register loads the current value of the datac port.
clk, ena
These are the clock and clock enable inputs of the register in a LE. Both are LAB
wide signals, too. There are two clock and two clock enable signals available for
each LAB.
Universitat Bielefeld, AG Technische Informatik 41
7/29/2019 Dijsktra Thesis
42/65
5 Analysation and Discussion
module rs_ff (reset_n,set, q);
input reset_n;
input set;
output q;
wire \reset_ncombout ;
wire \setcombout ;
wire \inst10 ;
wire gnd;
wire vcc;
assign gnd = 1b0;
assign vcc = 1b1;
stratix_io \reset_nI (.combout(\reset_ncombout ),.padio(reset_n));
defparam \reset_nI .operation_mode = "input";
defparam \reset_nI .ddio_mode = "none";
defparam \reset_nI .input_register_mode = "none";
defparam \reset_nI .output_register_mode = "none";
defparam \reset_nI .oe_register_mode = "none";
defparam \reset_nI .input_async_reset = "none";
defparam \reset_nI .output_async_reset = "none";
defparam \reset_nI .oe_async_reset = "none";
defparam \reset_nI .input_sync_reset = "none";
defparam \reset_nI .output_sync_reset = "none";
defparam \reset_nI .oe_sync_reset = "none";
defparam \reset_nI .input_power_up = "low";defparam \reset_nI .output_power_up = "low";
defparam \reset_nI .oe_power_up = "low";
stratix_io \setI (.combout(\setcombout ),.padio(set));
// parameters removed, see above
stratix_lcell \inst120 (.datab(\reset_ncombout ),.datac(\setcombout ),
.datad(\inst10 ),.combout(\inst10 ));
defparam \inst120 .operation_mode = "normal";
defparam \inst120 .synch_mode = "off";
defparam \inst120 .register_cascade_mode = "off";
defparam \inst120 .sum_lutc_input = "datac";
defparam \inst120 .lut_mask = "CCC0";
defparam \inst120 .output_mode = "comb_only";
stratix_io \qI (.datain(\inst10 ),.padio(q));defparam \qI .operation_mode = "output";
// rest of parameters removed, see above
endmodule
Figure 5.6: VQM file generated by the Quartus II software for the LUT RS flip-flop shown in Figure 5.2
42 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
43/65
5 Analysation and Discussion
sclr
This signal is a synchronous LAB wide reset input for the register. One sclr
signal can be used per LAB.
sload
sload is the synchronous counterpart of the aload input. When this LAB wide
signal is set, the register stores the value of the datac input on the next clock
edge.
combout
The combout output is the output of the LEs LUT.
regout
This is the output of the LEs register.
cout
The cout output is the carry chain output of a LE. It is available in arithmetic
mode and can only be connected to the cin port of the next LE in the same LAB.
To configure the behaviour of a LE, defparam is used to set required parameters. The
availability of some of the above ports also depends on these parameters.
operation_mode
The value of this parameter can be set to normal or arithmetic. In
arithmetic mode, the LUT is split into four 2-input LUTs. Also, the cout output
becomes available in this mode. However, the implementation of delay elements
only uses LEs in normal mode - a detailed description of the dynamic arithmetic
mode is found in the Stratix Device Handbook [Cor04b].
lut_mask
The lut_mask is a 16 bit hexadecimal value that represents the truth table of the
LUT. The Quartus II Handbook, Volume 3, Section 10 [Cor04a] describes how to
calculate the LUT mask for a given equation by the creating of a truth table. The
LUT can be compared to a simple 16x1 RAM: with dataa, datab, datac,
and datad used as address lines 0 to 3, the LUT mask resembles the stored data
word with the least significant bit first.
register_cascade_mode
Possible values for this parameter are on and off. When this mode is en-
abled, the output of the previous LE in the LAB can be connected directly to the
input of the register of the LE that is to be configured by using the regcascin
input.
Universitat Bielefeld, AG Technische Informatik 43
7/29/2019 Dijsktra Thesis
44/65
5 Analysation and Discussion
sum_lutc_input
This parameter determines the signal which is fed into the LUTs datac input.
A value of datac connects the LUTs input to the datac input of the LE. The
value cin connects the input to the carry in input, and the value qfbk can
be used to connect the output of the register in this LE to the LUTs datac input.
synch_mode
Like register_cascade_mode, a boolean value of on or off can be
assigned to this parameter. It has to be set to use the synchronous input ports of
the LE.
output_mode
The LE has three output modes: The LUT or the register output or
both can be made available for interconnections. The possible values are
comb_only for the LUT output only, reg_only for the register output
and reg_and_comb for both.
Figure 5.6 also shows the definition of stratix_io input and output pins. Since
it is possible to include VQM files in a Quartus II project, the input and output pin
assignments can be created by standard tools. Therefore, an implementation of the
input and output blocks in VQM is not neccessary.
As a side note, while trying to review the VQM file from a more complex project which
contained the Nios system-on-a-chip CPU of the Nios Development Kit, the resulting
file was no ASCII file. Instead, it contained binary data. It seems that VQM files
that contain elements from the Quartus II SOPC Builder are encrypted somehow. A
possible reason for this may be the prevention of compiling this CPU for other FPGA
architectures by reimplementing the basic Stratix components in Verilog.
44 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
45/65
5 Analysation and Discussion
5.2.2 Implementation of the LUT delay element in VQM
As previously mentioned, the LUT delay element utilises two LUTs.
Figure 5.7: Schematic for a LUT based delay element
Figure 5.7 shows the neccessary connections. The logical function for the LUT labeled
stage1 is a b. The equivalent truth table for this function is shown in table 5.1.
datab dataa out
0 0 0
0 1 1
1 0 0
1 1 0
Table 5.1: Truth table for the stage1 LUT
Since the output does not depend on inputs c and d, the output pattern is repeated four
times, resulting in a binary LUT mask of 0010001000100010 equal to a hexadecimal
value of 2222.
According to figure 5.3, the function for the LUT labeled stage2 is ((a b) d) c).
Table 5.2 shows the resulting truth table.The resulting LUT mask for this function is a hexadecimal value of F080. Both LUTs
can be implemented in VQM files as shown in figure 5.8 and added to a Quartus II
project. As soon as the source files are added, the implemented modules can be used in
the project like any other component defined in Verilog. In the test environment for the
delay element, a symbol was generated for both stage1 and stage2 and they were
instantiated in a block schematic file.
Universitat Bielefeld, AG Technische Informatik 45
7/29/2019 Dijsktra Thesis
46/65
5 Analysation and Discussion
datad datac datab dataa out
0 0 0 0 00 0 0 1 0
0 0 1 0 0
0 0 1 1 0
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 1
1 0 0 0 0
1 0 0 1 0
1 0 1 0 01 0 1 1 0
1 1 0 0 1
1 1 0 1 1
1 1 1 0 1
1 1 1 1 1
Table 5.2: Truth table for the stage2 LUT
5.3 Reduction of the number of needed DFFs byusing an alternate path tracing algorithm
In a regular grid, the largest potential difference between two adjacent nodes is the
weight of a diagonal edge. If the application needs only one reference node to be
set, the shortest path in the map can still be found if a smaller counter is used, as
long as the counter can hold a value of at least twice the maximum distance between
two adjacent nodes. In longer paths, a counter overrun occurs, but a modification
of the path tracing algorithm still allows to find the shortest path. However, the in-
formation about the absolute distance to the reference node in the potential vector is lost.
The algorithm described in section 2.2 uses a maximum weight of three. If three
bit counters are used, an overrun occurs if the value changes to a number larger than
seven. This results in a substraction of eight from the previous value of a counter. When
this happens, the minimal difference between two nodes potentials five in this case
is the maximum counter value increased by one minus the largest possible weight of
any edge. Thus, in the tracing algorithm, a counter overrun can be assumed to have
occurred whenever the difference between the counters of two nodes is equal or larger
than five.
46 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
47/65
5 Analysation and Discussion
module stage1 (in, clk, out);
input in;
input clk;
output out;
wire in;
wire clk;
wire out;
stratix_lcell stage1_lut (.dataa(in),.datab(clk),.combout(out));
defparam stage1_lut .operation_mode = "normal";
defparam stage1_lut .synch_mode = "off";
defparam stage1_lut .register_cascade_mode = "off";
defparam stage1_lut .sum_lutc_input = "datac";
defparam stage1_lut .lut_mask = "2222";
defparam stage1_lut .output_mode = "comb_only";
endmodule
module stage2 (in, clk, reset_n, out);
input in;
input clk;
input reset_n;
output out;
wire in;
wire clk;
wire reset_n;
wire out;
stratix_lcell stage2_lut (.dataa(in),.datab(clk),.datac(reset_n),
.datad(out),.combout(out));
defparam stage2_lut .operation_mode = "normal";
defparam stage2_lut .synch_mode = "off";
defparam stage2_lut .register_cascade_mode = "off";
defparam stage2_lut .sum_lutc_input = "datac";
defparam stage2_lut .lut_mask = "F080";
defparam stage2_lut .output_mode = "comb_only";
endmodule
Figure 5.8: Implementation of two LUTs for a delay element
Universitat Bielefeld, AG Technische Informatik 47
7/29/2019 Dijsktra Thesis
48/65
5 Analysation and Discussion
The pseudo C code in figure 5.9 implements an iteration step of the modified tracing
algorithm.
// offsets to neighbours
int dx[]={-1, 0, 1,-1, 1,-1, 0, 1};
int dy[]={-1,-1,-1, 0, 0, 1, 1, 1};
int num_neighbours = 8
int min_overrun_diff = 5;
int overrun_value = 8;
// p[][] contains a two dimensional array
// of counter values
int max_dist=0;
int next_index=-1;
// for each neighbour...
for(i=0;i max_dist) {
max_dist = dist; // save new maximum distance
next_index = i; // save index of offset
}
}
if(next_index >= 0) {
// neighbour with lower potential found
x+= dx[next_index];
y+= dy[next_index];
} else {
// reference node reached
}
Figure 5.9: Pseudo C code for the modified tracing algorithm
5.4 Implementation of the address logic and host
interface
The second problem of an efficient implementation of this algorithm involves the trans-
fer of calculated potential data to the host. In the original implementation of the pro-
totype, tri-state buffers supplied by the Quartus II development environment have been
used to access the contents of each counter register. Unfortunately, the analysis of these
48 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
49/65
5 Analysation and Discussion
buffers reveals, that they are actually implemented in gate logic, since the used FPGA
does not support real tri-state signals.
Figure 5.10: AND-OR replacement logic for tri-state buffers
Figure 5.10 shows an implementation for a substitution of those tri-state buffers usinggate logic. chain_in is connected to the output ofchain_out of the previous node.
The data input is connected to the output of the data signal to be tri-stated. This
signal is masked with the select signal by an AND gate, the result is OR combined
with the result of the previous nodes and passed on to the next node. Since only one
select signal of the whole chain can be active at a time, the masked output of all
inactive parts is low. Only the data input of the selected node determines the result of
the whole chain.// // On an FPGA architecture with 4-input LUTs, an implementation of
this logic fits into one LUT. Even an extension for the combination of row and column
select signals by an AND gate can be implemented in a single LUT, but this method is
not desired in an implementation with a large node matrix, since a long chain of LUTs
would result in a large propagation delay. Instead, the chain can be broken down in
one chain for each row. The values of the selected column are then combined within
each row, and an additional row select logic provides the final result. This structure still
leaves one input port of a LUT vacant. If this port is configured as an additional input
to the OR gate, further reduction of the propagation delay is possible by arranging the
elements of the chain in a tree-like structure. Figure 5.11 shows the resulting selection
logic for three data bits.
Universitat Bielefeld, AG Technische Informatik 49
7/29/2019 Dijsktra Thesis
50/65
5 Analysation and Discussion
Figure 5.11: Efficient selection logic for three data bits
Another desired improvement regarding the host interface is a faster transfer of the map
and the result of the calculation. The prototype uses a RS232 connection at 9600 bits
per second, and three bytes have to be transfered to configure one node of the map.
This results in a duration of 2.5 milliseconds per node. In a map of 16 by 16 nodes,
an approximation of the maximum potential value for a worst case scenario is 225 (see
section 2.2). With the FPGA running at 50 MHz, the calculation finishes after 4.5
microseconds. Therefore, in a real-world application, a PCI interface could result in a
great acceleration. By operating the FPGA synchronously to the PCI bus at 33 MHz or
66 MHz and a using a parallel transfer of 32 or 64 configuration bits in a burst transfer1, the overhead of configuration can be reduced significantly.
5.5 Conclusion
A parallel implementation of the algorithm described in 2.2 on an FPGA is possible.
However, the mapping of a graph node onto a single LAB could not be realised, because
the additional logic to implement the counter retrieval mechanism increased the num-
ber of logic elements needed. Thus, a topology preserving mapping of the graph onto
the FPGA would only be possible if a second adjacent LAB is used to implement the
complete graph node. This would reduce the maximum map size by 50%. Some testswith different map and counter sizes have been made, and the results demonstrated that
the Quartus II development environment is already able to generate a ressource efficient
version of this project, though the automatic placement of nodes could not be predicted.
Figure 5.12 shows the connections of the incoming edge signals to node (13,3) in a map
sized 16 by 16. While a two LAB version of the algorithm uses 20 LEs per node, the
1In a PCI burst transfer, one data word of a contiguous memory block can be transfered per clock
cycle.
50 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
51/65
5 Analysation and Discussion
Size of the node matrix Nodes total Counter width LEs used LEs/node used
16x16 256 5 bits 4118 16.124x24 576 5 bits 9703 16.8
24x24 576 3 bits 7779 13.5
Table 5.3: Map size vs. LEs/node used
synthesiser of the Quartus II software needed less than that number. Table 5.3 shows
the needed LEs per node for three different versions of the prototype.
Figure 5.12: Connections of the edge signals to map node (13,3) in a 16 by 16 map
To overcome the problems regarding the retrieval logic, the choice of a different newer
generation FPGA appears feasible. For example, the Xilinx Virtex II FPGA supports
tri-state buses, however the available number of these signals as described in [Cor03b]
is still too limited for this application. Also, the newer Altera Stratix II FPGAs use
improved versions of the LAB as basic logic blocks, described in the Stratix II Device
Universitat Bielefeld, AG Technische Informatik 51
7/29/2019 Dijsktra Thesis
52/65
5 Analysation and Discussion
Handbook [Cor04c]. These LABs contain eight adaptive logic modules (ALMs) which
not only are larger, but a more flexible configuration is possible. There are two registersper ALM, and with eight data inputs per ALM altogether, the configuration of two 4-
input LUTs in one ALM is not the only option. It is possible to configure one 3-input
LUT and one 5-input LUT or to use one large 6-input LUT and connecting the remaining
two inputs directly to the registers. Some other configurations with different constraints
are possible, too. The design using the Stratix LEs utilises a full LUT for combining the
already OR combined and delayed diagonal edge signal with the output signal of the OR
gate which combines the orthogonal edge signals. Furthermore, the remaining registers
in those LEs are hardly accessible. In an ALM, the combination of all four orthogonal
edge signal and the delayed diagonal edge signal would be possible in a single 5-input
LUT configured as an OR gate while keeping both registers of the ALM available.
52 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
53/65
ALarge format schematics
A.1 Schematic of the top level entity
On the following page, a fold-out of the top level entity of the FPGA design can be
found.
Unfortunately, the picture for this schematic was taken from an early version of
the project and the counter out output of the node matrix is shown with a wrong bus
width. While the picture shows an eight bit wide bus, the running prototype uses only a
five bit wide bus. The correct label for this output is counter_out[4..0].
Universitat Bielefeld, AG Technische Informatik 53
7/29/2019 Dijsktra Thesis
54/65
A Large format schematics
A.2 Schematic of a graph vertex implemented in
digital hardware
54 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
55/65
BSource code
B.1 main.m - Main module of the Verilog code gen-
erator
/* main.m of the verilog code generator */
#import "template.h"
struct offset {
int dx,dy;
};
struct offset edge_offsets[8] = {
{ -1, 0 }, // left
{ 1, 0 }, // right
{ 0, 1 }, // bottom
{ 0, -1 }, // top
{ -1, -1 }, // top left
{ 1, -1 }, // top right
{ -1, 1 }, // bottom left
{ 1, 1 } // bottom right
};
int main(int argc, char **argv) {
int width=16;
Universitat Bielefeld, AG Technische Informatik 55
7/29/2019 Dijsktra Thesis
56/65
B Source code
int height=16;
unsigned char *line;
int x,y,n,i,j;
int numnodes=width*height;
unsigned char chain_in_buffer[128];
unsigned char edge_buffer[8][128];
unsigned char node_edge_buffer[128];
id header,node,footer;
header=[Template new];
footer=[Template new];
[header load: "templates/header.v"];
[header set: "maxnode" toInt: numnodes-1];
[header set: "maxcol" toInt: width-1];
[header set: "maxrow" toInt: height-1];
[header show];
for(y=0;y
7/29/2019 Dijsktra Thesis
57/65
B Source code
toString: edge_buffer[i]];
}
[node show];
}
}
[footer load: "templates/footer.v"];
[footer show];
}
Universitat Bielefeld, AG Technische Informatik 57
7/29/2019 Dijsktra Thesis
58/65
B Source code
B.2 serial.c - Support functions for the host utili-
ties
/* serial.c - module containing serial port
* initialisation and helper functions.
*/
#include
#include
#include
#include
#include
#include
#include
#include "serial.h"
#define SERIAL "/dev/tty.USA19QW1b1P1.1"
int serial_fd;
struct termios options;
fd_set read_fds;struct timeval timeout;
int init_serial() {
serial_fd=open(SERIAL,O_RDWR | O_NOCTTY | O_NDELAY);
if(serial_fd==-1) {
perror("cannot open serial port!\n");
exit(0);
}
fcntl(serial_fd,F_SETFL,0);
tcgetattr(serial_fd,&options);
cfsetispeed(&options, B9600);
cfsetospeed(&options, B9600);
options.c_cflag &= PARENB;
58 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
59/65
B Source code
options.c_cflag &= CSTOPB;
options.c_cflag &= CSIZE;
options.c_cflag &= CRTSCTS;
options.c_cflag |= ( CS8 | CREAD | CLOCAL );
options.c_lflag &= (ICANON | ECHO | ECHOE | ISIG);
options.c_iflag &= (IXON | IXOFF | IXANY);
options.c_oflag &= OPOST;
tcsetattr(serial_fd, TCSAFLUSH, &options);
}
void close_serial() {
close(serial_fd);
}
void send_char(unsigned char c) {
write(serial_fd,&c,1);
}
void send_string(unsigned char *s, int length) {write(serial_fd,s,length);
}
unsigned char read_char() {
unsigned char buf;
read(serial_fd,&buf,1);
return buf;
}
// fpga commands
void reset_map() {
send_char(0x08);
}
void reset_nodes() {
send_char(0x04);
}
Universitat Bielefeld, AG Technische Informatik 59
7/29/2019 Dijsktra Thesis
60/65
B Source code
unsigned char read_node(unsigned char addr) {
send_char(0x01);
send_char(addr);
send_char(0x02);
return read_char();
}
void write_node(unsigned char addr, int flag) {
send_char(0x01);
send_char(addr);
if(flag) {
send_char(0x60);} else {
send_char(0x40);
}
}
void activate_node(unsigned char addr) {
send_char(0x01);
send_char(addr);
send_char(0xE0);
}
void iterate() {
send_char(0x10);
}
60 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
61/65
B Source code
B.3 map transfer.c - Host utility to transfer a map
to the FPGA
/* map_transfer.c - Transfers an ASCII map file
* to the FPGA
*/
#include
#include "serial.h"
int main(int argc, char **argv) {
FILE *map;
char line_buffer[256];
unsigned char addr;
int width,height,x,y;
if(argc!=2) {
printf("usage: %s \n",argv[0]);
exit(10);
}
if(!(map=fopen(argv[1],"r"))) {printf("cant open mapfile!\n");
exit(10);
}
fgets(line_buffer,256,map);
width=atoi(line_buffer);
fgets(line_buffer,256,map);
height=atoi(line_buffer);
init_serial();reset_nodes();
reset_map();
for(y=0;y
7/29/2019 Dijsktra Thesis
62/65
B Source code
printf("**");
} else {
write_node(addr,1);
printf(" ");
}
}
printf("\n");
}
close_serial();
}
62 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
63/65
Bibliography
Bibliography
[ALS03] Bertrand Cuzeau ALSE. Simple UART project - VHDL. http://www.alse-
fr.com, 2003.
[Cor99] Xilinx Corporation. XC4000 databook. http://www.xilinx.com, 1999.
[Cor03a] Altera Corporation. Nios development board. http://www.altera.com, 2003.
[Cor03b] Xilinx Corporation. Virtex-II platform FPGAs: Complete data sheet.
http://www.xilinx.com, 2003.
[Cor04a] Altera Corporation. Quartus II device handbook. http://www.altera.com,
2004.
[Cor04b] Altera Corporation. Stratix device handbook. http://www.altera.com, 2004.
[Cor04c] Altera Corporation. Stratix II device handbook. http://www.altera.com, 2004.
[Min78] E. Minieka. Optimization Algorithms for Networks and Graphs. Dekker, New
York, Basel, 1978.
[Mol99] Ralf Moller. Path planning using hardware time delays. IEEE Transactions
on Robotics and Automation, Volume 15 No. 3:pp. 588591, 1999.
[Nes03] John A. Nestor. FPGA Implementation of a Mulitlayer Maze Router. De-
partment of Electrical and Computer Engineering, Lafayette College, Easton,
2003.
[TS01] Matti Tommiska and Jorma Skytta. Dijkstras shortest path routing algorithm
in reconfigurable hardware. LNCS 2147, pages pp. 653657, 2001.
Universitat Bielefeld, AG Technische Informatik 63
7/29/2019 Dijsktra Thesis
64/65
Bibliography
64 Universitat Bielefeld, AG Technische Informatik
7/29/2019 Dijsktra Thesis
65/65
Bibliography
Hiermit versichere ich, da ich diese Diplomarbeit selbstandig bearbeitet habe. Ich habe
keine anderen als die angegebenen Quellen und Hilfsmittel benutzt und entsprechende
Zitate kenntlich gemacht.
Bielefeld, den November 11, 2004
Marcus Grieger