Dijsktra Thesis

Embed Size (px)

Citation preview

  • 7/29/2019 Dijsktra Thesis

    1/65

    Universitat BielefeldTechnische Fakultat

    AG Technische Informatik

    Diplomarbeit

    A parallel implementation of path planning on

    reconfigurable hardware

    Marcus Grieger

    November 11, 2004

    BetreuerProf. Dr. Ralf Moller

    Dipl.-Inform. Tim Kohler

  • 7/29/2019 Dijsktra Thesis

    2/65

    Diese Diplomarbeit wurde von Prof. Dr. Ralf Moller und Dipl.-Inform. Tim Kohler

    betreut.

    Gutachter waren Prof. Dr. Ralf Moller und Dipl.-Inform. Tim Kohler.

    2 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    3/65

    Contents

    Contents

    1 Introduction 5

    1.1 Dijkstras shortest path algorithm . . . . . . . . . . . . . . . . . . 51.2 Path planning on binary maps . . . . . . . . . . . . . . . . . . . . 61.3 Mapping of binary map nodes on reconfigurable hardware . . . . 6

    1.4 Applications of grid based shortest path algorithms . . . . . . . . 61.5 Related projects . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 The Dijkstra Algorithm 9

    2.1 The Dijkstra Shortest Path algorithm . . . . . . . . . . . . . . . . 92.1.1 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.1.2 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2 Simplification of the Dijkstra algorithm for a mapping of graph ver-tices on digital hardware in grid-based applications . . . . . . . . 122.2.1 Tracing of the shortest path in the potential vector . . . . 13

    3 Field Programmable Gate Arrays (FPGAs) 17

    3.1 History of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Internal structure of FPGAs . . . . . . . . . . . . . . . . . . . . . 183.3 Suitability of FPGAs for the specified application . . . . . . . . . 19

    4 Implementation of a Prototype 21

    4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4.2 Design of graph vertices in digital hardware . . . . . . . . . . . . 234.3 Generation of the node matrix by a Verilog code generator . . . . 254.4 Interface to a host system . . . . . . . . . . . . . . . . . . . . . . 29

    4.5 Software environment on the host system . . . . . . . . . . . . . 324.6 Result of a test run . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    5 Analysation and Discussion 37

    5.1 Substitution of DFFs by LUT based delay elements . . . . . . . . 375.2 Low-level instantiation of logic elements by VQM source files . . 40

    Universitat Bielefeld, AG Technische Informatik 3

  • 7/29/2019 Dijsktra Thesis

    4/65

    Contents

    5.2.1 Structure of VQM files . . . . . . . . . . . . . . . . . . . . 40

    5.2.2 Implementation of the LUT delay element in VQM . . . . 455.3 Reduction of the number of needed DFFs by using an alternate

    path tracing algorithm . . . . . . . . . . . . . . . . . . . . . . . . 465.4 Implementation of the address logic and host interface . . . . . . 48

    5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    A Large format schematics 53A.1 Schematic of the top level entity . . . . . . . . . . . . . . . . . . . 53A.2 Schematic of a graph vertex implemented in digital hardware . . 54

    B Source code 55B.1 main.m - Main module of the Verilog code generator . . . . . . . 55B.2 serial.c - Support functions for the host utilities . . . . . . . . . . 58

    B.3 map transfer.c - Host utility to transfer a map to the FPGA . . . . 61

    C Bibliography 63

    4 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    5/65

    1Introduction

    One of the problems in mobile robotics is navigation and path planning on maps. An

    approach to finding a shortest path is to encode the map in a weighted graph and use

    Dijkstras shortest path algorithm [Min78] to find the shortest path between two nodes

    in this graph. The downside of this algorithm is a quadratic growth of complexity de-

    pendent on the number of nodes. This work describes a parallel implementation of a

    simplified version of Dijkstras algorithm to achieve a linear growth.

    1.1 Dijkstras shortest path algorithm

    The common version of Dijkstras algorithm solves an all-shortest-path problem 1 in a

    weighted graph. Each vertex represents a location on a map and is connected to one

    or more other vertices via weighted edges, the weight representing a distance or cost

    incurred by following this edge. After defining a destination or reference vertex, the

    algorithm iteratively calculates a distance potential in every vertex which represents the

    distance to the destination node. The shortest path from any starting vertex is found by

    subsequently following the edge to the neighbouring vertex with the lowest potential. A

    detailed formal description of this algorithm is found in the first part of chapter 2.

    1A solution to an all-shortest-path problem provides the shortest path from a reference vertex to any

    other vertex of the graph, not only the shortest path between two specific vertices.

    Universitat Bielefeld, AG Technische Informatik 5

  • 7/29/2019 Dijsktra Thesis

    6/65

    1 Introduction

    1.2 Path planning on binary maps

    A possible representation of a map generated by a mobile robots sensors is a simple two

    dimensional binary map, where a logical zero represents free space and a logical one

    represents obstacles. This map is converted to a graph by placing a vertex in every pixel

    of the map and connecting it with eight edges to its eight neighbours. The weights result

    from the simple Euclidian distance of 1 to orthogonal neighbours and

    2 to diagonalneighbours. Obstacle vertices get an infinite weight or are not connected at all.

    1.3 Mapping of binary map nodes on reconfig-

    urable hardwareIn a classic implementation, each iteration requires a sequential calculation of each ver-

    tex potential to be updated, therefore a parallel calculation of all map nodes on suitable

    hardware should result in a significant reduction of complexity and thus in great accel-

    eration of this algorithm. To achieve this goal, more simplifications are made to allow

    for a creation of an electronic equivalent of a graph vertex. An interconnected matrix of

    these vertices can be implemented on a field programmable gate array (FPGA) and the

    calculation of one iteration is possible in a single clock cycle.

    1.4 Applications of grid based shortest path algo-

    rithms

    In addition to the acceleration of offline and real time path planning in mobile robots,

    variants of this algorithm could be used in other applications:

    A quite similar problem of finding a shortest path can be found in so-called maze

    routing, where conductive tracks on a printed circuit board have to be routed between

    pins of components without overlapping or crossing other tracks. This variant starts

    with a clear map and every routed track becomes an obstacle for the following tracks.

    This method of routing normally uses only orthogonally connected vertices, but in thecase of a multilayer board, the map may become three dimensional.

    Another application could be segmentation in the field of image processing. A compar-

    ison of region growing algorithms to the wave propagation in this algorithm suggests

    using a variant that uses more than one reference vertex as initial region seeds and some

    binary tag being propagated with each wave to distinguish between different regions

    later.

    6 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    7/65

    1 Introduction

    1.5 Related projects

    [Nes03] describes an implementation of a maze-routing accelerator, based on Xilinx

    XC4000 and Xilinx VirtexII FPGAs.

    An implementation of an accelerator for the calculation of a shortest path in a common

    graph is described in [TS01]. This approach aims at the acceleration of the Dijkstra

    algorithm used in dynamic network routing.

    Universitat Bielefeld, AG Technische Informatik 7

  • 7/29/2019 Dijsktra Thesis

    8/65

    1 Introduction

    8 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    9/65

    2The Dijkstra Algorithm

    2.1 The Dijkstra Shortest Path algorithm

    The Dijkstra Shortest Path Algorithm is described in [Min78]. To solve an all-shortest-

    path problem, each location is assigned to a vertex v V in an undirected graph G =(V, E). All locations reachable from this vertex are then connected by a set of weighted

    edges E. A weight represents a distance or cost incurred by following the path alongthis edge. The algorithm involves an initialisation step and a number of iterations:

    2.1.1 Initialisation

    A reference or target vertex t V is defined and a distance potential p(x) is assigned toevery x V. This potential is initialized with p(x) = 0 for x = t and p(x) = for allother vertices:

    p(t) = 0 (2.1)

    p(x) = x = t, x V. (2.2)Initially, all vertices x = t and all edges are uncoloured. Vertex t is coloured andbecomes the most recently coloured vertex named i.

    2.1.2 Iteration

    For each uncoloured vertex x that receives an edge from i, p(x) is updated as the min-imum of its own potential and the potential of vertex i in addition to the cost of the

    connecting edge:

    Universitat Bielefeld, AG Technische Informatik 9

  • 7/29/2019 Dijsktra Thesis

    10/65

    2 The Dijkstra Algorithm

    p(x) = min(p(x), p(i) + w(i, x)) (2.3)

    If uncoloured vertices with a finite potential exist, the vertex j with the lowest potential

    is coloured and i is set to j. Additionally, the edge that determined the value of p(x) iscoloured. Else, no further vertices are reachable and the algorithm stops.

    If the vertex s representing the starting location has been coloured, the shortest path from

    this location to the reference node t has been found and can be obtained by following

    the coloured edges.

    10 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    11/65

    2 The Dijkstra Algorithm

    Figure 2.1: Example of a shortest path calculation in a weighted graph from vertex s to t using the Dijkstra method. The

    reference vertex t is initialized with p(t) = 0, all other vertices are initialized with p(x) = , x = t, denotedby a blank node. In f), the starting node s is coloured and the shortest path from s to t is represented bybold (coloured) edges.

    Universitat Bielefeld, AG Technische Informatik 11

  • 7/29/2019 Dijsktra Thesis

    12/65

    2 The Dijkstra Algorithm

    2.2 Simplification of the Dijkstra algorithm for a

    mapping of graph vertices on digital hardware

    in grid-based applications

    To solve a path planning problem on a binary map, the vertices of the graph G = (V, E)are placed in a regular grid and each vertex v V is connected with its eight neighboursvia edges with constant weights w(y, x) = 1 for orthogonal edges and w(y, x) =

    2

    for diagonal edges.

    For this special case, [Mol99] suggests a simplification of the Dijkstra algorithm that

    allows an implementation of a graph vertex in simple digital hardware.

    The first simplification suggested by the accelerated Dijkstra method as described in

    [Mol99] is to use integral weights. Incorporating a factor of 2, weights are w(y, x) = 2for orthogonal edges and w(y, x) = 3 for diagonal edges. This eliminates the needto implement complex combinatorial circuits for real number arithmetics and allows

    an encoding of integral weights through an equivalent number of delay elements in

    hardware.

    In the original Dijkstra algorithm, the vertex with the lowest potential is coloured

    in each iteration. To achieve the same goal, the implementation proposed by [Mol99]

    introduces a global counter that is incremented in constant periods of time. At thestart of a calculation, a binary signal is propagated from the reference node t to the

    neighbouring nodes. This signal then spreads through the entire graph, delayed by a

    number of clock cycles equivalent to the weight of the edges. When the signal reaches

    a node, the value of the global counter is latched in a register inside the node. The

    time needed by the signal to reach a node is equal to the number of delay elements

    traversed and thus equal to the sum of the weights of the edges involved. Since the

    global counter increments equally to the time needed, the latched value represents the

    same sum of the weights. All incoming signals of the neighbouring edges of a node are

    OR combined, so the earliest signal locks this node. Therefore, the OR combination is

    the corresponding equivalent to the minimum function from the original algorithm.

    Due to the integral weights and the regularity of the grid, multiple nodes may

    obtain the same minimal potential at a time. Because the distance between two

    neighbouring nodes is always shorter than any other path between these nodes, all

    nodes with the same potential can be coloured at once. This is the main reason why a

    parallel implementation of this algorithm is possible.

    By using this method, the maximum number of iterations that are needed to cal-

    culate all potentials of reachable nodes in a given n by m grid can be approximated: In

    12 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    13/65

    2 The Dijkstra Algorithm

    a worst case scenario, every other row of the map has to be traversed almost completely.

    The rows in between are completely marked as obstacle nodes, except for a one nodewide passage alternating at the left and right border of the map. So the only path

    through this map is serpentine shaped (see example in figure 2.2). If the diagonal edges

    at the turns are neglected, an upper approximation of the length of the longest path is

    (m 1) (n 1)2

    (2.4)

    and the maximum potential pmax is twice that number:

    pmax = (m 1) (n 1) (2.5)

    Figure 2.2: Serpentine shaped map for a worst-case scenario

    After the number of clock cycles exceeds pmax, all reachable nodes still have a potential

    value lower than the final value cmax of the global counter. Thus, these can be considered

    coloured and all nodes with a potential equal to cmax are unreachable. This elimiatesthe need to encode the colouration of nodes.

    2.2.1 Tracing of the shortest path in the potential vector

    The colouration of edges can also be omitted if an iterative tracing algorithm on the

    resulting potential vector is used.

    Universitat Bielefeld, AG Technische Informatik 13

  • 7/29/2019 Dijsktra Thesis

    14/65

    2 The Dijkstra Algorithm

    Initialization

    v0 = s s V (2.6)

    v denotes the resulting path, v0 = s is the starting location.

    Iteration

    In each iteration step, the following vertex vn+1 is set as the neighbouring vertex with

    the minimal potential. When the reference vertex t is reached at T, the algorithm stops

    and v0, . . . , vT describes the shortest path from s to t.

    vn+1 = arg minxconnected tovn

    (p(x)) (2.7)

    Figure 2.3 shows an example of the resulting algorithm at different values of the global

    counter.

    14 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    15/65

    2 The Dijkstra Algorithm

    Figure 2.3: Calculation of node potentials in the simplified algorithm. The values for the global counter are a) 0, b) 2,c) 4, d) 5, e) 6, f) 7. Red edges have a weight of 2, blue edges have a weight of 3, and green edges areunconnected.

    Universitat Bielefeld, AG Technische Informatik 15

  • 7/29/2019 Dijsktra Thesis

    16/65

    2 The Dijkstra Algorithm

    16 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    17/65

    3Field Programmable Gate Arrays

    (FPGAs)

    3.1 History of FPGAs

    Today, FPGAs are used mainly for rapid prototyping of application specific integratedcircuits (ASIC).

    In the past, the demand for integrated circuits for all kinds of applications steadily

    increased, but the development of a full-custom chip was an expensive process.

    Originally, the creation of a chip required the design of custom wafers. Special clean

    rooms and equipment for photolithography and etching of these wavers were needed

    for the creation of a prototype, and the whole process was very uneconomical if only a

    limited-lot production was needed.

    One of the early ancestors of FPGAs was the Uncommited Logic Array (ULA)

    or Gate Array (GA). The die of an ULA consists of an array of unconnected standard

    logic gates and other elements needed for the creation of an application specific

    integrated circuit (ASIC). These elements were connected later by applying metal

    interconnections to an additional layer of the die. The possibility to mass produce ULAs

    for different applications reduced the overall design cost compared to the creation of

    full-custom chips. The creation of an ASIC prototype using an ULA was still a complex

    process, since the application of the metal layer could not be done by the designer. In

    those days, the design process started to change - the designer no longer had to create

    the wafer mask of the die, but the new task was to connect already existing parts of the

    chip to implement the desired functionality.

    Universitat Bielefeld, AG Technische Informatik 17

  • 7/29/2019 Dijsktra Thesis

    18/65

    3 Field Programmable Gate Arrays (FPGAs)

    The desire to have user programmable chips arose and led to the invention of pro-

    grammable array logic (PAL) and gate array logic (GAL) chips. A PAL consists ofa matrix of AND and OR gates and the possibility to create interconnections at spe-

    cific points in this matrix by using fuse or antifuse technologies similar to those used

    in PROMs and EPROMs. A GAL is a more complex version, which basically includes

    several PALs on one die. These can be connected with each other to create more com-

    plex logical functions.

    At this point, the developer no longer needed to send his metal connection mask to

    the ULA manufacturer, but he could obtain readily available chips and program the

    needed interconnections using programming hardware on site, which was much cheaper.

    Another step to speed up the development of projects involving ASICs was theinvention of in-circuit-programming methods. The first user programmable chips had

    to be removed from and reinserted into a socket for each test of a new version. This not

    only took time, but uncareful handling and mechanical stress from a repeated removal

    and insertion could also lead to a destruction of the component. As a result, the Joint

    Test Action Group (JTAG) developed a standard interface which allowed programming

    of a device by using a simple serial port protocol without the need for a removal from

    the circuit board.

    The growing complexity of ASIC designs eventually led to the invention of the

    field-programmable gate array (FPGA) in the early 1980s. An FPGA consists of a lot oflogic cells arranged in a two dimensional array, which can be interconnected in almost

    any way1. With computers becoming faster, the design process evolved simultaneously.

    More and more tasks of the design process were implemented in software tools to

    aid the designer. This led to todays complex development environments that allow a

    prototype to be created and simulated on almost any personal computer.

    Two of the main FPGA manufacturers nowadays are Altera and Xilinx.

    3.2 Internal structure of FPGAs

    FPGAs consist of basic elements called logic blocks. These are arranged in a twodimensional array and can be connected by a programmable connection matrix. To

    connect internal logic to external circuits, I/O blocks at the border of the array can

    also be connected to the logic blocks. Newer devices may also contain a number of

    additional features like clock generators, larger blocks of on-chip memory, digital

    signal processor (DSP) components and even complete CPU cores.

    1only limited by connection ressources, but there are plenty for most practical applications

    18 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    19/65

    3 Field Programmable Gate Arrays (FPGAs)

    The logic blocks consist mainly of one or more lookup tables (LUTs), in most

    cases 4-input LUTs, to implement logical gates. Most FPGAs implement the LUTsas well as the configuration registers for the connection matrix in SRAM technology

    to achieve high overall chip clock rates. This requires a reconfiguration of the device

    after each power-up. Some alternative implementations use non-volatile concepts like

    EEPROM or Flash technology. Those are not as fast as an SRAM based device, but no

    external components for the configuration of the logic device is needed after an initial

    programming process.

    In addition to the LUTs, flip-flops are provided to implement storage registers in

    a logic element. Depending on the manufacturer and the particular device, these logic

    blocks are called logic array blocks (LABs, Altera) or configurable logic blocks (CLBs,Xilinx).

    Most newer devices partition these basic blocks into smaller elements. While a Xilinx

    XC4000 2 CLB contains two LUTs and two registers only, in the Virtex II3 FPGA,

    one CLB is divided into four slices which contain two LUTs and two registers each.

    An Altera Stratix 4 LAB includes ten logic elements (LEs), which provide one LUT

    and one register each. The more advanced Stratix II 5 provides eight adaptive logic

    modules (ALMs) per LAB. An ALM contains a comparable larger LUT that can be

    used in different configurations regarding the number of input signals. It also includes

    two registers.

    The programmable interconnections differ dependent on manufacturer and de-

    vice, too. All devices provide a varying number of horizontal and vertical signal lines of

    variable length across the chip, though. Additionally, most devices support global clock

    signals to distribute a low skew clock to every part of the chip. One main difference

    between Altera and Xilinx devices is that Altera uses a column oriented approach while

    Xilinx maintains a true grid structure. In an Altera FPGA, the LABs are placed in

    columns thus providing very fast interconnections to the neighbouring LABs in a row.

    3.3 Suitability of FPGAs for the specified applica-

    tion

    The simplified algorithm described in chapter 2.2 can be realised with a very small

    number of gates and registers per node. This fact and the arrangement of the nodes

    2XC4000 data sheet: [Cor99]3Virtex II data sheet: [Cor03b]4Stratix data sheet: [Cor04b]5Stratix II data sheet: [Cor04c]

    Universitat Bielefeld, AG Technische Informatik 19

  • 7/29/2019 Dijsktra Thesis

    20/65

    3 Field Programmable Gate Arrays (FPGAs)

    in a grid suggest the use of a similar structured silicon device. Besides the expensive

    creation of an ASIC or a full-custom chip, the implementation on an FPGA seems to bethe only reasonable option. An efficient parallel implementation of the algorithm can

    be achieved if it is possible to map one graph vertex to a single logic block of an FPGA.

    The selection of an appropriate FPGA for this algorithm is difficult, because these

    days, a broad range of devices optimised for different applications are commercially

    available. There are low-budget as well as high performance solutions with a lot of

    different additional features from both Xilinx and Altera, and there are other devices

    from other companies like Actel, Quicklogic, and Lattice, too.

    For this project, the Altera Stratix device family was chosen in favor of a Xilinxproduct for one main reason: Compared to XC4000 and Virtex II from Xilinx, they

    offer more flip-flops per basic logic block. Since the maximum counter value depends

    on the size of the map, a larger number of registers to store that value is desirable to

    maximize the possible map size.

    20 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    21/65

    4Implementation of a Prototype

    A running implementation of the algorithm described in 2.2 was created using an Altera

    Nios Statix Edition prototyping board[Cor03a] on an Altera Stratix 1S10 FPGA running

    at a clock rate of 50 MHz.

    This version allows the calculation of potentials in a map as big as 16 by 16 pixels.

    A graph node is implemented by using standard elements from the Quartus II design

    software, possible optimisations and problems associated with this implementation arediscussed in the next chapter.

    4.1 Overview

    The prototype consists of the FPGA development board which is connected to a host

    system by a RS232 serial connection. On the host, a set of utilities described later is

    used to transfer map data to the FPGA, execute the calculation and retrieve the resulting

    potential vector.

    Figure 4.1: System overview of the prototype

    Universitat Bielefeld, AG Technische Informatik 21

  • 7/29/2019 Dijsktra Thesis

    22/65

    4 Implementation of a Prototype

    Figure 4.2: Top level entity of the FPGA design. A larger version of this figure can be found in appendix A.1

    The FPGA top-level design (Figure 4.2) uses the following components:

    node matrix

    The node matrix component contains a matrix of 16 by 16 graph nodes for the cal-

    culation of the potentials. It is created automatically by an Objective C program

    described in 4.3 that generates a Verilog source file which contains the instantia-

    tion and interconnection source code for single graph nodes.

    command register, decoder, msr

    These components implement a state machine to execute instructions receivedvia the universal asynchronous receiver/transmitter (UART). The output of the

    command register is wired to eight on-board LEDs to display the contents of this

    register for debugging purposes.

    address register, address decoders

    The address register stores the address of a node selected by the ADDRESS in-

    struction. The binary address is then decoded to an array of unary selector lines

    by the row decoder and col decoder. Also, the node address is decoded by two

    22 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    23/65

    4 Implementation of a Prototype

    seven-segment-decoders to display the current address on the seven segment ele-

    ments on the Nios board.

    UARTS

    The UART core was obtained in VHDL source from Bertrand Cuzeau of

    ALSE[ALS03] with permission to use in this project. It is connected by two

    data busses for incoming and outgoing data and a few signal lines indicating the

    receive and transmit states.

    A variant of this design supporting larger addresses was made to test an implementation

    with maps larger than 16 by 16 nodes. This included a second status register and an

    extension of the instruction decoder to transfer these addresses. An implementationwhich could handle a map sized 24 by 24 nodes could be compiled successfully for the

    Stratix 1S10 FPGA.

    4.2 Design of graph vertices in digital hardware

    The first implementation of a graph node is a modification of the design suggested in

    [Mol99] It uses three D-Flip-Flops (DFF) to implement the delay elements and a fourth

    DFF to store a configuration bit to mark obstacles.

    More DFFs are used in the lpm countermegafunction of the Quartus II software which

    can implement an arbitrary width binary counter. The prototype used a five bit widecounter, this results in a total of nine DFFs per vertex used in this design.

    Figure 4.3: Implementation of a hardware graph vertex. A larger version of this figure can be found in appendix A.2

    Universitat Bielefeld, AG Technische Informatik 23

  • 7/29/2019 Dijsktra Thesis

    24/65

    4 Implementation of a Prototype

    in0, . . . , in7 are the input signals for incoming edges of the neighbour vertices. in0,

    . . . , in3 are connected to the orthogonal and in4, ..., in7 are connected to the diag-onal neighbours of the current vertex. Each of these four-input sets is OR combined

    to generate the orthogonal edge and diagonal edge signals. The diagonal edge signal

    is connected to the diagonal delay DFF resulting in a delay of one node clock cycle.

    This DFFs output is then OR combined to the orthogonal edge signal and the result

    is fed to the next delay DFF, delay1. Finally, the output of delay1 is delayed again by

    DFF delay2. This implements the fact, that a diagonal edge has a weight of 3 while an

    orthogonal edges weight is 2. Another OR gate feeds the output of delay2 back to its

    input to keep a high state, regardless of the output of delay1. This feedback allows to

    implement a preset logic for the reference node.

    The output ofdelay2 is masked with the output of the configuration bit by an AND gate.If the output of the DFF which contains the configuration bit is low, this node is marked

    as an obstacle and the output will stay low, regardless of any incoming signals.

    The masked signal out is then distributed to the neighbouring nodes inputs and

    simultaneously used as a low active clock enable signal for the counter present in each

    node. In contrast to the suggested implementation in [Mol99], no global counter is used

    in this variant. As long as the output of a node is low, the counter is incremented with

    each cycle ofnode clk.

    To set the configuration bit of a particular node and to set a reference node, a

    simple address logic is used. Two binary to unary decoders are connected to the upperand lower half of the address register respectively to generate a number of row select

    and column select signals according to the number of rows and columns of the map.

    These decoders assure that only one row and column select signal is high active at any

    given time. Each node of a given row n is connected to the corresponding row select[n]

    signal, each node of a column m is connected to col select[m].

    A node select signal is generated by performing a logical and on a nodes row select

    and col selectinputs. This allows the transfer of a configuration bit to the node currently

    selected by the address register using the conf data and conf clk inputs common to all

    nodes. Also, the node selectsignal and the common conf setsignal are fed into a NAND

    gate. This gates output is connected to the asynchronous preset input of the delay2 DFF,

    which allows a preset of the currently selected node as a reference node.

    Additionally, the node select signal is connected to the enable input of

    counter tristate bufferto output the current nodes potential counteron the counter out

    bus.

    All configuration bits can be cleared asynchronously by assigning a low signal to the

    common conf reset input. The counter and delay DFFs can be reset asynchronously by

    assigning a low signal to node reset.

    24 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    25/65

    4 Implementation of a Prototype

    Figure 4.4: Block schematic of the address logic

    4.3 Generation of the node matrix by a Verilog

    code generator

    The current design uses a matrix of 16 by 16 nodes. Each nodes output is connected

    to the inputs of each of its eight neighbours. This results in a total number of 65536connections for the output signals only. Thus, a manual placement of nodes and inter-

    connections is impractical, but the regularity of the connections allows an automatic

    generation of the node matrix in a hardware description language.

    The input and output signals of each node can be divided into three different

    categories:

    The inputs node reset n, node clk, conf reset n, conf clk, conf set, conf data and

    Universitat Bielefeld, AG Technische Informatik 25

  • 7/29/2019 Dijsktra Thesis

    26/65

    4 Implementation of a Prototype

    the output bus counter out are common signals to each node and passed through

    from or to the node matrix.

    All row selectand col selectselector input lines of the node matrix are connectedto each node in the corresponding row and column. The definition and inter-

    connection of these signals depends on both the size of the node matrix and the

    position of an individual node in the matrix.

    The in0, .. . , in7inputs of each node are connected to the out signals of the eightneighbouring nodes. At the border of the matrix, a low signal is assigned to these

    inputs if the appropriate neighbour does not exist. The definition of these signals

    depends on the position and size of the matrix, too.

    The Verilog source code of the node matrix consists of three parts:

    A header containing the definitions of the node matrix module, inputs, outputs, and

    internal signal lines, followed by the repeated instantiation and interconnection of nodes,

    and a footer ending the module definition.

    Three template files representing the needed parts are used to generate the resulting

    file. Placeholder variables, marked by the #-character at the beginning and the end,

    represent values dependent on size of the grid and position of individual nodes. They

    are calculated and filled in upon execution of the code generator. The template class

    supports integer, float and string variables, though only integer and string variables are

    used in this application.

    26 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    27/65

    4 Implementation of a Prototype

    module node_matrix(node_row,node_col,

    node_reset_n,conf_reset_n,conf_clk,conf_set,conf_data,node_clk,

    counter_out);

    input [#maxrow#:0] node_row;

    input [#maxcol#:0] node_col;

    input node_reset_n;

    input conf_reset_n;

    input conf_clk;

    input conf_data;

    input conf_set;input node_clk;

    output [4:0] counter_out;

    wire node_clk;

    wire [#maxrow#:0] node_row;

    wire [#maxcol#:0] node_col;

    wire node_reset_n;

    wire conf_clk;

    wire conf_data;

    wire conf_set;

    wire conf_reset_n;

    tri [4:0] counter_out;

    wire node_out[#maxnode#:0];

    Figure 4.5: header.v - Template for the node matrix.v header

    Figure 4.5 shows the template for the module header. For a height rows by width

    columns matrix, the values for the used variables are:

    #maxrow# = height-1

    #maxcol# = width-1

    #maxnode# = (height * width) -1

    Those variables are used to define the number of selector signals and output signals of

    all nodes.

    Universitat Bielefeld, AG Technische Informatik 27

  • 7/29/2019 Dijsktra Thesis

    28/65

    4 Implementation of a Prototype

    The template for the instantiation of one node is shown in figure 4.6.

    node node_x#x#_y#y# (

    .row_select(node_row[#x#]),

    .col_select(node_col[#y#]),

    .conf_clk(conf_clk),

    .conf_data(conf_data),

    .conf_set(conf_set),

    .conf_reset_n(conf_reset_n),

    .node_clk(node_clk),

    .node_reset_n(node_reset_n),

    .in0(#in0#),

    .in1(#in1#),

    .in2(#in2#),

    .in3(#in3#),

    .in4(#in4#),

    .in5(#in5#),

    .in6(#in6#),

    .in7(#in7#),

    .out(node_out[#n#]),

    .counter_out(counter_out)

    );

    Figure 4.6: node.v - Template for the node instantiation in node matrix.v

    The nodes are generated row by row and column by column in each row. In a nested

    loop, this template is used once for each node with #y# running from 0 to #maxrow#

    and #x# running from 0 to #maxcol#. Therefore, a node at row 5, column 3

    gets the name node\_x3\_y5, its row select and col select inputs are connected to

    row select[5] and col select[3] respectively.

    The number #n# of a node is calculated as

    #n# = y*width + x

    and this nodes output is connected to node out[#n#]. For the above example, in a

    16 by 16 matrix the output line is node out[83].

    As described in section 4.2, in0, . . . , in7 are the input signals from the neigh-

    bouring nodes. Figure 4.7 shows the definition of a lookup table containing the

    corresponding #x# and #y# offsets for each incoming signal.

    28 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    29/65

    4 Implementation of a Prototype

    struct offset {

    int dx,dy;};

    struct offset edge_offsets[8] = {

    { -1, 0 }, // left

    { 1, 0 }, // right

    { 0, -1 }, // bottom

    { 0, 1 }, // top

    { -1, -1 }, // top left

    { 1, -1 }, // top right

    { -1, 1 }, // bottom left{ 1, 1 } // bottom right

    };

    Figure 4.7: Offset table describing the position of adjacent nodes

    For each input inN, the offset edge_offsets[N] is added to the current #x# and

    #y# to obtain the position of the neighbouring node.

    In case of the resulting position being out of bounds, inN is substituted by the string

    b0 which represents a logic zero in Verilog. Otherwise, the replacement string is

    node_out[x+edge_offsets[N].dx+(y+edge_offsets[N].dy)*width]

    After the Verilog code for all nodes is generated, the footer template is appended. This

    template only contains the line

    endmodule

    to end the Verilog module definition.

    4.4 Interface to a host system

    An interface to the host system is needed to transfer map data to the FPGA, set ref-

    erence nodes, and retrieve the resulting potential vector. For the prototype, an RS232

    connection was chosen for the ease of creating the needed host software.

    The interface mainly consists of a ready-to-use UART connected to the address and

    command registers, a command decoder, and a state register, which allow to execute a

    set of commands received via the RS232.

    Universitat Bielefeld, AG Technische Informatik 29

  • 7/29/2019 Dijsktra Thesis

    30/65

    4 Implementation of a Prototype

    The RXD and TXD data lines of the UART are connected directly to FPGA pins, which

    are then connected to the TTL to RS232 voltage level converters. After a byte is receivedby the UART, the RxRDY signal gets high for one clock cycle and data can be read from

    the Dout bus.

    The current data at the Din bus is sent to the RS232 by assigning a high signal to the LD

    input for one clock cycle. While a byte is transmitted, the TxBusy signal is high.

    Another status signal is RxErr, which indicates an error during the receipt one byte.

    This signal is unconnected, since no error handling for the interface is implemented and

    transmission errors are unlikely at a rate of 9600 bits per second.

    The UART supports a choice between two transmission rates, defined as parame-

    ters, by assigning a high or low signal to the Baud input. Since only a rate of 9600 bitsper seconds is used, this signal is tied to GND. RST is a low-active reset signal and

    connected to the system reset signal.

    Dout is fed into the data inputs of both the command and address registers, allowing to

    load received data by assigning a high signal to one of the registers enable inputs for

    one clock cycle.

    counter outof the node matrix is connected to Din of the UART. This tri-state bus holds

    the current counter value of the node selected by the address register.

    The state register is used to distinguish between the wait for command state and the

    actual execution of an instruction. As long as it is low, the system waits to receive a

    byte via the UART. After uart rxrdy switches to a high value, the decoder performs afetch command cycle by assigning a high to command enable which results in the

    command register to take over and store the value of the UARTs Dout bus. The msr is

    set to high, too, and the execution of the particular instruction starts in the next system

    clock cycle.

    Table 4.1 shows the decoder table for the instruction decoder. The actual encoding

    of each command can be taken from the column command register. The following

    commands are implemented:

    ADDRESS

    Two states are needed for the ADDRESS instruction to load a node address to the

    address register. First, the system stays in the load address wait state until a

    second byte is received via the UART. After another transition of uart rxrdy to a

    high value, the decoder assigns a high signal to address enable and the value of

    Dout is taken over by the address register in the load address load cycle.

    READ

    The READ instruction is implemented in two states, too. As long as the

    uart txbusy signal is in a high state, the system will stay in this state to wait for

    an ongoing transmission of a byte to finish. When the UART is ready to transmit

    30 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    31/65

    4 Implementation of a Prototype

    state msrin

    commandregister

    uartrxrdy

    uarttxbusy

    addressenable

    commandenable

    uartld

    nodeclk

    noderesetn

    confclk

    confresetn

    confset

    confdata

    msrout

    wait for command 0 x 0 x 0 0 0 0 1 0 1 0 0 0

    fetch command 0 x 1 x 0 1 0 0 1 0 1 0 0 1

    load address wait 1 b00000001 0 x 0 0 0 0 1 0 1 0 0 1

    load address load 1 b00000001 1 x 1 0 0 0 1 0 1 0 0 0

    read wait 1 b00000010 x 1 0 0 0 0 1 0 1 0 0 1

    read send 1 b00000010 x 0 0 0 1 0 1 0 1 0 0 0

    reset nodes 1 b00000100 x x 0 0 0 0 0 0 1 0 0 0

    reset map 1 b00001000 x x 0 0 0 0 1 0 0 0 0 0

    iterate 1 b00010000 x x 0 0 0 1 1 0 1 0 0 0

    load config 1 bab100000 x x 0 0 0 0 1 1 1 a b 0

    Table 4.1: Decoder table of the instruction decoder

    new data, a high signal is assigned to the LD input in the read send state. This

    triggers a transmission of the currently assigned value of the Din port to the host

    system via the RS232.

    RESET NODES

    The RESET NODES instruction causes the node reset n output of the decoder to

    become low for one clock cycle. This resets the counter and delay registers in the

    node matrix.

    RESET MAP

    This instruction works similar to the RESET NODES command. By assigning

    a low value to conf reset n for one cycle, the map configuration registers in the

    node matrix are reset.

    ITERATE

    The ITERATE instruction causes a high value on the node clk output of the de-

    coder, which results in a clock cycle for both the counter and delay registers of

    the node matrix. This is used to perform iterations of the implemented algorithm.

    LOAD CONFIG

    LOAD CONFIG is used to set the currently selected node as a reference or an

    obstacle node. Unlike the other instructions, this is the only one which encodes

    Universitat Bielefeld, AG Technische Informatik 31

  • 7/29/2019 Dijsktra Thesis

    32/65

    4 Implementation of a Prototype

    additional parameters in the instruction word. These parameters, a and b in table

    4.1, are passed through to the node matrix. If parameter a, the conf set signal, isset, the currently selected node is set as a reference node by asynchronously set-

    ting the last delay element to a high value. Parameter b is passed to the conf data

    signal and a simultaneous conf clkcycle causes the configuration DFF of the node

    to take over this signal. Since the output of the configuration bit is AND combined

    with the output of the last delay DFF, a low value has to be set for an obstacle node

    and a high signal for a clear node.

    The chosen encoding uses one bit of the instruction word to distinguish between dif-

    ferent instructions. This keeps an implementation of the decoder in gate logic simple.

    Figure 4.8 shows the implementation of the decoder.

    Figure 4.8: Schematic of the instruction decoder

    4.5 Software environment on the host system

    On the host system, a set of command line utilities written in C is used for test runs.

    The source code for a module containing common functions used in these utilities can

    32 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    33/65

    4 Implementation of a Prototype

    be found in appendix B.2

    The utilities include:

    send

    This program just sends the value supplied as an argument to the serial port. It is

    used mainly for testing purposes, e.g. for sending a single command byte to the

    FPGA.

    Example:

    $ ./send 4 # issue "reset_nodes" command to the FPGA

    address

    address sends the ADDRESS command, followed by the node address parame-

    ter. It is used for the debugging of the address logic and seven segment decoders,

    and for setting one or more reference nodes.

    Example:

    $ ./address 50 # 0x31 - set address to node (3,1)

    $ ./send 14 # 0xE0 - s et node (3,1) as

    # clear reference node

    map transfer

    This utility reads in an ASCII file , resets the map configura-

    tion in the FPGA, and transmits the neccessary commands to configure the map

    according to the contents of the file. The first two lines of the file contain the num-

    meric values of the width and height of the map. In the following lines, a space

    represents a clear node and any non-space character marks an obstacle. Figure 4.9

    shows an example file.

    iterate

    This commands sends ITERATE instructions to the FPGA to perform

    iterations of the algorithm.

    read

    To retrieve the potential vector from the FPGA, the read utility is used. It uses AD-

    DRESS and READ NODE instructions to read every nodes potential and prints

    it to the console.

    Universitat Bielefeld, AG Technische Informatik 33

  • 7/29/2019 Dijsktra Thesis

    34/65

    4 Implementation of a Prototype

    16

    16****************

    * *

    * * *

    * * *

    * * *

    ********* ****

    * *

    * * *

    * * *

    * * ************* ***

    * *

    **** * ********

    * * *

    * * *

    ****************

    Figure 4.9: Example ASCII map for the transfer map utility

    4.6 Result of a test run

    In a test run, the map shown in figure 4.10 encoded as an ASCII map as shown in figure

    4.9 is transfered to the FPGA. An ADDRESS instruction is then sent to the FPGA to set

    the address register to node (1,1) and a following LOAD CONFIG instruction sets this

    node as the reference node:

    $ ./address 17 # 0x11 - set current node address to (1,1)

    $ ./send 14 # 0xE0 - set current node as

    # reference node

    Figure 4.11 shows the output of the read utility after 71 iterations. These values are

    loaded into Matlab to generate a visualisation of the resulting potential vector in a

    greyscale image as shown in figure 4.12.

    34 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    35/65

    4 Implementation of a Prototype

    Figure 4.10: Map for the test run

    71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71

    71 0 2 4 6 8 10 12 14 16 18 20 22 24 26 71

    71 2 3 5 7 9 11 13 71 17 19 21 23 25 27 71

    71 4 5 6 8 10 12 14 71 19 20 22 24 26 28 71

    71 6 7 8 9 11 13 15 71 21 22 23 25 27 29 71

    71 71 71 71 71 71 71 71 71 23 24 25 71 71 71 71

    71 40 38 36 34 32 30 28 26 25 26 27 28 30 32 71

    71 41 39 37 35 33 31 29 71 27 28 29 30 31 33 71

    71 42 40 38 36 34 32 31 71 29 30 31 32 33 34 71

    71 43 41 39 37 35 34 33 71 31 32 33 34 35 36 7171 71 71 71 71 71 71 71 71 71 71 71 36 71 71 71

    71 59 57 55 53 51 49 47 45 43 41 39 38 39 41 71

    71 71 71 71 54 52 71 48 71 71 71 71 71 71 71 71

    71 61 59 57 55 54 71 50 51 53 55 57 59 61 63 71

    71 62 60 58 57 56 71 52 53 54 56 58 60 62 64 71

    71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71

    Figure 4.11: Output of the read utility after 71 iterations

    Universitat Bielefeld, AG Technische Informatik 35

  • 7/29/2019 Dijsktra Thesis

    36/65

    4 Implementation of a Prototype

    Figure 4.12: Greyscale image of the resulting potential vector.

    X X X X X X X X X X X X X X X X

    X t XX X XX X XX X XX X X X X X X X X X X X XX XX X XX

    X

    X

    X X XX X X X X X X X X X X X X X XX XX X X X X X X X X X X X XX X XX X XX X X X X X X X X X X X X X X X

    Figure 4.13: Direction vectors generated by a path tracing algorithm. By following these vectors from any point in themap, the reference node t is reached via the shortest path possible.

    36 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    37/65

    5Analysation and Discussion

    The FPGA used in this application supplies a matrix of 30 by 40 LABs, containing 10

    LEs each. To achieve an optimal mapping of the graph structure onto the FPGAs LAB

    grid, the implementation of one graph node in one LAB is desirable. However, on the

    chosen Stratix 1S10 FPGA two main problems occur:

    With an increase of the map size, distanced may become longer. This leads toa higher maximal counter value, and the number of DFFs needed to store node

    potentials grows.

    No real tri-state signals are supported, so additional logic is required to retrievethe potentials from the FPGA.

    The following sections discuss these problems.

    5.1 Substitution of DFFs by LUT based delay ele-

    ments

    Figure 5.1, taken from the Stratix Device Handbook [Cor04b], shows the internal struc-

    ture of a LE within a LAB.

    Universitat Bielefeld, AG Technische Informatik 37

  • 7/29/2019 Dijsktra Thesis

    38/65

    5 Analysation and Discussion

    Figure 5.1: Stratix LE block schematic taken from [Cor04b]

    An approach to conserve registers of LEs for storing larger counter values is the replace-ment of delay DFFs with a LUT based solution. The following requirements have to be

    met by the replacement:

    To put the system in a defined state prior to a computation, a reset signal is needed. An input signal to the element must be propagated on a clock edge. An additional asynchronous preset logic is needed in the last delay element to set

    a reference node.

    The first step in the creation of a LUT based register is the implementation of an asyn-

    chronous RS flip-flop as shown in figure 5.2.

    Figure 5.2: Equivalent block schematic for a LUT based RS flip-flop

    38 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    39/65

    5 Analysation and Discussion

    The set input is connected to an OR gate. The output of this gate is then masked with

    the low-active reset n signal and fed back to the second input of itself. As long as thereset n signal stays in a high state, this feedback loop causes the output to stay in a high

    state, too, once a high signal is assigned to the set input. A low reset n signal causes

    the output to become low. As long as the input signal also stays low, the flip-flop will

    remain in this state.

    To use this LUT based flip-flop in graph nodes, an extension to create an edge

    triggered behaviour is needed. In theory, this can be achieved by masking the input

    signal with the negated clock signal in a first LUT and then masking the output of the

    first LUT with the same clock signal. When the input of the first LUT switches to high,

    its output also switches to high during a low clock signal. At the edge of the clocksignal, the output of the first LUT stays high for the propagation delay of the LUT. In

    this period of time, the second LUT already received the high clock signal without the

    delay and the LUT output is expected to switch to and stay in a high state until a low

    reset n signal is assigned.

    Figure 5.3: Equivalent block schematic for a LUT based delay element

    Figure 5.3 shows the complete schematic for the LUT based delay element. The

    leftmost AND gate and the inverter are implemented in the first LUT, the rest of

    this circuit is implemented in the second LUT. Tests of this circuit in the Quartus II

    simulator produce contradicting results as shown in figures 5.4 and 5.5. While the

    functional netlist simulation reports the predicted results, the output signal oscillates in

    an unstable state in the timing simulation. A test on the FPGA hardware showed the

    predicted behaviour to be true, so an implementation of a delay element in this way ispossible.

    Figure 5.4: Results of a functional netlist simulation of the delay element.

    Universitat Bielefeld, AG Technische Informatik 39

  • 7/29/2019 Dijsktra Thesis

    40/65

    5 Analysation and Discussion

    Figure 5.5: Results of a timing simulation of the delay element

    Since the development environment usually optimises a users design to save on-chip

    ressources, another problem arises during the implementation of this circuit. Every

    network of logic gates is mapped onto 4-input LUTs, so suitable parts of a circuit are

    reduced to 4-input functions to fit into those LUTs. For example, the RS flip-flop in

    figure 5.2 is synthesized into a single LUT. In the proposed schematic of the clock

    edge triggered delay element, the optimiser identifies both leftmost AND gates asredundant. This part of the schematic can also be written as the equation (a b) bwhich evaluates to false for all values ofa and b, if no propagation delays are taken into

    account.

    To solve this problem, the above circuit was created by using a method to directly

    instantiate logic elements on the FPGA as described in the next section.

    5.2 Low-level instantiation of logic elements by

    VQM source filesTo allow design and synthesis of FPGA designs in 3rd party tools without the publica-

    tion of the bitstream format, Quartus II offers the possibility to include Verilog Quartus

    Mapping (VQM) files in a project and assemble them. These files make it possible to

    instantiate and configure FPGA ressources like logic cells (LEs), I/O blocks, on-chip

    memory, etc. on a low level. As the name suggests, the syntax of VQM files is the same

    as that of Verilog files.

    Unfortunately, no detailed public information about VQM files is available, but the

    Quartus II software can create a VQM file of a compiled project. Therefore, reverse

    engineering the file format is possible by compiling simple projects and analysing the

    resulting VQM files.

    5.2.1 Structure of VQM files

    As an example, figure 5.6 shows the resulting VQM file of the RS flip-flop design from

    the last section. The most interesting part of this file is the instantiation of a logic cell to

    implement the given logical function. Since the design of the flip-flop uses only three

    inputs overall, it was reduced to a single LE using only one LUT.

    40 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    41/65

    5 Analysation and Discussion

    stratix_lcell (); creates a logic element named

    with connections to other logic listed in . Configuration parameters of the

    logic element are defined by defparam.

    Two operation modes are possible for a Stratix LE: normal mode and dynamic

    arithmetic mode. The Stratix Device Handbook[Cor04b] describes both modes in

    detail.

    The following input and output ports can be defined in a VQM file:

    dataa, datab, datac, datad

    These are the four data inputs of the LUT. datac can be configured as the cin

    carry input signal or as an input for a feedback from the register of the same LE

    alternately.

    cin

    This is the carry chain input signal of a LE. It can only be connected to the cout

    carry out output of a previous LE in the same LAB.

    regcascinThis is the register cascade input of the LE. To connect the input of the LEs

    register directly to the output of the register in the previous LE inside a LAB, the

    regcascin input is used.

    aclr

    The aclr input of a LE can be connected to a LAB wide asynchronous high-

    active reset signal. Only two different signals of this type are available to all LEs

    in one LAB.

    aload

    This is a LAB wide asynchronous preset signal. One signal of thise type common

    to all LEs inside a LAB can be used. When a high signal is assigned to this input,

    the register loads the current value of the datac port.

    clk, ena

    These are the clock and clock enable inputs of the register in a LE. Both are LAB

    wide signals, too. There are two clock and two clock enable signals available for

    each LAB.

    Universitat Bielefeld, AG Technische Informatik 41

  • 7/29/2019 Dijsktra Thesis

    42/65

    5 Analysation and Discussion

    module rs_ff (reset_n,set, q);

    input reset_n;

    input set;

    output q;

    wire \reset_ncombout ;

    wire \setcombout ;

    wire \inst10 ;

    wire gnd;

    wire vcc;

    assign gnd = 1b0;

    assign vcc = 1b1;

    stratix_io \reset_nI (.combout(\reset_ncombout ),.padio(reset_n));

    defparam \reset_nI .operation_mode = "input";

    defparam \reset_nI .ddio_mode = "none";

    defparam \reset_nI .input_register_mode = "none";

    defparam \reset_nI .output_register_mode = "none";

    defparam \reset_nI .oe_register_mode = "none";

    defparam \reset_nI .input_async_reset = "none";

    defparam \reset_nI .output_async_reset = "none";

    defparam \reset_nI .oe_async_reset = "none";

    defparam \reset_nI .input_sync_reset = "none";

    defparam \reset_nI .output_sync_reset = "none";

    defparam \reset_nI .oe_sync_reset = "none";

    defparam \reset_nI .input_power_up = "low";defparam \reset_nI .output_power_up = "low";

    defparam \reset_nI .oe_power_up = "low";

    stratix_io \setI (.combout(\setcombout ),.padio(set));

    // parameters removed, see above

    stratix_lcell \inst120 (.datab(\reset_ncombout ),.datac(\setcombout ),

    .datad(\inst10 ),.combout(\inst10 ));

    defparam \inst120 .operation_mode = "normal";

    defparam \inst120 .synch_mode = "off";

    defparam \inst120 .register_cascade_mode = "off";

    defparam \inst120 .sum_lutc_input = "datac";

    defparam \inst120 .lut_mask = "CCC0";

    defparam \inst120 .output_mode = "comb_only";

    stratix_io \qI (.datain(\inst10 ),.padio(q));defparam \qI .operation_mode = "output";

    // rest of parameters removed, see above

    endmodule

    Figure 5.6: VQM file generated by the Quartus II software for the LUT RS flip-flop shown in Figure 5.2

    42 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    43/65

    5 Analysation and Discussion

    sclr

    This signal is a synchronous LAB wide reset input for the register. One sclr

    signal can be used per LAB.

    sload

    sload is the synchronous counterpart of the aload input. When this LAB wide

    signal is set, the register stores the value of the datac input on the next clock

    edge.

    combout

    The combout output is the output of the LEs LUT.

    regout

    This is the output of the LEs register.

    cout

    The cout output is the carry chain output of a LE. It is available in arithmetic

    mode and can only be connected to the cin port of the next LE in the same LAB.

    To configure the behaviour of a LE, defparam is used to set required parameters. The

    availability of some of the above ports also depends on these parameters.

    operation_mode

    The value of this parameter can be set to normal or arithmetic. In

    arithmetic mode, the LUT is split into four 2-input LUTs. Also, the cout output

    becomes available in this mode. However, the implementation of delay elements

    only uses LEs in normal mode - a detailed description of the dynamic arithmetic

    mode is found in the Stratix Device Handbook [Cor04b].

    lut_mask

    The lut_mask is a 16 bit hexadecimal value that represents the truth table of the

    LUT. The Quartus II Handbook, Volume 3, Section 10 [Cor04a] describes how to

    calculate the LUT mask for a given equation by the creating of a truth table. The

    LUT can be compared to a simple 16x1 RAM: with dataa, datab, datac,

    and datad used as address lines 0 to 3, the LUT mask resembles the stored data

    word with the least significant bit first.

    register_cascade_mode

    Possible values for this parameter are on and off. When this mode is en-

    abled, the output of the previous LE in the LAB can be connected directly to the

    input of the register of the LE that is to be configured by using the regcascin

    input.

    Universitat Bielefeld, AG Technische Informatik 43

  • 7/29/2019 Dijsktra Thesis

    44/65

    5 Analysation and Discussion

    sum_lutc_input

    This parameter determines the signal which is fed into the LUTs datac input.

    A value of datac connects the LUTs input to the datac input of the LE. The

    value cin connects the input to the carry in input, and the value qfbk can

    be used to connect the output of the register in this LE to the LUTs datac input.

    synch_mode

    Like register_cascade_mode, a boolean value of on or off can be

    assigned to this parameter. It has to be set to use the synchronous input ports of

    the LE.

    output_mode

    The LE has three output modes: The LUT or the register output or

    both can be made available for interconnections. The possible values are

    comb_only for the LUT output only, reg_only for the register output

    and reg_and_comb for both.

    Figure 5.6 also shows the definition of stratix_io input and output pins. Since

    it is possible to include VQM files in a Quartus II project, the input and output pin

    assignments can be created by standard tools. Therefore, an implementation of the

    input and output blocks in VQM is not neccessary.

    As a side note, while trying to review the VQM file from a more complex project which

    contained the Nios system-on-a-chip CPU of the Nios Development Kit, the resulting

    file was no ASCII file. Instead, it contained binary data. It seems that VQM files

    that contain elements from the Quartus II SOPC Builder are encrypted somehow. A

    possible reason for this may be the prevention of compiling this CPU for other FPGA

    architectures by reimplementing the basic Stratix components in Verilog.

    44 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    45/65

    5 Analysation and Discussion

    5.2.2 Implementation of the LUT delay element in VQM

    As previously mentioned, the LUT delay element utilises two LUTs.

    Figure 5.7: Schematic for a LUT based delay element

    Figure 5.7 shows the neccessary connections. The logical function for the LUT labeled

    stage1 is a b. The equivalent truth table for this function is shown in table 5.1.

    datab dataa out

    0 0 0

    0 1 1

    1 0 0

    1 1 0

    Table 5.1: Truth table for the stage1 LUT

    Since the output does not depend on inputs c and d, the output pattern is repeated four

    times, resulting in a binary LUT mask of 0010001000100010 equal to a hexadecimal

    value of 2222.

    According to figure 5.3, the function for the LUT labeled stage2 is ((a b) d) c).

    Table 5.2 shows the resulting truth table.The resulting LUT mask for this function is a hexadecimal value of F080. Both LUTs

    can be implemented in VQM files as shown in figure 5.8 and added to a Quartus II

    project. As soon as the source files are added, the implemented modules can be used in

    the project like any other component defined in Verilog. In the test environment for the

    delay element, a symbol was generated for both stage1 and stage2 and they were

    instantiated in a block schematic file.

    Universitat Bielefeld, AG Technische Informatik 45

  • 7/29/2019 Dijsktra Thesis

    46/65

    5 Analysation and Discussion

    datad datac datab dataa out

    0 0 0 0 00 0 0 1 0

    0 0 1 0 0

    0 0 1 1 0

    0 1 0 0 0

    0 1 0 1 0

    0 1 1 0 0

    0 1 1 1 1

    1 0 0 0 0

    1 0 0 1 0

    1 0 1 0 01 0 1 1 0

    1 1 0 0 1

    1 1 0 1 1

    1 1 1 0 1

    1 1 1 1 1

    Table 5.2: Truth table for the stage2 LUT

    5.3 Reduction of the number of needed DFFs byusing an alternate path tracing algorithm

    In a regular grid, the largest potential difference between two adjacent nodes is the

    weight of a diagonal edge. If the application needs only one reference node to be

    set, the shortest path in the map can still be found if a smaller counter is used, as

    long as the counter can hold a value of at least twice the maximum distance between

    two adjacent nodes. In longer paths, a counter overrun occurs, but a modification

    of the path tracing algorithm still allows to find the shortest path. However, the in-

    formation about the absolute distance to the reference node in the potential vector is lost.

    The algorithm described in section 2.2 uses a maximum weight of three. If three

    bit counters are used, an overrun occurs if the value changes to a number larger than

    seven. This results in a substraction of eight from the previous value of a counter. When

    this happens, the minimal difference between two nodes potentials five in this case

    is the maximum counter value increased by one minus the largest possible weight of

    any edge. Thus, in the tracing algorithm, a counter overrun can be assumed to have

    occurred whenever the difference between the counters of two nodes is equal or larger

    than five.

    46 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    47/65

    5 Analysation and Discussion

    module stage1 (in, clk, out);

    input in;

    input clk;

    output out;

    wire in;

    wire clk;

    wire out;

    stratix_lcell stage1_lut (.dataa(in),.datab(clk),.combout(out));

    defparam stage1_lut .operation_mode = "normal";

    defparam stage1_lut .synch_mode = "off";

    defparam stage1_lut .register_cascade_mode = "off";

    defparam stage1_lut .sum_lutc_input = "datac";

    defparam stage1_lut .lut_mask = "2222";

    defparam stage1_lut .output_mode = "comb_only";

    endmodule

    module stage2 (in, clk, reset_n, out);

    input in;

    input clk;

    input reset_n;

    output out;

    wire in;

    wire clk;

    wire reset_n;

    wire out;

    stratix_lcell stage2_lut (.dataa(in),.datab(clk),.datac(reset_n),

    .datad(out),.combout(out));

    defparam stage2_lut .operation_mode = "normal";

    defparam stage2_lut .synch_mode = "off";

    defparam stage2_lut .register_cascade_mode = "off";

    defparam stage2_lut .sum_lutc_input = "datac";

    defparam stage2_lut .lut_mask = "F080";

    defparam stage2_lut .output_mode = "comb_only";

    endmodule

    Figure 5.8: Implementation of two LUTs for a delay element

    Universitat Bielefeld, AG Technische Informatik 47

  • 7/29/2019 Dijsktra Thesis

    48/65

    5 Analysation and Discussion

    The pseudo C code in figure 5.9 implements an iteration step of the modified tracing

    algorithm.

    // offsets to neighbours

    int dx[]={-1, 0, 1,-1, 1,-1, 0, 1};

    int dy[]={-1,-1,-1, 0, 0, 1, 1, 1};

    int num_neighbours = 8

    int min_overrun_diff = 5;

    int overrun_value = 8;

    // p[][] contains a two dimensional array

    // of counter values

    int max_dist=0;

    int next_index=-1;

    // for each neighbour...

    for(i=0;i max_dist) {

    max_dist = dist; // save new maximum distance

    next_index = i; // save index of offset

    }

    }

    if(next_index >= 0) {

    // neighbour with lower potential found

    x+= dx[next_index];

    y+= dy[next_index];

    } else {

    // reference node reached

    }

    Figure 5.9: Pseudo C code for the modified tracing algorithm

    5.4 Implementation of the address logic and host

    interface

    The second problem of an efficient implementation of this algorithm involves the trans-

    fer of calculated potential data to the host. In the original implementation of the pro-

    totype, tri-state buffers supplied by the Quartus II development environment have been

    used to access the contents of each counter register. Unfortunately, the analysis of these

    48 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    49/65

    5 Analysation and Discussion

    buffers reveals, that they are actually implemented in gate logic, since the used FPGA

    does not support real tri-state signals.

    Figure 5.10: AND-OR replacement logic for tri-state buffers

    Figure 5.10 shows an implementation for a substitution of those tri-state buffers usinggate logic. chain_in is connected to the output ofchain_out of the previous node.

    The data input is connected to the output of the data signal to be tri-stated. This

    signal is masked with the select signal by an AND gate, the result is OR combined

    with the result of the previous nodes and passed on to the next node. Since only one

    select signal of the whole chain can be active at a time, the masked output of all

    inactive parts is low. Only the data input of the selected node determines the result of

    the whole chain.// // On an FPGA architecture with 4-input LUTs, an implementation of

    this logic fits into one LUT. Even an extension for the combination of row and column

    select signals by an AND gate can be implemented in a single LUT, but this method is

    not desired in an implementation with a large node matrix, since a long chain of LUTs

    would result in a large propagation delay. Instead, the chain can be broken down in

    one chain for each row. The values of the selected column are then combined within

    each row, and an additional row select logic provides the final result. This structure still

    leaves one input port of a LUT vacant. If this port is configured as an additional input

    to the OR gate, further reduction of the propagation delay is possible by arranging the

    elements of the chain in a tree-like structure. Figure 5.11 shows the resulting selection

    logic for three data bits.

    Universitat Bielefeld, AG Technische Informatik 49

  • 7/29/2019 Dijsktra Thesis

    50/65

    5 Analysation and Discussion

    Figure 5.11: Efficient selection logic for three data bits

    Another desired improvement regarding the host interface is a faster transfer of the map

    and the result of the calculation. The prototype uses a RS232 connection at 9600 bits

    per second, and three bytes have to be transfered to configure one node of the map.

    This results in a duration of 2.5 milliseconds per node. In a map of 16 by 16 nodes,

    an approximation of the maximum potential value for a worst case scenario is 225 (see

    section 2.2). With the FPGA running at 50 MHz, the calculation finishes after 4.5

    microseconds. Therefore, in a real-world application, a PCI interface could result in a

    great acceleration. By operating the FPGA synchronously to the PCI bus at 33 MHz or

    66 MHz and a using a parallel transfer of 32 or 64 configuration bits in a burst transfer1, the overhead of configuration can be reduced significantly.

    5.5 Conclusion

    A parallel implementation of the algorithm described in 2.2 on an FPGA is possible.

    However, the mapping of a graph node onto a single LAB could not be realised, because

    the additional logic to implement the counter retrieval mechanism increased the num-

    ber of logic elements needed. Thus, a topology preserving mapping of the graph onto

    the FPGA would only be possible if a second adjacent LAB is used to implement the

    complete graph node. This would reduce the maximum map size by 50%. Some testswith different map and counter sizes have been made, and the results demonstrated that

    the Quartus II development environment is already able to generate a ressource efficient

    version of this project, though the automatic placement of nodes could not be predicted.

    Figure 5.12 shows the connections of the incoming edge signals to node (13,3) in a map

    sized 16 by 16. While a two LAB version of the algorithm uses 20 LEs per node, the

    1In a PCI burst transfer, one data word of a contiguous memory block can be transfered per clock

    cycle.

    50 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    51/65

    5 Analysation and Discussion

    Size of the node matrix Nodes total Counter width LEs used LEs/node used

    16x16 256 5 bits 4118 16.124x24 576 5 bits 9703 16.8

    24x24 576 3 bits 7779 13.5

    Table 5.3: Map size vs. LEs/node used

    synthesiser of the Quartus II software needed less than that number. Table 5.3 shows

    the needed LEs per node for three different versions of the prototype.

    Figure 5.12: Connections of the edge signals to map node (13,3) in a 16 by 16 map

    To overcome the problems regarding the retrieval logic, the choice of a different newer

    generation FPGA appears feasible. For example, the Xilinx Virtex II FPGA supports

    tri-state buses, however the available number of these signals as described in [Cor03b]

    is still too limited for this application. Also, the newer Altera Stratix II FPGAs use

    improved versions of the LAB as basic logic blocks, described in the Stratix II Device

    Universitat Bielefeld, AG Technische Informatik 51

  • 7/29/2019 Dijsktra Thesis

    52/65

    5 Analysation and Discussion

    Handbook [Cor04c]. These LABs contain eight adaptive logic modules (ALMs) which

    not only are larger, but a more flexible configuration is possible. There are two registersper ALM, and with eight data inputs per ALM altogether, the configuration of two 4-

    input LUTs in one ALM is not the only option. It is possible to configure one 3-input

    LUT and one 5-input LUT or to use one large 6-input LUT and connecting the remaining

    two inputs directly to the registers. Some other configurations with different constraints

    are possible, too. The design using the Stratix LEs utilises a full LUT for combining the

    already OR combined and delayed diagonal edge signal with the output signal of the OR

    gate which combines the orthogonal edge signals. Furthermore, the remaining registers

    in those LEs are hardly accessible. In an ALM, the combination of all four orthogonal

    edge signal and the delayed diagonal edge signal would be possible in a single 5-input

    LUT configured as an OR gate while keeping both registers of the ALM available.

    52 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    53/65

    ALarge format schematics

    A.1 Schematic of the top level entity

    On the following page, a fold-out of the top level entity of the FPGA design can be

    found.

    Unfortunately, the picture for this schematic was taken from an early version of

    the project and the counter out output of the node matrix is shown with a wrong bus

    width. While the picture shows an eight bit wide bus, the running prototype uses only a

    five bit wide bus. The correct label for this output is counter_out[4..0].

    Universitat Bielefeld, AG Technische Informatik 53

  • 7/29/2019 Dijsktra Thesis

    54/65

    A Large format schematics

    A.2 Schematic of a graph vertex implemented in

    digital hardware

    54 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    55/65

    BSource code

    B.1 main.m - Main module of the Verilog code gen-

    erator

    /* main.m of the verilog code generator */

    #import "template.h"

    struct offset {

    int dx,dy;

    };

    struct offset edge_offsets[8] = {

    { -1, 0 }, // left

    { 1, 0 }, // right

    { 0, 1 }, // bottom

    { 0, -1 }, // top

    { -1, -1 }, // top left

    { 1, -1 }, // top right

    { -1, 1 }, // bottom left

    { 1, 1 } // bottom right

    };

    int main(int argc, char **argv) {

    int width=16;

    Universitat Bielefeld, AG Technische Informatik 55

  • 7/29/2019 Dijsktra Thesis

    56/65

    B Source code

    int height=16;

    unsigned char *line;

    int x,y,n,i,j;

    int numnodes=width*height;

    unsigned char chain_in_buffer[128];

    unsigned char edge_buffer[8][128];

    unsigned char node_edge_buffer[128];

    id header,node,footer;

    header=[Template new];

    footer=[Template new];

    [header load: "templates/header.v"];

    [header set: "maxnode" toInt: numnodes-1];

    [header set: "maxcol" toInt: width-1];

    [header set: "maxrow" toInt: height-1];

    [header show];

    for(y=0;y

  • 7/29/2019 Dijsktra Thesis

    57/65

    B Source code

    toString: edge_buffer[i]];

    }

    [node show];

    }

    }

    [footer load: "templates/footer.v"];

    [footer show];

    }

    Universitat Bielefeld, AG Technische Informatik 57

  • 7/29/2019 Dijsktra Thesis

    58/65

    B Source code

    B.2 serial.c - Support functions for the host utili-

    ties

    /* serial.c - module containing serial port

    * initialisation and helper functions.

    */

    #include

    #include

    #include

    #include

    #include

    #include

    #include

    #include "serial.h"

    #define SERIAL "/dev/tty.USA19QW1b1P1.1"

    int serial_fd;

    struct termios options;

    fd_set read_fds;struct timeval timeout;

    int init_serial() {

    serial_fd=open(SERIAL,O_RDWR | O_NOCTTY | O_NDELAY);

    if(serial_fd==-1) {

    perror("cannot open serial port!\n");

    exit(0);

    }

    fcntl(serial_fd,F_SETFL,0);

    tcgetattr(serial_fd,&options);

    cfsetispeed(&options, B9600);

    cfsetospeed(&options, B9600);

    options.c_cflag &= PARENB;

    58 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    59/65

    B Source code

    options.c_cflag &= CSTOPB;

    options.c_cflag &= CSIZE;

    options.c_cflag &= CRTSCTS;

    options.c_cflag |= ( CS8 | CREAD | CLOCAL );

    options.c_lflag &= (ICANON | ECHO | ECHOE | ISIG);

    options.c_iflag &= (IXON | IXOFF | IXANY);

    options.c_oflag &= OPOST;

    tcsetattr(serial_fd, TCSAFLUSH, &options);

    }

    void close_serial() {

    close(serial_fd);

    }

    void send_char(unsigned char c) {

    write(serial_fd,&c,1);

    }

    void send_string(unsigned char *s, int length) {write(serial_fd,s,length);

    }

    unsigned char read_char() {

    unsigned char buf;

    read(serial_fd,&buf,1);

    return buf;

    }

    // fpga commands

    void reset_map() {

    send_char(0x08);

    }

    void reset_nodes() {

    send_char(0x04);

    }

    Universitat Bielefeld, AG Technische Informatik 59

  • 7/29/2019 Dijsktra Thesis

    60/65

    B Source code

    unsigned char read_node(unsigned char addr) {

    send_char(0x01);

    send_char(addr);

    send_char(0x02);

    return read_char();

    }

    void write_node(unsigned char addr, int flag) {

    send_char(0x01);

    send_char(addr);

    if(flag) {

    send_char(0x60);} else {

    send_char(0x40);

    }

    }

    void activate_node(unsigned char addr) {

    send_char(0x01);

    send_char(addr);

    send_char(0xE0);

    }

    void iterate() {

    send_char(0x10);

    }

    60 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    61/65

    B Source code

    B.3 map transfer.c - Host utility to transfer a map

    to the FPGA

    /* map_transfer.c - Transfers an ASCII map file

    * to the FPGA

    */

    #include

    #include "serial.h"

    int main(int argc, char **argv) {

    FILE *map;

    char line_buffer[256];

    unsigned char addr;

    int width,height,x,y;

    if(argc!=2) {

    printf("usage: %s \n",argv[0]);

    exit(10);

    }

    if(!(map=fopen(argv[1],"r"))) {printf("cant open mapfile!\n");

    exit(10);

    }

    fgets(line_buffer,256,map);

    width=atoi(line_buffer);

    fgets(line_buffer,256,map);

    height=atoi(line_buffer);

    init_serial();reset_nodes();

    reset_map();

    for(y=0;y

  • 7/29/2019 Dijsktra Thesis

    62/65

    B Source code

    printf("**");

    } else {

    write_node(addr,1);

    printf(" ");

    }

    }

    printf("\n");

    }

    close_serial();

    }

    62 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    63/65

    Bibliography

    Bibliography

    [ALS03] Bertrand Cuzeau ALSE. Simple UART project - VHDL. http://www.alse-

    fr.com, 2003.

    [Cor99] Xilinx Corporation. XC4000 databook. http://www.xilinx.com, 1999.

    [Cor03a] Altera Corporation. Nios development board. http://www.altera.com, 2003.

    [Cor03b] Xilinx Corporation. Virtex-II platform FPGAs: Complete data sheet.

    http://www.xilinx.com, 2003.

    [Cor04a] Altera Corporation. Quartus II device handbook. http://www.altera.com,

    2004.

    [Cor04b] Altera Corporation. Stratix device handbook. http://www.altera.com, 2004.

    [Cor04c] Altera Corporation. Stratix II device handbook. http://www.altera.com, 2004.

    [Min78] E. Minieka. Optimization Algorithms for Networks and Graphs. Dekker, New

    York, Basel, 1978.

    [Mol99] Ralf Moller. Path planning using hardware time delays. IEEE Transactions

    on Robotics and Automation, Volume 15 No. 3:pp. 588591, 1999.

    [Nes03] John A. Nestor. FPGA Implementation of a Mulitlayer Maze Router. De-

    partment of Electrical and Computer Engineering, Lafayette College, Easton,

    2003.

    [TS01] Matti Tommiska and Jorma Skytta. Dijkstras shortest path routing algorithm

    in reconfigurable hardware. LNCS 2147, pages pp. 653657, 2001.

    Universitat Bielefeld, AG Technische Informatik 63

  • 7/29/2019 Dijsktra Thesis

    64/65

    Bibliography

    64 Universitat Bielefeld, AG Technische Informatik

  • 7/29/2019 Dijsktra Thesis

    65/65

    Bibliography

    Hiermit versichere ich, da ich diese Diplomarbeit selbstandig bearbeitet habe. Ich habe

    keine anderen als die angegebenen Quellen und Hilfsmittel benutzt und entsprechende

    Zitate kenntlich gemacht.

    Bielefeld, den November 11, 2004

    Marcus Grieger