MAFA07 19 Heiner

Embed Size (px)

Citation preview

  • 8/13/2019 MAFA07 19 Heiner

    1/31

    Fault Tolerant ICAP Controller for High-ReliableInternal Scrubbing

    Jonathan Heiner, Nathan Collins, & Michael Wirthlin

    This work was supported by Lockheed Martin under a grant from the University Projects program withcollaboration by Tim Gallagher and Jon Wilson.

  • 8/13/2019 MAFA07 19 Heiner

    2/31

    Brigham Young University

    Outline

    FPGA Scrubbing Overview

    Internal Configuration Access Port (ICAP) Internal ICAP Architecture

    High Reliability Scrubber

    Radiation Test & Results Future Work & Summary

  • 8/13/2019 MAFA07 19 Heiner

    3/31

    Brigham Young University

    FPGA Fault Tolerant Strategy FPGAs provide SEU mitigation

    through redundancy andscrubbing

    Triple Modular Redundancy(TMR)

    Triplicate module to introduceredundancy

    Vote on outputs of triplicatedmodule

    Use greatest common result

    Configuration Scrubbing

    Readback frame data Compare frame to original

    Correct erroneous bits in frame

    Writeback frame to FPGA

    A A

    A

    A

    V

    FPGA

    Memory Comparitor

    Modifier

    ...010101010...

    ...010101010...

    ...010111010...

    Original FrameData

    ErroneousFrame Data

    CorrectedFrame Data

  • 8/13/2019 MAFA07 19 Heiner

    4/31

    Brigham Young University

    Continuous Time Reliability

    TMR w/Repair

    Non-redundantTMR

    (no repair)

    t

    t1

    101

    t2t3

    t 31

    1 20 1

    t 211

    t

    t3

    t 31

    1 20)2(1 tt +

    1

    Non Redundant

    TMR w/o Repair

    TMR w/Repair

  • 8/13/2019 MAFA07 19 Heiner

    5/31

    Brigham Young University

    Configuration Scrubbing Example

    ScrubColumn

    x

    ConfigurationUpset

  • 8/13/2019 MAFA07 19 Heiner

    6/31

    Brigham Young University

    Configuration Scrubbing Example

    x Configuration

    UpsetRepaired

  • 8/13/2019 MAFA07 19 Heiner

    7/31

    Brigham Young University

    External Components

    RadHard Memory

    Configuration Controller

    Dedicated IO

    Traditional Scrubbing

    External Scrubbers

    Blind Scrubbing

    Read-back Scrubbing

    Rad HardMemory

    DataAddress

    OE

    FPGA

    Data

    WriteCSBusyDoneProgInitCCLK

    SelectMap

    ConfigurationController

  • 8/13/2019 MAFA07 19 Heiner

    8/31

  • 8/13/2019 MAFA07 19 Heiner

    9/31

    Brigham Young University

    Internal Scrubbing

    Described in XAPP714 Architecture

    Based on ICAP configuration port

    Internal Scrubbing Architecture

    Control LogicPicoBlazeProcessor

    PicoBlazeBRAM

    Frame ECC ICAP

    DMA BRAM

    ICAP DMAFPGA Primitives

    Configuration Memory

  • 8/13/2019 MAFA07 19 Heiner

    10/31

    Brigham Young University

    Internal Scrubbing Strategy

    Internal Scrubbing Process

    Perform readback of each frame via ICAP interface

    Use FrameECC to detect errors

    Correct errors based on FrameECC syndrome value

    Write corrected frame back via ICAP interface

    Advantages

    No external memory, external controller, or external IO pins

    Disadvantages

    Additional circuit area required for scrubbing circuit

    Reliability of scrubber

  • 8/13/2019 MAFA07 19 Heiner

    11/31

    Brigham Young University

    Internal Configuration Access Port (ICAP)

    Internal interface to configuration port

    Active readback and re-configuration

    Similar to SelectMap (separate I/Out data bus)

    Hard-wired Logic

    Current application usage Dynamic Partial Reconfiguration

    Encryption

    Fault Tolerance/Injection

    CONFIG_MODE

    CONFIG_START

    RESET_FPGA

    DATA

    R/W

    CE

    CLK

    DONE

    BUSY

    ICAP

    SELECTMAP

    DATA

  • 8/13/2019 MAFA07 19 Heiner

    12/31

    Brigham Young University

    Detailed Architecture Overview

    Frame ECC

    Hard-wired internal

    component

    Performs SECDEDalgorithm on frame

    Provides syndromeword and error bitvalues

    Directly connectedto read-port of ICAP

    Control LogicPicoBlazeProcessor

    PicoBlazeBRAM

    Frame ECC ICAP

    DMA BRAM

    ICAP DMAFPGA Primitives

    Configuration Memory

  • 8/13/2019 MAFA07 19 Heiner

    13/31

    Brigham Young University

    Detailed Architecture Overview

    Control LogicPicoBlazeProcessor

    PicoBlazeBRAM

    Frame ECC ICAP

    DMA BRAM

    ICAP DMAFPGA Primitives

    Configuration Memory

    ICAP DMA

    Provides ICAP with

    data every clockcycle

    Stores ICAP output

    to DMA BRAM Transmits BRAM

    content to controllogic

  • 8/13/2019 MAFA07 19 Heiner

    14/31

    Brigham Young University

    PicoBlaze Processor

    8-bit programmable

    Controller

    Performs scrubbinglogic

    BRAM contains pre-compiled scrubbingprogram

    Software used forease of modifyinglogic

    Detailed Architecture Overview

    Control LogicPicoBlazeProcessor

    PicoBlazeBRAM

    Frame ECC ICAP

    DMA BRAM

    ICAP DMAFPGA Primitives

    Configuration Memory

  • 8/13/2019 MAFA07 19 Heiner

    15/31

    Brigham Young University

    Control Logic

    Synchronize data

    transfer betweenpicoblaze and ICAPDMA

    Maintains timing anddata requirements

    Detailed Architecture Overview

    Control LogicPicoBlazeProcessor

    PicoBlazeBRAM

    Frame ECC ICAP

    DMA BRAM

    ICAP DMAFPGA Primitives

    Configuration Memory

  • 8/13/2019 MAFA07 19 Heiner

    16/31

    Brigham Young University

    Scrubber Program

    Initializes devices

    Walk slow scan

    Approx. 24ms to 278ms @ 100Mhz

    Actual Detection

    Run fast scan

    Approx. 1.2ms to 14.6ms @100Mhz

    Quick Detection (is there an errorsomewhere)

    Patch Ignore SEU bymodifying parity bits

    Correction Correct SEU

    Initialization andSynchronization

    Initial Walk

    Initial RunScan

    Perform RunScan

    Perform WalkError

    Detected?

    PerformCorrection

    ErrorDetected?

    Perform Patch

  • 8/13/2019 MAFA07 19 Heiner

    17/31

    Brigham Young University

    High Reliability Scrubber

    Internal Scrubber is susceptible to configuration upsets

    Logic used to implement scrubber may be affected bySEUs

    Upsets within the scrubber logic may limit the ability

    of the scrubber to repair the fault

    SEU mitigation technique needed to insure reliablescrubbing

  • 8/13/2019 MAFA07 19 Heiner

    18/31

    Brigham Young University

    High Reliable ICAP Scrubber

    Control LogicPicoBlazeProcessor

    PicoBlazeBRAM

    Frame ECC ICAP

    DMA BRAM

    ICAP DMAFPGA Primitives

    TMR

    MemoryScrubbing

    No Mitigation

  • 8/13/2019 MAFA07 19 Heiner

    19/31

  • 8/13/2019 MAFA07 19 Heiner

    20/31

    Brigham Young University

    BRAM Scrubber

    Specialized BRAMscrubber for Picoblaze

    memory

    Continuously read and

    repair upsets within thememory

    PICOBLAZE

    PROCESSOR

    (KCPSM 3 )

    P OR T B

    P OR T A

    PO RT B

    PO RT A

    PO RT B

    PO RT A

    WRITE -

    BACK

    LOGIC

    PICOBLAZE

    PROGRAM CODE

    (TEST 0)

    BRAM SCRUBBER

    AddressCounter

    VOTER

    VOTER

    VOTER

  • 8/13/2019 MAFA07 19 Heiner

    21/31

    Brigham Young University

    Scrubber Design Utilization

    62BRAM 1308 (12%)736 (6%)Slices

    680 (3%)Non TMR

    1082 (5%)Flip FlopsTMRResource

    Virtex-4 LX-25

  • 8/13/2019 MAFA07 19 Heiner

    22/31

    Brigham Young University

    Radiation Test

    Determine the reliability of ICAP scrubber

    Measure reliability of non-TMR scrubber

    Measure reliability of TMR scrubber

    Test limitations

    Operated behind another test

    Did not have control over beam flux

    Had to reconfigure with beam onPicture of test

  • 8/13/2019 MAFA07 19 Heiner

    23/31

  • 8/13/2019 MAFA07 19 Heiner

    24/31

    Brigham Young University

    Radiation Test Design

    ICAP controller

    TMR design

    Non TMR design

    No other FPGA circuitry

    FPGA mostly empty

    Detect and repairupsets in all areas ofFPGA

    Unused logic

    ICAP controller logic

    Empty Logic Fabric

    ICAP Controller(both TMR and non TMR)

  • 8/13/2019 MAFA07 19 Heiner

    25/31

    Brigham Young University

    Data Collection and Monitoring

    UART

    Transmit SEU data to PC

    Provide status messages

    Watchdog timers

    Identify picoblaze failure

    External PC Log data

    Control LogicPicoBlazeProcessor

    PicoBlazeBRAM

    Frame ECC ICAP

    DMA BRAM

    ICAP DMAFPGA Primitives

    UART

    WatchdogTimers

    ExternalPC

  • 8/13/2019 MAFA07 19 Heiner

    26/31

    Brigham Young University

    Configuration Upsets between Failure

    0

    2

    4

    6

    8

    10

    12

    14

    0-200 201-400 401-600 601-800 801-1000 1000+

    TMR

    Non-TMR

    TMR: 1682 SEUs between failureNon-TMR: 309 SEUs between failure

  • 8/13/2019 MAFA07 19 Heiner

    27/31

    Brigham Young University

    Multiple Bit Upsets

    FRAME ECC does not identify location of failure withmultiple upsets within frame

    Single Error Correction, Double Error Detection Syndrome can not locate failures

    MBUs were detected but could not be corrected

    MBUs accumulated during the test

    Failures often occurred due to MBU accumulation

    Presence of MBU significantly slowed downscrubbing

    Performed configuration walk with MBU

    1.7% of upsets were intra-frame MBUs

  • 8/13/2019 MAFA07 19 Heiner

    28/31

  • 8/13/2019 MAFA07 19 Heiner

    29/31

    Brigham Young University

    Failure Modes

    Single Point Failures (were not isolated during test) UART I/O

    ICAP Frame ECC

    Failure Modes (isolated during test) Program crash

    Invalid response from UART Repeat FAR & syndrome values

    Repeat FAR but different syndrome values

    Repeat sets of FAR & syndrome values

    FAR increments till end of FPGA row Errors detected after test finished

    Failed during reconfiguration

  • 8/13/2019 MAFA07 19 Heiner

    30/31

    Brigham Young University

    Conclusions

    ICAP scrubber worked correctly as expected

    Detected upsets within FPGA fabric during operation

    Repaired SEUs within the device

    Hi-Rel scrubber provided improved reliability

    5.4x higher SEU to failure than non-TMR

    1.4x higher MBU to failure than non-TMR

    ICAP hi-rel scrubber reliability limited by MBUs

    Cannot remove MBUs

    Failure due to accumulation of MBUs

  • 8/13/2019 MAFA07 19 Heiner

    31/31

    Brigham Young University

    Future Work

    MBU Detection & Correction

    Investigate techniques for MBU correction

    VHDL Scrubber Increased speed & possibly smaller circuit

    Dynamic Partial Reconfiguration

    Future uses of ICAP

    Dynamic Partial Reconfiguration (bitstream compression)

    Low cost Fault Injection