FRAM Evaluation as Unified Memory for Convex Optimization …cse.lab.imtlucca.it/~bemporad/publications/papers/ederc... · 2015. 3. 25. · FRAM EVALUATION AS UNIFIED MEMORY FOR CONVEX

FRAM EVALUATION AS UNIFIED MEMORY FOR CONVEX OPTIMIZATIONALGORITHMS

Cimini Gionata1, Bemporad Alberto2, Ippoliti Gianluca1 and Longhi Sauro1

1 Universita Politecnica delle Marche, Via Brecce Bianche 12, 60131, Ancona, Italy2 IMT - Institute for Advanced Studies, Piazza San Ponziano 6, 55100 Lucca, Italy

email: [email protected], [email protected], [email protected], [email protected]

ABSTRACTFerroelectric Random Access Memory (FRAM) by Texas In-struments (TI) is a non-volatile memory which allows lowerpower and faster data throughput compared to other non-volatile solutions. These features have accelerated the in-terest in this technology as the future of embedded unifiedmemory, in particular in data logging, remote sensing andWireless Sensor Network (WSN). The application of ModelPredictive Control (MPC) in WSN has gained lot of atten-tion in the last years and it requires solving convex optimiza-tion problems in real-time. In this paper several convex op-timization algorithms have been implemented and comparedon a FRAM-based MSP-EXP430FR5739 node by TI, to eval-uate its suitability in extending the potentialities of onboardvolatile Static Random Access Memory (SRAM) for embed-ded optimization-based control.

1. INTRODUCTION

Thanks to the recent advances in a variety of engineeringdisciplines, such as system miniaturization, wireless com-munication and signal processing, WSNs have gained alot of attention, in particular in the field of smart envi-ronments. A sensor node is usually battery-powered andcombines sensing, computation and communication capa-bilities. Nowadays energy consumption remains as a ma-jor obstacle for full deployment and exploitation of WSNsin very different application fields [1, 2]. For this reasonlot of researchers are focused on the low-power circuit de-sign techniques or the power management approaches to im-prove node’s power consumption, minimizing for examplethe communication’s requirements [3] or focusing on addingexternal power sources to the node [4], [5].

Among these approaches, efforts have been made to de-velop memories for this kind of applications, as the FRAMby TI, which is competitive with emerging new memorytechnologies, but is already commercially available and con-sidered as the memory of the future due to its ideal prop-erties [6]. FRAM is a non-volatile memory which com-bines the speed, ultra-low-power, endurance and flexibilityof SRAM with the reliability and stability of flash memory;furthermore it allows random access to each individual bit forboth read and write. FRAM technology is particularly suitedin WSNs because, compared to standard flash memory, it hasmore than 100x faster write speed while having less than halfthe memory access energy per bit resulting in much higherthroughput and less energy per stored bit [7].

In this paper the performance of the FRAM has beenevaluated, focusing on its flexibility in terms of program-ming and on its write speed. The benchmark consists in the

implementation of convex optimization algorithms on MSP-EXP430FR5739 Experimenter Board by TI, a developmentplatform for data logging applications, energy harvesting,wireless sensing and automatic metering infrastructure. Theaim of the work is to verify the possibility to extend the ca-pabilities of on-board volatile memory with a non-volatilesupport, when the problems to be solved are larger than thestorage space in the volatile support. The choice for the algo-rithms to implement comes from their application in MPC,where the control problem requires solving Quadratic Pro-gramming (QP) in real time. The investigation of MPC forthe FRAM sensor node is motivated by the recent studiesin the applicability of MPC in the context of WSNs, wherethe technique has been used for power control to guarantee aspecified Quality of Service (QoS) adapting the transmissionpower level which has to be minimized [8]. Furthermore,MPC has been adopted in WSN for other purposes such asresources’ allocation and power management in energy har-vesting [9]. More generally MPC can be thought as a wayof expanding WSN nodes with real-time decision capabili-ties. The need for a QP solver to be executed at each sam-pling period has increased the interest in algorithms suitedfor embedded applications where the simplicity to code is akey feature, together with proved guarantees on worst-caseexecution time and robustness to low precision arithmetic.For these reasons, different algorithms have been developedin the last years and some of them, suitable for embeddedapplication, have been implemented and evaluated on the de-vice: Dual Gradient Projection (DGP), the Accelerated DualGradient-Projection (GPAD) [10], Parallel Quadratic Pro-gramming (PQP) [11] and Alternating Direction Method ofMultipliers (ADMM) [12]. The four algorithms have beencompared in terms of applicability on the FRAM support,evaluating the speed of a complete SRAM execution and acomplete FRAM one which, in addition to the above men-tioned benefits, permits the storage of a larger problem onthe selected device (1Kb of SRAM and 16KB of FRAM).Furthermore the performance of the algorithms have beendetailed in terms of data stored, Time Per Iteration (TPI)and average total execution time with worst case evaluation.The paper is organized as follows: in Section 2 the FRAMtechnology is detailed; Section 3 defines the formulation ofthe problem to be solved together with the algorithms testedwhich are briefly summarized. Finally in Section 4 the per-formance of the algorithms are compared.

2. FRAM TECHNOLOGY

While the benefits of FRAM have been known for manyyears, [6], the mass-production of FRAM-based devices has

Proceedings of the 6th European Embedded Design in Education and Research, 2014

978-1-4799-6843-5/14/$31.00 ©2014 IEEE 187

begun recently due to the improvements in encapsulation, ro-bustness to fatigue and access schemes for high density de-vices, [13–15]. FRAM stores information using the polariza-tion of ferroelectric thin film placed between two electrodes.The FRAM cell structure is very similar to that used in ofDRAM, consisting of One Transistor One Capacitor (1T1C)cell. The dipole orientation, which is the stored information,can be set and reversed by applying voltage across either line.The most important feature of the ferroelectric material isits non-volatility and immunity to magnetic fields. Unlikewidely used non-volatile memory technologies such as EEP-ROM or Flash memory, FRAM does not require a specialsequence to write data as random access to each individualbit for both read and write is permitted. The main features ofthe ferroelectric memory are briefly detailed in the followingsection.

2.1 Speed and UniversalityNormally tens of ms to several ms are required to write flashmemory in order to program one data word. In addition, thistime does not include the pre-erasing of the segment to be re-programmed, which takes several ms. In contrast, FRAM re-quires only 100ns to program one data word and pre-erasingis not required. Thanks to this fast write, there is virtually nointerruption of the program execution and a speed close to thevolatile memories such as SRAM allows FRAM to be usedas a unified “memory”, flexible to be partitioned to the needsof the application, providing more bytes for data (i.e., in datalogging application) or for code (i.e., in complex algorithmimplementation).

2.2 Low Power and ReliabilityReads and writes in FRAM-based devices occur at low volt-age, just 1.5V , and at very little current; thus a charge pumpin not needed to operate, drastically reducing the power con-sumption as well as the footprint of the device. For this de-vices TI guarantee 10 years of operation (virtually unlim-ited write endurance) and data retention at 85�. FurthermoreFRAM does not suffer from Soft Error, which are flips of bitstates due to external sources.

3. PROBLEM FORMULATION ANDIMPLEMENTED ALGORITHMS

For the convenience of the reader, the convex optimization al-gorithms considered in this paper for solving the linear MPCproblems are briefly described. In particular, such algorithmscan handle polyhedral constraints and convex quadratic ob-jective function, resulting in a constrained Quadratic Pro-gramming (QP) problem of the form:

minimizez

V (z) =12

zT Qz+qT z

subject to g(z) = Az�b 0(1)

with z 2 Rn, Q 2 Rn⇥n, q 2 Rn, A 2 Rm⇥n and b 2 Rn. Thedual problem of (1) is given by the following formulation:

minimizez

F(z) =12

yT Hy+hT y+12

W

subject to y � 0(2)

with

H = AQ�1AT , h = AQ�1q+b, W = qT Q�1q (3)

where H 2 Rm⇥n, h 2 Rm, y 2 Rm.

Definition 3.1. Let z⇤ and V ⇤ be the optimal solution and theoptimal value for the primal problem (1) and consider twonon-negative constants ev,eg and Z = {z 2 Rn|Az�b 0};we say that z 2 Rn is an (ev,eg)-optimal solution for (1) if

(4a)V (z)�V ⇤ ev

(4b)maxi2N[1,mx ]

[gi(z)]+ eg

where [gi(z)]+ = max{gi(z),0}.

In the following a summary for each QP solver consid-ered in this paper is provided. The implementations adoptedare optimized for speed execution, storing as much data aspossible to reduce the Time Per Iteration (TPI). The im-portance of coding details for energy-aware applications hasbeen addressed in [16].

3.1 Dual Gradient Projection (DGP)Considering the optimization problem in (1) and its La-grangian L (z,y) =V (z)+yT g(z), the gradient projection al-gorithm applied to the dual problem consists in solving thefollowing equations at every iterations.

zk = argminz

L (z,yk) (5)

yk+1 =

yk +

1L

g(zk)

�

+

(6)

where

L =

sm

Âi, j=1

|Hi, j|2 (7)

is the Frobenius norm of H. In this paper a fixed-point imple-mentation of DGP has been used, presented in [17] where aninexact DGP method specifically suitable for a fixed-pointimplementation has been developed together with detailedconvergence rate analysis (based on [18]).

3.2 Accelerated Dual Gradient-Projection Algorithm(GPAD)In [10] an algorithm based on the fast gradient projectionmethod of [19] has been proposed and applied to the QP(1). It is particularly tailored for embedded linear MPC withpolyhedral constraints on both inputs and states. An iterationof the GPAD algorithm is:

wk = yk +b k(yk � yk�1) (8)

zk = argminz

L (z,wk) (9)

yk+1 =

w⇤ k+

1L

g(zk)

�

+

(10)

b k+1 =k�1k+2

(11)

This algorithm guarantee a convergence rate of O(1/k2) forboth dual and primal optimality as well as for primal feasi-bility.

188

3.3 Parallel Quadratic Programming (PQP)The Parallel Quadratic Programming (PQP) method has beendeveloped in [20] for image processing problems in which aquadratic minimization program in the non-negative cone isinvolved, which is the form of the dual problem in (2). Re-cently the PQP structure has been exploited for constrainedMPC [11]. Given a generic matrix A, let us define the twomatrices:

[A�]i, j = max{0,�[A]i, j} (12)[A+]i, j = max{0, [A]i, j} (13)

Furthermore, construct a diagonal matrix W 2 Rm⇥m, com-puted to guarantee convergence; the choice adopted in [11]is [W]ii � [H�1]. By taking F =�Q�1AT a PQP iteration is:

hyk+1

i

i=

⇥(H�+W)zk +F�⇤

i[(H++W)zk +F+]i

[zk]i (14)

3.4 Alternating Direction Method of Multipliers(ADMM)ADMM is a simple method for solving large-scale convexoptimization problems and is becoming very popular [21]. Infact, ADMM offers a solution comparable to other efficientalgorithms in serial regime, but is very popular in distributedoptimization, merging the benefits of augmented Lagrangianand dual decomposition: the solutions of local sub-problemsare coordinated to find the solution to a larger problem. Con-sider problem (1) in a version equal to the ADMM standardform [12], introducing a slack variable x:

minimizez

V (z) =12

zT Qz+qT z+F (x)

subject to g(z,x) = Az�b+ x = 0(15)

where F (x) is the indicator function in the non-negative or-thant. Considering the augmented Lagrangian function in ascaled form,

(16)Lr(z,x,u) =

12

zT Qz + qT z + F (x)

+r2||Az � b + x + u||22 �

r2||u||22

with u = y/r , the scaled ADMM iterations consists on sep-arating minimization of (16) over z and x into two differentsteps and then perform an ascent step on the Lagrange mul-tipliers y. Thus an ADMM iteration consists in:

(17a)zk+1 = argminz

Lr(z,xk,uk)

(17b)xk+1 = argminz

Lr(zk+1,x,uk)

(17c)uk+1 = uk + r g(z,x)

4. RESULTSIn this section the results of the implementation of the al-gorithms in Section 3 are collected. The experimental setupconsists of a MSP-EXP430FR5739 programmed with USBinterface via Code Composer Studio 5. The MSP430FR5739

Figure 1: The experimental setup, with MSP-EXP430FR5739 device and oscilloscope to capture thetime needed by the different solvers tested.

integrated microcontroller has a 16-Bit fixed-point RISC ar-chitecture with a system clock up to 24-MHz and a memorycomposed of 16KB FRAM and 1KB SRAM. The algorithmshave been compared in terms of memory allocation and ex-ecution time over the two different memories. Random gen-erated QP problems have been used to test the algorithms;different sizes of QP have been tested considering n 2 [2,10]and m 2 [4,20]. The experimental setup for the evaluation ofperformance is shown in Figure (1).

4.1 Memory AllocationConsidering n, m defined as in equation (1) and the algo-rithms’ implementations as described in Section 3, the mem-ory occupancy relationships in terms of stored data can becompared as in Figure 2. For embedded application, a staticmemory allocation is preferable and all the algorithms havebeen coded with this shrewdness. Figure 2 gives a qualitativecomparison in terms of bytes allocated by the algorithms inrelation with the size of the problem: the horizontal plane ineach graph represents the 1Kb limit of the SRAM memoryof the device; as has been prove in the next section, the sizeof the problem can be extended involving the FRAM withoutcompromising the performances of the algorithms. Using theproposed implementation Figure 2 shows that the PQP algo-rithm stores more data than the other implementations which,are seen to be approximately comparable.

4.2 Execution TimeThe four algorithms have been tested comparing a completeexecution on the SRAM support and a complete executionon the FRAM support. To evaluate the execution time, as weare in the order of microseconds, software-based methodshave been avoided and a pin toggle has been used insteadwith an oscilloscope. By default the CCS compiler storesthe variables defined as constants in the FRAM support, andthe others in the SRAM; to force a complete execution onSRAM is sufficient to avoid the definition of constant vari-ables, whereas a complete FRAM execution needs differentsteps. A new area of FRAM must be defined by modifying

189

Figure 2: Memory Allocation for the Optimizazion Algo-rithms

the system memory map and a new memory section has tobe created inside the previously defined memory area. Fi-nally the definition of the variables have to be included inan environment as in Code 1. A part from this basic steps,the variables can be used with the simplicity of a SRAM asthe writing and reading phase from FRAM support is totallytransparent.

Figure 3 shows the Time Per Iteration (TPI) benchmarkof the four algorithms. The TPI is a deterministic evalu-ation of the embedding capabilities of an algorithm. Asthe selected platform provides a 32-bit Hardware Multiplier(MPY), sums and multiplications are easily performed by thedevice. GPD, GPAD and ADMM are substantially compara-ble; GPD and ADMM show equal computation time, GPADexhibits a slight increase in time and finally PQP shows theworst performances respect to other algorithms (this is due tothe necessity of several divisions). Figure 3 shows the com-parison between the execution time on SRAM and on FRAMfor each algorithm. The results show that the time needed bya complete FRAM execution is comparable to that neededby the volatile memory, thus it is observed that FRAM canbe used to deal with larger problems without noticeable de-crease in performance. For each graph in Figure 3 the exe-cution on SRAM stops when no more space is available onvolatile memory; as expected from memory allocation re-sults, PQP can solve only smaller problems with respect toother algorithms. Figure 4(a) shows the mean time needed(over 1000 testing problems per dimension) to solve QPs ofdifferent sizes, considering the FRAM execution; Figure 4(b)presents the worst case with the same testing problems.

The results for both memory allocation and executiontime can be summarized as follows:

• A complete execution on the FRAM is comparable withthat on SRAM, so the non-volatile memory can be usedto store and solve larger problems;

• The PQP algorithm stores a greater amount of data com-pared to the other three algorithms which are substan-tially comparable;

0 50 100 150 200

200

400

600

800

Problem dimension (n · m)

TPI

(µs)

GPD RAM/FRAM comparison

0

RAM ExecutionFRAM Execution

0 50 100 150 200

200

400

600

800


TPI

(µs)

GPAD RAM/FRAM comparison

0


0 50 100 150 200

500

1000

1500

2000

2500

Problem dimension (n · m)TP

I (µ

s)

PQP RAM/FRAM comparison

0


0 50 100 150 200

200

400

600

800


TPI

(µs)

ADMM RAM/FRAM comparison

0


Figure 3: Time Per Iteration Benchmark between SRAM andFRAM Execution: the SRAM plots stop when there is nomore space available.

• ADMM, GPAD and GPD result to be particularly suitedto be executed on the embedded platform, thanks to thesimplicity of the operations involved;

• ADMM outperforms the other algorithms (also for theworst case) with a combination of low TPI and the lowestnumber of iterations needed; only the GPAD remains agood alternative;

It is important to note that the good performance of theADMM algorithm is deeply related on the selection of theaugmenting term r for the Lagrangian. The results presentedin this paper are obtained with a good choice of this factor,selected empirically.

Code 1: Example of FRAM variable definition#pragma SET_DATA_SECTION(".fram_vars")[...variables definition...]#pragma SET_DATA_SECTION()

5. CONCLUSION

In this paper an MSP430 platform by TI with FRAM memorytechnology has been used for a benchmark of iterative solversfor quadratic programming. The novel non-volatile memoryhas been tested in terms of execution speed and compared tothe standard integrated SRAM, showing competitive resultsand thus allowing a combined use of volatile and non-volatilememory to increase the stored problem’s dimensions. Thesuitability of the platform for solving optimization problems,for example in MPC applications for WSNs, has been con-firmed. The GPD, GPAD, PQP and ADMM algorithms havebeen implemented in a speed-aware form on the platform andcompared from both a memory allocation and an executiontime point of view.

190

0 50 100 150 200

200

400

600

800

1000


T (m

s)

Total Computation Time (Mean Value)

0

GPDGPADPQP ADMM

(a) Total Optimization Time Com-parison (Mean Value)

0 50 100 150 200

1000

2000

3000

4000

5000

6000

7000

Problem dimension (n · m)T

(ms)

Total Computation Time (Worst Case)

0

GPDGPADPQP ADMM

(b) Total Optimization Time Com-parison (Worst case)

Figure 4: Execution time tested on 1000 random QP per di-mension with n = x, m = 2x and x 2 [2,10].

REFERENCES

[1] A. Seema and M. Reisslein, “Towards efficient wirelessvideo sensor networks: A survey of existing node ar-chitectures and proposal for a Flexi-WVSNP design,”Communications Surveys Tutorials, IEEE, vol. 13,pp. 462–486, Third 2011.

[2] G. Owojaiye and Y. Sun, “Focal design issues affectingthe deployment of wireless sensor networks for intelli-gent transport systems,” Intelligent Transport Systems,IET, vol. 6, pp. 432–432, December 2012.

[3] R. Yan, H. Sun, and Y. Qian, “Energy-aware sen-sor node design with its application in wireless sensornetworks,” Instrumentation and Measurement, IEEETransactions on, vol. 62, pp. 1183–1191, May 2013.

[4] A. S. Weddell, N. J. Grabham, N. R. Harris, andN. M. White, “Modular plug-and-play power resourcesfor energy-aware wireless sensor nodes.,” in SECON,pp. 1–9, IEEE, 2009.

[5] D. Riley and M. Younis, “A modular and power-intelligent architecture for wireless sensor nodes,” inLocal Computer Networks (LCN), 2012 IEEE 37thConference on, pp. 304–307, Oct 2012.

[6] R. Bailey, G. Fox, J. Eliason, M. Depner, D. Kim, E. Ja-billo, J. Groat, J. Walbert, T. Moise, S. Summerfelt,K. R. Udayakumar, J. Rodriquez, K. Remack, K. Boku,and J. Gertas, “FRAM memory technology - advan-tages for low power, fast write, high endurance ap-plications,” in Computer Design: VLSI in Computersand Processors, 2005. ICCD 2005. Proceedings. 2005IEEE International Conference on, 2005.

[7] A. Baumann, M. Jung, K. Huber, M. Arnold, C. Sichert,S. Schauer, and R. Brederlow, “A MCU platform withembedded FRAM achieving 350na current consump-tion in real-time clock mode with full state retentionand system wakeup time,” in VLSI Circuits (VLSIC),2013 Symposium on, pp. C202–C203, 2013.

[8] K. Witheephanich and M. J. Hayes, “On the appli-cability of model predictive power control to an ieee802.15.4 wireless sensor network,” in Signals and Sys-tems Conference (ISSC 2009), IET Irish, pp. 1–8, June2009.

[9] X. J. Li, X. Shao, K.-V. Ling, and B.-H. Soong, “Ap-

plication of model predictive control in wireless sensornetworks,” in Information, Communications and SignalProcessing (ICICS) 2011 8th International Conferenceon, pp. 1–5, Dec 2011.

[10] P. Patrinos and A. Bemporad, “An accelerated dualgradient-projection algorithm for embedded linearmodel predictive control,” Automatic Control, IEEETransactions on, vol. 59, pp. 18–33, Jan 2014.

[11] S. D. Cairano, M. Brand, and S. A. Bortoff,“Projection-free parallel quadratic programming forlinear model predictive control,” International Journalof Control, vol. 86, no. 8, pp. 1367–1385, 2013.

[12] S. Boyd, N. Parikh, and E. Chu, Distributed Optimiza-tion and Statistical Learning Via the Alternating Direc-tion Method of Multipliers. Now Publishers, 2011.

[13] Y. Kang and S. Lee, “The challenges and directionsfor the mass-production of highly-reliable, high-density1T1C FRAM,” in Applications of Ferroelectrics, 2008.ISAF 2008. 17th IEEE International Symposium on the,vol. 1, pp. 1–2, 2008.

[14] H. K. Ko, D. Jung, Y. K. Hong, J. Park, Y. Kang,H. Kim, Y. Kang, H. Kim, W. Jung, D. Choi, S. Kim,W. S. Ahn, J.-H. Kim, W. W. Jung, E. Lee, Y. Kang,S. Lee, H. Jeong, and K. Kim, “A novel encapsula-tion technology for mass-productive 150 nm, 64-mb,1T1C FRAM,” in Applications of Ferroelectrics, 2007.ISAF 2007. Sixteenth IEEE International Symposiumon, pp. 25–27, 2007.

[15] Z. Jia, G. Zhang, M. ming Zhang, T.-L. Ren, andH. yi Chen, “A novel fatigue-insensitive self-referencedscheme for 1T1C FRAM,” in Memory Workshop(IMW), 2010 IEEE International, pp. 1–2, 2010.

[16] T. Krueger, “Simple and cheap measurement of FRAMcurrent consumption and performance data,” in Educa-tion and Research Conference (EDERC), 2012 5th Eu-ropean DSP, pp. 53–57, 2012.

[17] P. Patrinos, A. Guiggiani, and A. Bemporad, “Fixed-point dual gradient projection for embedded model pre-dictive control,” in Control Conference (ECC), 2013European, pp. 3602–3607, 2013.

[18] O. Devolder, F. Glineur, and Y. Nesterov, “First-ordermethods of smooth convex optimization with inexactoracle,” No. 2011002, 2011.

[19] Y. Nesterov, “A method of solving a convex program-ming problem with convergence rate o (1/k2),” SovietMathematics Doklady, vol. 27, no. 2, pp. 372–376,1983.

[20] M. Brand and D. Chen, “Parallel quadratic program-ming for image processing,” in Image Processing(ICIP), 2011 18th IEEE International Conference on,pp. 2261–2264, 2011.

[21] D. Han and X. Yuan, “Local linear convergence ofthe alternating direction method of multipliers forquadratic programs,” SIAM Journal on NumericalAnalysis, vol. 51, no. 6, pp. 3446–3457, 2013.

191

Documents

FRAM Evaluation as Unified Memory for Convex Optimization …cse.lab.imtlucca.it/~bemporad/publications/papers/ederc... · 2015. 3. 25. · FRAM EVALUATION AS UNIFIED MEMORY FOR CONVEX