General Linear Methods for Integrated Circuit Design - edoc

General Linear Methodsfor Integrated Circuit Design

DISSERTATION

zur Erlangung des akademischen Gradesdoctor rerum naturalium

(Dr. rer. nat.)im Fach Mathematik

eingereicht an derMathematisch-Naturwissenschaftlichen Fakultat II

Humboldt-Universitat zu Berlin

vonHerr Dipl.-Math. Steffen Voigtmann

geboren am 12.07.1976 in Berlin

Prasident der Humboldt-Universitat zu Berlin:Prof. Dr. Christoph Markschies

Dekan der Mathematisch-Naturwissenschaftlichen Fakultat II:Prof. Dr. Uwe Kuchler

Gutachter:

(a) Prof. Dr. John Butcher(b) Prof. Dr. Roswitha Marz(c) Prof. Dr. Caren Tischendorf

eingereicht am: 30. Januar 2006Tag der mundlichen Prufung: 26. Juni 2006

Preface

Today electronic devices play an important part in everybody’s life. In par-ticular, there is an ongoing trend towards using mobile devices such as cellphones, laptops or PDAs. Integrated circuits for these kind of applicationsare mainly produced in CMOS technology (complementary metal-oxide semi-conductor). CMOS circuits use almost no power when they are not active andthus, combining negatively and positively charged transistors, they draw poweronly when switching polarity. Furthermore, advanced CMOS technology is ex-pected to dominate in the future since it allows to manufacture transistors inthe nanoscale regime.

Circuit simulation is one of the key technologies enabling a further increasein performance and memory density. A mathematical model is used in orderto assess the circuit’s behaviour before actually producing it. Thus productionstarts with an already optimised layout and production costs but also the time-to-market is significantly reduced.

One important analysis type in circuit simulation is the transient analysis oflayouts on varying input signals. Based on schematics or netlist descriptions ofelectrical circuits the corresponding model equations are automatically gener-ated. This network approach preserves the topological structure of the circuitbut does not lead to a minimal set of unknowns. Hence the resulting modelconsists of differential algebraic equations (DAEs). Typically these equationssuffer from poor smoothness properties due to the model equations of moderntransistors but also due to e.g. piecewise linear input functions. Similarly, timeconstants of several orders of magnitudes give rise to stiff equations and loworder A-stable methods need to be used.

The further miniaturisation of electrical devices drives simulation methods forcircuit DAEs to their limits. Due to the reduced signal/noise ratio, stabilityquestions become more and more important for modern circuits. Thus thereis a strong need to improve stability properties of existing methods such asthe combination of BDF and trapezoidal rule. There are fully implicit Runge-Kutta methods that exhibit much better stability properties. However, these

iv Preface

methods are currently not attractive for industrial circuit simulators due totheir high computational costs.

General linear methods (GLMs) provide a framework covering, among others,both linear multistep and Runge-Kutta methods. They enable the constructionof new methods with improved convergence and stability properties. Up to nowlittle is known about solving DAEs using general linear methods. In particularthe application of general linear methods in electrical circuit simulation has notyet been addressed. Hence the object of this thesis is to study general linearmethods for integrated circuit design.

The work is organised as follows:

Part I: Using the charge oriented modified nodal analysis the differential alge-braic equations describing electrical circuits are derived. Classical methods forsolving these equations are briefly addressed and their limitations are investi-gated. As a means to overcome these shortcomings general linear methods areintroduced.

Part II: Linear and nonlinear DAEs of increasing complexity are investigatedin detail. Using the concept of the tractability index a decoupling procedure fornonlinear DAEs is derived. This decoupling procedure is the key tool for givingconditions for the existence and uniqueness of solutions but also for studyingnumerical integration schemes.

Part III: General linear methods are applied to differential algebraic equations.In order to prove convergence for index-2 DAEs it is seminal to investigateGLMs for implicit index-1 equations. Order conditions and further additionalrequirements on the method’s coefficients are derived such that convergenceis ensured. Using the decoupling procedure from Part II these results aretransferred to the case of index-2 equations.

Part IV: Methods with order p are constructed for 1 ≤ p ≤ 3. As differentdesign decisions are possible, the emphasise is on comparing two families ofmethods: the first one having p+1 internal stages while the other one employsjust p stages. While the former type of methods allows better stability prop-erties and highly accurate error estimators, the latter family reduces the workper step and is capable of reacting more rapidly to changes of the numericalsolution. Implementation issues such as Newton iteration, error estimation andorder control are addressed for both families of methods. Extensive numericaltests indicate high potential for general linear methods in integrated circuitdesign.

Acknowledgement

This work is one result of the close friendship between the numerical anal-ysis group of Prof. Roswitha Marz and the ’Runge-Kutta Club’ headed byProf. John Butcher.

Roswitha Marz not only teaches numerical analysis at the Humboldt Univer-sity in Berlin but she also fills students with enthusiasm about the numericalanalysis of differential algebraic equations. I am one of these students and Iwant to thank her for the motivating, encouraging and supportive atmospherethat I enjoyed at Humboldt University.

After finishing my Master’s degree I was fortunate to get the chance to visitProf. John Butcher at The University of Auckland. This stay in New Zealandwas most influential for my future work. I thank John Butcher for letting mebecome part of the Runge-Kutta Club and teaching me so many things (notonly about mathematics and general linear methods). I am honoured that heconsented to review this thesis.

Towards the end of my stay in New Zealand a project developed that aimedat combining the two mathematical worlds I lived in so far: the applicationof general linear methods to differential algebraic equations. My supervisorProf. Caren Tischendorf (Technical University Berlin) was enthusiastic aboutthis idea from the very beginning. I thank her for realising a project within theResearch Center Matheon. Throughout working on this project I was free toexplore my own ideas but Caren offered most valuable help whenever needed.I always trusted her guidance but she never forced me into a certain direction.

While working on this thesis I was fortunate to meet many colleagues andfriends influencing my work. I thank Claus Fuhrer (Lund University, Sweden)for many fruitful discussions on DAEs. He not only invited me to Lund butalso arranged a visit with Anne Kværnø (NTNU Trondheim, Norway). I thankher for helping me with the convergence proof for general linear methods.

vi Acknowledgement

I learned a lot from Helmut Podhaisky (Martin-Luther University Halle-Witten-berg), in particular about the construction of methods and implementation is-sues. His Matlab codes formed the basis for developing my own DAE solver.Stepsize prediction and order control were discussed with Gustaf Soderlind(Lund University, Sweden). I thank him for taking interest in my work. ReneLamour (Humboldt University Berlin) was always available for discussion andI thank Andreas Bartel (University of Wuppertal) for sending me a copy of hisPhD thesis.

I am pleased to acknowledge the financial support of the Matheon ResearchCenter and the German Research Foundation (Deutsche Forschungsgemein-schaft). I thank my colleagues at the Infineon Technologies AG / Qimonda AGfor supporting me in many ways. Special thanks go to Sabine Bergmann andto Sieglinde Janicke from the Humboldt University for extraordinary supportwhen submitting the thesis.

After all, writing and finishing this work would not have been possible withoutthe loving support of my wife, Sabine. I am lucky to have such a wonderfulwoman at my side.

Kei mai koe ki auHe aha te mea nui te no?Makue ki atu -He tangata, he tangata, he tangata.

Maori proverb

Steffen Voigtmann

Ottobrunn, 20.08.2006.

Contents

Part I Introduction

1 Circuit Simulation and DAEs 15

1.1 Basic Circuit Modelling . . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Differential Algebraic Equations . . . . . . . . . . . . . . . . . . 19

2 Numerical integration schemes 25

2.1 Linear Multistep Methods . . . . . . . . . . . . . . . . . . . . . 26

2.2 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 General Linear Methods . . . . . . . . . . . . . . . . . . . . . . 42

Part II Differential Algebraic Equations

3 Linear Differential Algebraic Equations 51

3.1 Index Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2 Linear DAEs with Properly Stated Leading Terms . . . . . . . . 55

3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 Nonlinear Differential Algebraic Equations 67

4.1 The Index of Nonlinear DAEs . . . . . . . . . . . . . . . . . . . 69

4.2 Index-1 Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Exploiting the Structure of DAEs in Circuit Simulation 77

5.1 Decoupling Charge-Oriented MNA Equations . . . . . . . . . . 78

6 Properly Stated Index-2 DAEs 87

6.1 A Decoupling Procedure for Index-2 DAEs . . . . . . . . . . . . 89

6.2 Application to Hessenberg DAEs . . . . . . . . . . . . . . . . . 102

6.3 Proofs of the Results . . . . . . . . . . . . . . . . . . . . . . . . 104

viii Contents

Part III General Linear Methods for DAEs

7 General Linear Methods 113

7.1 Consistency, Order and Convergence . . . . . . . . . . . . . . . 114

7.2 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 120

8 Implicit Index-1 DAEs 125

8.1 Order Conditions for the Local Error . . . . . . . . . . . . . . . 126

8.2 Convergence for Implicit Index-1 DAEs . . . . . . . . . . . . . . 147

8.3 The Accuracy of Stages and Stage Derivatives . . . . . . . . . . 157

9 Properly Stated Index-2 DAEs 165

9.1 Discretisation and Decoupling . . . . . . . . . . . . . . . . . . . 165

9.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Part IV Practical General Linear Methods

10 Construction of Methods 179

10.1 Methods with s=p+1 Stages . . . . . . . . . . . . . . . . . . . . 183

10.2 Methods with s=p Stages . . . . . . . . . . . . . . . . . . . . . 193

10.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

11 Implementation Issues 209

11.1 Newton Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 211

11.2 Error Estimation and Stepsize Prediction . . . . . . . . . . . . . 213

11.3 Changing the Order . . . . . . . . . . . . . . . . . . . . . . . . . 223

12 Numerical Experiments 227

12.1 Glimda++: Methods with s=p+1 Stages . . . . . . . . . . . . 229

12.2 Glimda vs. Glimda++ . . . . . . . . . . . . . . . . . . . . . . 232

12.3 Glimda vs. Standard Solvers . . . . . . . . . . . . . . . . . . . 237

Part V Summary

Bibliography 247

List of Figures

1.1 The Miller Integrator circuit . . . . . . . . . . . . . . . . . . . . 161.2 Two simple VRC circuits . . . . . . . . . . . . . . . . . . . . . . 201.3 Analytical solution of the circuit from Figure 1.2 (b) . . . . . . . 211.4 Impact of topology and technical parameters on the index . . . 23

2.1 Damping behaviour of the BDF2 method for an RLC circuit . . 282.2 Artificial oscillations of the trapezoidal rule . . . . . . . . . . . . 292.3 Stability properties of BDF methods and the trapezoidal rule . . 312.4 Stability properties of some Runge-Kutta methods . . . . . . . . 362.5 Circuit diagram for the Ring Modulator . . . . . . . . . . . . . 372.6 The Ring Modulator’s output signal . . . . . . . . . . . . . . . . 382.7 Comparison of Dassl and Radau for the Ring Modulator . . . 392.8 Order reduction for Gauss and SDIRK methods . . . . . . . . . 412.9 Stability behaviour of the GLM (2.12) and order of convergence 46

3.1 Numerical solution of Example 3.9 (standard formulation) . . . 643.2 Numerical solution of Example 3.9 (properly stated) . . . . . . . 64

4.1 The NAND gate model . . . . . . . . . . . . . . . . . . . . . . . 684.2 A nonlinear circuit with a current-controlled current source . . . 73

5.1 The NAND gate with the equivalent circuit from Figure 4.1 (b) . 78

6.1 The relation between u, z, w and the solution x∗ . . . . . . . . . 926.2 The components of the solution x for Example 6.8 . . . . . . . . 996.3 The roadmap to the proofs . . . . . . . . . . . . . . . . . . . . . 104

7.1 The local and global error for general linear methods . . . . . . 117

8.1 A simple RCL circuit . . . . . . . . . . . . . . . . . . . . . . . . 1268.2 One step taken by a GLM M for the problem (8.1). . . . . . . . 1358.3 Simplification of trees. . . . . . . . . . . . . . . . . . . . . . . . 1428.4 The global error for general linear methods . . . . . . . . . . . . 148

x List of Figures

8.5 Example 8.28 – Accuracy of stages and stage derivatives . . . . 158

9.1 The relationship between discretisation and decoupling . . . . . 166

10.1 Choosing λ for the method (10.12) . . . . . . . . . . . . . . . . 18710.2 Choosing λ for the method (10.13) . . . . . . . . . . . . . . . . 18910.3 Choosing λ for the order 3 method . . . . . . . . . . . . . . . . 19110.4 The parameter a22 controls the damping behaviour of M∗

2 . . . 19610.5 Controlling the damping bahaviour for Example 10.1 . . . . . . 19710.6 Choosing λ for the method M3 from Table 10.3 . . . . . . . . . 20310.7 Stability regions for the three families of methods . . . . . . . . 206

11.1 Accuracy of the estimators from Table 11.1 . . . . . . . . . . . . 21911.2 Accuracy of the estimators from Table 11.2 . . . . . . . . . . . . 22211.3 Order selection strategy for the methods from Table 10.1 . . . . 22411.4 Order selection strategy for the methods from Table 10.4 . . . . 226

12.1 Comparing the methods of Table 10.1 and 10.2 (Glimda++) . 23012.2 The transistor amplifier circuit . . . . . . . . . . . . . . . . . . . 23112.3 Comparison of Glimda and Glimda++ . . . . . . . . . . . . . 23312.4 Comparison of Glimda and Glimda++. 2nd set of problems . 23412.5 Numerical results for the Two Bit Adding Unit . . . . . . . . . 23612.6 Comparison of Glimda and Dassl, Radau, Mebdfi . . . . . 239

List of Tables

1.1 Characteristic equations for basic elements . . . . . . . . . . . . 17

2.1 Coefficients of the BDF methods . . . . . . . . . . . . . . . . . 272.2 A(α)-stability of the BDF methods . . . . . . . . . . . . . . . . 322.3 Some examples of Runge-Kutta methods . . . . . . . . . . . . . 342.4 The RadauIIA methods of order p = 1, 3, 5 . . . . . . . . . . . . 352.5 Technical parameters for the Ring Modulator . . . . . . . . . . 38

4.1 The functions q and b for the NAND gate model . . . . . . . . . 68

8.1 Special trees of order up to 4 . . . . . . . . . . . . . . . . . . . . 1308.2 Trees of order up to 3 . . . . . . . . . . . . . . . . . . . . . . . . 1318.3 The number of trees for a given order . . . . . . . . . . . . . . . 1328.4 The number of order conditions . . . . . . . . . . . . . . . . . . 1448.5 Simplified trees and corresponding order conditions . . . . . . . 144

10.1 Methods of Butcher and Podhaisky with inherent RK stability . 18510.2 Methods with Runge-Kutta stability and M∞,p = 0 . . . . . . . 19210.3 M3(λ) with unconditional stability at zero and at infinity . . . 20210.4 A family of Dimsims with s = r = q = p . . . . . . . . . . . . . 20310.5 A Dimsim with s=r=p=q=3 and A(α)-stability for α ≈ 88.1 204

11.1 Error estimators for the methods from Table 10.4 . . . . . . . . 21811.2 Error estimators for the methods from Table 10.1 (IRKS) . . . . 22011.3 Error estimators for the methods from Table 10.2 (M∞,p = 0) . 220

12.1 Summary of failed runs for Glimda++ (IRKS and M∞,p = 0) . 23212.2 Summary of failed runs for Glimda and Glimda++ (IRKS) . 23712.3 Summary of failed runs (Glimda, Dassl, Radau, Mebdfi) . . 24012.4 Comparison of different solvers for ODEs . . . . . . . . . . . . . 24112.5 Comparison of different solvers for DAEs . . . . . . . . . . . . . 241

Part I

Introduction

1

Circuit Simulation and DAEs

The numerical simulation of electrical circuits in the time domain requires thederivation of a mathematical model. In computer-aided electronics design anetwork approach is adopted that combines physical laws such as energy orcharge conservation with the characteristic equations of the network elements.Most frequently the modified nodal analysis (MNA) is used [47, 53, 66]. Theresulting model consists of a system of differential algebraic equations (DAEs).

In the following section the MNA is introduced. We show how to derive themodel equations that are used to describe electrical circuits. The next section isthen devoted to studying basic properties of the resulting differential algebraicequations.

1.1 Basic Circuit Modelling

Before discussing the modified nodal analysis in the general setup, an intro-ductory example shall be studied first.

Example 1.1. The Miller Integrator depicted in Figure 1.1 is an electricalrealisation of the mathematical concept of integration. Given a time-varyinginput signal v at node u1 the circuit delivers the node potential u3 =−G1

C1

∫v dt

as output. This relation will be confirmed in Section 3.2, Example 3.8.

The particular input-output relation of the Miller Integrator can be interpretedas an inductor in admittance form. Thus the Miller Integrator is heavily usedfor integrated filter circuits in order to substitute inductors, which are expensiveto obtain using integrated technologies. This example is taken from [66] wherethe Miller Integrator is discussed in detail.

In order to apply nodal analysis, we consider the network as a directed graph,pick the node potentials u1, u2, u3 as variables and write down the elementscharacteristic equations in admittance. In particular we seek relations express-ing the branch current i explicitely in terms of the branch voltage u,

iR = G1 (u1 − u2), iC = C1(u3 − u2)′. (1.1a)

16 1 Circuit Simulation and DAEs

The equations (1.1a) are Ohm’s law for a linear resistor with conductance G1

and Faraday’s law for a linear capacitor with capacitance C1, respectively. Theprime in the expression for iC denotes the time derivative.

As in [66] an ideal operational amplifier with limited amplification a is used forsimplicity, such that the operational amplifier can be modelled as a voltage-controlled voltage source ,

u3 = a · u2. (1.1b)

Similarly, the node potential u1 is given explicitely by the independent source,

u1(t) = v(t). (1.1c)

The electrical interaction of the elements described by (1.1) is accomplished byKirchhoff’s laws which take into account the circuit’s topology:

(a) Kirchhoff’s Voltage Law (KVL): At every instant of time the algebraicsum of voltages along each loop of the network is equal to zero.

(b) Kirchhoff’s Current Law (KCL): At every instant of time the algebraicsum of currents traversing each cutset of the network is equal to zero.

A cutset is a minimal subgraph such that removing this subgraph decomposesthe network into two separate connected units. In particular for every nodethe set of branches, that are connected to this node, forms a cutset. Hence,applying KCL to every node leads to

0 = iV1 + iR = iV1 +G1 (u1 − u2), (1.2a)

0 = −iR − iC = −G1 (u1 − u2)− C1(u3 − u2)′, (1.2b)

0 = iC + iV2 = C1(u3 − u2)′ + iV2 , (1.2c)

where the element relations (1.1a) have already been inserted.

In contrast to the classical nodal analysis, branch currents through voltagesources and inductors, if present, are considered as additional unknowns. Thus

−

+

G1 C1

v(t)

iV1

iV2

u1 u2 u3

ground

−

+ic2

ic1

uc2

uc1

iu

u = a · UcUc = uc2 − uc1ic1 = ic2 = 0

Figure 1.1: Miller Integrator circuit. The operational amplifier is modelledas a voltage-controlled voltage source with amplification factor a.

1.1 Basic Circuit Modelling 17

the equations of the modified nodal analysis are obtained by combining (1.1),(1.2) and u1 = v(t), but more structure is revealed using incidence matrices.

Notice that the topology of the circuit can be conveniently described using theelement related matrices

AC =

0−1

1

, AR =

1−1

0

, AV =

1 00 00 1

.Each column represents a branch and the entries ±1 indicate which nodes thisbranch is connected to and in which direction the current is flowing (with thesign being a matter of convention). The MNA equations (1.1), (1.2) for theMiller Integrator can thus be written as

ACC1A>C u

′ + ARG1A>R u+ AV iV = 0, A>V u =

[v(t)a·u2

]. (1.3)

However, in case of nonlinear capacitors and/or resistors (1.3) has to be replacedby

AC q(A>C u(t), t

)+ AR g

(A>R u(t), t

)+ AV iV (t) = 0,

A>V u(t)−[

v(t)a·u2(t)

]= 0.

(1.3’)

As before u and iV are the vectors of node potentials and branch currentsthrough voltage sources, respectively. Additionally we need to take the vectorq(A>C u, ·) of charges into account. More details on the treatment of nonlinearcapacitors and inductors can be found in [66, 78, 80].

With the Miller Integrator in mind we now turn to the investigation of thecharge oriented modified nodal analysis for arbitrary circuits.

The first step in deriving the MNA equations is to interpret the circuit as a di-rected graph. The vertices are referred to as nodes and each branch representsa basic element such as resistors, capacitors, inductors, voltage sources and cur-rent sources. The characteristic device equations are given in Table 1.1. Basic

device linear nonlinear

resistor iR = GuR iR = g(uR, t)capacitor iC = C u′C iC = qC(uC , t)

inductor uL = L i′L uL = φL(iR, t)

device independent controlled

voltage source uV = v(t) uV = v(uctrl, ictrl, t)current source iI = i(t) iI = i(uctrl, ictrl, t)

Table 1.1: Characteristic equations for basic elements


elements are idealised descriptions of the corresponding physical devices. Realdevices, and in particular more complex elements such as diodes, transistorsetc., are replaced by so-called companion models. A companion model is anequivalent circuit consisting of basic elements that approximate the behaviourof a real, physical device.

The topology of the directed graph is then described using element relatedincidence matrices AR, AC , AL, AV , AI , just as we did in case of the MillerIntegrator. The three essential steps in setting up the equations of the chargeoriented modified nodal analysis are to

(a) apply KCL to every node except ground,

(b) insert the admittance form representation for the branch current of resis-tors, capacitors and current sources,

(c) add the impedance form representation for inductors and voltage sourcesexplicitely to the system.

Hence we arrive at the following system of equations

AC q(A>Cu, t) + AR g(A

>R u, t) + AL iL+AV iV + AI iI(A

>u, q, iL, iV , t) = 0, (1.4a)

φ(iL, t)− A>L u = 0, (1.4b)

vV (A>u, q, iL, iV , t)− A>V u = 0. (1.4c)

The vector q(A>Cu, t) represents charges while φ(iL, t) is the vector of fluxes.Thus, (1.4) is referred to as the charge/flux oriented formulation of the networkequations.

A dot indicates time derivatives such that q(A>Cu, t) = ddt

[q(A>Cu(t), t)

]is the

vector of branch currents through the circuit’s capacitors. Similarly, the vectorφ(iL, t) = d

dt

[φ(iL(t), t)

]denotes currents through inductors.

For many analog circuits such as switched capacitor filters or charge pumpscharge conservation is a crucial property. In fact, the original intent of thecharge oriented formulation (1.4) was to ensure charge conservation. By con-struction (1.4) guarantees that for each charge storing element one terminalcharge is given as the negative sum of all other terminal charges. Expandingthe charge vector around the exact solution u∗,

q(A>Cu(t), t) = q(A>Cu∗(t), t) +∂q(A>Cu∗(t), t)

∂u·∆u+O(∆u2),

shows that with increasing numerical accuracy, ∆u→ 0, the exact charges areapproximated.

It is shown in [66] that a similar situation cannot be guaranteed for the con-ventional formulation, where generalised conductance and inductance matrices∂q(A>Cu,t)

∂u, ∂φ(iL)

∂ilare used. See also [78, 80] for more details.

1.2 Differential Algebraic Equations 19

The charge/flux oriented formulation (1.4) exhibits the form

A[d(x, t)

]′+ b(x, t) = 0, (1.5)

with

x =

uiLiV

, A =

AC 00 I0 0

, d(x, t) =

[q(A>Cu, t)φ(iL, t)

]

and the obvious definition of b. Typically the (constant) matrix A is rectangularhaving nC + nL + nV rows but only nC + nL columns. The prime [d(x, t)]′ =ddt

[d(x(t), t)] denotes differentiation with respect to time.

The time derivative x′ of the dependent variable is not given explicitely as afunction of x and t. Since the matrix A∂d(x,t)

∂xmight be singular, (1.5) is not an

ordinary differential equation but of differential algebraic type.

This thesis is devoted to studying numerical methods for differential algebraicequations (DAEs) of the form (1.5). We will start by investigating some basicproperties of these equations.

1.2 Differential Algebraic Equations

In the previous section we saw that the charge oriented modified nodal analysisleads to differential algebraic equations of the form

A[d(x, t)

]′+ b(x, t) = 0. (1.5)

These equations are different from ordinary differential equations in many ways.Some of the more prominent differences will be introduced by considering twosimple examples.

Example 1.2. The circuit depicted in Figure 1.2 (a) represents a simple VRCloop. Using the modified nodal analysis the corresponding circuit DAE reads 0

10

(C1 · u2

)′+

G1 −G1 1−G1 G1 0

1 0 0

u1

u2

iV

=

00v(t)

, (1.6)

where u1, u2 are again the node potentials and iV represents the current thoughthe voltage source.

This DAE can be solved analytically by noting that the third equation yieldsu1(t) = v(t) such that the second equation turns out to be the ordinary differ-ential equation

u′2(t) = C−11 G1

(v(t)− u2(t)

).


Once a solution for this ODE is found, the current iV through the voltagesource is found from the first equation,

iV (t) = G1

(u2(t)− u1(t)

).

Example 1.2 clearly shows that not all components of a DAE solution need tobe differentiable. For u1 and iV it suffices to require continuity. The specialformulation of the leading term in (1.6) figures out precisely which derivativesare involved.

Similarly, not all components can be assigned initial values. Here we maychoose an arbitrary initial condition for u2, but as u1 and iV are fixed in termsof u2 and the problem data, initial conditions for these components need to beconsistent, i.e. u1(t0) = v(t0) and iV (t0) = G1

(u2(t0)− u1(t0)

).

Example 1.3. In Figure 1.2 (b) the position of the resistor was changed andthe MNA equations read[

10

] (C1 · u1

)′+

[G1 11 0

] [u1

iV

]=

[0v(t)

].

In contrast to the previous example all components are fixed by the problemdata and there is no freedom to pose any initial condition:

u1(t) = v(t), iV (t) = −C1u′1(t)−G1u1(t).

Even though the Examples 1.2 and 1.3 seem quite similar, Example 1.3 never-theless introduces a new level of complexity. The crucial point is that in orderto compute iV one needs to differentiate u1 such that input function v needsto be smooth.

Since differentiation is an unstable operation, this situation may lead to severeproblems. As in [155] consider e.g. an input signal v(t) = 5− δ(t) with a smallperturbation δ(t) = ε sin(ωt). Independently of ω the perturbation is boundedin magnitude by |ε|. However, since the current through the voltage source isgiven by

iV (t) = C1ε ω cos(ωt)− 5G1 +G1ε sin(ωt),

u1 u2

G1

C1

v(t)

iV

Figure 1.2 (a): VRC loop

u1

C1

v(t)G1

iV

Figure 1.2 (b): VRC in parallel

Figure 1.2: Two circuits consisting of a voltage source, resistor and capacitor.


the error resulting from the perturbation δ is found to be Cεω cos(ωt) +Gε sin(ωt). For large values of ω this quantity can become arbitrarily largeas is illustrated in Figure 1.3.

The solution of differential algebraic equations may require differentiation ofcertain parts of the input functions. Hence it is not enough, as for ODEs, torequire continuity of the right-hand side, but some components may requireadditional smoothness.

The question whether differentiation is involved in solving a given DAE and, ifso, up to what degree, distinguishes different levels of complexity both in theanalytical and numerical treatment of DAEs. As a means to measure thesedifficulties, each DAE is assigned an integer number – its index.

Many different index concepts are available in the literature. Some of them willbe briefly reviewed in Section 3.1. Roughly speaking, for an index-µ equationinherent differentiations up to order µ − 1 are involved. Consequently, thecircuit in Figure 1.2 (a) leads to an index-1 DAE while Figure 1.2 (b) comprisesan index-2 equation. Ordinary differential equations are assigned index 0.

For a large number of circuits the index can be read directly from the circuit’stopology. This is possible given that

(a) the capacitance, inductance and conductance matrices

C(w, t) = ∂qC(w,t)∂w

, L(w, t) = ∂φL(w,t)∂w

, G(w, t) = ∂g(w,t)∂w

are positive-definite1 and

(b) the controlled sources satisfy certain topological conditions (specifiedin [63]).

4

5

6

0 1·10−6 2·10−6 time

Figure 1.3 (a): Input signal v(t) withperturbation δ(t) = ε sin(ωt).

-6

-5

-4

0 1·10−6 2·10−6 time

Figure 1.3 (b): The current i(t) witherror of size ε ω C1.

Figure 1.3: Analytical solution of the circuit from Figure 1.2 (b) with C1 =1µF, R1 = 1

G1= 1 Ω for a perturbed input signal (ε = 10−2, ω = 108).

1A matrix M is said to be positive definite if x>Mx > 0 for every x 6= 0. This definitiondoes not require symmetry.


The condition (a) corresponds to nonlinear capacitors, inductors and resistorsthat are strictly locally passive. On the other hand, (b) requires a careful analy-sis of controlled voltage and current sources since special circuit configurationsmay impede a topological index test.

Topological index tests were first investigated in [136]. The precise set of al-lowed controlled sources is given in [61, 63, 154]. There the proof of the nexttheorem can be found as well.

Theorem 1.4. Consider a lumped electrical circuit consisting of capacitors,inductors, resistors, voltage sources and current sources. Let the conditions(a) and (b) (in the sense of [63]) be satisfied. If the circuit contains neither aloop of voltage sources nor a cutset of current sources, then the charge orientedmodified nodal analysis leads to a DAE with index µ ≤ 2.

The MNA equations have

• index 0, if the circuit contains no voltage source and the network graphis spanned by a tree consisting exclusively of capacitive branches,

• index 1, if the circuit contains a voltage source or there is no tree of capac-itive branches, and, in addition, the following two topological conditionsare satisfied

T1 : there is no CV loop containing at least one voltage source,

T2 : there is no LI cutset,

• index 2, otherwise.

The numerically unstable index-2 components are given by the branch currentsthrough voltage sources of CV loops but also by the branch voltages of theinductors and current sources of LI cutsets [62]. In particular these branchvoltages do not necessarily coincide with the Cartesian components of the vectorx but are linear combinations of these.

In view of Theorem 1.4 the loop consisting of capacitor and voltage source inFigure 1.2 (b) is the reason for the index being 2. In Figure 1.2 (a) this loop isbroken by the resistor such that the index is only 1.

The inclusion of controlled sources in Theorem 1.4 is crucial as controlledsources are implicitly included in every companion model for semiconductordevices. Fortunately the conditions stated in [63] guarantee that these sourcecan be treated within the framework of Theorem 1.4. Hence topological indextests proofed to be a very powerful tool for modern circuit diagnosis.

Nevertheless, both assumptions (a) and (b) are vital for Theorem 1.4 to beapplicable. The positive definiteness may be violated by independent chargeand flux sources used e.g. to model α radiation or external magnetic fields [66].On the other hand, controlled sources such as the operational amplifier inthe context of the Miller Integrator from Example 1.1 are not covered by (b).In fact, the index of the Miller Integrator depends not only on topologicalproperties but also on device parameters. A summary is given in Figure 1.4.


In particular the inclusion of a further capacitor C2 between node 2 and groundgives rise to higher index cases as this capacitor closes a CV loop together withC1 and the controlled voltage source. See [66] for more details.

Although the index may depend on device parameters, most circuits are coveredby Theorem 1.4 and topological methods are a well established and powerfultool for dealing with the differential algebraic equations arising in modern cir-cuit simulation.

DAEs are the subject of current research [100, 117, 118, 138]. They do presentchallenges both for mathematical analysis and scientific computing. We willreturn to investigating linear and nonlinear DAEs in much more detail in Chap-ter 3 and 4, respectively.

−

+

G1 C1

v(t)

iV1

iV2

u1 u2 u3

C2

Technical parameters Index

C2 = 0a 6= 1 1

a = 1 2

C2 6= 0a 6= 1 + C1

C22

a = 1 + C1

C23

Figure 1.4: Impact of topology and technical parameters on the index of theMiller Integrator circuit.

2

Numerical integration schemes

In the previous chapter we derived a mathematical model for electrical circuits.The charge oriented modified nodal analysis leads to differential algebraic equa-tions

A[d(x(t), t

)]′+ b(x(t), t

)= 0. (2.1)

Each DAE is assigned an index in order to measure its complexity concerninganalytical and numerical treatment. Already for index-1 equations initial valuescan not be chosen arbitrarily but they have to be consistent with the problemdata1. DAEs with index ≥ 2 involve inherent differentiations such that theseequations are ill posed and certain parts of the input functions need to bedifferentiated.

Nevertheless, as Theorem 1.4 shows, index-1 and index-2 equations appearmost frequently for electrical circuits. In fact, the index µ typically satisfies1 ≤ µ ≤ 2. Even though certain controlled sources may lead to an index µ > 2,these configurations can be detected using topological index tests, such that aregularisation2 becomes possible. Thus, in electrical circuit simulation one isconfronted with solving DAEs (2.1) having index µ = 1 or µ = 2.

It was mentioned earlier that modern semiconductor devices are treated us-ing companion models. Regarding the switching tasks of these devices it isclear that the model equations will suffer from poor smoothness properties.Consequently, the coefficients of (2.1) are continuous but usually non-smooth.

Modern circuits comprise timescales of several orders of magnitude, such thatthe resulting circuit DAEs are usually stiff. In addition to this, the large numberof devices leads to MNA models consisting of up to 105 or 106 equations andcomputational time is a major issue.

In summary, DAEs (2.1) in integrated circuit design present challenging prob-lems for numerical computation. Low order methods need to be used due to

1An initial value x0 is consistent if there is a solution passing through x0.2Additional capacitors to ground may often regularise a given circuit.

26 2 Numerical integration schemes

the model’s poor smoothness properties. Since the equations are mainly stiff,A-stable methods are preferable.

Gear [71] proposed BDF3 methods already in 1971. Their robustness andreliability, in particular in combination with the trapezoidal rule, have madethese methods to become the standard in circuit simulation.

Unfortunately, linear multistep methods such as the BDF or the trapezoidalrule often suffer from undesired stability properties. For instance there is no A-stable method of order higher than 2. One step methods such as Runge-Kuttamethods with improved stability properties may prove an alternative. However,the computational costs for these (fully implicit) methods is prohibitive incircuit simulation. Diagonally implicit methods, on the other hand, will sufferfrom order reduction.

In the next two sections we will briefly review both linear multistep methodsand Runge-Kutta methods. Each class of integration schemes has its ownadvantages but also suffers from certain shortcomings and potential problemswill be indicated. In Chapter 2.3 general linear methods will then be introducedas a means to overcome these difficulties.

2.1 Linear Multistep Methods

Given the DAE (2.1) together with a consistent initial condition x(t0) = x0,we want to determine the solutions x on some time interval [t0, T ] ⊂ R, i.e. weseek approximations xk to the exact solution on a (usually non-uniform) gridof discrete time points tk | k = 0, 1, . . . , N .In order to proceed from tn to tn+1 using a stepsize h = tn+1 − tn linearmultistep methods make use of existing approximations xi ≈ x(tn−i) andd′i ≈ [d(x(tn−i), tn−1)]

′ at previous time points, i = 0, . . . , k−1. These quantitiesare used in order to compute xn+1 and d′n+1 such that Ad′n+1+b(xn+1, tn+1) = 0and

k∑i=0

αi d(xn+1−i, tn+1−i) = h

k∑i=0

βi d′n+1−i.

For an ordinary differential equation y′ = f(y, t) this scheme simplifies to

k∑i=0

αi yn+1−i = h

k∑i=0

βi f(yn+1−i, tn+1−i). (2.2)

Since (2.2) has to be solved for xn+1 we need to require α0 6= 0.

Linear multistep methods were used by Adams and Bashforth as early as in1883 [6]. The so-called Adams-Bashforth methods are characterised by the

3Backward Difference Formulae

2.1 Linear Multistep Methods 27

parameters α0 = α1 = 1 and α2 = · · · = αk = β0 = 0. For β0 6= 1 one obtainsthe Adams-Moulton methods [126].

Different choices of the coefficients α, β lead to a large variety of methods.Among the most prominent are the BDF methods first proposed by Cur-tiss and Hirschfelder [48]. In order to derive these methods from (2.2), setβ1 = · · · = βk = 0 and choose the coefficients αi such that

α(z) =k∑i=0

αizi = β0

k∑i=1

1

k(1− z)k.

The remaining parameter β0 is fixed by the condition α(0) = α0 = 1. Thespecial choice of α(z) ensures that the polynomial p which interpolates thevalues (tn+1−i, yn+1−i) | i = 0, . . . , k satisfies p′(tn+1) = f(xn+1, tn+1).

The coefficients of the methods BDFk for k = 1, . . . , 6 are given in Table 2.1.For k ≥ 7 the BDF formulae are not stable and only the methods with1 ≤ k ≤ 6 are useful for numerical computation [25].

Since the work of Gear [70, 71] BDF methods are most successfully used forsolving stiff equations. The application to DAEs has been investigated in [10,11] and the code Dassl, written by Petzold [128], is still one of the mostcompetitive solvers for low-index DAEs.

One of the main reasons for the BDFs being so popular also in circuit simula-tion is the fact that only one function evaluation is required per step. HenceBDF methods can be implemented with relatively low costs even for very largesystems. However, the stability properties are not always desirable.

Example 2.1. Consider the circuit in Figure 2.1 (a), where a voltage source,capacitor, inductor and resistor are connected in series. As this circuit containstwo energy storing elements, the circuit’s complete response can be calculatedby solving the second order differential equation [56]

0 = U ′′ +1

LGU ′ +

1

LCU − 1

LCv(t) (2.3)

k β0 α1 α2 α3 α4 α5 α6

1 1 −1

2 23

−43

13

3 611

−1811

911

− 211

4 1225

−4825

3625

−1625

325

5 60137

−300137

300137

−200137

75137

− 12137

6 2049

−12049

15049

−400147

7549

−2449

10147

Table 2.1: Coefficients of the BDF methods (α0 = 1)


for the capacitor voltage U = u1 − u2. For simplicity assume that C = 1F,L = 1H and that v ≡ 1V is a DC source delivering a constant voltage dropbetween node 1 and ground. The initial conditions

U(0) = 2 V and iL(0) = − C2LG

fix the capacitor voltage to be

U(t) = e−α t cos(ω t) + 1, with α = 12LG

, ω2 = 1LC

− α2, (2.4)

provided that G > 12Ω

. Since u1 ≡ 1V, the potential at the second node is givenby u2(t) = −e−α t cos(ω t). For large values of G this is practically a pure cosinefunction, but for small values G, say R = 1

G= 100mΩ, the node potential u2

decays quickly (see Figure 2.1 (b) and Figure 2.1 (c)).

Stability problems of the BDF formulae become apparent in Figure 2.1 (d).There the BDF2 was used to solve the undamped circuit using a constantstepsize h = π

4which corresponds to a sampling rate of about 8 timesteps per

period. Although the undamped circuit was solved, the solution for u2 shows astriking similarity to the exact solution of the damped circuit.

u1 u2

C iL L

G

u3

v(t)

iV

Figure 2.1 (a): An RLC seriescircuit.

-1

0

1

time20100

Figure 2.1 (b): Potential u2 forR = 1

G = 1mΩ (exact values).

-1

0

1

time20100

Figure 2.1 (c): Potential u2 forR = 1

G = 100mΩ (exact values).

-1

0

1

time20100

Figure 2.1 (d): Potential u2 forR = 1

G = 1mΩ (BDF solution).

Figure 2.1: An RLC series circuit (C = 1F, L = 1H, v ≡ 1V): For smallvalues of R = 1

Gthe node potential u2 is nearly undamped but it decays quickly

for a large resistance. The BDF2 method, when applied to the un-dampedcircuit, produces results as if the circuit was damped.


As indicated in the previous example, the BDF methods, in particular fororder 1 and 2 show strong numerical damping. This is a most disadvantageoussituation for the simulation of electrical circuits4 as it is not clear from thesimulation results whether the damping is inherent in the circuit or just anumerical artefact. Even though a stepsize control algorithm may avoid thesekind of problems, the steps taken will be excessively small since the stepsizesequence will be governed by stability rather than accuracy.

There are methods that do preserves both amplitude and phase of any givenoscillation. One example is the trapezoidal rule, that is obtained from (2.2) bysetting k = 1, α0 = −α1 = 1, β0 = β1 = 1

2, such that

yn+1 = yn + h2

(f(yn, tn) + f(yn+1, tn+1)

).

Example 2.2. As an example the VRC circuit from Figure 1.2 (b) on page 20will be solved using the trapezoidal rule. For simplicity choose R1 = 1

G1= 1 Ω,

C1 = 1F and the input signal

v(t) =

1 , t ≤ π

2,

cos(t− π

2

), t > π

2.

This function models an independent voltage source that delivers a DC signalof 1V until the breakpoint t∗ = π

2is reached. There the input switches to a

sinusoidal signal.

The trapezoidal rule is applied with constant stepsize h = 14

on the interval[0, 3π] using consistent initial values. Figure 2.2 shows that the numerical resultfor the node potential u(t) = v(t) is in good agreement with the exact value.Nevertheless, at the breakpoint errors are introduced. These perturbations leadto a highly oscillatory behaviour of the numerical approximation to iV (t) =−C1u

′(t)−G1u(t).

1

1

2

0

−

1

2

−1

time2 π3 π

2π

π

20

u(t)

−2

−1

0

1

2

time2 π3 π

2π

π

20

iV (t)

Figure 2.2: The trapezoidal rule applied to the VRC circuit from Figure1.2 (b). For the node potential u the numerical result agrees with the exactsolution but iV oscillates around the exact solution (plotted as a dashed line inboth pictures).

4For other applications such as e.g. the simulation of multibody systems with friction,the BDF method’s numerical damping may well be a desired feature.


The previous two examples indicate that, depending on the type of oscillation,different behaviour of an integrator is required:

(a) Oscillations of physical significance, as in Example 2.1, should be pre-served.

(b) High frequent oscillations caused by numerical noise, perturbations, in-consistent initial values or errors from the Newton iteration should bedamped out quickly.

In order to analyse the damping properties of a numerical method, (2.3) canbe studied in more detail. Rewriting (2.3) as a first order ordinary differentialequation,[

UiL

]′=

[0 1

C

− 1L

− 1LG

] [UiL

]+

[0

1Lv(t)

]the eigenvalues of the coefficient matrix are found to be λ1/2 = −α±ω i, whereα = 1

2LG, ω2 = 1

LC− α2 and i is the imaginary unit. ω is real number for

G2 > C4L

and the corresponding homogeneous system has oscillatory solutionswith frequency ω. These oscillations are stable for α ≥ 0, i.e. when λi hasnegative real part. See (2.4) for an example. If |λi| is large compared to thetime scale of the problem, then the differential equation is called stiff.

In order to assess a methods linear stability properties we need to study thelinear scalar test equation

y′(t) = λ y(t), (2.5)

where λ = −α + ω i is a complex number having negative real part. Thus theexact solution is of the form e−α t

(A1 cos(ω t)+A2 sin(ω t)

), such that it decays

for t ≥ 0.

The trapezoidal rule, when applied to (2.5) with constant stepsize, yields therecursion

yn+1 = yn +λh

2(yn + yn+1) ⇔ yn+1 =

2 + λh

2− λhyn.

The mapping R : C → C, z 7→ 2+z2−z is known as the method’s stability

function. Stable solutions are obtained provided that the product z = λh ofeigenvalue and stepsize satisfies |R(λh)| ≤ 1. Thus the set z ∈ C

∣∣ 1 ≥ |R(z)| is called the stability region. Observe that |R(ωi)| = |2+ωi

2−ωi| = 1 along the

imaginary axis. Thus the stability region for the trapezoidal rule coincidesexactly with the left half of the complex plane.

On the other hand, the BDF1 method, often referred to as the implicit Eulerscheme, reflects (2.5) by means of the recursion

yn+1 = yn + λh yn+1 ⇔ yn+1 =1

1− λhyn+1,


such that the stability region is given by the set z ∈ C∣∣ 1 ≤ |1− z| (Figure

2.3 (a)). For higher order BDF methods, write Yn =[yn yn−1 · · · yn+1−k

]>such that the k-step BDF applied to (2.5) reads

Yn+1 =

−α1

1−zβ0

−α2

1−zβ0· · · −αk−1

1−zβ0

−αk1−zβ0

1 0 · · · 0 00 1 · · · 0 0...

.... . .

......

0 0 · · · 1 0

Yn = Mk(z)Yn,

where λh was again replaced by the complex variable z. In contrast to onestep methods (e.g. the trapezoidal rule or the implicit Euler scheme) and theirstability function R(z), linear stability is governed by a stability matrix M(z).The stability region is characterised by those values z ∈ C, where M(z) ispower bounded. For the BDFk formulae with 1 ≤ k ≤ 5 the stability regionsare plotted in Figure 2.3 (a).

For k ≤ 2 the left half of the complex plane is included in the stability region.Thus, whenever the exact solution is stable, the numerical approximation willbe stable as well. This statement is not true for k ≥ 3 as the boundary of thestability region crosses the imaginary axis.

Methods showing this behaviour are in general not suited for solving stiff equa-tions. In the stiff case, where eigenvalues λ = −α+ω i with large α are present,it may be necessary to reduce the stepsize drastically until z = λh again be-longs to the stability region. Hence, for solving stiff equations a property calledA-stability is desirable.

−5i

0

5i

0 5 10

BDF1

BDF2

BDF3

BDF4

BDF5

Figure 2.3 (a): Stability regionsof the BDFk methods.

ρk(yi)

1

3

4

1

2

1

4

0

10−2 1 102 104

BDF1

BDF2

BDF3

BDF4

BDF5

trapez. rule

0.1π 0.25π

Figure 2.3 (b): Stability beha-viour along the imaginary axis.

Figure 2.3: Stability properties of the BDFk methods and the trapezoidalrule. For each 1 ≤ k ≤ 5 the stable area in Figure 2.3 (a) is given by theoutside of the kidney shaped region. For Figure 2.3 (b) the modulus ρk(z)of the largest eigenvalue of Mk(z) is plotted along the imaginary axis. For asampling rate of about 8− 20 steps per period ρ(yi) is required to stay close to1 for y ∈ [0.1π, 0.25π].


Definition 2.3. A numerical method is called A-stable, if the left half of thecomplex plane is included in the method’s the stability region.

This definition is due to Dahlquist [49]. Dahlquist was also able to show thatthere is no A-stable linear multistep method with order p > 2. This fact isknown as the second Dahlquist barrier, which is confirmed in Figure 2.3 (a).

Even though the BDFk methods are not A-stable for k ≥ 3, the sector

r e(π−θ)i ∈ C | r ∈ R, θ ∈ [−α, α]

is contained in the stability region for some suitable value of α (see Table 2.2).

Definition 2.4. A numerical method is called A(α)-stable if the stability regionincludes the sector r e(π−θ)i ∈ C | r ∈ R, r ≥ 0, θ ∈ [−α, α] .Apart from looking at A- and A(α)-stability it is interesting to study the stabil-ity matrix along the imaginary axis, since purely imaginary eigenvalues λ = ω igive rise to undamped oscillations. Let ρ(z) denote the modulus of the largesteigenvalue of M(z). Assuming a sampling rate of about 8 − 20 steps for os-cillations of physical significance, we require that ρ(ω i) stays close to 1 forω ∈ [0.1π, 0.25π]. This range is marked in Figure 2.3 (b) using vertical lines.

It becomes clear from Figure 2.3 (b) that both the implicit Euler scheme and theBDF2 method will damp out oscillations of physical significance as ρ(ω i) < 1holds in the desired range. The BDF methods with k ≥ 3 will amplify theseoscillations due to ρ(ω i) > 1. The behaviour in both cases is far from optimal.

For the trapezoidal rule, on the other hand, we already found |R(ω i)| = ρ(ω i) =1, such that the amplitude of any oscillation will be preserved. This is true notonly for oscillations of physical significance but also for high frequent numericalnoise, a fact that explains the behaviour from Example 2.2.

More details on this type of analysis can be found in [66, 81].

Considering the damping behaviour as discussed above, a commonly used strat-egy in order to simulate electrical circuits both efficiently and reliably is to startthe integration with the BDF method but to continue with the trapezoidalrule after a few successful steps. If convergence problems arise or a breakpoint(switching point) is reached, the BDF method is used for a few steps followedby the trapezoidal rule etc. Obviously, the frequent change of the methodcauses problems for an efficient error estimation and stepsize selection. Thiscyclic approach cannot be generalised to order 3 computations where the BDF3

is used exclusively. This method is not A-stable and due to Dahlquist’s secondbarrier, there is no A-stable linear multistep method of this order at all.

BDF1 BDF2 BDF3 BDF4 BDF5 BDF6

90 90 86.03 73.35 51.84 17.84

Table 2.2: A(α)-stability of the BDF methods (taken from [85])

2.2 Runge-Kutta Methods 33

Apart from stability constraints, further complications are associated withchanging the stepsize and/or the order. Both aspects are essential for a com-petitive code, but require considerable overhead for linear multistep methods.

It is far simpler to change the stepsize and order for a Runge-Kutta method.More importantly, there are Runge-Kutta methods with much improved sta-bility properties. Hence this class of methods shall be investigated next.

2.2 Runge-Kutta Methods

In contrast to linear multistep methods Runge-Kutta methods do not use pastinformation but confine themselves to using only one initial approximationyn ≈ y(tn) at the beginning of a step. In order to obtain a highly accurateapproximation yn+1 ≈ y(tn + h) at the end of this step, intermediate stagesYi ≈ y(tn+ ci h) are calculated. These ideas were first proposed by Runge [142]and Heun [87], but it was Kutta [103] who completely characterised the set oforder 4 methods.

Since the fundamental work of Butcher [15, 18, 16] Runge-Kutta methods re-ceived much attention in the literature. Particular methods are constructedin [1, 17, 57] while implementation issues are dealt with in [14, 86] and manyother references. The monograph [22] provides in-depth information.

The theory of Runge-Kutta methods for the solution of differential algebraicequations is developed in [84, 85, 104, 140]. Unfortunately, these results donot cover all aspects of scientific computing in electrical circuit design. [84, 85]focus on systems in Hessenberg form such that the charge oriented formulationis not covered. On the other hand, [104] restricts attention to index-1 equations.

Given an ordinary differential equation y′ = f(y, t) and an initial approximationyn, a Runge-Kutta method is of the form

Yi = yn+hs∑j=1

aij f(Yj, tn+cj h), yn+1 = yn+hs∑i=1

bi f(Yi, tn+ci h),

where s intermediate stages Yi are calculated, i = 1, . . . , s. The position of

these internal stages is indicated by the vector c =[c1 · · · cs

]>, i.e. we

intend to have Yi ≈ y(tn + ci h). The method’s parameters can conveniently berepresented in a Butcher-tableau

c1 a11 · · · a1s...

.... . .

...cs as1 · · · ass

b1 · · · bs

orc A

b>.

Some well known examples are given in Table 2.3. Two of these methods arestiffly accurate, i.e the vector b> coincides with the last row of A. In case of


stiffly accurate methods the numerical result yn+1 = Ys is given by the laststage. This property is of particular importance when studying stiff equationsand DAEs [84, 86, 90].

The coefficients of a Runge-Kutta method need to satisfy certain order condi-tions that guarantee that the error committed in one step can be estimated interms of O(hp+1) for an order p method. In Chapter 8 order conditions will bediscussed in considerable detail within the framework of general linear meth-ods for DAEs. Here, it suffices to note that Runge-Kutta methods are oftenconstructed using the simplifying assumptions

B(η) :s∑i=1

bi ck−1i =

1

k, k = 1, . . . , η,

C(ξ) :s∑j=1

aij ck−1j =

1

kcki , k = 1, . . . , ξ, i = 1, . . . , s,

D(ζ) :s∑i=1

bi ck−1i aij =

1

kbj (1− ckj ), k = 1, . . . , ζ, j = 1, . . . , s.

For a method of order p it is necessary that B(p) holds. If the condition C(q)is satisfied, then the method has stage order q, i.e. the intermediate stages arecalculated with accuracy Yi = y(tn + ci h) +O(hq+1). More details are given in[22, 25]. There the following fundamental result is proved.

Theorem 2.5. If the coefficients of a Runge-Kutta method satisfy B(p), C(ξ),D(ζ) with ξ + ζ + 1 ≥ p and 2 ξ + 2 ≥ p, then the method is of order p, i.e.yn+1 = y(tn+1) + O(hp+1) if yn+1 is the numerical result calculated from theexact initial value y(tn) using stepsize h.

The simplifying assumptions can also be used for the construction of Runge-Kutta methods. For example the Gauss methods are obtained by choosing

01 1

212

12

12

1−√

22

1−√

22

1√

22

1−√

22

√2

21−

√2

2

012

12

12

0 12

1 0 0 116

13

13

16

12−

√3

614

14−

√3

6

12

+√

36

14

+√

36

14

12

12

trapezoidalrule (order 2)

SDIRK method5

(order 2)’classical’ RKmethod (order 4)

Gauss method with2 stages (order 4)

Table 2.3: Some examples of Runge-Kutta methods

5SDIRK is an abbreviation for “Singly Diagonally Implicit Runge-Kutta method”. Moredetails will be given later in this section.


c1, . . . , cs to be the zeros of the shifted Legendre polynomials of degree s,

Ps(t) =1

s!

ds

dts(ts (t− 1)s

),

and then satisfying B(s) and C(s). These methods have stage order s and order2s, which is the highest possible order for an s-stage Runge-Kutta method.

Another important family of Runge-Kutta methods is based on the Radauquadrature formulae, i.e. c1, . . . , cs are the zeros of Ps − Ps−1 with cs = 1.Choosing b and A such that B(s) and C(s) are satisfied, yields the RadauIIAmethods [25, 58, 85], the first few being listed in Table 2.4. The order of aRadauIIA method is p = 2s− 1 and the stage order is q = s.

The code Radau written by Hairer and Wanner [85, 86] is based on theRadauIIA methods. Radau became one of the standard solvers for ODEsand DAEs up to index 3. One of the main reasons for Radau’s popularity arethe excellent stability properties of the RadauIIA method.

A Runge-Kutta method, when applied to (2.5), reads

Yi = yn + λhs∑j=1

aij Yj, yn+1 = yn + λhs∑i=1

bi Yi (2.7)

such that the numerical approximations yn satisfy the recurrence

yn+1 =(1 + z b>(I − zA)−1 e

)yn, z = λh, e =

[1...1

]∈ Rs.

The stability function R(z) = 1 + z b>(I − zA)−1 e determines the stabilityregion z ∈ C | 1 ≥ |R(z)| for a given Runge-Kutta method.

Figure 2.4 shows that all RadauIIA methods are A-stable – a fact that isproved e.g. in [85]. The SDIRK method from Table 2.3 is A-stable as well.Even though the stability regions of the Gauss methods are not plotted here,it can be shown that these methods are all A-stable and their stability regioncoincides exactly with left half of the complex plane. Hence they are high ordergeneralisations of the trapezoidal rule. This statement is confirmed in Figure2.4 (b), where |R(z)| is plotted along the imaginary axis.

1 1

1

13

512− 1

12

1 34

14

34

14

25−

√6

101145− 7

√6

36037225− 169

√6

1800− 2

225+ 3

√6

225

25+

√6

1037225

+ 169√

61800

1145

+ 7√

6360

− 2225− 3

√6

225

1 49−

√6

3649

+√

636

19

49−

√6

3649

+√

636

19

Table 2.4: The RadauIIA methods of order p = 1, 3, 5


This plot indeed shows the superior stability properties of the RadauIIA meth-ods: High frequent oscillations are damped out quickly while oscillations ofphysical significance are preserved. In particular, |R(yi)| stays close to 1 fory ∈ [0.1π, 0.25π].

For the SDIRK method Figure 2.4 (b) suggests a good preservation of lowfrequent oscillations as well.

Nevertheless, Runge-Kutta methods are not popular in circuit simulation. Themain reason are their high computational costs. Recall from (2.6) that thestages Yi satisfy

Y = h(A⊗ Im)F (Y ) + (e⊗ Im) yn, Y =

[Y1...Ys

], F (Y ) =

[ f(Y1,tn+c1 h)...f(Ys,tn+cs h)

],

where A ⊗ B denotes the Kronecker product for matrices [91] and m is theproblem size. In general, this is a nonlinear system of dimension s·m and aniterative procedure such as Newton’s method has to be used [86, 110].

Given an initial guess Y 0 the costs for the stage evaluation comprise, for eachiteration,

(a) the evaluation of the residual δk = Y k − h(A⊗ Im)F (Y k)− (e⊗ Im) yn,

(b) evaluating the Jacobian J =(∂fi∂Yj

),

(c) computing the LU factors of the iteration matrix LU = Is⊗Im−h(A⊗J),

(d) solving the linear system LU ∆Y k+1 = −δk.In particular the costs for (c) and (d) depend heavily on the structure of theRunge-Kutta matrix A. If A is a full matrix, then (c) requires O

(s3m3

)oper-

ations, and for (d) another O(s2m2

)operations are necessary [93, 130].

−5i

0

5i

0 4 8 12

Radau1

Radau3

Radau5SDIRK

Figure 2.4 (a): Stability regions of theRadauIIA methods (order 1, 3, 5) andthe SDIRK method from Table 2.3.

|Rp(yi)|

1

3

4

1

2

1

4

0

10−1 1 101 102

SDIRK

BDF2

Radau5

Radau3

Radau1

Gauss methods

0.1π 0.25π

Figure 2.4 (b): Stability behaviouralong the imaginary axis. The BDF2 isincluded for comparison.

Figure 2.4: Stability properties of some Runge-Kutta methods. For a sam-pling rate of 8 − 20 steps per period Radau5 shows almost no damping ofphysical oscillations as |R5(yi)| stays close to 1 for y ∈ [0.1π, 0.25π].


Thus, for large m the linear algebra routines for Runge-Kutta methods (s ≥ 2)require much more work as compared to linear multistep methods (s = 1).Since Runge-Kutta methods use more functions evaluations per step, the com-putational costs can be prohibitive for the application in electrical circuit designas function calls are most expensive due to the evaluation of complex transistormodels.

Obviously this is only a rough estimate and it is not fair to judge methodsbased only on these crude computations. Runge-Kutta methods such as theRadauIIA formulae can often be implemented much more efficiently [85]. Onthe other hand keeping track of the backward information requires some over-head in linear multistep methods. Thus, in order to compare the computationalcosts of Runge-Kutta and linear multistep methods, we consider a real worldapplication.

Example 2.6. The Ring Modulator circuit from Figure 2.5 is an electricaldevice that mixes a low-frequent input signal Uin1 with a high-frequent in-put signal Uin2 . The corresponding mathematical model was introduced byHorneber [92]. It is a widely used benchmark circuit described in detail in theTest Set for Initial Value Problem Solvers [124] of Bari University (formerlymaintained by CWI Amsterdam). There artificial parasitic capacitances Csare used in order to regularise the circuit and an ODE model is obtained [65].

This regularisation comes at the price of high frequent parasitic oscillationsbeing introduced. Omitting the artificial capacitances removes these parasiticeffects, but the circuit turns into an index-2 system. As in [84, 132] this index-2formulation shall be investigated here.

For the MNA, the diodes are replaced by nonlinear resistors described by thevoltage-current relation I = g(U) = γ (eδ U − 1). The various constants can be

u1

2

u1I7

Ls1

Ri+Rg1

Uin1R

I1

Lh

C

D4

D1

D3

D2

u3 u5

u4 u6

I8u2

Ls1R

C

Lh

I5−I6

2

Rc+Rg1

u1

2

Ls2

Rg2I3

u7

Cp Rp

Uin2

I4 Rg3

Ls3

I6Rg3

Ls3

u2

2

u2

2

Ls2

I5Rg2

I2I3−I4

2

Figure 2.5: Circuit diagram for the Ring Modulator


found in Table 2.5. Writing down the equations of the charge oriented modifiednodal analysis as described in Section 1.1 yields the DAE6

A[d(x(t)

)]′+ b(x(t), t

)= 0 where x =

[u1 · · · u7 I1 · · · I8

]>comprises potentials for the node 1–7 and currents I1, . . . , I8 through the eightinductors. The coefficients A, d and b are given by

A =

[I2 0

0 00 I9

] 2 4 9, d(x) =

C u1C u2Cp u7

Lh I1Lh I2Ls2 I3Ls3 I4Ls2 I5Ls3 I6Ls1 I7Ls1 I8

, b(x, t) =

1Ru1−I1+ 1

2I3− 1

2I4−I7

1Ru2−I2+ 1

2I5− 1

2I6−I8

I3+g(u6−u3)−g(u3−u5)I4+g(u5−u4)−g(u4−u6)I5+g(u3−u5)−g(u5−u4)I6+g(u4−u6)−g(u6−u3)

1Rpu7−I3−I4u1u2

−u12

+u3−u7+Rg2I3u12

+u4−u7+Rg3I4

−u22

+u5+Rg2I5−Uin2(t)

u22

+u6+Rg3I6−Uin2(t)

u1+(Ri+Rg1 )I7−Uin1(t)

u2+(Rc+Rg1 )I8

.

As an example consider node 3. Kirchhoff’s current law shows that

0 = iRg2 + iD4 − iD1 = I3 + g(u6 − u3)− g(u3 − u5).

On the other hand, each inductor yields a further differential equation, e.g.

Ls2 I3 = uLs2 = u1

2+ u7 − uRg2 − u3 = u1

2+ u7 −Rg2I3 − u3.

The input signals Uin1(t) = 12

sin(2π t·103), Uin2(t) = 2 sin(2 π t·104) producean output signal u2 at node 2 (see Figure 2.6).

As the capacitors and inductors are linear, it is straightforward to reformulatethe equations in the form M y′ = f(y, t) that is given in the Test Set [124].This formulation can be dealt with by Dassl and Radau, two of the mostsuccessful linear multistep and Runge-Kutta solvers, respectively. Although

1

2

0

−

1

2

time2

310−31

310−30

Figure 2.6: The Ring Modula-tor’s output signal u2 at node 2.

C = 1.6 · 10−8 R =25000Cp = 10−8 Rp = 50Lh = 4.45 Rg1 = 36.3Ls1 = 0.002 Rg2 = 17.3Ls2 = 5 · 10−4 Rg3 = 17.3Ls3 = 5 · 10−4 Ri = 50γ =40.67286402 · 10−9 Rc = 600δ = 17.7493332

Table 2.5: Technical parameters for theRing Modulator

6The formulation given in [124] uses the voltages U1 = u1, U2 = u2, U3 = u3 − u7,U4 = u7 − u4, U5 = u5 − Uin2 , U6 = Uin2 − u6, U7 = u7 instead of the node potentials.


Dassl is intended to solve DAEs of index µ ≤ 1, it is still a tough competitorfor Radau on this index-2 test problem.

In order to generate problems of arbitrary size, N instances of the Ring Mod-ulator were solved simultaneously. Hence, systems of dimension N · 15 aretreated here.

From the work-precision diagram in Figure 2.7 (a) we see that for N = 1 theRunge-Kutta code Radau is clearly superior, as higher accuracy is achievedin shorter time. However, already for N = 50, i.e. when solving a systemof dimension 750, this situation is turned upside down. Now the BDF-solverDassl outperforms Radau, even though Dassl was not designed for solvingindex-2 problems (see Figure 2.7 (b)).

We saw that Runge-Kutta methods with a full matrix A in general requireO(s3m3

)+O

(s2m2

)operations for evaluating the stages. When the problem

is large, Runge-Kutta methods are therefore expected to be computationallymore expensive than linear multistep methods. This effect became clearlyvisible in the previous example.

However, the costs for solving the nonlinear equations in a Runge-Kutta schemedepend heavily on the structure of A. For an explicit method, i.e. when aij = 0for j ≥ i, there is no nonlinear system to be solved at all.

In case of differential algebraic equations A[d(x, t)]′ + b(x, t) = 0 and the cor-responding numerical scheme

AY ′i + b(Xi, tn + cih) = 0, i = 1, . . . , s, (2.8a)

d(Xi, tn + cih) = yn + hs∑j=1

aij Y′j , yn+1 = yn + h

s∑i=1

bi Y′i (2.8b)

0.25

0.5

1

10−610−510−410−310−2

com

puting

tim

e(s

ec)

relative error

Dassl

Radau

Figure 2.7 (a): N = 1 (15 eqns)

2·102

4·102

8·102

10−510−410−310−2

com

puting

tim

e(s

ec)

relative error

Dassl

Radau

Figure 2.7 (b): N = 50 (750 eqns)

Figure 2.7: Computing time vs. the achieved accuracy for different problemsizes. For the smaller problem problem Radau is superior, while Dassl ismore efficient for the larger problem. The computations were performed onan Intel R© Pentium R© M processor with 1.4 GHz, 512 MB RAM, running thesoftware from [124].


the coefficient matrix A in (2.8a) is singular. Thus certain parts of the stagederivatives Y ′

i need to be calculated from (2.8b) rather than from (2.8a). Asa consequence implicit methods with a nonsingular matrix A have to be used.Different levels of implicitness can be considered:

(a) singly diagonally implicit methods : A has lowertriangular structure and the diagonal elementsaii = λ 6= 0 are all equal.

[λ...

. . .as1 · · · λ

](b) diagonally implicit methods : The matrix A has

still lower triangular structure but the diagonalelements may be different (some, not all, mayeven be zero).

[a11...

. . .as1 · · · ass

]

(c) fully implicit methods : A is a full matrix. a11 · · · a1s

.... . .

...as1 · · · ass

(d) singly implicit methods : The matrix A is a full

matrix but it has a one-point spectrum σ(A) = λ.

In case of diagonally implicit methods (b), the stages can be evaluated se-quentially. Thus the computational work is considerably lowered to aboutsm3 + sm2 operations. Recall that the iteration matrix for stage number i isgiven by I −h aii ∂f∂y . If all diagonal elements are equal, as is the case for singlydiagonally implicit methods, there is a good chance that the same Jacobiancan be used for every stage such that the costs are further reduced. Finally,singly implicit methods can be implemented as cheaply as singly diagonally im-plicit methods (at least for large problems) [14], but they require an additionaltransformation to be carried out.

As a consequence, singly diagonally implicit methods such as the SDIRKmethod given in Table 2.3 may be the way to go. SDIRK methods were intro-duced by Butcher [16]. Further contributions were made by Nørsett [141] andAlexander [1]. Unfortunately, SDIRK methods suffer from the order reductionphenomenon, which is closely related to the method’s stage order.

Example 2.7. Prothero and Robinson [131] introduced the test problem

y′(t) = L(y(t)− g(t)

)+ g′(t), y(t0) = g(t0), (2.9)

in order to study the behaviour of numerical methods for stiff problems. Theexact solution is given by the smooth function y(t) = g(t) which is assumed tobe slowly varying. For this example, assume g(t) = cos(t). If L is negative andlarge in magnitude, then the problem can be arbitrarily stiff.

For stable methods of order p the order conditions ensure that the global errorsatisfies yn − g(tn) = O(hp) as h tends to zero, but this relation may not holdfor larger stepsizes. The order can be estimated numerically by solving (2.9)using different stepsizes. If the global error is plotted against the stepsize on adoubly logarithmic scale, then the slope indicates the order of convergence.


These calculations were carried out for the two-stage Gauss and SDIRK meth-ods from Table 2.3 with order p = 4 and p = 2, respectively. Figure 2.8 showsthat for moderate values of L the methods do enjoy order p behaviour. IfL is chosen to be large in magnitude, say L = −107, then the order beingnumerically observed degenerates to the stage order q.

This effect can be understood by noting that the global error is given by

g(tn)− yn = R(hL)(g(tn−1)− yn−1

)+ ε0 + hLb>(I − hLA)−1

[ ε1...εs

],

where ε0 = hp+1

p!

(1p+1

−∑s

i=1 bicpi

)+ O(hp+2), εi = hqi+1

qi!

(1

qi+1−∑s

j=1 aijcqij

)+

O(hqi+2), i = 1, . . . , s, are quadrature errors for the numerical result and thestages, respectively. R is the stability function. Due to the expansion

hLb>(I − hLA)−1ε = −b>A−1ε− (hL)−1b>A−1A−1ε− · · · ,

which is relevant for large hL, we find


)+ h3

36g(3)(tn−1) +O(h4) = O(h2)

for the Gauss method with q = min(q1, q2) = min(2, 2) = 2 and


)− h

√2

4Lg(2)(tn−1) +O(h2) =

↑L-stability

O(h)

for the SDIRK method where q = min(q1, q2) = min(1, 2) = 1. Owing to its L-stability, i.e. limz→∞R(z) = 0, the SDIRK method exhibits order 1 convergencesince for large hL the global error is essentially given by the local one. See also[25] for an in-depth explanation of this order reduction phenomenon.

10−12

10−8

10−4

10−4 10−3 10−2 10−1 h

abso

lute

erro

r

1

4

12

Gauss (p=4, q=2)

Figure 2.8 (a): Gauss method, s = 2.

10−11

10−7

10−3

10−4 10−3 10−2 10−1 h

abso

lute

erro

r

11

12

SDIRK (p=2, q=1)

Figure 2.8 (b): SDIRK from Table 2.3.

Figure 2.8: Order of convergence for a Gauss and SDIRK method. In the non-stiff case L = −10 ( • ) the methods show order p behaviour. For L = −107

( ), i.e. in the stiff regime, the numerically observed order is governed bythe stage order q.


For stiff ODEs and even more so for DAEs, the order of convergence is governednot only by the method’s order but also by its stage order. In the stiff case, asin the previous example, the order of convergence being numerically observedis determined by the stage order q. In particular, when the stage order issignificantly lower than the order, this means an unacceptable degeneration inperformance. Unfortunately, SDIRK methods are restricted to stage order 1[52]. If an explicit first stage is used [105], then the stage order may be 2, but ifa higher stage order is desired, some components of the vector c will lie outsidethe unit interval [0, 1].

While studying Runge-Kutta methods in this section, it turned out that someof these methods have excellent stability properties, but there are severe draw-backs as well. Implementations of (fully-implicit) Runge-Kutta methods aretoo expensive for large problems due to the nonlinear system that needs to besolved. In order to reduce the costs, singly diagonally implicit methods havebeen considered. Unfortunately this class of methods is restricted to a lowstage order and order reduction will occur for stiff ODEs and DAEs.

2.3 General Linear Methods

Traditionally integration schemes are classified as being either a linear multi-step or a one-step method. Both classes have been considered in the previoustwo sections. It turned out that linear multistep methods, and in particu-lar the BDF schemes, can be implemented efficiently for large problems, butthe stability properties are not always satisfactory for integrated circuit design.One-step methods, with Runge-Kutta methods being the most prominent, havesuperior stability properties but they do suffer from high computational costs.Singly diagonally implicit methods (SDIRK) are affected by the order reductionphenomenon.

The implicit Euler method and the trapezoidal rule do belong to both classes.Hence linear multistep methods as well as Runge-Kutta methods can be re-garded as generalisations of these integration schemes. While linear multistepmethods make use of information from the past, Runge-Kutta methods investmore work per step in order to calculate highly accurate approximations to theexact solution. As a consequence, mathematical analysis such as order condi-tions, stability concepts or the derivation of methods is often different for thesetwo classes of methods.

Several attempts have been made in order to overcome difficulties associatedwith each class of methods while keeping its advantages. Hybrid methodsallow more than one function evaluation in a linear multistep scheme [69].Using cyclic compositions of multistep methods it became possible to breakDahlquist’s barriers [7]. On the other hand, Rosenbrock methods aim at re-ducing the costs for a Runge-Kutta scheme by linearising the nonlinear system

2.3 General Linear Methods 43

and incorporating the Jacobian into the numerical scheme [85, 96]. Since onlylinear systems need to be solved, these methods are of particular interest forsimulating electrical circuits of huge dimension. The investigations in [79, 81]lead to the development of Choral, a charge-oriented ROW method. In [129]two-step-W -methods are considered, where a general matrix W may be usedinstead of the Jacobian.

In order to cover both linear multistep and Runge-Kutta methods in one uni-fying framework, Butcher [18] introduced general linear methods (GLMs) forthe solution of ordinary differential equations

y′(t) = f(y(t), t

), y(t0) = y0.

A general linear methods uses r pieces of input information y[n]1 , . . . , y

[n]r , when

proceeding from tn to tn+1 = tn + h with a stepsize h. Similar to Runge-Kuttamethods s internal stages

Yi = hs∑j=1

aijf(Yj, tn + cj h) +r∑j=1

uijy[n]j , i = 1, . . . , s, (2.10a)

are calculated and another r quantities

y[n+1]i = h

s∑j=1

bijf(Yj, tn + cj h) +r∑j=1

vijy[n]j , i = 1, . . . , r, (2.10b)

are passed on to the next step. Occasionally Yi and y[n]i are referred to as

internal and external stages, respectively. Using the more compact notation

Y =

Y1...Ys

, F (Y ) =

f(Y1, tn + c1 h)...

f(Ys, tn + cs h)

, y[n] =

y[n]1...

y[n]r

(2.10) can be written as

Y =(A⊗ Im)hF (Y ) + (U ⊗ Im) y[n],

y[n+1] = (B ⊗ Im)hF (Y ) + (V ⊗ Im) y[n].

The integer m denotes the problem size and ⊗ represents the Kronecker pro-duct for matrices [91]. It is only a slight abuse of notation when the Kroneckerproduct is often omitted, i.e.[

Y

y[n+1]

]=

[A UB V

][hF (Y )

y[n]

].

This formulation of a GLM is due to Burrage and Butcher [13]. The matricesA, U , B and V represent the method’s coefficients. Additionally, each methodis characterised by four integers:

s: the number of internal stages,

r: the number of external stages,


p: the order of the method,

q: the stage order of the method.

The internal stages are approximations Yi ≈ y(tn + ci h) to the exact solution,but the external stages are fairly general. Commonly adopted choices are

y[n] ≈

y(tn)

y(tn − h)...

y(tn − (r − 1)h)

or y[n] ≈

y(tn)h y′(tn)

...hr−1y(r−1)(tn)

, (2.11)

the former representing a method of multistep type while the latter is a Nord-sieck vector. Notice that compared to Nordsieck’s original formulation [127]factorials have been omitted for convenience. Methods in Nordsieck form be-came popular through the work of Gear [70]. General linear methods of thistype are considered, among other references, in [29, 40, 158].

Different choices of the vector y[n] are often related by linear transformations.In this sense the representation of a method using the matrices A, U , B andV is not unique as two different methods may be equivalent owing to such atransformation [25].

Many classical integration schemes can be cast into general linear form. Forsimplicity we will restrict attention to Runge-Kutta and BDF methods butmore exotic schemes such as the cyclic composition methods mentioned earlierbelong to this class as well. Many examples are given in [26] but also in theextensive monograph [25].

Example 2.8. Consider a Runge-Kutta scheme given by the Butcher tableau

c Ab>

=

c1 a11 · · · a1s...

.... . .

...cs as1 · · · ass

b1 · · · bs.

A Runge-Kutta method uses only the initial values yn as input such that r = 1.Thus the scheme (2.7), written in general linear form, reads[

Y

y[n+1]

]=

[A e

b> 1

][f(Y )

y[n]

]

Example 2.9. The BDF scheme

k∑i=0

αi yn+1−i = hβ0 f(yn+1, tn+1)

from Section 2.1 with α0 = 1 can be written as

yn+1 = hβ0 f(yn+1, tn+1)−k∑i=1

αi yn+1−i.


This quantity Y = yn+1 can be interpreted as the single internal stage ofstep number n. The approximations yn+1−i to the exact solution at previoustimepoints are collected in the vector y[n]. Hence a general linear formulationof the BDF scheme is given by

yn+1

yn+1

yn...

yn+2−k

=

β0 −α1 · · · −αk−1 −αkβ0 −α1 · · · −αk−1 −αk0 1 0...

. . ....

0 1 0

·

h f(yn+1)

yn

yn−1...

yn+1−k

.

General linear methods were introduced already in 1966. Even though thesemethods are investigated in [13, 20, 22], it was the derivation of Dimsims7 in1993 [23] that started a renewed interest in using GLMs for practical compu-tations. The subclass of Dimsim methods is characterised by s = r = p = q.Often more general methods are included by only requiring that the quantitiess, r, p, and q are all approximately equal [25].

Dimsims for non-stiff and stiff equations were constructed mainly by Butcherand Jackiewicz [32, 33, 34]. In [35, 36, 94] implementation issues are ad-dressed. The potential of these methods in a parallel computing environmentis investigated in [28, 60]. Dimsims are also closely related to singly implicitRunge-Kutta methods with an explicit first stage [105] as well as to the two-stage-W -methods of [129].

Recently it turned out that allowing one more stage, i.e. s = r = p + 1 =q+1, has advantages both for the construction and analysis of GLMs [25, 158].Methods with inherent Runge-Kutta stability can be constructed using onlylinear operations. These methods allow a very accurate estimation of the localtruncation error [40].

The reason for general linear methods becoming so popular is the fact thatwithin this class diagonally implicit methods exist that have high stage orderand a stability behaviour similar to Runge-Kutta methods.

Example 2.10. Consider the general linear method

M =

1−

√2

20 1 1−

√2

2√2

41−

√2

21

√2

4√

24

1−√

22

1√

24

0 1 0 0

with c =

[2−

√2

1

]. (2.12)

It is assumed that the two quantities passed from step to step are approxima-

tions to the exact solution and the first scaled derivative. Since y[n] ≈[

y(tn)h y′(tn)

],

this method is therefore in Nordsieck form.7Dimsim is short for Diagonally Implicit Multistage Integration Method.


Note that the matrix A has diagonally implicit structure such that the stagescan be calculated sequentially. Thus the implementation costs are similar tothe SDIRK method from Table 2.3. In fact, these two methods even share thesame linear stability behaviour. This can be seen by applying M to the linearscalar test equation (2.5),

Y = zAY + Uy[n]

y[n+1] = zBY + Vy[n]

⇒ y[n+1] =

(V + zB(I − zA)−1U

)y[n].

The stability matrix M(z) = V + zB(I − zA)−1U has only one nonzero eigen-

value R(z) = 2+(2−4λ)z+(1−4λ+2λ2)z2

2(λz−1)2and R(z) agrees with the stability function

of the SDIRK method. Hence their stability regions coincide (see Figure 2.9 (a)and Figure 2.4 (a)).

However, the general linear method is superior when solving stiff equations.The general linear structure allows the stage order to be 2 such that there isno order reduction. In Figure 2.9 (b) this statement is confirmed numericallyfor the Prothero-Robinson problem (2.9).

For any order p the framework of general linear methods allows the constructionof diagonally implicit schemes having high stage order q = p (see [158] but alsoSection 10.1). This makes GLMs very attractive for solving not only stiff ODEbut also DAEs. Nevertheless, there are only a few references available in theliterature addressing general linear methods in the context of DAEs.

In a technical report Chartier [46] considered GLMs for DAEs having index 1or 2. This work is similar to the Runge-Kutta approach [84], in particular inits restriction to DAEs in Hessenberg form. It was mentioned earlier that thisclass does not cover the equations of the charge oriented modified nodal anal-ysis. GLMs for index-3 Hessenberg systems are studied by Schneider in [143].

−5i

0

5i

0 5 10 15

Figure 2.9 (a): Stability region(outside of the encircled area).

10−14

10−10

10−6

10−2

10−4 10−3 10−2 10−1 h

abso

lute

erro

r

12

12

L = −107

L = −10

Figure 2.9 (b): Order ofconvergence.

Figure 2.9: Stability behaviour of the GLM (2.12) and the numerically ob-served order for the Prothero-Robinson example (2.9). M has the same stabil-ity region as the SDIRK method with order 2. Due to the stage order being 2there is no order reduction for stiff problems.


There is also a technical report by Butcher and Chartier [27] were methods areconstructed for stiff ODEs and DAEs.

In this thesis Dimsims as well as methods with s = r = p + 1 = q + 1 arestudied for nonlinear differential algebraic equations having index 1 or 2. TheDAEs studied here originate mainly from the charge oriented modified nodalanalysis (MNA) and the emphasis is on solving DAEs of the form

A[d(x(t)

)]′+ b(x(t), t

)= 0, (2.13)

where the leading term is properly stated. Other applications such as chemicalreaction dynamics or the simulation of mechanical multibody systems (after asuitable regularisation) often fall into this class as well [59].

Studying numerical methods for DAEs of this type requires a refined analysisof structural properties. Part II is therefore devoted to studying (2.13) inconsiderable detail. In particular the concept of the tractability index will leadto a decoupling procedure for nonlinear DAEs. This new decoupling proceduremakes only mild assumptions on the smoothness of the coefficients and is thussuited for dealing with the MNA equations.

In Part III the decoupling procedure is used to study general linear methodsfor DAEs of the form (2.13). After reviewing briefly ODE theory for GLMs,these methods are applied to implicit index-1 equations. The order conditionsand convergence results obtained in Chapter 8 are then transferred to properlystated DAEs (2.13) with index 2. Here, the decoupling procedure will be thecentral tool. The details of this construction are presented in Chapter 9.

After deriving all requirements a GLMs has to fulfil in order to be applicable forsolving DAEs, practical aspects are addressed in Part IV. Here, in Chapter 10,general linear methods with s = r = p + 1 = q + 1 but also Dimsims withs = r = p = q are constructed. Implementation issues such as error estimationand order control are addressed for each class of methods such that a variable-stepsize variable-order implementation is possible in each case.

Numerical experiments in Chapter 12 are used to compare these two imple-mentations. The application to realistic circuits in integrated circuit designgives strong evidence that general linear methods, in particular the Dimsimimplementation, are competitive if not superior to classical codes.

Part II

Differential Algebraic Equations

3

Linear Differential Algebraic Equations

In Chapter 1 differences between differential and differential algebraic equationshave been addressed. In particular it may happen that parts of the originaldata need to be differentiated in order to obtain a solution. As we are aimingat solving DAEs numerically, this differentiation process has to be performednumerically as well. This in turn, is far from trivial since small perturbationsin the original data may lead to errors of arbitrary size – an effect alreadydemonstrated in Example 1.3. The question whether there are inherent differ-entiations involved and, if so, up to what degree, distinguishes different classesof numerical complexity for DAEs. These different levels are usually measuredby some kind of index concept.

In the next section some of the most frequently used index notions will bereviewed. Then, in Section 3.2, the tractability index will be treated in moredetail. DAEs with a properly stated leading term are introduced and analysedusing the framework provided by this index concept. In particular the decoup-ling of general index-µ equations is briefly addressed for linear DAEs. Themain focus, however, is on DAEs with index 1 or 2.

3.1 Index Concepts

There is a wide variety of different index-concepts available in the literature.For linear DAEs with constant coefficients,

E x′(t) + F x(t) = 0, (3.1)

the Kronecker index is most natural and will therefore be addressed first. Giventhat the pair (E,F ) forms a regular matrix pencil, i.e. there is λ ∈ C such thatdet(λE + F ) 6= 0, there are nonsingular matrices U , V that transform E andF simultaneously into Kronecker-Weierstraß normal form given by

UEV =

[I 00 N

], UFV =

[C 00 I

].

52 3 Linear Differential Algebraic Equations

The matrixN = diag[N1, . . . , Nk] has block-diagonal form, whereNi is a Jordanblock to the eigenvalue 0. Proofs of this result are given in [68, 85, 97]. Usingthe transformed variables [ uv ] = V −1x and [ ab ] = U q equation (3.1) can bewritten as the decoupled system

u′(t) + Cu(t) = a(t), v(t) =

µ−1∑i=0

(−N)ib(i)(t). (3.2)

The number µ is the index of nilpotency for the matrix N , i.e. Nµ−1 6= 0 butNµ = 0. From (3.2) it is clear that the inherent dynamics of the DAE (3.1)is described by the ordinary differential equation u′(t) = −Cu(t) + a(t). Onlyfor this component initial conditions may be prescribed. The remaining partof the solution, v(t), is already fixed by the right hand side. However, in orderto calculate v(t), parts of the right hand side need to be differentiated up toµ− 1 times. If µ = 1, and therefore N = 0, no differentiation is necessary. Butif µ = 2, then b′ appears in the representation for v. Therefore the DAE (3.1)is said to have index µ.

In the case where E is non-singular, i.e. (3.1) is in fact an ordinary differen-tial equation, the block N does not appear at all in the Kronecker-Weierstraßnormal form. Hence this special case is assigned the index µ = 0.

Unfortunately there is no immediate generalisation of the Kronecker index toDAEs with time-dependent coefficients. The case of nonlinear DAEs is evenfurther beyond the scope of the Kronecker index. The reason is that importantinvariants of constant coefficient systems may be changed under time-dependenttransformations. In the ongoing strive to find suitable index concepts for largeclasses of DAEs many different index-concepts where developed, each of thememphasising different aspects of the Kronecker index.

It was remarked earlier that for equations with higher index small perturbationsin the data my lead to large errors in the solution. The perturbation indexintroduced in [84] is defined as a measure for this sensitivity of solutions withrespect to perturbations of the problem data. The perturbation index is mostrelevant for numerical computations. Unfortunately, it can be quite difficult tocalculate the index analytically. Another drawback is that for some problemsthe perturbation index may differ significantly from other indices attributed tothe same problem based on the index concepts described below [43, 72].

It is conceptually much easier to count, as for the Kronecker index, the numberof times that the equations need to be differentiated in order to derive an ODErepresentation. This idea was applied to general nonlinear DAEs by Campbell,Gear and Petzold. They introduced the concept of the differentiation index[10, 43, 73]. The following definition is taken from [85].

Definition 3.1. Let the nonlinear DAE

f(x′(t), x(t), t

)= 0

3.1 Index Concepts 53

be sufficiently smooth. The DAE is said to have (differentiation) index µ, if µis the minimal number of differentiations

f(x′(t), x(t), t

)= 0,

d

dtf(x′(t), x(t), t

)= 0, . . . ,

dµ

dtµf(x′(t), x(t), t

)= 0

(3.3)

such that the equations (3.3) allow to extract an explicit ordinary differentialsystem x′(t) = ϕ

(x(t), t

)using only algebraic manipulations.

The differentiation index received much attention and it is widely used. Butfor practical computations the setup of the derivative array (3.3) may be pro-hibitive due to high computational costs. Also, Definition 3.1 tacitly assumesthat f ∈ Cµ is sufficiently smooth to calculate (3.3), which often does nothold for applications such as circuit simulation or multi-body dynamics. Wewill see later that weaker smoothness requirements are often sufficient. Oneparticular disadvantage of the differentiation index is the fact that it can’t beapplied to over- and underdetermined systems as the solvability concept of thedifferentiation index requires unique solvability.

The strangeness index introduced by Kunkel and Mehrmann [98, 99] generalisesthe differentiation index. Its main advantage is not only the possibility toinclude over- and underdetermined systems, but also to provide normal formsfor time-dependent differential algebraic equations. We will briefly review someof the key points related to the strangeness index. More details are given in thepapers by Kunkel and Mehrmann and in particular in the monograph [101].

Within the context of the Kronecker index two matrix pairs (E1, F1) and(E2, F2) are considered to be equivalent if they are related by a similaritytransformation, i.e. E2 = UE1V and F2 = UF1V with nonsingular matri-ces U , V . The equation is then written in terms of the transformed variabley = V −1x. If V depends on time, then x = V y needs to be differentiatedin order to obtain the transformed equation. Thus, due to the product rule,x′ = V y′+V ′y holds and the additional term V ′y has to be taken into considera-tion. Consequently, Kunkel and Mehrmann consider two matrix pairs (Ei, Fi),Ei, Fi ∈ C(I,Cm×n), i = 1, 2, equivalent, if there are pointwise nonsingularmatrix functions U ∈ C(I,Cm×m) and V ∈ C(I,Cn×n) such that

E2 = UE1V, F2 = UF1V + UE1V′.

This indeed defines an equivalence relation (E1, F1) ∼ (E2, F2). Under addi-tional constant rank assumptions it is possible to derive the (global) normalform

(E1, F1) ∼ (E2, F2) =

Is 0 0 00 Id 0 00 0 0 00 0 0 00 0 0 0

,

0 A12 0 A14

0 0 0 A24

0 0 Ia 0Is 0 0 00 0 0 0

sdasm−d−a


where the blocks Aij are again matrix functions on I and the numbers s, d and aare invariants of the equivalence relation [101]. Writing down the correspondinglinear differential algebraic equation, (3.1) is found to be equivalent to thesystem

x′1 + A12x2 + A14x4 = q1 (3.4a)

x′2 + A24x4 = q2 (dynamic part) (3.4b)

x3 = q3 (algebraic part) (3.4c)

x1 = q4 (3.4d)

0 = q5 (consistency condition) (3.4e)

The interpretation of (3.4b), (3.4c) and (3.4e) is straightforward. However,there is a “strange” coupling between (3.4a) and (3.4d) such that the numbers is called “strangeness”. Differentiating (3.4d) and inserting the result into(3.4a), the latter equation becomes purely algebraic. The resulting modifiedmatrix pair is again denoted as (E2, F2) for simplicity. The process of findingthe global normal form and eliminating the strangeness part can be repeatediteratively to obtain a sequence of characteristic values (si, di, ai) for (Ei, Fi).A pair (Ei, Fi) is called strangeness free if si = 0. If this is the case, then thesequence becomes stationary. The smallest number i ∈ N0, such that (Ei, Fi)is strangeness free, is defined to be the strangeness index.

If the strangeness index µ is well-defined, i.e. the constant rank assumptionshold for every step of the construction, then the normal form of (Eµ, Fµ) reads Idµ

00

x1

x2

x3

′ + 0 0 A13

Iaµ 00

x1

x2

x3

=

f1

f2

f3

.For the original DAE (3.1) the following properties hold:

• The problem (3.1) is solvable if and only if f3 = 0.

• An initial condition x(t0) = x0 is consistent if and only if x(t0) = x0

implies x2(t0) = f2(t0).

• The problem (3.1) is uniquely solvable if and only if m− dµ − aµ = 0.

Obviously the strangeness index is a powerful tool for analysing differentialalgebraic equations. Over- and underdetermined systems are included mostnaturally. In particular the resulting normal forms provide much inside intothe structure of a given DAE.

On the other hand even for simple examples it may be difficult to calculatethe normal forms due to the complexity of the linear algebra computations in-volved [146]. This is true in particular for nonlinear problems. Thus numericalsoftware developed in conjunction with the strangeness index requires the userto provide the derivative array (3.3) as input [102].

3.2 Linear DAEs with Properly Stated Leading Terms 55

In this thesis we are mainly concerned with the analytical and numerical in-vestigation of differential algebraic equations arising from modelling electricalcircuits. Input signals but also the equations modelling electrical devices suchas MOSFETs or BJTs1 will typically have only low smoothness properties.Thus for our investigations we have to avoid using the derivative array notonly due to these smoothness considerations but also because of the problemsize. If a circuit contains hundred-thousands of devices, the calculation of thederivative array is not recommended in practice.

Hence we will focus on the tractability index originally introduced by Marz[113]. It does not require the calculation of the derivative array – neither foranalytical nor for computational purposes. Much of the smoothness require-ments on the data will be replaced by assumptions on certain subspaces tobe spanned by continuously differentiable functions. These subspaces can beconveniently analysed in the context of electrical circuit simulation.

This index concept aims at determining the exact smoothness requirementsnecessary for solving a given DAE [117, 121]. The equation’s structure is ana-lysed in detail by determining an inherent ordinary differential equation thatdescribes the dynamics of the system [115, 116]. Recently over- and underde-termined systems have been included into the general theory as well [55, 119].Normal forms are considered in [118].

As the tractability index is seminal for the work presented in this thesis, wewill investigate this concept in considerable detail. The next section is devotedto defining the index and investigating its consequences for linear differentialalgebraic equations. We will determine the so-called inherent regular ordinarydifferential equation for linear DAEs. A decoupling procedure will be the maintool leading not only to the inherent regular ODE but also to a refined charac-terisation of solvability, consistent initialisation and smoothness requirements.

Other index concepts such as the the geometric index [133] will not be inves-tigated further. All index concepts exist in their own right. Each has its ownadvantages and disadvantages, but for integrated circuit design the tractabilityindex seems to be most appropriate.

3.2 Linear DAEs with Properly Stated Leading Terms

The tractability index is defined for differential algebraic equations of the form

A(t)[D(t)x(t)

]′+B(t)x(t) = q(t), t ∈ I. (3.5)

Two matrices A(t) ∈ Rm×n and D(t) ∈ Rn×m are used in order to formulatethe leading term. In general both matrices will be rectangular. The leadingterm in (3.5) figures out precisely which derivatives are actually involved.

1MOSFET: Metal Oxide Semiconductor Field Effect Transistor and BJT: Bipolar Junc-tion Transistor


Notice that DAEs in standard form (3.1) can be cast easily into the form(3.5), using e.g. a projector PE along the kernel of E [2, 88]. Also, recall fromSection 1.1 that the equations of the modified nodal analysis are exactly of thistype, provided that the circuit is linear. In the MNA equations (1.5) the vectord(x(t), t

)represented charges and fluxes of the network. The leading term of

(3.5) represents the linear analogy to this situation.

However, A and D may not be chosen completely arbitrary. We have to makesure that there is no overlap between A and D but also no gap in betweenthem. More precisely, A and D have to be well-matched in the following sense.

Definition 3.2. The leading term of (3.5) is properly stated if

kerA(t)⊕ imD(t) = Rn, t ∈ I,

and there is a continuously differentiable projector function R ∈ C1(I, L(Rn)

)such that

imR(t) = imD(t), kerR(t) = kerA(t) t ∈ I.

We require that the matrix functions A, D and B are continuous. If the leadingterm is properly stated then, by definition, A and D have a common constantrank [116]. A continuous function x : I → R

m is a solution of (3.5) if itsatisfies the equation pointwise. In order for this to hold, the D-part of x,i.e. t 7→ D(t)x(t), needs to be differentiable. Consequently the appropriatesolution space is given by

C1D(I,Rm) =

x ∈ C(I,Rm) |Dx ∈ C1(I,Rn)

.

The linear DAE (3.5) will now be investigated in more detail.

Let Q0 be a continuous projector function onto the kernel of AD. If P0 = I−Q0

denotes the complementary projector, we have x = P0x + Q0x and (3.5) canbe written as

A(Dx)′ +Bx = q ⇔ A(Dx)′ +BP0x+BQ0x = q. (3.6)

The t arguments have been omitted for simplicity. For a further refined refor-mulation a generalised reflexive inverse D− is introduced for D such that

DD−D = D, D−DD− = D−, DD− = R, D−D = P0. (3.7)

Generalised matrix inverses are studied in [160]. Due to the first two properties,the products DD− and D−D are projector functions. By requiring DD− = Rand D−D = P0 the inverse D− becomes uniquely defined. However, it stilldepends on P0 and thus on the choice of Q0.


Obviously we have DP0 = D and DQ0 = Q0D− = 0. Using Definition 3.2 it

turns out that A = AR = ADD−. Hence (3.6) can be rewritten as

(AD +BQ0)[D−(Dx)′ +Q0x] +BP0x = q. (3.6’)

What have we gained by introducing the reformulation (3.6’)? To see thebenefit of the above calculations let us assume, for the moment, that the matrixfunction G1 = AD+BQ0 remains nonsingular on I. Then (3.5) is equivalent to

D−(Dx)′ +Q0x+G−11 BD−Dx = G−1

1 q, (3.6”)

which is found by scaling with G−11 . Multiplication with the projector functions

P0 and Q0, respectively, transforms this equation into the decoupled system

P0D−u′ + P0G

−11 BD−u = P0G

−11 q (3.8a)

z0 +Q0G−11 BD−u = Q0G

−11 q. (3.8b)

The abbreviations u = Dx and z0 = Q0x were introduced in order to make thedecoupled structure clearer.

Similar to the approach taken when defining the Kronecker index, the reformu-lation (3.8) leads to a decoupled system of two equations. (3.8a) is completelyindependent of (3.8b). But once a solution of (3.8a) is found, a solution of theoriginal DAE can be constructed as

x = P0x+Q0x = D−u+ z0 = (I −Q0G−11 B)D−u+Q0G

−11 q.

Observe that in order to calculate Q0x no differentiation is necessary. Thereforewe will later refer to this case, where G1 is nonsingular, as index 1.

Even though (3.8) provides a decoupling of the original DAE, this structure isnot quite satisfying. Matters would be much clearer if (3.8a) was an ordinarydifferential equation. Using the product rule this goal is easily achieved.

Notice that P0 = D−D and kerP0 = kerD imply that (3.8a) is equivalent to

Ru′ +DG−11 BD−u = DG−1

1 q

and since R is a smooth function, we find

u′ = (Ru)′ = R′u−DG−11 BD−u+DG−1

1 q (3.8a’)

Thus the component u = Dx is determined by an inherent regular ordinary dif-ferential equation. For u initial conditions may be freely chosen. The remainingpart z0 = Q0x of the solution is explicitely given in terms of u.

It is obvious that this decoupling hinges on the fact that G1 = AD + BQ0 isnonsingular. However, the procedure can be generalised for higher index cases.This is done by considering G0 = AD, B0 = B and the following sequence of


matrices and subspaces for i ≥ 0. All expressions in this sequence are meantpointwise for t ∈ I.

Ni = kerGi, (3.9a)

Si = z ∈ Rm |Biz ∈ imGi = z ∈ Rm |Bz ∈ imGi , (3.9b)

Qi = Q2i , imQi = Ni, Pi = I −Qi, (3.9c)

Gi+1 = Gi +BiQi, (3.9d)

Bi+1 = BiPi −Gi+1D−(DP0 · · ·Pi+1D

−)′DP0 · · ·Pi. (3.9e)

The expression (3.9c) means that a projector function Qi onto Ni is introducedand that Pi is its complementary projector. When defining the sequence (3.9)we have to make sure that the involved derivatives exist.

Definition 3.3. The sequence (3.9) is said to be admissible up to k ∈ N, if thefollowing properties hold for i = 0, 1, . . . , k:

(a) rankGi(t) = ri is constant on I,

(b) N0 ⊕ · · · ⊕Ni−1 ⊂ kerQi for i ≥ 1,

(c) Qi ∈ C(I,Rm×m) and DP0 · · ·PiD− ∈ C1(I,Rn×n).

The condition (b) ensures that certain products of projector functions are againprojectors. Property (c) states the required smoothness properties.

We saw in the above example that G0 = AD was singular, but the nonsingularmatrix function G1 enabled a decoupling of the original DAE. This situationis generalised to higher indices in the following definition. See [3, 75, 116, 120]for more details.

Definition 3.4. Equation (3.5) is called a regular DAE with tractability indexµ if there is a sequence (3.9) that is admissible up to µ with

0 ≤ r0 ≤ · · · ≤ rµ−1 < m and rµ = m.

It was pointed out earlier that D− depends on the choice of Q0. Similarly thematrix functions Gi depend on the particular choice of the projector functionsQj for j = 0, . . . , i − 1. In spite of this, the tractability index itself is well-defined. In [116] it is proved that the index is independent of the special choiceof the admissible sequence (3.9). There it is also shown that the index remainsinvariant under regular transformations and refactorisations.

Remark 3.5. In [3] the index is defined in terms of the subspaces Ni andSi appearing in the sequence (3.9). In fact, both approaches are equivalent(see [116]) and the index can be defined as follows:

• A regular index-1 DAE can be equivalently characterised by the conditionN0(t) ∩ S0(t) = 0 for t ∈ I.

• A regular index-2 DAE can be equivalently characterised by requiringdim

(N0(t) ∩ S0(t)

)= const > 0 and N1(t) ∩ S1(t) = 0 for t ∈ I.


In particular the space N0(t) ∩ S0(t) will later become of vital importance forstudying nonlinear index-2 equations.

Finally, recall from [75, Theorem A.13] that Ni(t)∩Si(t) = 0 is equivalent toNi(t) ⊕ Si(t) = R

m. Thus for index-µ equations the canonical projector Qµ−1

onto Nµ−1 along Sµ−1 can always be defined.

Similarly to the case of index-1 equations discussed above, the sequence (3.9)allows to decouple index-µ equations into their characteristic parts. A carefulrearrangement of (3.5) and rescaling with the matrix G−1

µ shows that (3.5) canbe written as

Pµ−1 · · ·P0D−(Dx)′ +G−1

µ BP0 · · ·Pµ−1x+

µ−1∑j=0

Qjx

+

µ−1∑i=1

i∑j=1

Pµ−1 · · ·PjD−(DP0 · · ·PjD−)′DP0 · · ·Pi−1Qix = G−1µ q.

For µ = 1 this equation exactly coincides with (3.6”) since P0D− = D−. The

derivatives (DP0 · · ·PjD−)′ start to appear for µ ≥ 2. In any case the decom-position

I = P0 · · ·Pµ−1 +Q0P1 · · ·Pµ−1 + · · ·+Qµ−2Pµ−1 +Qµ−1 (3.10)

is used to split this equation into µ + 1 separate parts. Notice that each termof (3.10) is a projector function due to the requirement (b) in Definition 3.3.Thus, after quite long and technical calculations as performed e.g. in [116, 120]it turns out that a regular index-µ DAE (3.5) is equivalent to the decoupledsystem

u′ −(DP0 · · ·Pµ−1D

−)′u+DP0 · · ·Pµ−1G−1µ BD−u (3.11a)

= DP0 · · ·Pµ−1G−1µ q,

zk = Lkq −KkD−u+

µ−1∑j=k+1

Nkj(Dzj)′ +

µ−1∑j=k+2

Mkjzj (3.11b)

where k = 0, . . . , µ− 1. The components u and zk are given by

u = DP0 · · ·Pµ−1x, z0 = Q0x,

zk = P0 · · ·Pk−1Qkx, k = 1, . . . , µ− 1.(3.12)

The continuous coefficients Lk, Kk, Nkj and Mkj are defined in terms of D,D− and the projector functions Qi, Pi. Detailed expressions are given in [120].

For our investigations it is important to note that (3.11a) is an ordinary differ-ential equation for the component u which does not depend on any of the othervariables zk. This equation is called the inherent regular ordinary differentialequation of the index-µ DAE (3.5).


Once a solution u of the inherent regular ODE is found, zµ−1 is calculatedas zµ−1 = Lµ−1q − Kµ−1D

−u. In order to calculate zµ−2 the D part of thiscomponent needs to be differentiated according to

zµ−2 = Lµ−2q −Kµ−2D−u+Nµ−2,µ−1(Dzµ−1)

′.

Successively the remaining parts zµ−3, . . . , z0 are calculated and additionalderivatives may be involved. Finally the solution x of the original DAE isobtained as

x = D−u+ z0 + · · ·+ zµ−1.

This situation is summarised in the following theorem.

Theorem 3.6. Let (3.5) be a regular DAE with tractability index µ. Then

(a) imDP0 · · ·Pµ−1 is a (time-varying) invariant subspace of the inherent re-gular ODE (3.11a).

(b) If x ∈ C1D(I,Rm) is a solution of the DAE (3.5), then the components

u ∈ C1(I,Rn), z0 ∈ C(I,Rm) and z1, . . . , zµ−1 ∈ C1(I,Rm) defined in(3.12) form a solution of the decoupled system (3.11).

(c) If u ∈ C1(I,Rn), z0 ∈ C(I,Rm) and z1, . . . , zµ−1 ∈ C1(I,Rm) satisfy(3.11) with u(t0) ∈ im

(DP0 · · ·Pµ−1

)(t0), then x = D−u+ z0 + · · ·+ zµ−1

is a solution of the original DAE (3.5).

Proofs of the above result can be found in [116, 120].

The following statement about the existence and uniqueness of solutions forinitial value problems in regular index-µ DAEs is a direct consequence of thepreceding theorem.

Corollary 3.7. Let (3.5) be a regular DAE with tractability index µ. Assumethat the coefficients and the right-hand side q are sufficiently smooth. Then foreach x0 ∈ Rm the initial value problem

A(Dx)′ +Bx = q, x(t0)− x0 ∈(N0 ⊕ · · · ⊕Nµ−1

)(t0)

is uniquely solvable in C1D(I,Rm).

Proof. Define u0 =(DP0 · · ·Pµ−1

)(t0)x

0 and let u be the solution of the in-herent regular ODE (3.11a) satisfying u(t0) = u0. As imDP0 · · ·Pµ−1 is aninvariant subspace of this equation, u = DP0 · · ·Pµ−1u ∈ imDP0 · · ·Pµ−1 fol-lows for t ∈ I.

Let zk be the vector from (3.11b) and consider x = D−u + z0 + · · · + zµ−1.By Theorem 3.6 this is a solution satisfying

(DP0 · · ·Pµ−1

)(t0)x(t0) = u(t0) =

u0 =(DP0 · · ·Pµ−1

)(t0)x

0. Therefore x(t0) − x0 ∈(N0 ⊕ · · · ⊕ Nµ−1

)(t0) is

satisfied as well.

If there was another solution of the IVP, say x, then Theorem 3.6 shows thatthe corresponding parts u and zk satisfy (3.11b) as well. Thus u = u, zk = zkand the two solutions coincide.


Observe that in Corollary 3.7 special care was taken when formulating theinitial condition. Given an arbitrary vector x0 ∈ Rm, the exact solution has tosatisfy

x(t0)− x0 ∈(N0 ⊕ · · · ⊕Nµ−1

)(t0).

Then x(t0) is the consistent initial value corresponding to x0. Nevertheless,computing x(t0) from x0 often poses serious difficulties for realistic applica-tions [61].

The results of Theorem 3.6 and Corollary 3.7 were obtained using arbitrary ad-missible sequences (3.9). If we restrict ourselves to choosing special canonicalprojector functions Qi, the structure of (3.11b) can be simplified considerably.In fact, in [118] fine and complete decouplings are studied. Choosing canoni-cal projector functions it is possible to obtain Kk = 0, such that the couplingcoefficients disappear. Thus u and the components zk can be calculated inde-pendently of each other.

Consider again the case of index-1 equations. It was shown earlier that theinherent regular ODE is given by (3.8a’), (3.8b),

u′ = (DD−)′u−DG−11 BD−u+DG−1

1 q. (3.11a′)

z0 = Q0x = Q0G−11 q −Q0G

−11 BD−u. (3.11b′)

Thus we are lead to the solution representation

x = D−u+ z0 = (I −Q0G−11 B)D−u+Q0G

−11 q. (3.13)

Observe that I−Q0G−11 B is precisely the canonical projector onto S0 along the

subspace N0. This is shown e.g. in [75]. If the choice Q0 = Q0G−11 B leads to

an admissible sequence, then (3.11) turns out to be the completely decoupledsystem

u′ = R′u−DG−11 BD−u+DG−1

1 q,

z0 = Q0G−11 q, x = D−u+ Q0G

−11 q.

This can be seen by noting that the term Q0G−11 BD− = 0 vanishes. Thus the

decoupled system is in standard canonical form (SCF) in the sense of [42].

The index-2 case is more complicated. However, working carefully through theexpressions given in [120] we find

u′ = (DP1D−)′u−DP1G

−12 BD−u+DP1G

−12 q, (3.11a′′)

z1 = P0Q1x = P0Q1G−12 q −K1D

−u, (3.11b′′)

z0 = Q0x = Q0P1G−12 q −K0D

−u+Q0Q1D−(Dz1)

′.


The coupling coefficients are given by (cf. [118])

K1 = P0Q1G−12 BP0P1, (3.14a)

K0 = Q0P1G−12

(B +G0D

−(DP1D−)′D

)P0P1. (3.14b)

Again, Q1 = Q1G−12 BP0 is the canonical projector onto N1 along S1. If the

sequence based on Q0 = Q0 and Q1 is admissible, then Q1 = Q1G−12 BP0 implies

K1 = P0Q1P1 = 0 such that (3.11) reads


−12 BD−u+DP1G

−12 q, (3.15a)

z1 = P0Q1G−12 q, (3.15b)

z0 = Q0P1G−12 q − K0D

−u+ Q0Q1D−(Dz1)

′. (3.15c)

In the case of index-2 equations the exact solution can therefore be written as

x = D−u+ z0 + z1

=(I − K0

)D−u+

(Q0P1+P0Q1)G

−12 q + Q0Q1D

−(DQ1G−12 q)′

(3.16)

This is called a fine decoupling. In contrast to a complete decoupling thecoefficient K0 is still present. Fine decouplings such as the above system werestudied in considerable detail in [90, 115, 120, 145]. However, in [3] it is shownthat

Q0 = Q0P1G−12 B +Q0Q1D

−(DQ1D−)′, Q1 = (I − Q0P0)Q1 = Q1

yield a complete decoupling with K0 = 0, provided that Q0, Q1 give rise to anadmissible sequence (3.9).

Example 3.8. The Miller Integrator already introduced in Example 1.1 isdescribed by the linear DAE[

0−1

100

]([ 0 −C C 0 0 ]

[ u1u2u3iV1iV2

])′

+

[G −G 0 1 0

−G G 0 0 00 0 0 0 10 −a 1 0 01 0 0 0 0

][ u1u2u3iV1iV2

]=

[0000v(t)

].

The amplification factor of the operational amplifier is given by the real numbera. Using the formulation above, the leading term is properly stated with R = 1.Calculating the matrix sequence (4.6) we find

G0 =

[0 0 0 0 00 C −C 0 00 −C C 0 00 0 0 0 00 0 0 0 0

], Q0 =

[1 0 0 0 00 1 0 0 00 1 0 0 00 0 0 1 00 0 0 0 1

], G1 =

[G −G 0 1 0

−G C+G −C 0 00 −C C 0 10 1−a 0 0 01 0 0 0 0

].

Due to detG1 = (a−1)C the Miller Integrator represents an index-1 problem fora 6= 1. The particular choice of the projector function Q0 fixes the generalised

inverse D− =[0 0 1

C0 0

]>and the decoupled system (3.11) reads

u′ = G(a−1)C

u−Gv(t), z0 = v(t)

[100

−GG

]+ 1

(1−a)C

[011G

−G

].

3.3 Examples 63

If the amplification factor a tends towards infinity, the inherent regular ODEreads u′ = −Gv(t) such that the potential at node 3 is u3 = −G

C

∫v(t) dt such

that the circuit indeed performs an integration.

For a = 1 the matrix G1 is singular and the matrix sequence needs to beextended in order to determine the index:

Q1 = 1G

[ 0 0 0 0 00 −C C 0 00 −C−G C+G 0 00 −C G C G 0 00 C G −C G 0 0

], G2 =

[G −G 0 1 0

−G C+G −C 0 00 −C C 0 10 −1 1 0 01 0 0 0 0

].

Notice that DP1D− = 0 such that G2 = G1 +BP0Q1. The projector Q1 chosen

here is already the canonical one, i.e. imQ1 = N1 and kerQ1 = S1. This canbe seen by checking the relation Q1 = Q1G

−12 BP0. Calculating detG2 = −G

shows that the index is 2 and we obtain the decoupled system

u′ = 0, z1 = 0, z0 = v(t) ·[1 1 1 0 0

]>.

Since our main focus is on dealing with equations having index 1 or 2, wewill not dwell on higher index DAEs any further. We content ourselves withremarking that studying fine and complete decouplings naturally leads to theinvestigation of normal forms for linear DAEs with tractability index µ. Un-der suitable regularity assumptions it can be shown that each regular index-µequation is equivalent to a DAE in standard canonical form (SCF),

y′+My = Lyq,

Nw′+ w = Lwq,

where N is strictly upper triangular. See [42, 118] for more details.

Currently there is much interest in studying the relationship of the strangenessand the tractability index using the corresponding normal forms. It has beenshown that both index concepts essentially agree for µ = 1, 2, . . . , 5 (usingthe counting of the tractability index). It is conjectured that the relationship“µtractability = µstrangeness + 1” holds for the general case as well. More detailscan be found in [108].

3.3 Examples

The framework of the tractability index might not be as easily accessible as thedifferentiation or the perturbation index but it is most convenient for studyingDAEs in electrical circuit simulation. The matrix sequence and the decouplingprocedure introduced above provide a detailed insight into the structure of agiven DAE. For electrical networks the index µ = 0, 1, 2 can be checked usingfast and reliable algorithms based on graph theory [61, 63].


However, the tractability index does not only offer a refined structural analysis,but it often leads to significant improvements for numerical computations. Inparticular the properly stated leading term plays a crucial role here.

Example 3.9. Consider the well-known problem of Gear and Petzold [73],[0 01 η t

]x′(t) +

[1 η t0 1 + η

]x(t) =

[e−t

0

], (3.17a)

which constitutes a linear index-2 DAE. The parameter η is a real number.It is shown in [73] that BDF methods fail completely for η = −1. For otherparameter values −1 < η < −0.5 the numerical solution is exponentially un-stable. In [84] the DAE (3.17a) is said to pose difficulties to every numericalmethod.

These statements are confirmed by the numerical results given in Figure 3.1.There (3.17a) was solved on the interval [0, 3] using the BDF2 formula and theRadauIIA method with two stages. A constant stepsize h = 0.05 was used.The exact solution is given by x1(t) = (1− η t)e−t and x2(t) = e−t, such thatx0 = [1, 1]> is a consistent initial value. Obviously both numerical methods faileven for moderate values of η due to the exponential instability.

−6

−4

−2

0

2

0 1 2 time

BDF2

RADAU3

exact solution

Figure 3.1 (a): (3.17a), η = −0.275

−150

−100

−50

0

50

100

0 1 2 time

BDF2

RADAU3

exact solution

Figure 3.1 (b): (3.17a), η = −0.3

Figure 3.1: Numerical solution (2nd component). Using the standard formu-lation (3.17a) the solution explodes even for moderate values of η.

0

0.5

1

0 1 2 time

exact solutionRADAU3

BDF2

Figure 3.2 (a): (3.17b), η = −0.275

0

0.5

1

0 1 2 time

exact solutionRADAU3

BDF2

Figure 3.2 (b): (3.17b), η = −0.3

Figure 3.2: Numerical solution (2nd component). The properly stated DAE(3.17b) yields correct results and no difficulties arise.

3.3 Examples 65

For comparison consider the following equivalent reformulation using a properlystated leading term,[

01

] ( [1 η t

]x(t)

)′+

[1 η t0 1

]x(t) =

[e−t

0

]. (3.17b)

Solving the reformulated problem yields correct numerical results as can beseen in Figure 3.2. In particular the accuracy of the numerical result does notdepend on the parameter η. We will see later, in Section 9.1, that this behaviouris ensured by the two subspaces DN1 =R, DS1 = 0 being constant, a factthat was first observed in [90].

The numerical results for Example 3.9 indicate that an appropriate formulationof the problem ensures a correct behaviour of the numerical solution. Thestandard formulation Ex′ + Fx = q, by contrast, inevitably leads to seriousdifficulties. This effect can be studied in detail for the following example.

Example 3.10. Consider the simple linear DAE

λt u′ + (λ− 1)z′ = 0, (λt− 1)u+ (λ− 1)z = 0 (3.18a)

depending on the parameter λ ∈ R. J.C. Butcher received this example fromCaren Tischendorf. It is documented in [24].

Differentiating the second equation of (3.18a) and inserting the result into thefirst one shows that

u′ = λu, z = λt−11−λ u. (3.18b)

Hence the dynamics is given by the inherent ordinary differential equationu′ = λu. The remaining component z is fixed in terms of u. A numericalmethod, when applied to (3.18a), should refect this situation properly, i.e. weexpect the decoupled system (3.18b) to be discretised correctly.

For simplicity consider the implicit Euler scheme. Discretising (3.18a) yields

λ tn+1

h(un+1 − un) + λ−1

h(zn+1 − zn) = 0,

(λ tn − 1)un + (λ− 1) zn = 0.

Solving the second equation for zn, the first equation simplifies to

un+1 = un + hλun.

Thus the inherent ODE is in fact discretised using the explicit Euler scheme.This will have severe consequences such as strong stepsize restrictions due tostability requirements.

Similar to the previous example, the DAE[10

]( [λ− 1 λt

] [z(t)u(t)

])′+

[0 −λ

λ− 1 λt− 1

] [z(t)u(t)

]= 0


is an equivalent reformulation of (3.18a). Due to the properly stated leadingterm it is possible to ensure that the implicit Euler scheme leads to

un+1 = un + hλun+1, (1− λ)zn = (λtn − 1)un,

such that the inherent ODE is indeed discretised by the implicit scheme. Inorder to guarantee this behaviour, one has to ensure that imD remains con-stant [89].

4

Nonlinear Differential Algebraic Equations

In the previous chapter linear differential algebraic equations

A(t)[D(t)x(t)

]′+B(t)x(t) = q(t) (4.1)

were studied. This class is of key relevance for analysing DAEs since importantfeatures such as the properly stated leading term and the decoupling proce-dure can be introduced. However, when modelling complex processes in realapplications one most often is confronted to deal with nonlinear equations.

Example 4.1. Logical elements such as the NAND gate in Figure 4.1 are thecentral building blocks for complex adding circuits and registers. The NANDgate consists of two n-channel enhancement MOSFETs1 (ME), one n-channeldepletion MOSFET (MD) and a load capacitance C. The transistors can beproduced in CMOS technology leading to low power consumption, high densityand a fast switching behaviour [78].

The gate voltages of the two MEs are controlled by input signals V1 and V2.The response at node 1 is low (false) only in the case of both input signals beinghigh (true). The NAND gate thus represents the logical expression ¬(V1 ∧ V2).

For the numerical simulation of the NAND gate the MOSFETs are replacedby suitable equivalent circuits. Using the companion model from Figure 4.1 (b)the modified nodal analysis leads to a DAE of the form

A q(x(t)

)+ b(x(t), t

)= 0 (4.2)

where x =[u1 · · · u12 I1 I2 IBB IDD

]>comprises all node potentials

and the branch currents of the four voltage sources. A ∈ R16×13 is a constantmatrix and the functions q and b are given in Table 4.1.

The nonlinear functions qgd, qgs, qdb, qsb and ibd, ibs, ids as well as all parametervalues are given in [151]. Notice that both the leading term and the function bcontain nonlinearities.

1MOSFET: Metal Oxide Semiconductor Field Effect Transistor.

68 4 Nonlinear Differential Algebraic Equations

ME

ME

MD

V1

V2

VBB

VDDu4

u12

u5

u11

u10

u9

u8

C

u7

u6

u7

u6

u1

Figure 4.1 (a): Circuit diagram forthe NAND gate.

drain

bulk

source

gate

Figure 4.1 (b): Equivalentcircuit replacing the MOSFETs.

Figure 4.1: The NAND gate model taken from [151].

q(x) =

266666666666666666664

C u1

qgd(u1−u3)qgs(u1−u2)qdb(u3−u12)qsb(u2−u12)qgd(u5−u7)qgs(u5−u6)qdb(u7−u12)qsb(u6−u12)qgd(u8−u10)qgs(u8−u9)qdb(u10−u12)qsb(u9−u12)

377777777777777777775

b(x, t) =

26666666666666666666666666666666666666666666664

1Rs

(u1−u2) − 1Rd

(u7−u1)1Rs

(u2−u1) + 1Rsd

(u2−u3) + ibs(u12−u2) . . .

+ ids(u3−u2, u1−u2, u12−u2)1Rd

(u3−u4) − 1Rsd

(u2−u3) + ibd(u12−u3) . . .

− ids(u3−u2, u1−u2, u12−u2)1Rd

(u4−u3) + IDDI1

1Rs

(u6−u11) + 1Rsd

(u6−u7) + ibs(u12−u6) . . .

+ ids(u7−u6, u5−u6, u12−u6)1Rd

(u7−u1) − 1Rsd

(u6−u7) + ibd(u12−u7) . . .

− ids(u7−u6, u5−u6, u12−u6)I2

1Rsd

(u9−u10) + u9Rs

+ ibs(u12−u9) . . .

+ ids(u10−u9, u8−u9, u12−u9)1Rd

(u10−u11) − 1Rsd

(u9−u10) + ibd(u12−u10) . . .

− ids(u10−u9, u8−u9, u12−u9)1Rs

(u11−u6) − 1Rd

(u10−u11)

−ibs(u12−u2) − ibd(u12−u3)− ibs(u12−u6) . . .− ibd(u12−u7)− ibs(u12−u9) . . .− ibd(u12−u10) + IBB

u4 − VDDu12 − VBBu5 − V1(t)u8 − V2(t)

37777777777777777777777777777777777777777777775

Table 4.1: The functions q and b for the NAND gate model

4.1 The Index of Nonlinear DAEs 69

In this chapter nonlinear equations such as (4.2) are addressed. In Section 4.1the concept of regular DAEs with tractability index µ is extended to the non-linear case. For index-1 equations a decoupling procedure similar to the resultsof Section 3.2 is used to prove existence and uniqueness of solutions in Sec-tion 4.2. The case of nonlinear index-2 equations is considerably harder toanalyse. Hence the analysis is split into two parts:

Many electrical circuits, when modelled using the modified nodal analysis, leadto a special structure where the so-called index-2 components enter the modelequations linearly. Equations of this type are treated in Chapter 5. The caseof general index-2 equations with a properly stated leading term is addressedin Chapter 6. There it is shown how a careful analysis of index-2 equationswithin the framework of the tractability index leads to criteria for the existenceand uniqueness of solutions that are easily accessible in practical applications.

4.1 The Index of Nonlinear DAEs

As a nonlinear version of (4.1) we consider differential algebraic equations

A(t)[d(x(t), t)

]′+ b(x(t), t

)= 0, t ∈ I. (4.3)

A : I → Rm×n is assumed to be a continuous matrix function defined on the

interval I ⊂ R. As seen in Example 4.1 the mappings d : D×I → Rn and

b : D×I → Rm represent nonlinearities of the model. d and b are defined

on D×I ⊂ Rm+1 where D represents some domain in Rm. We assume both

functions to be continuous. Furthermore it is assumed that continuous partialderivatives dx = ∂d

∂xand bx = ∂b

∂xexist.

Similar to Definition 3.2 the notion of a properly stated leading term is needed.Observe that in case of linear DAEs with d(x, t) = D(t)x(t) the matrices Aand D = dx were used to define properly stated DAEs. Thus we arrive at thisstraightforward generalisation:

Definition 4.2. The leading term of (4.3) is properly stated if

kerA(t)⊕ im dx(x, t) = Rn ∀ (x, t) ∈ D×I,

and if there is a smooth projector function R ∈ C1(I,Rn×n)

)such that

kerR(t) = kerA(t), imR(t) = im dx(x, t) and d(x, t) = R(t)d(x, t)

for every (x, t) ∈ D×I.

This definition was given in [114]. In particular, im dx does not depend on xfor properly stated DAEs.

If x satisfies (4.3) pointwise, then it is sufficient for x to be continuous, butthe mapping t 7→ d

(x(t), t

)needs to be differentiable. Hence a function x ∈

C(Ix,Rm), Ix ⊂ I, is said to be a solution of (4.3) provided that


• for every t ∈ Ix the mapping x(t) takes values in D,

• the mapping t 7→ d(x(t), t

)is continuously differentiable

• and x satisfies the DAE (4.3) pointwise for t ∈ Ix.Unfortunately this concept of a solution does not lead to a linear function space.Thus we consider DAEs with a linear leading term,

A(t)[D(t)x(t)

]′+ b(x(t), t

)= 0. (4.4)

DAEs of this type are sometimes referred to as quasi-linear. Obviously solutionslie in the linear space

C1D(I,Rm) :=

z ∈ C(I,Rm) | Dz ∈ C1(I,Rn)

.

From the point of view of mathematical analysis (4.4) is much easier to handlethan (4.3), but the quasi-linear form (4.4) does not seem to be general enoughto include circuits such as the NAND gate from Example 4.1. Luckily this isnot the case as the general formulation (4.3) can be transformed into (4.4) byconsidering the enlarged system

A(t)[R(t)y(t)

]′+ b(x(t), t

)= 0, (4.5a)

y(t)− d(x(t), t) = 0. (4.5b)

A new variable y(t) = d(x(t), t

)is introduced that allows to artificially split the

equation in two. R is the projector function from Definition 4.2 characterisingthe properly stated leading term.

With x = ( xy ), A = ( A0 ), D =(0 R

)and b(x, t) =

(b(x,t)

y−d(x,t)

)the enlarged

system (4.5) is seen to be of type (4.4). Even though (4.5) may have twice thedimension of (4.3), this does not pose a problem for practical computations.Whenever the variable y is referenced, an explicit evaluation of (4.5b) is pos-sible. Hence the computational effort for solving (4.3) and (4.5) will be thesame. In fact, implementations do rely on (4.3), but the mathematical analysisis carried out using the enlarged system (4.5) in the formulation (4.4).

This approach is justified in [88, 114]. There it is shown that (4.3) and (4.5) areequivalent in the sense that there is a one-to-one correspondence of solutions,subspaces, the index and so on. Thus it is no restriction to concentrate onDAEs of the form (4.4).

Remark 4.3. In the literature [46, 84, 85] nonlinear DAEs are most oftenassumed to be given in Hessenberg form

y′ = f(y, z), 0 = g(y).

For index 2 one has to ensure that gyfz has a bounded inverse in the neigh-bourhood of a solution.


In general the MNA equations and in particular the split system (4.5) are notof Hessenberg type. Conversely, the class of Hessenberg DAEs is included inthe formulation (4.4) by considering

x =

[yz

], A =

[I0

], D =

[I 0

], b(x, t) =

[−f(y, z)g(y)

].

As in the case of linear DAEs one introduces admissible sequences of matrixfunctions and subspaces in order to analyse nonlinear DAEs with a properlystated leading term. The details of this construction can be found in [116, 117,146]. Here we will content ourselves with summarising the key results necessaryfor our later investigations.

Pointwise for t ∈ I, x ∈ D we introduce

G0(t) = A(t)D(t), B(x, t) = bx(x, t) =∂

∂xb(x, t),

N0(t) = kerG0(t), S0(x, t) =z ∈ Rm |B(x, t)z∈ imG0(t)

.

(4.6a)

Observe that due to the nonlinearity b the matrix B(x, t) as well as the sub-space S0(x, t) are state-dependent in general. If Q0 ∈ C

(I,Rm×m) is again a

projector function onto N0, then G1, N1 and S1 depend on x as well,

G1(x, t) = G0(t) +B(x, t)Q0(t),

N1(x, t) = kerG1(x, t),

S1(x, t) =z ∈ Rm |B(x, t)z∈ imG1(x, t)

.

(4.6b)

Let Q1(x, t) be a projector onto N1(x, t) and define P0 = I −Q0, P1 = I −Q1.It is possible to extend this sequence by introducing

C(x1, x, t) =(DP1D

−)x(x, t)x1 +

(DP1D

−)t(x, t),

B1(x1, x, t) = B(x, t)P0(t)−G1(x, t)D

−(t)C(x1, x, t)D(t),

G2(x1, x, t) = G1(x, t) +B1(x

1, x, t)Q1(x, t)

(4.6c)

for t ∈ I, x ∈ D. As in the previous chapter D−(t) is the generalised reflexiveinverse of D(t) defined by

DD−D = D, D−DD− = D−, D−D = P0, DD− = R.

The newly introduced variable x1 ∈ Rm is necessary for a correct definitionof the matrix C. To better understand this construction recall that for linearDAEs the matrix B1 is defined as

B1(t) = B(t)P0(t)−G1(t)D−(t)

(DP1D

−)′(t)D(t).


For nonlinear DAEs, P1 depends on x and so does DP1D− in general. If x is a

smooth mapping that takes values in D, then

d

dt

(DP1D

−)(x(t), t) =(DP1D

−)x(x(t), t) x′(t) +

(DP1D

−)t(x(t), t)

= C(x′(t), x(t), t

).

For a moment let us consider the linearisation of (4.4) along x. Denote thecoefficients of the corresponding linear DAE by

A(t) = A(t), D(t) =∂

∂xd(x(t), t

), B(t) =

∂

∂xb(x(t), t

).

Then the above construction ensures that G1(t) = G1

(x(t), t

)as well as G2(t) =

G2

(x′(t), x(t), t

), revealing a close relationship between the matrix sequence

and the corresponding sequence for the linearised equation.

Definition 4.4. The sequence (4.6) is said to be admissible if

(a) rankG0 = r0, rankG1 = r1 and rankG2 = r2 are constant on I,

(b) N0 ⊂ kerQ1,

(c) Q0 ∈ C(I,Rm×m), Q1 ∈ C(D×I,Rm×m), DP1D− ∈ C1(D×I,Rn×n).

Comparing with Definition 3.3 we have defined admissibility up to index 2. In[117, 121] these sequences are extended to higher index cases as well. Doingso the technical effort increases considerably as further additional variables xi

need to be introduced. For our investigations the case of index 1 and 2 is mostrelevant. Hence attention will be restricted this situation.

Definition 4.5. The DAE (4.4) with a properly stated leading term is regularwith tractability index µ ∈ 1, 2 on D×I if there is an admissible sequence(4.6) such that rµ−1 < rµ = m.

For µ ∈ 1, 2 this definition generalises the corresponding Definition 3.4 forlinear DAEs.

Remark 4.6. A regular index-1 DAE can be equivalently characterised by thecondition N0(t) ∩ S0(x, t) = 0 for (x, t) ∈ D×I. On the other hand, a reg-ular index-2 DAE can be equivalently characterised by requiring dim

(N0(t) ∩

S0(x, t))

= const > 0 and N1(x, t) ∩ S1(x, t) = 0 for (x, t) ∈ D×I (see alsoRemark 3.5).

An important consequence of Definition 4.5 concerns linearisations of regularindex-µ equations.

Theorem 4.7. Let (4.4) be a regular index-µ DAE in the sense of Defini-tion 4.5. Then for every function x ∈ C1(I,Rm) with x(t) ∈ D the linearisationalong x is regular with tractability index µ.


Proof. This result follows at once from the above construction. More detailsare given in [117].

Example 4.8. We want to finish this section with determining the index ofthe nonlinear circuit from Figure 4.2. This example is taken from [61] but aslightly modified version will be used here. In particular the resistor is modelledusing a nonlinear conductance such that iR = (2 + t)u2

1. The current sourceis controlled by the current iV trough the voltage source. These are ratherartificial parameters and it is not claimed that the circuit is of any practicaluse. Nevertheless, the circuit from Figure 4.2 allows a better understanding ofhow to put the above construction into practice.

The charge oriented modified nodal analysis leads to the DAE[1 00 10 0

] ([ 1 0 00 1 0 ]

[ u1u2iV

])′+

[(2+t)u2

1+iV +i(iV ,t)−iV

u1−u2−v(t)

]= 0,

where linear capacitors with capacity C1 = C2 = 1F are used. Constructingthe matrix sequence (4.6) yields

G0 =[

1 0 00 1 00 0 0

], Q0 =

[0 0 00 0 00 0 1

], D− =

[1 00 10 0

], B =

[2(2+t)u1 0 ψ−1

0 0 −11 −1 0

].

The function ψ(iV , t) = ∂i(iV ,t)∂iV

+2 was introduced in order to shorten notations.It is assumed that ψ remains nonzero for all arguments (iV , t). Notice that Bdepends not only on (u1, t) but also on iV via ψ. Since the matrix

G1 = AD +BQ0 =[

1 0 ψ−10 1 −10 0 0

]is singular, the index of this circuit is µ ≥ 2. The matrix sequence can beextended by considering the projector

Q1 = 1ψ

[ψ−1 1−ψ 0−1 1 0−1 1 0

]onto imG1. This particular choice of Q1 leads to

C = (DP1D−)xx

1 + (DP1D−)t = φ

[ −1 1−1 1

], B1 =

[2(2+t)u1+φ −φ 0

φ −φ 01 −1 0

]u1 u2iV

i(iV , t)G C1 C2

v(t)

Figure 4.2: A nonlinear circuit with a current-controlled current source


with φ(iV , t) = ∂ψ(iV ,t)∂t

/ψ(iV , t)

2. Finally we arrive at

G2 = G1 +B1Q1 =

[1+φ+2(2+t)u1

ψ−1ψ

−φ−2(2+t)u1ψ−1ψ

ψ−1

φ 1−φ −11 −1 0

]and detG2 = −ψ. As we assumed ψ 6= 0 the index is 2 due to G2 beingnonsingular. Observe that Q1 is indeed the canonical projector onto N1 alongS1 as can be seen by computing Q1 = Q1G

−12 BP0.

4.2 Index-1 Equations

Nonlinear differential algebraic equations with index 1 are studied in [88]. Thispaper is an important step towards investigating nonlinear DAEs within theframework of the tractability index. Unfortunately this approach can not beeasily generalised to index-2 equations.

Nevertheless similar ideas will mark the starting point for studying properlystated index-2 equations later on. As motivation and for later reference we willroughly sketch the main results of [88] in this section.

The central tool will be an appropriate decoupling procedure. In contrast toSection 3.2 the implicit function theorem [67] has to be used in order to treatnonlinear equations

A(t)[D(t)x(t)

]′+ b(x(t), t

)= 0. (4.4)

Recall that the index-1 case is precisely characterised by the fact that the matrixG1(x, t) = A(t)D(t) + B(x, t)Q0(t) remains nonsingular on D×I. In terms ofthe subspaces N0 and S0 this is equivalent to requiring N0(t) ∩ S0(x, t) = 0for every point (x, t) ∈ D×I [3, 75]. Each solution x∗ of (4.4) satisfies theobvious constraint

x∗(t) ∈M0(t) = z ∈ D | b(z, t) ∈ imA(t) .

In order to split x∗ into its characteristic parts, it is advantageous to introduce

u∗(t) = D(t)x∗(t), w∗(t) = D−(t)u′∗(t) +Q0(t)x∗(t). (4.7)

The definition of w∗ and DQ0 = 0 imply Dw∗ = DD−u′∗ = Ru′∗. Thus thecomponent u∗ satisfies the ordinary differential equation

u′∗(t) = (Ru∗)′(t) = R′(t)u∗(t) +R(t)u′∗(t)

= R′(t)u∗(t) +D(t)w∗(t).

Using the projector functions Q0, P0 and the decomposition

x∗ = P0x∗ +Q0x∗ = D−u∗ +Q0w∗,

4.2 Index-1 Equations 75

the index-1 DAE (4.4) can be equivalently written as2

0 = A(t)D(t)w∗(t) + b(D−(t)u∗(t) +Q0(t)w∗(t), t) =: F(w∗(t), u∗(t), t

).

The mapping F (w, u, ·) = ADw+ b(D−u+Q0w, ·) = 0 can be studied withoutassuming that x∗ is a solution of (4.4). It turns out that due to the index-1condition the equation F (w, u, t) = 0 can locally be solved for w.

Lemma 4.9. Given (x0, t0) ∈ D×I and y0 ∈ imD(t0) such that

x0 ∈M0(t0), N0(t0) ∩ S0(x0, t0) = 0, A(t0)y0 + b(x0, t0) = 0,

denote u0 = D(t0)x0 and w0 = D(t0)−y0 + Q0(t0)x0. Then there exist δ > 0

and a continuous mapping w : Bδ(u0)× I → Rm satisfying

w(u0, t0) = w0, F(w(u, t), u, t

)= 0 ∀ (u, t) ∈ Bδ(u0)× I.

Proof. It suffices to compute

F (w0, u0, t0) = A(t0)y0 + b(x0, t0) = 0,

Fw(w0, u0, t0) = A(t0)D(t0) + bx(x0, t0)Q0(t0) = G1(x0, t0).

Hence the assertion follows from the implicit function theorem as G1 is nonsin-gular for index-1 equations.

Using this result we know that every solution x can be written as

x(t) = D−(t)u(t) +Q0(t)w(u(t), t

)(4.8a)

where u(t) = D(t)x(t) satisfies the ordinary differential equation

u′(t) = R′(t)u(t) +D(t)w(u(t), t

). (4.8b)

This representation (4.8) provides deep insight into the structure if index-1equations. As in Section 3.2 the equation (4.8b) is called the inherent regularODE.

Theorem 4.10. Let the assumptions of Lemma 4.9 be satisfied.

(a) The inherent regular ODE (4.8b) is uniquely determined by the DAE, i.e.it does not depend on the particular choice of Q0.

(b) imD is a (time-varying) invariant subspace of (4.8b), i.e. the conditionu(t0) ∈ imD(t0) implies u(t) ∈ imD(t) for all t ∈ I.

(c) If imD is constant, then (4.8b) simplifies to u′(t) = D(t)w(u(t), t

).

(d) Through each x0 ∈M0(t0) passes exactly one solution of (4.4).

Proof. The proof can be found in [88].

2To see this observe that A = AR = ADD−.

5

Exploiting the Structure of DAEs

in Circuit Simulation

The previous chapter gave a brief introduction to the tractability index fornonlinear DAEs

A(t)[d(x(t), t

)]′+ b(x(t), t

)= 0, t ∈ I. (5.1)

The index-1 case was seen not to pose serious difficulties as the decoupling pro-cedure from Section 4.2 is applicable. In particular, if the subspace im dx(x, t)is constant, then numerical methods, when applied to (5.1), are known to dis-cretise the inherent regular ODE correctly (compare Theorem 4.10 (c) and see[88] for more details).

In Chapter 1 the matrix A turned out to be constant for MNA equations.Given that the capacitance matrix C = ∂qC(x,t)

∂xand the inductance matrix

L = ∂φL(x,t)∂x

are positive definite, the subspace im dx(x, t) does not vary withx nor t. Index-1 equations can therefore be treated efficiently and qualitativeproperties of the inherent ODE are preserved by stiffly accurate Runge-Kuttaschemes but also when using BDF methods [88, 89].

Given the structural assumptions from [61, 63] the index of (5.1) typicallydoes not exceed 2. However, Theorem 1.4 shows that the index-2 case appearsfrequently in applications. In particular loops of capacitors and voltage sources(with at least one voltage source) lead to index-2 configurations. Similarly,cutsets of inductors and current sources give rise to index-2 equations as well.

It is unfortunate that the approach taken for index-1 equations cannot begeneralised easily to index-2 DAEs of the general structure (5.1). It is notknown how to define suitable variables such that just one application of theimplicit function theorem yields a decoupled system such as (4.8). It is a meresupposition that such an approach is not possible for general index-2 equations(5.1) with a properly stated leading term.

However, if we restrict attention to DAEs of the special form

A(t)[D(t)x(t)

]′+ b(U(t)x(t), t

)+ B(t)T (t)x(t) = 0, (5.2)

78 5 Exploiting the Structure of DAEs in Circuit Simulation

where

N0(t) ∩ S0(x, t) does not depend on x,

then it is possible to extend the decoupling procedure of Section 4.2 to (5.2).Roughly speaking, (5.2) ensures that the so-called index-2 components Tx enterthe equations linearly. This situation and the additional projector functions Uand T will be discussed in detail later on.

Equations of form (5.2) were already studied in [61]. There it is shown that,given some additional assumptions on the controlled sources of an electricalnetwork, the charge oriented modified nodal analysis leads to DAEs having thespecial structure (5.2) (see e.g. [61, Corollary 3.2.8]).

V1

V2

VBB

VDDu4

u12

u1

u3

u2

u7

u6

u5

u11

u10

u9

u8

C

Figure 5.1: The NAND gate with theequivalent circuit from Figure 4.1 (b)

Example 5.1. Consider once againthe NAND gate from Example 4.1.If the companion model from Figure4.1 (b) is inserted into the circuit dia-gram Figure 4.1 (a), it becomes obvi-ous that the capacitors of the MOS-FET model lead to several CV -loops.In Figure 5.1 these loops are indicatedusing dashed lines. Hence the NANDgate comprises an index-2 system (seeTheorem 1.4).

The numerically unstable index-2components are given by the branchcurrents of voltage sources belongingto these loops, i.e. I1, I2 and IBB forthe NAND gate. Looking at the MNAequations presented in Example 4.1 onpage 67 these components are found toenter the equations linearly.

We will explore equations of the type (5.2) in more detail. Recall that forindex-2 DAEs N1(x, t) ⊕ S1(x, t) = R

m holds for (x, t) ∈ D×I. Thus Q1 willalways be chosen to be the canonical projector onto N1 along S1.

5.1 Decoupling Charge-Oriented MNA Equations

For regular DAEs (5.1) the characteristic subspaces

N0(t) = kerA(t)D(t), S0(x, t) =z ∈ Rm |B(x, t)z∈ imA(t)

.

were introduced in Section 4.2. Strictly speaking the enlarged system (4.5) wasused instead of (5.1). Recall from (4.6) that B(x, t) = bx(x, t) = ∂

∂xb(x, t).

5.1 Decoupling Charge-Oriented MNA Equations 79

The index is µ = 1, if and only if these two subspaces intersect transversely,i.e. N0 ∩ S0 = 0. For µ = 2 the intersection N1 ∩ S1 = 0 is trivial, but thesubspace N0∩S0 has constant nonzero dimension (see Remark 4.6). The lattersubspace will be of vital importance for the subsequent analysis.

The following result is given in [61].

Lemma 5.2. Let Q0 and Q1 be projector functions defined in the sequence1

(4.6). Independently of the particular choice of Q0 and Q1 the relation

imQ0(t)Q1(x, t) = N0(t) ∩ S0(x, t)

is satisfied for (x, t) ∈ D×I.

Proof. Let z = Q0Q1w ∈ imQ0Q1. Obviously, z ∈ imQ0 = N0. From

Bz = BQ0Q1w = (G1 − AD)Q1w = −ADQ1w ∈ imA

it is found that z ∈ S0, and therefore z ∈ N0 ∩ S0.

On the other hand, if z ∈ N0∩S0, then z ∈ N0 = imQ0 implies z = Q0z. Sincez ∈ S0 there is a vector w such that Bz = BQ0z = Aw. Due to the properlystated leading term the matrix A satisfies A = AR = ADD− (see page 57)such that Bz = ADD−w follows.

Defining u = D−w− z yields G1u = (AD+BQ0)(D−w− z) = 0 and therefore

u ∈ kerG1 = imQ1. Since Q0D− = Q0D

−DD− = Q0P0D− = 0, we can

conclude that z = Q0z = Q0(D−w − u) = −Q0u = −Q0Q1u ∈ imQ0Q1.

Recall from (3.11b′′) that in the case of linear index-2 equations the component

z0 = Q0x = Q0P1G−12 q −K0D

−u+Q0Q1D−(Dz1)

′

is calculated by differentiating Dz1 (see page 61). It becomes clear that thederivative (Dz1)

′ is not affecting z0 as a whole, but only the component thatbelongs to imQ0Q1 = N0 ∩ S0.

Following [153] we introduce a projector function T onto N0 ∩ S0 and defineU = I − T . Calculating Uz0 requires no differentiation at all. But in orderto determine Tz0 the component Dz1 needs to be differentiated. The vectorTx = Tz0 is therefore sometimes referred to as the index-2 variables.

Observe that imT = N0 ∩ S0 ⊂ N0 = kerP0 implies TP0 = 0, such that wehave indeed

Tx = T (D−u+ z1 + z0) = T (P0P1x+ P0Q1x+ z0) = Tz0.

1Recall that the admissible sequences of Definition 4.4 leave some freedom for choosingthe projector functions Qi.


On the other hand imT ∩ imP0 = 0 shows that without loss of generalitythe projector T can always be chosen such that P0T = 0. Taking into accountthat Q1Q0 = 0 (cf. Definition 4.4) the following relations hold:

Q0T = T = TQ0, P0U = P0 = UP0, Q1T = 0, UQ0Q1 = 0. (5.3)

The benefit of introducing the additional projector functions U and T was re-alised in [61, Corollary 3.2.8]. Given the assumptions from [61, 63] on controlledsources, the structural condition

N0(t) ∩ S0(x, t) does not depend on x (5.4)

is found to be satisfied. Therefore T can be chosen to be independent of thestate x and the equations of the charge oriented modified nodal analysis exhibitthe particular structure

A(t)[D(t)x(t)

]′+ b(U(t)x(t), t

)+ B(t)T (t)x(t) = 0, (5.5a)

or, dropping t arguments,

A[Dx]′ + b(Ux, ·) + BTx = 0. (5.5b)

In fact, in [61] the subspaceN0(t)∩S0(x, t) is shown to be constant and choosinga constant projector T becomes possible. Nevertheless we will use the slightlymore general time-dependent version.

As in Section 4.2 let x∗ be a solution of (5.5) and denote x0 = x∗(t0). Weintroduce new variables

u∗ = DP1x∗, w∗ = P1D−(Dx∗)

′ + (Q0 + Q1)x∗, (5.6)

where P1(t) = P1(x0, t) and Q1(t) = Q1(x0, t). Here and in the sequel t argu-ments are generally omitted for better readability. The bar-notation for P1, Q1

is used to indicate that the x argument is replaced by the fixed initial value x0.

By simple manipulations it turns out that

Q1w∗ = Q1x∗, Q0w∗ = −Q0Q1D−(Dx∗)

′+Q0x∗ +Q0Q1x∗,

DP1w∗ = DP1D−(Dx∗)

′.

Similar to the case of linear equations the solution x∗ can be split as

x∗ = P0P1x∗ + P0Q1x∗ +Q0x∗

= D−u∗ + (P0Q1 +Q0P1)w∗ +Q0Q1D−(Dx∗)

′.(5.7)

Keeping (5.3) in mind, the component Ux∗ = D−u∗+(P0Q1 +UQ0)w∗ is foundto be given in terms of u∗ and w∗ only. This is of particular importance, as


we are aiming at rewriting (5.5) in terms of these new variables. In order toachieve this goal it remains to rewrite A(Dx∗)

′ + BTx∗ using u∗ and w∗.

As a first step recall that imQ1 = kerG1 and therefore

0 = G1(x0, ·)Q1 = (AD + BT )Q1 +B(Ux0, ·)UQ0Q1 = (AD + BT )Q1.

Using this relation we can calculate

A(Dx∗)′ + BTx∗ = (AD + BT )D−(Dx∗)

′ + BTx∗

= (AD + BT )P1D−(Dx∗)

′ + BTx∗ + (AD + BT )Q1x∗

= (AD + BT )[P1D

−(Dx∗)′ + TQ0x∗ + UQ0x∗ + Q1x∗

]= (AD + BT )w∗,

such that the DAE (5.5) can be written equivalently as

F (u∗, w∗, ·) :=(AD+BT

)w∗+ b(D−u∗+ (P0Q1+UQ0)w∗, ·) = 0. (5.8)

Lemma 5.3. Let (5.5) be a regular DAE with index µ ∈ 1, 2. Assume that(x0, t0) ∈ D×I and y0 ∈ imD(t0) satisfy

A(t0)y0 + b

(U(t0)x

0, t0)

+ B(t0)T (t0)x0 = 0.

Denote

u0 = D(t0)P1(t0)x0, w0 = P1(t0)D

−(t0)y0 +

(Q0 + Q1

)(t0)x

0

and consider F from (5.8) being defined on a neighbourhood N0 ⊂ Rn×Rm×Rof (u0, w0, t0). Then there is a neighbourhood N1 ⊂ R

n×R of (u0, t0) and acontinuous mapping w : N1 → R

m such that

w(u0, t0) = w0, F(u,w(u, t), t

)= 0 ∀ (u, t) ∈ N1.

Proof. Due to (5.8) we have F (u0, w0, t0) = 0 and

Fw(u,w, ·) = AD + BT + bx(D−u+ (P0Q1 + UQ0)w, ·

)(P0Q1 + UQ0).

We will show that Fw(u0, w0, t0) and G2(x1, x0, t0) have common rank for every

x1 ∈ Rm. Since G2(x

1, x0, t0) is nonsingular due to the index-2 condition,Fw(u0, w0, t0) will have the same property. Then the assertion follows from theimplicit function theorem.

Recall that if (5.5) was an index-1 equation, the matrix sequence (4.6) impliesthat G2 = G1 is nonsingular. Hence index-1 DAEs are covered as well.

For the structure (5.5) the matrix sequence (4.6) yields

B(x, ·) = bx(Ux, ·)U + BT,

B1(x1, x, ·) = B(x, ·)P0 −G1(x, ·)D−C(x1, x, ·)D.


The matrix B1 is used to define

G2(x1, x, ·) = G1(x, ·) +B1(x

1, x, ·)Q1(x, ·).

Simple calculations show that this matrix can be factorised as

[AD +B(x, ·)(Q0 + P0Q1(x, ·)

)] [I − P1(x, ·)D−C(x1, x, ·)DQ1(x, ·)].

The second factor is nonsingular with I + P1(x, ·)D−C(x1, x, ·)DQ1(x, ·) beingits inverse. The first factor, on the other hand, is independent of x1. Evaluatedat the point (x0, t0) this factor happens to agree with Fw(u0, w0, t0). ThereforeFw(u0, w0, t0) and G2(x

1, x0, t0) have indeed common rank.

Notice that the mapping w from the previous lemma is defined only locallyaround (u0, t0). For simplicity we assume that the definition domain is suffi-ciently large to cover the whole interval I. If this is not the case, an appropriatesmaller interval I has to be considered.

Finally, from (5.7) the solution representation

x∗ = D−u∗ + (Q0P1 + P0Q1)w(u∗, ·) +Q0Q1D−(u∗ +DQ1w(u∗, ·)

)′= D−u∗ + z0∗ + z1∗ (5.9a)

can be derived. Similar to the case of linear DAEs the abbreviations

u∗ = DP1x∗

z1∗ = P0Q1x∗ = P0Q1w(u∗, ·)z0∗ = Q0x∗

= Q0P1w(u∗, ·) +Q0Q1D−(DP1D

−)′u∗ +Q0Q1D−(Dz1∗)

′

have been used. Observe that u∗ = DP1x∗ satisfies the ordinary differentialequation

u′ = (DP1D−)′u+DP1w(u, ·) + (DP1D

−)′DQ1w(u, ·). (5.9b)

This equation will be called inherent regular ODE. Once u∗ is determined, thecomponent z1∗ is given in terms of u∗ and the mapping w from Lemma 5.3.Finally, Dz1∗ needs to be differentiated in order to calculate z0∗ = Q0x∗.

Similar to Section 4.2 and [88] the inherent regular ODE (5.9b) can be studiedwithout assuming the existence of a solution x∗ for the DAE (5.5).

Theorem 5.4. Let the assumptions of Lemma 5.3 be satisfied. Then

(a) imDP1D− is a (time-varying) invariant subspace of the inherent ODE

(5.9b), i.e. u(t0) ∈ im(DP1D

−)(t0) implies u(t) ∈ im(DP1D

−)(t) forevery t ∈ I.

(b) If the subspaces imDP1D− and imDQ1D

− are constant, then (5.9b) sim-plifies to u′ = DP1w(u, ·).


Proof. The proof uses the same techniques as [88, Theorem 2.2], but R needsto be replaced by DP1D

−. The above result is also a special case of Lemma 6.6from Section 6.1. Thus a proof can be found there as well.

In contrast to index-1 equations, the space M0(t) introduced on page 74 is nolonger filled with solutions. It is clear from the decoupling (5.9) that for everyx0 ∈ Rm the component u0 = D(t0)P1(x

0, t0)x0 uniquely determines a solution.

This is made more precise in the following result.

Theorem 5.5. Let the assumptions of Lemma 5.3 be satisfied and let u bethe solution of the inherent regular ODE (5.9b) with initial value u(t0) =DP1(x

0, t0)x0. If the mapping t 7→ D(t)Q1(t)(x

0, t)w(u(t), t) belongs to theclass of C1-functions, then there is a unique solution x ∈ C1

D(I,Rm) of theinitial value problem

A[Dx]′ + b(Ux, ·) + BTx = 0, DP1(x0, t0)

(x0 − x(t0)

)= 0. (5.10)

Proof. The mapping w from Lemma 5.3 defines the inherent regular ODE(5.9b). Let u be the solution satisfying u(t0) = DP1(x

0, t0)x0. Then Theo-

rem 5.4 ensures that u(t) belongs to imD(t)P1(t)D−(t) for every t. Define

x = D−u+ z0 + z1

= D−u+ (Q0P1 + P0Q1)w(u, ·) +Q0Q1D−(u+DQ1w(u, ·)

)′.

This mapping is indeed a solution of (5.10) since

A(Dx)′ + b(Ux, ·) + BTx = A(Dx)′ + b(Ux, ·) + BTx− F(u,w(u, ·), ·

)= (AD + BT )Q1D

−(Dx)′ + ADP1D−(Dx)′ − ADP1w(u, ·) = 0

and DP1(x0, t0)x(t0) = u(t0) = DP1(x

0, t0)x0. The decoupling procedure de-

scribed above shows that this solution is unique.

Example 5.6. Consider again the circuit from Figure 4.2 on page 73. Incontrast to Example 4.8 the node potentials will now be denoted by ei in orderto avoid confusion with the variable u of the inherent regular ODE (5.9b).

For simplicity choose

i(iV , t) = t · iV , and v(t) ≡ −1 (5.11)

for t ≥ 0 such that ψ = 2 + ∂i(iV ,t)∂iV

= 2 + t and φ = ∂ψ(iV ,t)∂t

/ψ(iV , t)

2 = 1(2+t)2

.The matrix sequence was already calculated in Example 4.8. The particularchoice of i(iV , t), v(t) adopted here leads to

B =[

2(2+t)e1 0 1+t0 0 −11 −1 0

], Q0 =

[0 0 00 0 00 0 1

], Q1 = 1

2+t

[1+t −1−t 0−1 1 0−1 1 0

]


and

G2 = 1(2+t)2

[(2+t)2+1+2e1(1+t)(2+t)2 −1−2e1(1+t)(2+t)2 (1+t)(2+t)2

1 (2+t)2−1 −(2+t)2

(2+t)2 −(2+t)2 0

].

Observe that G2 is nonsingular for t ≥ 0 with detG2 = −(2 + t) such that theindex is µ = 2. Having Q0 and Q1 at our disposal we can determine

N0 ∩ S0 = imQ0Q1 = im

[0 0 00 0 0−12+t

12+t

0

]and thus T = Q0 with Tx =

[0 0 iV

]>. Looking at Figure 4.2 it turns out

that the current source is controlled by the index-2 variable iV – a situation thatwill hardly occur in real applications. However, choosing different functionsfor i(iV , t) it is possible to study the precise consequences of how the index-2components enter the equations. The nonlinear case will be addressed laterin Example 6.2 in the next section. Here the choice (5.11) ensures that Txappears linearly such that this section’s results apply.

The MNA equations can be rewritten in the form (5.5),[1 00 10 0

] ([ 1 0 00 1 0 ]

[ e1e2iV

])′+[

(2+t)e210

e1−e2+1

]+[

0 0 1+t0 0 −10 0 0

] [00iV

]= 0,

with a time-dependent matrix B. An initialisation is given by the triple(y0, x0, t0) =

([ − 1

20 ]>, [ 1

232

0 ]>, 0)

since Ay0 + b(Ux0, t0) + B(t0)Tx0 = 0.

The mapping F from (5.8) reads

F (u,w, t) =

[w1+(1+t)w3+ 1

2+t((2+t)u1+(1+t)(w1−w2))2

w2−w3u1−u2+w1−w2+1

].

Due to the index-2 condition it is possible to solve the system F (u,w, t) = 0 for

w = w(u, t) = (1+u1−u2)(t+2)−[1+u1−u2+(u2−1)(t+2)]2

(2+t)2

[111

]− (1+u1−u2)

[100

]and the inherent regular ODE (5.9b) reads

[ u1u2 ]′ = 1

2+t(u1+u2(1+t)−t) (2−u1−u2(1+t)+t) [ 1

1 ] , u(0) = DP1(t0)x0 = [ 1

1 ] .

The two components u1 and u2 necessarily coincide and u′1 = 2u1(1+t)−t2+t

− u21

yields the unique solution u1 = u2 = 1. Using (5.9) we derive the solution

z1 =

[ − 1+t2+t1

2+t

0

], z0 =

[00

− 1(2+t)2

], x = D−u+ z0 + z1 = 1

2+t

[1

3+t− 1

2+t

].

Notice that neither u = DP1x = e1+e2(1+t)2+t

[ 11 ] nor z1 = P0Q1x = e1−e2

2+t

[1+t−10

]depends on the index-2 component Tx =

[0 0 iV

]>– a fact stressing the

decoupled structure of (5.9).


It is stressed that the decoupling procedure is not meant to be used for sol-ving DAEs as we did in the previous example. The decoupling procedure is amathematical tool that will be used in Chapter 9 to study numerical methodsfor differential algebraic equations.

Obviously there is a close relationship between the decoupling procedure stud-ied in this section and the one considered in Section 4.2 for index-1 equations.In particular, if (5.5) was an index-1 DAE, then Q1 = 0 and P1 = I provide are-interpretation of this section’s results also for index-1 DAEs. In this case wefind T = 0 and U = I such that considering DAEs with the special structure(5.5) is no restriction at all for index-1 equations. Finally, since (5.6) reducesto u = Dx and w = D−(Dx)′ + Q0x, we find that the decoupling procedurefor index-2 equations (5.5), when applied to index-1 DAEs, coincides with theformalism of Section 4.2.

Unfortunately, (5.6) is not suited for decoupling index-2 DAEs of the moregeneral structure (5.1). Observe that due to x∗ = D−u∗ + (P0Q1 +Q0P1)w∗ +Q0Q1D

−(Dx∗)′ the component Ux∗ = D−u∗+(P0Q1 +UQ0)w∗ can be written

in terms of u∗ and w∗. However, this is not possible for x∗ itself. Hence writingthe original DAE (5.1) in terms of the new variables u∗ and w∗ turns out to berather difficult if not impossible.

But still, a more refined splitting using three instead of two components makesit possible to obtain similar results also for (5.1). In the next chapter we willinvestigate this situation in more detail.

6

Properly Stated Index-2 DAEs

The object of this chapter is to study nonlinear index-2 DAEs with a properlystated leading term exhibiting the general structure

A(t)[d(x(t), t)

]′+ b(x(t), t

)= 0. (6.1)

Up to now different simplified versions of (6.1) were addressed in this thesis.Linear DAEs

A(t)[D(t)x(t)

]′+B(t)x(t) = q(t)

were treated in Chapter 3. There basic ideas such as the properly stated leadingterm or the decoupling procedure were introduced. In Chapter 4 the conceptof the tractability index was introduced for nonlinear DAEs. We saw thatinstead of treating DAEs of the form (6.1) it is often advantageous to studythe equivalent enlarged system

A(t)[R(t)y(t)

]′+ b(x(t), t

)= 0,

y(t)− d(x(t), t) = 0,

such that is suffices to treat DAEs of the form

A(t)[D(t)x(t)

]′+ b(x(t), t

)= 0. (6.2)

Index-1 equations of this type were analysed in Section 4.2, but in Chapter 5we needed to restrict attention to DAEs with the special structure

A(t)[D(t)x(t)

]′+ b(U(t)x(t), t

)+ B(t)T (t)x(t) = 0 (6.3)

in order to address index-2 equations. Here the index-2 variables Tx enterthe equations linearly. DAEs having this structure are of particlar importancefor scientific computing in circuit simulation as the equations of the charge-oriented modified nodal analysis are of this form [61]. Having this applicationarea in mind we found the important structural condition that

N0(t) ∩ S0(x, t) does not depend on x (6.4)

88 6 Properly Stated Index-2 DAEs

to be satisfied as well. Hence a wide range of differential algebraic equationsand in particular many electrical circuits are already covered by the resultspresented so far. Nevertheless, the scope of our investigations is not yet wideenough.

Example 6.1. Consider a DAE in Hessenberg form

y′ = f(y, z), 0 = g(y), (6.5)

having index-2, i.e. gyfz is assumed to be nonsingular. This system can be castinto the form (6.2) by introducing

x = [ yz ] , A = [ I0 ] , D = [ I 0 ] , b(x, t) =[−f(y,z)g(y)

].

The matrix sequence (4.6) from Section 4.1 yields

G0 = AD = [ I 00 0 ] , Q0 = [ 0 0

0 I ] , B =[−fy −fzgy 0

], G1 =

[I −fz0 0

].

It turns out that N0 =kerG0 = (y, z) | y = 0 and S0 = (y, z) |B [ yz ]∈ imG0= (y, z) | gyy = 0 such that N0 ∩ S0 = N0 is always constant. T = Q0 isa projector onto N0 ∩ S0 and the index-2 variables Tx = [ 0

z ] may enter theequation (6.5) in a nonlinear way.

The above example shows that DAEs in Hessenberg form may not be coveredby the structure (6.3) even though they can be cast into the form (6.2). Hence,choosing the model problem (6.2) for our investigations, both Hessenberg DAEsand MNA equations will be covered.

In the sequel properly stated DAEs (6.2) will be addressed where the index-2components may enter nonlinearly. The special case of DAEs in Hessenbergform will be discussed later in Section 6.2.

Allowing the index-2 components to enter the equations in a nonlinear waycomplicates matters considerably. New qualitative features, that haven’t beenpresent before, will have a major impact on the analysis.

Example 6.2. As a motivation consider once more the circuit from Figure 4.2.In the Example 5.6 on page 83 we chose i(iV , t) = t · iV such that the index-2variables Tx = [ 0 0 iV ]> enter the equations linearly.

For i(iV , t) = i3V this is no longer true and the matrix sequence changes to

D = [ 1 0 00 1 0 ] Q0 = T =

[0 0 00 0 00 0 1

], Q1 = 1

2+3i2V

[1+3i2V −1−3i2V 0−1 1 0−1 1 0

].

In contrast to the situation from Example 5.6 the index-2 variable iV has aclear influences on

u = DP1x =e1+e2(1+3i2V )

2+3i2V

[1

1

], z1 = P0Q1x = e1−e2

2+3i2V

[1+3i2V−1

0

].

6.1 A Decoupling Procedure for Index-2 DAEs 89

Consequently, a decoupled system such as (5.9) and in particular an inherentordinary differential equation (5.9b) cannot be expected for general nonlinearDAEs.

The above example shows that errors in the components z0, z1 may influencethe dynamical part u. This was already observed numerically in [152]. Fornonlinear DAEs (6.1) a coupling between u and Dz1 will become visible. Thisforms a stark contrast to the results obtained in the previous section for DAEsof the form (6.3).

The coupled structure makes an analysis more complicated. However, a carefulinvestigation of the equation’s properties and a more refined decoupling pro-cedure will make a result quite similar to (5.9) possible. The price we haveto pay is to change from the inherent regular ODE to an inherent implicitindex-1 DAE.

6.1 A Decoupling Procedure for Index-2 DAEs

We consider differential algebraic equations (DAEs)

A(t)[D(t)x(t)

]′+ b(x(t), t

)= 0 (6.6)

with continuous coefficients. As in the previous sections let D ⊂ Rm be a

domain and I ⊂ R an interval. A(t) ∈ Rm×n and D(t) ∈ Rn×m are rectangularmatrices in general. We assume that A and D are continuous matrix functionsand that the leading term is properly stated in the sense of Definition 3.2.

The nonlinear function b : D×I →∈ Rm is defined for (x, t) ∈ D×I. It isassumed that b and the partial derivative bx are continuous.

The index of differential algebraic equations (6.6) with a properly stated leadingterm is defined using admissible sequences (4.6) from Section 4.1. For conve-nience and for later reference these definitions are repeated here. Recall thatpointwise for t ∈ I, x ∈ D we introduced

G0(t) = A(t)D(t), B(x, t) = bx(x, t)

N0(t) = kerG0(t),

S0(x, t) =z ∈ Rm |B(x, t)z∈ imG0(t)

G1(x, t) = G0(t) +B(x, t)Q0(t),

N1(x, t) = kerG1(x, t),

S1(x, t) =z ∈ Rm |B(x, t)z∈ imG1(x, t)

.

(6.7a)

The continuous projector function Q0 ∈ C(I,Rm×m) is chosen in such a waythat imQ0 = N0. Similarly let Q1 ∈ C(D×I,Rm×m) be a projector functiononto N1. The complementary projectors

P0(t) = I −Q0(t), P1(x, t) = I −Q1(x, t)


will be used frequently. The index of (6.6) is defined to be µ = 1, if N0 andS0 intersect transversally, i.e. N0 ∩ S0 = 0 or, equivalently, if G1 remainsnonsingular on D×I. For index-2 equations the intersection N0 ∩ S0 has aconstant positive dimension, but N1 ∩ S1 = 0.In Definition 4.5 the index was defined in terms of the matrices Gi. Thereforeconsider

C(x1, x, t) =(DP1D

−)x(x, t)x1 +

(DP1D

−)t(x, t),

B1(x1, x, t) = B(x, t)P0(t)−G1(x, t)D

−(t)C(x1, x, t)D(t),

G2(x1, x, t) = G1(x, t) +B1(x

1, x, t)Q1(x, t)

(6.7b)

for t ∈ I, x ∈ D and x1 ∈ Rm. The generalised reflexive inverse D−(t) is againuniquely defined by requiring

DD−D = D, D−DD− = D−, D−D = P0, DD− = R,

where the projector function R originates from the definition of the properlystated leading term.

According to Definition 4.4 the sequence (6.7) is admissible, if rankGi = ri isconstant on the interval I for i = 0, 1, 2, the subspace N0 = kerG0 satisfiesN0 ⊂ kerQ1 and

Q0 ∈ C(I,Rm×m), Q1 ∈ C(D×I,Rm×m), DP1D− ∈ C1(D×I,Rn×n).

The DAE (6.6) is regular with index µ = 2 if there is an admissible sequence(6.7), such that r0 ≤ r1 < r2 = m. Thus G2(x

1, x, t) remains nonsingular onRm×D×I and we have

N1(x, t)⊕ S1(x, t) = Rm

as is shown e.g. in [75]. Hence, without restriction we will always assume thatQ1 is the canonical projector onto N1 along S1. Of course we have to requirethat this particular choice of Q1 leads to an admissible sequence.

As in Chapter 5 let T ∈ C(D×I,Rm×m) be a projector function such that

imT (x, t) = N0(t) ∩ S0(x, t) = imQ0(t)Q1(x, t).

It was remarked earlier that T can be chosen such that

Q0T = T = TQ0, P0U = P0 = UP0, Q1T = 0 (6.8)

(see Lemma 5.2 and equation (5.3) on page 80). As usual, U = I − T denotesthe complementary projector function corresponding to T .

It is well known that in order to prove the existence of solutions for differentialalgebraic equations, the DAE has to satisfy certain structural conditions. In


[99] the DAE is assumed to satisfy a hypothesis based on the derivative array. Incontrast to that, [122] requires Q1G

−12

(b(x, t)− b(P0x, t)

)= 0. Unfortunately,

for the application in circuit simulation the former approach is not feasibleas it uses the derivative array. On the other hand the latter requirement istoo restrictive in the sense that there are DAEs arising from the modifiednodal analysis that do not satisfy this condition [153]. Hence in [61, 153] thegeneralised structural condition

(A1) N0(t) ∩ S0(x, t) does not depend on x

was introduced. This condition already played a crucial role in Chapter 5.

Notice that for equations arising from the modified nodal analysis the spaceN0(t) ∩ S0(x, t) is found to be constant [61]. Thus for applications in circuitsimulation (A1) always holds automatically. For DAEs from other applicationbackgrounds, (A1) can be checked using linear algebra tools in practice [107].

Even though properly stated index-2 DAEs of the general structure (6.6) areconsidered in this chapter, we still need to require the structural condition(A1). It will turn out that (A1) is sufficient for proving the local existence anduniqueness of solutions. Thus in circuit simulation there is no need to checkcomplicated conditions that guarantee the existence of solutions.

It will be pointed out later where this condition becomes necessary. One im-mediate consequence is that the projector T can be chosen to be independentof x.

6.1.1 Splitting of DAE Solutions

For the moment let us assume that there is a solution x∗ ∈ C1D(I,Rm) of the

regular index-2 DAE (6.6). Then, by definition,

A(t0)y0 + b(x0, t0) = 0 (6.9)

provides a consistent initialisation with (y0, x0, t0) =((Dx∗)

′(t0), x∗(t0), t0).

Some of the matrix functions defined in the sequence (6.7) depend not onlyon the time t but also on the arguments x and x1. In order to obtain suitableevaluation points we introduce an arbitrary mapping x ∈ C1(I,Rm) satisfying

x(t0) = x0,(x(t), t

)∈ D×I ∀ t ∈ I, (6.10)

and consider

Q1(t) = Q1

(x(t), t

), P1(t) = P1

(x(t), t

), (6.11)

G1(t) = G1

(x(t), t

), B(t) = B

(x(t), t

), G2(t) = G2

(x′(t), x(t), t

)defined for t ∈ I. Observe that the above construction ensures

C(x′, x, ·) =(DP1D

−)x(x, ·) x′ + (DP1D

−)t(x, ·) = ddt

(DP1D

−)(x, ·)


for the matrix C from (6.7b). In order for this relation to hold, x ∈ C1(I,Rm)was chosen to be sufficiently smooth.

If the solution itself satisfied x∗ ∈ C1(I,Rm), then x = x∗ would be an ob-vious choice. The constant function x ≡ x0 may be considered as well. Thisparticular choice was adopted in Chapter 5.

We are now in a position to define functions

u(t) = D(t)P1(t)x∗(t), w(t) = T (t)x∗(t), z(t) = Z(t)x∗(t) (6.12)

for t ∈ I. The matrix Z(t) is given by the mapping

Z : I → Rm×m, t 7→ Z(t) = P0(t)Q1(t) + U(t)Q0(t).

Using the properties from (6.8) it becomes clear that Z is again a projectorfunction, i.e. Z2 = Z. Figure 6.1 shows how u, w and z are obtained from x∗by successive splitting.

In contrast to similar splittings previously considered, z0∗ = Q0x∗ is dividedinto its U and T part, respectively. Uz0∗ is then merged with the componentz1∗ = P0Q1x∗ in order to obtain z = z1∗ + Uz0∗ = Zx∗. On the other hand thecomponent w = TQ0x∗ = Tz0∗ is considered as a new variable. The solutionx∗ can be written as

x∗(t) = D−(t)u(t) + z(t) + w(t). (6.13)

This refined splitting of z0∗ provides a clear separation of algebraic and dif-ferential components. Looking at (3.15) for linear equations, z = z1∗ + Uz0∗can be calculated by an algebraic relation but in order to obtain w = Tz0∗ thecomponent Dz1∗ = Dz needs to be differentiated1. A similar situation is validfor nonlinear DAEs (6.6) as will become clear in due course.

As the solution x∗ belongs to C1D(I,Rm) =

x∈C(I,Rm) |Dx∈C1(I,Rn)

,

the algebraic part z ∈ C1D(I,Rm) has the same property. Hence we consider

v(t) = D(t)z(t) = D(t)Q1(t)x∗(t), such that it is possible to write(Dx∗

)′=(DP1x∗ +DQ1x∗

)′= u′ + v′.

x∗

P0x∗z0∗ = Q0x∗

D−DP1x∗z1∗ = P0Q1x∗

UQ0x∗TQ0x∗

D−u z = z1∗ + Uz0∗ w = Tz0∗

Figure 6.1: The relation between u, z, w and the solution x∗

1Recall that DUQ0 = −DTQ0 = DP0T = 0 and thus DZ = DQ1. Additionallyim T = im Q0Q1 is satisfied such that TQ0Q1 = Q0Q1.


In order to decouple the DAE into its characteristic parts, we will rewrite (6.6)in terms of the new variables introduced above. Using the notation

f(y, x, t) = A(t)y + b(x, t)

the DAE (6.6) can be written as

0 = f((Dx∗

)′(t), x∗(t), t)

= f(u′(t) + v′(t), D−(t)u(t)+z(t)+w(t), t)= F(u(t), w(t), z(t), u′(t), v′(t), t), t ∈ I.

(6.14a)

The function F is defined by

F (u,w, z, η, ζ, t) = f(η + ζ, D−(t)u+Z(t)z+T (t)w, t) (6.14b)

where u, w, z, η and ζ are considered to be parameters. Z and T were intro-duced for convenience when defining F . Because of (6.12) they do not change(6.14a) at all but will be quite useful when calculating derivatives of F .

Lemma 6.3. Let (6.6) be a regular DAE with index 2 on D×I. Assume that(6.9) holds for a solution x∗ ∈ C1(I,Rm). Choose x ∈ C1(I,Rm) satisfyingthe condition (6.10) and define

u0 = u(t0), w0 = w(t0), z0 = z(t0), η0 = u′(t0), ζ0 = v′(t0)

for the functions u, w and z from (6.12). If the structural condition (A1) holds,then locally around (u0, w0, z0, η0, ζ0, t0) equation (6.14) is equivalent to

z(t) = z(u(t), t

), u′(t) = f

(u(t), w(t), t

), w(t) = w

(u(t), v′(t), t

)(6.15)

with continuous functions z, f and w being defined on neighbourhoods of (u0, t0),(u0, w0, t0) and (u0, ζ0, t0), respectively.

In order to prove this result one splits the function F from (6.14b) using anapproach similar to the one depicted in Figure 6.1. Two applications of theimplicit function theorem yield the mappings z and w. The function

f(u,w, t) =(DP1D

−)′(t)(u+ v(u, t)) (6.16)

−(DP1G

−12

)(t) b(D−(t)u+ z(u, t) + T (t)w, t)

can be written in terms of the original data. The mapping v(u, t)=D(t)z(u, t)is given once z is known. The detailed proof of Lemma 6.3 will be carried outin Section 6.3.


6.1.2 Local Existence and Uniqueness of DAE Solutions

From now on we drop the assumption that there is a solution of (6.6). However,we assume that the triple (y0, x0, t0) ∈ imD(t0)×D×I satisfies

(A2) A(t0)y0 + b(x0, t0) = 0.

This initialisation (y0, x0, t0) doesn’t need to be consistent, i.e. apriori we donot require that there is a solution passing through x0. However, we will use(y0, x0, t0) for the construction of a consistent initialisation (y0, x0, t0). Thisprocess can be compared to the step-by-step construction of consistent initialvalues in [61].

Additionally assume that there is a mapping

(A3) x ∈ C1(I,Rm), x(t0) = x0,(x(t), t

)∈ D×I ∀ t ∈ I.

Then the matrix functions defined in (6.11) are used to introduce

u0 = D(t0)P1(t0)x0, w0 = T (t0)x

0, z0 = Z(t0)x0. (6.17)

Notice that x0 = D−(t0)u0 + z0 +w0 holds by construction. We will study theequation

F (u,w, z, η, ζ, t) = 0 (6.18)

without assuming that there is a solution of the original DAE. Observe thatthe parameters η and ζ replace the derivatives u′ and v′, respectively.

The results obtained in this section are based on the proof of Lemma 6.3. Hencethe results will only be quoted here. Full proofs are given in Section 6.3. Thefirst statement provides a function z similar to the one obtained in Lemma 6.3.

Lemma 6.4. Let (6.6) be a regular DAE with index 2 on D×I. Assume thatthe structural condition (A1) as well as (A2) and (A3) hold. Then the function

Z(t)G−12 (t)F (u,w, z, η, ζ, t) +

(I−Z(t)

)z =: F1(u, z, t)

is independent of w, η, ζ and there is rz > 0 and a continuous function

z :Brz(u0, t0) → Rm, with z(u0, t0) = z0,

such that F1

(u, z(u, t), t

)= 0 for every (u, t) ∈ Brz(u0, t0).

When proving this lemma, the assumption (A1) is crucial as it guarantees thatF1 is indeed independent of w (see Section 6.3 for more details).

Using the function z from Lemma 6.4 we introduce v(u, t) = D(t)z(u, t) anddefine f(u,w, t) as in (6.16). Recall that we had u′(t) = f

(u(t), w(t), t

)in

Lemma 6.3. The function f obtained here will have the same significance. Thusin (6.18) we replace η by f and z by z, respectively. The resulting equation isstudied in the next lemma.


Lemma 6.5. Let (6.6) be a regular DAE with index 2 on D×I. Assume thatthe structural condition (A1) as well as (A2) and (A3) hold. Consider

F2(u,w, ζ, t) = T (t)G−12 (t)F

(u,w, z(u, t), f(u,w, t), ζ, t

)+(I − T (t)

)w

where z is the mapping from Lemma 6.4 and f is given by (6.16). If we defineζ0 = y0 − f(u0, w0, t0), then there is rw > 0 and a continuous function

w :Brw(u0, ζ0, t0) → Rm, with w(u0, ζ0, t0) = w0,

such that F2

(u,w(u, ζ, t), ζ, t

)= 0 for every (u, ζ, t) ∈ Brw(u0, ζ0, t0).

The mappings z, v, f and w introduced above allow the construction of asolution. To this end we need to consider the following system of differentialalgebraic equations

z = z(u, t), v = v(u, t) = D(t)z(u, t),

u′ = f(u,w, t), w = w(u, v′, t).

Inserting w into f, it turns out that we have to deal with the implicit DAE

u′ = f(u,w(u, v′, t), t

)=: f(u, v′, t) (6.19a)

v = D(t)z(u, t) =: g(u, t). (6.19b)

Once u and v are known, the remaining components of the solution x = D−u+z + w are given by z = z(u, t), w = w(u, v′, t).

Lemma 6.6. Let (6.6) be a regular DAE with index 2 on D×I. Assume thatthe structural condition (A1) as well as (A2) and (A3) hold. Let z, f and w bethe functions obtained from Lemma 6.4 and 6.5.

(a) The implicit DAE (6.19) has at most (differentiation) index 1.

(b) For every consistent initial condition(u(t0), v(t0)

)=(u0, g(u0, t0)

)there

is a unique solution of (6.19).

(c) If u0 ∈ imD(t0)P1(t0), then u(t) ∈ imD(t)P1(t) for every t where thesolution exists.

Proof. To see (a) it suffices to note that I − ∂f∂v′

∂g∂u

= I − fv′gu is nonsingular ina neighbourhood of (u0, ζ0, t0) (see Remark 6.10 later on page 108). Of course,if fv′ vanishes, then (6.19a) represents an ordinary differential equation.

In order to prove (b) differentiate (6.19b) and insert the result into (6.19a).This yields

u′ = f(u, gu(u, t)u

′ + gt(u, t), t)

= f(u, u′, t)

and due to (a) it is possible to solve for u′. Thus (6.19b) is equivalent to anordinary differential equation u′ = F(u, t). It remains to solve the initial valueproblem u′ = F(u, t), u(t0) = u0 to see that

(u(t), g

(u(t), t

))is the unique

solution.


When proving (c), ideas can be employed that were originally used to study thecase of linear DAEs [116]. Let (u, v) be a solution of the index-1 system (6.19)satisfying u(t0) ∈ imD(t0)P1(t0). Multiplication of (6.19a) by the projectorfunction I −DP1D

− yields

(I −DP1D−)(t)u′(t) = −

(I −DP1D

−)′(t) (DP1D−)(t)u(t)

since DP1D−v(u, ·) = DP1P0z(u, ·) = DP1P0Zz(u, ·) = 0. This means that

u = (I −DP1D−)u satisfies the linear ODE

u′ =(I −DP1D

−)′u. (6.20)

The initial condition u(t0) = u0 ∈ imD(t0)P1(t0) implies u(t0) = 0 and thesolution of (6.20) is identically zero, i.e. u(t) ∈ imD(t)P1(t) for every t.

Starting from Lemma 6.6 it is straightforward to construct a solution of theoriginal index-2 system (6.6). We collect the result in the following theorem.

Theorem 6.7. Let (6.6) be a regular index-2 DAE on D×I with a properlystated leading term. Assume that the following conditions hold:


(A2) ∃ (y0, x0, t0) ∈ imD(t0)×D×I such that A(t0)y0 + b(x0, t0) = 0,

(A3) ∃ x ∈ C1(I,Rm) such that x(t0) = x0 and(x(t), t

)∈ D×I ∀ t ∈ I,

(A4) the derivatives ∂b∂x

, dDdt

and ∂∂t

(ZG−1

2 b)

exist and are continuous.

Then there is a unique local solution of the initial value problem

A(t)[D(t)x(t)

]′+b(x(t), t

)=0, D(t0)P1(x

0, t0)(x(t0)−x0

)=0. (6.21)

Proof. The mappings z and v(u, t) = D(t)z(u, t) are obtained from Lemma 6.4.Similarly, let w be the mapping defined in Lemma 6.5. Then due to Lemma 6.6there is a local solution of the implicit index-1 system

u′ = f(u,w(u, v′, t), t

), u(t0) = u0 := D(t0)P1(x

0, t0)x0,

v = D(t)z(u, t), v(t0) = g(u0, t0)

existing for t ∈ Iε = (t0 − ε, t0 + ε) ∩ I with some ε > 0. Using this solution(u(t), v(t)

)we define

x∗(t) = D−(t)u(t) + z(u(t), t

)+ w

(u(t), v′(t), t

). (6.22)

It remains to check that (6.22) is indeed a solution.

Due to u0 ∈ im(DP1)(t0) and Lemma 6.6 we have u(t) ∈ im(DP1)(t) for everyt ∈ Iε. Recall that R(t) = (DD−)(t) is the projector function related to theproperly stated leading term and Ru = u shows that

Dx∗ = Ru+DZz(u, ·) +DTw(u, ·) = u+ v(u, ·).


Therefore (A4) ensures that Dx∗ is a C1 mapping2. In particular we haveDP1x∗ = u, Zx∗ = z(u, ·) and Tx∗ = w(u, v′, ·) and, finally,

ZG−12 [A(Dx∗)

′ + b(x∗, ·)] = F1

(u, z(u, ·), ·

)= 0,

T G−12 [A(Dx∗)

′ + b(x∗, ·)] = F2

(u,w(u, v′, ·), v′, ·

)= 0,

P0P1G−12 [A(Dx∗)

′ + b(x∗, ·)] = D−(u′−f(u,w(u, v′, ·), ·

)) = 0

(see Remark 6.11 later on). Due to the splitting I = Z + T + P0P1 of theidentity, we can conclude that A(Dx∗)

′ + b(x∗, ·) = 0 such that x∗ is indeed asolution of (6.6). Since D(t0)P1(x

0, t0)x∗(t0) = u(t0) = u0 = D(t0)P1(x0, t0)x

0,this solution satisfies the initial value problem (6.21).

If there was another solution, say x∗, then one could decouple x∗ and x∗ asdescribed in Section 6.1.1. Therefore the corresponding DP1 parts u and usolve the same inherent index-1 system (6.19) and are therefore equal. Becauseof Lemma 6.3, z and z as well as w and w are also equal, respectively, suchthat x∗ and x∗ coincide.

The smoothness required in order to be able to construct the solution, is givenwhen the function v is differentiable with respect to t. The condition (A4) onD and ZG−1

2 b in Theorem 6.7 guarantees this fact but it may be unnecessarystrong in general.

The theorem is given for index-2 DAEs. Using Q1 = 0 and P1 = I it turns outthat for index-1 equations the component v vanishes because of DZ = 0. Thus(6.19a) reduces to the inherent regular ordinary differential equation (4.8b) andcompared to Theorem 4.10 no additional smoothness is required.

Example 6.8. For illustration we want to employ the decoupling proceduredescribed above for explicitly constructing a solution of the DAE

x′1+x22−x3(1+x3)=gε,t∗(t)

x′2 −x3=gε,t∗(t)x1−x2=0

⇔[

1 00 10 0

] ([ 1 0 00 1 0 ]

[x1x2x3

])′+

[x22−x3(1+x3)−gε,t∗ (t)

−x3−gε,t∗ (t)x1−x2

]= 0.

The mapping gε,t∗ appearing on the right-hand side is continuous but not dif-ferentiable,

gε,t∗(t) =

0 , 0≤ t < t∗ − ε,

t−t∗2 ε

+ 12

, t∗ − ε≤ t < t∗ + ε,1 , t∗ + ε≤ t.

t∗−ε t∗ t∗+ε

0

1

When simulating electrical circuits, piecewise linear function such as gε,t∗ withsmall ε > 0 are often used to model independent sources that are switched on

2Due to the construction of z, the partial derivative vu always exists. Since ZG−12 b is

smooth, φu(t) = z(u, t) is a C1-mapping for every fixed u. As D ∈ C1 as well, we haveφu ∈ C1

D. Thus the partial derivative vt exists and is continuous.


at t = t∗. In general a treatment on the subintervals [0, t∗) and [t∗,∞) is notfeasible, as the switching point t∗ may not be known in advance.

According to (A2) we need an initialisation to start with. Here we may use

(y0, x0, t0) =

([− 5

4

− 12

],

[11

− 12

], 0

)since Ay0 + b(x0, t0) = 0.

It is straightforward to calculate the matrix sequence (6.7) as

G0 =[

1 0 00 1 00 0 0

], B =

[0 2x2 −2x3−10 0 −11 −1 0

], N0 = span

[001

], Q0 =

[0 0 00 0 00 0 1

]G1 = G0 +BQ0 =

[1 0 −2x3−10 1 −10 0 0

], N1 = span

[2x3+1

11

]S0 = span

[110

],[

001

]= S1

As N0 ∩ S0 = N0 is independent of x, the structural condition (A1) holds. A

projector onto this subspace is given by T =[

0 0 00 0 00 0 1

].

Also, due to N1∩S1 = 0 for x3 6= 0, the index is 2. Calculating the canonical

projector Q1 = 12x3

[2x3+1 −2x3−1 0

1 −1 01 −1 0

]onto N1 along S1 we find that

G2(x1, x, t) =

1

2x3

[2x3(x2+x3)−x1

3 −2x2x3+x13 −2x3(2x3+1)

−x13 2x3+x1

3 −12x3 −2x3 0

].

This matrix depends on x, t and the auxiliary variable x1. Choosing x(t) ≡ x0,such that (A3) is satisfied, we find

G2(t) =[ −1 2 0

0 1 −11 −1 0

], Q1 =

[0 0 0

−1 1 0−1 1 0

], Z =

[0 0 0

−1 1 00 0 0

].

Instead of the original DAE we turn to investigate (6.18) written in terms ofthe new variables u, w, z, η, ζ and t,

F (u,w, z, η, ζ, t) =

[η1+ζ1−w3

2−w3+(u2−z1+z2)2−gε,t∗ (t)η2+ζ2−w3−gε,t∗ (t)−u2+z1−z2+u1

]= 0.

Using the matrix functions defined above we split this mapping into differentparts in order to determine z, f and w. Observe that the function

F1(u, z, t) =[

z1u2−u1+z2

z3

]= 0

as defined in Lemma 6.4 takes a quite simple form here. In particular it allowsthe determination of z = z(u, t) = [0, u1−u2, 0]> and therefore we get v(u, t) =D(t)z(u, t) = [0, u1−u2]

>. Notice that v(u, t) is continuously differentiable

with respect to both arguments. The mapping f(u,w, t) =[w3

2+w3−u12+gε,t∗ (t)

w32+w3−u1

2+gε,t∗ (t)

]is defined according to (6.16) and

F2(u,w, ζ, t) =[ w1

w2

ζ1−ζ2−w23+u2

1

]= 0


from Lemma 6.5 fixes the component w = w(u, ζ, t) = [0, 0, −√ζ1−ζ2+u2

1]>.

Observe that u0 = [1 1]>, w0 = [0 0 − 12]> and thus ζ0 = y0 − f(u0, w0, t0) =

[0, 34]>. Hence we needed to choose the negative sign for the root in order to

guarantee w0 = w(u0, ζ0, t0).

Finally we arrive at the implicit index-1 DAE

u′ = f(u,w(u, v′, ·), ·

)= [ 1

1 ] [ gε,t∗+v′1−v′2−√v′1−v′2+u2

1 ] , u(t0) = u0 = [ 11 ] ,

v = Dz(u, ·) = [ 0u1−u2

] , v(t0) = [ 00 ] .

The t arguments were omitted for better readability. Obviously v(t) ≡ [ 00 ] is

uniquely determined by the initial data and we have to consider the ordinarydifferential equation u′(t) = [ 1

1 ] [ gε,t∗ (t)−u1(t) ], u(t0) = u0. The unique solutionis given by

u1(t) = u2(t) =

α(t) , 0≤ t < t∗ − εα(t) + β(t) , t∗ − ε≤ t < t∗ + εα(t) + β(t) + γ(t) , t∗ + ε≤ t ≤ ∞

where the functions α, β and γ are defined by

α(t)=e−t, β(t)= 12+ 1

2 ε

[t−t∗−1+et∗−ε−t

], γ(t)=1+ 1

2 ε

[et∗−ε−t−et∗+ε−t

].

Following (6.22) we find

x∗(t) = D−(t)u(t) + z(u(t), t

)+ w

(u(t), v′(t), t

)= u1(t)

[11

−1

].

Direct computation shows that x∗ is indeed a solution. Notice that therewas no need for the initialisation to be consistent. The (inconsistent) ini-tialisation (y0, x0, t0) =

([−5

4,−1

2]>, [1, 1,−1

2]>, 0

)was used to start with,

but during the course of the above computations a consistent initialisation(y0, x0, t0) =

([−1,−1]>, [1, 1,−1]>, 0

)was obtained. However, x∗ satisfies

the initial condition D(t0)P1(t0)(x∗(t0)− x0

)= 0 as stated in Theorem 6.7.

The solution obtained for t∗ = 1 and ε = 14

is depicted in Figure 6.2.

Even though the decoupling procedure can be used to actually solving a DAE,it is not meant to be used for this purpose. The above example illustrates howthe different functions interrelate and how the implicit function theorem is used

1

1

2

0

−

1

2

−1time

5

223

211

20

x1(t)

x2(t)

x3(t)

Figure 6.2: The components of the solution x for t∗ = 1 and ε = 14. Observe

that x is only continuous, but Dx = [x1, x2]> is smooth.


to solve for different components of the solution. For practical computationsthese manipulations are not recommended.

Nevertheless the decoupling procedure is an invaluable tool for proving theexistence and uniqueness of solutions for nonlinear DAEs (6.6) making onlymild smoothness assumptions. In Part III of this thesis we will see how thedecoupling procedure can be successfully used to study numerical methods forindex-2 DAEs as well. Of course there will be no explicit decoupling when doingserious computations. However, one has to ensure that a given method, whenapplied to (6.6), behaves as if it was integrating the index-1 system (6.19). Thisbehaviour guarantees that qualitative properties of the solution are correctlyreflected by the numerical approximation. In the context of linear DAEs anexample of this situation was given in Section 3.3.

Before proving the obtained results in Section 6.3 and then moving on to study-ing numerical methods for nonlinear DAEs, some brief remarks about this newdecoupling procedure are due.

6.1.3 Remarks about the Decoupling Procedure

In contrast to the case of linear DAEs [116, 120] or nonlinear index-1 equa-tions [88] we did not obtain an inherent ordinary differential equation, but wederived the implicit DAE system (6.19) that governs the dynamical behaviour.Using the concept of the differentiation index it turned out that (6.19) hasindex 1. Of course, the transformation of higher index DAEs to equivalentones having index 1 is a well-known theme in the theory of differential alge-braic equations. But in contrast to classical approaches [10, 44, 99], we didn’tuse the derivative array at all. The concept of the tractability index made amore refined analysis of index-2 DAEs possible leading to lower smoothnessrequirements. This is of vital importance for practical applications.

The implicit system (6.19) is neither in Hessenberg form nor formulated with aproperly stated leading term. It is easily seen that reformulations that fit intothese classes of equations will have index 2 again. Thus one has to be carefulwhen reformulating the implicit system and it turns out to be advantageous toconsider (6.19) directly. A detailed study of general linear methods for (6.19)will be presented in Chapter 8.

The general approach taken here provides a unifying framework for the differentclasses of DAEs previously considered. Indeed, it turns out that the resultsgiven for

• linear DAEs having index 1 or 2,

• nonlinear index-1 equations and

• index-2 DAEs having the special structure (6.3)

can all be recovered as special cases.


Linear DAEs: If the DAE (6.6) is linear, then the sequence (6.7) dependsneither on x nor on x1. Hence the bar-notation introduced in (6.11) can beneglected. For linear DAEs the mapping f defined in (6.16) therefore takes theform

f(u,w, ·) = (DP1D−)′u+ (DP1D

−)′v(u, ·)−DP1G−12 BD−u+DP1G

−12 q

−DP1G−12 BZz(u, ·)−DP1G

−12 BTw.

Using the properties (6.8) of U , T as well as the definition of B1 from (6.7b) itturns out that

DP1G−12 BT = DP1G

−12 BQ0T = DP1G

−12 G1Q0T = DP1G

−12 G2P1Q0T

= DP1T = DT −DQ1T = 0,

DP1G−12 BZ = DP1G

−12 B(P0Q1 + UQ0)

= DP1G−12 (BP0)Q1 +DP1G

−12 B(Q0 − T )

= DP1G−12

(B1 +G2P1D

−(DP1D−)′D

)Q1 +DP1Q0

= DP1G−12 B1Q1 +DP1D

−(DP1D−)′DQ1

= (DP1D−)′DQ1.

Hence the first equation (6.19a) of the implicit index-1 system reduces to


−12 BD−u+DP1G

−12 q.

This is precisely the inherent regular ODE (3.11a′′) for µ = 2 from Section 3.2.It is important to notice, that DP1G

−12 BT = 0 ensures that the right-hand side

is independent of w such that there is no coupling between the components uand v anymore.

Nonlinear index-1 equations: It was already remarked earlier, that in caseof index-1 equations Q1 = 0 and P1 = I imply that the component v vanishesdue to DZ = 0. Therefore (6.19a) reduces to the inherent regular ODE (4.8b)as constructed in Section 4.2.

Nonlinear index-2 DAEs with the special structure (6.3): Given thatthe index-2 components Tx = w enter the equations linearly, f(u,w, ·) is againindependent of w. This can be seen by computing

∂f

∂w(u,w, ·) = −DP1G

−12

(bx(D

−u+ z(u, ·), ·)U + BT)T

= −DP1G−12

(bx(Ux0, ·)U + BT

)T = −DP1G

−12 BT = 0.

Consequently, (6.19a) reduces to the ordinary differential equation

u′ = (DP1D−)′(u+ v(u, t))−DP1G

−12 b(D−(t)u+ z(u, t), ·

). (6.23)


In order to see that this equation coincides with the inherent regular ODE(5.9b), take x ≡ x0 and consider the solution x∗ constructed in (6.22). Defining

w(u, ·) = P1D−(Dx∗)

′ + (Q0 + Q1)x∗

= P1D−(u+ v(u, ·)

)′+ (Q0 + Q1)z(u, ·) + Tw(u, v′, ·)

we see that w is the unique solution of (5.8) (see Lemma 5.3 for more details).Using (6.23) one calculates

DP1w(u, ·) = DP1D−(u+ v(u, ·)

)′+DP1Q0︸︷︷︸

=0

z(u, ·) +DP1T︸︷︷︸=0

w(u, v′, ·)

= u′ − (DP1D−)′(u+ v(u, ·)) = −DP1G

−12 b(D−(t)u+ z(u, t), ·

),

DQ1w(u, ·) = DQ1z(u, ·) = v(u, ·).

and comparing (5.9b) with (6.23) we see that the inherent regular ODE (5.9b)constructed in Section 5.1 is indeed just a special case of (6.19).

6.2 Application to Hessenberg DAEs

The important class of DAEs in Hessenberg form

y′ = f(y, z),0 = g(y)

⇔[I

0

] ([ I 0 ] [ x1

x2 ])′

+[−f(x1,x2)

g(x1)

]= 0 (6.24)

was already considered in Example 6.1. In this section we want to explorehow the decoupling procedure introduced above applies to DAEs having thisspecial Hessenberg structure. It is assumed that (6.24) has index 2 such thatgyfz remains nonsingular in the neighbourhood of a solution. Recall fromExample 6.1 that the matrix sequence reads

G0 = [ I 00 0 ] , Q0 = [ 0 0

0 I ] , D− = [ I0 ] , G1 =[I −fz0 0

], T = [ 0 0

0 I ] = Q0.

A projector onto the kernel of G1 is given by

Q1 =[fz(gyfz)−1gy 0

(gyfz)−1gy 0

]and it turns out that3

G2 =[I−fyfz(gyfz)−1gy −fz

gy 0

], Z = P0Q1 + UQ0 =

[fz(gyfz)−1gy 0

0 0

].

Obviously the projector Z = fz(gyfz)−1gy appears frequently, e.g.

G−12 =

[I−Z (I−Z)fyZ+fz(gyfz)−1

−(gyfz)−1gy (gyfz)−1(I−gyfyZ)

].

3Instead of G2 = G1 +B1Q1 with B1 = BP0−G1D−CD (cf. (6.7)) the modified version

G2 = G1 + B1Q1, B1 = BP0 is used here in order to simplify expressions. From the proof ofLemma 5.3 we know that G2 and G2 have common rank.

6.2 Application to Hessenberg DAEs 103

It is straightforward to apply the decoupling procedure described in the previ-ous section. For simplicity choose x ≡ x0 = [ y0z0 ] and recall from Lemma 6.4that

F1(u, z) = ZG−12 b(D−u+Zz

)+ (I−Z)z =

[fz(gy fz)−1g(u+Zz1)+(I−Z)z1

z2

].

The bar notation indicates evaluation at x = x0 and u, z = [ z1z2 ] are newlyintroduced parameters. Solving F1(u, z) = 0 for z yields the map z such thatF1

(u, z(u)

)= 0. The function f from (6.16) is given by

f(u,w) = (I − Z)[fyZg

(u+ z1(u)

)− f

(u+ z1(u), w2

)].

Finally the mapping F2 from Lemma 6.5 is obtained by straightforward com-putation,

F2(u,w, ζ) =[I 00 (gy fz)−1

] [ w1

gy(f(u+z(u),w2)−ζ1)+(I−gy fyZ)g(u+z(u))

].

Solving F2(u,w, ζ) = 0 for w = w(u, ζ) we arrive at the inherent index-1 system

u′ = (I − Z)[fyZg

(u+ z1(u)

)− f

(u+ z1(u),w2(u, v

′))], (6.25a)

v = Dz(u). (6.25b)

It is clear from the above computations that the decoupling procedure can beapplied to index-2 DAEs in Hessenberg form. Computing the partial derivative∂F1

∂z(u0, z0) = ∂F2

∂w(u0, w0, ζ0) = I shows that the existence of the functions

z and w is guaranteed by the implicit function theorem. Hence the index-2 DAE (6.24) can be decoupled into the index-1 problem (6.25) and explicitrepresentations for z = z(u, ·) and w = w(u, v′). This decoupling is realisedwithout differentiating the equations. Of course, in order to finally computew = w(u, v′) a differentiation is still necessary.

There is no doubt that (6.25) is far more complicated than the original formu-lation (6.24). When dealing with problems of Hessenberg type it is thereforeadvisable to stick to the original formulation (6.24) and to exploit this structurewhen studying numerical methods. This is done in e.g. in [84, 85, 140].

On the other hand, there are applications that do not give rise to equationsin Hessenberg form. One prominent example are the equations of the charge-oriented modified nodal analysis.

In order to address these more general problems that are not of Hessenbergtype, one has to invest considerable work such that the inherent structurecan be revealed and then exploited for analysing numerical methods. Thedecoupling procedure is one attempt to go into this direction. We saw thatthis approach works for the simple DAE in Example 6.8 as well as for DAEs inHessenberg form. The next section will give proofs for the general case.


6.3 Proofs of the Results

In this section proofs for Lemma 6.3, 6.4 and 6.5 will be given. The proofspresented here are rather technical, but the structure of our approach is clearlyvisible from Figure 6.3. The DAE (6.6) is rewritten in terms of new variables.The resulting equation (6.18), i.e. F (u,w, z, η, ζ, t) = 0, is studied in detail.Notice that the derivatives u′ and v′ are replaced by parameters η and ζ, re-spectively. The mapping F is defined in (6.14b) on page 93.

Using the identity I = Z+P0P1 +T , this equation is split into three parts thatcan be dealt with one after the other.

À F1(u, z, t) = 0 together with Lemma 6.4 yields the mapping z. Here thestructural condition (A1) is crucial as it allows the application of theimplicit function theorem.

Á Inserting z = z(u, t) into the second part provides an explicit representa-tion u′ = f(u,w, t).

Â Using the information about z and η = u′, the third part can be written asF2(u,w, ζ, t) = 0 such that a further application of the implicit functiontheorem yields w = w(u, ζ, t). The details of this construction are coveredby Lemma 6.5.

In order to derive the implicit DAE (6.19), we need to plug w back into f

keeping in mind that ζ was representing v′.

The procedure sketched above will be carried out in detail. We start by givinga preliminary result.

Lemma 6.9. Let (6.6) be a regular DAE with a properly stated leading termthat has index 2 on D×I. Let (A1), (A2) and (A3) hold. Then

G−12 (t)A(t)D(t) = P1(t)P0(t), G1(ξ, t)P0(t) =

(G2P1P0

)(t),

F (u, w, z, η, ζ, t) = 0

F1(u, z, t) = 0 u′ =(u, w, t) F2(u, w, ζ, t) = 0

z = (u, t)v = (u, t)

w = (u, ζ, t)

u′ = f(u, v

′, t)

v = g(u, t)

ZG−1

2DP1G

−1

2TG

−1

2

Figure 6.3: The roadmap to the proofs

6.3 Proofs of the Results 105

hold for every (ξ, t) ∈ D×I. Furthermore we have(TG−1

2

)(t0)B(x0, t0)T (t0)=T (t0),

(ZG−1

2

)(t0)B(x0, t0)Z(t0)= Z(t0).

If, in addition, the structural condition (A1) is valid, then

imB(ξ, t)T (t) ⊂ im(G2P1P0

)(t) ∀ (ξ, t) ∈ D×I.

Proof. Since G2P1P0 = AD = G1(ξ, ·)P0 for every ξ, the first two equationshold. The second two equations follow from

BT = BQ0T = G1Q0T = G2P1T,

ZG−12 BZ = P0Q1G

−12 BP0Q1 + UQ0G

−12 BP0Q1 + ZG−1

2 BQ0UQ0

= P0Q1 + UQ0 = Z,

respectively4. The argument t0 was dropped for better readability. Notice thatfor t 6= t0 we get G−1

2 (t)G2(x1, ξ, t) 6= I in general and the second two equations

need not hold for t 6= t0.

Given (A1), we have imT (t) ⊂ S0(ξ, t) since N0(t) ∩ S0(x, t) is independentof x. We find imB(ξ, t)T (t) ⊂ imG0(t) which proves the last assertion of thelemma.

Proof of Lemma 6.3. We assume that x∗ is a solution of (6.6) and define thefunctions u, w, z according to (6.12),

u(t) = D(t)P1(t)x∗(t), w(t) = T (t)x∗(t), z(t) = Z(t)x∗(t). (6.12)

We know already that the DAE (6.6) can be equivalently written as

0 = f((Dx∗

)′(t), x∗(t), t) = F(u(t), w(t), z(t), u′(t), v′(t), t), (6.14a)

where the mapping F is defined by

F (u,w, z, η, ζ, t) = f(η + ζ, D−(t)u+Z(t)z+T (t)w, t). (6.14b)

Using the identity I = P0P1 + Z +T = D−DP1 + Z +T as motivation we split(6.14b) into three parts

F1(u,w, z, η, ζ, t) = Z(t)G−12 (t)F (u,w, z, η, ζ, t) +

(I − Z(t)

)z,

F2(u,w, z, η, ζ, t) = T (t)G−12 (t)F (u,w, z, η, ζ, t) +

(I − T (t)

)w,

F3(u,w, z, η, ζ, t) =D(t)P1(t)G−12 (t)F (u,w, z, η, ζ, t).

4Recall Q1 =Q1G−12 BP0, BP0 =B1Q1+G2P1D

−(DP1D−)′DQ1, BQ0 =G2P1Q0.


Observe that (6.12) and (6.14a) imply

Fi(u(t), w(t), z(t), u′(t), v′(t), t

)= 0, i = 1, 2, 3, (6.27)

for t ∈ I. We study these functions around (u0, w0, z0, η0, ζ0, t0) where

u0 = u(t0), w0 = w(t0), z0 = z(t0), η0 = u′(t0), ζ0 = v′(t0).

As in (6.13) we have x0 = D−(t0)u0 + z0 + w0.

À Lemma 6.9 shows that (dropping the t argument)

F1(u,w, z, η, ζ, ·) = ZG−12 b(D−u+Zz+Tw , ·

)+(I − Z

)z

does not depend on η nor ζ. Due to the structural condition (A1) F1 is evenindependent of w as

F1,w(u,w,z, η, ζ, t) =(ZG−1

2

)(t)B

(ξ(t), t

)T (t) = 0 (6.28)

with ξ(t) = D−(t)u + Z(t)z + T (t)w (see Lemma 6.9). We may redefine F1

using the proper argument list:

F1(u, z, ·) = ZG−12 F (u, 0, z, 0, 0, ·) +

(I−Z

)z

= ZG−12 b(D−u+Zz , ·

)+(I−Z

)z.

(6.29)

Keep in mind that due to (6.28)

F1(u, z, ·) = ZG−12 b(D−u+Zz+w0 , ·

)+(I−Z

)z (6.30)

is also valid. Using Lemma 6.9 again, we calculate

F1,z(u0, z0, t0) = Z(t0)G−12 (t0)B(x0, t0)Z(t0) +

(I−Z

)= I. (6.31)

(6.27) for i = 1 and (6.31) allow the application of the implicit function theo-rem and z(t) = z

(u(t), t

)is given as a function of u(t) and t. The mapping z

is defined locally around (u0, t0) and satisfies F1

(u, z(u, t), t

)= 0 on a neigh-

bourhood of (u0, t0). Thus

z(u, t) = Z(t)z(u, t) (6.32)

is also valid since 0 =(I − Z(t)

)F1(u, z(u, t), t) =

(I − Z(t)

)z(u, t).

Due to D(t)z(u, t) = D(t) Z z(u, t) = D(t)Q1(t)z(u, t) we arrive at

v(t) = D(t)Q1(t)z(t) = D(t)Q1(t)Z(t) z(u(t), t

)= D(t)z

(u(t), t

)= v(u(t), t

)with the function v(u, t) = D(t)z

(u, t).


Á Noting that

DP1G−12 A(u+ v)′ = DP1G

−12 ADD−(u+ v)′ = DP1D

−(u+ v)′

= u′ − (DP1D−)′(u+ v)

it turns out that F3 provides an explicit representation of u′(·) in terms of u(·)and w(·). In particular, for i = 3 (6.27) is equivalent to

u′(t) = f(u(t), w(t), t

)(6.33a)

with

f(u,w, t) =(DP1D

−)′(t)(u+ v(u, t)) (6.33b)

−(DP1G

−12

)(t) b(D−(t)u+ z(u, t) + T (t)w, t).

Â Combining F2 with the results obtained so far we get

F2

(u(t), w(t), v′(t), t

)= 0 (6.34a)

where

F2

(u,w, ζ, t

)= F2

(u,w, z(u, t), f(u,w, t), ζ, t

)=(TG−1

2 A)(t)(f(u,w, t) + ζ

)+(I − T (t)

)w (6.34b)

+(TG−1

2

)(t) b(D−(t)u+ z

(u, t)

+ T (t)w , t)

is defined on a neighbourhood of (u0, w0, ζ0, t0). In order to apply the im-plicit function theorem once again, we need to show that F2,w

(u0, w0ζ0, , t0

)is

nonsingular. After calculating

fw

(u0, w0, t0

)= −

(DP1G

−12

)(t)B(x0, t0)T (t0)

we obtain from Lemma 6.9

F2,w

(u0, w0, ζ0, t0

)=(TG−1

2 A)(t0)fw(u0, w0, t0)+I−T (t0)+

(TG−1

2

)(t0)B(x0, t0)T (t0)

= −(T P1P0P1G

−12

)(t)B(x0, t0)T (t0) + I−T (t0) + T (t0).

This proves F2,w

(u0, w0, t0

)= I, since T P1P0P1 = 0. Thus the implicit function

theorem shows that locally around (u0, w0, ζ0, t0) (6.34a) is equivalent to

w(t) = w(u(t), v′(t), t

)with a continuous mapping w. Similar to (6.32) w(u, ζ, t) = T (t)w(u, ζ, t) holds.


The results of À – Â show that (6.14) implies (6.15) from Lemma 6.3, i.e.

z(t) = z(u(t), t

), u′(t) = f

(u(t), w(t), t

), w(t) = w

(u(t), v′(t), t

).

To finish the proof we note that these mappings imply

G−12 F (u,w, z, u′, v′, ·)

= F1(u, z(u, ·), ·)+F2(u,w(u, v′, ·), v′, ·)+D−(u′−f(u,w(u, v′, ·), ·))=0,

where t-arguments have once again been omitted.

Proof of Lemma 6.4. Again we drop the assumption that there is a solution.We require that (A1), (A2) and (A3) hold. Exactly as in (6.29) we define themapping F1 = F1(u, z, t) where u and z are considered to be parameters chosenin a neighbourhood of (u0, z0). Recall that u0, w0, z0 are defined in (6.17) as

u0 = D(t0)P1(t0)x0, w0 = T (t0)x

0, z0 = Z(t0)x0.

We have

F1(u0, z0, t0) =(ZG−1

2

)(t0) [A(t0)y

0 + b(x0, t0)] = 0

due to (6.30) and (A2). As in (6.31) we find F1,z(u0, z0, t0) = I and the implicit

function theorem provides the function z satisfying F1

(u, z(u, t), t

)= 0 on a

neighbourhood of (u0, t0). Notice that z satisfies (6.32).

Proof of Lemma 6.5. Having z at our disposal we introduce v(u, t)=D(t)z(u, t)and consider the mapping f = f(u,w, t) defined in (6.33b). The function F2 =F2(u,w, ζ, t) can be defined as in (6.34b). Let ζ0 = y0 − f(u0, w0, t0) such that

F2(u0, w0, ζ0, t0) =(TG−1

2

)(t0) [A(t0)y

0 + b(x0, t0)] = 0.

We already calculated F2,w

(u0, w0, t0

)= I and therefore the implicit function

theorem yields the mapping w with F2

(u,w(u, ζ, t), ζ, t

)= 0 in a neighbourhood

of (u0, ζ0, t0). Again w(u, ζ, t) = T (t)w(u, ζ, t) is satisfied.

Remark 6.10. The decoupling procedure discussed above leads to the implicitdifferential algebraic system

u′ = f(u,w(u, v′, t), t

)=: f(u, v′, t) (6.35a)

v = D(t)z(u, t) =: g(u, t). (6.35b)

Differentiating (6.35b) and inserting into (6.35a) shows that

u′ = f(u, gu(u, t)u

′ + gt(u, t), t). (6.36)


In order to guarantee that (6.35) has index 1, as stated in Lemma 6.6, we haveto ensure that (6.36) can be solved for u′. To this end consider the matrixM(u, ζ, t) = I − fv′(u, ζ, t)gu(u, t) satisfying

M(u0, ζ0, t0) = I −(fwwζvu

)(u0, ζ0, t0)

= I +(DP1G

−12 BT

)(t0)(TG−1

2 A)(t0)(DQ1D

−)(t0) = I.

Thus M remains nonsingular also on a neighbourhood of (u0, ζ0, t0) such that(6.35) has indeed index 1.

Remark 6.11. In Theorem 6.7 a solution x∗ was constructed by first solving(6.35) and then defining x∗(t) = D−(t)u(t) + z

(u(t), t

)+ w

(u(t), v′(t), t

). This

function is a solution of (6.21) since

ZG−12 [A(Dx∗)

′ + b(x∗, ·)] = F1

(u, z(u, ·), ·

)= 0, (6.37a)

TG−12 [A(Dx∗)

′ + b(x∗, ·)] = F2

(u,w(u, v′, ·), v′, ·

)= 0, (6.37b)

P0P1G−12 [A(Dx∗)

′ + b(x∗, ·)] = D−(u′−f(u,w(u, v′, ·), ·

)) = 0. (6.37c)

Here we want to remark that in order to see (6.37a), the relations ZG−12 AD = 0

and ZG−12 b(ξ, ·) = ZG−1

2 b(Uξ, ·) have to be taken into account. The latterrelation follows from the structural condition (A1) as was seen in (6.28). Theproperty (6.32) was used as well. Similarly, w(u, ζ, t) = T (t)w(u, ζ, t) and(6.34b) imply (6.37b). Finally, (6.37c) follows from (6.35a), v(t) = v

(u(t), t

)and the definition of f in (6.33b).

Part III

General Linear Methods for DAEs

7

General Linear Methods

The numerical solution of differential algebraic equations often poses difficultiesfor standard schemes. Hence there are many references studying numericalmethods for DAEs. Linear multistep methods are addressed in [10, 71, 111]while [84, 104] are references for Runge-Kutta methods in the context of DAEs.In [75, 85, 101] both families of methods are investigated.

We saw in Section 3.3 that even for simple linear DAEs numerical methodsmight not work as expected. Recall from Example 3.10 on page 65 that theinherent dynamics of the system was integrated using the explicit Euler schemeeven though the implicit method was applied to the original DAE.

In [89, 90] BDF methods and Runge-Kutta schemes were studied for linearDAEs. It was shown that properly stated leading terms allow a numericallyqualified formulation ensuring that stability properties of these integrationschemes carry over to the inherent regular ODE.

Even though these results are of key relevance for numerical computation, theapproach of [89, 90] is confined to linear DAEs. Thus their scope is not wideenough to include the nonlinear DAEs arising in electrical circuit simulationusing the modified nodal analysis.

In the introductory Chapter 2 we saw that both linear multistep and Runge-Kutta methods suffer from certain disadvantages when being used for inte-grated circuit design. In particular the damping behaviour of the BDF meth-ods and artificial oscillations of numerical solutions obtained by the trapezoidalrule gave rise to serious difficulties (see Example 2.1 and 2.2). On the otherhand, computational costs for fully-implicit Runge-Kutta methods are oftenprohibitive for the large scale DAE systems arising in circuit simulation (cf.Example 2.6). Although Runge-Kutta methods with diagonally implicit struc-ture may be used, these methods suffer from a low stage order and are thusprone to the order reduction phenomenon (cf. Example 2.7).

The object of this work is to extend the results of [89, 90] in two directions.On the one hand general nonlinear DAEs of the form

A(t)[d(x(t), t

)]′+ b(x(t), t

)= 0 (7.1)

114 7 General Linear Methods

will be addressed such that DAEs in Hessenberg form and in particular themore general MNA equations are covered by the results. Additionally the wideclass of general linear methods will be investigated in order to overcome thedisadvantages associated with both linear multistep and Runge-Kutta methods.

In contrast to linear DAEs the nonlinear version (7.1) requires considerableeffort in order to reveal the inherent structure necessary for an appropriategeneralisation of the results from [89, 90]. The previous section showed that adecoupling procedure can be successfully employed (although technically quitecomplex). This approach lead to an inherent implicit index-1 system of theform

y′ = f(y, z′, t), z = g(y, t). (7.2)

As a starting point, general linear methods are introduced for ordinary differ-ential equations in this chapter. Implicit index-1 DAEs (7.2) are treated inChapter 8. There order conditions and convergence results will be derived.Together with the decoupling procedure from the previous section, these in-vestigations will lead to similar statements for properly stated nonlinear DAEs(7.1) in Chapter 9.

7.1 Consistency, Order and Convergence

General linear methods (GLMs) were introduced by Butcher [18] already in1966. The original intent was to provide a unifying framework for linear multi-step and Runge-Kutta methods. It turned out that many other schemes such ascyclic composition methods or predictor-corrector pairs can be cast into generallinear form as well.

In Section 2.3 we saw that a general linear method M = [A,U ,B,V ] is charac-terised by four matrices A ∈ Rs×s, U ∈ Rs×r, B ∈ Rr×s and V ∈ Rr×r. Givenan ordinary differential equation

y′(t) = f(y(t), t

), y(t0) = y0, (7.3)

and r pieces of input information y[n]1 , . . . , y

[n]r , at tn, s internal stages

Yi = hs∑j=1

aijf(Yj, tn + cj h) +r∑j=1

uijy[n]j , i = 1, . . . , s, (7.4a)

are calculated. In order to proceed from tn to tn+1 = tn + h using a stepsize hthe input quantities are updated and another r quantities

y[n+1]i = h

s∑j=1

bijf(Yj, tn + cj h) +r∑j=1

vijy[n]j , i = 1, . . . , r, (7.4b)

7.1 Consistency, Order and Convergence 115

are passed on to the next step. The numerical scheme (7.4) is often written inthe more compact form

Y = hAF (Y ) + U y[n], y[n+1] = hB F (Y ) + V y[n] (7.5)

where the abbreviations

Y =

Y1...Ys

, F (Y ) =

f(Y1, tn + c1 h)...

f(Ys, tn + cs h)

, y[n] =

y[n]1...

y[n]r

are used. This formulation of a GLM is due to Burrage and Butcher [13]. Theinternal stages Yi ≈ y(tn + ci h) represent approximations to the exact solutionat intermediate timepoints. Many different choices are possible for the externalstages. We will assume that the components of y[n] are related to the exactsolution by a weighted Taylor series, i.e.

y[n]i =

p∑k=0

αikhky(k)(tn) +O(hp+1). (7.6)

Recall that s and r denote the number of internal and external stages, respec-tively. In the introduction it was emphasised that the method’s order p and itsstage order q are further important parameters.

Definition 7.1. A general linear method M = [A,U ,B,V ] has order p andstage order q if there are real parameters αik and cj such that (7.6) implies

Yj =

p∑k=0

(cjh)k

k!y(k)(tn) +O(hq+1), j = 1, . . . , s, (7.7a)

y[n+1]i =

p∑k=0

αikhky(k)(tn+1) +O(hp+1), i = 1, . . . , r. (7.7b)

In Definition 7.1 all components of the vector y[n] are required to be of orderp. Although we will adopt this convention in the sequel, this restriction is notnecessary. A good example are so-called Almost Runge-Kutta methods wherer = 3, but y

[n]3 is only a crude approximation to h2y′′(tn). More details are

given in [135].

Methods in Nordsieck form are of particular interest, as the formulation

y[n] ≈

y(tn)h y′(tn)

...hr−1y(r−1)(tn)

.


ensures that changing the stepsize for a GLM is nearly as easy as in case ofRunge-Kutta methods. A simple rescaling of the Nordsieck vector ensures thatthe correct powers of the stepsize h are used at the beginning of a step. Observethat in contrast to Nordsieck’s original formulation [127] factorials have beenomitted for convenience.

For methods in Nordsieck form the parameters αik are given by

αik =

1 , i = k + 10 , otherwise

such that, in particular, α0 = e1 =[1 0 · · · 0

]>and α1 = e2 =

[0 1 · · · 0

]>.

Example 7.2. Consider the 3 step BDF method. The coefficients are given inTable 2.1 on page 27 such that yn+1 − 18

11yn + 9

11yn−1 − 2

11yn−2 = 6

11hf(yn+1).

Written in general linear form this scheme readsY

yn+1

ynyn−1

yn−2

=

611

1811

− 911

211

0

611

1811

− 911

211

0

0 1 0 0 00 0 1 0 00 0 0 1 0

hf(Y )

ynyn−1

yn−2

yn−3

,where the additional input quantity yn−3 is considered for convenience. Noticethat yn−3 is not used for computing Y nor yn+1. Since yn ≈ y(tn) +O(h4) and

yn−i ≈ y(tn − ih) =3∑

k=0

(−ih)k

k!y(k)(tn) +O(h4), i = 1, . . . , 3,

it turns out that the coefficients αik are given by

W =[α0 α1 α2 α3

]=

1 0 0 0

1 −1 12−1

6

1 −2 2 −43

1 −3 92−9

2

.Using this matrix W it is possible to rewrite the BDF scheme in Nordsieckform. To this end the method M = [A,U ,B,V ] needs to be replaced by

M =

[A UW

W−1B W−1VW

]=

611

1 511− 1

22− 7

66

611

1 511− 1

22− 7

66

1 0 0 0 01211

0 −1211− 1

11511

611

0 − 611− 6

11811

.

The method M is in Nordsieck form with W =[α0 α1 α2 α3

]= I.


The definition of order is visualised in Figure 7.1. In general, if r > 1, astarting procedure1 has to be used in order to compute y[0] from the initialvalue y0 = y(t0). The method M takes y[0] as input and calculates y[1] atthe end of the first step. The local error is given by the difference y[1] − y[1],where y[1] results from applying the starting procedure to the exact solutiony(t1). Similarly a finishing procedure is used to recover the solution yn fromy[n] at the end of the integration. For methods in Nordsieck form the finishingprocedure is trivial as the numerical result appears in the first component of y[n].

After several steps taken by the method the local errors accumulate to givethe global error y[n] − y[n]. If the method has order p then the global error isnO(hp+1) = O(hp) provided that the method is stable.

Definition 7.3. A general linear method M = [A,U ,B,V ] is stable if thematrix V is power bounded, i.e. there is a constant C such that ‖Vn‖ ≤ C forevery n.

The definition of stability is closely related to solving the trivial ODE y′ = 0where Y = Uy[n] and y[n+1] = Vy[n]. The requirements

y[n] = α0 y(tn) +O(h), y[n+1] = α0 y(tn+1) +O(h), Y = e y(tn) +O(h)

show that the method has to satisfy certain pre-consistency conditions.

Definition 7.4. A method M = [A,U ,B,V ] is pre-consistent if there is avector α0 called the pre-consistency vector, such that

Vα0 = α0, Uα0 = e =[1 · · · 1

]>.

Consider the one-dimensional differential equation y′ = 1. The application ofM yields Y = hAe+ Uy[n] and y[n+1] = hBe+ Vy[n]. The method has at least

y(t0) y(t1) y(t2) y(tn−1) y(tn)

y[0]

y[1]y[2] y[n−1] y

[n]

y[1] y[2] y[n−1] y[n]

E E E

S S S S S

yn

F

M

M

M

local errorglobal error

Figure 7.1: The local and global error for general linear methods. S andF denote the starting and finishing procedure, respectively. E is the exactsolution operator that evaluates the exact solution at the next timepoint [26].

1Generalised Runge-Kutta methods are often used as starting methods [93]. In practicethe vectors y[n] with r > 1 are built up gradually by using a variable order implementation(see Chapter 11).


order 1 for this problem, i.e.

y[n] = α0y(tn)+α1hy′(tn)+O(h2), y[n+1] = α0y(tn+1)+α1hy

′(tn+1)+O(h2),

provided that it satisfies the following consistency condition.

Definition 7.5. A method M = [A,U ,B,V ] is called consistent if it is pre-consistent with pre-consistency vector α0 and there is a vector α1 such that

α0 + α1 = Be+ Vα1.

The requirement of having stage order 1, i.e. Yi = y(tn + cih) + O(h2) =y(tn) + cihy

′(tn) +O(h2), leads to the additional condition

c = A e+ U α1 (7.8)

which fixes the vector c in terms of the method’s coefficients and the consistencyvector α1.

A precise definition of convergence is rather complicated in this general setup.Let (7.3) be an initial value problem with f : Rm × [t0, tend] → R

m being con-tinuous and satisfying a Lipschitz condition for the first argument. A startingprocedure S : Rm×R→ R

rm is required and we assume that S(y0, h) → α0⊗y0

as the stepsize h tends towards zero. Let η(n) be the result computed by themethod after n steps with stepsize h = (tend − t0)/n starting from S(y0, h).The method is convergent if, for any such initial value problem and startingprocedure, limn→∞ η(n) = α0 ⊗ y(tend).

It is proved in [25] that for ordinary differential equations convergence is equiv-alent to having stability and consistency.

The pre-consistency and consistency conditions as well as (7.8),

Vα0 = α0, Uα0 = e, α0 + α1 = Be+ Vα1, c = A e+ U α1,

ensure that the general linear method M = [A,U ,B,V ] has order and stageorder at least 1 for ordinary differential equations. In order to guarantee higherorder p and stage order q the method’s coefficients have to satisfy complicatedexpressions, the so-called order conditions. If the stage order q is required tobe equal to the order p, these conditions simplify considerably [23, 26, 29].

Theorem 7.6. A general linear method M = [A,U ,B,V ] has order p andstage order q = p if and only if

UW = C −ACK, VW = WE − BCK, (7.9)

where

K =

0 1 0 · · · 0 00 0 1 · · · 0 0...

......

. . ....

...0 0 0 · · · 0 10 0 0 · · · 0 0

, E =

1 1

1!12!

· · · 1(p−1)!

1p!

0 1 11!

· · · 1(p−2)!

1(p−1)!...

......

. . ....

...0 0 0 · · · 1 1

1!

0 0 0 · · · 0 1

= exp(K)


and W =[α0 · · ·αp

], C =

[e c c2

2!· · · cp

p! ]. The exponentiation ck is meantcomponentwise.

Proof. Assume that the order and stage order is p, i.e. (7.6) implies (7.7) forq = p. Due to (7.7a), Yj = y(tn + cjh) + O(hp+1), it follows that h f(Yj, tn +cjh) = hy′(tn + cjh) + O(hp+2) such that a Taylor series expansion of (7.4a)yields

p∑k=0

ck

k!hky(k)(tn) =

p∑k=1

Ack−1

(k−1)!hky(k)(tn) +

p∑k=0

Uαkhky(k)(tn) +O(hp+1).

Equating coefficients for like powers shows that

e = Uα0,1k!ck = 1

(k−1)!Ack−1 + Uαk, k = 1, . . . , p. (7.10a)

Similarly, expanding (7.4b) leads to

p∑k=0

αkhky(k)(tn+1) =

p∑k=0

k∑l=0

αl(k − l)!

hky(k)(tn) +O(hp+1)

=

p∑k=1

Bck−1

(k−1)!hky(k)(tn) +

p∑k=0

Vαkhky(k)(tn) +O(hp+1)

such that

α0 = Vα0,k∑l=0

αl(k−l)!

=1

(k−1)!Bck−1 + Vαk, k = 1, . . . , p. (7.10b)

The expressions (7.10) are equivalent to (7.9). On the other hand, if (7.9)holds, then the same Taylor series expansions show that the method has orderand stage order p in the sense of Definition 7.1.

For methods in Nordsieck form with s = r = p + 1 Theorem 7.6 takes aparticularly simple form. Due to W = I it turns out that such a method hasorder p and stage order q = p if the matrices U and V are chosen as

U = C −ACK, V = E − BCK. (7.11)

In particular, A may have a diagonally implicit structure. Thus, for any givenorder, there are diagonally implicit methods having stage order equal to theorder. Of course for such a method to be of practical use, V needs to be stable.The construction of methods in Nordsieck form with s = r = p + 1 = q + 1having excellent stability properties is carried out in [158].


7.2 Stability Analysis

Similar to the approach of Chapter 2 the linear stability behaviour of a generallinear method M = [A,U ,B,V ] can be assessed by studying the linear scalartest equation

y′(t) = λ y(t) (7.12)

where λ ∈ C is a complex number having negative real part. For (7.12) thenumerical scheme (7.4) reads

Y = zAY + Uy[n]

y[n+1] = zBY + Vy[n]

⇒ y[n+1] =

(V + zB(I − zA)−1U

)y[n].

As usual the variable z = λh has been introduced.

Definition 7.7. The matrix M(z) = V + zB(I − zA)−1U is called stabilitymatrix of the general linear method M = [A,U ,B,V ]. The stability region isgiven by the set G = z ∈ C | sup∞n=1 ‖M(z)n‖ <∞ and the stability functionis defined as

Φ(w, z) = det(w I −M(z)

).

The close relationship between stability function and stability region becomesobvious by considering the sets

H = z ∈ C | ∃w ∈ C, |w| ≥ 1, Φ(w, z) = 0 ,H0 = z ∈ C | ∃w ∈ C, |w| > 1, Φ(w, z) = 0 .

It is easy to see that the complement G = C \ G of the stability region satisfiesG ⊂ H and G ⊃ H0. Hence the boundary of H reveals the shape of thestability region G. In order to trace out the boundary of H one has to solveΦ(exp(θi), z) = 0 for z where θ varies in the interval [0, 2π].

Definition 7.8. The general linear method M = [A,U ,B,V ] is called A-stableif the left half of the complex plane C− = z ∈ C |Re(z) < 0 is included inthe stability region, i.e. C− ⊂ G.

If the sector r e(π−θ)i ∈ C | r ∈ R, r ≥ 0, θ ∈ [−α, α] is included in thestability region G, then M is called A(α)-stable.

The method is called L-stable if it is A-stable and the stability matrix evaluatedat infinity, M∞ = limz→∞M(z), has zero spectral radius, i.e. the spectrum ofM∞ is given by σ(M∞) = 0.

Since M(z) is a matrix valued function, L-stability could equally well be de-fined using the stronger requirement M∞ = 0. However, this leaves only littlefreedom for the method and we will confine ourselves with Definition 7.8.

7.2 Stability Analysis 121

Runge-Kutta methods are known to often have excellent stability properties.This was already discussed in Chapter 2. It is thus a desirable property that ageneral linear method has the same stability region as a Runge-Kutta method.This situation occurs when


)= wr−1

(w −R(z)

)(7.13)

such thatM(z) has only one nonzero eigenvalue. R(z) has the same significanceas the stability function of a Runge-Kutta method. Consequently, a generallinear method is said to have Runge-Kutta stability if its stability functionsatisfies the relation (7.13).

It is a complicated task to determine conditions on the method that guaranteeRunge-Kutta stability. Nevertheless Wright [158] succeeded in constructingpractical methods by ensuring that the sufficient conditions of inherent Runge-Kutta stability are met.

Definition 7.9. A general linear method M = [A,U ,B,V ] with Ve1 = e1 hasinherent Runge-Kutta stability if

BA = XB, BU ≡ XV − VX

for some matrix X and

Φ(w, 0) = det(w I − V) = wr−1(w − 1).

The notation A ≡ B indicates that the two matrices are equal with the possibleexception of the first row.

The following theorem shows that inherent Runge-Kutta stability is indeedsufficient for Runge-Kutta stability. This result can be found in [25] or [158]as well.

Theorem 7.10. Inherent Runge-Kutta stability implies Runge-Kutta stability.

Proof. In order to check for Runge-Kutta stability we need to compute thecharacteristic polynomial of the stability matrix M(z) or, equivalently, of amatrix that is related to M(z) by a similarity transformation. It turns outthat

(I − zX)M(z)(I − zX)−1 = (I − zX)(V + zB(I − zA)−1U

)(I − zX)−1

=(V − zXV + z[B − zXB](I − zA)−1U

)(I − zX)−1

≡(V − z(BU + VX) + z[B − zBA](I − zA)−1U

)(I − zX)−1

=(V − z(BU + VX) + zBU

)(I − zX)−1

= (V − zVX)(I − zX)−1 = V .

Due to inherent Runge-Kutta stability, V has only one nonzero eigenvalue andso does M(z) since (I − zX)M(z)(I − zX)−1 is identical to V except for thefirst row.


Methods with inherent Runge-Kutta stability seem very likely to be good can-didates for practical computations. Methods with s = r = p+ 1 = q + 1 can beconstructed using only linear operations [158]. It is possible for A to have singlydiagonally implicit structure such that these methods can be implemented ef-ficiently. The stability for variable stepsize implementations is investigated in[38] and highly accurate error estimators are provided in [40]. Many implemen-tation issues such as stage predictors and details of the Newton iteration areaddressed by Huang in [93]. This work led to a fixed order implementation ofgeneral linear methods having inherent Runge-Kutta stability. The code wassuccessfully used to solve stiff and nonstiff ordinary differential equations.

Instead of the linear scalar test problem (7.12) nonlinear ODEs

y′(t) = f(y(t), t

), y(t0) = y0, (7.14a)

can be considered as well. Assume that f satisfies the one-sided Lipschitzcondition

〈u− v, f(u, t)− f(v, t)〉 ≤ ν‖u− v‖ ∀u, v ∈ Rm, t ∈ [t0, tend], (7.14b)

where ‖·‖ is a norm in Rm and 〈·, ·〉 the corresponding inner product. Providedthat (7.14b) holds, any two solutions y(t), z(t) satisfy

‖y(t)− z(t)‖ ≤ ‖y(t0)− z(t0)‖ exp(ν(t− t0)

)(see e.g. [85]). For every ν ≤ 0 the distance between these two solutions isthus a non-increasing function of t.

A Runge-Kutta method is called B-stable if it mimics this behaviour in thesense that

‖yn+1 − zn+1‖ ≤ ‖yn − zn‖ (7.15)

for every problem (7.14a) satisfying (7.14b) with ν = 0. Studying nonlinearstability for Runge-Kutta methods [12, 19] was motivated by the fundamentalworks of Dahlquist on nonlinear stability for one-leg methods in 1976 [50]. Forany linear multistep method

k∑i=0

αi yn+1−i = h

k∑i=0

βi f(yn+1−i, tn+1−i) (7.16a)

its one-leg counterpart is given by

k∑i=0

αi yn+1−i = h f(k∑i=0

βi yn+1−i, tn). (7.16b)

In [50] Dahlquist introduced the concept of G stability, i.e. he derived con-ditions guaranteeing that numerical solutions obtained by a one-leg method

7.2 Stability Analysis 123

behave in the sense of (7.15) for a special norm ‖ · ‖G constructed in thepaper. Later Dahlquist was able to prove that a linear multistep method(7.16a) is A-stable if and only if the corresponding one-leg method (7.16b)is G-stable [51, 85].

A series of papers [13, 20, 21] lead to a criterion which generalises both G-stability for one-leg methods and B-stability for Runge-Kutta methods. Thiscriterion is based on the matrix

H =

[DA+A>D − B>GB DU − B>GV

U>D − V>GB G− V>GV

].

Stable behaviour for a general linear method in the sense of (7.15) for problems(7.14a) satisfying (7.14b) with ν = 0 is guaranteed if there is a matrix G anda diagonal matrix D both being positive definite such that H is positive semi-definite.

More information is given in the extensive reference [25].

8

Implicit Index-1 DAEs

In this chapter implicit differential algebraic equations (DAEs)

y′ = f(y, z′), z = g(y), (8.1)

are considered. It is assumed that the matrix M = I−fz′gy remains nonsingularin a neighbourhood of the solution. Thus the DAE (8.1) has index 1.

Equations of this type were obtained when decoupling properly stated index-2DAEs in Chapter 6. Given a nonlinear DAE of the form

A(t)[D(t)x(t)

]′+ b(x(t), t

)= 0 (8.2)

we saw that the components u = DP1x and v = DQ1x of the solution need tosatisfy the inherent index-1 equation

u′ = f(u,w(u, v′, t), t

), v = D(t)z(u, t). (8.3)

z and w are obtained by the implicit function theorem. Details are given inSection 6.1. (8.3) can be cast into the form (8.1) by considering

y =

[ut

], z = v, f(y, z′) =

[f(u,w(u, v′, t), t

)1

], g(y) = D(t)z(u, t).

Investigating implicit DAEs (8.1) can be regarded as the key step in studyingnumerical methods for properly stated index-2 DAEs (8.2). Once order condi-tions and convergence results are available for (8.1), the corresponding resultsfor the more general equation (8.2) can be obtained using the decoupling pro-cedure from Chapter 6. The details of this approach will be carried out inChapter 9.

Nevertheless, implicit DAEs (8.1) constitute an interesting problem class intheir own right as DAEs of this type arise frequently in applications.

126 8 Implicit Index-1 DAEs

e1 e2

R L

C

Figure 8.1: A simple RCL circuit

Example 8.1. Consider the simple linearRCL circuit depicted in Figure 8.1. Themodified nodal analysis leads to

Ce′1 − Ce′2 + 1Re1 = 0,

−Ce′1 + Ce′2 + iL = 0,

Li′L − e2 = 0.

This is an index-1 system, where ei are node potentials and iL represents thecurrent through the inductive branch. Choosing new variables y = (

e2il ) and

z = e1 these equations can be written equivalently as

y′ =

[0 −1

C1L

0

]y +

[10

]z′ = f(y, z′), z =

[0 R

]y = g(y).

For more general circuits f and g may of course be nonlinear. Equations of thetype (8.1) were also used as a model problem in [104] to study the local orderconditions for Runge-Kutta methods. Notice that when reformulating (8.1) inHessenberg form the index typically increases to 2.

Runge-Kutta methods for implicit DAEs and equations in Hessenberg form arestudied thoroughly in the literature [84, 104, 140] but general linear methodshave not yet been studied extensively in the context of DAEs. In this chapterclassical results obtained for Runge-Kutta methods will be generalised to thiswider class of integration schemes.

Section 8.1 is devoted to studying order conditions for general linear methodswhen applied to implicit index-1 equations (8.1). These order conditions arecompletely new but results for Runge-Kutta methods are still contained as spe-cial cases. Combining the order conditions with a stability analysis leads toa convergence result in Section 8.2. Here, the stability matrix – in contrastto a stability function in case of Runge-Kutta methods – plays a decisive role.Finally the accuracy of the stages and stage derivatives is investigated in Sec-tion 8.3. The results obtained there will be seminal when addressing DAEs ofthe form (8.2) in Chapter 9.

8.1 Order Conditions for the Local Error

General linear methods have not yet been studied for implicit index-1 systems(8.1). Given a method M = [A,U ,B,V ], the numerical scheme (7.5) can bemodified for (8.1) such that

Y ′i = f(Yi, Z

′i), i = 1, . . . , s, Zi = g(Yi), i = 1, . . . , s,

Y = hAY ′ + U y[n−1], Z = hAZ ′ + U z[n−1]

y[n] = hB Y ′ + V y[n−1], z[n] = hBZ ′ + V z[n−1].

8.1 Order Conditions for the Local Error 127

In addition to the stages Y , Z stage derivatives Y ′, Z ′ are used to formulatethe numerical scheme. Observe that f : Rm1×Rm2 → R

m1 and g : Rm1 → Rm2

such that

Y =

Y1...Ys

, Y ′ =

Y ′1...Y ′s

∈ Rs·m1 , Z =

Z1...Zs

, Z ′ =Z ′1...Z ′s

∈ Rs·m2 .

s denotes the number of internal stages. The numerical scheme can be writtenequivalently in the simplified form

Y = hA f(Y, Z ′) + U y[n−1], g(Y ) = hAZ ′ + U z[n−1], (8.4a)

y[n] = hB f(Y, Z ′) + V y[n−1], z[n] = hBZ ′ + Vz[n−1], (8.4b)

where the abbreviations

f(Y, Z ′) =

f(Y1, Z′1)...

f(Ys, Z′s

, g(Y ) =

g(Y1)...g(Ys)

have been used.

In this section we derive necessary and sufficient conditions for the local errorto be of a given order p. Compared to ordinary differential equations addi-tional order conditions have to be satisfied since (8.1) is a DAE with index 1.The order conditions are obtained by expanding the exact and the numericalsolution in a Taylor series, respectively, and comparing terms.

8.1.1 Taylor Expansion of the Exact Solution

The exact solution’s Taylor expansion for a problem quite similar to (8.1) wasderived in [104] by Kværnø. Hence this subsection rests, to a large extent, onthe results already presented there. The Taylor series expansion will be writtenas a (generalised) B-series based on rooted trees. Due to the structure of theimplicit index-1 equation (8.1) derivatives of y and z will be involved. Thus,trees with two types of vertices are required. The different types of vertices willbe referred to as black nodes ( ) and white ones ( ). Rooted trees consisting ofthese nodes form the basis for analysing the numerical solution and for derivingorder conditions later on. Thus it is not possible to just quote the results of [104]but we need to summarise the key points necessary for the subsequent sections.Compared to [104] we will use a slightly different approach when introducingthe rooted trees. It is aimed at making the theory easier accessible.

For convenience we start by defining sets of tall trees. Tall trees consist of onlyone type of vertex and have no ramifications,

TTy =

, , , , . . .

, TTz =

, , , , . . .

.


The order |τ | of a tall tree τ is defined as the number of its nodes. Derivativesof z can be associated with trees in the following way:

z′ = gyy′, z′′ = gyy(y

′, y′)+ gyy′′,

z′′′ = gyyy(y′, y′, y′)+ 3 gyy(y

′, y′′)+ gyy′′′, . . .

All trees appearing here share a white root representing derivatives of g. Theblack tall trees connected to the root represent derivatives of y. These treesare collected in the following set of special trees,

STz = ∅ ∪

[τ1, . . . , τk]z =τ1 τk

∣∣ k ≥ 1, τi ∈ TTy.

The trees are ’special’ in the sense that higher derivatives of y are still in-volved. The empty tree was included for convenience. Notice that we write[τ1, . . . , τk]root to indicate that the trees τi are connected to a white root (root =z) or a black root (root = y). The particular sequence of subtrees does notmatter. Each permutation of the subtrees yields the same tree. Often pictorialrepresentations [τ1, . . . , τk]z =

τ1 τk

will be used. Here, the leafs indicate theblack roots of the tall trees τi.

For a given tree σ ∈ STz the order |σ| is defined as the number of its blacknodes. Thus, z(p) can be written in terms of trees with order p. This will bemade more precise in the following result.

Lemma 8.2. For each σ ∈ STz define GS(σ) by

GS(∅) = g, GS

( τ1 τk)

= gky(y(|τ1|), . . . , y(|τk|))

with gky = ∂kg∂yk

. Then the p-th derivative of the exact solution is given by

z(p) =∑

σ∈STz ,|σ|=p

β(σ)GS(σ),

where β(σ) is the number of times GS(σ) appears in the Taylor series expansionof the exact solution.

GS(σ) is called an elementary differential. Strictly speaking, for every σ ∈ STzthe elementary differential GS(σ) is a mapping GS(σ) : Rm1 × Rm2 → R

m2 .Consistent initial values (y0, z0) determine the unique solution (y, z) such thatfor instance GS

( )(y0, z0) = gy

(y0

)y′(t0) = z′(t0). For better readability the

y- and z-arguments are generally omitted.


As a next step derivatives of the exact solution y are considered. Using thechain rule it turns out that

y′ = f y′′ = fyy′ + fz′z

′′

y′′′ = fyy(y′, y′) + 2 fyz′(z

′′, y′) + fyy′′ + fz′z′(z

′′, z′′) + fz′z′′′

It is straightforward to associate the different terms with trees as indicatedabove. The tall trees τ ∈ TTy represent y-derivatives y(|τ |). Similarly, σ ∈ TTzrepresents z(|σ|+1). Be careful not to mix up the different counting for y- andz-derivatives.

The highest derivative of z can be replaced by the expressions already calcu-lated. This leads to

y′′ = fyy′ +[fz′gyy(y

′, y′) + fz′gyy′′ ]

y′′′ = fyy(y′, y′) + 2 fyz′(z

′′, y′) + fyy′′ + fz′z′(z

′′, z′′)

+[fz′gyyy(y

′, y′, y′) + 3fz′gyy(y′, y′′) + fz′gyy

′′′ ]

When replacing z-derivatives, the corresponding subtrees are treated accor-dingly. Take e.g. σ = representing fz′z

′′′. Since z′′′ can be written in terms of

, and , we needed to replace σ’s single subtree with these three trees

one after the other. Thus we arrived at , and , but the tree itself is

no longer present in the expansion of y′′′.

Due to the index-1 condition the matrix M = I − fz′gy is nonsingular. Thisallows, as the next step, to solve for the highest y derivative.

y′′ = M−1fyy′ +[

M−1fz′gyy(y′, y′)

]y′′′ = M−1fyy(y

′, y′) + 2 M−1fyz′(z′′, y′) + M−1fyy

′′ + M−1fz′z′(z′′, z′′)

+[

M−1fz′gyyy(y′, y′, y′) + 3 M−1fz′gyy(y

′, y′′)]

Observe that due to the multiplication with M−1 the trees , etc. are no longer

present. Also notice that the trees representing y(p) fall into two distinct classes.


There are trees that were present from the very beginning when calculatingy(p) from y′ = f(y, z′). But there are also those trees that originate fromrepresenting the highest z-derivative by trees from STz. The former trees willbe collected in the set STyy while the latter ones belong to STyz.

STyy = ∅ ∪

[τ1, . . . , τk, σ1, . . . , σl]y = τ1

τk

σl

σ1 ∣∣k ≥ 0, l ≥ 0, (k, l) 6= (0, 1), τi ∈ TTy, σi ∈ TTz

,

STyz =

[σ]y∣∣ σ ∈ STz, σ 6= [τ ]z for τ ∈ TTy

.

The tree with only one vertex, ∈ STyy, is represented by [ ]y. The trees in STyyhave a black root. All subtrees are tall trees. But trees of the form [σ]y withσ ∈ TTz had to be excluded since they where replaced by the correspondingtrees from STyz. The trees in STyz, in turn, are nothing but the STz treesattached to a new black root. However, trees of the form [[τ ]z]y with a tall treeτ ∈ TTy are not included as we solved for the highest y-derivative representedby τ . Table 8.1 gives pictorial representations for special trees of order up to 4.

For any given tree τ let |τ | be the number of its black nodes. Similarly denotethe number of white nodes by |τ | . Using this notation we define

STy = STyy ∪ STyz, |τ | =|τ | − |τ | , τ ∈ STy|τ | − |τ | + 1 , τ ∈ STz

and arrive at the following straightforward result.

Lemma 8.3. Given the elementary differentials

FS(τ)=

y , τ = ∅

M−1fky lz′(y(|τ1|), . . . , y(|τk|), z(|σ1|+1), . . . , z(|σl|+1)) , τ = τ1

τk

σl

σ1

∈STyy

M−1fz′gky(y(|τ1|), . . . , y(|τk|)) , τ =

τ1 τk

∈STyzwe have

y(p) =∑

τ∈STy ,|τ |=p

β(τ)FS(τ).

Again β(τ) is the number of times FS(τ) appears in the exact solution’s Taylorseries expansion.

order 0 1 2 3 4

STz ∅ , , , , , , ,

STyy ∅ , , ,, , , , ,

, , ,

Table 8.1: Special trees of order up to 4


Lemma 8.2 and 8.3 enable a representation of derivatives y(p), z(p) in terms ofelementary differentials FS and GS, respectively. As these representations arebased on special trees, the elementary differentials themselves are defined usingy(k), z(k) for k up to p − 1. A recursive insertion of expressions for y(k), z(k)

into those of y(p), z(p) will make it possible to write derivatives of y and z interms of partial derivatives of f and g only. In order to make this process moretransparent we introduce the following notation: For any set T of trees, T k isthe subset of T which consists of all trees with order k.

Definition 8.4. Introduce the following sets of rooted trees,

Tz = ∅ ∪

[τ1, . . . , τk]z∣∣ ∃ τ1 τk ∈ STz : τi ∈ T |τi|y

,

Tyy = ∅ ∪

[τ1, . . . , τk, σ1, . . . , σl]y∣∣ ∃ τ1

τk

σl

σ1

∈ STyy : τi ∈ T |τi|y ,

σj ∈ T|σj |+1z

,

Tyz = [

[τ1, . . . , τk]z]y

∣∣ ∃ τ1 τk

∈ STyz : τi ∈ T |τi|y

,

Ty = Tyy ∪ Tyz.

The order of a given tree τ is defined as |τ | =|τ | − |τ | , τ ∈ Ty|τ | − |τ | + 1 , τ ∈ Tz

.

The trees of order up to 3 are depicted in Table 8.2. Even though Definition 8.4seems quite technical, the idea is simple: Each tree τ ∈ Ty, σ ∈ Tz is obtainedfrom a corresponding special tree sp(τ) ∈ STy, sp(σ) ∈ STz, respectively, byrecursively substituting subtrees. As an example consider the tree ∈ STy.The subtree represents y′ and therefore it is replaced by each element of T 1

y ,respectively. As it happens, this doesn’t change the tree since T 1

y = . Incontrast to that, the subtree represents z′′ and needs to be replaced by everytree from T 2

z . Thus the special tree gives rise to

, , with sp( ) = sp( )

= sp

( )= .

order 0 1 2 3

Ty ∅, , , , , , , , , ,

, , , ,

Tz ∅, , , , , , , , , , ,

, , , , , ,

Table 8.2: Trees of order up to 3


As indicated above, the unique special tree giving rise to τ ∈ Ty ∪ Tz will bedenoted as sp(τ).

It is obvious that the number of trees for a given order grows exponentiallyas can be seen in Table 8.3. However, the trees can be constructed efficientlyusing software tools such as Maple [9].

Not all of the trees in Ty ∪ Tz will lead to independent order conditions. Tothis end suitable simplifications will be considered in Section 8.1.2.

The same procedure of recursively substituting subtrees by lower order expres-sions can be applied to the elementary differentials as well.

Lemma 8.5. Introduce the following elementary differentials

G(σ) =

z , σ = ∅

gky(F (τ1), . . . , F (τk)

), σ =

τ1 τk ∈ Tz

F (τ) =

y , τ = ∅f , τ =

M−1fky lz′(F (τ1), . . . , F (τk), G(σ1), . . . , G(σl)

), τ = τ1

τk

σl

σ1

∈ Ty

with M = I − fz′gy. Then

y(p) =∑

τ ∈ Ty , |τ | = p

α(τ)F(τ)(y0, z0), z(p) =

∑σ ∈ Tz , |σ| = p

α(σ)G(σ)(y0, z0). (8.5)

α is the number of times that a given elementary differential appears in theexact solution’s Taylor expansion. α can be calculated recursively as

α(τ) =

1 , τ ∈ ∅, ,

β(sp(τ)

)α(τ1) · · ·α(τk) , τ =

τ1 τk ∈ Tz, τ =τ1 τk

∈ Tyz

β(sp(τ)

) ∏ki=1 α(τi)

∏lj=1 α(σj) , τ = τ1

τk

σl

σ1

∈ Tyy

order p 1 2 3 4 5 6∑6

p=1

STz 1 2 3 5 7 11 29STyy 1 1 4 9 19 35 69STyz 0 1 2 4 6 10 23

Ty 1 2 15 136 1458 17089 18701Tz 1 3 18 157 1645 19132 20956

Table 8.3: The number of trees for a given order


Proof. From y = F (∅), z = G(∅), y′ = F ( ) and z′ = gy y′ = G( ) we imme-

diately get α(τ) = 1 for τ ∈ ∅, , . Let p ≥ 2 be fixed and assume that(8.5) holds for k ≤ p. From Lemma 8.3 it is known that

y(p) =∑

τs∈ST py

β(τs)FS(τs) =∑

τs∈ST pyy

β(τs)FS(τs) +∑

τs∈ST pyz

β(τs)FS(τs)

=∑

τs∈ST pyyτs = [τ1, ..., τk, σ1, ..., σl]y

β(τs)M−1fky lz′(y

(|τ1|), ..., y(|τk|), z(|σ1|+1), ..., z(|σl|+1))

+∑

τs∈ST pyzτs = [[τ1, ..., τk]z ]y

β(τs)M−1fz′gky(y

(|τ1|), . . . , y(|τk|)). (8.6)

By construction we have |τi| < p and |σj| < p − 1. Thus, by the inductionhypothesis, we get

y(|τi|) =∑

τi∈T|τi|y

α(τi)F (τi), z(|σj |+1) =∑

σj∈T|σj |+1z

α(σj)G(σj). (8.7)

Observe that τi and σj are tall trees. In order to distinguish them from τi ∈ Tyand σj ∈ Tz we use the symbol · . Inserting (8.7) into (8.6) and using themulti-linearity of the derivatives, we obtain

y(p) =∑

τs∈ST pyyτs = [τ1, ..., τk, σ1, ..., σl]y

∑τ1∈T

|τ1|y

· · ·∑

σl∈T|σl|+1z

β(τs)α(τ1) · · ·α(τk)α(σ1) · · ·α(σl) ·

· M−1fky lz′(F(τ1), . . . , F (τk), G(σ1), . . . , G(σl)

)+

∑τs∈ST pyz

τs = [[τ1, ..., τk]z ]y

∑τ1∈T

|τ1|y

· · ·∑

τk∈T|τk|y

β(τs)α(τ1) · · ·α(τk) ·

· M−1fz′gky(F (τ1), . . . , F (τk)

)=

∑τ∈T pyy

τ = [τ1, ..., τk, σ1, ..., σl]y

β(sp(τ))α(τ1) · · ·α(σl)F (τ) +∑τ∈T pyz

τ = [[τ1, ..., τk]z ]y

β(sp(τ))α(τ1) · · ·α(τk)F (τ)

The formula for α(τ), τ ∈ Ty, follows by comparing terms with (8.5). Forσ ∈ Tz the corresponding expression is proved similarly.

Remark 8.6. In [104] it is claimed that if τ = [τ1, . . . , τk, σ1, . . . , σl]y ∈ Ty, thenα(τ) = β

(sp(τ)

)α(τ1) · · ·α(τk)α(σ1) · · ·α(σl). The proof above shows that this

is not correct for τ = [σ]y ∈ Tyz.

As an example consider τ = with sp(τ) = τ . Recall from the computa-

tions on page 128 f. that β(sp(τ)

)= β

(sp( )

)= 3. According to [104] one

calculates α(τ) = 3α( ) = 3 · 3 · α( ) · α( ) = 9, which is wrong. FollowingLemma 8.5, by contrast, we correctly find α(τ) = 3 · α( ) · α( ) = 3.


Lemma 8.5 shows how to write the Taylor series expansion of the exact solutionin terms of elementary differentials. In order to simplify notation we introduce(generalised) B-series in the following definition. These B-series will be thecentral tool in the subsequent analysis.

Definition 8.7. Let T be a set of rooted trees. Denote by F (τ) correspondingelementary differentials. If a : T → R

n is a vector valued function, then theformal series

BFT

(a; y, z) =

∑τ∈T

(a(τ)⊗ F(τ)(y, z))α(τ)h|τ |

|τ |!

is called B-series for the elementary weight function a at (y, z) with stepsize h.

Again, ⊗ denotes the Kronecker product which will often be omitted for con-venience. If the type of the elementary differentials is clear from the context,say by specifying the set T , the superscript F will be dropped as well.

Using the result of Lemma 8.5 we arrive at the following B-series representa-tions.

Corollary 8.8. For r ≥ 1 define the elementary weight functions

S : Ty ∪ Tz → Rr, Si+1(τ) =

|τ |! , |τ | = i0 , otherwise,

i = 0, . . . , r − 1,

E : Ty ∪ Tz → Rr, Ei+1(τ) =

|τ |!

(|τ |−i)! , |τ | ≥ i

0 , otherwise,i = 0, . . . , r − 1.

Let (y, z) be the exact solution of (8.1). Then, for i = 0, . . . , r − 1, scaledderivatives of (y, z) are given by

hiy(i)(t0) = BFTy

(Si+1; y(t0), z(t0)

), hiy(i)(t0+h) = BF

Ty

(Ei+1; y(t0), z(t0)

),

hiz(i)(t0) = BGTz

(Si+1; y(t0), z(t0)

), hiz(i)(t0+h) = BG

Tz

(Ei+1; y(t0), z(t0)

).

Proof. With Lemma 8.5 and Definition 8.7 the representations of hiy(i)(t0) andhiz(i)(t0) are obvious. Recall that the exact solution at t0 + h is given byy(t0 + h) =

∑∞p=0 y

(p)(t0)hp

p!. Thus we calculate

hiy(i)(t0 + h) =∞∑p=0

y(p+i)(t0)hp+i

p!=

∞∑p=0

∑τ∈T

|τ |=p+i

α(τ)F (τ)hp+i

p!

=∑τ∈T|τ |≥i

|τ |!(|τ | − i)!

α(τ)F (τ)h|τ |

|τ |!= BF

Ty

(Ei+1; y(t0), z(t0)

).


With Corollary 8.8 a large step was taken towards deriving order conditionsfor general linear methods in the context of implicit index-1 DAEs – at least incase of methods in Nordsieck form. In this case, S represents the exact startingprocedure with

y[0] = BTy

(S; y(t0), z(t0)

), z[0] = BTz

(S; y(t0), z(t0)

).

The result of applying S to the exact solution at t0 + h is given by

y[1] = BTy

(E; y(t0), z(t0)

), z[1] = BTz

(E; y(t0), z(t0)

).

It is clear from Figure 8.2 that in order to derive order conditions we needto find similar expressions for the numerical result y[1], z[1] after taking onestep with the general linear method. The derivation of corresponding B-seriesrepresentations is the aim of the next section.

8.1.2 Taylor Expansion of the Numerical Solution

Let M = [A,U ,B,V ] be a general linear method having a nonsingular ma-trix A. We assume that M is given in Nordsieck form and consider one stepof the numerical scheme (8.4). Let the initial values y(t0) = y0, z(t0) = z0

be consistent. The input quantities are assumed to be exact, i.e. y[0] =BTy

(S; y(t0), z(t0)

)as well as z[0] = BTz

(S; y(t0), z(t0)

)are exact Nordsieck

vectors at t0. Our aim is to derive B-series representations for the quantities

Y = BTy

(v; y(t0), z(t0)

), hZ ′ = BTz

(k; y(t0), z(t0)

), (8.8a)

y[1] = BTy

(y; y(t0), z(t0)

), z[1] = BTz

(z; y(t0), z(t0)

)(8.8b)

involved in computing the numerical result. The definition of the elementaryweight functions v, k, y and z will be given in due course. Before doing so apreliminary definition will prove useful.

y(t0), z(t0) y(t1), z(t1)

y[0], z[0]

y[1], z[1]

y[1]

, z[1]

E

S S

M

Figure 8.2: One step taken by a GLM M for the problem (8.1).


Definition 8.9. For any given elementary weight function a : Ty → Rs with

a(∅) = e =[1 · · · 1

]>define

aD : Tz → Rs,

(aD)(σ) =

e , σ = ∅,

a(τ1) · · ·a(τk) , σ =τ1 τk

.

Observe that even though a is defined on Ty, aD operates on the set Tz.As aD produces results in Rs, products a(τ1) · · · a(τk) have to be understoodcomponent-wise. In particular

(aD)( ) =

(aD)([ ]y) = a(∅) = e.

Obviously, D is closely related to the derivative operator used in [25]. However,when dealing with implicit index-1 DAEs, D has a slightly different meaning.In the next lemma we show that D expresses the relationship z = g(y) on thelevel of elementary weight functions.

Lemma 8.10. Let v : Ty → Rs be an elementary weight function satisfying

v(∅) = e =[1 · · · 1

]>. If the initial values (y0, z0) are consistent, then the

following relation holds

g(BTy(v; y0, z0)) = BTz(vD; y0, z0).

Proof. Since v(∅) =(vD)(∅) = e we have(

g(BTy(v; y0, z0)))h=0

= e⊗ g(y0) = e⊗ z0 = (BTz(vD; y0, z0))h=0

.

In order to prove the assertion we need to show that(dp

dhpg(BTy(v; y0, z0))

)h=0

=

(dp

dhpBTz(vD; y0, z0)

)h=0

holds for every p ≥ 1. To this end it is convenient to introduce

Y (k) :=

(dk

dhkBTy(v; y0, z0)

)h=0

=∑τ∈Tky

α(τ)v(τ)F (τ)(y0, z0)

for k ≥ 1. With this notation, Lemma 8.2, 8.5 and the multi-linearity of thederivative gky one obtains for p ≥ 1(

dp

dhpg(BTy(v; y0, z0))

)h=0

=∑

σ=[τ1,...,τk]z∈ST pz

β(σ) gky(Y (|τ1|), . . . , Y (|τk|)

)=

∑σ ∈ STp

zσ=[τ1,...,τk]z

∑τ1∈T

|τ1|y

· · ·∑

τk∈T|τk|y

β(σ)α(τ1) · · ·α(τk)︸︷︷︸=α([τ1,...,τk]z)

·v(τ1) · · ·v(τk)︸︷︷︸=(vD)([τ1,...,τk]z)

·

· gky(F (τ1), . . . , F (τk)

)︸︷︷︸=G([τ1,...,τk]z)=

∑σ∈T pz

α(σ)(vD)(σ)G(σ)(y0, z0)

which already proves the assertion.


Lemma 8.10 establishes the relationship z = g(y) on the level of elementaryweight functions. As (8.1) is a coupled system of two equations, the equationy′ = f(y, z′) has to be addressed as well. In order to do so we give the followingdefinition.

Definition 8.11. Given the general linear method M = [A,U ,B,V ] with non-singular A define the following elementary weight functions:

v : Ty → Rs, v(τ) = A l(τ) + U S(τ)

l : Ty → Rs, l(τ) =

0 , τ = ∅,e , τ =

|τ |v(τ1) · · · v(τk)k(σ1)|σ1| · · ·

k(σl)|σl|

, τ = τ1

τk

σl

σ1

,

k : Tz → Rs,(vD)(σ) = A k(σ) + U S(σ).

S denotes the elementary weight function representing exact Nordsieck vectorsy[0] = BTy(S; y0, z0) and z[0] = BTz(S; y0, z0). This map was already introducedin Corollary 8.8. Observe that v(∅) = U S(∅) = U e1 = e due to pre-consistency(cf. Definition 7.4). Thus vD can be constructed as in Definition 8.9. Indeed,k(σ) is defined in terms of

(vD)(σ) and S(σ) since A is nonsingular by as-

sumption.

Example 8.12. In order to get a feeling for how these elementary weightfunctions look like, we want to calculate v( ). For simplicity, assume thatr ≥ 4 such that

[u1 u2 u3 u4 · · ·

]denotes the columns of U .

Descending into the recursion:

v( ) = A l( ) + 6u4

l( ) = 32v( )k( ) k( ) = A−1

(v( )2 − 2u3

)v( ) = A l( ) + u2

l( )= e

Ascending from the recursion:

v( ) = A[

32cA−1(c2 − 2u3)

]+ 6u4

l( ) = 32c A−1(c2 − 2u3) k( ) = A−1(c2 − 2u3)

v( )= A e +u2 =: c

l( )= e


In the above calculations the vector c = v( ) = A e + u2 = Ae + Ue2 wasused as suggested by 7.8 in the previous section. Recall that α1 = e2 =[0 1 0 · · · 0

]>for methods in Nordsieck form. Vector-vector products such

as c2 have to be understood in a component-by-component sense.

Again, let us stress that these computations can be carried out efficiently usingcomputer algebra tools such as Maple [9, 112].

Equipped with the elementary weight functions from Definition 8.11 we subjectthe equation y′ = f(y, z′) to closer scrutiny.

Lemma 8.13. Consider v, l and k from Definition 8.11. Then the followingrelation holds

h f(BTy(v; y0, z0),1hBTz(k; y0, z0)) = BTy(l; y0, z0). (8.9)

Before proving Lemma 8.13 some brief remarks are due. Lemma 8.13 togetherwith Definition 8.11 reveals how B-series relate to the equation y′ = f(y, z′).It comes as no surprise that the interaction of y, y′ and z′ complicates mattersconsiderably. It is for this reason that Definition 8.11 is not that straightfor-ward. However, Lemma 8.10 and 8.13 are the key results for proving that thequantities involved in the numerical scheme (8.4) are all B-series. As a guidethrough the following computations keep in mind that finally we want to prove(8.8), i.e. v represents the stages Y , but l and k represent the scaled stagederivatives hY ′ and hZ ′, respectively.

Proof. The idea of the proof is similar to the one of Lemma 8.10. Notice that1hBTz(k; y0, z0) represents the formal series

1

hBTz(k; y0, z0) =

∑σ∈Tz ,|σ|≥1

k(σ)G(σ)(y0, z0)

α(σ)h|σ|−1

|σ|!.

The zero-order term k(∅)z01h

is not present since k(∅) = 0. This follows imme-diately from the pre-consistency conditions in Definition 7.4. A similar remarkapplies to 1

hBTy(l; y0, z0) as well.

For h = 0 nothing is to show for (8.9) as l(∅) = 0. Calculating the firstderivative we find(

d

dh

(h f(BTy(v; y0, z0),

1hBTz(k; y0, z0))

))h=0

= f(e⊗ y0,k( )⊗ gy(y0)y

′(t0))

= e⊗ f(y0, z

′(t0))

= e⊗ y′(t0) =

(d

dhBTy(l; y0, z0)

)h=0

,

where k( ) = e was used. This relation is a consequence of Definition 8.11. Inorder to prove a similar result for higher derivatives, we consider Y (k) defined


in the proof of Lemma 8.10 and introduce

Z(k) :=

(dk−1

dhk−11hBTz(k; y0, z0)

)h=0

=1

k

∑σ∈Tkz

α(σ)k(σ)G(σ)(y0, z0)

for k ≥ 1. We will calculate derivatives of the left hand side of (8.9). UsingLemma 8.5 and the multi-linearity of fky lz′ we find( dp

dhp


1hBTz(k; y0, z0))

))h=0

= p∑

τ ∈ STpyy

τ = [τ1, ..., τk, σ1, ..., σl]y

β(τ) fky lz′(Y (|τ1|), . . . , Y (|τk|), Z(|σ1|+1), . . . , Z(|σl|+1)

)+ p fz′Z

(p)

= p∑τ ∈ Tp

yy

τ = [τ1, ..., τk, σ1, ..., σl]y

α(τ)v(τ1) · · ·v(τk)k(σ1)

|σ1|· · · k(σl)

|σl|MF (τ)+

∑σ ∈ Tp

z

α(σ)k(σ)fz′G(σ)

=∑τ ∈ Tp

yy

α(τ) l(τ) MF (τ) +∑σ ∈ Tp

z

α(σ)k(σ)fz′G(σ). (8.10)

The definition of l was already inserted here to simplify expressions. With(8.10) we are half way to the result, but we have to study the second term∑

σ ∈ Tpz

α(σ)k(σ)fz′G(σ) =∑

σ = [τ1]z ∈ Tpz

α(σ)k(σ)fz′G(σ) +∑σ ∈ Tp

zσ=[τ1,...,τk]z,k≥2

α(σ)k(σ)fz′G(σ) (8.11)

in more detail. Recall that for σ ∈ Tz we have τ = [σ]y ∈ Tyz if and only if σhas at least two subtrees, i.e. σ = [τ1, . . . , τk]z with k ≥ 2. In this situation wefind |τ | = |σ|, α(τ) = α(σ) and l(τ) = k(σ), such that∑

σ ∈ Tpz

σ=[τ1,...,τk]z,k≥2

α(σ)k(σ)fz′G(σ) =∑τ ∈ Tp

yz

α(τ)l(τ) MF (τ). (8.12)

On the other hand if σ = [τ ]z ∈ Tz, then α(σ) = α(τ) (since β(sp(σ)) = 1),S(σ) = S(τ) and the definition of v yields l(σ) = A−1

(v(τ) − U S(τ)

)= l(τ).

Finally, fz′G(σ) = fz′gy F (τ) = (I −M)F (τ). Putting all this together yields∑σ = [τ1]z ∈ Tp

z

α(σ)k(σ)fz′G(σ) =∑τ ∈ Tp

y

α(τ) l(τ) (I −M)F (τ). (8.13)

Inserting (8.11), (8.12), (8.13) into (8.10) concludes the proof since( dp

dhp


1hBTz(k; y0, z0))

))h=0

=∑τ ∈ Tp

y

α(τ) l(τ)F (τ).


Taking a closer look at Lemma 8.10 and 8.13 the Taylor expansion of the nu-merical solution is now obvious. It remains to collect the result in the followingtheorem.

Theorem 8.14. Let y[0] = BTy(S; y0, z0) and z[0] = BTz(S; y0, z0) be exactNordsieck vectors. Consider one step taken by the general linear method M,

Y = hA f(Y, Z ′, tc) + U y[0], g(Y, tc) = hAZ ′ + U z[0], (8.14a)

y[1] = hB f(Y,Z ′, tc)+V y[0], z[1] = hBZ ′+V z[0]. (8.14b)

Then the stages Y , the stage derivatives hZ ′ as well as the output quantitiesy[1], z[1] are B-series

Y = BTy(v; y0, z0), hZ ′ = BTz(k; y0, z0), (8.14c)

y[1] = BTy(y; y0, z0), z[1] = BTz(z; y0, z0), (8.14d)

where

y(τ) = B l(τ) + V S(τ), τ ∈ Ty, z(σ) = B k(σ) + V S(σ), σ ∈ Tz.

The elementary weight functions v, l and k are given in Definition 8.11.

Proof. Consider the elementary weight function v = A l + U S from Defini-tion 8.11. Using B-series notation and applying Lemma 8.13 shows that

BTy(v; y0, z0) = ABTy(l; y0, z0) + U BTy(S; y0, z0) (8.15a)

= h f(BTy(v; y0, z0),1hBTz(k; y0, z0)) + U BTy(S; y0, z0).

On the other hand, Lemma 8.10 and the definition of k yields

g(BTy(v; y0, z0)) = BTz(vD; y0, z0)

= ABTz(k; y0, z0) + U BTz(S; y0, z0).(8.15b)

Thus, comparing (8.15) with (8.14a) reveals that the stages are B-series givenby (8.14c). Due to linearity (8.14d) is an immediate consequence of (8.14b)and (8.14c).

8.1.3 Order Conditions

The previous two sections were devoted to deriving B-series representations ofboth the analytical and the numerical solution. By comparing these Taylorseries expansions it is possible to derive order conditions for general linearmethods applied to implicit index-1 DAEs.


We assume that exact initial values y0 = y(t0), z0 = z(t0) are given. Forgeneral linear methods this information is not enough to start the integration,but we need to calculate initial input vectors y[0] and z[0]. We assume that theprocess of calculating y[0], z[0] from y0 = y(t0), z0 = z(t0) can be described asB-series, i.e.

y[0] = BFTy

(S; y0, z0

), z[0] = BG

Tz

(S; y0, z0

)where S is defined in Corollary 8.8. Using y[0] and z[0] as input quantities weemploy a given general linear method to calculate output vectors y[1] and z[1].Recall from Definition 7.1 that the method has order p if

y[1] = y[1] +O(hp+1), z[1] = z[1] +O(hp+1).

Here, y[1] = BTy

(S; y(t0 + h), z(t0 + h)

)and z[1] = BTz

(S; y(t0 + h), z(t0 + h)

)are obtained by applying the starting procedure S to the exact solution at thenext timepoint (see Figure 8.2 on page 135).

If we restrict attention to methods in Nordsieck form, Corollary 8.8 and The-orem 8.14 show that y[1], z[1] as well as y[1], z[1] can be written as B-series,

y[1] = BFTy

(y; y0, z0

), y[1] = BF

Ty

(E; y0, z0

),

z[1] = BGTz

(z; y0, z0

), z[1] = BG

Tz

(E; y0, z0

).

Thus order conditions can be derived by comparing elementary weight func-tions.

Theorem 8.15. Let M = [A,U ,B,V ] be a general linear method in Nordsieckform with a nonsingular matrix A. M has order p for implicit index-1 DAEs(8.1) if and only if

y(τ) = E(τ) ∀ τ ∈ Ty with |τ | ≤ p,

z(σ) = E(σ) ∀ σ ∈ Tz with |σ| ≤ p.

If a general linear method is applied to implicit index-1 DAEs (8.1), it producesresults with a local error of order O(hp+1) if and only if the conditions ofTheorem 8.15 are satisfied. However, recall from Table 8.3 that the numberof trees for a given order, and thus the number of order conditions, growsexponentially.

In the remainder of this section we focus on simplifying the result of Theo-rem 8.15. As a first step the close relationship between l, k and y, z, respec-tively, is investigated. It will turn out that it suffices to consider trees in Tyonly. The trees from Tz yield no additional order conditions.

Lemma 8.16. Let M be a general linear method in Nordsieck form with anonsingular matrix A. Then for every tree σ ∈ Tz there is a tree τ ∈ Ty suchthat |σ| = |τ | and k(σ) = l(τ), z(σ) = y(τ).


Proof. Recall that S(∅) = e1 =[1 0 · · · 0

]>is the first canonical unit vector

for methods in Nordsieck form. The first column of U is given by U e1 = e =[1 · · · 1

]>due to pre-consistency (Definition 7.4).

Let σ be an arbitrary tree in Tz. For σ = ∅ nothing is to show since the relationk(∅) = A−1

((vD)(∅)− U S(∅)

)= A−1

(e− U e1

)= 0 = l(∅).

For |σ| ≥ 1 we distinguish two cases:

(a) σ = [τ ]z for some τ ∈ Ty.Observe that |σ|= |τ | and therefore S(σ) = S(τ). Calculating

(vD)(σ) =

v(τ) = A l(τ) + U S(τ) we find k(σ) = A−1((vD)(σ) − U S(σ)

)= l(τ).

This immediately implies z(σ) = y(τ) as well.

(b) σ = [τ1, . . . , τk]z with τi ∈ Ty and k ≥ 2.

Let τ = [σ]y, i.e. we attach σ to a new black root. Then τ ∈ Tyz and,

again, |τ | = |σ|. Definition 8.11 yields l(τ) = |τ |k(σ)|σ| = k(σ) and thus

z(σ) = y(τ).

If τ ∈ Ty is a tree corresponding to σ ∈ Tz in the sense of Lemma 8.16, thenτ and σ have the same order. Therefore E(τ) = E(σ) (cf. Corollary 8.8) and,due to y(τ) = z(σ), the elementary weight function z can be omitted fromTheorem 8.15.

Another means of reducing the number of order conditions considerably is tostudy simplified trees. This was already done in [104].

Definition 8.17. Every tree τ ∈ Ty can be associated with a simplified tree τas follows: If a white [black] node (except the root) has no ramifications andis followed by a black [white] node, then the tree can be simplified by removingthese two nodes (see Figure 8.3). The simplified tree τ corresponding to τ isthe tree which is simplified as much as possible.

Ty = τ | τ ∈ Ty is the set of simplified trees.

τ τ τ τ

Figure 8.3: Simplification of trees.

Pictorial representations of the simplified trees up to order 4 are given in Ta-ble 8.5 on page 144. Observe that simplifying a tree does not change its order.

The significance of the simplified trees is straightforward: The order conditionsgenerated by Ty coincide with those generated by Ty.

Lemma 8.18. Let τ ∈ Ty. Then it holds that y(τ) = y(τ).


Proof. The proof is similar to the one carried out in [104]. Let τ ∈ Ty be a tree

which contains η =τ1

τk

σl

σ1

=[[τ1, . . . , τk, σ1, . . . , σl]y

]z

as a (proper) subtree.

The white root has no ramifications but is followed by a black node. Thus thefirst simplification rule applies. Notice that η is attached to one of τ ’s blacknodes, i.e. ξ = [ξ1, . . . , ξm, η, η2, . . . , ηn]y is again a subtree of τ (where τ = ξis possible). Since y(τ) is calculated by evaluating l and k for τ ’s subtrees(depending on the subtrees’ roots), in order to prove y(τ) = y(τ) it suffices toshow that l(ξ) = l(ξ) where ξ = [ξ1, . . . , ξm, τ1, . . . , τk, σ1, . . . , σl, η2, . . . , ηn]y.

From Lemma 8.16 we know that

k(η) = l([τ1, . . . , τk, σ1, . . . , σl]y) = |η|∏k

i=1 v(τi)∏l

j=1k(σj)

|σj | .

Therefore

l(ξ) = |ξ|∏m

i=1 v(ξi)k(η)|η|∏n

j=2k(ηj)

|ηj |

= |ξ|∏m

i=1 v(ξi)∏k

i=1 v(τi)∏l

j=1k(σj)

|σj |∏n

j=2k(ηj)

|ηj |

= l([ξ1, . . . , ξm, τ1, . . . , τk, σ1, . . . , σl, η2, . . . , ηn]y

)= l(ξ).

For the second simplification rule, assume that η =τ1 τk

=[[τ1, . . . , τk]z

]y6= τ

is a subtree of τ . Two cases are possible: Either η is attached to a black node,such that ξ = [η, η2, . . . , ηk, σ1, . . . , σl]y is a subtree of τ , or η is connected to awhite node and ξ = [η, η2, . . . , ηk]z is one of τ ’s subtrees.

In both cases l(ξ) = l(ξ) and k(ξ) = k(ξ) is shown along the lines of thereasoning above. The crucial point to notice is that v(η) = v(τ1) · · ·v(τk).

Up to now only one simplification step was considered. However, it is clearthat the arguments can be repeated recursively.

With Lemma 8.18 we can state the final version of the order conditions.

Theorem 8.19. Let M be a general linear method in Nordsieck form with anonsingular matrix A. M has order p for implicit index-1 DAEs (8.1) if andonly if

y(τ) = E(τ) ∀ τ ∈ Ty with |τ | ≤ p.

Proof. We start from the order conditions of Theorem 8.15. It is, however,sufficient to require y(τ) = E(τ) for all trees τ ∈ Ty with |τ | ≤ p as Lemma 8.16showed that the trees from Tz yield no additional order conditions.

Because of Lemma 8.18 we can restrict attention to trees from the smaller setTy instead of Ty.

The number of order conditions up to p = 6 is given in Table 8.4. Even thoughthese numbers are still increasing exponentially, compared to Table 8.3 thenumber of order conditions is significantly reduced by considering simplifiedtrees.


In order to write down the order conditions for a given general linear methodM = [A,U ,B,V ] with s internal and r external stages, it is convenient to split

U =[u1 u2 · · · us

]and V =

[v1 v2 · · · vr

]order 1 2 3 4 5 6

∑Ty 1 2 6 21 81 336 447

Table 8.4: The number of order conditions

no. order τ y(τ) = E(τ)

1 1 Be+ v2 =[1 1 0 0 0

]>2 2 2Bc+ 2 v3 =

[1 2 2 0 0

]>3 2 BA−1 (c2 − 2u3) + 2 v3 =

[1 2 2 0 0

]>4 3 3B (2Ac+ 2u3) + 6 v4 =

[1 3 6 6 0

]>5 3 3Bc2 + 6 v4 =

[1 3 6 6 0

]>6 3 BA−1

(c (2Ac+ 2u3)− 6u4

)+ 6 v4 =

[1 3 6 6 0

]>7 3 BA−1 (c3 − 6u4) + 6 v4 =

[1 3 6 6 0

]>8 3 3

2BcA−1 (c2 − 2u3) + 6 v4 =

[1 3 6 6 0

]>9 3 3

4B(A−1 (c2 − 2u3)

)2+ 6 v4 =

[1 3 6 6 0

]>10 4 24B

(A (Ac+ u3) + u4

)+ 24 v5 =

[1 4 12 24 24

]>11 4 12B (Ac2 + 2u4) + 24 v5 =

[1 4 12 24 24

]>12 4 8Bc (Ac+ u3) + 24 v5 =

[1 4 12 24 24

]>13 4 4Bc3 + 24 v5 =

[1 4 12 24 24

]>14 4 6B

(AcA−1 (c2 − 2u3) + 4u4

)+ 24 v5 =

[1 4 12 24 24

]>15 4 6BA−1

(c(A (Ac+ u3) + u4

)− 4u5

)+ 24 v5 =

[1 4 12 24 24

]>16 4 3BA−1

(c (Ac2 + 2u4)− 8u5

)+ 24 v5 =

[1 4 12 24 24

]>17 4 4BA−1

((Ac+ u3)

2 − 6u5

)+ 24 v5 =

[1 4 12 24 24

]>18 4 2BA−1

(c2 (Ac+ u3)− 12u5

)+ 24 v5 =

[1 4 12 24 24

]>19 4 BA−1 (c4 − 24u5) + 24 v5 =

[1 4 12 24 24

]>20 4 4

3BcA−1 (c3 − 6u4) + 24 v5 =

[1 4 12 24 24

]>21 4 8

3BcA−1

(c (Ac+ u3)− 3u4

)+ 24 v5 =

[1 4 12 24 24

]>22 4 4B (Ac+ u3)A−1 (c2 − 2u3) + 24 v5 =

[1 4 12 24 24

]>23 4 2Bc2A−1 (c2 − 2u3) + 24 v5 =

[1 4 12 24 24

]>24 4 3B

(A((A−1 (c2 − 2u3)

)2)+ 8u4

)+ 24 v5 =

[1 4 12 24 24

]>25 4 3

2BA−1

(c(A cA−1(c2 − 2u3) + 4u4

)− 24u5

)+ 24 v5 =

[1 4 12 24 24

]>26 4 2

3BA−1 (c2 − 2u3)A−1 (c3 − 6u4) + 24 v5 =

[1 4 12 24 24

]>27 4 4

3BA−1(c2 − 2u3)A−1

(c (Ac+ u3)− 3u4

)+ 24 v5 =

[1 4 12 24 24

]>28 4 Bc

(A−1 (c2 − 2u3)

)2+ 24 v5 =

[1 4 12 24 24

]>29 4 3

4BA−1

(c(A(A−1(c2 − 2u3)

)2+ 8u4

)− 24u5

)+ 24 v5 =

[1 4 12 24 24

]>30 4 1

2B(A−1(c2 − 2u3)

)3+ 24 v5 =

[1 4 12 24 24

]>Table 8.5: Simplified trees of order up to 4 and corresponding order conditions.


columnwise. Definition 8.11 and Theorem 8.14 show how to calculate y forevery tree τ ∈ Ty. The order condition corresponding to τ ∈ Ty is obtained byequating y(τ) and E(τ). Since E(τ) depends only on the order of τ , E needsto be calculated only once for every order.

Example 8.20. Let us investigate the order condition associated with thetree τ = . On page 137 we already calculated l( ) = 3

2cA−1

(c2−2u3

).

Using the definition of y from Theorem 8.14 yields

y( ) = B l( ) + V S( ) = 32B cA−1

(c2 − 2u3

)+ 6 v4.

Since |τ | = 3, we find E(τ) = e1 + 3 e2 + 6 (e3 + e4), where ei denotes the i-thcanonical unit vector. Thus the order condition corresponding to τ = isgiven by

32B cA−1

(c2 − 2u3

)+ 6 v4 = e1 + 3 e2 + 6 (e3 + e4).

The simplified trees of order up to 4 and their corresponding order condi-tions are given in Table 8.5. The conditions stated there were calculated fors = r = 5. However, they are valid for arbitrary s, r as well. In general onehas to use the convention uk = 0 for k > s and vk = 0 for k > r.

Unfortunately there is no “simple” procedure to read the order conditions di-rectly from a given tree as is the case for Runge-Kutta methods. Due to themultivalue structure of the method it seems unavoidable to use a recursiveapproach for general linear methods.

Observe that each order condition is an equation in Rr. Taking a closer look,it turns out that the order conditions derived for Runge-Kutta methods byKværnø in [104] re-appear in the vector’s first components. This is no surpriseas Runge-Kutta methods are a special case of general linear methods withr = 1.

It is well known for Runge-Kutta methods that sufficiently high stage orderguarantees that the order conditions for DAEs are satisfied [84]. A similarremark holds when general linear methods are applied to implicit index-1 DAEs.We will investigate this situation in more detail.

8.1.4 Order Conditions and Stage Order

We know from Theorem 7.6 that a general linear method M = [A,U ,B,V ] hasorder p and stage order q = p for ordinary differential equations if and only if

UW = C −ACK, VW = WE − BCK.

The matrix W =[α0 · · ·αp

]is closely related to Definition 7.1 where order

and stage order are defined. Recall that for methods in Nordsieck form with


s = r = p + 1 this matrix is given by W = I. Thus general linear methods inNordsieck form with s=r have order and stage order p=q =s−1 if and only if

U = C −AC K V = E − BC K. (8.16)

The matrices C, E and K are defined in Theorem 7.1. In this section wewill show that, given these definitions for U and V , the method has order pfor implicit index-1 DAEs (8.1) as well. We start by studying the elementaryweight function l.

If U and V are chosen in the special way (8.16), then l can be calculated simplyas l(τ) = CK S(τ). We prove this statement in the following result.

Lemma 8.21. Let M = [A,U ,B,V ] be a general linear method in Nordsieckform with s = r = p+ 1 and U = C −AC K, V = E − BC K. Then

l(τ) = C K S(τ) ∀ τ ∈ Ty with |τ | ≤ p. (8.17)

Proof. Recall that C =[e c c2

2· · · cp

p!

]and C K =

[0 e c · · · cp−1

(p−1)!

].

Using the definition of the exact starting procedure S from Corollary 8.8 wefind

C S(τ) = c|τ | and C K S(τ) = |τ | c|τ |−1 ∀ |τ | ≤ p.

For τ = ∅ and τ = it is straightforward to calculate l(∅) = 0 = C K S(∅)and l( ) = e = C K S( ), respectively. Assume that (8.17) holds for all treesτ with order |τ | ≤ k where 1 ≤ k < p. Let τ be an arbitrary tree with order|τ | = k + 1. Again we distinguish two cases:

(a) τ = τ1

τk

σl

σ1

∈ TyyThe tree |τ | has order k + 1 by assumption and therefore |τi| ≤ k and|σj| ≤ k− 1. Evaluating v for the subtrees τi yields, due to the inductionhypothesis,

v(τi) = A l(τi) + U S(τi)

= AC KS(τi) + (C −AC K)S(τi) = C S(τi).

Since |σj| = |[σj]y| we have k(σj) = l([σj]y) = C K S([σj]y) = |σj| c|σj |−1.Combining these two results it turns out that

l(τ) = |τ |∏k

i=1 v(τi)∏l

j=1k(σj)

|σj | = |τ |∏k

i=1 c|τi|∏l

j=1 c|σj |−1

= |τ | c|τ1|+···+|τk|+(|σ1|−1)+···+(|σl|−1).

Recall from Definition 8.4 that

|τ | = |τ | − |τ | = 1 +∑k

i=1

(|τi| − |τi|

)+∑l

j=1

(|σi| − |σi|

)= 1 +

∑ki=1 |τi|+

∑lj=1(|σi| − 1)

such that l(τ) = |τ | c|τ |−1 = CK S(τ).

8.2 Convergence for Implicit Index-1 DAEs 147

(b) τ =τ1 τk

= [σ]y ∈ TyzIn this case we have |τ | = |σ|. Thus we can not apply the inductionhypothesis directly but we have to take one step further into the recursion.

Let σ =τ1 τk

. Then |τi| ≤ k and(vD)(σ) =

∏ki=1 v(τi) =

∏ki=1 C S(τi) = c|τ | = C S(τ).

Observe that l(τ) = k(σ) and S(τ) = S(σ) such that

l(τ) = k(σ) = A−1((vD)(σ)− U S(σ)

)= A−1

(C S(τ)− (C −AC K)S(τ)

)= C K S(τ).

With Lemma 8.21 all the preparatory work is done and we can proceed toformulating the desired result that links the order conditions for implicit in-dex-1 DAEs to the property of a general linear method to have high order andstage order for ordinary differential equations.

Theorem 8.22. Let M be a general linear method in Nordsieck form satisfyings = r > 1. Assume that M has order and stage order equal to p = q = s − 1for ordinary differential equations. Then M has order p for implicit index-1DAEs (8.1).

Proof. SinceM has order and stage order p = q = s−1 for ordinary differentialequations, we know from Theorem 7.1 that U = C−AC K and V = E−BC K.Theorem 8.14 and Lemma 8.21 then yield

y(τ) = B l(τ) + V S(τ) = BC K S(τ) + (E − BC K)S(τ) = E S(τ)

for every tree with |τ | ≤ p. In order to show that the order conditions forimplicit index-1 DAEs are satisfied, it remains to show that E S(τ) = E(τ).However, this is clear from the definition of E in Corollary 8.8 as the i-thcomponent of the vector E S(τ) reads

(E S(τ)

)i=

p+1∑j=i

1

(j − i)!δ|τ |+1,j|τ |! =

|τ |!(|τ |+ 1− i)!

for i− 1 ≤ |τ | ≤ p and(E S(τ)

)i= 0 otherwise.

8.2 Convergence for Implicit Index-1 DAEs

The order conditions from Theorem 8.19 and 8.22 make it possible to derivegeneral linear methods in Nordsieck form with a given local error when applied


to implicit index-1 DAEs (8.1). However, when calculating the numerical so-lution the local error contributions accumulate over a series of steps taken bythe method. Thus we have to study the global error

εyn = y[n] − y[n] and εzn = z[n] − z[n].

Recall that y[n] and z[n] denote Nordsieck vectors obtained by applying thestarting procedure S to the exact solution y(tn) and z(tn) at timepoint tn,respectively (see Figure 8.4).

Only if the global error can be bounded in terms of the local error, the schemeis said to be stable and the numerical solution converges to the exact result ifthe stepsize h tends to zero.

The convergence results presented in this section follow ideas from [84, 104].The existence of GLM solutions and the influence of perturbations is studiedfirst. Then the main convergence result is proved in Theorem 8.27.

Hairer, Lubich and Roche [84] studied the application of Runge-Kutta methodsto DAEs in Hessenberg form having index µ ∈ 1, 2, 3. Here, we investigategeneral linear methods for implicit index-1 systems (8.1) instead. Observe thata reformulation (8.1) in Hessenberg form typically increases the index to 2.

Fully implicit DAEs were studied by Kværnø in [104]. She considered Runge-Kutta methods for the numerical solution of index-1 DAEs f(v, v′) = 0. Theaim of this section is to generalise her results to the case of general linearmethods.

Naturally, the mathematical techniques used in this section, in particular ho-motopy arguments and careful ’big-O’ calculations, are very similar to the onesused when analysing Runge-Kutta methods for fully implicit index-1 problems.Unfortunately even for Runge-Kutta methods the threefold result of existenceof solutions, influence of perturbations and convergence is obtained only withconsiderable technical effort. Nevertheless, as these results are seminal for nu-merical computations it is essential to verify them for the class of general linear

y(t0) y(t1) y(t2) y(tn−1) y(tn)

y[0]

y[1]y[2] y[n−1] y

[n]

y[1] y[2] y[n−1] y[n]

E E E

S S S S S

yn

F

M

M

M

local errorglobal error

Figure 8.4: The global error for general linear methods. Only the y compo-nent is drawn for simplicity.


methods as well. It is a rather nice result that almost all ideas carry over fromthe Runge-Kutta to the general linear case.

However, there are also new aspects coming into the play. Namely the conceptof consistency from Definition 7.4 and 7.5 is needed in Theorem 8.23 when prov-ing the existence of solutions for the numerical scheme. Also, the requirement|1 − b>A−1e| < 1 for Runge-Kutta methods [10, 84, 104] has to be replacedby the power boundedness of the matrix M∞ = V − BA−1U . This matrixis the general linear method’s stability matrix M(z) evaluated at infinity (seeSection 7.2 or [25]).

We consider the application of a general linear method M = [A,U ,B,V ] to theimplicit index-1 equation (8.1). If y[0] and z[0] are some input vectors for thefirst step, we need to study the system

Y ′ = f(Y, Z ′), (8.18a)

Y = hAY ′ + U y[0], g(Y ) = hAZ ′ + U z[0], (8.18b)

y[1] = hB f(Y, Z ′) + V y[0], z[1] = hBZ ′ + V z[0]. (8.18c)

f(Y, Z ′) =[f(Y1, Z

′1)>, . . . , f(Ys, Z

′s)>]> and g(Y ) =

[g(Y1)

>, . . . , g(Ys)>]>

again denote super vectors in Rsm1 and Rsm2 , respectively. Similarly, recallthat

Y =

[Y1...Ys

]∈ Rsm1 , Z ′ =

[Z′1...Z′s

]∈ Rsm2 ,

As usual Kronecker products in (8.18) were left out to simplify notation. Incontrast to (8.4) explicit representations of the stage derivatives Y ′ = f(Y, Z ′)are used and Y ′ is not inserted into (8.18b). This will turn out to be usefulwhen proving existence of solutions and convergence for the numerical scheme.

Theorem 8.23 (Existence of solutions). Let M be a consistent general linearmethod with nonsingular matrix A. Let the input vectors be given by

y[0]i = α0iν + hα1iµ+O(h2), z

[0]i = α0iη + hα1iζ +O(h2)

for i = 1, . . . , r with ν, µ, η, ζ ∈ Rn. As in Definition 7.4, 7.5 the elementsα0, α1 ∈ Rr denote the pre-consistency and consistency vector, respectively.Assume that

f(ν, ζ) = µ+O(h), g(ν) = η +O(h2), gy(ν)µ = ζ +O(h2) (8.19)

and that the matrices (I−fz′gy)−1, fy and fz′ remain bounded in an h-indepen-dent neighbourhood of (ν, ζ). Then there exists a unique solution of (8.18a),(8.18b) satisfying

Yi = ν +O(h), Z ′i = ζ +O(h).


Proof. First define1 y = h(A e)µ + U y[0] and z = h(A e) ζ + U z[0] using thevector e = [1, . . . , 1]> ∈ Rs. Consider the homotopy

Y ′ = f(Y, Z ′) + (τ − 1)(f(y, e⊗ ζ)− e⊗ µ

), Y = hAY ′ + Uy[0],

g(Y ) = hAZ ′ + Uz[0] + (τ − 1)(z − g(y)

).

For τ = 0 this system has the obvious solution Y = y, Y ′ = e⊗µ and Z ′ = e⊗ζ,but for τ = 1 it is equivalent to (8.18a), (8.18b).

Considering Y , Y ′ and Z ′ as functions of τ we differentiate with respect to thisparameter and obtain

Y ′ = fy Y + fz′ Z ′ + f(y, e⊗ ζ)− e⊗ µ, Y = hA Y ′,

gy Y = hA Z ′ + z − g(y),

where fy = blockdiag(fy(Y1, Z

′1), . . . , fy(Ys, Z

′s)). The matrices fz′ and

gy are defined similarly. Due to Y = hA Y ′ we find that Y ′ and Z ′ satisfy

Y ′ − hfyA Y ′ = fz′ Z ′ + f(y, e⊗ ζ)− e⊗ µ,

A−1gyA Y ′ = Z ′ +1

hA−1

(z − g(y)

).

Inserting the second equation into the first one, this system can be rewrittenin matrix form[I−hfyA−fz′A−1gyA 0

−A−1gyA I

][Y ′

Z ′

]=

[f(y, e⊗ ζ)−e⊗ µ− 1

hfz′A−1

(z−g(y)

)− 1hA−1

(z−g(y)

) ].

Since gy(Yi)aij = aijgy(Yj)+O(d) provided that ‖Yi−Yj‖ ≤ d with d indepen-dent of h, the block in the upper left corner can be written as

I − hfyA − fz′A−1gyA = I − fz′gy − hfyA+O(d). (8.20)

By assumption M = I−fz′gy is nonsingular in a neighbourhood of (ν, ζ). Thus,(8.20) is also nonsingular for h and d sufficiently small.

Therefore the coefficient matrix of the above system has a bounded inverse.As (8.19) ensures that the right-hand side is O(h), standard arguments as in[84, theorem 4.1] show that Y ′

i = µ+O(h), Z ′i = ζ +O(h). It remains to notethat due to pre-consistency we have Uα0 = e and therefore Uy[0] = e⊗ν+O(h)which shows Y = hAY ′ + Uy[0] = e⊗ ν +O(h).

The uniqueness of the solution is obtained in the same way as in [84].

1Using Kronecker products y reads y = h(A e) ⊗ µ + (U ⊗ Im1)y[0] in more detail. A

similar expression holds for z.


When using the scheme (8.18) to solve (8.1) numerically, all the computationsare performed on some computer architecture. Thus the computed results willnot only suffer from inaccuracies due to the method (local and global error)but also from the fact that all calculations are performed in finite precision.The influence of these perturbations is studied in the next result.

Theorem 8.24 (Influence of perturbations). Let the assumptions of theorem8.23 be satisfied. Denote the solution of (8.18a), (8.18b) by Y ′, Y and Z ′.Assume that Y ′, Y , Z ′ satisfy the perturbed system

Y ′ = f(Y , Z ′) + δ, (8.21a)

Y = hA Y ′ + U y[0], g(Y ) = hA Z ′ + U z[0] + θ, (8.21b)

where the input vectors satisfy

y[0] − y[0] = O(h2), z[0] − z[0] = O(h2). (8.22)

If the perturbations are of order δ = O(h) and θ = O(h2) then there is aconstant C > 0 such that∥∥∥[ Y ′−Y ′Z′−Z′

]∥∥∥ ≤ C (‖y[0] − y[0]‖+ ‖δ‖+ 1h‖z[0] − z[0]‖

+ 1h‖gy(η, ζ) (y[0] − y[0])‖+ 1

h‖θ‖)

(8.23)

provided that h is sufficiently small.

The norm used in (8.23) for super vectors is given by∥∥∥[ Y ′−Y ′Z′−Z′

]∥∥∥ = max(‖Y ′ − Y ′‖, ‖Z ′ − Z ′‖

)=

smaxi=1

(‖Y ′

i − Y ′i ‖, ‖Z ′i − Z ′i‖

)where ‖ · ‖ denotes some vector norm in Rm1 or Rm2 , respectively.

Proof. The result of Theorem 8.24 is obtained very similar to the proof ofTheorem 8.23, but a different homotopy is used:

Y ′ = f(Y, Z ′) + (1− τ) δ,

g(Y ) = hAZ ′ + U z[0] + (1− τ)(θ + U (z[0] − z[0])

),

Y = hAY ′ + U y[0] + (1− τ)U(y[0] − y[0]).

For τ = 1 this system is (8.18a), (8.18b), but for τ = 0 it coincides with (8.21).See [84, 104] for more details.

Before we continue to study the propagation of errors over several consecutivesteps we remark that (8.22) and (8.23) imply∥∥∥[ Y ′−Y ′Z′−Z′

]∥∥∥ ≤ C1h (8.24a)

for δ = 0, θ = 0. But if y[0] − y[0] = O(h) and z[0] − z[0] = O(h), we only have∥∥∥[ Y ′−Y ′Z′−Z′

]∥∥∥ ≤ C2. (8.24b)


Lemma 8.25 (Error propagation). Let the assumptions of Theorem 8.23 besatisfied. Denote exact input vectors2 by y[n], z[n] and assume that

y[n] = y[n] + εyn, z[n] = z[n] + εzn with ‖εyn‖ = O(hp), ‖εzn‖ = O(hp).

Let (y[n+1], z[n+1]) be the numerical solution obtained after n steps. Then theglobal error satisfies the recursion

εn+1 = M∞ Sn εn + V Pn εn − dn+1 + η (8.25)

where M∞ = V − BA−1U is the method’s stability matrix evaluated at infinity,dn+1 is the local discretisation error and

η =

O(hp+1), for p ≥ 2, or p = 1 and f linear in z′,

O(h), for p = 1 and f nonlinear in z′.

Sn and Pn are projector functions given by

Sn =

[I −M−1 M−1fz′−gyM−1 I + gyM

−1fz′

], Pn = I − Sn,

where M−1 = (I − fz′gy)−1, fz′ and gy are evaluated at

(y(tn), z

′(tn)).

Remark 8.26. In order to simplify matters, (8.25) is a slight abuse of nota-

tion. We have εn =[εynεzn

], dn+1 =

[dyn+1

dzn+1

]and similarly η =

[ηfηg

]such that

εn, dn+1, η ∈ Rr(m1+m2). On the other hand

Sn =[S1n S2

n

S3n S4

n

]∈ R(m1+m2)×(m1+m2), Pn =

[P1n P2

n

P3n P4

n

]∈ R(m1+m2)×(m1+m2)

and M∞, V ∈ Rr×r. Thus (8.25) has to be understood as the system[εyn+1

εzn+1

]=

([M∞ ⊗ S1

n M∞ ⊗ S2n

M∞ ⊗ S3n M∞ ⊗ S4

n

]+

[V ⊗ P1

n V ⊗ P2n

V ⊗ P3n V ⊗ P4

n

])[εynεzn

]+

[ηf − dyn+1

ηg − dzn+1

].

Proof. Using exact input vectors we find

Y ′ = f(Y , Z ′), (8.26a)

Y = hA Y ′ + U y[n], g(Y ) = hA Z ′ + U z[n] (8.26b)

y[n+1] = hB Y ′ + V y[n] + dyn+1, z[n+1] = hB Z ′ + V z[n] + dzn+1 (8.26c)

2Recall that y[n]i+1 = hiy(i)(tn) and z

[n]i+1 = hiz(i)(tn), i = 1, . . . , r, for methods in Nord-

sieck form.


with local error dyn+1, dzn+1. On the other hand

Y ′ = f(Y, Z ′), (8.27a)

Y = hAY ′ + U y[n], g(Y ) = hAZ ′ + U z[n] (8.27b)

y[n+1] = hB Y ′ + V y[n] + εyn+1, z[n+1] = hBZ ′ + V z[n] + εzn+1 (8.27c)

with global error εyn+1, εzn+1. Introduce ∆Y ′ = Y ′ − Y ′, ∆Z ′ = Z ′ − Z ′ and

assume that

‖εyn‖ ≤ ε, ‖εzn‖ ≤ ε, ‖∆Y ′‖ ≤ δ, ‖∆Z ′‖ ≤ δ.

(8.24) implies δ = O(h) for p ≥ 2 and δ = O(1) for p = 1. By assumption wehave ε = O(hp). Writing Z ′ = Z ′ +∆Z ′ and Y = Y +hA∆Y ′+U εyn, we maylinearise f in (Y , Z ′) such that

f(Y, Z ′) = f(Y , Z ′) + fz′∆Z ′ + ηf .

fz′ = blockdiag(fz′(Y1, Z

′1), . . . , fz′(Ys, Z

′s))

and so on are notations similarto the ones already introduced in the proof of Theorem 8.23. The term ηfcontains higher order terms of the form

fy(hA∆Y ′ + U εyn

), fyy

(hA∆Y ′ + U εyn, hA∆Y ′ + U εyn

),

fyz′(hA∆Y ′ + U εyn, ∆Z ′

), fz′z′

(∆Z ′,∆Z ′

), . . .

Due to Theorem 8.23 the relations Yi = y(tn) + O(h) and Z ′i = z′(tn) + O(h)imply fy(Yi, Z

′i) = fy

(y(tn), z

′(tn))+O(h). Similar estimates hold for the other

derivatives. It turns out that

f(Y, Z ′) = f(Y , Z ′) + fz′ ∆Z′ +O

(hδ + ε+ δ2 + hδ2

). (8.28)

Writing fy, fz′ , gy and so on has to be understood as evaluating the derivativesin(y(tn), z

′(tn)). Similar calculations show that

g(Y ) = g(Y + hA∆Y ′ + U εyn

)= g(Y ) + gy

(hA∆Y ′ + U εyn

)+ ηg,

where ηg contains higher order terms of the form gyy(hA∆Y ′ + U εyn

)2and

so on. With Yi = y(tn) +O(h) we get ηg = O(h2δ2 + hδε+ ε2

)and finally

g(Y ) = g(Y ) + gy ·(hA∆Y ′ + U εyn

)+O

(h2δ2 + h2δ + hε

). (8.29)

Using (8.28), (8.29) together with (8.26), (8.27) we find

∆Y ′ = f(Y, Z ′)− f(Y , Z ′) = fz′∆Z′ +O

(hδ + ε+ δ2 + hδ2

),

∆Z ′ =1

hA−1(g(Y )− g(Y )− U

(z[n] − z[n]

))

=1

hA−1(gy ·

(hA∆Y ′ + U εyn

)− U εzn) +O

(hδ2 + hδ + ε)

= gy ∆Y ′ +1

hgyA−1U εyn −

1

hA−1U εzn +O(hδ2 + hδ + ε)

(8.30)


The calculations performed here involve several Kronecker products [91]. Formore clarity observe that

A−1 · gy · A∆Y ′ = (A−1 ⊗ Im2)(Is ⊗ gy)(A⊗ Im1)∆Y′

= (A−1A⊗ gy)∆Y′ = (Is ⊗ gy)∆Y

′ ∈ Rsm2

and similarly

A−1 · gy · U εyn = (A−1 ⊗ Im2)(Is ⊗ gy)(U ⊗ Im1)εyn = (A−1U ⊗ gy)ε

yn

= (Is ⊗ gy)(A−1U ⊗ Im1)εyn ∈ Rsm2 .

In the equations on the previous page we wrote gy ∆Y ′ = (Is ⊗ gy)∆Y′ and

gyA−1U εyn = (Is ⊗ gy)(A−1U ⊗ In)εyn in order to shorten notations.

Writing the system (8.30) in matrix form yields[I −fz′

−gy I

] [∆Y ′

∆Z ′

]=

[0 0gy −I

] [1hA−1U 0

0 1hA−1U

] [εynεzn

]+O

(hδ+ε+δ2+hδ2).

Using the notation M = I − fz′gy, the inverse of the coefficient matrix reads[I −fz′

−gy I

]−1

=

[M−1 M−1fz′gyM

−1 I + gyM−1fz′

]and we get an explicit representation for ∆Y ′, ∆Z ′. Namely,[

∆Y ′

∆Z ′

]= −Sn

[1hA−1U 0

0 1hA−1U

] [εynεzn

]+O

(hδ + ε+ δ2 + hδ2

). (8.31)

Notice that

Sn = S(y(tn), z

′(tn))

=

[I −M−1 M−1fz′−gyM−1 I + gyM

−1fz′

](8.32)

is a projector function. All matrix functions occurring in the definition of Snhave to be evaluated at

(y(tn), z

′(tn)).

Since ‖εyn‖ ≤ ε, ‖εzn‖ ≤ ε we infer from the inequality (8.31) that∥∥[∆Y ′

∆Z′

]∥∥ ≤ K(hδ + ε+ δ2 + hδ2 + ε

h

)with some constant K. As in [104, 111] it is possible to obtain an estimate forδ by considering the functional iteration δ = G(δ) = K

(hδ+ ε+ δ2 + hδ2 + ε

h

)using the initial value δ(0) = C hp−1. It turns out that δ = O(hp−1) for p ≥ 1.Inserting this result into (8.31) yields[

∆Y ′

∆Z ′

]= −Sn

[1hA−1U 0

0 1hA−1U

] [εynεzn

]+

O(hp), for p ≥ 2,

O(1), for p = 1.


Observe that the term O(1) originates from O(δ2) in (8.31). Recall that theterm O(δ2) comes from higher order terms of the type fz′z′

(∆Z ′,∆Z ′

). If f

is linear in z′, these terms do not appear and we can conclude[∆Y ′

∆Z ′

]= −Sn

[1hA−1U 0

0 1hA−1U

] [εynεzn

]+ η (8.33)

with

η =

O(hp), for p ≥ 2, or p = 1 and f linear in z′,

O(1), for p = 1 and f nonlinear in z′.

It remains to insert (8.33) into (8.26c), (8.27c) to obtain[εyn+1

εzn+1

]=

[hB 00 hB

] [∆Y ′

∆Z ′

]+

[V 00 V

] [εynεzn

]−[dyn+1

dzn+1

]= Sn

[−BA−1U 0

0 −BA−1U

] [εynεzn

]+

[V 00 V

] [εynεzn

]− dn+1 + η

with

η =

O(hp+1), for p ≥ 2, or p = 1 and f linear in z′,

O(h), for p = 1 and f nonlinear in z′.

Introducing Pn = I − Sn we get the desired result:

εn+1 = −BA−1U Snεn + V(Snεn + Pnεn

)− dn+1 + η

=(V − BA−1U

)Snεn + VPnεn − dn+1 + η.

Our aim is to prove convergence of general linear methods for implicit index-1DAEs (8.1). With Lemma 8.25 most of the work is already done. All that isleft to do is to use Lemma 8.25 recursively to find a representation of the globalerror in terms of the local one and the initial data.

Theorem 8.27 (Convergence). LetM = [A,U ,B,V ] be a general linear methodwith nonsingular A. Assume that

(a) M∞ = V − BA−1U and V are power bounded.

(b) f , g are sufficiently differentiable.

(c) The initial input vectors y[0], z[0] satisfy

y[0] = y[0] +O(hp), z[0] = z[0] +O(hp) (8.34)

with p ≥ 2. y[n], z[n] denote exact input vectors.

(d) The local truncation error satisfies dn+1 = O(hp+1) with p ≥ 2.

Then the method is convergent with order p.


Proof. We start from the error recursion

εn+1 = M∞ Sn εn + V Pn εn + dn+1 (8.35)

derived in Lemma 8.25. For convenience dn+1 = −dn+1 + ηn+1 is used as anabbreviation. Recall that ηn+1 = O(hp+1) and thus also dn+1 = O(hp+1). Thecomponents of the projectors P and S defined in (8.32) are smooth functions.Therefore we have

Sn+1Sn = Sn +O(h), Pn+1Pn = Pn +O(h),

Sn+1Pn = O(h), Pn+1Sn = O(h).

Multiplication of (8.35) by Pn+1 and Sn+1, respectively, yields[Pn+1εn+1

Sn+1εn+1

]=

[V(1 +O(h)

)O(h)

O(h) M∞(1 +O(h)

)] [PnεnSnεn

]+

[Pn+1 dn+1

Sn+1 dn+1

]= A(h)n+1

[P0e0S0e0

]+

n∑i=0

A(h)i[Pn+1−i dn+1−i

Sn+1−i dn+1−i

]

For sufficiently small h the matrix A(h) =[V(1+O(h)) O(h)O(h) M∞(1+O(h))

]is power

bounded, given that V and M∞ = V −BA−1U are power bounded themselves.In particular, standard arguments [104] show that

‖εn‖ ≤ C

(‖P0e0‖+ ‖S0e0‖+

1

hmaxi=1,...,n

‖Pidi‖+1

hmaxi=1,...,n

‖Sidi‖).

The constant C is independent of h. Due to the assumptions (c) and (d) thisstability inequality proves convergence of order (at least) p.

After proving this convergence result, let us briefly comment on the assumptionsmade when formulating Theorem 8.27.

Assumption (c) guarantees the existence of a solution for the first step of thenumerical scheme (8.18) as (8.34) implies

y[0] = α0 y(t0) + hα1 y′(t0) +O(h2), z[0] = α0 z(t0) + hα1 z

′(t0) +O(h2)

with pre-consistency vector α0 and consistency vector α1 (see Theorem 8.23).The error recursion of Lemma 8.25 and the power boundedness of V and M∞ensure that y[n], z[n] are calculated with the same accuracy for n ≥ 1. Thusthe numerical scheme (8.18) is solvable also for n ≥ 1. In order to guarantee(d) the order conditions from Theorem 8.19 have to be satisfied.

Comparing with the corresponding Theorem 3.3. in [104], which deals exclu-sively with Runge-Kutta methods, the assumption on V and M∞ to be powerbounded is the most obvious deviation. However, ifM was itself a Runge-Kutta

8.3 The Accuracy of Stages and Stage Derivatives 157

method, then V = 1 and M∞ = 1 − b>A−1e = R(∞). Thus for Runge-Kuttamethods assumption (a) means that |R(∞)| ≤ 1.

Also note that we restricted attention to the case p ≥ 2. This restriction is notnecessary. Similar results as in [104] can be derived also for p = 1 using meretechnical considerations.

8.3 The Accuracy of Stages and Stage Derivatives

Theorem 8.27 guarantees that the global error

εyn = y[n] − y[n] = O(hp), εzn = z[n] − z[n] = O(hp)

is of order p both for the y and the z component. It is interesting to investigatewhether similar estimates hold for the global error

∆Yni = Yni − y(tn−1 + cih), ∆Z ′ni = Z ′ni − z′(tn−1 + cih)

of the stages and stage derivatives as well. Yn and Z ′n are calculated via

Yn = hAf(Yn, Z′n) + Uy[n−1], g(Yn) = hAZ ′n + Uz[n−1],

where the input vectors y[n−1], z[n−1] are obtained by taking n− 1 consecutivesteps with the general linear method M starting from y[0], z[0] at t = t0. Inorder to keep track of the current step number, Yn and Z ′n are equipped withthe subscript n.

From [84] it is well known that for Runge-Kutta methods applied to index-2DAEs in Hessenberg form the global error of the z component is essentiallygiven by the local error, i.e. if the local error is O(hp+1) then the global erroris O(hp+1) as well. In this situation one immediately finds ∆Z ′n = O(hmin(p,q))from a perturbation result similar to Theorem 8.24.

Unfortunately, this standard approach for estimating the order of the derivativeapproximations is not feasible for implicit index-1 systems. In general the zcomponent is calculated with global error of order O(hp) only and convergenceof order O(hp+1) can’t be expected. This is illustrated in the following example.

Example 8.28. For L ∈ R consider the implicit DAE

y′ = f(y, z′) = Ly2 − z′, t ∈ [0, 0.5], y(0) = z(0) = 1,

z = g(y) = y3.

Due to I − fz′gy = 1 + 3y2 > 0 the index is indeed 1 and the exact solution isgiven by

y(t) =1

6

(t L+ 2 +

√(t L+ 2)2 + 12

), z(t) = y(t)3.


This DAE can be solved numerically using the general linear method

M =

14

0 0 1 0 − 132

16

14

0 1 112− 1

2416

12

14

1 112− 1

2416

12

14

1 112− 1

24

0 0 1 0 0 00 −2 2 0 0 0

, c =

1412

1

constructed in [158]. The method has order p = 2 and stage order q = 2 forordinary differential equations. To verify this, calculate

C =[e c c2

2

]=

1 14

132

1 12

116

1 1 12

, E =

1 1 12

0 1 10 0 1

and use Theorem 7.6 to find that U = C −ACK and V = E − BCK. Due toTheorem 8.22 the DAE order conditions are satisfied for order up to 2 as well.

10−12

10−9

10−6

10−3

10−3 10−2 h

13

‖dy

1‖

‖dz1‖

Figure 8.5 (a): Norm of the local errordy1 =y[1]−y[1], dz1 =z[1]−z[1].

10−6

10−4

10−2

10−3 10−2 h

1

2

‖eyn‖

‖ezn‖

Figure 8.5 (b): Norm of the global errorεyn=y[n]−y[n], εzn=z[n]−z[n] at tn = 0.5

10−6

10−4

10−2

10−3 10−2 h

1

2

‖∆Yn‖‖∆Z ′

n‖

Figure 8.5 (c): Norm of the global error∆Yn, ∆Z ′

n at tn = 0.5

10−5

10−3

10−1

101

10−3 10−2 h

12

11

‖∆Z ′

1‖

‖∆Z ′

2‖

‖∆Z ′

3‖

Figure 8.5 (d): Norm of the global error∆Z ′

n for n = 1, 2, 3

Figure 8.5: Numerical solution of Example 8.28 for L = 1 on [0, 0.5] usingfixed stepsizes. For Figure 8.5 (a) exact input vectors were used. All the otherresults were calculated from input vectors y[0] = y[0] +O(hp), z[0] = y[0] +O(hp)where p = 2 is the method’s order.


Consequently, both the y and the z component are calculated with local errordn = O(h3). This is confirmed in Figure 8.5 (a). Since the matrices

V =

1 112

− 124

1 112

− 124

1 112

− 124

, M∞ = V − BA−1U =

0 0 043

13− 1

12163

43

−13

have eigenvalues σ(V) = 1, 0, 0, σ(M∞) = 0, 0, 0, and are therefore powerbounded, we expect convergence of order 2. Again, the numerical results fromFigure 8.5 (b) confirm this result. Observe that the global error of the z com-ponent is indeed O(h2) even though the local error is O(h3).

However, this does not mean that the stage derivatives are calculated with alower order. In fact, Figure 8.5 (c) shows that the stage derivatives Z ′n arecalculated with order 2 as well. As seen from Figure 8.5 (d) this order ofconvergence is not exhibited after the first and second step but from the thirdstep onwards. We will see that the reason for this behaviour is the fact thatM∞ is nilpotent with M2

∞ = 0.

From Figure 8.5 (d) it is clear that in general, and in particular for the first fewsteps, one can only expect ∆Z ′n = O(hmin(p−1,q)). This will be proved in thenext lemma.

Given the additional assumption Mk0∞ = 0, it is nevertheless possible to show

∆Z ′n = O(hmin(p,q)). This situation will be dealt with later in Lemma 8.30.

For simplicity, this section deals exclusively with methods in Nordsieck form.

Lemma 8.29. Let the assumptions of Theorem 8.27 be satisfied. Given thatthe general linear method has stage order q for ordinary differential equations,the global error satisfies ∆Yn = O(hmin(p,q+1)) and ∆Z ′n = O(hmin(p−1,q)) for thestages and stage derivatives, respectively.

Even though Lemma 8.29 is a straightforward result, it is quite technical toprove. Again, B-series and elementary weight functions can be used. Recallthat the method has s internal and r external stages. Thus matters becomemore complicated due to the fact that either r < q + 1 or r ≥ q + 1 can hold.In order to treat both cases consistently, we introduce r = min(r, q + 1). As in

Lemma 8.21 the matrices C =[e c c2

2· · · cq

q!

]and K =

[0 Iq0 0

]will prove

to be useful. In particular it is possible to write the stage order conditions inmatrix form.

By assumption the method has stage order q for ordinary differential equations.Thus we know from Theorem 7.6 that the first r columns of U are fixed by thestage order conditions. Using Matlab notation [123] to extract submatriceswe obtain

U( : , 1 : r) =(C −AC K

)( : , 1 : r). (8.36a)


If r ≥ q + 1 then (8.36a) reads U( : , 1:q+1) = C−ACK as in Theorem 7.6.However, in case of r < q + 1 additional conditions A ck−1 = 1

kck are satisfied

for r ≤ k ≤ q due to high stage order. These are exactly the C(k)-conditionsused for analysing Runge-Kutta methods (see page 34 or [25]). In Matlabnotation we can write(

AC K)( : , r + 1 : q + 1) = C( : , r + 1 : q + 1). (8.36b)

Using the relations (8.36) we proceed to proving Lemma 8.29.

Proof. Let Yn and Z ′n be the stages and stage derivatives calculated from exactinput vectors, i.e.

Yn = hAf(Yn, Z′n) + U y[n−1], g(Yn) = hAZ ′n + U z[n−1],

The local error Yni−y(tn−1+cih), Z′ni−z′(tn−1+cih) is studied first. The global

error can then be bounded by finding estimates for Yni − Yni and Z ′ni − Z ′ni.

(a) Exact input vectors are given as B-series y[n−1] = BTy

(S; y(tn1), z(tn−1)

)and z[n−1] = BTz

(S; y(tn−1), z(tn−1)

)with the elementary weight function

S from Corollary 8.8. Theorem 8.14 implies that

Yn = BTy

(v; y(t0), z(t0)

), hZ ′n = BTz

(k; y(t0), z(t0)

),

where v and k are the weight functions given in Definition 8.11. If τ ∈ Tywith |τ | < r, then v(τ) = C S(τ) is obtained similarly to Lemma 8.21using (8.36a). On the other hand, if r ≤ |τ | ≤ q, then the term US(τ)vanishes such that

v(τ) = Al(τ) + US(τ) = Al(τ) = ACKS(τ) = C S(τ)

due to (8.36b). We have shown that v(τ) = C S(τ) for all trees with|τ | ≤ q such that Yi = y(tn−1 +cih)+O(hq+1) follows. The correspondingresult for h Z ′i is obtained similarly by keeping in mind that l([σ]y) = k(σ)

for [σ]y ∈ Tyz and l(τ) = k([τ ]z) for [τ ]z ∈ Tz (Lemma 8.16).

(b) In order to derive bounds for Yni− Yni and Z ′ni− Z ′ni the stages Yn as wellas the stage derivatives hZ ′n computed after n steps are written as

Yn = BTy

(v; y(t0), z(t0)

), hZ ′n = BTz

(k; y(t0), z(t0)

).

Recall that v and k were introduced in Definition 8.11,

v : Ty → Rs, v(τ) = A l(τ) + U Sn−1(τ)

l : Ty → Rs, l(τ) =

0 , τ = ∅,e , τ =

|τ |v(τ1) · · ·v(τk)k(σ1)

|σ1|· · · k(σl)

|σl|, τ = τ1

τk

σl

σ1

,

k : Tz → Rs,(vD)(σ) = Ak(σ) + U Sn−1(σ),


where the starting procedure Sn−1 represents both y[n−1] and z[n−1]. Using(y(t0), z(t0)

)as the new reference point for the B-series expansion we

rewrite the exact input quantities as y[n−1] = BTy

(Sn−1; y(t0), z(t0)

)and

z[n−1] = BTz

(Sn−1; y(t0), z(t0)

)such that

Yn = BTy

(v; y(t0), z(t0)

), hZ ′n = BTz

(k; y(t0), z(t0)

).

v and k are defined similar to v, k but using Sn−1 instead of Sn−1.

Since the global error has order p for the y and z component, the twostarting procedures agree for trees of order less than p, i.e. Sn−1(τ) =Sn−1(τ) for all trees τ ∈ Ty ∪ Tz with |τ | < p. This immediately implies

v(τ) = v(τ) for |τ | < p and k(σ) = k(σ) for |σ| < p.

(c) Putting the results of (a) and (b) together we find

Yi = Yi +O(hp) = y(tn−1 + cih) +O(hq+1) +O(hp),

hZ ′i =h Z ′i +O(hp) = h z′(tn−1 + cih) +O(hq+1) +O(hp).

In general the stage derivatives Z ′n are calculated with global error of magnitude∆Z ′n = O(hmin(p−1,q)) as was shown in the previous lemma. In spite of thisFigure Figure 8.5 (d) shows that from the third step onwards higher accuracy∆Z ′n = O(hmin(p,q)) is realised. It was pointed out earlier that the stabilitymatrix at infinity, M∞, plays a crucial role in explaining this behaviour.

Lemma 8.30. Let the assumptions of Lemma 8.29 be satisfied. In partic-ular the initial input vectors are assumed to satisfy y[0] = y[0] + O(hp) andz[0] = z[0] +O(hp). If the matrix M∞ = V − BA−1U is nilpotent with nilpo-tency index k0, then ∆Z ′n = O(hmin(p,q)) after k0 + 1 steps.

Proof. Define y0 = S0 and z0 = S0 where the elementary weight functiony0 : Ty → R

r is defined on Ty but z0 : Tz → Rr operates on Tz. For n ≥ 1

consider

vn = A ln + U yn−1, vnD = Akn + U zn−1,

yn = B ln + V yn−1, zn = B kn + V zn−1.

Similarly introduce

vn = A ln + U Sn−1, vnD = A kn + U Sn−1

yn = B ln + V Sn−1, zn = B kn + V Sn−1

as was already done in the previous proof. From Lemma 8.29 it is clear thatk(σ) = k(σ) for all trees σ ∈ Tz with |σ| < p. In order to prove the relation


∆Z ′n = O(hmin(p,q)) we need to show that k(σ) = k(σ) holds for trees of orderp as well.

Using the above notation it turns out that zn = BA−1(vnD) + M∞ zn−1 andtherefore

Akn = vnD − Uzn−1 = vnD − UM∞ zn−2 − UBA−1(vn−1D)

= · · · = vnD − UMn−1∞ z0 −

n−1∑i=1

UM i−1∞ BA−1(vn−iD).

Given that Mk0∞ = 0 this simplifies to

Akn = vnD −k0∑i=1

UM i−1∞ BA−1(vn−iD) (8.37a)

for n ≥ k0 + 1. As the local error is of order O(hp+1), we have zn(σ) = Sn(σ)for every tree σ with |σ| ≤ p. Thus a similar argument as above shows that

A kn =p vnD −k0∑i=1

UM i−1∞ BA−1(vn−iD). (8.37b)

In contrast to (8.37a) equality holds only for trees with |σ| ≤ p. This isindicated using the symbol =p.

Let σ ∈ Tz be an arbitrary tree with order |σ| = p. If the root of σ hasramifications, i.e. σ =

τ1 τk

with k ≥ 2, then each subtree τi has order strictlylower than p. Therefore

(vkD)(σ) = vk(τ1) · · ·vk(τk) = vk(τ1) · · · vk(τk) = (vkD)(σ)

for every k ≥ 1 and the recursions (8.37) imply kn(σ) = kn(σ) for n ≥ k0 + 1.

It remains to prove that this relationships holds for trees of the form σ = [τ ]z,too. For such a tree kn(σ) = ln(τ) holds (Lemma 8.16). Recall that τ has the

form τ = τ1

τk

σl

σ1

with (k, l) 6= (0, 1).

Since |τi| < |τ | and |σj| < |τ | for i = 1, . . . , k and j = 1, . . . , l, the calculationof ln requires the evaluation of vn and kn only for trees of order less than p.This means ln(τ) = ln(τ) and therefore also kn(σ) = kn(σ).

Consider a stiffly accurate Runge-Kutta method, i.e. M = [A, e, b>, 1] withe>s A = b> such that the last row of A coincides with the vector b>. Thestability function reads R(z) = 1 + z b>(I − zA)−1 e (cf. Section 2.2) suchthat R∞ = limz→∞R(z) = 1 − b>A−1 e = 0. Therefore the error in the stagederivatives can be estimated by ∆Z ′n = O(hmin(p,q)) already for n ≥ 2.


For a general linear method with r > 1 stiff accuracy no longer implies M∞ = 0but only the first row is zero. Hence, nilpotency of the stability matrix M∞becomes an additional requirement.

A statement similar to Lemma 8.30 was already proved in [10] for BDF meth-ods. Since the BDF schemes are included within the framework of generallinear methods a result such as Lemma 8.30 had to be expected.

Example 8.31. Recall from Example 7.2 that the BDF3 method written inNordsieck form reads

M =

[A UB V

]=

611

1 511− 1

22− 7

66

611

1 511− 1

22− 7

66

1 0 0 0 01211

0 −1211− 1

11511

611

0 − 611− 6

11811

such that the stability matrix at infinity is given by

M∞ = V − BA−1U =

0 0 0 0

−116

−56

112

736

−2 −2 0 23

−1 −1 −12

56

.It can be verified easily that

M2∞ =

0 0 0 076

13−1

6118

3 1 −12

16

2 1 −12

16

, M3∞ =

0 0 0 0

−13

0 0 0−1 0 0 0−1 0 0 0

, M4∞ = 0.

9

Properly Stated Index-2 DAEs

In this chapter we return to studying nonlinear index-2 DAEs

A(t)[D(t)x(t)

]′+ b(x(t), t

)= 0 (9.1)

with a properly stated leading term. In Chapter 6 a decoupling procedure wasderived showing that the components u = DP1x and v = DQ1x satisfy theinherent index-1 equation

u′ = f(u,w(u, v′, t), t

), v = D(t)z(u, t). (9.2)

When studying numerical methods for (9.1) it is seminal that the inherentindex-1 DAE (9.2) is treated correctly. This work was carried out in the pre-vious section. General linear methods were studied for (9.2). Order conditionsand a convergence result have been derived.

We will investigate how these results can be transferred to the general index-2formulation (9.1) using the decoupling procedure from Chapter 6.

By ensuring that a general linear method, when applied to (9.1), correctlydiscretises the inherent system (9.2) the results of [90] are extended in twodirections: Not only nonlinear DAEs are treated but also the large class ofgeneral linear methods is considered.

9.1 Discretisation and Decoupling

Assume that we wanted to employ a general linear method M = [A,U ,B,V ]in order to solve the initial value problem

A[Dx]′ + b(x, ·) = 0, D(t0)P1(x0, t0)

(x(t0)− x0

)= 0. (9.3)

The vector x0 originates from an initialisation A(t0)y0 + b(x0, t0) = 0. Observe

that x0 will be inconsistent in general, but the formulation (9.3) ensures thatthere is a unique local solution as was shown in Theorem 6.7.


From Chapter 6 the exact solution x∗ is known to satisfy the decoupled system

u′ = f(u,w(u, v′, ·), ·

)= f(u, v′, ·), v = Dz(u, ·) = g(u, ·), (9.4a)

x∗ = D−u+ z(u, ·) + w(u, v′, ·). (9.4b)

Recall that u = DP1x and v = DQ1x are components of the solution deter-mined by the projector functions P1(t) = P1

(x(t), t

)and Q1(t) = Q1

(x(t), t

).

The function x ∈ C1(I,Rm) is assumed to satisfy

x(t0) = x0,(x(t), t

)∈ D×I ∀ t ∈ I, (9.5)

where D×I ⊂ Rm ×R is the domain where the DAE (9.3) is defined.

As the decoupled system (9.4) is known, it is straightforward to apply thegeneral linear method M directly to the inherent index-1 system (9.4a),

U = hAF (U, V ′) + U u[n], G(U) = hAV ′ + U v[n], (9.6a)

u[n+1] = hB F (U, V ′) + V u[n], v[n+1] = hB V ′ + V v[n]. (9.6b)

The numerical solution is then given by

xn = D−(tn)un + z(un, tn) + w(un, v′n, tn) (9.6c)

where un and v′n are numerical approximations to u(tn) and v′(tn) respectively.For stiffly accurate methods these quantities are given by the last stages, i.e.un = Us and v′n = V ′

s . As in Chapter 7 the shorthand notation

F (U, V ′) =

f(U1, V′1 , tn + c1 h)

...f(Us, V

′s , tn + cs h)

, G(U) =

g(U1, tn + c1 h)...

g(Us, tn + cs h)

has been used.

Unfortunately the system (9.4) is not available for direct computation. Eventhough the implicit function theorem guarantees the existence of the functions

A[Dx]′ + b(x, ·) = 0

index-2 DAE

u′ = f(u, v

′, ·)

v = g(u, ·)

implicit index-1 DAE

+ constraints

discretised

index-2 DAE

discretised index-1 DAE

+ discr. constraints

decoupling

discretisation discretisation

decoupling

Figure 9.1: The relationship between discretisation and decoupling

9.1 Discretisation and Decoupling 167

f, w and z, these mappings are not accessible for practical applications. Thedecoupling procedure is only a theoretical tool to study the behaviour of theexact/numerical solution. The decoupling is not intended to be carried out inpractice.

However, we will derive conditions on the DAE (9.3) which guarantee that ageneral linear method, when applied to the original formulation (9.3), behavesas if it was applied to the inherent system (9.4a). Thus the aim of this sectionis to derive conditions on the DAE (9.3) such that the diagram in Figure 9.1commutes.

We have to compare the following two approaches:

(a) The DAE (9.3) is decoupled first and the method M is applied to thedecoupled system (9.4a).

(b) The general linear method is applied directly to the original formulation(9.3) and the decoupling procedure is applied to the resulting discretisedscheme afterwards.

The decoupling procedure is explained in detail in Chapter 6. General linearmethods for implicit index-1 DAEs have been studied in Chapter 8. Thusapproach (a) has received considerable attention so far in this thesis.

For practical computations only the approach (b) is feasible. Additional re-quirements on the DAE (9.3) will ensure that the numerical quantities obtainedusing approach (b) coincide with those originating from approach (a). Hencethe numerical result will correctly reflect qualitative properties of the inherentindex-1 system.

9.1.1 Decoupling the Discretised Equations

The general linear method’s discretisation of (9.3) is given by

A(tn+ci h)[DX]′i + b(Xi, tn+cih) = 0, i = 1, . . . , s, (9.7a)

[DX] = hA [DX]′ + U [Dx][n], (9.7b)

[Dx][n+1] = hB [DX]′ + V [Dx][n]. (9.7c)

The stages Xi ≈ x(tn + cih) are approximations to the exact solution at inter-mediate timepoints. The super vectors [DX] and [DX]′ are given by

[DX] =

D(tn + c1 h)X1...

D(tn + cs h)Xs

, [DX]′ =

[DX]′1...[DX]′s

,


where [DX]′i ≈ ddt

(Dx)(tn+cih) represents stage derivatives for the component

Dx. The input vector

[Dx][n] ≈

(Dx) (tn)

h (Dx)′(tn)...

hr−1(Dx)(r−1)(tn)

is assumed to be a Nordsieck vector. Observe that only information about thesolution’s D-component is passed on from step to step. Hence errors in thiscomponent are the only ones that are possibly propagated.

As a consequence the numerical result xn+1 at tn+1 = tn+h has to be obtainedas a linear combination of the stages Xi. Every stage Xi satisfies the obviousconstraints, such that

Xi ∈M0(tn + cih), M0(t) = z ∈ D | b(z, t) ∈ imA(t) . (9.8)

The setM0(t) was already introduced in Section 4.2 on page 74. It is a desirablefeature for the numerical solution to satisfy xn+1 ∈ M0(tn+1) as well. Thusstiffly accurate methods, where xn+1 coincides with the last stage Xs, guaranteethis situation. We will therefore restrict attention to stiffly accurate generallinear methods, i.e.

M =

[A UB V

]with e>s A = e>1 B, e>s U = e>1 V , cs = 1,

such that the last row of [A,U ] coincides with the first row of [B,V ].

In order to decouple the discretised system (9.7) according to the approach(b) described above, the vector [Dx][n] of incoming approximations needs to besplit according to its DP1 and DQ1 parts. For the moment assume that

[Dx][n]i = u

[n−1]i + v

[n−1]i ∈ im(DP1)(tn−1)⊕ im(DQ1)(tn−1) (9.9)

for i = 1, . . . , r. We will justify this assumption later on.

The stages Xi are split according to

Ui = D(tn + cih)P1(tn + cih)Xi, Zi = Z(tn + cih)Xi, (9.10a)

Vi = D(tn + cih)Q1(tn + cih)Xi, Wi = T (tn + cih)Xi, (9.10b)

such that

Xi = D−(tn + cih)Ui + Zi + Wi, Vi = D(tn + cih)Zi. (9.10c)

Recall from Section 5.1 that T is a projector function with imT = N0 ∩S0. Aswe are always assuming that the structural condition



holds, T can be chosen to depend on time only. Using U = I −T the projectorfunction

Z(t) = P0(t)Q1(t) + U(t)Q0(t)

was introduced in Section 5.1 as well. Finally, for a nonsingular matrix A stagederivatives U′

i and V′i can be defined using the relations

U = hAU′ + U u[n], V = hAV′ + U v[n].

A quick comparison with (9.7b), (9.9) and (9.10) shows that [DX]′ = U′ + V′

holds by construction.

Rewriting (9.7a) in terms of the new variables introduced so far yields

F(Ui,Wi,Zi,U

′i,V

′i, tn + cih)

= Ani(U′i + V′

i) + b(D−niUi + Zi + Wi, tn+cih) = 0,

(9.7a’)

where Ani = A(tn + cih) was used to shorten notations. Dni has a similarinterpretation. The mapping F (u,w, z, η, ζ, t) was first defined in (6.14b) onpage 93. Later, in Section 6.3, F was split into three different parts, F1, F2 andF3 used to determine z, w and f, respectively. This procedure will be repeatedfor the discretised system (9.7a’).

Lemma 9.1. Let (9.3) be a regular index-2 DAE with a properly stated leadingterm. Assume that the structural condition (A1) is satisfied and that (9.5) holdsfor some function x ∈ C1(I,Rm). If a general linear method M = [A,U ,B,V ]with nonsingular A and sufficiently small h is used for the numerical scheme(9.7), then the following relations hold:

Zi = z(Ui, tn + cih), (9.11a)

Vi = Dni z(Ui, tn + cih), (9.11b)

U′i = Ti

1 −Ti2 + f(Ui,Wi, ti), (9.11c)

0 = F2

(Ui,Wi,V

′i, tn + cih

)+ (Q0Q1D

−)niTi2 + Ti

3. (9.11d)

z is the mapping from Lemma 6.4. The functions f and F2 are defined in (6.16)and Lemma 6.5, respectively. The terms Ti

j are given by

Ti1 =

(I − (DP1D

−)ni)U′i − (DP1D

−)′niUi,

Ti2 = (DP1D

−)niV′i + (DP1D

−)′niVi,

Ti3 = (Q0Q1D

−)ni(DP1D−)′niUi − (Q0Q1D

−)niU′i.


Proof. Lemma 5.2 and 6.9 imply ZG−12 AD = 0 such that the representation

(9.7a’) leads to F1

(Ui,Zi, tn + cih

)= 0 for the function

F1(u, z, ·) = ZG−12 b(D−u+Zz , ·

)+(I−Z

)z

(see Section 6.3 for more details). As the mapping z is implicitly defined byF1(u, z, t) = 0, the stages Zi are given by (9.11a). The relation (9.11b) followsfrom (9.11a) due to Vi = DniQ1Xi = DniZi.

To be precise, it is assumed that the stepsize h is small enough to guaranteethat (Ui, tn + cih) remains in the domain where z is defined. This will not bestated every time, but all arguments are meant locally in that sense.

As a next step, (9.7a’) is multiplied by DP1G−12 . The leading term can be

written as

(DP1G−12 A)ni[DX]′i = (DP1D

−)ni(U′i + V′

i

)= U′

i − (I − (DP1D−)ni)U′

i + (DP1D−)niV

′i

such that

U′i =

(I − (DP1D

−)ni)U′i − (DP1D

−)niV′i −DP1G

−12 b(Xi, tn + cih)

and therefore (9.11c) follows. Finally, (9.7a) shows that

(TG−12 A)ni(U

′i + V′

i) + (TG−12 )i b

(D−niUi + Zi + Wi, tn + cih

)= 0.

Using (9.11a) and (9.11c) this equation can be written as

0 = F2

(Ui,Wi,V

′i, tn + cih

)+(TG−1

2 A)ni

(Ti1 −Ti

2). (9.12)

It remains to take TG−12 A = −Q0Q1D

− into account to see that (9.12) isidentical to (9.11d).

The terms Tij appearing in the previous lemma are similar to correspond-

ing quantities obtained in [90] when studying Runge-Kutta methods for linearDAEs. There we also find practical criteria for Ti

j = 0.

Lemma 9.2. (a) If the space imDP1D− is constant, then Ti

1 = 0, Ti3 = 0.

(b) If the space imDQ1D− is constant, then Ti

2 = 0.

Proof. The proof is essentially the same as in [90, Lemma 7].

9.1.2 Comparing Approach (a) and (b)

We are now in a position to prove commutativity for the diagram in Figure 9.1.Recall that we wanted to compare the following two approaches:


(a) The DAE (9.3) is decoupled first and the method M is applied to thedecoupled system (9.4a).

This approach leads to the numerical scheme

U = hAF (U, V ′) + U u[n], G(U) = hAV ′ + U v[n], (9.6a)

u[n+1] = hB F (U, V ′) + V u[n], v[n+1] = hB V ′ + V v[n], (9.6b)

with

F (U, V ′)i = f(Ui,w(Ui, V

′i , tn + cih), tn + cih

),

G(U)i = Dni z(Ui, tn + cih).

For stiffly accurate methods the numerical solution is given by

xn+1 = D−(tn+1)Us + z(Us, tn+1) + w(Us, V′s , tn+1). (9.6c)

(b) The general linear method is applied directly to the original formulation(9.3) and the decoupling procedure is applied to the resulting discretisedscheme afterwards.

We saw above that this approach leads to

U = hAU′ + U u[n], V = hAV′ + U v[n], (9.13a)

u[n+1] = hBU + V u[n], v[n+1] = hBV′ + V v[n]. (9.13b)

Lemma 9.1 ensures that

Vi = Dni z(Ui, tn + cih), U′i = Ti

1 −Ti2 + f(Ui,Wi, ti),

0 = F2

(Ui,Wi,V

′i, tn + cih

)+ (Q0Q1D

−)niTi2 + Ti

3..

Finally the numerical result can be represented in the form

xn+1 = Xs = DnsUs + z(Us, tn+1) + Ws. (9.13c)

Obviously, the two schemes (9.6) and (9.13) coincide provided that

• the same initial vectors u[0] = u[0], v[0] = v[0] are used

• the terms Tij = 0 vanish,

• Wi = w(Ui,V′i, tn + cih) holds.

Theorem 9.3. Let the assumptions of Lemma 9.1 be satisfied and assumethat the subspaces imDP1D

− and imDQ1D− are constant. If the initial input

vector satisfies

[Dx][0]i = u

[0]i + v

[0]i ∈ im(DP1)(t0)⊕ im(DQ1)(t0) (9.14)

for i = 1, . . . , r, then the diagram in Figure 9.1 commutes.


Proof. Due to Lemma 9.2 the terms Tij vanish such that Ti

1 = 0, Ti3 = 0 and

Ti2 = 0. Therefore (9.11d) reduces to

F2(Ui,Wi,V′i, tn + cih) = 0.

This equation determines Wi as a function of Ui, V′i and the timepoint tn+cih

(cf. Lemma 6.5). Thus Wi = w(Ui,V′i, tn + cih), since the implicitly defined

function w is uniquely determined. From (9.11c) we find

U′i = f

(Ui,w(Ui,V

′i, tn + cih), tn + cih

)= F (U,V′)i (9.15a)

and (9.11b) implies

Vi = DniZi = Dniz(Ui, tn + cih) = G(U)i. (9.15b)

As a consequence, (9.6) and (9.13) yield the same numerical scheme.

Note that (9.7b) and (9.7c) split into

U = hAU′ + U u[n], V = hAV′ + U v[n]

u[n+1] = hBU′ + V u[n], v[n+1] = hBV′ + V v[n].

This justifies (9.9) whenever we have a similar splitting (9.14) for the initialinput vector.

The case of index-1 equations is again included in the results presented here.If (9.3) was an index-1 DAE, G2 = G1 would be non-singular. Hence Q1 = 0,P1 = I is an obvious choice for the projectors in the sequence (6.7). We haveDQ1D

− = 0 and DP1D− = DD− = R, where R is the projector function re-

lated to the properly stated leading term (see Definition 3.2). For Theorem 9.3the only remaining requirement is a constant subspace imR = imD.

As for linear DAEs [89, 90] we will say that a given equation is numericallyqualified whenever Theorem 9.3 is applicable. More precisely we consider

Definition 9.4. • A regular index-1 DAE (9.3) with a properly stated lead-ing term is said to be numerically qualified if imD is constant.

• A regular index-2 DAE (9.3) with a properly stated leading term is saidto be numerically qualified if the two subspaces DS1 = imDP1D

− andDN1 = imDQ1D

− are constant.

For a general linear method to be suited for integrating nonlinear index-2 DAEswith properly stated leading terms it has to be designed for dealing with theimplicit index-1 systems but at the same time the method has to discretise theconstraints correctly. The commutativity of diagram 9.1 guarantees that bothproperties are combined correctly.

When doing practical computations the quantities U, V, u[n], v[n] and so onare never used explicitely. All calculations are based on X, [DX]′ and [Dx][n],

9.2 Convergence 173

since the numerical scheme has to be applied directly to the index-2 formula-tion. Nevertheless Theorem 9.3 ensures that these quantities are interrelatedby (9.13). Thus for numerically qualified DAEs the method behaves as if itwas integrating the inherent index-1 equation.

It remains to prove convergence for general linear methods applied to properlystated index-2 DAEs. To this point we now turn.

9.2 Convergence

In this section we again consider properly stated differential algebraic equations

A(t)[D(t)x(t)

]′+ b(x(t), t

)= 0 (9.16)

with index 1 or 2 and assume that (9.16) is numerically qualified in the senseof Definition 9.4. Recall from Chapter 6 that the exact solution x(t) can bewritten as

x(t) = D−(t)u(t) + z(t) + w(t). (9.17a)

The functions u and v = Dz satisfy the inherent index-1 system equation

u′(t) = f(u(t), v′(t), t

), v(t) = g

(u(t), t

), (9.17b)

while the components w and z are given by

w(t) = w(u(t), v′(t), t), z(t) = z

(u(t), t). (9.17c)

In the previous section a general linear method M = [A,U ,B,V ] was appliedto (9.16) giving rise to the numerical scheme

Ani[DX]′i + b(Xi, tn + cih) = 0, i = 1, . . . , s, (9.18a)

[DX] = hA [DX]′ + U [Dx][n], (9.18b)

[Dx][n+1] = hB [DX]′ + V [Dx][n]. (9.18c)

As the DAE (9.16) is assumed to be numerically qualified, Theorem 9.3 ensuresthat the stages Xi exhibit a structure quite similar to (9.17a). In particular

Xi = D−niUi + Zi + Wi, (9.19a)

U = hAF (U,V′) + U u[n], g(U) = hAV′ + U v[n],

u[n+1] = hBU′ + V u[n], v[n+1] = hBV′ + V v[n],

(9.19b)

Wi = w(Ui,V′i, tn + cih), Zi = z(Ui, tn + cih). (9.19c)

Observe that (9.19b) is precisely the general linear method’s discretisationof the inherent index-1 system (9.17b). As usual the shorthand notationsF (U,V′)i =f(Ui,V

′i, tn+cih) and G(U)i =Dniz(Ui, tn+cih) have been used.


For stiffly accurate methods the numerical result xn+1 at time tn+1 is given bythe last stage such that xn+1 = Xs. In order to prove convergence we thereforehave to compare (9.17a) and (9.19a).

Theorem 9.5. Let the regular index-µ DAE

A(t)[D(t)x(t)

]′+b(x(t), t

)=0, D(t0)P1(x

0, t0)(x(t0)−x0

)=0, (9.20)

with a properly stated leading term, µ ∈ 1, 2, satisfy the following conditions:


(A2) ∃ (y0, x0, t0) ∈ imD(t0)×D×I such that A(t0)y0 + b(x0, t0) = 0,

(A3) ∃ x ∈ C1(I,Rm) such that x(t0) = x0 and(x(t), t

)∈ D×I ∀ t ∈ I,

(A4) the derivatives ∂b∂x

, dDdt

and ∂∂t

(ZG−1

2 b)

exist and are continuous,

(A5) (9.16) is numerically qualified in the sense of Definition 9.4.

Then there exists a unique local solution x ∈ C1D([t0, tend],R

m) of the initialvalue problem (9.20). Let M = [A,U ,B,V ] be a general linear method withnonsingular A. Assume that

(a) V is power bounded and M∞ is nilpotent with Mk0∞ = 0.

(b) f, v, w are sufficiently differentiable.

(c) The initial input vector [Dx][0] satisfies [Dx][0] = [Dx][0] +O(hp), where[Dx][0]denotes the exact input vector.

(d) The method has order p for implicit index-1 DAEs (9.17b) and stageorder q = p for ordinary differential equations.

(e) M is stiffly accurate.

Then, after k0+1 steps, the numerical result computed using a constant stepsizeh converges to the exact solution of the DAE (9.16) with order (at least) p.

Before actually proving this results, some brief remarks are due.

The structural condition (A1) ensures that the decoupled system (9.17) canbe constructed. As in Theorem 6.7 the assumptions (A2)-(A4) guarantee theexistence of the solution x.

(A5) ensures that Theorem 9.3 is applicable. Since discretisation and decou-pling are therefore known to commute, the discrete system (9.18) decouplesinto (9.19) and we can study (9.19a) and (9.19b) separately.

The assumptions (a)-(d) guarantee convergence of order p for the inherentindex-1 system (9.17b). This result was established in Theorem 8.27. Theinitial input vector [Dx][0] can be computed using generalised Runge-Kuttamethods taking only the initial value x0 as input but producing r output quan-tities. In this case but also when building up [Dx][0] gradually using a variableorder implementation, the splitting (9.14) is evident.

9.2 Convergence 175

Finally, assumption (e) ensures that the numerical result xn+1 = Xs coincideswith the last stage. Thus the numerical result at tn+1 satisfies the obviousconstraint (9.8) and xn+1 can be expressed using (9.19a).

Observe that the assumption on M∞ to be nilpotent is a stronger requirementthan in Theorem 8.27. There M∞ was assumed to be power bounded, butthe nilpotency of M∞ allows the application of Lemma 8.29 such that V′ iscalculated with order min(p, q) = p.

Proof of Theorem 9.5. Due to stiff accuracy, comparing (9.17a) and (9.19a)yields an estimate for the global error,

‖xn+1−x(tn+1)‖ ≤ ‖D−ns

(Us−u(tn+1)

)‖+ ‖z(Us, tn+1)−z

(u(tn+1), tn+1

)‖

+ ‖w(Us,V′s, tn+1)−w

(u(tn+1), v

′(tn+1), tn+1

)‖

≤ ‖D−ns

(Us−u(tn+1)

)‖+

∫ 1

0‖zu(ϕ(τ), tn

)‖ dτ ‖Us−u(tn+1)‖

+∫ 1

0‖wu

(ϕ(τ), ψ(τ), tn

)‖dτ ‖Us−u(tn+1)‖

+∫ 1

0‖wv′

(ϕ(τ), ψ(τ), tn

)‖dτ ‖V′

s−v′(tn+1)‖.

The functions ϕ and ψ are given by

ϕ(τ) = τUs + (1− τ)u(tn+1), ψ(τ) = τV′s + (1− τ)v′(tn+1).

Since the partial derivatives zu, wu and wv′ are bounded, there are h-indepen-dent constants C1, C2 such that

‖xn+1 − x(tn+1)‖ ≤ C1‖Us − u(tn+1)‖+ C2‖V′s − v′(tn+1)‖.

Recall that the general linear method was assumed to be stiffly accurate. ThusTheorem 8.27 implies Us = u(tn+1) +O(hp). On the other hand, Lemma 8.29and Mk0

∞ = 0 show that after k0 + 1 steps the stage derivatives satisfy V′s =

v′(tn+1) + O(hp) due to the stage order being q = p. In summary, the globalerror can be estimated as ‖xn+1 − x(tn+1)‖ = O(hp) proving the assertion.

The proof presented here hinges on three essential ingredients:

• The DAE (9.20) is properly stated and the structural condition (A1)ensures the existence of the decoupled system (9.17).

• Theorem 8.27 guarantees convergence of order p for the inherent index-1system (9.17b).

• Lemma 8.29 yields the same accuracy for the stage derivatives V′i.

In view of Theorem 8.22 the second requirement can be expressed in terms ofthe method’s order and stage order for ordinary differential equations.

Corollary 9.6. Let M be a general linear method in Nordsieck form withs = r > 1. Then the assumption (d) in Theorem 9.5 can be replaced by

(d′) M has order and stage order equal to p = q = s− 1 for ordinary differ-ential equations.

Part IV

Practical General Linear Methods

10

Construction of Methods

The derivation of integration schemes is a well established topic in numericalanalysis. Methods later called Runge-Kutta schemes were used as early as in1901 by Kutta [103]. The construction of Runge-Kutta methods is a complextask as complicated nonlinear expressions, the order conditions, have to besolved. The use of rooted trees by Butcher [15, 16, 18] made a systematicstudy of order conditions possible.

This work lead to well-known order barriers. Explicit methods with order p,for example, require at least s ≥ p stages. In case of p ≥ 5 the number of stagesnecessarily satisfies s > p [25]. On the other hand, the maximum attainableorder of an s-stage implicit Runge-Kutta scheme is p = 2s.

The RadauIIA methods suitable for stiff equations where constructed by JohnButcher in [17]. These methods form the basis of the popular code Radauwritten by Hairer and Wanner [85, 86]. An extensive overview of Runge-Kuttaformulae is given in [22].

Alexander and Nørsett [1, 141] focused on diagonally implicit methods. Thiswork is continued in Trondheim until today. A comprehensive overview isavailable at [106]. The code Simple written by Nørsett and Thomsen usesan embedded pair of SDIRK methods with order 3(2). Recent research inTrondheim focuses on SDIRK methods with an explicit first stage [105] as wellas on multirate methods [5].

Linear multistep schemes were used even earlier than Runge-Kutta methods.Adams and Bashforth [6] used linear multistep methods in 1883 to studycapillary action by investigating drop formation. BDF methods were intro-duced by Curtiss and Hirschfelder [48] and became popular through the workof Gear [70, 71]. Dassl, one of the most successful BDF solvers, was writtenby Petzold [128]. It was also Gear who promoted the use of Nordsieck vectorsfor linear multistep methods. The codes Difsub of Gear and later Lsode ofHindmarsh [134] use this technique.

With [71, 73, 75, 84, 111] and other references differential algebraic equationscame into focus in the 1980’s. The numerical methods used for DAEs were

180 10 Construction of Methods

the ones that had been successfully applied to stiff ordinary differential equa-tions. The BDF code Dassl and the Runge-Kutta code Radau are the mostprominent examples.

Methods of Rosenbrock type were studied for DAEs as well [137, 140]. Gunthertook up this approach and adapted Rosenbrock methods to the charge orientedMNA equations [79] (see also Section 1.1). Details about his code Choral canbe found in [81].

Richardson extrapolation is a further technique that can be efficiently used pro-vided that the global error has a known asymptotic expansion into powers of thestepsize h. For index-1 equations Deuflhard, Hairer and Zugck [54] were able toderive a perturbed asymptotic expansion such that extrapolation methods canbe successfully used. The code Limex developed at the Konrad-Zuse-Zentrumfur Informationstechnik Berlin (ZIB) [159] is based on the extrapolated im-plicit Euler scheme. Limex is capable of solving linear implicit index-1 DAEsB(y, t)y′(t) = f(y, t).

Extrapolation schemes were modified by Gerstberger and Gunther [74] to beapplicable to integrated circuit design. Their work showed that index-1 circuitsare feasible but it is not known how to derive the required asymptotic expansionfor index-2 equations.

General linear methods (GLMs) were introduced by Butcher [18] in 1966 in or-der to provide a unifying framework for both linear multistep and Runge-Kuttamethods. General linear methods are addressed in detail in the monograph [22].There three matrices are used to characterise a method, but the equivalent for-mulation

M =

[A UB V

],

Y = hAF (Y ) + U y[n],

y[n+1] = hB F (Y ) + V y[n],(10.1)

using four matrices later became a standard. This formulation due to Burrageand Butcher [13] was already used in Section 2.3 and Chapter 7. Observe that(10.1) covers ordinary differential equations

y′ = f(y, t), y(t0) = y0,

where F (Y )i = f(Yi, tn + cih) for i = 1, . . . , s. Recall that a general linearmethod M is characterised by four integers:

s: the number of internal stages Yi,

r: the number of external stages y[n]i ,

p: the order of the method,

q: the stage order of the method.

Among the huge class of general linear methods Diagonally Implicit MultistageIntegration Methods (Dimsims) seem to be most likely to be used for practicalapplications. A Dimsim is loosely defined by the quantities s, r, p, and q

181

all being approximately equal [25]. Dimsims were introduced by Butcher in1993 [23]. It is common practice to require the matrix A to have lower atriangular structure,

A =

a11 0...

. . .as1 · · · ass

. (10.2)

Thus the stages Yi can be evaluated sequentially such that an efficient imple-mentation is possible (see Section 2.2). We will adopt the diagonally implicitstructure (10.2) as a standard as well. Whenever possible the diagonal elementsaii will be chosen to be equal. This allows a further reduction of implementationcosts.

As stability of a general linear method is directly connected with the matrix Vbeing power bounded (see Definition 7.3), this matrix is often assumed to beof low rank. The example methods constructed in [23] focus on the structure

U = I and V = e · v where e =[1 · · · 1

]>and v =

[v1 · · · vr

]. In [33, 34]

Butcher and Jackiewicz require the same structure for U and V . Computeralgebra tools are used in [33] for the construction of Dimsims with order 1,2, 3. Order 4 methods are obtained with the aid of homotopy techniques.Higher order methods are constructed in [34] using least square minimisationalgorithms from Minpack [125]. The main reason for using numerical searchesis the fact that computer algebra tools such as Maple or Mathematica areno longer powerful enough to solve the order conditions associated with high pand q. Methods of even higher order, p = 7, p = 8, are derived in [39] using anonlinear optimisation approach.

General linear methods for parallel computing have been developed by Butcherand Chartier [28]. The matrix A = λI is assumed to be a diagonal matrix.Methods of this kind are sometimes referred to as type-4 Dimsims. Type-4methods with the additional property that the stability matrix M(z) has onlyone nonzero eigenvalue are called Dimsems by Enenkel and Jackson1 [60]. Us-ing the somewhat unusual choice of all parameters ci being zero, they succeededin constructing methods with p = r − 1 for any r ≥ 2.

The concept of Runge-Kutta stability was introduced in Chapter 7. Obviously,Dimsems belong to this class of methods. A systematic study of general linearmethods having inherent Runge-Kutta stability by Wright [158] lead to thesurprising result that Dimsims with s = r = p+ 1 = q + 1 can be constructedusing only linear operations. The important concept of inherent Runge-Kuttastability was already introduced in Definition 7.9.

Methods of this type mark the starting point for constructing practical methodsfor index-2 DAEs in the next two sections. As described above, most examplesof general linear methods were derived in the context of (stiff) ODEs. In

1Dimsem: Diagonally IMplicit Single-Eigenalue Method.


view of Theorem 9.5 and Corollary 9.6, we know that DAEs require additionalproperties such as stiff accuracy or improved stability at infinity. More preciselywe focus on methods having the following properties:

(P1) M has order p and stage order q = p for ordinary differential equations.

(P2) A has a diagonally implicit structure (10.2) with aii = λ for i = 1, . . . , s.

(P3) V is power bounded and M∞ is nilpotent. Mk0∞ = 0 for some k0 ≥ 1 is a

necessary requirement for L-stability ensuring that the stage derivativesare calculated with the desired accuracy (see Lemma 8.30).

(P4) The abscissae ci satisfy ci ∈ [0, 1] and ci 6= cj for i 6= j.

(P5) M is stiffly accurate, i.e. cs = 1 and the last row of [A U ] coincideswith the first row of [B V ].

(P6) M is given in Nordsieck form such that the stepsize can be varied easily.

Apart from these requirements a small error constant is desirable. The errorconstant Cp signifies that the local truncation error for an order p methodapplied to the ordinary differential equation y′ = f(y) is given by the termCph

p+1y(p+1)(tn) +O(hp+2). Thus a small |Cp| allows larger steps to be takenfor a given tolerance. More details on computing error constants will be givenin Chapter 11.

Finally, we will pay close attention to stability properties when constructingmethods. Recall from the introduction that for electrical circuit simulationthe stability behaviour along the imaginary axis is an indicator for the meth-ods damping properties. We saw in Example 2.1 and 2.2 that two kinds ofoscillations need to be addressed:

• Oscillations of physical significance should be preserved. Hence for asampling rate of about 8− 20 steps per period, the largest eigenvalue ofM(yi) is required to stay close to 1 provided that y ∈ [0.1π, 0.25π].

• High frequent oscillations caused by numerical noise, perturbations, in-consistent initial values or errors from the Newton iteration should bedamped out quickly. This requires the eigenvalues of M(yi) to decay asy tends towards infinity.

We will first look at methods with inherent Runge-Kutta stability satisfyings = r = p+ 1 = q + 1. In order to improve stability at infinity, stiffly accuratemethods with M∞ = 0 will be constructed as well.

These methods use s = p + 1 internal stages. Thus the computational costsper step can be reduced by considering Dimsims with s = r = p = q. Methodsof this type will be considered in Section 10.2.

Recall from Chapter 2 that the MNA equations modelling electrical circuitstypically suffer from poor smoothness properties of the semiconductor modelsbeing used. Thus we restrict attention to low order methods with 1 ≤ p ≤ 3.

10.1 Methods with s=p+1 Stages 183

10.1 Methods with s = p + 1 Stages

General linear methods M = [A,U ,B,V ] with inherent Runge-Kutta stabilityand s = r = p+ 1 = q + 1 have been constructed in [38, 40, 158]. Other exam-ples can be found in [93], where implementation details are addressed as well.The construction is based on an algorithm developed by Wright [158].

Recall from Section 7.2 that Runge-Kutta stability,


)= wr−1

(w −R(z)

), (10.3)

is ensured by requiring that V has a single nonzero eigenvalue given by V e1 = e1and imposing the conditions2

BA = XB, BU ≡ XV − VX. (10.4)

As seen in Lemma 7.10, the conditions (10.4) guarantee that the stability matrixM(z) and V are related by a similarity transformation

(I − zX)M(z) (I − zX)−1 ≡ V. (10.5)

Wright showed that the most general matrix X satisfying (10.4), (10.5) hasdoubly companion form [158]

X =

−α1 −α2 · · · −αp −αp+1

1 0 · · · 0 −βp0 1 · · · 0 −βp−1...

.... . .

......

0 0 · · · 1 −β1

. (10.6)

The vector β =[β1 . . . βp

]is considered as a free parameter. The entries

of α =[α1 . . . αp+1

]are determined by det(wI − X) = (w − λ)s since the

matrix X = BAB−1 has a single s-fold eigenvalue λ due to (10.4) and therequirement (P2) from the previous page. It will be always assumed that B isnonsingular (see [158] for more details).

Observe that in (10.4) the matrix A is determined by B. Similarly, for methodsin Nordsieck form, the conditions for order p = q = s − 1 fix the matrices Uand V in terms of A and B,

U = C −ACK, V = E − BCK. (10.7)

2As in Section 7.2, A ≡ B indicates equality for the matrices A and B with the possibleexception of the first row.


The matrices

K =

0 1 0 · · · 0 00 0 1 · · · 0 0...

......

. . ....

...0 0 0 · · · 0 10 0 0 · · · 0 0

, E =

1 1

1!12!

· · · 1(p−1)!

1p!

0 1 11!

· · · 1(p−2)!

1(p−1)!...

......

. . ....

...0 0 0 · · · 1 1

1!

0 0 0 · · · 0 1

= exp(K)

and C =[e c c2

2!· · · cp

p!

]have been introduced in Theorem 7.6 on page 118.

Thus, the construction of a general linear method M = [A,U ,B,V ] with in-herent Runge-Kutta stability boils down to choosing the coefficients of B.

Instead of the matrix B itself, Wright considers B = Ψ−1B. The nonsingularmatrix Ψ is known from the transformation of X to Jordan canonical form, i.e.Ψ−1XΨ = λI + J where J = K>. In order to guarantee a lower triangularstructure for A, the matrix B has to be lower triangular, too.

Stiff accuracy as required by the condition (P5) can be ensured by setting

cs = 1, βp = 0, e>2 B = e>s . (10.8)

It is easily seen that (10.8) indeed implies stiff accuracy since

e>s A = e>s B−1XB = e>2 XB = e>1 B,e>s U = e>s (C −ACK) = e>1 (E − BCK) = e>1 V .

For the second equation, observe that cs = 1 ensures e>s C = e>1 E.

Methods satisfying (10.8) will not only be stiffly accurate but they also satisfythe FSAL property known from the study of Runge-Kutta methods.

The relations

e>2 B =[0 · · · 0 1

],

e>2 V = e>2 (E − BCK) = e>2 E − e>s CK = e>2 E − e>1 EK =[0 · · · 0

]show that for ordinary differential equations y′ = f(y, t) the exact derivativeY ′s = f(Ys, tn+ csh) of the last stage is passed on to the next step as the second

component of the Nordsieck vector. More precisely we have y[n+1]2 = hY ′

s suchthat the FSAL property (first same as last) is satisfied: hY ′

s could be equallywell computed using an explicit zeroth stage in step number n+1. We will seein Section 11 that this property simplifies error estimation considerably (seealso [40]).

The construction of general linear methods M = [A,U ,B,V ] with inherentRunge-Kutta stability can now proceed along the following lines (full detailsare given in [158]):

(1) Choose s = r = p+1 = q+1 in advance and fix parameters λ, c1, . . . , cs−1,β1, . . . , βp−1 ∈ R. Set cs = 1, βp = 0 and choose a nonsingular matrixT ∈ R(s−2)×(s−2).


(2) Compute α, X and the transformation matrix Ψ.

(3) Compute B from the following set of linear conditions:

• inherent Runge-Kutta stability (10.4),

• stiff accuracy and FSAL property (10.8),

• zero-stability, i.e. T−1V T is strictly upper triangular for V=[1 v12 v>10 0 00 v2 V

]with v1, v2 ∈ Rs−2, V ∈ R(s−2)×(s−2),

• tr(M∞) = 0, as a vanishing trace implies that the single nonzeroeigenvalue is zero at infinity. The error constant can be prescribedas well [158].

(4) Compute B = ΨB, A = λI+ B−1JB and U = C−ACK, V = E−BCK.

This algorithm was used in [40] to derive the family of methods presented inTable 10.1. Observe that the order-1 method does not use the second com-ponent of the Nordsieck vector. Thus, in a variable order implementation, nostarting procedure is required.

[A UB V

]=

310

0 1 0710

310

1 0

710

310

1 0

0 1 0 0

(a) Order p = q = 1, c =[

310

1]>

49

0 0 1 −19

− 554

12427

49

0 1 −11827

−13081

−4081

− 427

49

1 9781

155486

−4081

− 427

49

1 9781

155486

0 0 1 0 0 0

−214

34

94

0 94

0

(b) Order p = q = 2, c =

[13

23

1]>

940

0 0 0 1 140

− 140

− 173840

5244317200

940

0 0 1 −4771317200

−5158368800

− 1693691651200

−19966411350200

− 3871570

940

0 1 759579337550

32698715400800

90792910801600

−5948140000

−189800

14138000

940

1 4643320000

5059380000

9659120000

−5948140000

−189800

14138000

940

1 4643320000

5059380000

9659120000

0 0 0 1 0 0 0 0

−5 3845

−15745

409

0 165

0 −137720

267281

−25627

−251281

160081

0 −99281

0 0

(c) Order p = q = 3, c =

[14

12

34

1]>

Table 10.1: A family of methods constructed by Butcher and Podhaisky in[40] having inherent Runge-Kutta stability.


The methods from Table 10.1 have Runge-Kutta stability by construction. Itcan be verified that for the method of order p the single nonzero eigenvalue ofthe stability matrix

M(z) = V + zB(I − zA)−1U

is given by Rp(z) where

R1(z) = 1+(1−2λ)z(1−λz)2 , R2(z) =

1+(1−3λ)z+(3λ2−3λ+ 12)z2

(1−λz)3 ,

R3(z) =1+(1−4λ)z+ 1

2(1−6λ)(1−2λ)z2+(−4λ3+6λ2−2λ+ 1

6)z3

(1−λz)4 .

The parameter λ denotes the diagonal element. Observe that Rp(z) coincideswith the stability function of stiffly accurate SDIRK methods having s = p+ 1stages and order p [85]. Thus, the methods from Table 10.1 have the samelinear stability behaviour as the corresponding SDIRK methods. A-stability(and hence L-stability) is ensured by an appropriate choice of λ.

The stability matrix at infinity

M∞,1 =

[0 0409

0

], M∞,2 =

0 0 094

0 02618

518

0

, M∞,3 =

0 0 0 0165

0 −137720

0

−99281

0 0 0

−355712243

−69088243

−563

0

plays a decisive role for the methods’ applicability to index-2 differential alge-braic equations. Although M∞,p is nilpotent for 1 ≤ p ≤ 3 with M s

∞,p = 0, theentries of M∞,p tend to become large in magnitude (up to ≈1463.84 for p = 3).

For differential algebraic equations certain parts of the global error are amplifiedby M∞ (see Lemma 8.25). Hence suitability of these methods for DAEs mightbe improved by requiring M∞,p = 0 for p = 1, 2, 3.

Recall from Theorem 7.6 that a method in Nordsieck form with s = r = p+1 =q + 1 needs to satisfy U = C−ACK, V = E−BCK. Therefore

M∞ = V − BA−1U = (E − BCK)− BA−1(C −ACK) = E − BA−1C

vanishes if and only if

B = E C−1A. (10.9)

For (10.9) to make sense we have to restrict attention to non-confluent methods,i.e. ci 6= cj for i 6= j. In this case C is a Vandermonde matrix and hencenonsingular. The matrix V can be written as

V = E − BCK = E − E C−1ACK (10.10)

such that A and c remain the only free parameters of the method.


Given A and c we consider general linear methods in Nordsieck form with

Mp =

[A C −ACK

E C−1A E − E C−1ACK

], A =

λ 0 ··· 0 0a21 λ ··· 0 0...

......

......

as−1,1 as−1,2 ··· λ 0as1 as2 ··· as,s−1 λ

(10.11)

and c =[c1 · · · cs−1 1

]>. The structure (10.9) for the matrix B ensures

M∞,p = 0. The particular choice of U and V shows that Mp has order p andstage order q = p.

Order p = 1

For order 1 methods with 2 stages (10.11) takes the form

M1 =

λ 0 1 c1 − λa21 λ 1 1 − a21 − λa21 λ 1 1 − a21 − λλ−a21

c1−1− λc1−1

0 c1+a21−1c1−1

, c =

[c11

].

Observe that e>s A = e>1 B and cs = 1 guarantee stiff accuracy, but we donot require the FSAL property explicitely. As M1 will be used to start theintegration, picking a zero second column for U and V ensures that only theinitial value is required for startup. We arrive at

M1 =

λ 0 1 0

1− λ λ 1 01− λ λ 1 01−2λ1−λ

λ1−λ 0 0

, c =

[λ1

]. (10.12)

10−3

10−2

10−1

|C1(λ)|

0 0.2 0.4 0.6 0.8 λ

region of A-stability

Figure 10.1 (a): C1(λ) = λ2 − 2λ + 12 .

1

|R1(yi)|

1

2

1

4

0

10−1 1 101 102 y

λ =3

10

λ =7

20

λ =1

2

0.1π 0.25π

Figure 10.1 (b): Stab. function R1(yi).

Figure 10.1: Choosing λ for the method (10.12): The left hand picture showsthe modulus |C1(λ)| of the error constant. In Figure 10.1 (b) the stabilityfunction R1(z) is plotted for z = yi. Increasing λ leads to poorer preservationof physical oscillations.


This method has the stability function


)= w2 − 1+(1−2λ)z

(1−λz)2 w = w(w −R1(z)

).

It turns out that M1 has Runge-Kutta stability and hence the same linearstability behaviour as the SDIRK method with p = 1 and two stages.

The diagonal element λ can be chosen in the interval [1 −√

22, 1 +

√2

2] for A-

stability [85]. Figure 10.1 suggests λ = 310

such that a small error constantC1(

310

) = − 1100

and good damping behaviour along the imaginary axis is re-alised.

Order p = 2

For p = s − 1 = 2 the formulation (10.11) leaves six free parameters: λ, c1,c2 and a21, a31, a32. Recall from (10.4) that a necessary condition for inherentRunge-Kutta stability is

BA = XB ⇔ E C−1A = XEC−1 ⇔ X = (EC−1)A (EC−1)−1.

Since X has doubly companion form (10.6), the submatrix X2:3,1:2 = I yields

four conditions on A and c. These are linear in a =[a21 a31 a32

]>. In

particular, the linear system

X2,1 = 1, X3,1 = 0, X3,2 = 1,

or, equivalently, c1+3(c2−1)(c1−c2)

− c1+2+c2(c2−1)(c1−1)

− c1+2+c2(c2−1)(c1−1)

− 2(c2−1)(c1−c2)

2(c2−1)(c1−1)

2(c2−1)(c1−1)

− 2(c1+1)(c2−1)(c1−c2)

2(c1+1)(c2−1)(c1−1)

2(c2+1)(c2−1)(c1−1)

a21

a31

a32

=

1

0

1

can be solved for

a21 = c2 − c1, a31 = (1−c1)(2c1−3c2+1)2(c1−c2)

, a32 = (1−c1)(c2−1)2(c1−c2)

.

The remaining condition X2,2 = 12(c1 − c2) + λ = 0 can easily be satisfied by

choosing c1 = c2 − 2λ. For simplicity, c2 is chosen such that the abscissae ciare spaced equidistantly in [0, 1], i.e. c1 = 1

2− λ, c2 = 1

2+ λ. This results in

the method

λ 0 0 1 12− 2λ 1

8− λ+ 3

2λ2

2λ λ 0 1 12− 2λ 1

8− λ+ 3

2λ2

(2λ+1)(10λ−1)16λ

1−4λ2

16λλ 1 1

2− 2λ 1

8− λ+ 3

2λ2

(2λ+1)(10λ−1)16λ

1−4λ2

16λλ 1 1

2− 2λ 1

8− λ+ 3

2λ2

4λ2+4λ−14λ(2λ+1)

4λ2+4λ−14λ(2λ−1)

−4λ4λ2−1

0 0 02λ−1

2λ(2λ+1)6λ−1

2λ(2λ−1)−8λ

4λ2−10 0 0

. (10.13)


The diagonal element λ is again a free parameter. The stability function

Φ(w, z) = w3 − 1+(1−3λ)z+(3λ2−3λ+ 12)z2

(1−λz)3 w2 = w2(w −R2(z)

)shows that for every λ the method (10.13) has Runge-Kutta stability. R2(z)is the stability function of the SDIRK method with s = p + 1 = 3. There-fore, A-stability is achieved for λ ∈ [0.18042531, 2.18560010] (see again [85]).Figure 10.2 shows that again small values of λ lead to smaller error constantsand to the desired stability behaviour along the imaginary axis. The diagonalelement λ = 2

11≈ 0.1818 might be a good compromise since the correspond-

ing method is A-stable with C2(211

) = − 1037986

≈ −0.0129. Methods with evensmaller error constants are possible for λ ≈ 0.436 = 109

250, but Figure 10.2 (b)

indicates stronger numerical damping for these methods.

10−3

10−2

10−1

|C2(λ)|

0 0.2 0.4 0.6 0.8 λ


Figure 10.2 (a): C2(λ)=−λ3+3λ2−32λ+1

6 .

1

|R2(yi)|

1

2

1

4

0

1 101 102 y

λ =2

11

λ =3

11

λ =109

250

0.1π 0.25π

Figure 10.2 (b): Stab. function R2(yi).

Figure 10.2: Choosing λ for the method (10.13): λ = 211

leads to a smallerror constant C2(

211

) = − 1037986

and nice stability properties.

Order p = 3

Unfortunately the approach taken for order 2 methods cannot be generalised top = 3. In fact, for p = s− 1 = 3 and λ 6= 0 there is no inherently Runge-Kuttastable method of the form (10.11) satisfying all requirements (P1) – (P6) frompage 182. This can be seen as follows:

Let X be the doubly companion matrix from (10.6) and assume that

β1 = 1− δ1, β2 = 12− δ1 − δ2, β3 = 1

6− 1

2δ1 − δ2 − δ3.

Then

X = E−1XE =

−α1 −α2 · · · −αp −αp+1

1 0 · · · 0 −δp0 1 · · · 0 −δp−1...

.... . .

......

0 0 · · · 1 −δ1


is again a doubly companion matrix. For a method (10.11) to have inherentRunge-Kutta stability the relation

BA = XB ⇔ E C−1A = XEC−1 ⇔ AC = CX

needs to be satisfied. This system of nonlinear equations3 can be solved forthe unknowns λ, ci, δi, aij. Simple but quite lengthy computations not beingreproduced here show that either λ = 0 or ci = cj for some i 6= j.

Nevertheless Runge-Kutta stability is still possible. For simplicity assume that

c =[

14

12

34

1]>

such that the stability function

Φ(w, z) =(p0 + p1w + p2w

2 + p3w3 + (1− λz)4w4

)1

(1−λz)4

can be written in terms of polynomials

pk =∑k

l=0 pkl zl, k = 0, . . . , 3,

where pkl are polynomials depending on the method’s remaining parameters λand aij. Details on this particular structure are given in [33, 34].

For Runge-Kutta stability we have to ensure p0 = 0, p1 = 0, p2 = 0 such thatΦ(w, z) = w3

(w−R3(z)

). This yields 1+2+3 = 6 conditions on the remaining

seven parameters. Thus λ is fixed in advance and the system

p00 = 0, p10 = 0, p11 = 0, p20 = 0, p21 = 0, p22 = 0 (10.14)

is solved for a21, a31, a32, a41, a42, a43.

The equations (10.14) can easily be generated using computer algebra tools.Consider Maple [112] as an example. Appropriate commands read

s := 4:A := evalm(matrix(s, s, (i, j) -〉〉 ‘if‘(i 〈〈 j, 0, cat(’a’, i, j) )) + lambda*&*() ):C := matrix(s, s, (i, j) -〉〉 (i/s)ˆ(j-1) / (j-1)! ):K := matrix(s, s, (i, j) -〉〉 ‘if‘(j = i+1, 1, 0) ):

5 E := exponential(K):B := evalm(E &* inverse(C) &* A):U := evalm(C - A &* C &* K):V := evalm(E - B &* C &* K):Mz := map(simplify, evalm(V + z*B &* inverse(&*()-z*A) &* U)):

10 Phi := series(det( w*&*() - Mz ), w, s+1):eqns := seq( seq(

coeff(series(numer(coeff(Phi, w, k)), z, k+1), z, l), l=0..k), k=0..2) ;

The set eqns contains all equations from (10.14). However, solving this systemsymbolically,

solve(eqns, indets(eqns) minus lambda );

was not possible on an Intel R© Pentium R© M processor with 1.4 GHz, 512 MBRAM, using Maple 9.5. Hence it was decided to solve (10.14) numericallyusing routines from Minpack.

3Recall that ci appears nonlinearly due to C =[e c c2

2! · · · cp

p!

].


The Fortran subroutine hybrd from Minpack [125] solves a system of n non-linear equations in n variables using a modification of the Powell hybrid method.For every iteration the correction is chosen as a convex combination of the New-ton and scaled gradient directions. This ensures (under reasonable conditions)global convergence for starting points far from the solution and a fast rate ofconvergence. The Jacobian is computed using forward differences and rank-1Broyden updates.

Input for the simplified driver hybrd1 can be generated using Maple,

fvec := convert(eqns, list):codegen[fortran](fevec):

101

102

103

max |Mij |

0.2 0.3 0.4 0.5 λ


Figure 10.3 (a): Magnitude of M’s lar-gest element. Different initial values a0

lead to a variety of methods for each λ.

10−5

10−4

10−3

|C3(λ)|

10−1

0.2 0.3 0.4 0.5 λ


Figure 10.3 (b): Error constant C3(λ).Two methods M(λ, a0) 6= M(λ, a0)have the same error constant.

1

|R3(yi)|

1

2

1

4

0

10−1 1 101 102 103 y

λ =1

4

λ =1

3

0.1π 0.25π

1

0.3 0.6 1.2

0.1π 0.25π

Figure 10.3 (c): Stability behaviour onthe imaginary axis. For λ = 1

3 oscilla-tions of physical significance will bedamped more rapidly than for λ = 1

4 .

−10i

−5i

0

5i

10i

0 10 20

λ =1

3λ =

1

4

Figure 10.3 (d): Comparison of stabilityregions. The stable region is given bythe outside of the kidney shaped area.

Figure 10.3: λ = 14

leads to a small error constant C3(14) ≈ 0.00391. The

largest coefficient of the corresponding method has magnitude ≈38.7 (see Ta-ble 10.2). Although λ = 1

3leads to smaller coefficients, the stability behaviour

is inferior to that of the 14-method. The error constant C3(

13) ≈ 0.017 is larger

as well.


For fixed λ ∈ [0.22364780, 0.57281606] this approach yields A-stable meth-ods [85] as the single nonzero eigenvalue is given by the stability functionR3(z) of the SDIRK method with s = p + 1 = 4. Different starting points

a0 =[a0

21 a031 a0

32 a041 a0

42 a043

]>lead to methodsM(λ, a0). Figure 10.3 (a)

shows that the coefficients ofM(λ, a0) vary enormously in absolute size. Smallercoefficients are preferable over larger ones, as exceedingly large entries will leadto cancellation of significant values due to round-off effects. The order 3 methodfrom Table 10.2 with λ = 1

4offers a good compromise between reasonably sized

coefficients, a small error constant and good damping properties. For λ = 13

smaller coefficients are possible, but the error constant is considerably largerand the stability behaviour is no longer acceptable (see Figure 10.3).

Even though the order 3 method in Table 10.2 is reproduced using (truncated)decimal numbers, the matrices

B = EC−1A, U = C −ACK, V = E − BCK

will be computed using (10.7) and (10.9). In an actual implementation the orderconditions will thus be satisfied exactly. The errors resulting from truncatingA’s coefficients will affect stability only. As reported in [34], these changes inthe method’s stability behaviour are generally negligible.

[A UB V

]=

310

0 1 0710

310

1 0

710

310

1 047

37

0 0

(a) Order p = q = 1, c =[

310

1]>

211

0 0 1 322

− 7968

411

211

0 1 322

− 7968

135352

105352

211

1 322

− 7968

135352

105352

211

1 322

− 7968

− 17120

1756

88105

0 0 0

−7760

−1128

176105

0 0 0

(b) Order p = q = 2, c =

[722

1522

1]>

0.25000000 0 0 0 1 0 −0.03125000 −0.0052083333.96644133 0.25000000 0 0 1 −3.71644133 −0.99161033 −0.1343679587.10874919 0.19011989 0.25000000 0 1 −6.79886908 −1.77849724 −0.2459133989.22821238 0.30032578 0.14504057 0.25000000 1 −8.92357873 −2.31599641 −0.325048352

9.22821238 0.30032578 0.14504057 0.25000000 1 −8.92357873 −2.31599641 −0.3250483525.83388174 1.42095043 −1.93636919 1.83333333 0 −6.15179631 −1.55000209 −0.231990442

−23.5448944 10.4008341 −15.3587019 8.00000000 0 20.5027622 5.20483297 0.755308606−28.7175183 30.7178318 −38.7174038 16.0000000 0 20.7170903 4.85851648 0.946963295

(c) Order p = q = 3, c =

[14

12

34

1]>

Table 10.2: A family of methods having Runge-Kutta stability and M∞,p = 0.

10.2 Methods with s=p Stages 193

10.2 Methods with s = p Stages

The methods from Table 10.1 and 10.2 use s = p+1 internal stages to performan order p computation. Hence, each step requires at least p+1 function evalu-ations – one for each stage. Due to the diagonally implicit structure the stagescan be evaluated sequentially. Newton’s method is used in order to determinethe stages Yi for i = 1, . . . , s. Evaluating the Jacobian and performing theNewton iteration requires additional function calls.

In electrical circuit simulation a function call means the evaluation of all devicemodels. This process is often referred to as ’loading’. For standard applicationswith up to 103 equations the loading requires approximately 85% of the workwhile the linear solver causes only about 10% [66]. Most time is spent forloading since even standard transistor models are very complex. The overheadrequired for stepsize and convergence control is usually below 5%.

Consequently, a large number of internal stages may slow down the simulationsignificantly. Spending more time per step can be justified only if larger stepsare taken. This, in turn, requires small error constants and a relatively smoothbehaviour of the solution. Discontinuities and breakpoints, as they appearfrequently in circuit simulation, will render an efficient stepsize control moredifficult. It is not clear for realistic applications whether the steps can be takenlarge enough to benefit from the small error constants and the good stabilityproperties of the methods constructed above.

In order to judge how much work should be spent per step, it was decided notonly to look at methods with s = p + 1 stages but also to construct methodswith only s = p stages. In this section general linear methods M = [A,U ,B,V ]with s = r = p = q are derived. In other words, we will construct Dimsims oftype 2, where the matrix A has the diagonally implicit structure (10.2).

The method M has to satisfy the properties (P1) – (P6) from page 182 suchthat it can be used efficiently for solving differential algebraic equations. How-ever, for s = r = p = q fewer coefficients are available for satisfying theserequirements. Compared to the previous section little freedom remains. Itwill turn out that M∞,p = 0 is not possible for p = 2 and the property ofRunge-Kutta stability has to be dropped for p = 3.

Order p = 1

For s = r = 1 the method’s tableau simply reads M1 =[

a ub v

]with real

numbers a, u, b, v. Stiff accuracy and the conditions for p = q = 1 imply that

M1 =[

1 11 1

]is uniquely defined. This method, however, is nothing but the

implicit Euler scheme already introduced in Section 2.1.


Order p = 2

The construction of general linear methods with s = r = p = q = 2 is moreinteresting. The generic singly diagonally implicit scheme being stiffly accuratereads

M2 =

[A UB V

]=

λ 0 u11 u12

a21 λ u21 u22

a21 λ u21 u22

b21 b22 v21 v22

, c =

[c11

]

such that eleven parameters can be chosen to satisfy the requirements (P1) –(P6) from page 182. In order to guarantee p = q = 2 the method’s coefficientshave to satisfy

UW = C −ACK, VW = WE − BCK. (10.15)

These relations were derived in Theorem 7.6 for C =[e c c2

2!· · · cp

p!

], K =

[ 0 I0 0 ] and E = exp(K). More details are given in Section 7.1. Recall that

methods in Nordsieck form are characterised by Wij =

1 , i = j0 , i 6= j

such that

W =[

1 0 00 1 0

]for s = p = 2. The order and stage order conditions (10.15) can

therefore be written as

U =

[1 c1−λ1 1−a21−λ

], Ac =

1

2c2, V =

[1 1−a21−λ0 1−b21−b22

], Bc =

[12

1

].

As usual, the exponentiation ck has to be understood in a component by com-ponent sense. Solving these equations for c1, a21 and b22 the method M2 takesthe form4

M2 =

λ 0 1 λ

14λ− 1

2λ 1 3

2− λ− 1

4λ

14λ− 1

2λ 1 3

2− λ− 1

4λ

b21 1− 2b21λ 0 b21(2λ− 1)

, c =

[2λ1

].

Computing the stability matrix M(z) and determining the limit for z → ∞yields the matrix

M∞,2 =

[0 0

2b21λ(4λ2−1)+1−2λ−4λ2

2λ3

2b21λ(2λ−1)+1−4λ+2λ2

2λ3

]=

[0 0m21 m22

].

It can be checked easily that the system m21 = 0, m22 = 0 has no solution.Hence, M∞,2 = 0 is not possible. Nevertheless, it is easy to achieve Runge-Kutta stability by requiring b21 = 0. This can be seen by computing thestability function

Φ(w, z) = det(wI −M(z)

)= 1

(1−λz)2(p0 + p1w + (1− λz)2w2

).

4For simplicity it is assumed that c1 6= 0.


The polynomial p0 is given by

p0 = b21(2λ− 1)(2− 2λz + z).

Thus p0 vanishes for b21 = 0. Assuming that b21 = 0 holds, the stabilityfunction reads

Φ(w, z) = w(w − 2z2λ2−4z2λ+z2−4zλ+2z+2

2(1−λz)2

)= w

(w − R2(λ, z)

).

The single nonzero eigenvalue R2(λ, z) is given by the stability function of thecorresponding SDIRK method with two stages and order 2. It is well knownthat in this case A-stability can be achieved if and only if λ = 1 ±

√2

2(see

again [85]). Observe that λ = 1 −√

22

leads to c1 = 2λ ≈ 0.586 ∈ [0, 1] suchthat the final method reads

M2 =

1− 1

2

√2 0 1 1− 1

2

√2

(−1+√

2)(2+√

2)4

1− 12

√2 1 (−1+

√2)(2+

√2)

4

(−1+√

2)(2+√

2)4

1− 12

√2 1 (−1+

√2)(2+

√2)

4

0 1 0 0

, c =

[2−

√2

1

].

This scheme is A- and L-stable. The stability matrix at infinity is given by

M∞,2 =

[0 0

(−4+3√

2)(3+2√

2)2

0

]≈[

0 00.707106778 0

]. (10.16)

Notice that M2 exhibits the FSAL property. This method could therefore beinterpreted equally well as a diagonally implicit Runge-Kutta scheme with anexplicit first stage. Using this slightly different point of view, the method M2

was also constructed by Kværnø in [105].

It is interesting to note that a strictly lower triangular structure (10.16) forM∞,2 can be obtained by fixing A’s first diagonal element only. Repeating theabove computations for A = [ a11 0

a21 a22] leads to the method

a11 0 1 a11

1−2a22

4a11a22 1 4a11(1−a22)+2a22−1

4a11

1−2a22

4a11a22 1 4a11(1−a22)+2a22−1

4a11

0 1 0 0

, c∗ =

[2a11

1

].

Straightforward computations show that the stability matrix at infinity has theform [

0 014−a2

11−12a22

a211a22

12−a22−a11(1−a22)

a11a22

]


such that a strictly lower triangular structure can be ensured by choosing a11 =2a22−1

2(a22−1). The resulting method

M∗2(a22) =

2a22−12a22−2

0 1 2a22−12a22−2

12− 1

2a22 a22 1 1

2− 1

2a22

12− 1

2a22 a22 1 1

2− 1

2a22

0 1 0 0

, c∗(a22) =

[2a22−1a22−1

1

],

has Runge-Kutta stability due to

Φ∗(w, z) = w(w − 2a22(1+z(1−a22))−z−2

(2a22(z−1)−z+2)(za22−1)

)= w

(w − R∗

2(z)),

but a22 is left as a free parameter. In order to check for A-stability we writeR∗

2(z) = P (z)Q(z)

and compute the poles by solving

Q(z) = 0 ⇔ z ∈

1a22, 2a22−2

2a22−1

.

For 0 < a22 <12

all poles are positive numbers such that A-stability is equivalentto requiring

E(y) = Q(iy)Q(−iy)− P (iy)P (−iy) ≥ 0 ∀ y ∈ R.

This mapping, the E-polynomial, is widely used e.g. in [25, 85]. The propertyE(y) ≥ 0 ensures that the nonzero eigenvalue satisfies |R∗

2(yi)| ≤ 1 on theimaginary axis. Since all poles belong to the right half of the complex plane,the maximum principle shows that |R∗

2(z)| ≤ 1 holds whenever z has negativereal part. This, in turn, is nothing but A-stability.

1

|R2(yi)|

1

2

1

4

0

10−1 1 101 102 103 y

a22 =2−

√

2

2

a22 =1

10

a22 =1

100

a22 =1

1000

0.1π 0.25π

Figure 10.4 (a): Stability behaviour onthe imaginary axis for different a22.

0.04

0.06

|C3(a22)|

0.1

0 0.1 0.2 0.3 0.4 a22

a22 = 2−

√

2

2

Figure 10.4 (b): Error constantC2(a22) for the method M∗

2.

Figure 10.4: The parameter a22 controls the damping behaviour of themethod M∗

2. For a22 → 0 the damping behaviour of the trapezoidal rule

is approximated. The error constant is minimal for a22 = 1 −√

22

whereM∗

2(a22) = M2.


For the special case of M∗2(a22) the E-polynomial is easy to compute,

E(y) = y4a222(2a22 − 1)2.

Thus M∗2(a22) is A-stable (and hence also L-stable) for every 0 < a22 <

12.

The significance of the parameter a22 is visualised in Figure 10.4 (a). Varyinga22 allows to control the damping behaviour of the method. For a22 → 0 thedamping behaviour of the trapezoidal rule is approximated. In fact, for a22 = 0the method

M∗2(0) =

12

0 1 12

12

0 1 12

12

0 1 12

0 1 0 0

, c∗(0) =

[11

],

is nothing but a complicated representation of the trapezoidal rule itself.

For a22 = 1−√

22

the methods M∗2

(1−

√2

2

)= M2 coincide.

Example 10.1. In Chapter 2 the damping behaviour of BDF methods was in-vestigated. Recall from Example 2.1 on page 27 that the BDF2 scheme showedstrong artificial damping for the circuit in Figure 10.5 (a) below. Plotting thenode potential u3 and the current through the voltage source iV in phase spacethis becomes clearly visible as the numerical solution spirals inwards. Thegeneral linear method M∗

2, by contrast, shows little numerical damping for

a22 = 1 −√

22

. For a22 = 1100

the solution is almost a perfect circle in phasespace.

u1 u2

C iL L

G

u3

v(t)

iV

Figure 10.5 (a): The RLC seriescircuit from Example 2.1.

−1

0

iV

−1 0 u3

−1

0

iV

−1 0 u3

(b) exact (c) BDF2

−1

0

iV

−1 0 u3

−1

0

iV

−1 0 u3

(d) a22 = 1−√

22

(e) a22 = 1100

Figure 10.5: For a small resistance R = 1G

= 1mΩ the exact solution is nearlyundamped such that u3 and iV trace out a circle in phase space (b). The BDF2

(c) shows strong numerical damping while the general linear method, (d) and(e), allows a control of the damping behaviour via the parameter a22.


The possible control of the damping behaviour comes at the price of the di-agonal elements a11, a22 not being equal. This is far from optimal as differentiteration matrices will be necessary for each of the two stages. The additionalevaluations of the Jacobian may slow down the simulation significantly.

For electrical circuit simulation this is not a severe restriction. Although func-tion evaluations are extremely expensive, information about the Jacobian canbe obtained at little additional costs. ROW methods have been adapted to acheap Jacobian by Gunther [82]. For general linear methods exploiting cheapJacobians may lead to an adaptive control of the damping behaviour. However,these investigations are beyond the scope of this thesis and, for the time being,it is recommended to use a22 = 1−

√2

2such that M∗

2(a22) = M2 becomes singlydiagonally implicit and the error constant is minimised (see Figure 10.4 (b)).

Order p = 3

The generic stiffly accurate method with s = r = 3 is given by the tableau

M3 =

[A UB V

]=

λ 0 0 u11 u12 u13

a21 λ 0 u21 u22 u23

a31 a32 λ u31 u32 u33

a31 a32 λ u31 u32 u33

b21 b22 b23 v21 v22 v23

b31 b32 b33 v31 v32 v33

, c =

c1c21

.(10.17)

Solving the order and stage order conditions

UW = C −ACK, VW = WE − BCK

from Theorem 7.6 shows that for s = r = p = q = 3 and c1 6= 0 the methodM3 has the form

A=

λ 0 0c22(c2−3λ)

27λ2 λ 01−3a32c22−3λ

27λ2 a32 λ

, c =

3λc21

,

U=

1 2λ 3

2λ2

127λ2c2−c32+3λc22−27λ3

27λ2

c2(15λc2−2c22−18λ2)18λ

127λ2−1+3a32c22+3λ−27a32λ2−27λ3

27λ2

15λ−2+6a32c22−18a32c2λ−18λ2

18λ

,B=

1−3a32c22−3λ

27λ2 a32 λb21 b22 1−9b21λ

2−b22c22b31 b32 2−9b31λ

2−b32c22

,V=

127λ2−1+3a32c22+3λ−27a32λ2−27λ3

27λ2

15λ−2+6a32c22−18a32c2λ−18λ2

18λ

0 b22c22−b21−b22+9b21λ

2 b22c22−3b21λ−b22c2+9b21λ

2

0 b32c22−b31−b32−2+9b31λ

2 b32c22−1−3b31λ−b32c2+9b31λ

2

.


Observe that the fourth column of the stage order conditions UW = C−ACKreads 0 = 1

6c3 − 1

2Ac2. Thus stage order q = 3 requires

0 =1

3c3−Ac2 =

13c31 − λc21

13c32 − a21c

21 − λc22

13− a31c

21 − a32c

22 − λ

to hold. For c1 6= 0 the first component implies c1 = 3λ as is already indicatedby the above formulae.

On the other hand, if M3 had Runge-Kutta stability,

Φ(w, z) = w2(w − R3(λ, z)

),

the single nonzero eigenvalue R3(λ, z) would be given by the stability functionof the SDIRK method with s = p = 3. Thus, A-stability would be possibleonly for λ = 0.43586652 which is a root of λ3 − 3λ2 + 3

2λ− 1

6= 0 (see [85]).

As a consequence, there is no A-stable singly diagonally implicit general linearmethod in Nordsieck form having stiff accuracy, s = r = p = q = 3 andsatisfying 0 < c1 < 1.

Sacrificing A-stability in favour of a value c1 ∈ [0, 1] will inevitably lead to astability matrix M∞ having a nonzero eigenvalue. Hence, nilpotency of M∞would be impossible for 0 < c1 < 1, but this was one of the requirements(P1) – (P6) on page 182.

The difficulty of finding L-stable methods with s = r + 1 = p = q = 3 was al-ready observed in [33]. Although Butcher and Jackiewicz succeed in construct-ing A- and L-stable Dimsims with s = r = p = q = 3 in [33], these methodsare neither in Nordsieck form nor stiffly accurate. In [27] it is conjectured thatno A-stable type-4 Dimsim5 with M∞ = 0 exists for s = r = p = q = 3.

In order to cope with these difficulties, we will drop the requirement of Runge-Kutta stability from now on.

In view of Theorem 9.5 good stability at zero and at infinity will be ensuredfirst. The remaining free parameters can be used to achieve A-stability, or atleast large regions of A(α)-stability.

Recall from Chapter 7 that a general linear method M = [A,U ,B,V ] appliedto the ordinary differential equation y′ = f(y, t) reads

Y = hnAF (Y ) + U y[n], y[n+1] = hnB F (Y ) + V y[n], (10.18)

where F (Y ) =[f(Y1, tn + c1 hn)

> · · · f(Ys, tn + cs hn)>]>. Using the scheme

(10.18) the method proceeds from tn to tn+1 = tn + hn using a stepsize hn. Ifthe next step is taken with a different stepsize hn+1 6= hn, the output vector

5A Dimsim is said to be of type-4 if A = λI.


y[n+1] has to be modified. For methods in Nordsieck form the appropriate mod-ification is given by simply multiplying the Nordsieck vector with the diagonalmatrix D(σn) = diag(1, σn, . . . , σ

r−1n ),

y[n+1] ≈

y(tn+1)hny

′(tn+1)...hr−1n y(r−1)(tn+1)

⇒ D(σn) y[n+1] ≈

y(tn+1)

hn+1y′(tn+1)...

hr−1n+1y

(r−1)(tn+1)

.Here, σn = hn+1

hndenotes the stepsize ratio6. In case of the linear scalar test

equation y′ = λy, (10.18) turns into

D(σn) y[n+1] = D(σn)M(λhn)y

[n] =n∏i=0

D(σi)M(λhi) y[0].

In a variable stepsize implementation, stability is thus related to products ofmatrices M(σi, zi) = D(σi)M(zi). To ensure that y[n+1] remains bounded asn→∞, it is not sufficient to require a spectral radius %

(M(σi, zi)

)< 1 for each

factor, but we have to guarantee a spectral radius less than 1 for the product.

This goal can be achieved by limiting the stepsize ratio σ. Corresponding resultsfor low order BDF schemes and a two-stage order 1 general linear method werederived by Butcher and Heard [31] using appropriate norms. Guglielmi andZennaro use the theory of a joint spectral radius for a family of matrices andthe concept of polytope norms [77, 76]. Results for two-step W -methods aregiven in [95].

Butcher and Jackiewicz use a different approach in [38]. They show that gen-eral linear methods being unconditionally zero-stable can be constructed usingspecial error estimators and appropriate modifications of the Nordsieck vector.

For the order-3 method (10.17), by contrast, we will try to guarantee stabilityat zero and at infinity by construction. If the matrices V and M∞ were of theform

V =

1 v12 v13

0 0 00 v32 0

, M∞ =

0 0 0m21 0 0m31 m32 0

, (10.19)

then M(σ, 0) = D(σ)V and M(σ,∞) = D(σ)M∞ would have the same struc-ture. Thus, multiplication by D(σ) does not affect stability since the specialzero-pattern ensures that the spectra σ(V) = 1, 0 and σ(M∞) = 0 do notdepend on the particular values of vij, mij.

6A more refined adjustment of the Nordsieck vector that preserves the first order errorterms will be discussed in Chapter 11.


In order for the method M3 to be of type (10.19), the system

v22 = b22c22−b21−b22+9b21λ

2 = 0,

v23 = b22c22−3b21λ−b22c2+9b21λ

2 = 0,

v33 = b32c22−1−3b31λ−b32c2+9b31λ

2 = 0

is solved first. One particularly simple solution is given by

b21 = 0, b22 = 0, b32 = −1−3b31λ+9b31λ2

c2(1−c2), (10.20)

where it is assumed that c2 6= 0 and c2 6= 1. Observe that (10.20) ensuresthat M3 has the FSAL property since e>2 B =

[0 0 1

]and e>2 V =

[0 0 0

].

For the structure (10.19) it remains to compute M∞ = (mij) and to solve thesystem m22 = 0, m23 = 0, m33 = 0. The first row

[m11 m12 m13

]= 0

vanishes due to stiff accuracy. Using e.g. Maple the solution is found to be

c2 = 1−6λ+6λ2

3λ2−5λ+1,

a32 =(3λ2−5λ+1)

3λ

(1−9λ+21λ2−9λ3)(1−6λ+6λ2),

b31 = 1+510λ4−288λ5+120λ2+54λ6−366λ3−18λ3λ (6λ−1)(−1+3λ)(3λ2−6λ+1)(3λ3−9λ2+6λ−1)

and the diagonal element λ is left as a free parameter. Of course, we haveto require λ 6∈

0, 1

6, 1

3, 1 ±

√6

3, 1

2±

√3

6, 5

6±

√136

and that λ is not a root of

3λ3 − 9λ2 + 6λ − 1 = 0. This avoids a zero denominator. The coefficients ofthe method M3(λ) are reproduced in Table 10.3.

Due to the special structure (10.19) for the matrices V and M∞ this methodis stable at zero and at infinity for every stepsize pattern (σn)n ∈ N. Hence wehave ensured unconditional stability at zero and at infinity by construction.

Figure 10.6 (a) suggests to choose 0 < λ < 1 −√

63

such that 0 < c1 < c2 < 1.Unfortunately, the corresponding methods are not A-stable.

Information on possible angles for A(α)-stability is given in Figure 10.6 (b).For each λ ∈ [0, 1] the method M3(λ) has been computed. The angle of A(α)-stability was computed from the corresponding stability region. This angle α(λ)is plotted in Figure 10.6 (b) as a function of the diagonal element λ. M(λ) isA-stable provided that α(λ) = 90, i.e. for λ ≥ λ∗ ≈ 0.3558.

For λ ≥ λ∗ the abscissa c1 lies outside the unit interval [0, 1]. As we chosethe condition c1, c2 ∈ [0, 1] as a design criterion, the optimal angle of A(α)stability is achieved for λ ≈ 0.158 with α ≈ 75.8053. For λ = 4

25= 0.16

and α( 425

) ≈ 73.7535 much simpler rational coefficients can be obtained (seeTable 10.4). The entry of largest magnitude is b32 ≈ −13.4165 and the errorconstant is given by C3(

425

) = − 354041696805687500

≈ −0.0052.


A=

λ

00

−1 27

(6λ2+

1−

6λ)2

(−1+

3λ) (

3λ2−

6λ+

1)

(−5λ+

3λ2+

1)3λ2

λ0

1 27

(6λ−

1) (

9λ4−

27λ3+

27λ2−

9λ+

1)

λ2(−

1+

3λ)(

3λ2−

6λ+

1)

−(−

5λ+

3λ2+

1)3λ

(−1+

3λ)(

3λ2−

6λ+

1)(

6λ2+

1−

6λ)λ

,

U=

12λ

3 2λ

2

1−

1 27

1−

21λ−

14526λ6−

306λ3−

5103λ8+

729λ9+

7452λ5+

150λ2−

1224λ4+

12798λ7

(−5λ+

3λ2+

1)3λ2

−1 18

(6λ2+

1−

6λ)(

9λ3−

27λ2+

15λ−

2)(

18λ3−

36λ2+

12λ−

1)

(−5λ+

3λ2+

1)3λ

11 27

81λ6−

378λ5+

459λ4−

144λ3−

21λ2+

12λ−

1(6λ2+

1−

6λ)λ

21 18

18λ3−

48λ2+

21λ−

2λ

,

B=

1 27

(6λ−

1) (

9λ4−

27λ3+

27λ2−

9λ+

1)

λ2(−

1+

3λ)(

3λ2−

6λ+

1)

−(−

5λ+

3λ2+

1)3λ

(−1+

3λ)(

3λ2−

6λ+

1)(

6λ2+

1−

6λ)

λ

00

1

1 31−

18λ+

54λ6−

366λ3−

288λ5+

120λ2+

510λ4

λ(6λ−

1)(−

1+

3λ)(

3λ2−

6λ+

1)(

3λ3−

9λ2+

6λ−

1)

3(−

5λ+

3λ2+

1)3λ2

(6λ2+

1−

6λ)(

6λ−

1)(−

1+

3λ)(

3λ2−

6λ+

1)(

3λ3−

9λ2+

6λ−

1)

18λ4−

66λ3+

66λ2−

21λ+

2(3λ3−

9λ2+

6λ−

1)(

6λ−

1)

,

V=

11 27

81λ6−

378λ5+

459λ4−

144λ3−

21λ2+

12λ−

1(6λ2+

1−

6λ)λ

21 18

18λ3−

48λ2+

21λ−

2λ

00

0

0−

1 3

(36λ5−

141λ4+

168λ3−

78λ2+

15λ−

1)(−

1+

3λ)2

(6λ2+

1−

6λ)(

3λ3−

9λ2+

6λ−

1)(

6λ−

1)λ

0

.

c=

3λ

6λ2+

1−

6λ

−5λ+

3λ2+

1

1

.

Table 10.3: The method M3(λ) satisfies s = r = p = q = 3 and is uncondi-tionally stable at zero and at infinity.


Although the region of A(α)-stability is considerably smaller than for the BDFscheme with the same order (see Table 2.2 on page 32), we will see in Chap-ter 12 that the family from Table 10.4 performs surprisingly well for most testexamples.

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 λ

λ = 1−√

6

3

c1(λ)c2(λ)c3(λ)

Figure 10.6 (a): Values for theabscissae ci at different λ.

0

0.5

1

0.2 0.3 0.40

30

60

90λ ≈ 0.158

λ

A(α) stab.

c1(λ)

c2(λ)

c3(λ)

Figure 10.6 (b): Possible anglesfor A(α)-stability.

Figure 10.6: Choosing λ for the method M3. Figure 10.6 (a) shows that

0<λ<1−√

63

guarantees 0 < c1 < c2 < 1. For λ > 12−

√3

6≈ 0.211 at least

one ci lies outside the unit interval. Figure 10.6 (b) contains an enlarged copyof the left hand picture. Additionally the angle α is plotted (rightmost axis)indicating M3’s A(α) stability.

[A UB V

]=

[1 1

1 1

]

(a) Order p = q = 1, c =[1]

1− 1

2

√2 0 1 1− 1

2

√2

(√

2−1)(2+√

2)4

1− 12

√2 1 (

√2−1)(2+

√2)

4

(√

2−1)(2+√

2)4

1− 12

√2 1 (

√2−1)(2+

√2)

4

0 1 0 0

(b) Order p = q = 2, c =

[2−

√2 1

]>

425

0 0 1 825

24625

3473577252236773744

425

0 1 2148017909955919343600

2709612294659945300

57229409968

2071086871768125

425

1 1345435132670000

160122500

57229409968

2071086871768125

425

1 1345435132670000

160122500

0 0 1 0 0 033499437545927804

−6213260400463105357

277584033

0 −44512915855916

0

(c) Order p = q = 3, c =

[14

12

34

1]>

Table 10.4: A family of Dimsims with s = r = q = p


Larger regions of A(α)-stability are possible by weakening the requirement(10.19). As an example assume that V and M∞ satisfy

V =

1 v12 v13

0 0 00 v32 0

, M2∞ =

0 0 0m21 0 0m31 m32 0

, (10.21)

where M2∞, in contrast to M∞, is required to have a strictly lower triangular

form. Solving m22 = 0, m23 = 0, m33 = 0 yields many different solutions. Un-fortunately it was not possible to find A-stable methods within these families.One example of a method with a large region of A(α)-stability is given in Ta-ble 10.5. There the angle α satisfies α ≈ 88.1 such that the method is morestable than the BDF3 scheme. Unconditional stability at infinity is, however,lost.

740

0 0A =

[3431458

740

0

],

522543285118627840000

+ β126720000

234658592119317760000

− 9β2759680000

740

1 720

1473200

U =

[1 11851

29160651797200

],

1 44126632861104315904000

− 23β4967424000

1592119012027520000

− β675840000

522543285118627840000

+ β126720000

234658592119317760000

− 9β2759680000

740

B =

[0 0 1

],

β670320

+ 3022207670320

−171306211207360

− 3β1207360

8202367802560

+ β802560

1 44126632861104315904000

− 23β4967424000

1592119012027520000

− β675840000

V =

[0 0 0

].

0 −21351673951360

− β3951360

0

Table 10.5: A Dimsim with s = r = p = q = 3 and A(α)-stability for α ≈88.1. In the tableau β represents

√11831751737089 and the vector c is given

by c =[

2140, 49

60, 1].

10.3 Summary

In the literature [23, 28, 33, 34, 93, 158] general linear methods have beenconstructed mainly for (stiff) ordinary differential equations. The investigationsof Theorem 9.5 and Corollary 9.6 showed that differential algebraic equationscan be treated as well, but additional requirements need to be satisfied.

10.3 Summary 205

In particular additional order conditions and stronger stability requirementshave to be met. It was thus decided to find methods M having the followingproperties:

(P1) M uses s = r stages to achieve order p and stage order q = p forordinary differential equations.

(P2) A has a diagonally implicit structure (10.2) with aii = λ for i = 1, . . . , s.

(P3) V is power bounded and M∞ is nilpotent7.

(P4) The abscissae ci satisfy ci ∈ [0, 1] and ci 6= cj for i 6= j.

(P5) M is stiffly accurate, i.e. cs = 1 and the last row of [A U ] coincideswith the first row of [B V ].

(P6) M is given in Nordsieck form such that the stepsize can be varied easily.

Three families of methods have been considered so far. The methods fromTable 10.1 constructed by Butcher and Podhaisky and those in Table 10.2 withM∞,p = 0 use s= p+ 1 stages while the Dimsims constructed in the previoussection have only s=p stages.

The methods in Table 10.1 were constructed using the algorithm of Wright [158]assuming BA = XB and BU ≡ XV − VX. Thus these methods have inherentRunge-Kutta stability (IRKS). One of the many advantages of this approachis the fact, that only linear operations are necessary to derive a method.

In an attempt to improve stability at infinity, the methods from Table 10.2were constructed such that M∞,p=0. For p = 1 and p = 2 IRKS was stillpossible. For p = 3 inherent Runge-Kutta stability could not be realised anymore, but Runge-Kutta stability was still possible. Starting from the stabilityfunction

Φ(w, z) =(p0 + p1w + p2w

2 + p3w3 + (1− λz)4w4

)1

(1−λz)4

the Fortran routine hybrd from Minpack [125] was used to solve the systemp0 = 0, p1 = 0, p2 = 0 numerically. Thus the resulting method has, once again,only one nonzero eigenvalue.

Solving the complicated nonlinear system leads only to isolated methods. TheIRKS approach of Wright, by contrast, yields families of methods.

The requirement of having M∞,p = 0 is a strong restriction. Some of themethod’s coefficients seem to get a bit too large. Also, the methods of Butcherand Podhaisky have smaller error constants. We have to check numericallywhether there is any benefit due to the improved stability at infinity. Thecorresponding numerical experiments will be performed in Chapter 12.

In Section 10.2 Dimsims with s = r = p = q have been considered. Thesemethods seem of particular interest as only s = p stages are used per step. Thereduced computational work may lead to a clear increase in performance.

7Recall from Lemma 8.30 that Mk0∞ = 0 for some k0 ≥ 1 ensures that the stage derivatives

are calculated with the desired accuracy.


−10i

0

10i

0 10 20 30

p = 2 p = 1

p = 3

Figure 10.7 (a): Table 10.1 (IRKS).

−10i

0

10i

0 10 20 30

p = 1

p = 2

p = 3

Figure 10.7 (b): Table 10.2 (M∞,p=0).

−30i

−15i

0

15i

30i

0 15 30 45

p = 1

p = 2

p = 3

p = 3

Figure 10.7 (c): Table 10.4 (s = p).The method from Table 10.5 is plottedas a dashed line.

Figure 10.7: Stability regions of thethree families of methods constructedin the previous sections. The stableregion is always given by the outsideof the encircled areas.

However, compared to s = p+ 1 fewerparameters are available for satisfyingthe requirements (P1) – (P6).

For p=1 the corresponding method isuniquely determined. Forcing Runge-Kutta stability for p = 2 by settingb21 = 0 leads to an order 2 methodthat can be interpreted as a singly di-agonally implicit Runge-Kutta schemewith an explicit first stage.

If the entries a11 and a22 on the dia-gonal are allowed to take different val-ues, a family M∗

2(a22) of A- and L-stable diagonally implicit general li-near methods was constructed. Vary-ing a22 ∈ [0, 1

2] this family allows a

control of the method’s damping be-haviour.

Unfortunately the search for A-stablemethods with p = 3 was not suc-cessful. If unconditional stability atzero and at infinity is guaranteed byconstruction, the maximal angle forA(α)-stability is α ≈ 75.8053 (seeFigure 10.7 (c)). There are, however,methods that are more stable thantheir BDF-counterparts. The methodfrom Table 10.5 with α ≈ 88.1 servedas an example.

The methods with s = r = p = qfrom Table 10.4 will be the basis of thevariable-order variable-stepsize imple-mentation Glimda. The acronym’Glimda’ is used to abbreviate Gen-eral LInear Methods for DifferentialAlgebraic equations.

Although the order 3 method is not A-stable, there is some hope for a com-petitive code as only s = p stages areused per step. This code will also ben-efit from the unconditional stability ofthe order 3 method.

10.3 Summary 207

Since the methods from Table 10.1 and 10.2 use an incremented number ofstages, the corresponding code will be denoted as Glimda++. The familyof methods being used for a particular computation can be chosen easily bychanging the method’s coefficients.

In Chapter 12 a large number of numerical experiments will be performed inorder to assess the potential of these different families of methods. The codesbased on general linear methods will not only be compared with each otherbut also with Dassl and Radau. It will turn out that general linear methodsare competitive and often superior to classical linear multistep or Runge-Kuttacodes. In particular Glimda will prove a tough competitor.

Before proceeding to the numerical experiments, implementation issues such aserror estimation, stepsize selection and strategies for changing the order haveto be addressed. These considerations will be the topic of the next chapter.

11

Implementation Issues

Implementation specifics for general linear methods (GLMs) have been dis-cussed in several papers [26, 37, 40, 94] mainly by Butcher and Jackiewicz.The focus is always on (stiff) ordinary differential equations

y′ = f(y, t), t ∈ [t0, tend], y(t0) = y0. (11.1)

For stiff equations singly diagonally implicit methods with high stage order aremost appropriate. Recall that in the stiff regime implicit methods have to beused, but due to the diagonally implicit structure the stages can be evaluatedsequentially. Thus the computational costs are considerably reduced.

The resulting nonlinear systems are solved using Newton’s method. Generallinear schemes allow the construction of highly accurate predictors for the initialguess. The predictors are based on incoming approximations and on stagederivatives already evaluated within the current step. Thus the initial guesscan be obtained at no additional costs. A case study for possible predictors isgiven in [93].

A distinct feature of general linear methods is the possibility to have diagonallyimplicit schemes with high stage order. High stage order is not only exploitedin deriving predictors for Newton’s method but also, as we will see later, forconstructing error estimates. In the context of stiff equations and DAEs it is ofparticular importance to recall that high stage order avoids the order reductionphenomenon discussed in Section 2.2. Finally, dense output comes practicallyfor free (see again [93]).

Solving (11.1) numerically requires discretisation. Starting from the initial con-dition y(t0) = y0 approximations yn ≈ y(tn) to the exact solution are computedon the nonuniform grid

t0 < t1 < t2 < · · · < tN = tend.

After computing yn at timepoint tn a stepsize hn has to be predicted for thenext step from tn to tn+1 = tn+hn. The suggestion for the stepsize hn is based

210 11 Implementation Issues

on an estimate of the local truncation error in step number n. In [29] the errorestimator is obtained from information computed in the previous two steps.This is not always convenient in a variable order implementation. Thus, wewill use ideas from [37, 40] where the error estimate is based on informationfrom the current step only. As indicated above, error estimation benefits fromhigh stage order. Methods with s = p + 1 stages will have clear advantagesover those with s = p stages.

Finally, an efficient implementation has to be capable of changing the order.Low order methods are preferred for loose tolerances. Tight tolerances andvery smooth solutions are most efficiently dealt with by higher order schemes.Adaptivity, i.e. monitoring how the solution evolves and changing the orderaccordingly, is the key for an efficient simulation.

Variable-order implementations for general linear methods are discussed in [30,40, 94]. While [30, 94] use ratios of appropriate error estimates to decide onthe new order, the approach taken in [40] is more elegant. There it is shownhow to obtain an estimate for hp+2y(p+2)(tn). As the method currently used isassumed to have order p, this allows to estimate the local truncation error forthe method of order p+ 1.

For methods with s = p stages this approach is not feasible, but techniques ofHairer and Wanner [86] can be adapted to general linear methods. The decisionon the new order will be based on the convergence rate of Newton’s method.

These topics – Newton’s method, error estimation and order control – will nowbe addressed in more detail. We will restrict attention to methods in Nordsieckform where the input vector

y[n] ≈

y(tn)h y′(tn)...

hr−1y(r−1)(tn)

contains approximations to scaled derivatives of the exact solution.

Recall that we are concerned with the numerical solution of differential alge-braic equations

A(t)[d(x(t), t

)]′+ b(x(t), t

)= 0.

The material presented here will be based on the more general structure

f( q(x(t), t

), x(t), t) = 0 (11.2)

such that the codes Glimda and Glimda++ will eventually be capable ofsolving DAEs of the form (11.2).

11.1 Newton Iteration 211

11.1 Newton Iteration

Let M = [A,U ,B,V ] be a general linear method with the matrix A beingnonsingular. Similar to the approach taken in Chapter 9 the method M canbe applied to the DAE (11.2) by introducing internal stages Xi and stagederivative Q′

i representing Xi ≈ x(tn + cih) and Q′i ≈ q

(x(tn + cih), tn + cih

),

respectively, for i = 1, . . . , s.

Xi and Q′i are related by the differential equation,

f(Q′i, Xi, tn + cih

)= 0, i = 1, . . . , s, (11.3a)

Q(X) = hAQ′ + U q[n], (11.3b)

q[n+1] = hBQ′ + V q[n]. (11.3c)

In (11.3) the following abbreviations have been used:

X =

X1...Xs

, Q′ =

Q′1...

Q′s

, Q(X) =

q(X1, tn + c1h)...q(Xs, tn + csh)

.The input vectors q

[n]i+1 ≈ hi d

dtq(x(tn), tn

), i = 1, . . . , r, approximate scaled

derivatives of the function q appearing in (11.2). Observe that only informationabout q(x, t) is propagated from step to step. As a consequence, the numericalsolution xn might not be recoverable from q[n]. For stiffly accurate methods thisis no restriction at all since the numerical result in step number n coincideswith the last stage, i.e. xn = Xs.

Observe that q[n] ∈ Rr·l for

f : Rl ×Rm ×R→ Rm.

In general the relation l ≤ m holds such that for l < m the memory requiredfor storing the Nordsieck vector is lower as compared to the case of ordinarydifferential equations where q[n] satisfies q[n] ∈ Rr·m.

Given that the matrix A has singly diagonally implicit structure,

A =

λ 0 ··· 0 0a21 λ ··· 0 0...

......

......

as−1,1 as−1,2 ··· λ 0as1 as2 ··· as,s−1 λ

,the relation (11.3b) can be written as

q(Xi, tn + cih) = hλQ′i + ωi (11.4)


for i = 1, . . . , s, where, at stage number i, the vector

ωi = h

i−1∑j=1

aij Q′j +

r∑j=1

uij q[n]j

is already known from previous results. Thus, in order to determine the valueof Xi we need to solve the nonlinear system

Fi(Xi) = hλ f( 1hλ

[q(Xi, tn + cih)− ωi], Xi, tn + cih) = 0.

Starting from an initial guess X0i Newton’s method can be applied such that

Xk+1i = Xk

i + ∆Xki with Ji(Xk

i )∆Xki = −Fi(Xk

i ). (11.5)

The Jacobian matrix

Ji(Xki ) =

d

dXi

Fi(Xki ) =

∂f

∂q

∂q

∂x+ hλ

∂f

∂x

is computed using difference approximations. Of course, Ji(Xki ) can be assem-

bled analytically, given that the user provides functions ∂f∂q

, ∂q∂x

and ∂f∂x

.

Notice that the stages Xi can indeed be evaluated sequentially. Hence in orderto determine the stages from (11.3), s nonlinear systems of dimension m haveto be solved. The integer m denotes the problem size.

The diagonal element λ stays the same for every stage Xi. Hence there issome hope that the same Jacobian J1(X

k1 ) can be used for all stages. The

Jacobian information may even be kept constant over several steps. In case ofconvergence problems the Jacobian is updated as needed. Full details on theiteration scheme are given in [93].

For solving the linear system (11.5), the Jacobian Ji(Xki ) = LU is decomposed

into its LU factors using dgetrf from Lapack [109]. The Newton process isstopped provided that the residuum Fi(Xk

i ) and the correction ∆Xki satisfy

the mixed criterion

‖Fi(Xki )‖sc ≤ tolf, ‖∆Xk

i ‖sc ≤ tolx,

where the norm ‖ · ‖sc is defined by

‖η‖sc =m

maxj=1

( |ηj| / (atolj + rtolj · |xn,j|)).

The vectors atol and rtol are given by the user and contain the absolute and rel-ative accuracy requirements, respectively. The numerical result of the previousstep from tn−1 to tn is denoted by xn ∈ Rm. This vector is used as a referencewhen assessing the relative accuracy. The tolerances tolf and tolx can be chosenby the user. Default values are tolf = tolx = 1

10.

11.2 Error Estimation and Stepsize Prediction 213

It remains to chose an appropriate initial guess X0i for the Newton iteration.

Many different approaches are discussed in [93]. These ideas use quantitiesbased on the Nordsieck vector in order to predict a stage value. In case ofthe DAE (11.2) this leads to an approximation Q0

i ≈ q(x(tn+cih), tn+cih

),

but it was mentioned earlier that X0i cannot be computed from Q0

i in general.A similar remark applies when using explicit methods of order p to compute aprediction. Hence a different approach is required.

Extensive numerical tests have shown that extrapolation based on previousstage values gives satisfactory results. Let X be a queue of κ past (accepted)stages and corresponding timepoints, i.e.

X =[(X1, t1), (X2, t2), . . . , (Xκ, tκ)

],

and let P denote the unique polynomial satisfying P (tj) = Xj for j = 1, . . . , κand P (tn+cjh) = Xj for j = 1, . . . , i−1. A prediction X0

i is found by evaluatingX0i = P (tn + cih). After completing step number n, the κ most recent values

from the list[(X1, t1), (X2, t2), . . . , (Xκ, tκ), (X1, tn + c1h), . . . , (Xs, tn + csh)

]are saved in X for use in step number n + 1. For extrapolation Neville’salgorithm from [130] is used. Obviously, the development of more sophisticatedstage predictors for DAEs should be a topic for further research.

From the stage values Xi the corresponding stage derivatives Q′i can be com-

puted using (11.4), i.e.

Q′i = 1

hλ

(q(Xi, tn + cih)− ωi

).

Finally the output vector q[n+1] of step number n is obtained from (11.3c).

11.2 Error Estimation and Stepsize Prediction

After computing the output vector at the end of the step we have to decidewhether to accept or reject the result. This decision is based on an estimationof the local truncation error. Appropriate estimators have been developedin [26, 37, 40] in the context of stiff ODEs (11.1). For completeness the keypoints will be briefly reviewed here.

We assume that M = [A,U ,B,V ] is a general linear method in Nordsieck formwith s internal and r external stages. The stage order q = p agrees with theorder. Recall that M is applied to the ODE (11.1) according to

Y = hAY ′ + U y[n],

y[n+1] = hB Y ′ + V y[n]

with Y ′i = f(Yi, tn + cih) being the stage derivatives.


The exact Nordsieck vector will be denoted by

y[n] =

y(tn)h y′(tn)...

hr−1y(r−1)(tn)

.It is assumed that the input vector y[n] used for numerical computations comeswith a first order error term β hp+1y(p+1)(tn), i.e

y[n]i = y

[n]i − βi h

p+1y(p+1)(tn) +O(hp+2) (11.6)

for i = 1, . . . , r. As we are interested in the local error, the first component isassumed to be exact, i.e. β1 = 0. As in [26, 37], using Taylor series expansionyields

Yi = y(tn + cih) − εi hp+1y(p+1)(tn) +O(hp+2), (11.7a)

hY ′i = h y′(tn + cih)− εi

∂f∂y

(y(tn)

)hp+2y(p+1)(tn)+O(hp+3), (11.7b)

y[n+1]i = y

[n+1]i − γi h

p+1y(p+1)(tn+1) +O(hp+2), (11.7c)

The coefficients ε and γ are given by

ε = 1(p+1)!

cp+1 − 1p!A cp + U β,

γ = α(p, r) − 1p!B cp + V β,

ck =

ck1

ck2...

cks

, α(k, l) =

1

(k+1)!

1k!...1

(k+2−l)!

.The matrix V plays a crucial role in propagating the error. For methods withRunge-Kutta stability but also for those from Table 10.4 this matrix is given by

V =[

1 v>

0 V

]where the submatrix V has zero spectral radius. Hence the errors

in the first component of the Nordsieck vector are propagated quite differentlyfrom those in the remaining r − 1 components. Indeed, writing

β =[ 0β

], B =

[b>

B

], α(p, r) =

[1

(p+1)!

α

],

it turns out that

β 7→ α− 1p!B cp + V β

is a fixed-point mapping. Thus, after a series of steps using a constant stepsize,the values β will settle down to a fixed value that is not changed from step tostep. This fixed point can be computed by solving

γ = β + Cp e1 ⇔[e1 e

>1 + (I − V)

] [Cpβ

]= α(p, r)− 1

p!B cp. (11.8)


If Cp and β form a solution of (11.8), then input values

y[n]1 = y

[n]1

y[n]i = y

[n]i − βi h

p+1y(p+1)(tn) +O(hp+2), i = 2, . . . , r,

imply that the output at the end of the step is given by

y[n+1]1 = y

[n+1]1 − Cp h

p+1y(p+1)(tn+1) +O(hp+2),

y[n+1]i = y

[n+1]i − βi h

p+1y(p+1)(tn+1) +O(hp+2), i = 2, . . . , r.

Thus, the number Cp is called the method’s error constant and

y[n+1]1 − y

[n+1]1 = Cp h

p+1y(p+1)(tn+1) +O(hp+2) (11.9)

is the local truncation error.

Example 11.1. Consider the order 2 method from Table 10.2 on page 192,

M =

[A UB V

]=

211

0 0 1 322

− 7968

411

211

0 1 322

− 7968

135352

105352

211

1 322

− 7968

135352

105352

211

1 322

− 7968

− 17120

1756

88105

0 0 0

−7760

−1128

176105

0 0 0

, c =

722

1522

1

.

For this method p = 2, s = r = 3 implies α(2, 3) =[

16

12

1]>

such that thelinear system (11.8) reads

1 − 322

7968

0 1 0

0 0 1

C2

β2

β3

=

− 415

31944

17968

722

⇔

C2

β2

β3

=

− 103

7986

17968

722

.Hence the error constant is given by C2 = − 103

7986.

If a method M is used with a constant stepsize h, the above computationsare valid for all n as the coefficients β won’t change from step to step. Ingeneral, however, an efficient numerical solution of any given problem requiresan adaptive change of the stepsize in order to react on how the solution evolves.

If the current step is computed with stepsize hn and another stepsize hn+1 willbe used for the next step, the Nordsieck vector needs modification. This wasdiscussed briefly in Section 10.2.


Let σn = hn+1

hnbe the stepsize ratio. Then the Nordsieck vector y[n+1] needs to

be multiplied by the diagonal matrix D(σn) = diag(1, σn, . . . , σr−1n ) in order to

provide the correct scaled derivatives for the next step.

This approach, however, distorts the coefficients β as can be seen from thefollowing computation:

y[n+1] =

y(tn+1) − Cp h

p+1n y(p+1)(tn+1) +O(hp+2

n )hn y′(tn+1) − β2 h

p+1n y(p+1)(tn+1) +O(hp+2

n )...

hr−1n y(r−1)(tn+1) − βr h

p+1n y(p+1)(tn+1) +O(hp+2

n )

,

D(σn) y[n+1] =

y(tn+1) − Cp h

p+1n y(p+1)(tn+1) +O(hp+2

n+1)

hn+1 y′(tn+1) − σnβ2 hp+1n y(p+1)(tn+1) +O(hp+2

n+1)...

hr−1n+1 y(r−1)(tn+1) − σr−1

n βr hp+1n y(p+1)(tn+1) +O(hp+2

n+1)

.The desired output, by contrast, reads

y[n+1]i = hi−1

n+1y(i−1)(tn+1)− βi h

p+1n+1y

(p+1)(tn+1) +O(hp+2n+1)

= hi−1n+1y

(i−1)(tn+1)− σp+1n βi h

p+1n y(p+1)(tn+1) +O(hp+2

n+1).

Hence the appropriate modification of the Nordsieck vector has to be twofold.First y[n+1] needs to be scaled with the stepsize ratio and then the correction(σi−1

n − σp+1n )βi h

p+1n y(p+1)(tn+1) has to be added. In other words, y

[n+1]i can be

computed as

y[n+1]i = σi−1

n y[n+1] + (σi−1n − σp+1

n )βi hp+1n y(p+1)(tn+1). (11.10)

This approach is called ’scale and modify’ [26, 40]. It ensures that the coeffi-cients β are maintained correctly also in case of a variable-stepsize implemen-tation.

In order to apply the scale and modify technique, an estimate

scdrvp+1(tn+1) ≈ hp+1y(p+1)(tn+1)

for the scaled derivative is required. This value will not only be used for (11.10),but also in order to estimate the local truncation error (11.9) via

estp+1(tn+1) = Cp scdrvp+1(tn+1).

The error constant Cp can be computed from the method’s coefficients as asolution of the linear system (11.8).

Recall from (11.7) that due to the method’s high stage order the (scaled) stagederivatives hY ′

i are exact to within O(hp+2). Thus, scdrvp+1(tn+1) will becomputed as a linear combination of stage derivatives.


In the previous chapters we restricted attention to stiffly accurate methods.There the first component of the input vector y

[n]1 = Ys is computed from

the last stage Ys of the previous step. The corresponding scaled derivativeh Y ′

s is thus available for further exploitation as well. Since in case of local

error estimation y[n]1 is assumed to be exact, h Y ′

s constitutes the exact scaledderivative.

For methods having the FSAL property, or Property F as it is called in [26],h Y ′

s is passed on to the next step as the second component of the Nordsieck

vector (see page 184). For methods with y[n]2 6= h Y ′

s the quantity h Y ′s can be

made available using an appropriate implementation.

As in [40] we consider the estimation of hp+1y(p+1) using a linear combination

scdrvp+1(δ0, δ, tn+1) = δ0 h Y′s +

s∑i=1

δi hY′i . (11.11)

Expanding the right hand side of (11.11) into a Taylor series leads to

scdrvp+1(δ0, δ, tn+1)

= δ0 h y′(tn) +

s∑i=1

δi h y′(tn + cih)

− δ>ε ∂f∂yhp+2y(p+1)(tn) +O(hp+3)

= δ0 h y′(tn) + δ>C Zn − δ>ε ∂f

∂yhp+2y(p+1)(tn) +O(hp+3)

(11.12)

with

δ=

δ1...δs

, C=

1 c11!

· · · cp+11

(p+1)!......

. . ....

1 cs1!

· · · cp+1s

(p+1)!

=[C cp+1

(p+1)!

], Zn=

h y′(tn)...

hp+2y(p+2)(tn)

.The matrix C was introduced in Theorem 7.6 and extensively used in theprevious Chapter. Compared to C, the matrix C uses an additional column.Observe that the multiplication δ>C Zn = [(δ>C)⊗ Im]Zn involves Kroneckerproducts.

For (11.11) to be an estimator satisfying

scdrvp+1(δ0, δ, tn+1) = hp+1y(p+1)(tn+1) +O(hp+2)

= hp+1y(p+1)(tn) +O(hp+2)

we need to solve the linear system

[δ0 δ1 · · · δs

]

1 0 · · · 0

1 c11!· · · cp1

p!......

. . ....

1 cs1!· · · cps

p!

=[0 · · · 0 1

]⇔[δ0δ

]>·[e>1C

]=e>p+1. (11.13)


This system consists of p + 1 equations in s + 1 unknowns. Observe that thefirst p equations guarantee that hky(k)(tn) does not appear in the Taylor seriesexpansion (11.12) for k = 1, . . . , p. The last equation ensures coefficient 1 forhp+1y(p+1)(tn) such that the correct scaled derivative is approximated.

For non-confluent methods with s = p, the solution is uniquely defined. In thiscase the coefficient matrix is a Vandermonde matrix and hence nonsingular.The estimators computed for the methods from Table 10.4 are presented inTable 11.1.

Example 11.2. In order to verify that the estimators derived for methodswith s = r = p = q are highly accurate, we want to solve the problem

y′ = − 110

(y − g(t)

)+ g′(t), t ∈ [0, 20]. (11.14)

This equation introduced by Prothero and Robinson [131] was already discussedin Example 2.7. Here, the complex function g(t) = exp(xi) with i2 = −1 isused. The exact solution is given by y(t) = g(t). Similar to [38, 40] the periodicstepsize pattern

hn+1 = ρ(−1)k(n) sin(8πtn/20) cos(2π tn/20) hn (11.15)

is used. The choice k(n) = 1−bn mod 42

c ensures that the stepsize is successivelyincreased twice and then decreased twice such that hn oscillates around theinitial stepsize h0 = 20

N, N = 800. When changing the stepsize, the scale and

modify technique is used.

The integration is started with order p = 1. After that a random order changeis allowed at every fourth step. More precisely, the following random-variableorder strategy is employed,

pn+1 =

pn , n mod 4 6= 0

pn + 1 , n mod 4 = 0, pn = 2 and z ∈ [23, 1]

pn + 1 , n mod 4 = 0, pn = 1 and z ∈ [12, 1]

pn − 1 , n mod 4 = 0, pn = 2 and z ∈ [0, 13]

pn − 1 , n mod 4 = 0, pn = 3 and z ∈ [0, 12]

order δ0 δ1 δ2 δ3

p = 1 −1 1

p = 2 2 +√

2 −4− 3√

2 2 + 2√

2

p = 3 −4325242

270312524674

−3883287752985554

12975338

Table 11.1: Error estimators for the methods from Table 10.4 (s = r = p = q)constructed in Chapter 10.


7.50·10−3

1.00·10−2

1.25·10−2

1.50·10−2

rel. error

2.00·10−2

0 5 10 15 timep = 3

p = 2

p = 1

Figure 11.1: Accuracy of the estimators from Table 11.1 for the problem(11.14) using fixed-variable steps (11.15) and random order.

where z is a random number uniformly distributed in [0, 1]. Thus, each pos-sibility – changing the order up- or downwards and keeping the same order –has the same probability.

At every timepoint tn the accuracy of the estimate scdrvp+1(tn) is measured bycomputing the relative error∣∣∣∣scdrvp+1(tn)− hp+1

n y(p+1)(tn)

hp+1n y(p+1)(tn)

∣∣∣∣ .This relative error together with the order history is plotted in Figure 11.1.

In spite of the demanding stepsize and order variations used here the estima-tors perform quite satisfactory. The estimate yields approximately two correctdigits.

Recall that the schemes discussed in the previous example use only s = r = pstages and the estimators in Table 11.1 are uniquely defined by the linear system(11.13). Methods with s = r = p + 1 = q + 1 leave more freedom to deriveoptimised error estimators.

Example 11.3. For illustration consider again the order 2 method from Ex-ample 11.1. In order to determine an error estimator (δ0, δ), the linear system(11.13) needs to be solved, i.e.

[δ0 δ1 δ2 δ3

]

1 0 0

1 722

49968

1 1522

225968

1 1 12

=[0 0 1

].

Obviously the solution is not uniquely defined. One solution is given by thevector[

δ0 δ1 δ2 δ3]

=[

12115

−84760

2057420

121105

].


This particular solution has advantages over other choices in the sense thatδ>ε = 0 holds for

ε = 1(p+1)!

cp+1 − 1p!A cp + U β =

[− 239

63888− 15

1936− 103

7986

]>.

Recall that ε is related to the first order error term of the stages as indicatedin (11.7). Satisfying the additional requirement δ>ε = 0 allows to constructestimators of higher accuracy.

For general linear methods with s = r = p+ 1 = q + 1 assume that the (δ0, δ)is obtained from solving the extended system

[δ0 δ>

]·[e>1 0C ε

]=[e>p+1 0

]. (11.16)

The estimator (11.12) is thus seen to satisfy

scdrvp+1(δ0, δ, tn+1) = hp+1n y(p+1)(tn) + θ hp+2

n y(p+2)(tn) +O(hp+3n )

= hp+1n y(p+1)(tn+θhn) +O(hp+3

n )

for θ = 1(p+1)!

δ>cp+1. The scaled derivative hp+1n y(p+1)(tn+θhn) can therefore

be estimated with much higher accuracy. The particular values of δ0, δ and θcorresponding to the methods constructed in the previous chapter are given inTable 11.2 and 11.3.

The increased accuracy of scdrvp+1(δ0, δ, tn+1) offers the possibility to estimatethe (p+ 2)-nd derivative as well. The details of this construction are describedin [40]. Since estimating the local truncation error of the method with order

order δ0 δ1 δ2 δ3 δ4 θ

p = 1 −56

− 521

1514

2140

p = 2 3906629

−6057629

396629

1755629

8241887

p = 3 − 532623474561439714429

1209076663681439714429

− 431489143681439714429

− 633757805441439714429

388793760001439714429

553412378711517715432

Table 11.2: Error estimators for the methods with inherent Runge-Kuttastability from Table 10.1

order δ0 δ1 δ2 δ3 δ4 θ

p = 1 −56

− 521

1514

2140

p = 2 12115

−84760

2057420

121105

38

p = 3 −5.19608 −43.2157 160.823 −171.216 58.8039 0.604703

Table 11.3: Error estimators for the methods with M∞,p = 0 from Table 10.2


p + 1 is seminal for a variable-order implementation, the main ideas will bebriefly reviewed.

Denote ηn = scdrvp+1(δ0, δ, tn+1) and σn−1 = hnhn−1

such that

ηn = hp+1n y(p+1)(tn+θhn) +O(hp+3

n )

= hp+1n y(p+1)(tn) + θ hp+2

n y(p+2)(tn) +O(hp+3n )

ηn−1 = hp+1n−1y

(p+1)(tn−1+θhn−1) +O(hp+3n−1)

= 1

σp+1n−1

hp+1n y(p+1)

(tn+( θ

σn−1−1)hn

)+O(hp+3

n )

= 1

σp+1n−1

hp+1n y(p+1)(tn) + 1

σp+1n−1

( θσn−1

−1)hp+2n y(p+2)(tn) +O(hp+3

n ).

The aim is to find coefficients µ and λ such that

µ ηn+λ ηn−1 = hp+2n y(p+2)(tn+1) +O(hp+3

n ) = hp+2n y(p+2)(tn) +O(hp+3

n ).

This goal can be achieved solving the linear system[1 1

σp+1n−1

θ 1

σp+1n−1

( θσn−1

−1)

][µ

λ

]=

[0

1

]⇔

[σp+1n−1 1

θ σp+2n−1 θ−σn−1

][µ

λ

]=

[0

σp+2n−1

].

Hence an estimator for the (p+ 2)-nd scaled derivative is given by

scdrvp+2(tn+1) =σn−1

θ σn−1 − θ + σn−1(ηn − σp+1

n−1ηn−2)

= hp+2n y(p+2)(tn+1) +O(hp+3

n ).

Observe that this estimator uses information from the current step and fromthe step before.

Example 11.4. The estimators for s = r = p + 1 = q + 1 were tested usingthe same numerical experiment as in Example 11.2. Figure 11.2 displays therelative error∣∣∣∣scdrvp+1(tn+1)− hp+1

n y(p+1)(tn + θhn)

hp+1n y(p+1)(tn + θhn)

∣∣∣∣and ∣∣∣∣scdrvp+2(tn+1)− hp+2

n y(p+1)(tn+1)

hp+2n y(p+2)(tn+1)

∣∣∣∣for estimating the (p+ 1)-st and (p+ 2)-nd scaled derivative, respectively.


10−5

10−4

rel. error

10−2

0 5 10 15 timep = 3

p = 2

p = 1

Figure 11.2: Accuracy of the estimators from Table 11.2 (methods with in-herent Runge-Kutta stability, Table 10.1) for the problem (11.14) using fixed-variable steps (11.15) and random order. The relative error in scdrvp+1(tn) (+)and scdrvp+2(tn) (o) is displayed.

The relative error in the estimator scdrvp+2(tn) is displayed only after threeconsecutive steps with the same order. This allows a settling of the leadingerror term after changing the order.

Figure 11.2 indeed indicates very high accuracy for these estimators. In caseof the scaled (p + 1)-st derivative, scdrvp+1(tn) yields approximately 5 correctdigits while the estimator scdrvp+2(tn) for the scaled (p+2)-nd derivative offers2–3 significant digits.

Using the estimators for scaled derivatives as described above, the local trun-cation error (11.9) can be estimated using

estp+1(tn+1) = Cp scdrvp+1(tn+1) = q[n+1]1 − q

[n+1]1 +O(hp+2

n ).

The error constant Cp is computed from (11.8). Recall that for properly statedDAEs of the form (11.2) the Nordsieck vector q[n] approximates scaled deriva-tives of q

(x(tn), tn

)∈ Rl such that the estimate satisfies estp+1(tn+1) ∈ Rl as

well. The current step is accepted provided that the inequality

‖ estp+1(tn+1)‖scq ≤ 1

holds. The scaled norm

‖q‖scq =l

maxj=1

( |qj| / (atolqj + rtolqj · |qn,j|))

depends on absolute and relative tolerances atolq, rtolq for q. These tolerancescan be computed from the values atol, rtol that are prescribed by the user (seealso Chapter 12). The numerical approximation for q(xn, tn) obtained in theprevious step is denoted by qn.

11.3 Changing the Order 223

The number ‖ estp+1(tn+1)‖scq is used to predict a stepsize hn+1 for the nextstep from tn+1 to tn+2 as well. Recall that

estp+1(tn+1) = Cp hp+1n q(p+1)

(x(tn+1), tn+1

)+O(hp+2

n )

and similarly we will have

estp+1(tn+2) = Cp hp+1n+1q

(p+1)(x(tn+2), tn+2

)+O(hp+2

n+1)

Since optimal performance is realised if the local error coincides with the re-quired tolerance, we will try to choose hn+1 such that ‖ estp+1(tn+2)‖scq ≈ 1.Assuming that q does not vary too much over the next step, this can be achievedrequiring(

hnhn+1

)p+1

=‖ estp+1(tn+1)‖scq

1,

or, equivalently,

hn+1 = hn

(1

‖ estp+1(tn+1)‖scq

)p+1

. (11.17)

In spite of its simplicity, this standard stepsize controller (11.17) is quite reliablefor many problems. The GLM codes described in [93, 94] use this technique.A slightly modified version is used in [40]. The controller (11.17) will also beadopted as standard for the codes Glimda and Glimda++. As usual, safetyfactors will be employed to avoid excessively large variations in the stepsizewhich might destroy stability properties of the methods [85, 93]. Observe thatin case of rejected steps, ‖ estp+1(tn+1)‖scq > 1, the controller (11.17) decreasesthe stepsize.

More sophisticated controllers can be constructed assuming e.g. Kn+1

Kn= Kn

Kn−1

where Kn = q(p+1)(x(tn+1), tn+1

). This leads to the formula

hn+1 = hn

(1

‖ estp+1(tn+1)‖scq

)p+1hnhn−1

(‖ estp+1(tn)‖scq‖ estp+1(tn+1)‖scq

)p+1

.

A control theoretic study of similar estimators can be found in [148, 149].

11.3 Changing the Order

With the error estimator and stepsize control techniques described in the pre-vious section, a variable-stepsize implementation of general linear methods isalready possible. It remains to develop a strategy for choosing the appropriateorder in an adaptive fashion.


Assume that we are currently using a method with s = r = p + 1 = q + 1. Inthis case, scdrvp+2(tn+1) can be used to estimate the local truncation error

estp+2(tn+1) = Cp+1 scdrvp+2(tn+1)

for the method with order p+ 1. The corresponding estimate

estp(tn+1) = Cp−1q[n+1]r

for the method with order p − 1 is trivial to obtain as the last component ofthe Nordsieck vector approximates the scaled derivative hpq(p)

(x(tn+1), tn+1

).

Thus a stepsize prediction can be computed not only for the method of orderp but also for its adjacent methods with order p+ 1 and p− 1, respectively.

The decision on what order should be used for the next step is based on thesestepsize predictions hn+1,p−1, hn+1,p and hn+1,p+1. As methods of different orderuse a different number of stages, the computational costs per step have to betaken into account as well.

In [40] it is suggested to choose the new order by selecting the maximum of

s−2(s−1)2

hn+1,p−1, α s−1s2hn+1,p,

s(s+1)2

hn+1,p+1.

The safety factor α = 1.2 is used to prevent frequent order changes.

This strategy is employed for Glimda++ as well. Similar to [40] the order is

allowed to change only in case of a settled stepsize, i.e.∣∣hn+1,p−hn,p

hn,p

∣∣ < 0.2.

Example 11.5. This order selection strategy will be tested using the MillerIntegrator from Example 1.1. Recall from Chapter 1 that the Miller Integratorcircuit leads to an index-1 equation with m = 5, l = 1. The input signalv(t) = sin(ω t2) is used with ω = 1014. The DAE is solved for t ∈ [0, 7 · 10−7]using x(0) = 0 as initial value.

For many different tolerances rtol = 10−j, j = 0, . . . , 11, atol = 10−3 · rtol, Fig-ure 11.3 shows the number of functions evaluations required for solving theproblem together with the achieved accuracy, i.e. the relative error at the end

101

102

103

104

105

106

rel. error10−610−410−2100102

num

ber

off

evalu

ations

Glimda++

p = 1p = 2p = 3

Figure 11.3: Testing the order selection strategy for the Miller Integratorusing the methods from Table 10.1.

11.3 Changing the Order 225

of the integration interval. The figure contains four curves. Three correspondto solving the Miller Integrator using constant order p ∈ 1, 2, 3 while thefourth curve shows the results of the variable-order code Glimda++. For thecomputations the methods of Butcher and Podhaisky from Table 10.1 wereused.

The order control described above shows almost optimal performance: For highaccuracy demands high order methods are used while low order methods arechosen for loose tolerances.

This order selection strategy is very elegant and seems to work reliably. How-ever, for methods with s = r = p = q we cannot proceed in a similar fashionsince no estimator scdrvp+2(tn+2) is available for methods of this type. In orderto overcome these difficulties there are at least three possible approaches.

(a) Information from previous steps could be used in order to compute therequired estimate scdrvp+2(tn+2). Keeping track of the backward infor-mations is a complex task in a variable-stepsize variable-order settingand a related theory establishing the error expansions quickly becomesdifficult to manage (see e.g. [29]).

(b) In [30, 94] the decision on a new order is based on monitoring the ratio

ρ = estp+1(tn+1)

estp(tn+1). If ρ < ρmin = 0.9 and p < pmax, the order is increased.

On the other hand, the order is decreased provided that ρ > ρmax = 1.1and p > pmin.

(c) Hairer and Wanner [86] discovered that high order methods performpoorly for loose tolerances due to a slowly converging Newton iteration.Hence, order control for their code Radau is based on the convergencerate of Newton’s method.

The approaches (b) and (c) have been tested. It was found that in case of(b) the order is built up quickly, but often it is not decreased sufficiently fast.Hence, most computations are performed using the method of highest order.

The approach (c), by contrast, resulted in more efficient and more reliablecomputations such that (c) was implemented for Glimda.

The technique of Hairer and Wanner is well documented in [85, 86]. It can beused for general linear methods without any modifications being necessary.

For k ≥ 1 let θk = ‖∆Xki ‖/‖∆Xk−1

i ‖ denote the quotient of two consecutiveNewton corrections in the iterative procedure (11.5). Based on θk the conver-gence rate is measured using

ψ1 = θ1, ψk =√θm · θk−1, k ≥ 2.

For any given problem the computation is started using the lowest order p = 1.The order is allowed to change only after computing samep consecutive stepswith the current order and if the stepsize has settled, i.e.

∣∣hn+1−hnhn

∣∣ < 0.2.


In case of a possible order change, the order is increased provided that ψ < upthand p < pmax. Here, ψ denotes the final value ψk of the Newton iteration forthe current step. Similarly the order is decreased if ψ > downth and p > pmin.Convergence failures give rise to an order decrease as well.

The parameters samep and upth, downth can be set by the user. A value samep =5 is used by default. For the threshold values upth = 10−3 and downth = 0.8 hasbeen found to be appropriate.

Example 11.6. The order selection strategy is tested using the same numericalexperiment as in Example 11.5.

For the Miller Integrator Figure 11.4 (a) shows the desired behaviour. For highaccuracy demands the solution is computed using order p = 3 but for loosetolerances order p = 1 is used.

One of the test problems used most frequently for comparing integrators for stiffODEs is given by the Robertson Reaction. This problem consists of a system ofstiff ordinary differential equation modelling an autocatalytic chemical reaction.The reaction is described in [139]. The equations became popular in numericalanalysis through the work of Hairer and Wanner. In [86] it is used for derivingand testing their order selection strategy.

Usually the equations are posed for t ∈ [0, 1011]. Here, the tolerances rtol =10−j, j = 2, 3, . . . , 9, atol = 10−3 · rtol have been used.

The results obtained for the code Glimda based on the general linear methodsfrom Table 10.4 are presented in Figure 11.4 (b). Obviously the order 3 methodis most efficient for almost all tolerances. This method is chosen correctly bythe order control mechanism.

101

102

103

104

105

106

rel. error10−410−2100

num

ber

off

evalu

ations

Glimda

p = 1p = 2p = 3

Figure 11.4 (a): Miller Integrator.

103

104

105

106

rel. error10−1100101102

num

ber

off

evalu

ations

Glimda

p = 1p = 2p = 3

Figure 11.4 (b): Robertson Reaction.

Figure 11.4: Testing the order selection strategy for the methods from Ta-ble 10.4.

12

Numerical Experiments

Using techniques from the previous chapter the general linear methods fromTable 10.1, 10.2 and 10.4 have been implemented using Fortran77. This pro-gramming language offers fast performance and high accuracy. Linear algebracan be handled conveniently using routines from Blas and Lapack [8, 109].

The resulting codes Glimda and Glimda++ are capable of solving index-2differential algebraic equations

f( q(x(t), t

), x(t), t) = 0. (12.1)

Notice that this formulation is more general than the structure

A(t)[d(x(t), t

)]′+ b(x(t), t

)= 0 (12.2)

covered by the results of this thesis.

Both Glimda and Glimda++ are variable-stepsize variable-order codes basedon singly diagonally implicit general linear methods. The order p satisfies1 ≤ p ≤ 3 and the stage order q = p coincides with the order. The quantitiespassed on from step to step are given by the Nordsieck vector

q[n] ≈

q(x(tn), tn

)h q′(x(tn), tn

)...

hr−1q(r−1)(x(tn), tn

) .

Our aim is to compare the following methods:

• Glimda: s = r = p stages are used. For p = 1, 2 the methods fromTable 10.4 are A- and L-stable. The order 3 method is A(α)-stable withα ≈ 73.7535. Unfortunately there was no success in finding an A-stableorder 3 method satisfying all requirements (P1) – (P6) from page 182.The method from Table 10.5 with α ≈ 88.1 might be an alternative.

As described in the previous chapter, order control is based on the conver-gence rate of Newton’s method. Fast convergence signals a possible orderincrease. If the iteration process converges slowly or does not convergeat all, the order is switched downwards.

228 12 Numerical Experiments

• Glimda++: this solver uses an additional flag to distinguish betweenthe two families of methods from Table 10.1 and 10.2. The former meth-ods have been constructed with inherent Runge-Kutta stability while thelatter family satisfies M∞,p = 0.

In all cases s = r = p + 1 stages are used and the stage order is q = p.The methods are A- and L-stable. Order control is based on estimates ofthe quantities hp+1 dp+1

dtp+1 q(x(t), t

)and hp+2 dp+2

dtp+2 q(x(t), t

).

The codes Glimda and Glimda++ are available from the author.

Comparing different codes is a notoriously difficult task. The results of thecomparison will clearly depend on the problem, on the prescribed accuracy oron the machine that runs the tests.

Since we are studying general linear methods for integrated circuit design,this chapter focuses on applications from electrical circuit engineering. Thusmost problems are realistic circuits. The Test Set for Initial Value ProblemSolvers [124] of Bari University (formerly maintained by CWI Amsterdam)offers a wealth of suitable test problems.

In case of electrical circuits, the vector x of the DAE (12.1) contains node poten-tials as well as the currents through inductors and voltage sources. The vectorq(x, t) comprises charges and fluxes. Observe that x ∈ Rm and q(x, t) ∈ Rn

might be of different size. We have seen in Chapter 11 that error estimation ismost naturally based on q rather than on x. The user, however, will prescribea desired tolerance for x, such that this tolerance needs modification in orderto be applicable to charges and fluxes. The computation of an appropriate tol-erance is performed automatically based on the matrix ∂q

∂x(x, t). BDF schemes

used for electrical circuit simulation give rise to similar difficulties. More detailscan be found in [147].

In order to achieve a comparison as fair as possible, each code has to solveevery test problem for many different tolerances. The results are visualised inwork-precision diagrams showing the relation between the required work andthe achieved accuracy. A code performs optimal if high accuracy is achievedinvesting little work.

It is difficult to find a reasonable measure for the work necessary to solve agiven problem. The easiest option is to measure the time required by thesolver. This approach, however, is likely to give false results as the processorof the test machine might be busy with other tasks at the same time. Hencethe results will often be un-reproducible.

For electrical circuit simulation most time is spent evaluating the complextransistor models and setting up the equations. This process called ’loading’requires approximately 85% of the work for standard applications with up to103 devices [66]. Thus a fair comparison is possible when counting the numberof function evaluations.

12.1 Glimda++: Methods with s=p+1 Stages 229

As one code may update the Jacobian much more frequently than another, thenumber of Jacobian evaluations has to be taken into account as well. Extensivetests using the simulator Titan of Infineon Technologies AG have shown thatfor electrical circuit simulation the Jacobian can be obtained rather cheaply.Evaluating the Jacobian together with a function call amounts to approximately1.4 to 1.5 times the work of a single function evaluation [64, 82]. To be on thesafe side, each Jacobian evaluation will be counted as an additional functioncall.

The accuracy is measured using scd numbers. The scd value denotes the mini-mum number of significant correct digits in the numerical solution at the endof the integration interval, i.e

scd = − log10

(‖relative error at the end of the integration interval‖∞

).

We will first use Glimda++ to compare the two families from Table 10.1and 10.2. Here we want to find out whether there is any benefit in using meth-ods with M∞,p = 0. Afterwards Glimda++ will be judged against Glimda.It will turn out that the methods with only s = p stages indeed show a betterperformance.

Two of the most successful linear multistep and Runge-Kutta codes are Dassland Radau. Hence, to assess the potential of general linear methods for re-alistic applications we have to compare Glimda to these highly tuned codes.Additionally the code Mebdfi by Cash [45] will be used for comparison in Sec-tion 12.3. Mebdfi is based on modified BDF formulae with improved stabilityproperties. We will see that Glimda is indeed competitive. In particular formany electrical circuits it will often be superior to classical codes.

All computations were performed on an Intel R© Pentium R© M processor with1.4 GHz and 512 MB RAM running SuSE Linux 9.1 [150]. The GNU Fortrancompiler was used for compilation and linking. Each code was used with itscorresponding standard options. Except for adjusting the tolerances and theinitial stepsize there was no tweaking of parameters to improve performance.

12.1 Glimda++: Methods with s = p + 1 Stages

The techniques for implementing Dimsims with s = r = p+1 = q+1 presentedin Chapter 11 do not depend on the particular coefficients of the method. Hencethe two families from Table 10.1 and 10.2 can be used within the same codeGlimda++.

Four test problems will be used to compare the different methods.

• miller – the Miller Integrator (index 1, m = 5, n = 1)

This example was used in Chapter 1 to introduce the modified nodal analysis.It is described in detail in [66]. Using a properly stated leading term it consistsof one differential and four algebraic equations.


For the miller problem relative tolerances rtol = 10−j, j = 1, . . . , 9, were chosen.The absolute tolerance satisfied atol = 10−3 · rtol. The initial stepsize was chosenas h0 = 10−10. As in Example 11.5 miller was solved on the interval [0, 7 · 10−7]using x(0) = 0 and the input signal v(t) = sin(ω t2) with ω = 1014.

Figure 12.1 (a) shows that both families perform similarly for loose tolerances.For tight tolerances the IRKS methods are clearly superior.

A B ¬(A ∧B)

true true falsetrue false truefalse true truefalse false true

• nand – NAND gate model (index 1, m = n = 14)

The NAND gate was discussed in Chapter 4 and 5.It served as an example where both functions qand b are nonlinear in the formulation 12.2. Herethe NAND gate model from the Bari Testset forInitial Value Problem Solvers is used [124].

The problem has been contributed by Gunther and Rentrop [83]. The equationsmodel a basic logical element expressing the NAND relation as indicated onthe right. Due to a simplified transistor model the equations have index 1.

The test set uses the implicit formulation f(x′, x, t) = 0 and the derivativeq(x, t) = C(x, t)x′ is calculated explicitely. Since the new solver Glimda++ iscapable of handling properly stated leading terms, the equations were modifiedsuch that the formulation f

(q(x, t), x, t

)= 0 is used directly.

102

103

104

work

0 2 4 scd

IRKSM∞ = 0

Figure 12.1 (a): miller

5·103

2·104

8·104

work

3 4 5 scd

IRKSM∞ = 0

Figure 12.1 (b): nand

2.0·104

3.0·104

4.5·104

work

0 1 2 3 scd

IRKSM∞ = 0

Figure 12.1 (c): ringmod dae

5·103

1·104

2·104

4·104

work

2 3 4 5 6 scd

IRKSM∞ = 0

Figure 12.1 (d): transamp

Figure 12.1: Comparison of the methods from Table 10.1 with inherentRunge-Kutta stability (IRKS) and those from Table 10.2 with M∞,p = 0.Although the latter family is superior for the index-2 DAE (c), the overallperformance is far from optimal. The IRKS methods are generally superior.

12.1 Glimda++: Methods with s=p+1 Stages 231

The following tolerances have been used: rtol = 10−j/2, j = 2, 3, . . . , 10 andatol = 10−3 · rtol. As for miller the initial stepsize was h0 = 10−10.

For the NAND gate the methods with M∞,p = 0 show a poor performance.Spending more work does not lead to a significant increase in accuracy. Almosthalf of the runs failed with stepsize too small (see Table 12.1). The methods withinherent Runge-Kutta stability, by contrast, show the desired behaviour.

• ringmod dae – the Ring Modulator as a DAE (index 2, m = 15, n = 11)

The Ring Modulator circuit was introduced in Chapter 2. The circuit diagramfrom Example 2.6 corresponds to the index-2 formulation used here. Recall thatthe Ring Modulator mixes a low-frequent input signal Uin1 with a high-frequentinput signal Uin2 .

For this problem, rtol = 10−j/2, j = 4, 6, . . . , 10 and atol = 10−2 · rtol have beenused. Again, the initial stepsize was prescribed as h0 = 10−10.

For this index-2 problem the schemes with M∞,p = 0 are clearly superior tothe IRKS methods. For all tolerance requirements higher accuracy is achievedwith fewer function evaluations.

• transamp – the transistor amplifier circuit (index 1, m = 8, n = 5)

The transistor amplifier circuit from Figure 12.2 is one of most frequently usedbenchmark circuits. Given an input signal Uin, the amplified output is deliveredat node 8. Amplification is realised using two transistors. Each transistor ismodelled as

IG = (1− α) g(uG − uS), ID = α g(uG − uS), IS = g(uG − uS),

where IG, ID and IS denotes the current through the gate, drain and sourcecontact, respectively. The node potentials at gate and source are denoted byuG, uS. The function g is given by g(u) = β

(exp(u/uF ) − 1

)with β = 10−6,

u1

u8

u6

u7u5u4

u3

u2

R1 R3 C2 R5 R7 C4 R9

R2 R4 R6 R8

R0

Uin

Ub

C3 C5

C1

−0.1

0

0.1

0 0.1 0.2time

input signal Uin

−4

−2

0

2

0 0.1 0.2time

output signal u8

Figure 12.2: The transistor amplifier circuit. The input signal Uin is convertedinto an amplified signal at node 8.


uF = 0.026 and α = 0.99. The transistor amplifier problem was published byRentrop in [137]. It can be found in [84, 124] as well.

For the numerical simulation rtol = 10−j/2, j = 0, 1, . . . , 8, atol = 10−6 · rtol andh0 = 10−10 has been used.

The results in Figure 12.1 (d) once again show a clear superiority of the IRKSmethods. The methods with M∞,p = 0 are competitive only for rtol = 1,atol = 10−6. Tighter tolerances do not lead to an increase in accuracy. Higherscd values seem possible for rtol = 10−4, atol = 10−10. On the other hand theIRKS methods show a nice increase of accuracy when tightening the tolerances.There is, however, a need for improvement in particular for loose tolerances.The same accuracy scd ≈ 3.5 is reached with a completely different amount ofwork. Behaviour of this kind has to be avoided.

The above numerical experiments show a clear difference in performance for thetwo families of methods considered here. A summary of failed runs is given inTable 12.1. As expected the methods with M∞,p = 0 perform best for nonlinearindex-2 problems. Here the improved stability at infinity seems to be a benefit.However, in all other cases the results are far from satisfactory, in particularfor nand and transamp.

The methods with inherent Runge-Kutta stability constructed by Butcher andPodhaisky [40] show a good behaviour for all test examples. These methodswill therefore be used as the standard option for the solver Glimda++.

It becomes clear that the construction of good methods is not a trivial task.Although both families of methods have Runge-Kutta stability and satisfy thesame order and stage order conditions, their performance is quite different.Thus the construction of methods should always be accompanied with a cor-responding test implementation in order to allow a realistic assessment of themethod’s potential.

problem solver rtol reason for failing

nand Glimda++ (M∞,p = 0) 10−1, 10−4,10−5/2, 10−5

stepsize too small

ringmod daeGlimda++ (IRKS) 10−2, 10−5/2 stepsize too small

Glimda++ (M∞,p = 0) 10−2 stepsize too small

Table 12.1: Summary of failed runs for Glimda++ applied to miller, nand,ringmod dae and transamp.

12.2 Glimda vs. Glimda++

In the previous section it was decided to adopt the methods from Table 10.1as a standard for Glimda++. We are now going to compare this code with

12.2 Glimda vs. Glimda++ 233

the implementation Glimda based on the methods from Table 10.4. Recallthat Glimda not only uses different methods with only s = p stages, but alsoemploys different implementation specifics. In particular, order control is basedon the convergence rate of Newton’s method (see Chapter 11 for more details).

The same test problems miller, nand, ringmod dae and transamp as in Section 12.1have been used.

Figure 12.3 shows a surprisingly well performance for Glimda. The codeGlimda++ is outperformed for all four test examples. Notice in particularthat for Glimda the relation between the achieved accuracy and the requiredwork is almost given by straight line. This behaviour ensures that Glimda isclearly superior not only for high accuracy demands but in particular for loosetolerances (see e.g. Figure 12.3 (d)).

In order to allow a more complete comparison, four additional test exampleswill be considered next.

• ringmod – the Ring Modulator circuit formulated as an ordinary differentialequation (index 0, m = 15, n = 15)

As described in Example 2.6 the Ring Modulator can be regularised by intro-ducing artificial capacitances Cs. The resulting model is an ordinary differentialequation of dimension m = 15. For Cs = 10−12 simulation results can be foundin [124]. The same values are used here. Due to the capacitances Cs, artificialoscillations of high frequency are introduced. Thus the numerical simulationrequires an excessively large number of timesteps.

102

103

104

work

0 2 4 scd

Glimda

Glimda++


1.0·103

5.0·103

2.5·104

work

3 4 5 6 scd

Glimda

Glimda++


1·104

2·104

4·104

8·104

work

0 1 2 3 scd

Glimda

Glimda++


3.0·103

6.0·103

1.2·104

2.4·104

work

2 3 4 5 6 scd

Glimda

Glimda++


Figure 12.3: Comparison of Glimda and Glimda++


For the comparison of Glimda and Glimda++ the tolerances rtol = 10−j/2,j = 4, 5, . . . , 8, atol = 10−3 · rtol have been used. The initial stepsize was chosenas h0 = 10−6.

Glimda produces accurate solutions but shows little reaction to different tol-erances. Glimda++, by contrast, is able to efficiently compute solutions thatsatisfy low accuracy demands. As this is the required behaviour, Glimda++is superior for this example as well.

Nevertheless the ODE formulation of the Ring Modulator circuit should not beused for practical applications. A comparison with the results for the index-2formulation in Figure 12.3 (c) shows that there Glimda achieves four times theaccuracy with only half the effort.

• rober – the Robertson reaction (index 0, m = n = 3)

This stiff ordinary differential equation models an autocatalytic chemical re-action described in [139]. The problem is used frequently for numerical stud-ies [85]. It is used in particular for comparing integrators for stiff ODEs.

For the rober problem stiffness is caused by reaction rates of different magnitude.The exact solution shows a quick initial transient. Afterwards the componentsvary slowly such that a large stepsize is appropriate. As reported in [85, 124]many codes fail if t becomes very large. Problems arise if components of thesolution become accidentally negative, which eventually causes overflow. Thusthe integration interval is chosen sufficiently large as [0, 1011].

8.0·104

1.6·105

3.2·105

6.4·105

work

0 1 2 3 scd

Glimda

Glimda++

Figure 12.4 (a): ringmod

103

104

work

0 1 2 3 scd

Glimda

Glimda++

Figure 12.4 (b): rober

102

103

104

105

work

2 3 4 5 scd

Glimda

Glimdapmax=2

Glimda++

Figure 12.4 (c): fekete

5·103

1·104

2·104

4·104

work

2 3 4 5 scd

Glimda

Glimda++

Figure 12.4 (d): tba

Figure 12.4: Comparison of Glimda and Glimda++. Second set of testproblems.

12.2 Glimda vs. Glimda++ 235

The tolerances rtol = 10−j, j = 2, 3, . . . , 9, atol = 10−3 · rtol and an initial stepsizeh0 = 10−10 have been used for the numerical simulations.

Recall that the order 3 method used for Glimda is not A-stable. In spite ofthis, the code performs quite satisfactory. Apart from rtol = 10−5, atol = 10−8

the relation between accuracy and work is almost given by straight line. Inparticular for loose tolerance Glimda is superior to Glimda++.

Let us stress that the order 3 method is indeed used for the computations. Thiswas seen in Example 11.6. Hence the lacking A-stability does not seem to bea severe restriction.

• fekete – the Fekete problem (Index 2, m = 160, n = 120)

This problem taken from [124] computes elliptic Fekete points. The task isto distribute N points x =

[x1 · · · xN

]on the unit sphere such that the

mutual distance V (x) =∏

i<j ‖xi − xj‖2 is maximised. The global optimum isdifficult to compute as an optimisation problem. Using Lagrange multipliersan equivalent DAE formulation can be found. In order to arrive at the index-2formulation used here the constraints have to be regularised [10]. The feketeproblem was solved for N = 20.

The relative tolerances were swept over rtol = 10−j, j = 0, 1, . . . , 5 with atol =10−2 · rtol. An initial stepsize h0 = 10−8 has been used.

Figure 12.4 (c) shows that both solvers Glimda and Glimda++ do not per-form satisfactory. In particular Glimda requires the same amount of work foralmost all accuracy requirements. Indeed, the best performance is achievedwhen restricting the order 1 ≤ p ≤ 2 for Glimda and thus using A-stablemethods only. As a consequence the problems for p = 3 might be caused bythe lacking A-stability. Unfortunately, [124] contains no information about theexact position of the eigenvalues.

• tba – the Two Bit Adding Unit (index 1, m = n = 175)

Given two numbers in base 2 representation the adding unit computes thesum of these numbers. In order to achieve this, voltages are associated withboolean values. By convention, a voltage exceeding 2V is interpreted as truewhile values lower than 0.8V correspond to false. In between the boolean valueis undefined.

Let A1A0 and B1B0 be two 2-bit numbers. These numbers and a carry bit Cinare fed into the circuit as input voltages. The circuit performs the addition

A1A0 +B1B0 + Cin = C S1S0.

S1S0 is the 2-bit representation of the output and C is the updated carry bit.An example computation is indicated in Figure 12.5. The circuit diagram ofthe Two Bit Adding Unit is available from [124].

It is emphasised in [124] that the equations have the form

A q(x) = f(x, t) (12.3)


with x ∈ R175, but standard solvers require a different formulation. UsuallyDAEs of the formMy′ = f(y, t) can be handled. Thus the software part of [124]uses the reformulation

Q = f(x, t), 0 = Q− Aq(x).

Introducing the additional variable Q = Aq(x) doubles the dimension suchthat in [124] the tba problem is referred to as a DAE of dimension m = 350.Since Glimda and Glimda++ are capable of handling the structure (12.3),this formulation is used directly.

For the tba problem the tolerances rtol = atol = 10−j/2, j = 2, 3, . . . , 10 and aninitial stepsize h0 = 10−10 have been used.

Again, Figure 12.4 (d) shows a clear superiority of Glimda over Glimda++.For the latter solver most runs fail as can be seen in Table 12.2.

The summary of failed runs in Table 12.2 shows that Glimda is a very robustsolver. Only one computation out of 61 runs failed. The computed results aremost satisfactory in almost all cases and the lack of A-stability for the order 3method seems not to play a crucial role. In particular the stiff rober problemdoes not pose serious difficulties for the solver.

A1

0

24

0 50 100 150 200 250 time

A0

0

24

0 50 100 150 200 250 time

B1

0

24

0 50 100 150 200 250 time

B0

0

24

0 50 100 150 200 250 time

Cin

0

24

0 50 100 150 200 250 time

S1

0

24

0 50 100 150 200 250 time

S0

0

24

0 50 100 150 200 250 time

C

0

24

0 50 100 150 200 250 time

Figure 12.5: The input signals A1A0, B1B0, Cin for the Two Bit Adding Unittba lead to the output S1S0 and the carry bit C. Notice that the bar denoteslogical inversion. For t = 200 the input reads A1A0 = 10, B1B0 = 00, Cin = 1,such that the addition 01 + 11 + 1 = 101 = C S1S0 is performed.

12.3 Glimda vs. Standard Solvers 237

Figure 12.3 shows that Glimda is clearly superior to Glimda++ for the testproblems miller, nand, ringmod dae and transamp. Higher accuracy is achievedinvesting a considerably lower amount of work.

The second four test examples ringmod, rober, fekete and tba do not change thisassessment, even though the difference between the two solvers is no longer asclear as for the first four problems.

In particular ringmod and fekete seem to indicate a better performance forGlimda++ as compared to Glimda. However, this is not a severe restric-tion as the first problem is academic in the sense that the index-2 formulationcan be solved much more efficiently, while the situation for fekete can be im-proved considerably by limiting the maximum order for Glimda. It is not clearwhether Glimda’s problems with fekete result from the missing A-stability. Inany case, an improved A-stable order 3 method is most desirable.

The numerical experiments of this section indicate that a practical implementa-tion should be based on methods with s=p stages. Although the linear stabilityproperties of methods with s=p+1 stages are more desirable and error estima-tion as well as order control can be realised very elegantly, these methods seemnot to benefit from their excellent properties as much as expected. The robustimplementation and the reduced work per step for methods with s= p stagesis seen to be a clear advantage for the many test problems considered here.

In fact, Glimda seems powerful enough to be a tough competitor for classi-cal codes such as Dassl or Radau. The corresponding comparisons will beperformed in the next section.

problem solver rtol reason for failing

ringmod dae Glimda++ 10−2, 10−5/2 stepsize too small

ringmod Glimda 10−2 stepsize too small

fekete Glimda++ 10−5 stepsize too small

tba Glimda++ 10−3/2, 10−5/2, 10−3,10−7/2, 10−4, 10−9/2, 10−5

stepsize too small

Table 12.2: Summary of failed runs for Glimda and Glimda++ (IRKS)applied to miller, nand, ringmod dae, transamp, ringmod, rober, fekete and tba.

12.3 Glimda vs. Standard Solvers

In Chapter 2 classical numerical methods such as linear multistep schemes andRunge-Kutta methods have been considered. It was pointed out that bothfamilies of methods have disadvantages for integrated circuit design. Hence,general linear methods have been studied in the context of differential algebraic


equations modelling electrical circuit. These studies eventually lead to the testimplementation Glimda.

The numerical experiments of this section are performed in order to test thecode Glimda against three of the most successful and highly tuned codesavailable at [124].

• Dassl solves general implicit DAEs f(x′, x, t) = 0 having index µ ≤ 1.This linear multistep code is based on BDF formulae of order 1 ≤ p ≤ 5.Dassl was written by Petzold [10, 128] and can be downloaded fromwww.netlib.org/ode/ddassl.f.

• Radau is a standard Runge-Kutta code written by Hairer and Wan-ner [85, 86]. It is capable of solving differential algebraic equations of theform M y′ = f(y, t) with index µ ≤ 3. The code uses RadauIIA meth-ods of order p = 5, 9, 13 and can be obtained from www.unige.ch/∼hairer/prog/stiff/radau.f.

• Mebdfi uses Modified Extended Backward Differentiation Formulae thatincrease the region of absolute stability as compared to classical BDFmethods. The resulting schemes are A-stable up to order 4, but theimplementation by Abdulla and Cash uses methods of order 1 ≤ p ≤ 7.Similar to Dassl implicit DAEs f(x′, x, t) = 0 can be solved. The codeis applicable if the index satisfies µ ≤ 3. Mebdfi is available at www.ma.ic.ac.uk/∼jcash/IVP software/itest/mebdfi.f.

For the numerical experiments performed here, the codes Dassl, Radau,Mebdfi and the Fortran77 problem descriptions have been downloadedfrom [124]. All simulation results are displayed as work-precision diagramsin Figure 12.6.

It is no surprise that Glimda is not competitive with the standard solvers forringmod and fekete (Figure 12.6 (e) and Figure 12.6 (g)). These issues have al-ready been discussed in the previous section. Observe that Radau as availablefrom [124] was not able to solve ringmod. Dassl, on the other hand, fails forfekete. A complete summary of failed runs is given in Table 12.3.

For miller and tba Glimda shows a performance quite similar to Dassl. Again,Radau fails completely for tba and Figure 12.6 (h) shows that Mebdfi is notat all competitive with Glimda and Dassl for the Two Bit Adding Unit. Forthe miller problem Figure 12.6 (a) shows a similar situation for Glimda, Dassland Mebdfi, although the three solvers are now much closer together. Radau,by contrast, is most efficient. For miller this solver seems to benefit from thehigh order methods being used.

The benefit of high order can be seen clearly in Figure 12.6 (f) for the roberproblem. Although Glimda is competitive for low accuracy demands, a max-imum order of p = 3 is not sufficient to keep up with Dassl or Radau, whereorders up to 5 and 13 are used, respectively.

In contrast to these results for ordinary differential equations, computations

www.netlib.org/ode/ddassl.f

www.unige.ch/~hairer/prog/stiff/radau.f

www.unige.ch/~hairer/prog/stiff/radau.f

www.ma.ic.ac.uk/~jcash/IVP

www.ma.ic.ac.uk/~jcash/IVP

software/itest/mebdfi.f


102

103

work

0 2 4 6 scd


1·103

3·103

9·103

work

2 3 4 5 6 scd


1·104

2·104

4·104

1 2 3 4 scd

work


3.0·103

6.0·103

1.2·104

work

2 3 4 5 6 scd


1.5·105

3.0·105

6.0·105

work

2 3 4 scd

Figure 12.6 (e): ringmod

2·102

1·103

5·103

work

−2−1 0 1 2 3 4 5 scd

Figure 12.6 (f): rober

1.0·102

5.0·102

2.5·103

work

0 2 4 scd

Figure 12.6 (g): fekete

103

104

105

work

2 3 4 5 scd

Figure 12.6 (h): tba

Figure 12.6: Comparison of Glimda (

2·102

1·103

5·103

work

−4−3−2−1 0 1 2 3 4 5scd

Glimda

DasslRadauMebdfi

) and the standard solvers Dassl(

2·102

1·103

5·103

work

−4−3−2−1 0 1 2 3 4 5scd

Glimda

DasslRadauMebdfi

), Radau (

2·102

1·103

5·103

work

−4−3−2−1 0 1 2 3 4 5scd

Glimda

DasslRadauMebdfi

) and Mebdfi (

2·102

1·103

5·103

work

−4−3−2−1 0 1 2 3 4 5scd

Glimda

DasslRadauMebdfi) problems.


with a maximum order p = 3 seem to be most appropriate for problems fromelectrical circuit simulation. Indeed, Glimda is the most efficient solver forboth ringmod dae and transamp. Figure 12.6 (c) and Figure 12.6 (d) show thatGlimda is superior for all tolerances. Notice that the gap between Glimdaand the other solvers increases for low accuracy demands. This is a particularlynice feature as for realistic applications an error of about 2% is sufficient [66].

problem solver rtol reason for failing(as reported by the solver)

miller Mebdfi 10−7 hmin reduced by a factorof 1.0e10

nand Radau 10−1, · · · ,10−5

RADAU can not solve IDEs

ringmod Glimda 10−2 stepsize too smallDassl 10−2 failed to converge repeatedly

or with abs(h)=hminRadau 10−2, · · · ,

10−4RADAU can not handle

FEVAL IERR

tba Dassl 10−1 could not converge becauseires was equal to minus one

Radau 10−1, · · · ,10−5

RADAU can not handleFEVAL IERR

Mebdfi 10−5 the requested error is smallerthan can be handled

transamp Dassl 100, 10−1/2,10−1

failed to converge repeatedlyor with abs(h)=hmin

Radau 100, 10−1/2 step size too smallMebdfi 100, 10−1/2 the requested error is smaller

than can be handled

ringmod dae Dassl 10−2, 10−5/2 failed to converge repeatedlyor with abs(h)=hmin

Mebdfi 10−2, 10−5/2 corrector convergence couldnot be achieved

rober Dassl 10−6, 10−7,10−9

failed to converge repeatedlyor with abs(h)=hmin

Radau 10−2, 10−3,10−7

step size too small

fekete Dassl 100, . . . , 10−5 DASSL can not solve higherindex problems

Table 12.3: Summary of failed runs for Glimdaand Dassl, Radau, Mebdfi


For the nand problem in Figure 12.6 (b) Glimda lies in between Mebdfi andDassl. Radau was not able to solve this example (see again Table 12.3).

The results of this section show that Glimda is competitive with the standardsolvers Dassl, Mebdfi and Radau. Recall that Glimda is designed to solvedifferential algebraic equations

A(t)[d(x(t), t

)]′+ b(x(t), t

)= 0

with a properly stated leading term and index µ ≤ 2. Ordinary differentialequations y′ = f(y, t) can be cast into this form, such that Glimda is capableof handling ODEs as well. Nevertheless, the comparison from Table 12.4 showsthat specialised codes such as Radau should be used. Although Glimda showsa reasonable performance for ODEs, other codes are often more efficient.

Table 12.4 attempts a pictorial assessment of the results for ODEs. ⊕ in-dicates superior performance while © and represent reasonable and poorperformance, respectively. A hyphen – is used in case of a solver failing for agiven problem.

The high potential of Glimda is revealed when applying the solver to differ-ential algebraic problems. Glimda and Mebdfi are the only codes successfulin solving all examples. Dassl and Radau fail for at least one test prob-lem. Table 12.5 confirms that Glimda is indeed competitive if not superior tostandard solvers.

miller ringmod rober

Glimda © ©

Dassl © ⊕ ⊕Radau ⊕ – ⊕Mebdfi © ⊕ ⊕

Table 12.4: Comparison of different solvers for ordinary differential equations.

nand ringmod dae fekete tba transamp

Glimda © ⊕ ⊕ ⊕Dassl ⊕ © – ⊕ Radau – © ⊕ – ©

Mebdfi © ⊕ ©

Table 12.5: Comparison of different solvers for differential algebraic equations.

Part V

Summary

Summary

The design and production of today’s electronic devices requires extensive nu-merical testing. Using CMOS technology there is an ongoing trend towardsdecreasing size and increasing power density for future chip generations. Dueto high quality demands and short product cycles, circuit simulation is a keytechnology in every modern design flow as it allows an efficient layout andparameter optimisation.

The modified nodal analysis (MNA) applied to electrical circuits leads to dif-ferential algebraic equations (DAEs) that need to be solved in the time do-main. Many classical methods give rise to severe difficulties. Problems due tohigh computational costs for Runge-Kutta methods and undesired stability be-haviour for the trapezoidal rule or BDF schemes have been discussed. Generallinear methods (GLMs) are suggested as a means to overcome these difficulties.

General linear methods are studied for nonlinear differential algebraic equa-tions having a properly stated leading term. A refined analysis for DAEs ofincreasing complexity is presented. Restricting attention to DAEs satisfying anadditional structural condition, it is shown how to derive a decoupling proce-dure for nonlinear equations. Using this procedure statements on the existenceand uniqueness of solutions are given. Compared to previous results the suffi-cient conditions derived here are much easier to check for practical applications.The derivative array is not used at all. MNA equations are covered by the re-sults as well as DAEs in Hessenberg form. It is shown how the special structureof DAEs in circuit simulation can be exploited in order to simplify the resultsconsiderably.

In order to use the decoupling procedure to study numerical methods for index-2 DAEs, implicit index-1 equations have to be addressed first. A thoroughstudy of general linear methods for implicit index-1 equations is presented.Using rooted tree theory order conditions are derived and it is shown thatsufficiently high order p and stage order q guarantee that these order conditionsare satisfied. Convergence is proved as well. Special care is taken when assessingthe accuracy of stages and stage derivatives. It is shown that methods with anilpotent stability matrix at infinity and p = q provide the desired accuracy.

246

Using the decoupling procedure these results are transferred to properly statedindex-2 equations. Provided that the DAE is numerically qualified it is shownthat a GLM, when applied to the original DAE, behaves as if it was integrat-ing the inherent index-1 system. Hence convergence can be proved for GLMsapplied to properly stated index-2 DAEs.

In order to verify the theoretical results, practical GLMs with s = p + 1 ands = p stages are constructed. Addressing implementation issues it turns outthat error estimation and order control can be realised easily for methods ofthe former type. Although it is not as straightforward to implement methodswith s = p stages, these methods offer reduced costs per step such that theymight have higher potential for practical computations.

Using extensive numerical studies it is indeed confirmed that general linearmethods can be efficiently used in electrical circuit simulation. In particularthe code Glimda based on methods with s = p stages is competitive withclassical codes such as Dassl or Radau.

These results motivate a further study of general linear methods for DAEs.Glimda is still in an experimental stage and there is room for considerableimprovement. In particular, the error estimators and order selection strategieswere derived in the context of ODEs. Another highly desired improvement isthe derivation of an A-stable order-3 method, if it exists. As indicated by themethod of order 2 an adaptive control of the damping behaviour might evenbe possible.

Bibliography

[1] R. Alexander. Diagonally implicit Runge-Kutta methods for stiff ODEs.SIAM J. Numer. Anal., 14:1006–1021, 1977.

[2] K. Balla and R. Marz. A unified approach to linear differential algebraicequations and their adjoint equations. Technical Report 00-18, HumboldtUniversitat zu Berlin, 2000. use BallaMaerz:2002 instead.

[3] K. Balla and R. Marz. A unified approach to linear differential alge-braic equations and their adjoints. Z. Anal. Anwendungen, 21(3):783–802, 2002. ISSN 0232-2064.

[4] A. Bartel. Partial Differential-Algebraic Models in Chip Design – Ther-mal and Semiconductor Problems. PhD thesis, Bergische UniversitatWuppertal, 2004. Fortschritt-Berichte VDI, Reihe 20, Nr. 391.

[5] A. Bartel, M. Gunther, and A. Kværnø. Multirate methods in electricalcircuit simulation. In Progress in Industrial Mathematics at ECMI 2000,Mathematics in Industry 1, pages 258–265. Springer, 2002.

[6] F. Bashforth and J.C. Adams. An attempt to test the theories of capillaryaction by comparing the theoretical and measured forms of drops of fluid.With an explanation of the method of integration employed in construct-ing the tables which give the theoretical form of such drops. CambridgeUniversity Press, 1883.

[7] T.A. Bickart and Z. Picel. High order stiffly stable composite multistepmethods for numerical integration of stiff differential equations. BIT, 13:272–286, 1973.

[8] Blas. Blas – Basic Linear Algebra Subprograms, www.netlib.org/blas.

[9] F. Bornemann. Runge-Kutta methods, trees, and Maple. Selcuk J.Appl. Math., 2:3–15, 2001.

www.netlib.org/blas

248

[10] K.E. Brenan, S.L. Campbell, and L.R. Petzold. Numerical solution ofinitial-value problems in differential-algebraic equations, volume 14 ofClassics in Applied Mathematics. Society for Industrial and AppliedMathematics (SIAM), 1996. ISBN 0-89871-353-6.

[11] K.E. Brenan and L.R. Engquist. Backward differentiation approximationsof nonlinear differential/algebraic equations. Math. Comp., 51:659–676,1988.

[12] K. Burrage and J.C. Butcher. Stability criteria for implicit Runge-Kuttamethods. SIAM J. Numer. Anal., 16:46–57, 1979.

[13] K. Burrage and J.C. Butcher. Non-linear stability for a general class ofdifferential equation methods. BIT, 20:326–340, 1980.

[14] K. Burrage, J.C. Butcher, and F.H. Chipman. An implementation ofsingly-implicit Runge-Kutta methods. BIT, 20:326–340, 1980.

[15] J.C. Butcher. Coefficients for the study of RungeKutta integration pro-cesses. J. Austral. Math. Soc., 3:185–201, 1963.

[16] J.C. Butcher. Implicit Runge-Kutta processes. Math. Comp., 18:50–64,1964.

[17] J.C. Butcher. Integration processes based on Radau quadrature formulas.Math. Comp., 18:233–244, 1964.

[18] J.C. Butcher. On the convergence of numerical solutions of ordinarydifferential equations. Math. Comp., 20:1–10, 1966.

[19] J.C. Butcher. A stability property of implicit Runge-Kutta methods.BIT, 15:358–361, 1975.

[20] J.C. Butcher. Stability properties for a general class of methods for or-dinary differential equations. SIAM J. Numer. Anal., 18(1):34–44, 1981.

[21] J.C. Butcher. Linear and non-linear stability for general linear methods.BIT, 27:182–189, 1987.

[22] J.C. Butcher. The numerical analysis of ordinary differential equations,Runge-Kutta and general linear methods. Wiley, Chichester and NewYork, 1987.

[23] J.C. Butcher. Diagonally-implicit multi-stage integration methods. Appl.Numer. Math., 11:347–363, 1993.

[24] J.C. Butcher. Miniature 21: R gave me a DAE underneath the Lindentree, 2003. www.math.auckland.ac.nz/∼butcher/miniature/miniature21.pdf.

[25] J.C. Butcher. Numerical methods for ordinary differential equations. JohnWiley & Sons Ltd., Chichester, 2003. ISBN 0-471-96758-0.

www.math.auckland.ac.nz/~butcher/miniature/miniature21.pdf

249

[26] J.C. Butcher. General linear methods. to appear in Acta Numerica, 2006.

[27] J.C. Butcher and P. Chartier. The construction of Dimsims for stiff ODEsand DAEs. Technical Report 308, The University of Auckland, 1994.

[28] J.C. Butcher and P. Chartier. Parallel general linear methods for stiffordinary differential and differential algebraic equations. Appl. Numer.Math., 17:213–222, 1995.

[29] J.C. Butcher, P. Chartier, and Z. Jackiewicz. Nordsieck representationof Dimsims. Numer. Algorithms, 16:209–230, 1997.

[30] J.C. Butcher, P. Chartier, and Z. Jackiewicz. Experiments with avariable-order type 1 Dimsim code. Numer. Algorithms, 22:237–261,1999.

[31] J.C. Butcher and A.D. Heard. Stability of numerical methods for ordinarydifferential equations. Numer. Algorithms, pages 47–58, 2002.

[32] J.C. Butcher and Z. Jackiewicz. Construction of diagonally implicit gen-eral linear methods for ordinary differential equations. BIT, 33:452–472,1996.

[33] J.C. Butcher and Z. Jackiewicz. Construction of diagonally implicit gen-eral linear methods of type 1 and 2 for ordinary differential equations.Appl. Numer. Math., 21:385–415, 1996.

[34] J.C. Butcher and Z. Jackiewicz. Construction of high order diagonally im-plicit multistage integration methods for ordinary differential equations.BIT, 33:452–472, 1996.

[35] J.C. Butcher and Z. Jackiewicz. Implementation of diagonally implicitmultistage integration methods for ordinary differential equations. SIAMJ. Numer. Anal., 34(6):2110–2141, 1997.

[36] J.C. Butcher and Z. Jackiewicz. A reliable error estimation for diagonallyimplicit multistage integration methods. BIT, 41(4):656–665, 2001.

[37] J.C. Butcher and Z. Jackiewicz. A new approach to error estimation forgeneral linear methods. Numer. Math., 95:487–502, 2003.

[38] J.C. Butcher and Z. Jackiewicz. Unconditionally stable general linearmethods for ordinary differential equations. BIT, 44:557–570, 2004.

[39] J.C. Butcher, Z. Jackiewicz, and H.D. Mittelmann. A nonlinear opti-mization approach to the construction of general linear methods of highorder. J. Comput. Appl. Math., 81:181–196, 1997.

[40] J.C. Butcher and H. Podhaisky. On error estimation in general linearmethods for stiff ODEs. Appl. Numer. Math., 2005. accepted for publi-cation.

250

[41] S.L. Campbell. A procedure for analyzing a class of nonlinear semistateequations that arise in circuit and control problems. IEEE Trans. Circuitsand Systems, pages 256–261, 1981.

[42] S.L. Campbell. A general form for solvable linear time varying singularsystems of differential equations. SIAM J. Math. Anal., 18(4):1101–1115,1987.

[43] S.L. Campbell and C.W. Gear. The index of general nonlinear DAEs.Numer. Math., 72:173–196, 1995.

[44] S.L. Campbell and E. Griepentrog. Solvability of general differentialalgebraic equations. SIAM J. Sci. Comput., 16(2):257–270, 1995.

[45] J. Cash. The integration of stiff initial value problems in o.d.e.s usingmodified extended backward differentiation formulae. Comput. Math.Appl., 9:645–657, 1983.

[46] P. Chartier. General linear methods for differential-algebraic equations ofindex one and two. Technical Report 1968, Institut national de rechercheen informatique et en automatique, 1993.

[47] L.O. Chua, C.A. Desoer, and E.S. Kuh. Linear and nonlinear circuits.McGraw Hill Book Company, Singapore, 1987.

[48] C.F. Curtiss and J.O. Hirschfelder. Integration of stiff equations. Proc.Nat. Acad. Sci., 38:235–243, 1952.

[49] G. Dahlquist. A special stability problem for linear multistep methods.BIT, pages 27–43, 1963.

[50] G. Dahlquist. Error analysis of a class of methods for stiff nonlinearinitial value problems. In Numerical Analysis, Dundee, Lecture Notes inMathematics, 506, pages 60–74, 1976.

[51] G. Dahlquist. G-stability is equivalent to A-stability. BIT, 18:384–401,1978.

[52] K. Dekker and J.G. Verwer. Stability of Runge-Kutta methods for stiffnonlinear differential equations. CWI Monographs. North Holland, 1984.

[53] C.A. Desoer and E.S. Kuh. Basic Circuit Theory. McGraw Hill BookCompany, 1969.

[54] P. Deuflhard, E. Hairer, and J. Zugck. One-step and extrapolation meth-ods for differential-algebraic equations. Numer. Math., 51:501–516, 1987.

[55] H. Dohring. Traktabilitatsindex und Eigenschaften von matrixwertigenRiccati-Typ Algebrodifferentialgleichungen. Master’s thesis, HumboldtUniversitat zu Berlin, 2004.

251

[56] R.C. Dorf. Introduction to electric circuits. John Wiley & Sons, 2ndedition, 1989.

[57] J.R. Dormand and P.J. Prince. A family of embedded Runge-Kuttaformulae. J. Comput. Appl. Math., 6:19–26, 1980.

[58] B.L. Ehle. On Pade approximations to the exponential function andA-stable methods for the numerical solution of initial value problems.Technical report, Dept. AACS, University of Waterloo, Ontario, Canada,1969. Research Report CSRR 2010.

[59] E. Eich-Soellner and C. Fuhrer. Numerical Methods in Multibody Dynam-ics. Teubner, Stuttgart, 1998. (since 2002 available as a reprint directlyfrom the authors).

[60] R.F. Enenkel and K.R. Jackson. Dimsems – diagonally implicit single-eigenvalue methods for the numerical solution of ODEs on parallel com-puters. Adv. Comput. Math., 7(1-2):97–133, 1997.

[61] D. Estevez Schwarz. Consistent initialization for index-2 differential al-gebraic equations and it’s application to circuit simulation. PhD thesis,Humboldt Universitat zu Berlin, 2000. edoc.hu-berlin.de/dissertationen/estevez-schwarz-diana-2000-07-13/PDF/Estevez-Schwarz.pdf.

[62] D. Estevez Schwarz, U. Feldmann, R. Marz, S. Sturzel, and C. Tischen-dorf. Finding beneficial DAE structures in circuit simulation. In W. Jagerand H.J. Krebs, editors, Mathematics – Key technology for the future,pages 413–428. Springer, Berlin, 2003.

[63] D. Estevez Schwarz and C. Tischendorf. Structural analysis of electriccircuits and consequences for MNA. Internat. J. Circuit Theory Appl.,28:131–162, 2000.

[64] U. Feldmann. private communication, 2005.

[65] U. Feldmann and M. Gunther. Some remarks on regularization of cir-cuit equations. In W. Mathis, editor, X. International Symposium onTheoretical Electrical Engineering, pages 343–348, 1999.

[66] U. Feldmann, M. Gunther, and J. ter Maten. Handbook of NumericalAnalysis, Volume XIII: Numerical Methods in Electromagnetics, chap-ter Modelling and Discretization of Circuit Problems. North-Holland,2005.

[67] O. Forster. Analysis 2. Differentialrechnung im Rn, gewohnliche Differ-

entialgleichungen. Vieweg, 1984.

[68] F.R. Gantmacher. The theory of matrices. Chelsea Pub. Co. New York,1959.

edoc.hu-berlin.de/dissertationen/estevez-schwarz-diana-2000-07-13/PDF/Estevez-Schwarz.pdf

edoc.hu-berlin.de/dissertationen/estevez-schwarz-diana-2000-07-13/PDF/Estevez-Schwarz.pdf

252

[69] C.W. Gear. Hybrid methods for initial value problems in ordinary differ-ential equations. SIAM J. Numer. Anal., 2:69–86, 1965.

[70] C.W. Gear. Numerical initial value problems in ordinary differential equa-tions. Prentice Hall, 1971.

[71] C.W. Gear. Simultaneous numerical solution of differential-algebraicequations. IEEE Trans. Circuits and Theory, CT-18(1):89–95, 1971.

[72] C.W. Gear. Differential-algebraic equation index transformations. SIAMJ. Sci. Statist. Comput., 9:39–47, 1988.

[73] C.W. Gear and L.R. Petzold. Differential/algebraic systems and matrixpencils. In B. Kagstrom & A. Ruhe, editor, Matrix Pencils, Lecture Notesin Mathematics 973, pages 75–89. Springer Verlag, 1983.

[74] R. Gerstberger and M. Gunther. Charge-oriented extrapolation methodsin digital circuit simulation. Appl. Numer. Math., 18:115–125, 1995.

[75] E. Griepentrog and R. Marz. Differential-algebraic equations and theirnumerical treatment. Teubner, Leipzig, 1986.

[76] N. Guglielmi and M. Zennaro. On the asymptotic properties of a familyof matrices. Linear Algebra Appl., 322:169–192, 2001.

[77] N. Guglielmi and M. Zennaro. On the zero-stability of variable stepsizemultistep methods: the spectral radius approach. Numer. Math., pages445–458, 2001.

[78] M. Gunther. Ladungsorientierte Rosenbrock-Wanner-Methoden zur nu-merischen Simulation digitaler Schaltungen. PhD thesis, TU Munchen,1995. (ISBN 3-18-316820-0, VDI-Verlag, Dusseldorf).

[79] M. Gunther. Simulating digital circuits numerically – A charge orientedrow approach. Numer. Math., 79:203–212, 1998.

[80] M. Gunther and U. Feldmann. CAD based electrical circuit modellingin industry I: Mathematical structure and index of network equations.Surveys Math. Indust., 8:97–129, 1999.

[81] M. Gunther, U. Feldmann, and P. Rentrop. Choral – a one-step methodas numerical low pass filter in electrical network analysis. In U. vanRienen et. al., editor, SCEE 2000, number 18 in Lecture Notes Comp.Sci. Eng., pages 199–215. Springer, Berlin, 2001.

[82] M. Gunther, M. Hoschek, and R. Weiner. ROW methods adapted to acheap Jacobian. Appl. Numer. Math., 37:231–240, 2001.

[83] M. Gunther and P. Rentrop. The NAND-gate – a benchmark for thenumerical simulation of digital circuits. In W. Mathis and P. Noll, editors,2. ITG-Diskussionssitzung “Neue Anwendungen Theoretischer Konzepte

253

in der Elektrotechnik” - mit Gedenksitzung zum 50. Todestag von WilhelmCauer, pages 27–33. VDE-Verlag, Berlin, 1996.

[84] E. Hairer, C. Lubich, and R. Roche. The numerical solution ofdifferential-algebraic systems by Runge-Kutta methods, volume 1409 ofLecture Notes in Mathematics. Springer-Verlag, Berlin, 1989. ISBN 3-540-51860-6.

[85] E. Hairer and G. Wanner. Solving ordinary differential equations II: stiffand differential algebraic problems. Springer, Berlin Heidelberg New YorkTokyo, 2 edition, 1996.

[86] E. Hairer and G. Wanner. Stiff differential equations solved by Radaumethods. J. Comput. Appl. Math., 111:93–111, 1999.

[87] K. Heun. Neue Methoden zur approximativen Integration der Differen-tialgleichungen einer unabhangigen Veranderlichen. Zeit. Math. Phys.,45:23–38, 1900.

[88] I. Higueras and R. Marz. Differential algebraic equations with properlystated leading terms. Comput. Math. Appl., 48(1-2):215–235, 2004.

[89] I. Higueras, R. Marz, and C. Tischendorf. Stability preserving integrationof index-1 DAEs. Appl. Numer. Math., 45(2-3):175–200, 2003. ISSN 0168-9274.

[90] I. Higueras, R. Marz, and C. Tischendorf. Stability preserving integrationof index-2 DAEs. Appl. Numer. Math., 45(2-3):201–229, 2003. ISSN 0168-9274.

[91] R. Horn and C. Johnson. Topics in Matrix Analysis. Cambridge Univer-sity Press, 1991.

[92] E.H. Horneber. Analyse nichtlinearer RLCU-Netzwerke mit Hilfe dergemischten Potentialfunktion mit einer systematischen Darstellung derAnalyse nichtlinearer dynamischer Netzwerke. PhD thesis, UniversitatKaiserslautern, 1976.

[93] S.J.Y. Huang. Implementation of general linear methods for stiff ordinarydifferential equations. PhD thesis, The University of Auckland, 2005.www.math.auckland.ac.nz/∼butcher/theses/shirleyhuang.pdf.

[94] Z. Jackiewicz. Implementation of Dimsims for stiff differential systems.Appl. Numer. Math., 42:251–267, 2002.

[95] Z. Jackiewicz, H. Podhaisky, and R. Weiner. Construction of highlystable two-step W -methods for ordinary differential equations. Techni-cal report, FB Mathematik und Informatik, Martin-Luther UniversitatHalle-Wittenberg, 2002. www.mathematik.uni-halle.de/reports/sources/2002/02-23report.ps.

www.math.auckland.ac.nz/~butcher/theses/shirleyhuang.pdf

www.mathematik.uni-halle.de/reports/sources/2002/02-23report.ps

www.mathematik.uni-halle.de/reports/sources/2002/02-23report.ps

254

[96] P. Kaps and G. Wanner. A study of Rosenbrock-type methods of highorder. Numer. Math., 38:279–298, 1981.

[97] L. Kronecker. Algebraische Reduktion der Scharen bilinearer Formen.Akademie der Wissenschaften Berlin, Werke vol. III, pages 141–155,1890.

[98] P. Kunkel and V. Mehrmann. Canonical forms for linear differentialalgebraic equations with variable coefficients. J. Comput. Appl. Math.,56:225–251, 1994.

[99] P. Kunkel and V. Mehrmann. Regular solutions of nonlinear differential-algebraic equations and their numerical determination. Numer. Math.,79(4):581–600, 1998. ISSN 0029-599X.

[100] P. Kunkel and V. Mehrmann. Index reduction for differential-algebraicequations by minimal extension. Z. Angew. Math. Mech., 84:579–597,2004.

[101] P. Kunkel and V. Mehrmann. Differential-Algebraic Equations. Analysisand Numerical Solution. EMS Publishing House, Zurich, Switzerland,2006. To appear.

[102] P. Kunkel, V. Mehrmann, and I. Seufer. Genda: A software pack-age for the numerical solution of general nonlinear differential-algebraicequations. Technical Report 730, Technische Universitat Berlin, 2002.www.math.tu-berlin.de/preprints/files/dgenda.ps.gz.

[103] W. Kutta. Beitrag zur naherungsweisen Integration totaler Differential-gleichungen. Zeit. Math. Phys., 46:435–453, 1901.

[104] A. Kværnø. Runge-Kutta methods applied to fully implicit differential-algebraic equations of index 1. Math. Comp., 54(190):583–625, 1990.ISSN 0025-5718.

[105] A. Kværnø. Singly diagonally implicit Runge-Kutta methods with anexplicit first stage. BIT, 44:489–502, 2004.

[106] A. Kværnø, S.P. Nørsett, and B. Owren. Runge-Kutta research in Trond-heim. Appl. Numer. Math., 22(1-3):263–277, 1996.

[107] R. Lamour. Index determination and calculation of consistent initialvalues for DAEs. Comput. Math. Appl., 50(7):1125–1140, 2005.

[108] R. Lamour. Strangenessindex +1=Traktabilitatsindex? 7. Work-shop uber Deskriptorsysteme, 2005. www.math.tu-berlin.de/numerik/mt/NumMat/Meetings/0503 Descriptor.

[109] Lapack. Lapack – Linear Algebra PACKage, www.netlib.org/lapack.

www.math.tu-berlin.de/preprints/files/dgenda.ps.gz

www.math.tu-berlin.de/numerik/mt/NumMat/Meetings/0503

www.math.tu-berlin.de/numerik/mt/NumMat/Meetings/0503

Descriptor

www.netlib.org/lapack

255

[110] W. Liniger and R.A. Willoughby. Efficient integration methods for stiffsystems of ordinary differential equations. SIAM J. Numer. Anal., 7:47–66, 1970.

[111] P. Lotstedt and L.R. Petzold. Numerical solution of nonlinear differentialequations with algebraic constraints. I. Convergence results for backwarddifferentiation formulas. Math. Comp., 46(174):491–516, 1986. ISSN0025-5718.

[112] Maple. Maple, Waterloo Maple Inc., www.maplesoft.com.

[113] R. Marz. Numerical methods for differential-algebraic equations. ActaNumerica, pages 141–198, 1992.

[114] R. Marz. Nonlinear differential-algebraic equations with properly formu-lated leading terms. Technical Report 01-3, Humboldt Universitat zuBerlin, 2001. www.math.hu-berlin.de/publ/pre/2001/P-01-3.ps.

[115] R. Marz. Differential algebraic systems anew. Appl. Numer. Math., 42(1-3):315–335, 2002. ISSN 0168-9274.

[116] R. Marz. The index of linear differential algebraic equations with properlystated leading terms. Results Math., 42:308–338, 2002.

[117] R. Marz. Differential algebraic systems with properly stated leading termand MNA equations. In Modeling, simulation, and optimization of inte-grated circuits (Oberwolfach, 2001), volume 146 of Internat. Ser. Numer.Math., pages 135–151. Birkhauser, Basel, 2003.

[118] R. Marz. Fine decouplings of regular differential-algebraic equations.Results Math., 46:57–72, 2004.

[119] R. Marz. Projectors for matrix pencils. Technical report, Humboldt Uni-versitat zu Berlin, 2004. www.math.hu-berlin.de/publ/pre/2004/P-04-24.ps.

[120] R. Marz. Solvability of linear differential algebraic equations with prop-erty stated leading terms. Results Math., 45(1–2):88–105, 2004.

[121] R. Marz. Characterizing differential algebraic equations without the useof derivative arrays. Comput. Math. Appl., 50(7):1141–1156, 2005.

[122] R. Marz and C. Tischendorf. Solving more general index-2 differentialalgebraic equations. Comput. Math. Appl., 28(10–12):77–105, 1994.

[123] matlab. Matlab, The MathWorks, www.mathworks.com.

[124] F. Mazzia and F. Iavernaro. Test Set for Initial Value Problem Solvers.Department of Mathematics, University of Bari, 2003. www.dm.uniba.it/∼testset.

www.maplesoft.com

www.math.hu-berlin.de/publ/pre/2001/P-01-3.ps



www.mathworks.com

www.dm.uniba.it/~testset

www.dm.uniba.it/~testset

256

[125] Minpack. Minpack, www.netlib.org/minpack.

[126] F.R. Moulton. New methods in exteriour ballistics. University of ChicagoPress, 1926.

[127] A. Nordsieck. On numerical integration of ordinary differential equations.Math. Comp., 16:22–49, 1962.

[128] L.R. Petzold. A description of DASSL: A differential/algebraic systemsolver. In Proceedings of IMACS World Congress, 1982.

[129] H. Podhaisky. Parallele Zweischritt-W-Methoden. PhD thesis, Martin-Luther-Universitat Halle-Wittenberg, 2002. deposit.ddb.de/cgi-bin/dokserv?idn=964158841.

[130] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Nu-merical recipes in Fortran. Cambridge University Press, 2nd edition,1992.

[131] A. Prothero and A. Robinson. On the stability and accuracy of one-stepmethods for solving stiff systems of ordinary differential equations. Math.Comp., 28:145–162, 1974.

[132] R. Pulch. Finite difference methods for multi time scale differential alge-braic equations. Z. Angew. Math. Mech., 83:571–583, 2003.

[133] P. Rabier and W. Rheinboldt. Theoretical and Numerical Analysis ofDifferential-Algebraic Equations, volume VIII. Elsevier Science B.V.,2002. edited by P.G. Ciarlet and J.L. Lions.

[134] K. Radhakrishnan and A.C. Hindmarsh. Description and use of Lsode,the Livemore Solver for Ordinary Differential Equations. Lawrence Liv-ermore National Laboratory Report Series, 1993. URL www.llnl.gov/tid/lof/documents/pdf/240148.pdf.

[135] N. Rattenbury. Almost Runge-Kutta methods for stiff and non-stiff prob-lems. PhD thesis, The University of Auckland, 2005.

[136] G. Reissig. Beitrage zur Theorie und Anwendung impliziter Differential-gleichungen. PhD thesis, TU Dresden, 1998.

[137] P. Rentrop, M. Roche, and G. Steinebach. The application of Rosenbrock-Wanner type methods with stepsize control in differential-algebraic equa-tions. Numer. Math., 55:545–563, 1989.

[138] R. Riaza and C. Tischendorf. Topological analysis of qualitative featuresin electrical circuit theory. In Scientific Computing in Electrical Engi-neering, 2004. URL www.matheon.de. to appear.

[139] H.H. Robertson. The solution of a set of reaction rate equations. Aca-demic Press, pages 178–182, 1966.

www.netlib.org/minpack

deposit.ddb.de/cgi-bin/dokserv?idn=964158841

deposit.ddb.de/cgi-bin/dokserv?idn=964158841

www.llnl.gov/tid/lof/documents/pdf/240148.pdf

www.llnl.gov/tid/lof/documents/pdf/240148.pdf

www.matheon.de

257

[140] M. Roche. Rosenbrock methods for differential algebraic equations. Nu-mer. Math., 52:45–63, 1988.

[141] S.P. Nørsett. Semi explicit Runge-Kutta methods. Technical report,Dept. Math. Univ. Trondheim, 1974. Report No. 6/74, ISBN 82-7151-009-6.

[142] C. Runge. Uber die numerische Auflosung von Differentialgleichungen.Math. Ann., 46:167–178, 1895.

[143] S. Schneider. Convergence of general linear methods on differential-algebraic systems of index 3. BIT, 37(2):424–441, 1997.

[144] S. Schulz. General linear methods for linear DAEs. Technical Report03-10, Humboldt Universitat zu Berlin, 2003. www.math.hu-berlin.de/∼steffen/publications.html.

[145] S. Schulz. Four lectures on differential algebraic equations. TechnicalReport 497, The University of Auckland, New Zealand, 2003. www.math.hu-berlin.de/∼steffen/publications.html.

[146] I. Schumilina. Charakterisierung der Algebro-Differentialgleichungen mitTraktabilitatsindex 3. PhD thesis, Humboldt Universitat zu Berlin, 2004.

[147] E.R. Sieber, U. Feldmann, R. Schultz, and H. Wriedt. Timestep con-trol for charge conserving integration in circuit simulation. In R. E.Bank, R. Bulirsch, H. Gajewski, and K. Merten, editors, Mathematicalmodelling and simulation of electrical circuits and semiconductor devices,volume 117 of Int. Series of Num. Mathematics. Birkhauser, Basel, 1994.

[148] G. Soderlind. Automatic control and adaptive time-stepping. Numer.Algorithms, 31(1–4):281–310, 2002.

[149] G. Soderlind. Digital filters in adaptive time-stepping. ACM Trans. Math.Software, 29(1):1–26, 2003.

[150] Suse. The openSUSE project, www.opensuse.org.

[151] C. Tischendorf. Benchmark: Nand gate. www.math.hu-berlin.de/∼caren/nandgatter.ps.

[152] C. Tischendorf. Feasibility and stability behaviour of the BDF applied toindex-2 differential algebraic equations. Z. Angew. Math. Mech., 75(12):927–946, 1995. ISSN 0044-2267.

[153] C. Tischendorf. Solution of index-2 differential algebraic equations andit’s application in circuit simulation. PhD thesis, Humboldt Universitatzu Berlin, 1996.

[154] C. Tischendorf. Topological index calculation of differential algebraicequations in circuit simulation. Surveys Math. Indust., 8:187–199, 1999.

www.math.hu-berlin.de/~steffen/publications.html




www.opensuse.org

www.math.hu-berlin.de/~caren/nandgatter.ps

www.math.hu-berlin.de/~caren/nandgatter.ps

258

[155] C. Tischendorf. Numerische Simulation elektrischer Netzwerke. LectureNotes, www.math.hu-berlin.de/∼caren/schaltungen.pdf, 2000.

[156] S. Voigtmann. General linear methods for nonlinear DAEs in circuitsimulation. In Scientific Computing in Electrical Engineering, 2004. URLwww.matheon.de. to appear.

[157] S. Voigtmann. Accessible criteria for the local existence and uniquenessof DAE solutions. Technical Report 214, Matheon, 2005. URL www.matheon.de. www.math.hu-berlin.de/∼steffen/publications.html.

[158] W. Wright. General linear methods with inherent Runge-Kutta stability.PhD thesis, The University of Auckland, New Zealand, 2003. www.math.auckland.ac.nz/∼butcher/theses/willwright.pdf.

[159] ZIB. ZIB: Konrad-Zuse-Zentrum fur Informationstechnik Berlin, www.zib.de/index.en.html, for Limex: www.zib.de/Numerik/numsoft/CodeLib/codes/limex.tar.gz.

[160] G. Zielke. Motivation und Darstellung von verallgemeinerten Matrixin-versen. Beitrage Numer. Math., 7:177–218, 1979.

www.math.hu-berlin.de/~caren/schaltungen.pdf

www.matheon.de

www.matheon.de

www.matheon.de


www.math.auckland.ac.nz/~butcher/theses/willwright.pdf

www.math.auckland.ac.nz/~butcher/theses/willwright.pdf

www.zib.de/index.en.html

www.zib.de/index.en.html

www.zib.de/Numerik/numsoft/CodeLib/codes/limex.tar.gz

www.zib.de/Numerik/numsoft/CodeLib/codes/limex.tar.gz

Selbstandigkeitserklarung

Hiermit erklare ich, dass ich die vorliegende Dissertation selbstandig verfassthabe und keine anderen als die angegebenen Quellen und Hilfsmittel benutztworden sind.

Ottobunn, 20.08.2006, Steffen Voigtmann

Documents

General Linear Methods for Integrated Circuit Design - edoc