86
Submitted by Thomas Leidinger, BSc Submitted at Institute for Design and Control of Mechatronical Systems Supervisor Univ.-Prof. DI Dr. Luigi del Re September 2021 JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Österreich www.jku.at DVR 0093696 Hybrid powertrain online control using reinforcement learning Master’s Thesis to obtain the academic degree of Diplom-Ingenieur in the Master’s Program Mechatronics

Submitted by Thomas Leidinger, BSc Institute for Design

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Submitted by Thomas Leidinger, BSc Institute for Design

Submitted byThomas Leidinger, BSc

Submitted atInstitute for Design andControl of MechatronicalSystems

SupervisorUniv.-Prof. DI Dr. Luigidel Re

September 2021

JOHANNES KEPLERUNIVERSITY LINZAltenbergerstraße 694040 Linz, Österreichwww.jku.atDVR 0093696

Hybrid powertrain onlinecontrol usingreinforcement learning

Master’s Thesisto obtain the academic degree of

Diplom-Ingenieurin the Master’s Program

Mechatronics

Page 2: Submitted by Thomas Leidinger, BSc Institute for Design
Page 3: Submitted by Thomas Leidinger, BSc Institute for Design

Eidesstattliche Erklärung

Ich erkläre an Eides statt, dass ich die vorliegende Arbeit selbstständig und ohnefremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutztbzw. die wörtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemachthabe.

Die vorliegende Arbeit ist mit dem elektronisch übermittelten Textdokument identisch.

Linz, 01. September 2021

Thomas Leidinger, BSc

Page 4: Submitted by Thomas Leidinger, BSc Institute for Design
Page 5: Submitted by Thomas Leidinger, BSc Institute for Design

Abstract

Nowadays, the general environmental problems of energy waste and greenhouse gasemissions represent an important topic in our world. Hence, the decrease of airpollution and fuel consumption of vehicles plays a major role to improve the currentsituation. In the field of transportation, the Hybrid Electric Vehicle (HEV) is anapproach to reduce greenhouse gas emissions by selecting the power out of two sources.Therefore, the main task of the energy management of a HEV includes the power splitbetween the electrical machine/battery and the engine. The recharging ability of thebattery inappropriate parts of the driving cycle supports the HEV to improve thegeneral goals. The best results are obtained if full information is known in advanceand therefore a global solution can be calculated e.g. with Dynamic Programming(DP). However, the computational burden prevents this method to be used in changingenvironments (driving cycles) to recalculate a new solution.The machine learning method Reinforcement Learning (RL) can obtain a comparablesolution and reduce the computational burden to a large extent. This approachworks by learning a policy due to visiting different states through executing differentactions. In this thesis, a Deep Reinforcement Learning (DRL) method is developed tohandle the case of 3 actions/inputs values for the HEV model. Therefore, a NeuralNetwork is used as a function approximator in the considered complex problem whichapproximates a policy even for unvisited states. Further, an action shield is usedto prevent the agent from selecting actions that would lead to infeasible solutions.In the beginning, the agents (controller) are trained in 4 different environments (4different city driving cycles) respectively. The DRL is shown to be able to deliver verycomparable results compared to the results of DP in all 4 cases. In the next step, atrained agent was applied to the other 3 driving cycles to test the performance of theagent running in changing/uncertain environment. It shows that the agents, althoughtrained in other environments, can adapt to new environments flexibly and delivergood results within few seconds. This application simulates the online approach bycomputing solutions in unknown environments.

Page 6: Submitted by Thomas Leidinger, BSc Institute for Design
Page 7: Submitted by Thomas Leidinger, BSc Institute for Design

Kurzfassung

In der heutigen Zeit stellen die Energieverschwendung und die Treibhausgasemissionenein großes Problem in unserer Welt dar. Dadurch spielen auch die Luftverschmutzungund der Kraftstoffverbrauch von einem Fahrzeug mit reinem Verbrennungsmotoreine große Rolle. Der Einsatz von Hybrid-Fahrzeugen führt dazu, den Kraftstoffver-brauch, im Vergleich zu einem Fahrzeug mit reinem Verbrennungsmotor, zu reduzieren.Dadurch werden auch Faktoren wie Schadstoff-Ausstoß und Erdöl-Verbrauch ver-ringert. Die Hauptaufgabe der Regelung eines Hybrid-Fahrzeugs besteht darin, dieLeistungsaufteilung zwischen der elektrischen Maschine/Batterie und des Verbren-nungsmotors durchzuführen. Durch das Aufladen der Batterie an geeigneten Stellendes Fahrzyklus wird das Hybrid-Fahrzeug weiter unterstützt, keinen zusätzlichenKraftstoff zu verbrauchen. Die besten Ergebnisse können dann erreicht werden, wennalle Informationen im voraus bekannt sind und eine globale Lösung berechnet werdenkann z.B.: mit Dynamic Programming (DP). Die enorme Rechenzeit dieser Methodemacht es jedoch unmöglich, die Lösung in einer veränderten Umgebung bzw. in einemveränderten Fahrzyklus schnell zu adaptieren.Mit Hilfe der Machine Learning Methode Reinforcement Learning ist es möglich, ver-gleichbare Lösungen zu generieren und dabei den Rechenaufwand um ein erheblichesMaß zu reduzieren. Bei diesem Ansatz wird die Lösung durch wiederkehrendes Ler-nen bei verschiedenen Zuständen, welche durch verschiedene Aktionen hervorgerufenwerden, berechnet. In dieser Arbeit wird eine Deep Reinforcement Learning (DRL)Methode realisiert, welche 3 Aktionen/Eingänge des HEV-Modells verwendet. Dabeibeinhaltet diese Methode ein Neurales Netz, welches Funktionen approximiert, diebei der Berechnung benötigt werden, um auch auftretende Zustände behandeln zukönnen, die nicht im Lernprozess aufgetreten sind. Dies führt dazu, dass diese Meth-ode auch in veränderten Umgebungen eingesetzt werden kann, um eine neue Lösungzu berechnen. Des Weiteren wird zusätzlich ein Action-Shield verwendet, welchesden Controller (Agent) dabei hindert, undurchführbare Lösungen zu generieren. ZuBeginn werden 4 verschiedene Agenten (Controller) in 4 verschiedenen Umgebungen(Fahrzyklen) trainiert und die Ergebnisse mit denen von Dynamic Programmingverglichen. Die DRL Methode liefert vergleichbare Resultate in allen 4 Fällen. Imnächsten Schritt wird ein trainierter Agent auf die anderen Umgebungen angewendet.Die Ergebnisse weichen nur im geringen Maße von den vorherigen ab, jedoch hatdieser Ansatz den Vorteil, in wenigen Sekunden berechnet werden zu können. Dieser

Page 8: Submitted by Thomas Leidinger, BSc Institute for Design

vi

Vorgang simuliert die Online-Anwendung, da der Agent in unbekannten UmgebungenLösungen berechnet.

Page 9: Submitted by Thomas Leidinger, BSc Institute for Design

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 52.1 Reinforcement Learning (RL) . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . . 62.1.2 Discounted Expected Reward . . . . . . . . . . . . . . . . . . . 62.1.3 Policy and Value Functions . . . . . . . . . . . . . . . . . . . . 72.1.4 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Basic grid world example . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Reinforcement Learning in a basic grid world . . . . . . . . . . 92.2.2 Dynamic programming in a basic grid world . . . . . . . . . . . 102.2.3 RL vs. DP in a basic grid world . . . . . . . . . . . . . . . . . 112.2.4 Reinforcement learning in a changing grid world . . . . . . . . 112.2.5 Dynamic programming in a changing grid world . . . . . . . . 122.2.6 RL vs. DP in a changing grid world . . . . . . . . . . . . . . . 122.2.7 Conclusion of the grid world examples . . . . . . . . . . . . . . 14

2.3 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1 Input Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.2 Hidden Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.3 Output Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.4 Neural Network Algorithm . . . . . . . . . . . . . . . . . . . . 18

2.4 Deep Reinforcement Learning (DRL) . . . . . . . . . . . . . . . . . . . 212.4.1 Deep Q-Network (DQN) . . . . . . . . . . . . . . . . . . . . . . 212.4.2 Experience Replay . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 Target Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.4 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.5 Deep Q-Learning algorithm . . . . . . . . . . . . . . . . . . . . 23

3 HEV Problem Formulation 253.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Optimization Problem Formulation . . . . . . . . . . . . . . . . 26

vii

Page 10: Submitted by Thomas Leidinger, BSc Institute for Design

viii Contents

3.2 Hybrid Electric Vehicle Model . . . . . . . . . . . . . . . . . . . . . . . 263.2.1 Vehicle Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Powertrain Model . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Test Case - Driving Cycle . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Method and Implementation 314.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 HEV Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Shield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Observation Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.7 Hyperparameter and Implementation Algorithm . . . . . . . . . . . . 37

5 Results 415.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 Results - DQN RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2.1 1 action/input . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2.2 3 actions/inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3 Comparison with Dynamic Programming . . . . . . . . . . . . . . . . 495.3.1 1000s driving cycles . . . . . . . . . . . . . . . . . . . . . . . . 495.3.2 Whole driving cycles . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4 Online Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Conclusion and Future Work 676.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

A HEV Model Parameter 69

Bibliography 73

Page 11: Submitted by Thomas Leidinger, BSc Institute for Design

Chapter 1

Introduction

This chapter provides a general introduction to the topic and gives an overview of thethesis. In 1.1 the motivation for the use of reinforcement learning in a hybrid electricvehicle, to optimize energy management, is given. The second subchapter is aboutthe related work which is done so far, including important methods to calculate anacceptable solution in this thesis. In 1.3 the goal of the work and the basic structureto solve and compare the problem is mentioned. The last subchapter is about theoutline of this study.

1.1 Motivation

Nowadays, engine emissions and fuel consumption have a high impact and status inthe discussion about climate change in our world. To reduce these emissions and to actin a resource-saving way, alternative transport methods get more and more interesting.One type of vehicles are HEVs (hybrid electric vehicles) which have the advantage indeciding between two sources of power, the chemical source (engine) and the electricalsource (battery). The challenge or the main task which should be optimized is thepower split between these two sources in a way that the fuel consumption is minimalby simultaneously include the charge sustaining task of the battery. The secondarygoals are to minimize the gear shifts and the clutch shifts of the HEV.The best results can be achieved if the full future information is known in advancelike the whole driving cycle, the changing traffic, etc., but in reality, it is impossibleto have this knowledge beforehand. Dynamic programming (DP)[1] would deliverthe best results of optimization problems with full prior information. However, italso has to be mentioned that DP would have a long computational time to get theoptimal control strategy because of the enormous computational burden to solve sucha problem with a large space of possible states. It is clear that DP could not be usedin online control problems where the solution must be re-calculated fast in case of thechanging environment.There are several online control strategies for HEVs proposed, such as Approximatedynamic programming (ADP) [1], Model predictive control (MPC) [2], Pontryagin’sminimum principle (PMP) [3], etc. Reinforcement learning (RL), as a branch of

1

Page 12: Submitted by Thomas Leidinger, BSc Institute for Design

2 1 Introduction

machine learning, gets more and more important and so the number of applicationsincreases constantly, for example, video games [4], medical applications like cancerdetection [5], robots [6], autonomous driving/flying [7], etc. and recently in HEV fieldtoo. Several works of literature stated, as seen in the next subchapter, that by usingRL, they can achieve comparable results to the DP solutions. Besides the advantageof reducing the computational burden to a large extent, it is also possible to derivegood control policies without prior knowledge of the cycle.

1.2 Related work

As mentioned above, there are already a lot of control strategies in the field of hybridelectric vehicles. A good explanation about approximate dynamic programming canbe found in the research paper of Johannesson “Approximate Dynamic ProgrammingApplied to Parallel Hybrid Powertrain” [8]. ADP and the resulting improvement ofthe computational efficiency of DP, by local linear approximation of the value function,is explained.One of the most important RL literature is the book from Sutton and Barto “Rein-forcement Learning An Introduction” [9]. The basics of Reinforcement learning andalso the different methods of RL are explained in detail. The following papers areclose to the thesis by including HEV’s with their different hybrid architectures anddifferent control strategies.In the work of Qi named “Data-Driven Reinforcement Learning–Based Real-TimeEnergy Management System for Plug-In Hybrid Electric Vehicles” [10] a good overviewabout the trade-off that RL can make between the real-time performance and opti-mality in the energy management for Plug-In HEVs is given. It is mentioned thatRL can achieve a good balance in between because the model can be implemented inreal-time without any efforts for prediction. This data-driven approach doesn’t needany HEV model information once it is well trained.In “Deep Reinforcement Learning for Advanced Energy Management of Hybrid Elec-tric Vehicles” [11] from Liessner, Deep Reinforcement learning is explained in HEVswith a controller (agent) who has one action to execute. The choice of gear, forexample, is implemented as heuristics and will not be influenced by the agent. Thisis a difference from the objective of this work, where the agent has to find the bestpolicy for more actions which results in a huge possible space of solutions. Therefore,the approach in this thesis has a more complex and realistic model which is expectedto give more realistic results than a simplified model. An implementation withouta Neural Network (to approximate values) wouldn’t be able to handle a value thatwasn’t processed during training. Further, the used action shield is bounded for 3actions which is different from the simplified one-action approach [12].

1.3 Purpose

The objective of this thesis is to develop an online control strategy for a parallel hybridpowertrain using Reinforcement learning. More specifically, the thesis will investigate

Page 13: Submitted by Thomas Leidinger, BSc Institute for Design

1.4 Thesis outline 3

a Deep Q-learning algorithm (where a Neural Network is applied to approximatevalues) to a hybrid electric vehicle. The actions (inputs) of the system are the gearshift command, the coupler between the battery and the engine and the clutch shiftcommand. The states of the HEV model are the gear position, the clutch position andthe state of charge of the battery. The results should be compared with the benchmarkresults of DP. The validation should be performed in driving cycles that are differentthan the driving cycle where the strategy has been trained. This approach simulatesthe online application because of the unknown information which has to be handled.

1.4 Thesis outline

The remainder of this thesis is organized as follows. In the second chapter, Reinforce-ment learning is explained, starting with the basics and the different used methods.Afterwards, a grid world example helps to understand the advantages of RL. Chapter 3is about the hybrid electric vehicle problem formulation which includes an explanationof the used model of the HEV. In section 4, the implementation of the Deep RLmethods, which are applied to the problem, are investigated and the differences arementioned. In the next chapter (5), the results of Deep RL applied to the HEV areshown and compared to DP and other optimization strategies. Finally, the thesis issummarized with a conclusion and the future work.

Page 14: Submitted by Thomas Leidinger, BSc Institute for Design
Page 15: Submitted by Thomas Leidinger, BSc Institute for Design

Chapter 2

Theory

This chapter provides the basic information about Reinforcement Learning (RL) andDeep Reinforcement Learning (DRL) and is based on the book from Sutton andBarto [9]. The next subchapter is about a simple RL example to show the benefitscompared to Dynamic Programming (DP). To summarize the concepts of DRL it isalso necessary to focus on Neural Nets which have the task of function approximationin this method.

2.1 Reinforcement Learning (RL)

In Reinforcement Learning, the two main parts of the method are the agent (controller,learner, decision-maker) and the environment (system/plant). Anything outside theagent belongs to the environment. The agent interacts with the environment usingstates, actions and rewards to find an optimal control policy. The rewards are a toolto evaluate the performed actions. Figure 2.1 illustrates this behavior.

Figure 2.1: Reinforcement Learning: agent-environment interaction [9]

At time t, the environment is in state St. The agent observes St and makes an actionAt based on this state. Now the environment makes a transition to St+1 and setsa reward to Rt+1. Then the next step t + 1 is simulated. Therefore the sequentialprocess has the following trajectory:

S0, A0, R1, S1, A1, R2, S2, A2, R3, ... (2.1)

5

Page 16: Submitted by Thomas Leidinger, BSc Institute for Design

6 2 Theory

2.1.1 Markov Decision Process (MDP)

The probability of each possible state and reward value is described by a MDP, if thestate St ∈ S and reward Rt ∈ R only depend on the previous state St−1 and actionAt−1 ∈ A(s). The past information of the interaction between the agent and theenvironment is included in the state. Then the stochastic process has the MarkovProperty.If the MDP is finite, the set of states S, the set of rewards R and the set of actionsA(s) have a finite number of elements. A(s) is the feasible set of actions in state s.The following example helps to understand the aforementioned part.

Example: Suppose s′ ∈ S and r′ ∈ R. Then there is a probability thatSt = s′ and Rt = r′. The probability depends on preceding state s ∈ S and actiona ∈ A(s):

p(s′, r′|s, a) = Pr{St = s′, Rt = r′|St−1 = s,At−1 = a} (2.2)

The function p defines the dynamics of the Markov Decision Process. On the basis ofthis function the state-transition probabilities are derived:

p(s′|s, a) = Pr{St = s′|St−1 = s,At−1 = a} (2.3)

The expected rewards of state-action pairs computes the values of rewards at theconsidered states and actions:

r(s, a) = E[Rt|St−1 = s,At−1 = a] (2.4)

Due to these definitions, MDP is a 4 tuple with state space, action space, state-transition probability and expected reward.

2.1.2 Discounted Expected Reward

In Reinforcement Learning, the goal of the agent is to maximize the expected returnof rewards. This means that the total amount of rewards should be maximized over atraining cycle (episode). The following equation shows the return of rewards for oneepisode at the final time T :

Gt = Rt+1 +Rt+2 +Rt+3 + ...+RT (2.5)

If the episode hasn’t a final time step T , like in continuous tasks, then it wouldn’tbe possible to compute a return of rewards with this equation. The time T wouldbe infinity and the return of rewards too. Due to this, the discount rate γ ∈ [0, 1] is

Page 17: Submitted by Thomas Leidinger, BSc Institute for Design

2.1 Reinforcement Learning (RL) 7

introduced. The agent goal is now to maximize the discounted expected return ofrewards:

Gt = Rt+1 + γRt+2 + γ2Rt+3 + ... =∞∑k=0

γkRt+k+1 (2.6)

In the case of γ < 1, the sum of the previous equation will always get a finite valueif the reward sequence Rk is bounded. Generally, if γ < 1, future rewards are morediscounted, immediate rewards are more important. Therefore, the discount rate γ isalso a setting parameter for weighting the rewards.

2.1.3 Policy and Value Functions

Policies in Reinforcement Learning are functions that map a given state to probabilitiesof selecting each possible action from that state. To fulfill such a policy, valuefunctions are implemented which are functions of states or state-action pairs. Thesemathematical instruments evaluate the given state or the performance of an action ina given state with future rewards that are expected (expected return of rewards), i.e.value functions depend on policies.

Policy π(a|s): If an agent follows policy π at time t, then π(a|s) is the proba-bility that At = a given St = s.

The state-value function vπ for policy π evaluates any given state s for anagent while following policy π. The value of this function is the expected return fromstarting at state s at time t and following policy π thereafter:

vπ(s) = Eπ[Gt|St = s] = Eπ[∞∑k=0

γkRt+k+1|St = s] (2.7)

The action-value function qπ for policy π evaluates any given action a at a given state sfor an agent while following policy π. The value of this function is the expected returnfrom starting at state s at time t, taking action a and following policy π thereafter:

qπ(s, a) = Eπ[Gt|St = s,At = a] = Eπ[∞∑k=0

γkRt+k+1|St = s,At = a] (2.8)

Reinforcement Learning algorithms learn optimal policies which are evaluated withstate-value and action-value functions, i.e. the optimal policy has an associatedoptimal state-value function:

v∗(s) = maxπ

vπ(s) (2.9)

The optimal policy has an associated optimal action-value function (or Q-function):

q∗(s, a) = maxπ

qπ(s, a) (2.10)

Page 18: Submitted by Thomas Leidinger, BSc Institute for Design

8 2 Theory

Generally: A policy π is considered to be better than or the same as policy π′ if theexpected return of π is greater than or equal to the expected return of π′ for all states s.

To get the optimal action-value function it is necessary to introduce the useof the Bellman equation. This equation has the advantage to get a relationshipbetween the states and actions with their successor states and actions in a recursiveway. The optimal action-value function is computed with the optimal Bellmanequation as follows:

q∗(s, a) = E[Rt+1 + γmaxa′

q∗(s′, a′)|St = s,At = a] (2.11)

For any state-action pair (s, a) at time t, the expected return is going to be theexpected reward we get from taking action a in state s, which is Rt+1, plus themaximum expected discounted return that can be achieved from any possible nextstate-action pair (s′, a′).The goal is now to compute q∗(s, a) with help of an RL algorithm to find the actiona′ for the state s′.

2.1.4 Q-Learning

The Q-Learning algorithm, which is an off-policy Temporal Difference algorithm, isone of the most important algorithms in Reinforcement Learning. Off-policy means,that the learned action-value function Q directly approximates the optimal action-value function q∗(s, a) but is independent of the policy π (advantage in convergenceproofs). The impact of the policy is only on the visit and update of the state-actionpairs. Temporal Difference (TD) methods have the advantage that they only wait onetime step to make an update and do not have to wait until an episode ends (like inMonte-Carlo methods). This makes TD methods applicable to online learning andthey do not need a model of the environment. The update rule is defined by:

Q(s, a)← Q(s, a) + α [R+ γmaxa′

Q(s′, a′)︸ ︷︷ ︸TD−target

−Q(s, a)]

︸ ︷︷ ︸TD−error

(2.12)

The parameter α ∈ [0, 1] represents the step size (learning rate) and indicates theweights of the previous and present value. The objective is to update the Q-value overand over again to reduce the TD-error. The updated Q-values for every state-actionpair are stored in a so-called Q-table. Once the Q-function converges to the optimalQ-function the optimal policy can be computed. The pseudo-code of the Q-Learningalgorithm [9] is shown below:

Page 19: Submitted by Thomas Leidinger, BSc Institute for Design

2.2 Basic grid world example 9

Q-Learning algorithm:

Initialize Q(s, a), for all s ∈ S, a ∈ A(s), arbitrarily, and Q(terminal-state,.)=0

Loop for each episode:Initialize sLoop for each step of episode:

Choose an action a from state s using policy derived from Q (e.g., ε - greedy)Take action a, observe reward R, state s′Q(s, a)← Q(s, a) + α[R+ γmaxa′ Q(s′, a′)−Q(s, a)]s← s′

until s is terminal

The procedure to choose an action from a state using a policy, derived fromQ, is called the ε-greedy strategy. The task of this method is to make a trade-offbetween exploration and exploitation of the environment. If ε = 1 the agent onlyexplores, if ε = 0 he only exploits. Due to that, at the beginning ε should be near to1, to explore a lot of the actions the agent could take and to get an evaluation ofit. From episode to episode ε should decrease whereby the probability increases toexploit. In case of exploitation, the greedy action would be selected to get the mostreward by exploiting the agent’s current action-value estimates.

2.2 Basic grid world example

2.2.1 Reinforcement Learning in a basic grid world

The first example of reinforcement learning in the Matlab toolbox [13] is a game inwhich the agent (red circle) wants to get from the cell [2,1] (state 2) to the terminalstate cell [5,5] (state 25) with the maximal reward. The possible actions he can makeare to go to north, south, east and west where he always gets a reward of -1. Whenhe reaches the terminal state he gets a reward of +10, moreover there is an obstacle(see figure 2.2) where the agent is blocked but there is a special jump over it fromstate 17 to 19 with a reward of +5.

The training progress of the agent is shown in figure 2.3. At the beginning of thelearning procedure, the agent explores the environment a lot to search for the maximumreward, so he often gets a total reward that isn’t near to the optimum. After someepisodes, he finds the optimal policy (reward 11) and so he exploits the environment.This means he always takes the path with the maximum reward (see Q-learning,Epsilon-greedy policy, chapter 1).

Page 20: Submitted by Thomas Leidinger, BSc Institute for Design

10 2 Theory

Figure 2.2: Basic Grid World - RL

Figure 2.3: Training progress of the agent

2.2.2 Dynamic programming in a basic grid world

The same example is now solved with dynamic programming [14] in Matlab andnaturally, it has the same optimal path as with reinforcement learning.

Generally, DP provides a globally optimal solution, depending on the discretizationlimits, for an optimization problem. Since this algorithm contains a backward cal-culation, everything of the entire problem formulation must be known in advance.Therefore, and because of the high computational burden of this method, DP is notsensible to online problems which have to be solved in real-time.

Page 21: Submitted by Thomas Leidinger, BSc Institute for Design

2.2 Basic grid world example 11

Figure 2.4: Basic Grid World - DP

2.2.3 RL vs. DP in a basic grid world

The main difference between these two results is the fact that dynamic programmingis faster than reinforcement learning in this case. In small state space, dynamicprogramming can easily calculate the optimal policy (reward 11) in a short time.Reinforcement learning has to explore and after this exploit the environment whichof course takes more time. In table 1 the calculated times of solving this examplewith reinforcement learning and dynamic programming are listed. The calculationsare done with a computer that contains an Intel Core i7-8565U Processor and 16 GBRAM.

Table 2.1: Calculated times of RL and DPMethod Calculated timeReinforcement learning 0.56 sDynamic programming 0.35 s

The two times have a relatively large difference considering the small state and actionspace. It also has to be mentioned that the time of solving with reinforcement learningis already optimized by setting the parameters of Q-learning (see Q-learning, chapter1) in an optimal way to minimize the time of solving.In this example the advantages of dynamic programming compared to reinforcementlearning in different environments or systems is evident. DP is optimal and fast if thewhole problem is known in advance and there are no changes in it. RL, in contrast,has to search and train the agent in this environment to get to the optimal path.

2.2.4 Reinforcement learning in a changing grid world

The next example is similar to the previous one but there are several changes in theproperties of the environment. In the beginning, the agent is in the cell [1,1] (state 1)and his goal is to reach the terminal state in the cell [5,5] (state 25) with a maximumreward. The actions which the agent could take are the same as before: North, South,East and West. The agent gets a reward of -1 for each of them except when he reachesthe terminal state where he gets a reward of +10. Again, there is also an obstacle init where the agent is blocked. In the beginning, it lays in the position of figure 2.5.

Page 22: Submitted by Thomas Leidinger, BSc Institute for Design

12 2 Theory

Figure 2.5: Grid World - RL

After the agent reaches the terminal state and solves the problem, the environmentchanges randomly where the obstacle is in another position and would be solved too.After that, there is another change in the environment and it would be also solved,etc. (see figure 2.6).

As seen in figure 2.5 there are more optimal policies with a maximum re-ward of +3. Therefore, the agent learns more paths with the same maximum reward(see Q-learning, chapter 1). In the first environment, the agent takes more time toexplore the whole environment and to find the other optimal paths. In the next steps,when the environment changes, the agent only has to adapt his Q-table but didn’thave to learn everything new. Therefore, reinforcement learning is much faster in thefollowing changed environments.

Figure 2.6: Changing Grid World - RL

2.2.5 Dynamic programming in a changing grid world

Solving this example with DP delivers the same results as RL. Dynamic programmingsolves every problem as a new problem and does not have an advantage over thepreviously solved problems.

Again, the changes of the environment are randomly after every step when dynamicprogramming solved each problem.

2.2.6 RL vs. DP in a changing grid world

The required time that both methods need for solving each of the changing environ-ments is listed in the following table.

Page 23: Submitted by Thomas Leidinger, BSc Institute for Design

2.2 Basic grid world example 13

Figure 2.7: Grid World - DP

Figure 2.8: Changing Grid World - DP

Table 2.2: Calculated times of the different environments in RL and DPtime RL DPtime 1 0.86 s 0.38 stime 2 0.20 s 0.25 stime 3 0.15 s 0.23 stime 4 0.20 s 0.22 stime 5 0.18 s 0.27 stime 6 0.27 s 0.22 stime 7 0.10 s 0.22 stime 8 0.10 s 0.21 stime 9 0.10 s 0.22 stime 10 0.26 s 0.23 s

Time 1 in the reinforcement learning method is high compared to the other occurringtimes. As mentioned before, the agent learns a lot of the main things about thisproblem in the first environment and after that, he reduces the solving times. Hence,when the environment changes at least 9 times, RL is faster than DP in that example:

Table 2.3: Total times of RL and DPMethod Calculated timeReinforcement learning 2.42 sDynamic programming 2.45 s

Page 24: Submitted by Thomas Leidinger, BSc Institute for Design

14 2 Theory

2.2.7 Conclusion of the grid world examples

These examples show the advantages and disadvantages of Reinforcement Learning(RL) and Dynamic Programming (DP). RL isn’t as fast and in general, does not reachthe global optimum like DP [10] (In the grid world examples both methods reach theglobal optimum because of the simplicity of the problem). In problems where fullinformation is known in advance, DP computes the optimal solution. In problemswhere the environment is changing, RL has the advantage that it doesn’t start fromzero because of the agents’ knowledge from previous learning episodes. One of suchchallenges is the hybrid electric vehicle control problem, where the whole driving cycleisn’t known in advance. The controller has to be adapted to get an acceptable resultof the power split between the battery and the engine to minimize the losses/maximizethe energy efficiency. It also has to be mentioned that DP, in some cases, is not areasonable choice. For example, when there is a problem with a high dimensionalstate or action space, and should be solved in a short time. DP couldn’t deliver goodresults because of the enormous computational burden.

Page 25: Submitted by Thomas Leidinger, BSc Institute for Design

2.3 Deep Neural Networks 15

2.3 Deep Neural Networks

The ability of artificial neural networks (ANN) to approximate non-linear functionsmaking them an interesting tool in computational mathematics [15]. This techniquehas emerged from the biological neural networks which occur in the nervous systems ofa living organism. This system communicates through nerve cells/neurons via electricsignals. The basic architecture of an artificial neural network is shown in figure 2.9.

Figure 2.9: Feedforward artificial neural network architecture [16]

The three main parts of an ANN are the input layer, the hidden layer and the outputlayer [16]. To approximate more complex functions the number of hidden layers aswell as the number of artificial neurons, called hidden units, nodes or basic functions,are increased. The input and output layers consist of one layer. As shown in figure2.9, every artificial neuron of one layer is connected to the previous and next occurringneurons. The information is forwarded from the input layer over the hidden layer tothe output layer and has no backward connection, as in a recurrent neural network.An artificial neuron computes the output value by adding up the weighted inputs θixifrom the previous layer and a bias value b. Thereafter, the determined value passesthe activation function and the result represents the output of the artificial neuron.This calculation is shown in the following figure 2.10:

Figure 2.10: Artificial neuron, figure edited based on [16]

Page 26: Submitted by Thomas Leidinger, BSc Institute for Design

16 2 Theory

Every neuron of the ANN hidden layer calculates the output in the same manner andit is common to use the identical activation function all neurons of a layer [17]. Theinteraction between the units gives the ANN the ability to learn the proper weightsto represent the approximative function. This learning algorithm is explained in thechapter Neural Network Algorithm.

2.3.1 Input Layer

The task of the first layer in an artificial neural network is to handle the raw inputdata which fed the ANN with information. These data are, for example, raw pixels ofan image or informative observations/states of a system. The number of units of theinput layer depends on the complexity of the considered problem.

2.3.2 Hidden Layer

This layer is responsible to approximate the main part of the desired function. Itis common to use more hidden layers depending on the complexity of the problem.In the same way, the number of artificial neurons of a hidden layer is chosen. Thefollowing layers are examples of possible hidden layers defined by their activationfunctions.

• Sigmoid layer

The sigmoidal function forms a basis of functions to have the ability toapproximate the non-linear problems which was shown in the work of Cybenko,[18]. This function is defined as [17]:

Definition: Let σ(x) be continuous. Then σ(x) is called a sigmoidalfunction if it has the following properties

σ(x)→

a as x→ +∞b as x→ −∞

where a, b, b < a are any real values.

Another property of a sigmoid function is the differentiability with apositive derivative in the entire range of possible values. The following functionis one common example of a sigmoid function with an output range of [0,1]:

σ(x) = 11 + e−x

σ′(x) = σ(x)(1− σ(x))

The goal of the units of an ANN is to approximate the target function inan acceptable way. To compare the results of the used activation function,an error function is initialized which depends on a target output value of aneuron and the calculated value of the selected activation function. In [19]

Page 27: Submitted by Thomas Leidinger, BSc Institute for Design

2.3 Deep Neural Networks 17

the main properties to minimize the error is stated. These properties alsodepend on the derivative of the selected activation function. Overall, the bestresults can be achieved with functions of the form of tanh which are stated below.

• Tanh layer

The main difference to the previous functions is the output range which is[-1,1]. Therefore zero is located in the center of the range and so the necessaryproperties could be achieved [17].

σ(x) = ex − e−x

ex + e−x= tanh(x) σ′(x) = 1− σ2(x)

Nevertheless, a disadvantage of both functions is the saturation because of theirbounded output range. In the chapter Neural Network Algorithm, the gradientdescent method is introduced which causes problems if the activation functionis in saturation mode. As explained later, the gradient will get zero and solearning is insufficient. However, these activation functions can be used if theoutputs of the artificial neurons are scaled that they lie in the required outputrange [17].

• Relu Layer

One of the most common activation functions is a threshold operationthat sets all values which are smaller than zero to zero, called rectified linearunit (ReLU):

σ(x) =

x, x ≥ 00, x ≤ 0

This function has a nonlinear and linear part that depends on the range ofvalues and therefore it is called a piecewise linear function. The close to linearbehavior of the ReLU-function makes the ANN easier to optimize. Therefore,as occurring above, the issue at the gradient descent method of a vanishinggradient does not appear. Another advantage of this activation function is thecomputational simplicity. Compared to the sigmoid and tanh functions whichuse an exponential calculation, the ReLU-function implementation is trivial [20].

2.3.3 Output Layer

The last layer of an artificial neural network provides the learned result of theapproximative function. Again, the number of neurons depends on the complexity ofthe considered problem. The output values of this layer are, for example, the actionvalues of a system like the torque signals for the joints of a robot that learns to walk.

Page 28: Submitted by Thomas Leidinger, BSc Institute for Design

18 2 Theory

2.3.4 Neural Network Algorithm

The description is based on the book Neural Networks A Systematic Introduction[21]. The considered neural network, to explain the algorithm, is a 3-layer NN with1 hidden layer. This network has n input sites, k hidden units and m output units.The weights between the input site i and hidden unit j are donated with the exponent(1): θ(1)

i,j , the ones between the hidden unit j and output unit l with the exponent (2):θ

(2)j,l . One extra unit with a value of 1 is inserted for the input and hidden layer. Thisadditional artificial neuron represents the bias of each unit of a layer by multiplicationof a weight θ(1)

n+1,j for the input layer or θ(2)k+1,l for the hidden layer.

Figure 2.11: 3-layered network [21]

The previous figure 2.11 shows the 3-layered artificial neural network. The (n+ 1)-dimensional input vector is x = (x1, x2, x3, ..., 1) and therefore the output of thehidden layer results in:

o(1)j = σ(

n+1∑i=1

θ(1)ij xi)

The output of the hidden layer is a combination of a weighted sum and an appliedactivation function σ as explained in figure 2.10. The output of the whole network iscalculated below with the additional weights between the hidden layer and the outputlayer and the usage of an activation function at the output layer.

o(2)l = σ(

k+1∑j=1

θ(2)jl σ(

n+1∑i=1

θ(1)ij xi))

Backpropagation algorithm:

For simplicity, only a single input-output pair (x, y) of r training examples isconsidered. The goal of this algorithm is to minimize the error cost-functionE = 1

2∑ml=1 ‖o

(2)l − yl‖2 to find the weights θ which approximate the desired function.

Page 29: Submitted by Thomas Leidinger, BSc Institute for Design

2.3 Deep Neural Networks 19

1. Forwardpropagation:The input x is fed into the artificial neural network and the outputs o(1)

j and o(2)l

are computed and stored, as well as the evaluated derivatives of the activationfunctions.

2. Backpropagation to the output layer:The first step in the backpropagation process is the calculation of the backprop-agation error δ(2)

l at the output layer. In general, this parameter is an indicatorof the deviation between each node of the output layer and the desired targetvalue of the network.

δ(2)l = ∂E

∂(o(1)j θ

(2)jl )

= (o(2)l − yl) σ

′(o(1)j θ

(2)jl )

As stated in the previous subchapter Hidden Layer, the advantage of the deriva-tive of a sigmoid or tanh function is the simple calculation that simplifies thecomputation of the backpropagation error.

3. Backpropagation to the hidden layer:In this part of the algorithm, the backpropagation error δ(1)

j of the hidden layeris calculated. Therefore, every connection of the hidden layer to the outputlayer has to be taken into account. Their associated weights are θ(2)

jl and theconsidered partial derivative is:

∂E

∂(xiθ(1)ij )

= δ(1)j

The backpropagation error of the hidden layer results in:

δ(1)j = (

m∑l=1

θ(2)jl δ

(2)l ) σ′(xiθ(1)

ij )

In the same manner, the backpropagation error of every additional hidden layeris calculated.

4. Weight updates:The previously calculated backpropagation errors of the output and hidden layerare used to compute the corrections of the weights:

∆θ(2)jl = −γo(1)

j δ(2)l

and

∆θ(1)ij = −γxiδ(1)

j

The learning rate γ represents the step size of the following algorithm.

Furthermore, the possibility of a complex calculated approximative function has tobe mentioned. In this case, the considered problem leads to overfitting. To avoid

Page 30: Submitted by Thomas Leidinger, BSc Institute for Design

20 2 Theory

such behavior, an additional term is introduced, called the regularization term. Thismathematical tool keeps the approximate function as simple as possible and occurs indifferent versions in the literature [22].

Gradient descent algorithm:

The following pseudo-code computes the weights which reduce the error cost-function E:

At the beginning the weights θ are set to a random value near zero.

Repeat:1. Set ∆θ(2) = 0 and ∆θ(1) = 0.

2. For i = 1 to z training examples,

Use backpropagation to compute ∆θ(2) and ∆θ(1).

3. Update the parameters:θ(2) = θ(2) + ∆θ(2)

θ(1) = θ(1) + ∆θ(1)

Until the value of the error function E is sufficiently small.

This algorithm is called the batch gradient descent algorithm which calcu-lates the gradients for all training examples to make one update of the parameters.The stochastic gradient descent algorithm is another option to calculate the requiredparameters. This algorithm performs the parameter update for each training example.The mini-batch gradient descent algorithm represents a mix of both previouslymentioned algorithms. The choice of the algorithm depends on the data amountthat should be used to compute the parameter update. On one hand, batch gradientdescent needs more time to calculate the update than stochastic gradient descent, buton the other hand, guaranteed to converge to the exact minimum. Stochastic gradientdescent tends to overshoot and therefore convergence to the exact minimum is notguaranteed [23].Another method that has to be mentioned and often occurs in artificial neural networkproblems is the Adam-Algorithm (derived from adaptive moment estimation). Thismethod for gradient-based optimization of stochastic objective functions is appliedon high-dimensional parameter spaces. Furthermore, the algorithm can handlenon-stationary objectives and problems with sparse gradients which leads to practicalimportance in the field of neural networks. For more information about the algorithmsee [24].

Page 31: Submitted by Thomas Leidinger, BSc Institute for Design

2.4 Deep Reinforcement Learning (DRL) 21

2.4 Deep Reinforcement Learning (DRL)

In Reinforcement Learning algorithms, like Q-Learning, the whole information is savedin a table with Q-values. In a lot of applied examples, the state and action spacehas a high number of elements. This leads to a large table of values which makes itimpossible to compute every entry of it by exploring these state-action pairs. Anotherdisadvantage of this method is, that a Q-table with a high number of elements quicklyexceed the memory amount of the hardware.Due to avoid that, neural networks are used to approximate the Q-value function.The non-linear approach can handle similar states in a way that they do not have tovisit every entry. This type of approximation generalizes over unvisited states whichreduces the computational time and the required memory amount. Figure 2.12 showsthe main difference between Q-Learning and Deep Q-Learning:

Figure 2.12: Q-Learning vs. Deep Q-Learning: The Q-table is replaced by a neural networkas a non-linear function approximation. [25]

2.4.1 Deep Q-Network (DQN)

The combination of Reinforcement Learning and a Neural Network was successfullyintroduced by the DeepMind Technologie Group applied on the Atari games [4]. Thisapproach can handle unstable learning issues which appear when non-linear functionapproximators are implemented. The main reasons to achieve that are the usage ofan experience replay mechanism and a target network which are explained in the nexttwo subchapters.The fact that deep neural networks can learn better representations than handcraftedfeatures and handle large state and action spaces justifies the use. The input dataare the states (or observations) of the considered environment. Inside the neuralnetwork, the relevant properties are extracted by themselves, see chapter Deep NeuralNetworks. The outputs of the network are the Q-values of the discretized actionswhich correspond to probabilities that an action is taken at a given state.

As stated in chapter Reinforcement Learning, the goal for the action-value

Page 32: Submitted by Thomas Leidinger, BSc Institute for Design

22 2 Theory

function Q(s, a) is to converge via value iteration to the optimal action-value functionQ∗(s, a) which fulfills the optimal Bellman equation:

Q∗(s, a) = E[Rt+1 + γmaxa′

Q∗(s′, a′)|St = s,At = a] (2.13)

In this case, the action-value function has to be estimated separately for each sequence,without any generalization. Due to avoid that, function approximators are used in RLto estimate the action-value function [4]:

Q(s, a; θ) ≈ Q∗(s, a) (2.14)

The parameter θ indicates the weights in Neural Networks. To update this DQN, aLoss-function is implemented which changes in every iteration i of the learning process,based on [4]:

Li(θi) = E[(Rt+1 + γmaxa′

Q(s′, a′; θtargeti )−Q(s, a; θi))2] (2.15)

The goal is to minimize the Loss-function which includes a network with weights θiand a target network with weights θtargeti . The interaction between these two NeuralNetworks is explained in the subchapter Target network.

2.4.2 Experience Replay

The basic concept of this technique [26] is to store the transition experienceSt, At, Rt, St+1 at each time step to an experience buffer of a fixed length. Dur-ing each training step, a mini-batch of experiences is sampled randomly from thestored buffer and is applied to the network to perform the weight updates. Therefore,executing successive updates are uncorrelated and the variance is reduced. Anotheradvantage is the smoothing effect over the changes in the data distribution sinceprevious experiences are repeated.

2.4.3 Target Network

As stated above, the goal of the DQN is to minimize the Loss-function which includestwo networks, the prediction network and the target network [27].

Li(θi) = E[(Rt+1 + γmaxa′

Q(s′, a′; θtargeti )︸ ︷︷ ︸Target

−Q(s, a; θi)︸ ︷︷ ︸Prediction

)2] (2.16)

The prediction network is modified every iteration step by updating the weights θi.The target network is structurally identical to the prediction network, however, theweights θtargeti are only changed to the values of θi after a fixed period of steps. If theDQN would be used without a target network, the target value would always changewhen the prediction network is updated and so the training process would tend todivergence. Therefore, the target network method achieves smoothing oscillations inpolicies and avoids divergence which improves the stability of the training.

Page 33: Submitted by Thomas Leidinger, BSc Institute for Design

2.4 Deep Reinforcement Learning (DRL) 23

2.4.4 Double DQN

The reason to implement a Double DQN method [28] are the overestimated Q-valuesin the standard DQN caused by selecting and evaluating an action with the same value.To avoid that, Double DQN divides the selection and the evaluation of an action intotwo parts. This split is done with the help of the target network, even though it isnot uncorrelated. Considering that this network already exists, its use simplifies theproblem and reduces the computational burden. As stated below, the target networkestimates the evaluation of the current greedy policy while the selection of the actionis determined by the prediction network.

Li(θi) = E[(Rt+1 + γ Q(s′, argmaxa′

Q(s′, a′; θi), θtargeti )︸ ︷︷ ︸NEW

−Q(s, a; θi))2] (2.17)

2.4.5 Deep Q-Learning algorithm

The whole Deep Q-Learning algorithm with the previously explained techniques ofthe DQN is stated below [13].The random selection of the action is executed with the ε-greedy strategy as explainedin the chapter Reinforcement Learning. In the sub-item 5 of the Deep Q-Learningalgorithm both calculating policies of the DQN methods, the Double DQN and DQN,are specified.

Page 34: Submitted by Thomas Leidinger, BSc Institute for Design

24 2 Theory

Deep Q-Learning algorithm:

Initialize Q(s, a, θ) with random parameter values θ, and initialize the tar-get Q(s, a, θtarget) with the same values: θtarget = θ

For each training time step:1. For the current observation s, select a random action a with probability ε.

Otherwise, select the action for which the action-value function is greatest:

a = argmaxa

Q(s, a; θi)

2. Execute action a. Observe the reward r and next observation s′.

3. Store the experience (s, a, r, s′) in the experience buffer.

4. Sample a random mini-batch of M experiences (si, ai, ri, s′i) from the experiencebuffer.

5. If s′i is a terminal state, set the value function target yi to Ri. Otherwise set itto:

amax = argmaxa′

Q(s′i, a′; θi)

yi = Ri + γQ(s′i, amax; θtargeti ) DoubleDQN

yi = Ri + γmaxa′

Q(s′i, a′; θtargeti ) DQN

6. Update the action-value parameters by one-step minimization of the loss Lacross all sampled experiences.

L = 1M

M∑i=1

(yi −Q(s′i, a′; θtargeti ))2

7. Update the target parameters depending on the target update method. Ifτ = 1, the periodic method is used, if τ 6= 1 a smoothing method is used.

θtarget = τθ + (1− τ)θtarget

8. Update the probability threshold ε for selecting a random action based on thedecay rate.

Page 35: Submitted by Thomas Leidinger, BSc Institute for Design

Chapter 3

HEV Problem Formulation

The aim of this chapter is to describe the hybrid electric vehicle energy managementproblem and the models used in this work. The first subchapter states the generalhybrid powertrain control problem. In 3.2 the HEV model, including the vehicle modeland the powertrain model, is explained. In the last subchapter, the used driving cyclesto train and simulate the agent are presented.

3.1 Problem Description

The main goal of the DRL method applied on the HEV is to minimize the fuelconsumption over the whole test case while sustaining the charge of the battery. Thesecondary goals of this method are to minimize the number of gear shifts and clutchchanges. In this formulation, the driving cycle is given in advance and so the requiredpower demand is calculated and supplied to the HEV model. This approach is called"backward modeling" [3] which works without a driver model because the desired speedis a direct input. The following figure shows this behavior:

Figure 3.1: Modeling approach of the HEV control problem [29]

The block "Powertrain Control" represents the DRL agent which computes the inputs

25

Page 36: Submitted by Thomas Leidinger, BSc Institute for Design

26 3 HEV Problem Formulation

uPT of the powertrain model to fulfill the previously mentioned goals. The output ofthis model represents the state xPT and the performance quantities of the powertrainzPT . The variable w states the power demand and is supplied to the control unit.The "Vehicle dynamics on route" block provides the route characteristics obtainedfrom the known driving cycle [29].

3.1.1 Optimization Problem Formulation

As mentioned in previous chapters, RL does not guarantee to find the optimum ofthe considered problem. Dynamic Programming, on the other hand, finds the globaloptimum. Therefore, the implementation of RL is crucial to get as close as possible tothe minimum of the following general optimization problem formulation [29]:

minuP T (t),∀t∈[t0,tf ]

J (xPT , zPT , uPT , w)

s.t h(xPT (t), zPT (t), uPT (t), w(t)) = 0, ∀t ∈ [t0, tf ]l(xPT (t), uPT (t), zPT (t), w(t)) ≤ 0, ∀t ∈ [t0, tf ]

(3.1)

where w(t) ∀t ∈ [t0, tf ] is the given power demand profile and t0 and tf are thestarting time and the final time of the trip.In this case:

uPT =

ujucuξ

, xPT =

xjxcxξ

, w =

vah

, zPT =

qfzczj

(3.2)

where uj is the gear action, uc is the clutch action, uξ the power split between engineand battery, xj the gear position, xc the clutch position, xξ the state of charge (SOC),v the vehicle speed, a the vehicle acceleration, h the altitude of the route, qf the fuelconsumption, zc the clutch shift, zj the gear shift.

3.2 Hybrid Electric Vehicle Model

The HEV model is composed of a vehicle model and a powertrain model. The vehiclemodel also includes the parameters of the driving cycle in addition to the vehicleparameters. The powertrain model, which consists of an engine, an electric machine,a battery, etc., needs to provide the required power demand from the driving cycleand vehicle model. The HEV model is based on [30] and explained in the followingsubchapters.

Page 37: Submitted by Thomas Leidinger, BSc Institute for Design

3.2 Hybrid Electric Vehicle Model 27

3.2.1 Vehicle Model

This model computes the power demand based on the vehicle and the driving conditions.Therefore, an assumption is made that the powertrain is stiff and there is no wheelslip. The traction force Fp and the resistance force Fr can be calculated by:

Fp = Fr(φ, v) + λ(j)ma

Fr(φ, v) = mg (cr + sin (φ(s))) + cdρairA

2 v2(3.3)

In this formulas, λ(j) is the factor for the rotational inertia, m the vehicle mass, a thevehicle acceleration, g the gravitational constant, cr the rolling resistance, φ(s) theroad grade angle, cd the drag coefficient, ρair the air density and A the front area ofthe vehicle.The engine speed ωe can be computed through the vehicle speed v, the gear dependenttransmission ratio γ(j) and the wheel radius r:

ωe = γ(j)

rv (3.4)

Due to the previous formulas, the power demand can be computed by [29]:

P = Fp · v (3.5)

The specific parameters are listed in Appendix A - HEV Model Parameters.

3.2.2 Powertrain Model

This model realizes the energy flow through the components of the powertrain. Theindividual parts of the powertrain calculate their specific parameter and pass them onto the next component. The structure is shown in figure 3.2:

Figure 3.2: Structure of the P2 parallel hybrid, figure edited based on [29]

Page 38: Submitted by Thomas Leidinger, BSc Institute for Design

28 3 HEV Problem Formulation

This model works with a break heuristic and does not use an extra coupler betweenthe brake and the gearbox to control the mechanical energy flow provided by the wheel.As shown in figure 3.2, the inputs of the gearbox, coupler and clutch are the threeinputs of the previously mentioned optimization problem formulation. The gear shiftcommand has three different operation modes: switch down, remain and switch upi.e. uj ∈ {−1, 0, 1}. The coupler operates in a range between 0 and 1 i.e. uξ ∈ [0, 1],where 0 stands for total battery operation and 1 for total engine operation. Hence,the required power demand is a combination of the battery Pb, the engine Pe and theloss power from the powertrain components PL:

P + PL = Pb + Pe (3.6)

The clutch shift command operates in three different modes: closed, slip and open i.e.uc ∈ {0, 1, 2}. The engine is turned on when the clutch is closed; when the clutchslips, the engine runs at low constant speed; the engine is turned off when the clutchis open (decoupled) [31].

Engine model: The mathematical formulation of the engine is made with the followingequations.

τeff = τ + J · ω̇e (3.7)qf = qf (ωe, τeff) (3.8)

In this equations, τ is the engine torque, J is the inertia, ωe is the engine speed and qfis the fuel flow rate. The fuel flow rate is determined by the following fuel consumptionmap and depends on the effective engine torque and the engine speed.

Figure 3.3: Fuel consumption map and limits [30]

Electric machine: To calculate the power demand, an electrical power map, as shownin figure 3.4, is used. The determination is made with the rotational speed and the

Page 39: Submitted by Thomas Leidinger, BSc Institute for Design

3.3 Test Case - Driving Cycle 29

effective torque of the electric machine. The effective torque is equally calculated asin the engine model with the specified inertia and acceleration of the EM.

0.60.65

0.65

0.7

0.7

0.75

0.75

0.8

0.8

0.8

0.85

0.850.85

0.85

0.85

0.90.9

0.9

0.9

0.90.9

0.95

0.95

0.95

0 2000 4000 6000 8000 [rpm]

-100

0

100

200

[Nm

]

Figure 3.4: Efficiency map of the electric machine [30]

Battery model: The description of this model is made with the following formulas.

ξ̇ = 1Qn· Ib(ξ, Pb)

Ib(ξ, Pb) = −Uoc(ξ) +√Uoc(ξ)2 − 4PbRi2Ri

Ub = Uoc +Ri · Ib

Uoc(ξ) =(E0 −

K

ξ+Ae(B(ξ−1))Qn

)(3.9)

The parameter ξ represents the state of charge, Qn the nominal charge, Ib thebattery current, Uoc the open circuit voltage, E0 the battery constant voltage, K thepolarisation voltage, A and B are model constants, Ub the battery clamp voltage andRi the internal resistance.

3.3 Test Case - Driving Cycle

Besides the states of the HEV, the signals of the driving cycles are used for the agentof the DRL algorithm to select the appropriate actions at the current time step. Thetest case provides the required time of the vehicle, the route length and the altitudeof the route. From these parameters, the vehicle speed, the vehicle acceleration andthe required force are calculated. These route and dynamic characteristics of thedriving cycle are fed to the DRL method where the agent explores and exploits theenvironment. Therefore, DRL computes the actions for this driving cycle conditionsbut could also be used for other driving conditions because of the exploration behaviorof the DRL method.The conditions of the driving cycles represent a traffic scenario in Linz, Upper Austria.

Page 40: Submitted by Thomas Leidinger, BSc Institute for Design

30 3 HEV Problem Formulation

Therefore, the speed is low or even zero if the car stands still which regularly occurs inthe considered driving cycles. The applied velocity profiles are shown in the followingchapter.

Page 41: Submitted by Thomas Leidinger, BSc Institute for Design

Chapter 4

Method and Implementation

The implementation of the Deep Reinforcement Learning method, applied to theHEV problem, is made with a Deep Q-Learning algorithm which is described insubchapter 2.4. In the first two subchapters, the main parts of the ReinforcementLearning approach (environment, agent) applied on the HEV are explained. Thespecified observation and action spaces as well as the reward function for that problemare stated in the following chapters. Then the used neural network and the trainingparameters to achieve the goal of the HEV problem are explained.The main part of the implementation of the Deep Reinforcement Learning method inMatlab is made with a step-function and a reset-function. The step-function is calledevery time step in an episode and calculates the next observations, relevant signalsand the reward based on the states of the model and fed actions. Therefore, the modelsupplies the specified signals for the considered time step. If the episode ends, thereset function is called and the observations and relevant signals were set to the initialvalue. After the reset, the agent begins to learn again and starts the new episode.

4.1 Environment

The environment consists of all parts of the system that the agent does not include.Therefore, in figure 3.1 in chapter 3, the environment consists of all blocks of thesystem except the Powertrain Control block. In figure 4.2 the parts of the environmentare mentioned in the associated block. The environment of the HEV problem alsoincludes the driving cycle. Since in this work, the agent is required to be able to controla driving cycle with changing conditions, different driving cycles were implemented tosimulate the online behavior. To better illustrate the various velocity areas, only thefirst 1000 seconds of the four used driving cycles of the City-Linz-Traffic are stated inthe following figure 4.1.

4.2 HEV Agent

The agent represents the controller of the Reinforcement Learning system, who has thetask of selecting the optimal actions for the HEV problem. The used Deep Q-Network

31

Page 42: Submitted by Thomas Leidinger, BSc Institute for Design

32 4 Method and Implementation

Figure 4.1: HEV Reinforcement Learning approach

agent executes the Deep Q-Learning algorithm which is mentioned in chapter 2 Theory.The used powertrain model requires discrete inputs to calculate the discrete statesand signals of the HEV. One of the properties of a DQN-agent is the ability to operatewith discrete actions and discrete or continuous observation/state spaces. Therefore,the DQN-agent is qualified to be used for this approach.The DQN-agent operates in a trial and error manner to explore the environment dueto the fact to maximize the total cumulative reward. In some cases of the trainingprocess, when the action selection leads to operating regions of the engine or theelectrical machine which are infeasible, the agent calculates unacceptable solutions.Besides the fact that these input values normally lead to results that do not fulfillthe goals of the HEV problem, these inputs could damage or destroy the engine orthe electrical machine. Therefore, these actions have to be avoided. One approach fornot visiting bad operating regions is through negative rewards which the agent getsif he executes one of these bad actions. After a period of episodes, he achieves thegoal to avoid these unacceptable regions. The simulation results of the agent in theenvironment he learns are safe. The disadvantage of this method is the applicationto online problems. In this case, the agent has to handle states which didn’t occurin the learning process and therefore the opportunity to reach an infeasible statewith a selected action is possible. To avoid that, as seen in figure 4.2, the additionalshield block helps the agent to execute safe actions. The details of the action shieldimplementation are mentioned in the next subchapter.

Page 43: Submitted by Thomas Leidinger, BSc Institute for Design

4.2 HEV Agent 33

Figure 4.2: HEV Reinforcement Learning approach, figure edited based on [12]

4.2.1 Shield

The task of the action shield [12] is to safely avoid actions that could violate theconstraints of the HEV model. To do so, the shield receives the values of the currentstate/observation and action and calculates a safe action based on this information.It is important to emphasize that the shield intervenes only if the action violates theconstraint in the considered case. Otherwise, the agents’ action selection remainsunchanged. It should further be mentioned that the altered action should deviatefrom the original action as close as possible. i.e. for example, if the gear shift is -1and the solution is infeasible, the gear shift is changed to 0. If this action also leadsto an infeasible solution then the gear shift command is changed to 1. If changing thefirst action is not sufficient, the second action (coupler) gets changed, etc. Hence, theaction shield corresponds to a temporal logic that is computable in advance due tothe knowledge of the driving cycle. The approach of a shield combined with RL iscalled Safe RL [32]. A detailed explanation of the considered actions is made in thesubchapter Action Space.

Page 44: Submitted by Thomas Leidinger, BSc Institute for Design

34 4 Method and Implementation

4.3 Observation Space

The observations of an RL problem have to fulfill the task of delivering information forthe agent. These variables have to be meaningful which means that the observationsrepresent the RL problem sufficiently. In the case of the HEV the following observationsare used:

• SoC (xξ) - State of charge of the battery: To reach the goal of sustaining charge,the SoC has to be considered. The agent selects an action based on the currentvalue of SoC. For example, if SoC is below the desired target value, the agentexecutes an action that SoC is increasing, but always considers the main goal ofminimizing the fuel consumption and secondary goals of minimizing gear andclutch shifts.

• jk (xj) - Gear-position: This observation variable plays a role for feasible statesfrom the model. The required rotational speed and torque values are unreachableat a lot of gear-position in the specific constellation. Hence, if the action shieldis inactive, the agent learns the optimal gear position, depending on the goals ofthe HEV problem, for the considered time-step of the driving cycle.

• ck (xc) - Clutch-position: As mentioned before, one secondary goal is to minimizethe clutch shifts over a driving cycle. Therefore, the agent gets the informationin which clutch position the HEV is located and then decides what clutch actionis optimal for the considered time -step.

• T - Required torque: In each time step, the driving cycle delivers the requiredtorque which has to be generated either from the engine or the electrical machine.In the case of negative torque, the energy can be used to recuperate electricalenergy for the battery and so increasing the value of SoC. Therefore, thisobservation variable is essential to minimize the fuel consumption of the HEV.

• w - Required rotational speed: Besides the required torque, the required ro-tational speed is necessary to set as an observation variable. Both of theminfluence the operating points of the engine and the electrical machine.

observations = {SoC, jk, ck, T, w} (4.1)

For the first three observations SoC, jk and ck the agent has the opportunity tochange the values by the following actions. The required torque T and the requiredrotational speed w are given by the applied driving cycle.

4.4 Action Space

In previous studies, [11] only one action variable, for example, the coupler or thepower supply, is used to influence the system of an HEV. In this thesis, three actionvariables are the input for the powertrain model: gear shift command, the clutchshift command and the coupler value between the engine and the electrical machine.

Page 45: Submitted by Thomas Leidinger, BSc Institute for Design

4.4 Action Space 35

More actions lead to a bigger action space and therefore, the HEV problem is morecomplex. Another impact on complexity and thus an increase of the computationaltime is made with a finer discretization of the action space.

• uj - gear shift: This action value is defined by shifting down, remain and shiftingup {−1, 0, 1} and therefore, the maximum gear position change within onetime-step is plus/minus one. The agent is not able to get from gear 2 to gear5, for example, which makes the gear shift input indirect. Hence, calculatingahead is more difficult for the agent.

• uc - clutch shift: The clutch action is either close, slip or open{0, 1, 2}. Theslippering mode is only applied from the action shield if all possible actions ofuj and ucoupler in the open or closed clutch state are infeasible. i.e. the directagents’ actions to select are open or closed {0, 2}. If the clutch is open there isno connection to the engine, T = 0 Nm. Only a small coupler value is possiblefor a torque value equal to zero, due to the discretization of the coupler.

• ucoupler - coupler between engine and electrical machine: The range of this actionvariable is between 0 and 1. Depending on the discretization, the number ofselectable action values is varying.

actions = {uj , uc, ucoupler} (4.2)

As mentioned before, if the clutch is open, only a small coupler value is possible. Toreduce the action space and thus the computational time, the number of selectableactions for the agent can be reduced. Hence, if the selected coupler value is smallenough that T = 0 Nm at the downstream of the clutch, the clutch is open. If theagent selects a coupler value where T 6= 0, the clutch is closed. Therefore, either thecoupler value or the action shield (if an infeasible solution would appear) selects theclutch value.

actionsnew = {uj , ucoupler} (4.3)

Except of small coupler values that select the open clutch state, the action shieldchanges the closed clutch to open if low velocities or small torque values generateinfeasible solutions.The following table shows the used action space:

Table 4.1: Action space [uj , ucoupler]

21 Action-pairs-1 0.066 -1 0.15 -1 0.3 -1 0.45 -1 0.6 -1 0.75 -1 0.90 0.066 0 0.15 0 0.3 0 0.45 0 0.6 0 0.75 0 0.91 0.066 1 0.15 1 0.3 1 0.45 1 0.6 1 0.75 1 0.9

The discretization of the coupler action leads to 7 values in which the lowest value is

Page 46: Submitted by Thomas Leidinger, BSc Institute for Design

36 4 Method and Implementation

assigned to the open clutch case. The 3 possible action values of the gear shift combinedwith the coupler values result in 21 action pairs. This action space is appropriateto calculate a solution in an acceptable time. For example, if the discretization ofthe coupler action leads to 30 values, 90 action pairs would arise which leads to anenormous increase of computational time, but the benefit of better results of theHEV problem is small. Another reason for fewer coupler values is the probability ofselecting the open clutch action. If the coupler values would be changed to 30, 87values represent 0-closed clutch and only 3 values 2-open clutch. The probability ofselecting an open clutch value would be very low and therefore the results would notbe successful. The introduction of more values of the open-clutch actions again goesalong with increasing the action space and the resulting problems of the computationalburden.

4.5 Reward Function

To evaluate an executed action of an agent, this function provides a reward value. Theright choice of the reward function is important to reach the goals of the consideredHEV problem. In this case, charge sustaining, minimizing the fuel consumption, gearshifts and clutch shifts have to be considered. After investigation of many rewardfunctions for the HEV DRL problem, the following reward function delivers the bestresults:

reward =

−qf SoC ≥ SoCmin and SoC ≤ SoCmax

−qfmax SoC < SoCmin or SoC > SoCmax(4.4)

The total reward function is calculated by:

rewardtotal = reward− zj · facj − zc · facc (4.5)

The parameter qf represents the fuel consumption in every time-step and is scaledto get the value in the same range as the other terms of the reward function. Thegoal of charge sustaining is to get the target value of SoC in a range of SoCmin = 0.49to SoCmax = 0.51. Hence, the agent gets a maximum negative reward to avoid theseregions [33]. The remaining terms describe gear and clutch shifts, i.e. the variables zjand zc are equal to 1 if gear or clutch changes concerning the value of the previoustime-step, otherwise they remain 0. The factors facj and facc are used to weight thegear and clutch shifts.

4.6 Neural Network

As mentioned in the Theory chapter 2, the Neural Networks task is to approximatethe Q-value function. The setting parameters for this approach are the number oflayers, the types of each layer, the number of units of each layer and the activationfunctions. The selection of this parameter is based on the paper of [33]. They wereadapt due to the HEV problem size of the observation and action space. The input

Page 47: Submitted by Thomas Leidinger, BSc Institute for Design

4.7 Hyperparameter and Implementation Algorithm 37

layer includes 5 units, corresponding to the 5 observations of the system. The outputlayer includes 21 units due to the number of actions. The following table shows theparameter of the Neural Network:

Table 4.2: Neural Network parameter

Layer Type Activation function Number of neurons1 Input layer Linear 52 Fully connected layer ReLu 503 Fully connected layer ReLu 1004 Fully connected layer ReLu 2005 Output layer Linear 21

The observations represent variables that have a large difference in the range of values.Therefore, approximating the function is difficult and time-consuming for the NeuralNetwork to get acceptable results. To simplify this problem, the observations werescaled into the same range of values before they were fed to the Neural Network.

4.7 Hyperparameter and Implementation Algorithm

Training is essential for the agent to improve the policy to reach the goal of theconsidered problem. As mentioned in the Theory chapter, this process is repeatedevery new episode, where the agent adapts his policy depending on the DQN algorithm.One of the most influencing methods in context with this algorithm is the epsilongreedy strategy. The choice of epsilon, the epsilon-decay-rate and the minimum valueof epsilon are important for the agent to have the ability to generate a solution. Theepsilon-decay-rate is selected that epsilon decays in a way that the agent explores andexploits enough, depending on the problem size of the environment. The starting valueof epsilon is ε = 1, the epsilon-decay-rate εdecayrate = 4 ·10−6 and the minimum epsilonepsilonmin = 0.01. If the length of each episode is set to 1000 time-steps and epsilonis decayed after each time-step with ε = ε · (1 − εdecayrate) [34], epsilonmin = 0.01is reached after 1152 episodes. The evaluation of this selection is made with thelearning progress of the agent which is stated in the next chapter. The value of thediscount factor is selected by γ = 0.5. As mentioned in the Theory chapter, thisvalue determines the weighting of the future rewards. In this case, future rewards aremore discounted than in the case of higher discount factors. The trade-off betweenweighting local rewards and future rewards with γ = 0.5 delivers the best results inthe case of the HEV. The following table shows the parameters of the DQN-agentused in Matlab:

Page 48: Submitted by Thomas Leidinger, BSc Institute for Design

38 4 Method and Implementation

Table 4.3: Summary of the used Hyperparameters

Parameter ValueMini batch size 64Experience buffer length 106

Discount factor γ 0.5Learning rate 0.0001εstart 1εdecayrate 4·10−6

εmin 0.01Target smooth factor 0.001

A stated in the introduction of this chapter, the main part of the calculation is madein the step-function. The following code shows the important calculation parts of thisfunction:

Algorithm 1: Step-function of the DRL-algorithm[NextObs,Reward,IsDone,Signals] = StepFunction(Action,Signals,Envvariables)/* Unpack the Signals vector */

State = Signals.statesk = Signals.discIsDone = 0

/* Constraints - Action Shield */[Action] = actionconstraint(k,Envvariables,State,Action)

/* HEV - Block Model Function */newstates = discModelFunction(k,State,Action,Envvariables)SoC = newstates(3)ck = newstates(2)jk = newstates(1)

/* Transform State to Observation */Signals.states = [SoC;jk;ck;T;w]Signals.observations = [SoC;jkdem;ckdem;Tdem;wdem]NextObs = Signals.observations

/* Set flag if episode terminates */if k == episodelength then

IsDone = 1end

/* Calculate Reward */if SoC ≥ SoCmin and SoC ≤ SoCmax then

Rewardqf = -qfrewardelse if SoC < SoCmin or SoC > SoCmax then

Rewardqf = -qfmaxendReward = Rewardqf - zj · facj - zc · facc

/* Next Time-step */Signals.disc = k+1

Page 49: Submitted by Thomas Leidinger, BSc Institute for Design

4.7 Hyperparameter and Implementation Algorithm 39

The Step-function of the DRL algorithm is shown in a simplified form. The transferparameters contain the action signal selected by the agent, the parameter Signals,which remains after executing the Step-function, and the Envvariables parameter whichcontains information of the environment. The return parameters include the NextObsparameter, which is the calculated next observation of the agent, the reward signal,the IsDone flag, which is set to 1 if the episode terminates otherwise it remains 0, andthe previously mentioned Signals parameter.In the first part, the state values from the previous time-step were transferred to thepresent one. The variable k represents the present time-step, the flag variable is set to0. As mentioned earlier in this chapter, the action shield is used to avoid infeasiblestates due to the selection of the actions. Therefore, it is applied to the present stateand the Envvariables parameter delivers the information of the environment to calculatean action that leads to a feasible state. After that, the relevant and computed datais given to the Model function and the new states and signals were calculated. Theobtained values were assigned to the global Signal parameter, whereby the observationvalues get scaled (dem values) before they get fed to the Neural Network. After thetermination condition is executed, the Reward value is calculated. At the end of thisfunction, the global time-step parameter k is increased.

Page 50: Submitted by Thomas Leidinger, BSc Institute for Design
Page 51: Submitted by Thomas Leidinger, BSc Institute for Design

Chapter 5

Results

This chapter presents the results of the DRL method applied to the HEV problem.The first subchapter contains the learning process of the agent to reach the goal of theapplication. In the next chapter, the results and their meaningful signals of the DRLmethod applied on 4 different driving cycles are discussed. In 5.3, the DRL results arecompared to the benchmark results of Dynamic Programming. In the last subchapter,a trained agent is applied on a different driving cycle to simulate the behavior of areal-world online application.

5.1 Learning

In the training process, the agent learns and adapts the policy to maximize the totalreward. As mentioned in the previous chapter, learning depends on various factors:hyperparameters of the DQN-algorithm, the reward function, the considered modelin the environment, etc. Therefore, the selection of these factors is significant toobtain results that converge to an acceptable solution. In figure 5.1, the learningprogress of the DQN-agent in the HEV-problem with the City Linz light Trafficdriving cycle is shown. On the horizontal axis, the number of episodes and on thevertical axis the episode reward is plotted. The blue dotted line represents the totalcumulative reward of the episode and the red line represents the average rewardover 5 episodes. As seen in the figure, the learning progress is stopped after 1500Episodes. In the beginning, there is a steady increase of the curve and after about400 episodes the reward and average reward increases rapidly. This step is causedby the action shield which helps the agent to get on the right track to reach thegoal of the problem. The action shield prevents the agent to reach infeasible statesand generating bad rewards. Without these constraints, the agent would need morecomputational time or would not converge to a solution. After this behavior, thereis a low increase of the two graphs until the end of training that can be explainedby small improvements of the various calculated fuel consumptions, the number ofclutch shifts and the number of gear shifts. Shortly before the end of the learningprocess, the episode reward shows unsteady behavior. Hence, continuing training isn’tuseful since in this region of training the agent exploits and no longer explores the

41

Page 52: Submitted by Thomas Leidinger, BSc Institute for Design

42 5 Results

environment. Therefore, changes in the curve depend on the neural networks algorithm.

Figure 5.1: Learning progress of the agent in the HEV problem

The computational times of the training and simulation are stated in the followingtable. Simulation means applying the agent in an environment with his learned policy.The training environments properties and the simulation environments properties, forexample, the driving cycle, do not have to match.

Table 5.1: Computational times

Method TimeTraining 27043.1 s =̂ 7.5 hSimulation 2.3 s

The training time and the simulation time differ tremendously. Therefore, the use ofa DRL agent in online applications is only possible if the training process is finished.Then the simulation can be made in other environments with new driving cycles(online). As stated in the following subchapters, the calculation time of the simulationin DRL is fast compared to the benchmark method Dynamic Programming whichleads to a big advantage of the DRL method.

Page 53: Submitted by Thomas Leidinger, BSc Institute for Design

5.2 Results - DQN RL 43

5.2 Results - DQN RL

5.2.1 1 action/input

The first presented results show the application of DRL with 1 input, the couplerbetween the engine and the electric machine i.e. 1 action selection for the agent. Theseresults should show that the implementation of this thesis reaches similar statistics asthe already implemented 1 action approaches in the literature. As stated in the paperof Wu "Continuous reinforcement learning of energy management with deep Q networkfor a power split hybrid electric bus" [35], the DRL method achieves 89.6 percent ofthe optimal fuel consumption of the Dynamic Programming solution applied on theWVUSUB driving cycle. In this paper, the engine power is set as an action variableby an increment value. The other inputs are set that the engine works in optimalfuel economy under specific power. Therefore, the training difficulty is decreasedsignificantly. The results in this thesis reached a percentage of 92.2 of the optimalfuel consumption of DP (corresponds to 8.4% more fuel consumption than DP). Here,the City Linz Light Traffic driving cycle was investigated. The subfigures of figure 5.2show the velocity, the torque split between the engine and the electric machine, the 3states of the HEV model- SoC, gear and clutch and the fuel rate.It has to be mentioned that the considered WVUSUB driving cycle of the paper has aduration of about 1600 s and the considered part of the City Linz Light Traffic drivingcycle has a duration of 1000 s. The reason for the selected duration for the drivingcycle of the DRL method is the reduction of the computational time. Furthermore,as mentioned later, training over the whole driving cycle does not lead to significantimprovements in the results. The 2 other inputs of the HEV model, the gear shiftcommands and clutch shift commands, are set by the optimal solution of DP. Thisspecification does not represent the optimal fuel economy for the DRL approach,because DRL does not calculate a global solution as DP. The similar courses of thetwo SoC lines of DRL and DP arise through the use of the same gear and clutch valuesto calculate the solutions. These selections have a massive impact on the course ofSoC. The selection of the coupler, which represents the used action, of course, changesthe values of SoC but is based on the two given inputs. Therefore, also the previouslymentioned reward function has to be adapted. The following table shows the statisticsof DP and DRL with 1 action.

Table 5.2: Statistics - 1 Input - City Linz light Traffic

Method DRL DPTotal fuel consumption [kg] 0.1542 0.1422Number of gear shifts 57 57Number of clutch shifts 30 30Total cost 0.1716 0.1596Final SoC 0.4899 0.4940

Page 54: Submitted by Thomas Leidinger, BSc Institute for Design

44 5 Results

Figure 5.2: Results - 1 action RL - City Linz light traffic driving cycle 1000s

5.2.2 3 actions/inputs

The results with 3 inputs i.e. 3 actions for the agent are stated in the following figures.Here, the 4 driving cycles, mentioned in the previous chapter, are investigated. Thefirst figure 5.3 shows the first 1000 s of the City Linz light traffic driving cycle. Theparameter values of zj = 0.3 and zc = 0.5 which leads to a low number of gear shifts,however, the number of clutch shifts is considerably higher. The selection of thistwo parameters have a high impact on the solution of the considered problem andinfluence each other. Therefore, it represents a trade-off between gear and clutch shiftcommands and it affects the main goal of minimizing the fuel consumption too.

Page 55: Submitted by Thomas Leidinger, BSc Institute for Design

5.2 Results - DQN RL 45

Figure 5.3: Results RL - City Linz light traffic driving cycle 1000s

In the area of 400 s of the driving cycle, the process of recharging the battery isvisible. At first, the velocity is decreasing, the torque of the electrical machine isnegative and the SoC of the battery is increasing. Then the velocity profile increasewhereby the electrical machine delivers a positive torque value and the value of SoCbecomes smaller again. This process happens when the clutch is open and only theelectrical machine works. Otherwise, if the clutch is closed, the engine could alsocontribute torque but with the disadvantage of consuming fuel. The next figure 5.4shows the application of the first 1000 s of the City Linz mid traffic driving cycle. Theparameters of the reward function are set to zj = 0.1 and zc = 0.3. The smaller valueof zj leads to a higher number of gear shifts. However, with this selection, in this case,the main goal to minimize fuel consumption is improved.

Page 56: Submitted by Thomas Leidinger, BSc Institute for Design

46 5 Results

Figure 5.4: Results RL - City Linz mid traffic driving cycle 1000s

The following figure 5.5 shows the results of the first 1000 s of the City Linz heavytraffic driving cycle. The parameters of the reward function are set to zj = 0.1 andzc = 0.3 to obtain the following results. As seen before, the SoC of the battery staysabove 0.49, and therefore, the desired target SoC value is achieved.

Page 57: Submitted by Thomas Leidinger, BSc Institute for Design

5.2 Results - DQN RL 47

Figure 5.5: Results RL - City Linz heavy traffic driving cycle 1000s

Figure 5.6 shows the results of the first 1000 s of the City Linz bigVvar traffic drivingcycle. The parameters of the reward function are set to zj = 0.05 and zc = 0.2.

Page 58: Submitted by Thomas Leidinger, BSc Institute for Design

48 5 Results

Figure 5.6: Results RL - City Linz bigVvar traffic driving cycle 1000s

The 4 trends of the SoC of the battery are similar since the same reward function isused in all 4 cases and the velocities of the driving cycles are in the same range. Asseen in the figures, every driving cycle approach reaches the target SoC above 0.49.The approach of a funnel-shaped reward function to guide the agent to the target SoCvalue does not deliver acceptable results. The reason for that result is that the agentgot different rewards for the same state-action pair. For example, at the beginning ofan episode, the agent computes the expected reward for a state-action pair (Q-value)with a state value of SoC of 0.48. The reward is positive because in the early phaseof the episode this SoC value is acceptable. At the end of the cycle, when the agentshould reach the target value range of SoC, this state-action pair would be treateddifferently with a changed reward. Therefore, the calculation of the agent of the sameQ-value would be changed over one episode.

Page 59: Submitted by Thomas Leidinger, BSc Institute for Design

5.3 Comparison with Dynamic Programming 49

5.3 Comparison with Dynamic Programming

The Dynamic Programming approach delivers the benchmark solution of the HEVproblem. The computational time of the optimal solution depends heavily on thediscretization of the considered HEV model parameter. Here, the value of the couplerbetween the engine and the electric machine is divided into 31 parts and the valueof the SoC of the battery is divided into 111 parts. As previously mentioned, theDP solution is a global calculated result in contrast to the DRL method. Therefore,and due to the fact that the discretization is different, the DRL method is not ableto reach the optimal results of DP. As mentioned before, the reason for the 1000 sduration of the driving cycle is the reduction of the computational time. Furthermore,as mentioned later, training over the whole driving cycle does not lead to significantimprovements in the results.

5.3.1 1000s driving cycles

The following figures show the solutions of the Deep Reinforcement Learning approachand the solutions of the Dynamic Programming approach. The 2 methods are appliedon the first 1000 s of each of the 4 driving cycles.

Page 60: Submitted by Thomas Leidinger, BSc Institute for Design

50 5 Results

Figure 5.7: Results RL and DP - City Linz light traffic driving cycle 1000s

The first figure 5.7 shows the solutions of DP and DRL of the City Linz light traffic1000 s driving cycle. The course of the SoC of the battery of the DP solution showsthat this method computes a global solution. At the end of the driving cycle, thehighest amount of engine torque is needed to fulfill the goal of reaching the targetvalue of SoC. Therefore, a high amount of fuel is consumed. The course of the enginetorque of the calculated DRL solution includes more tips where small amounts of fuelare consumed. Hence, this method tries to hold the SoC value in the range of thetarget value.

Table 5.3 shows the important numbers of the 2 considered methods. In this case, theDRL method reaches 84 percent of the total fuel consumption of the DP method. Thelow number of gear shifts leads to a higher number of clutch shifts. As mentioned

Page 61: Submitted by Thomas Leidinger, BSc Institute for Design

5.3 Comparison with Dynamic Programming 51

Table 5.3: Statistics - City Linz light Traffic - 1000s

Method DRL DPTotal fuel consumption [kg] 0.1692 0.1422Number of gear shifts 7 57Number of clutch shifts 92 30Total cost 0.1890 0.1596Final SoC 0.4908 0.4940

before, the parameters to minimize, depend on each other. The total cost of theconsidered problem is calculated with the following equation:

Jtotal = mf + ng · 10−4 + nc · 10−4 (5.1)

The parameter mf represents the total fuel consumption, ng the number of gear shiftsand nc the number clutch shifts. The final SoC of both values are within the terminalconstraints and therefore, the goal of charge sustaining is fulfilled.

Page 62: Submitted by Thomas Leidinger, BSc Institute for Design

52 5 Results

The following figure 5.8 shows the results of DRL and DP applied on the City Linzmid traffic 1000 s driving cycle. The total fuel consumption computed by the DRLmethod reaches a value that is near the optimal fuel consumption value of the DPmethod. However, the number of gear shifts and the number of clutch shifts arethereby increased.

Figure 5.8: Results RL and DP - City Linz mid traffic driving cycle 1000s

Table 5.4: Statistics - City Linz mid Traffic - 1000s

Method DRL DPTotal fuel consumption [kg] 0.1774 0.1749Number of gear shifts 175 81Number of clutch shifts 92 34Total cost 0.2308 0.1979Final SoC 0.4978 0.4979

Page 63: Submitted by Thomas Leidinger, BSc Institute for Design

5.3 Comparison with Dynamic Programming 53

Figure 5.9 shows the results of DRL and DP for the City Linz heavy traffic 1000 sdriving cycle. In this case, the DRL method reaches a total fuel consumption of 85percent of the total fuel consumption of the DP method.

Figure 5.9: Results RL and DP - City Linz heavy traffic driving cycle 1000s

Table 5.5: Statistics - City Linz heavy Traffic - 1000s

Method DRL DPTotal fuel consumption [kg] 0.1944 0.1668Number of gear shifts 127 57Number of clutch shifts 96 27Total cost 0.2390 0.1836Final SoC 0.4908 0.4932

In the following figure 5.10 the results of both methods of the City Linz bigVvar Traffic1000 s driving cycle are investigated. The total fuel consumption, calculated by theDRL method is 87 percent of the value of the DP method.

Page 64: Submitted by Thomas Leidinger, BSc Institute for Design

54 5 Results

Figure 5.10: Results RL and DP - City Linz bigVvar traffic driving cycle 1000s

Table 5.6: Statistics - City Linz bigVvar Traffic - 1000s

Method DRL DPTotal fuel consumption [kg] 0.1605 0.1407Number of gear shifts 441 62Number of clutch shifts 114 21Total cost 0.2715 0.1573Final SoC 0.4978 0.4955

As seen in the previous figures and tables, the 4 cases deliver differences in thecomputed parameters. The used reward parameters are selected in a way that themain goal of minimizing fuel consumption is satisfied. In some cases, this leads to ahigh number of gear shifts and clutch shifts. Further, the action shields interventionsincrease the number of the 2 parameters too.

Page 65: Submitted by Thomas Leidinger, BSc Institute for Design

5.3 Comparison with Dynamic Programming 55

5.3.2 Whole driving cycles

The next 4 figures show the applications on the whole driving cycles. As before, thefollowing tables show the fuel consumption, the clutch shifts, the gear shifts, the totalcost and the final SoC value of both methods.

Figure 5.11: Results RL and DP - City Linz light traffic driving cycle

Table 5.7: Statistics - City Linz light Traffic - whole driving cycle

Method DRL DPTotal fuel consumption [kg] 0.5369 0.4089Number of gear shifts 16 124Number of clutch shifts 309 75Total cost 0.6019 0.4487Final SoC 0.4943 0.4942

Page 66: Submitted by Thomas Leidinger, BSc Institute for Design

56 5 Results

Figure 5.12: Results RL and DP - City Linz mid traffic driving cycle

Table 5.8: Statistics - City Linz mid Traffic - whole driving cycle

Method DRL DPTotal fuel consumption [kg] 0.5159 0.4040Number of gear shifts 323 151Number of clutch shifts 241 70Total cost 0.6287 0.4482Final SoC 0.4919 0.4943

Page 67: Submitted by Thomas Leidinger, BSc Institute for Design

5.3 Comparison with Dynamic Programming 57

Figure 5.13: Results RL and DP - City Linz heavy traffic driving cycle

Table 5.9: Statistics - City Linz heavy Traffic - whole driving cycle

Method DRL DPTotal fuel consumption [kg] 0.5686 0.4261Number of gear shifts 105 150Number of clutch shifts 188 65Total cost 0.6272 0.4691Final SoC 0.4939 0.4942

Page 68: Submitted by Thomas Leidinger, BSc Institute for Design

58 5 Results

Figure 5.14: Results RL and DP - City Linz bigVvar traffic driving cycle

Table 5.10: Statistics - City Linz bigVvar Traffic - whole driving cycle

Method DRL DPTotal fuel consumption [kg] 0.4849 0.3844Number of gear shifts 467 166Number of clutch shifts 262 73Total cost 0.6307 0.4322Final SoC 0.4948 0.4943

In all 4 results, the SoC value of the DP method shows the behavior of a global solution.The DRL method tries to keep the SoC value above the implemented boundary of 0.49over the whole cycle. Therefore, the gap between the calculated total fuel consumptionof the 2 methods gets bigger. It also has to be mentioned that the 4 different drivingcycles have different lengths. The percentages of the total fuel consumptions calculatedby DRL of the optimal solutions of DP are 76, 78, 74 and 79 percent.

Page 69: Submitted by Thomas Leidinger, BSc Institute for Design

5.4 Online Application 59

5.4 Online Application

In this chapter, an agent is applied to driving cycles other than the one he is trainedon to test the online capability of DRL. The previously trained agent of the CityLinz light traffic 1000 s calculates solutions in the environment of the three 1000 sdriving cycles of city Linz and the whole driving cycle City Linz light traffic. Further,the agent is applied on a driving cycle with a different route, called country roadNeuhofen. The DP solution, applied on a different driving cycle, does not deliverfeasible solutions. Therefore, if the used model is selected to simulate an HEV problem,only the DRL method computes feasible results. The first figure shows the results ofthe City Linz mid traffic 1000 s.

Figure 5.15: Results RL - City Linz mid traffic driving cycle

The first column of the following table shows the results of the DRL method calculatedin chapter 5.3. The second column indicates the solution of the agent which is trainedin the City Linz light traffic environment and then applied on the City Linz mid trafficenvironment and is called online DRL. The advantage of the online DRL method is the

Page 70: Submitted by Thomas Leidinger, BSc Institute for Design

60 5 Results

computational time, which makes it applicable in fast-changing driving cycles. Due tothe fact that the used agent was trained in a driving cycle with similar conditions, theobtained solutions of the City Linz mid traffic approach are usable. The upcomingobservations, in this case, are equal or similar to the occurred observations in thetraining process and so the agent knows how to calculate a useful solution.

Table 5.11: Computational times Online appplication - City Linz mid traffic

Method DRL online DRLTotal fuel consumption [kg] 0.1774 0.1946Number of gear shifts 175 13Number of clutch shifts 92 91Total cost 0.2308 0.2154Final SoC 0.4978 0.4980Computational time [s] 4.5

The following figure 5.16 shows the results of the City Linz heavy traffic case, whichwere calculated by the agent trained in the City Linz light traffic environment. Asseen in chapter 5.3, where the used agent is applied to the same environment in whichhe is trained, a low number of gear shifts were made due to the selected parameters ofthe DRL method. This behavior is also shown in the results of this chapter, wherethe online DRL method performs few gear shifts.

Page 71: Submitted by Thomas Leidinger, BSc Institute for Design

5.4 Online Application 61

Figure 5.16: Results RL - City Linz heavy traffic driving cycle

Table 5.12: Computational times Online appplication - City Linz heavy traffic

Method DRL online DRLTotal fuel consumption [kg] 0.1944 0.2029Number of gear shifts 127 11Number of clutch shifts 96 104Total cost 0.2390 0.2259Final SoC 0.4908 0.4899Computational time [s] 4.1

The following figure 5.17 and the table 5.13 shows the results of the application of theagent trained in the City Linz light traffic 1000 s driving cycle environment, applied onthe City Linz big Vvar 1000 s driving cycle environment. The total fuel consumptionof the online DRL method reaches a percentage of 95.8 of the total fuel consumptioncalculated by the conventional DRL method. A big advantage of the trained agent is

Page 72: Submitted by Thomas Leidinger, BSc Institute for Design

62 5 Results

the low number of gear shifts that he developed with his strategy, therefore the totalcost is even better than the total cost of the DRL method.

Figure 5.17: Results RL - City Linz big Vvar traffic driving cycle

Table 5.13: Computational times Online appplication - City Linz big Vvar

Method DRL online DRLTotal fuel consumption [kg] 0.1605 0.1858Number of gear shifts 441 15Number of clutch shifts 114 94Total cost 0.2715 0.2076Final SoC 0.4978 0.4972Computational time [s] 4.2

Figure 5.18 and table 5.14 show the results of the agent trained in the City Linz lighttraffic 1000s driving cycle environment, applied on the whole City Linz light trafficdriving cycle environment. Therefore, the first 1000 s are the same in both cases.

Page 73: Submitted by Thomas Leidinger, BSc Institute for Design

5.4 Online Application 63

Figure 5.18: Results RL - City Linz light traffic driving cycle

Table 5.14: Computational times Online appplication - City Linz light Traffic

Method DRL online DRLTotal fuel consumption [kg] 0.5369 0.5379Number of gear shifts 16 30Number of clutch shifts 309 265Total cost 0.6019 0.5969Final SoC 0.4943 0.4947Computational time [s] 6.7

Page 74: Submitted by Thomas Leidinger, BSc Institute for Design

64 5 Results

The total fuel consumption of both DRL methods is nearly the same over the wholedriving cycle. Hence, training the agent in the environment of the whole drivingcycle increases the training time but insignificantly improves the obtained results.The reasons are that the first 1000 s are the same in both cases and also the rest ofthe whole driving cycle does not differ enormously from the first part i.e. the agentoperates in a familiar environment of observations.The figure 5.19 and the table 5.15 show the results of the previous mentioned agentapplied on the Country Road Neuhofen 1000 s driving cycle environment.

Figure 5.19: Results RL - Country road Neuhofen driving cycle

The velocity profile of this driving cycle is different compared to the previously con-sidered driving cycles. The agent has to calculate solutions for occurring observationsthat he never visited in his training process. Therefore, the total fuel consumptionof this application has a high value and so the solution is worse. Further, the SoCfluctuates more in the unknown areas of the driving cycle, for example in the area of650 s.

Page 75: Submitted by Thomas Leidinger, BSc Institute for Design

5.4 Online Application 65

Table 5.15: Computational times Online appplication - Country road Neuhofen

Method online DRLTotal fuel consumption [kg] 0.7401Number of gear shifts 92Number of clutch shifts 127Total cost 0.7839Final SoC 0.4920Computational time [s] 4.3

The above results show the important role of training and selection of the environmentof a DRL approach. In this case, intuitively, an agent trained in an environment withsimilar conditions can lead to better results.As mentioned before, one advantage of the online DRL method is that a feasiblesolution is generated even if the driving cycles include velocity parts that are completelydifferent from the velocity parts of the driving cycle which was used to train the agent.The DP method is often not able to calculate a feasible solution for a new driving cycleeven if the velocity profile is nearly the same as the one where DP calculates a feasiblesolution. That is because DP does not learn other possible policies but only calculatesthe global optimum for the investigated driving cycle. In the simulation case, whenthe driving cycle is new, DP is not able to handle this problem. Therefore, DRL ismore universal/robust under disturbances. Due to that, the online DRL method issuitable for commute scenarios where small modifications have to be made but themain behavior of the driving cycle remains the same. The fact that the online DRLmethod requires a short computational time, an online application with fast-changingdriving conditions is possible.We have also compared DRL with another popular online approach called equivalentconsumption minimization strategy (ECMS). This method gives total fuel consumptionof 0.430 kg for the whole City Linz light traffic driving cycle. The computation takesfew minutes, but the performance largely depends on the equivalent factor, which ishard to be known in advance [3],[36]. This equivalent factor between electrical andchemical energy has to be defined to obtain an equivalent total fuel consumptionwhich has to be minimized. To reach the goal of charge sustaining, the selection ofthe equivalent factor is crucial. Therefore, tuning of this factor is important whichleads to a problem that needs predictive information to reach the SoC target value.The computational time of the used ECMS method is significantly longer than theone of the online DRL method i.e. although this method has a slightly better totalfuel consumption than DRL, but requires tuning work, predictive information andhas a longer computational time, DRL is a more proper choice. Hence, the range ofapplications is limited compared to the online DRL method.In summary, the fact of using a more realistic model, because of the use of 3 actionsinstead of the use of a one action approach [11],[33],[35], leads to a more complexproblem that has been solved. Further, by generating feasible solutions for all applieddriving cycles and calculating solutions in a short time, the online DRL method hasseveral clear advantages compared to the common applications and methods.

Page 76: Submitted by Thomas Leidinger, BSc Institute for Design
Page 77: Submitted by Thomas Leidinger, BSc Institute for Design

Chapter 6

Conclusion and Future Work

6.1 Conclusion

In order to maximize the energy efficiency of HEVs under uncertain driving envi-ronments, this thesis developed an online capable control using Deep ReinforcementLearning. This energy management problem has been solved under charge-sustainingconditions. In this thesis, a DQN-agent with its discrete action space was appliedto the HEV model to reach the mentioned goals. At first, a 1 action approach wasinvestigated to compare the results with existing literature. The other 2 actionswere given by the optimal solution of Dynamic Programming. Then, the full actionsapproach was developed. All 3 actions were solved by DRL but gives an unsatisfactoryresult because of the size of the action space which leads to a heavy computationalburden. To reduce this calculation effort, 3 actions were tactfully transformed to2 actions by performing 1 action dependent on another action. This improvementmakes the method faster and the solution converges to an appropriate value. Duringtesting and improving, infeasible values of the HEV model occurred often, whichbrought difficulties that the obtained results of the whole method were unacceptable.To address this, an action shield was implemented which helps the agent generatefeasible solutions, especially in online applications when new driving cycles occur. Inthe beginning, the agents were applied on the same driving cycles which they used fortraining and after that, an agent was selected and applied on a different driving cycleto simulate the online behavior.In conclusion, the results of the Deep Reinforcement Learning approach, of course,can not deliver the optimal solution as the benchmark method Dynamic Programming.The results manifest that in offline cases (where the agent is trained in an environmentand then simulated in the same environment), DRL can perform comparably as DPbut takes a longer time. This is due to the computational effort of a machine learningmethod like DRL. However, in online cases (where a trained agent of one environmentis then simulated in a new environment), DRL shows its advantages of deliveringfeasible solutions, calculated in short computational time, which is crucial in onlineapplications. Therefore, the online DRL approach can be seen as a robust onlineapproach under disturbances. These advantages could not be reached by common

67

Page 78: Submitted by Thomas Leidinger, BSc Institute for Design

68 6 Conclusion and Future Work

methods which were tested in this work. Furthermore, it is observed that training theagent over the whole cycle does not improve the results to a large extent, i.e. trainingthe agent in the first 1000 s of the driving cycle is sufficient.

6.2 Future Work

The implementation of the 3 actions DRL method in an HEV problem could beexpanded to the existing applications of the 1 action approach stated in the literature.This means, for example, that the agent learns in an environment where the drivingcycle is changed each or in a fixed period of episodes. Then, the agent learns differentstyles of routes and can handle a wider range of driving cycles in online applications.Of course, the training effort and therefore the computational burden would increasesignificantly to obtain acceptable results.Another field of investigation would be the approach of different starting and targetvalues. For example, setting the target value of SoC to 0.75, when the starting valueof SoC is equal to 0.25. Therefore, the energy management of this problem would bechanged compared to the investigated problem of this work.In the case of real-world driving, driving cycles often change. For example, driving in acity, on a highway or a country road. To handle these variations, different agents couldbe applied, depending on their trained environments for the current circumstances.The advantage of the low computational burden would enable RL to find an acceptablesolution for that approach.Further work could also be done by choosing different Deep Reinforcement Learningagents, like a Deep Deterministic Policy Gradient (DDPG) agent which is used inprevious works [11] for the 1 action approach. This agent operates with a continuousaction space, therefore the gear shift action and the clutch shift action have to bediscretized before they are applied to the HEV model.Finally, it must be mentioned that the obtained results could be possibly furtherimproved by changing the Hyperparameters of the Deep Reinforcement Learningalgorithm or other important parameters of the whole method. However, training theagent requires an enormous amount of calculation time (depending on the problemsize), and therefore, testing and improving the selected parameters has its limits.Hence, in machine learning applications, the utilize of a large amount of memory(RAM) is useful to obtain and improve the results more quickly.

Page 79: Submitted by Thomas Leidinger, BSc Institute for Design

Appendix A

HEV Model Parameter

Table A.1: Vehicle model parametersvariable g ρair m r A cb cr

value 9.81 1.2 1585 0.315 2.2 4500 0.0108units m/s2 kg/m3 kg m m2 Nm -

variable cd ηvalue 0.2578 0.92units - -

γ(j) [12.06, 8.05, 5.39, 4.27, 3.29, 2.56, 2.15, 1.71]λ(j) [1.3, 1.24, 1.16, 1.1, 1.06, 1.04, 1.02, 1 ]

69

Page 80: Submitted by Thomas Leidinger, BSc Institute for Design
Page 81: Submitted by Thomas Leidinger, BSc Institute for Design

A HEV Model Parameter 71

Page 82: Submitted by Thomas Leidinger, BSc Institute for Design
Page 83: Submitted by Thomas Leidinger, BSc Institute for Design

Bibliography

[1] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 4th ed. Vol. 1.Athena Scientific, 2017. isbn: 1886529434.

[2] Hoseinali Borhan, Ardalan Vahidi, Anthony M Phillips, Ming L Kuang, Ilya VKolmanovsky, and Stefano Di Cairano. “MPC-based energy management of apower-split hybrid electric vehicle”. In: IEEE Transactions on Control SystemsTechnology 20.3 (2011), pp. 593–603.

[3] Simona Onori, Lorenzo Serrao, and Giorgio Rizzoni. Hybrid Electric VehiclesEnergy Management Strategies. SpringerBriefs in Control, Automation andRobotics. Springer, 2016. isbn: 978-1-4471-6779-2.

[4] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. “Playing atari with deepreinforcement learning”. In: arXiv preprint arXiv:1312.5602 (2013).

[5] Zhuo Liu, Chenhui Yao, Hang Yu, and Taihua Wu. “Deep reinforcement learningwith its application for lung cancer detection in medical Internet of Things”. In:Future Generation Computer Systems 97 (2019), pp. 1–9.

[6] Jens Kober, J Andrew Bagnell, and Jan Peters. “Reinforcement learning inrobotics: A survey”. In: The International Journal of Robotics Research 32.11(2013), pp. 1238–1274.

[7] Charles Desjardins and Brahim Chaib-Draa. “Cooperative adaptive cruise con-trol: A reinforcement learning approach”. In: IEEE Transactions on intelligenttransportation systems 12.4 (2011), pp. 1248–1260.

[8] Lars Johannesson and Bo Egardt. “Approximate dynamic programming appliedto parallel hybrid powertrains”. In: IFAC proceedings volumes 41.2 (2008),pp. 3374–3379.

[9] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction.2nd ed. MIT press, 2018. isbn: 9780262039246.

[10] Xuewei Qi, Guoyuan Wu, Kanok Boriboonsomsin, Matthew J Barth, and JeffreyGonder. “Data-driven reinforcement learning–based real-time energy manage-ment system for plug-in hybrid electric vehicles”. In: Transportation ResearchRecord 2572.1 (2016), pp. 1–8.

73

Page 84: Submitted by Thomas Leidinger, BSc Institute for Design

74 Bibliography

[11] Roman Liessner, Christian Schroer, Ansgar Malte Dietermann, and BernardBäker. “Deep Reinforcement Learning for Advanced Energy Management ofHybrid Electric Vehicles.” In: ICAART (2). 2018, pp. 61–72.

[12] Roman Liessner, Ansgar Malte Dietermann, and Bernard Bäker. “Safe Deep Re-inforcement Learning Hybrid Electric Vehicle Energy Management”. In: Agentsand Artificial Intelligence. Ed. by Jaap van den Herik and Ana Paula Rocha.Cham: Springer International Publishing, 2019, pp. 161–181. isbn: 978-3-030-05453-3.

[13] MATLAB. version 9.8 (R2020a). Natick, Massachusetts: The MathWorks Inc.,2020. url: https://de.mathworks.com/help/reinforcement-learning/ug/dqn-agents.html.

[14] Olle Sundstrom and Lino Guzzella. “A generic dynamic programming Matlabfunction”. In: 2009 IEEE control applications,(CCA) & intelligent control,(ISIC).IEEE. 2009, pp. 1625–1630.

[15] Kevin Gurney. An introduction to neural networks. CRC press, 1997.

[16] Navdeep Singh Gill. “Artificial neural networks applications and algorithms”.In: Dosegljivo: https://www. xenonstack. com/blog/artificial-neural-network-applications/.[Dostopano: 1. 9. 2019] (2019).

[17] Luigi del Re and Pavlo Tkachenko. Regelsysteme 2. script. Institute for Designand Control of Mechatronical Systems, 2019.

[18] George Cybenko. “Approximation by superpositions of a sigmoidal function”.In: Mathematics of control, signals and systems 2.4 (1989), pp. 303–314.

[19] Barry L Kalman and Stan C Kwasny. “Why tanh: choosing a sigmoidal func-tion”. In: [Proceedings 1992] IJCNN International Joint Conference on NeuralNetworks. Vol. 4. IEEE. 1992, pp. 578–581.

[20] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifier neuralnetworks”. In: Proceedings of the fourteenth international conference on artificialintelligence and statistics. 2011, pp. 315–323.

[21] Raúl Rojas. Neural networks: a systematic introduction. Springer Science &Business Media, 2013.

[22] Reza Moradi, Reza Berangi, and Behrouz Minaei. “A survey of regulariza-tion strategies for deep models”. In: Artificial Intelligence Review 53.6 (2020),pp. 3947–3986.

[23] Sebastian Ruder. “An overview of gradient descent optimization algorithms”.In: arXiv preprint arXiv:1609.04747 (2016).

[24] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimiza-tion”. In: arXiv preprint arXiv:1412.6980 (2014).

[25] Ankit Choudhary. “A hands-on introduction to deep q-learning usingopenai gym in python”. In: Retrived from https://www. analyticsvidhya.com/blog/2019/04/introduction-deep-q-learningpython (2019).

Page 85: Submitted by Thomas Leidinger, BSc Institute for Design

Bibliography 75

[26] Long Ji Lin. “Programming Robots Using Reinforcement Learning and Teaching.”In: AAAI. 1991, pp. 781–786.

[27] Shota Ohnishi, Eiji Uchibe, Kosuke Nakanishi, and Shin Ishii. “ConstrainedDeep Q-learning gradually approaching ordinary Q-learning”. In: Frontiers inneurorobotics 13 (2019), p. 103.

[28] Hado Van Hasselt, Arthur Guez, and David Silver. “Deep reinforcement learningwith double q-learning”. In: Proceedings of the AAAI conference on artificialintelligence. Vol. 30. 1. 2016.

[29] Junpeng Deng, Daniel Adlberger, and Luigi Del Re. “Communication-basedpredictive energy management strategy for a hybrid powertrain [submitted]”.In: Optimal Control, Applications and Methods (2021).

[30] Junpeng Deng, Luigi Del Re, and Stephen Jones. “Predictive hybrid powertrainenergy management with asynchronous cloud update”. In: IFAC ProceedingsVolumes. 2020.

[31] Junpeng Deng. “V2X-Based HEV Energy Management Using Cloud Communi-cation/submitted by Junpeng Deng”. PhD thesis. Universität Linz, 2021.

[32] Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer,Scott Niekum, and Ufuk Topcu. “Safe reinforcement learning via shielding”. In:Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018.

[33] Yue Hu, Weimin Li, Kun Xu, Taimoor Zahid, Feiyan Qin, and ChenmingLi. “Energy management strategy for a hybrid electric vehicle based on deepreinforcement learning”. In: Applied Sciences 8.2 (2018), p. 187.

[34] MATLAB. version 9.8 (R2020a). Natick, Massachusetts: The MathWorks Inc.,2020. url: https://de.mathworks.com/help/reinforcement-learning/ref/rldqnagentoptions.html.

[35] Jingda Wu, Hongwen He, Jiankun Peng, Yuecheng Li, and Zhanjiang Li. “Con-tinuous reinforcement learning of energy management with deep Q network fora power split hybrid electric bus”. In: Applied energy 222 (2018), pp. 799–811.

[36] Lorenzo Serrao, Simona Onori, and Giorgio Rizzoni. “ECMS as a realization ofPontryagin’s minimum principle for HEV control”. In: 2009 American ControlConference. 2009, pp. 3964–3969.

Page 86: Submitted by Thomas Leidinger, BSc Institute for Design