Demand Response using Deep Reinforcement Learning (DQN)

Python experts only

Implement this paper attached by using either python or matlab.

My conditions are:

1- Do not use the already made framework or function call of the DQN from the AI baseline or the matlab toolbox (Do not do that), I want to build the DQN agent from scratch so I can understand it more.

2- Explain every line of code that you write so I can understand clearly, and the simpler code the better.

Please let me know If you want any article that you do not have access to so I can provide it. Also please let me know If you have any question or inquiry as you go along the work.

Citation: Amer, A.; Shaban, K.;

Massoud, A. Demand Response in

HEMSs Using DRL and the Impact of

Its Various Configurations and

Environmental Changes. Energies

2022, 15, 8235. https://doi.org/

10.3390/en15218235

Academic Editor: Surender Reddy

Salkuti

Received: 26 September 2022

Accepted: 28 October 2022

Published: 4 November 2022

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional affil-

iations.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

energies

Article

Demand Response in HEMSs Using DRL and the Impact of Its
Various Configurations and Environmental Changes
Aya Amer 1,*, Khaled Shaban 2 and Ahmed Massoud 1

1 Electrical Engineering Department, Qatar University, Doha 2713, Qatar
2 Computer Science and Engineering Department, Qatar University, Doha 2713, Qatar
* Correspondence: aa1303397l@qu.edu.qa

Abstract: With smart grid advances, enormous amounts of data are made available, enabling the
training of machine learning algorithms such as deep reinforcement learning (DRL). Recent research
has utilized DRL to obtain optimal solutions for complex real-time optimization problems, including
demand response (DR), where traditional methods fail to meet time and complex requirements.
Although DRL has shown good performance for particular use cases, most studies do not report the
impacts of various DRL settings. This paper studies the DRL performance when addressing DR in
home energy management systems (HEMSs). The trade-offs of various DRL configurations and how
they influence the performance of the HEMS are investigated. The main elements that affect the DRL
model training are identified, including state-action pairs, reward function, and hyperparameters.
Various representations of these elements are analyzed to characterize their impact. In addition,
different environmental changes and scenarios are considered to analyze the model’s scalability and
adaptability. The findings elucidate the adequacy of DRL to address HEMS challenges since, when
appropriately configured, it successfully schedules from 73% to 98% of the appliances in different
simulation scenarios and minimizes the electricity cost by 19% to 47%.

Keywords: deep learning; reinforcement learning; deep Q-networks; home energy management
system; demand response

1. Introduction

To address complex challenges in power systems associated with the presence of dis-
tributed energy resources (DERs), wide application of power electronic devices, increasing
number of price-responsive demand participants, and increasing connection of flexible load,
e.g., electric vehicles (EV) and energy storage systems (ESS), recent studies have adopted
artificial intelligence (AI) and machine learning (ML) methods as problem solvers [1]. AI
can help overcome the aforementioned challenges by directly learning from data. With
the spread of advanced smart meters and sensors, power system operators are producing
massive amounts of data that can be employed to optimize the operation and planning of
the power system. There has been increasing interest in autonomous AI-based solutions.
The AI methods require little human interaction while improving themselves and becoming
more resilient to risks that have not been seen before.

Recently, reinforcement learning (RL) and deep reinforcement learning (DRL) have
become popular approaches to optimize and control the power system operation, including
demand-side management [2], the electricity market [3], and operational control [4], among
others. RL learns the optimal actions from data through continuous interactions with
the environment, while the global optimum is unknown. It eliminates the dependency
on accurate physical models by learning a surrogate model. It identifies what works
better with a particular environment by assigning a numeric reward or penalty to the
action taken after receiving feedback from the environment. In contrast to the perfor-
mance of RL, the conventional and model-based DR approaches, such as mixed interlinear

Energies 2022, 15, 8235. https://doi.org/10.3390/en15218235 https://www.mdpi.com/journal/energies

https://doi.org/10.3390/en15218235

Homepage

https://creativecommons.org/licenses/by/4.0/

https://www.mdpi.com/journal/energies

https://www.mdpi.com

https://orcid.org/0000-0002-5688-7515

https://orcid.org/0000-0001-9343-469X

https://doi.org/10.3390/en15218235

https://www.mdpi.com/journal/energies

https://www.mdpi.com/article/10.3390/en15218235?type=check_update&version=1

Energies 2022, 15, 8235 2 of 2

programming [5,6], mixed integer non-linear programming (MINLP) [7], particle swarm
optimization (PSO) [8], and Stackelberg PSO [9], require accurate mathematical models
and parameters, the construction of which is challenging because of the increasing system
complexities and uncertainties.

In demand response (DR), RL has shown effectiveness by optimizing the energy
consumption for households via home energy management systems (HEMSs) [10]. The
motivation behind applying DRL for DR arises mainly from the need to optimize a large
number of variables in real time. The deployment of smart appliances in households is
rapidly growing, increasing the number of variables that need to be optimized by the HEMS.
In addition, the demand is highly fluctuant due to the penetration of EVs and RESs in the
residential sector [11]. Thus, new load scheduling plans must be processed in real-time to
satisfy the users’ needs and adapt to their lifestyles by utilizing their past experiences. RL
has been proposed for HEMSs, demonstrating the potential to outperform other existing
models. Initial studies focused on proof of concept, with research such as [12,13] advocating
for its ability to achieve better performance than traditional optimization methods such as
MILP [14], genetic algorithms [15], and PSO [16].

More recent studies focused on utilizing different learning algorithms for HEMS
problems, including deep Q-networks (DQN) [17], double DQN [18], deep deterministic
policy gradients [19], and a mixed DRL [20]. In [21], the authors proposed a multi-agent
RL methodology to guarantee optimal and decentralized decision-making. To optimize
their energy consumption, each agent corresponded to a household appliance type, such as
fixed, time-shiftable, and controllable appliances. Additionally, RL was utilized to control
heating, ventilation, and air conditioning (HVAC) loads in the absence of thermal modeling
to reduce electricity cost [22,23]. Work in [24] introduces an RL-based HEMS model that
optimizes energy usage considering DERs such as ESS and a rooftop PV system. Lastly,
many studies have focused on obtaining an energy consumption plan for EVs [25,26].
However, most of these studies only look at improving their performance compared to
other approaches without providing precise details, such as the different configurations
of the HEMS agent based on an RL concept or hyperparameter tuning for a more efficient
training process. Such design, execution, and implementation details can significantly
influence HEMS performance. DRL algorithms are quite sensitive to their design choices,
such as action and state spaces, and their hyperparameters, such as neural network size,
learning and exploration rates, and others [27].

The DRL adoption in real-world tasks is limited because of the reward design and safe
learning. There is a lack in the literature of in-depth technical and quantitative descriptions
and implementation details of DRL in HEMSs. Despite the expert knowledge required in
DR and HEMS, DRL-based HEMSs pose extra challenges. Hence, a performance analysis
of these systems needs to be conducted to avoid bias and gain insight into the challenges
and the trade-offs. The compromise between the best performance metrics and the limiting
characteristics of interfacing with different types of household appliances, EV, and ESS
models will facilitate the successful implementation of DRL in DR and HEMS. Further,
there is a gap in the literature regarding the choice of reward function configuration in
HEMSs, which is crucial for their successful deployment.

In this paper, we compare different reward functions for DRL-HEMS and test them
using real-world data. In addition, we examine various configuration settings of DRL
and their contributions to the interpretability of these algorithms in HEMS for robust
performance. Further, we discuss the fundamental elements of DRL and the methods used
to fine-tune the DRL-HEMS agents. We focus on DRL sensitivity to specific parameters to
better understand their empirical performance in HEMS. The main contributions of this
work are summarized as follows:

• A study of the relationship between training and deployment of DRL is presented.
The implementation of the DRL algorithm is described in detail with different con-
figurations regarding four aspects: environment, reward function, action space,
and hyperparameters.

Energies 2022, 15, 8235 3 of 20

• We have considered a comprehensive view of how the agent performance depends on
the scenario considered to facilitate real-world implementation. Various environments
account for several scenarios in which the state-action pair dimensions are varied or
the environment is made non-stationary by changing the user’s behavior.

• Extensive simulations are conducted to analyze the performance when the model
hyperparameters are changed (e.g., learning rates and discount factor). This verifies
the validity of having the model representation as an additional hyperparameter in
applying DRL. To this end, we choose the DR problem in the context of HEMS as a
use case and propose a DQN model to address it.

The remainder of this paper is structured as follows. The DR problem formulation
with various household appliances is presented in Section 2. Section 3 presents the DRL
framework, different configurations to solve the DR problem, and the DRL implementation
process. Evaluation and analysis of the DRL performance results are discussed in Section 4.
Concluding along with the future work and limitation remarks are presented in Section 5.

2. Demand Response Problem Formulation

The advances in smart grid technologies enable power usage optimization for cus-
tomers by scheduling their different loads to minimize the electricity cost considering
various appliances and assets, as shown in Figure 1. The DR problem has been tackled in
different studies; however, the flexible nature of the new smart appliances and the high-
dimensionality issue add a layer of complexity to it. Thus, new algorithms and techniques,
such as DRL, are proposed to address the problem. The total electricity cost is minimized
by managing the operation of different categories of home appliances. The appliances’
technical constraints and user comfort limit the scheduling choices. Thus, the DR problem
is defined according to the home appliances’ configuration and their effect on user comfort.
The home appliances can be divided into three groups as follows:

Energies 2022, 15, x FOR PEER REVIEW 3 of 20

• We have considered a comprehensive view of how the agent performance depends
on the scenario considered to facilitate real-world implementation. Various environ-
ments account for several scenarios in which the state-action pair dimensions are var-
ied or the environment is made non-stationary by changing the user’s behavior.

with various household appliances is presented in Section 2. Section 3 presents the DRL
framework, different configurations to solve the DR problem, and the DRL implementa-
tion process. Evaluation and analysis of the DRL performance results are discussed in
Section 4. Concluding along with the future work and limitation remarks are presented in
Section 5.

Demand Response Problem Formulation

The advances in smart grid technologies enable power usage optimization for cus-

tomers by scheduling their different loads to minimize the electricity cost considering var-
ious appliances and assets, as shown in Figure 1. The DR problem has been tackled in
different studies; however, the flexible nature of the new smart appliances and the high-
dimensionality issue add a layer of complexity to it. Thus, new algorithms and techniques,
such as DRL, are proposed to address the problem. The total electricity cost is minimized
by managing the operation of different categories of home appliances. The appliances’
technical constraints and user comfort limit the scheduling choices. Thus, the DR problem
is defined according to the home appliances’ configuration and their effect on user com-
fort. The home appliances can be divided into three groups as follows:

Figure 1. HEMS interfacing with different types of household appliances and assets. Figure 1. HEMS interfacing with different types of household appliances and assets.

2.1. Shiftable Appliances

The working schedule of this appliance group can be changed, e.g., from a high-price
time slot to another lower-price time slot, to minimize the total electricity cost. Examples of
this type are washing machines (WMs) and dishwashers (DWs). The customer’s discomfort
may be endured due to waiting for the appliance to begin working. Assume a time-shiftable

Energies 2022, 15, 8235 4 of 20

appliance requires an interval of dn to achieve one operation cycle. The time constraints of
the n shiftable appliance are defined as:

tint,n ≤ tstart,n ≤ (tend,n − dn) (1)

2.2. Controllable, Also Known as Thermostatically Controlled, Appliances

This group of appliances includes air conditioners (ACs) and water heaters (WHs),
among others, in which the temperature can be adjusted by the amount of electrical energy
consumed. Their consumption can be adjusted between maximum and minimum values
in response to the electricity price signal, as presented in (2). Regulating the consumption
of these appliances reduces charges on the electricity bill. However, reduced consumption
can affect the customer’s thermal comfort. The discomfort is defined based on the vari-
ation (Emax

n − En,t). When this deviation decreases, customer discomfort decreases and
vice versa.

Emin
n ≤ En,t ≤ Emax

n (2)

2.3. Baseloads

These are appliances’ loads that cannot be reduced or shifted, and thus are regarded
as a fixed demand for electricity. Examples of this type are cookers and laptops.

2.4. Other Assets

The HEMS controls the EVs and ESSs charging and discharging to optimize energy
usage while sustaining certain operational constraints. The EV battery dynamics are
modeled by:

SOEEV
n,t+1 =

{
SOEt + ηEV

ch ·E

EV
n,t , EEV

n,t > 0

SOEt + ηEV
dis ·E

EV
n,t , EEV

n,t < 0 (3)

− EEV/max
n,t ≤ EEV

n,t ≤ EEV/max
n,t t ∈ [ta,n, tb,n] (4)

EEV
n,t = 0, otherwise (5)

SOEmin ≤ SOEt ≤ SOEmax (6)

The ESS charging/discharging actions are modeled the same as EV battery dynamics,
as presented by Equations (3)–(6). However, the ESS is available at any time during the
scheduling horizon t ∈ T.

3. DRL for Optimal Demand Response

The merit of a DR solution depends, in part, on its capability to the environment and
the user preferences and integrate the user feedback into the control loop. This section
illustrates the Markov decision process (MDP), followed by the DRL setup for the DR
problem with different element representations.

3.1. Deep Reinforcement Learning (DRL)

DRL combines RL with deep learning to address environments with a considerable
number of states. DRL algorithms such as deep Q-learning (DQN) are effective in decision-
making by utilizing deep neural networks as policy approximators. DRL shares the same
basic concepts as RL, where agents determine the optimal possible actions to achieve their
goals. Specifically, the agent and environment interact in a sequence of decision episodes
divided into a series of time steps. In each episode, the agent chooses an action based on
the environment’s state representation. Based on the selected action, the agent receives a
reward from the environment and moves to the next state, as visualized in Figure 2.

Energies 2022, 15, 8235 5 of 20

Energies 2022, 15, x FOR PEER REVIEW 5 of 20

their goals. Specifically, the agent and environment interact in a sequence of decision ep-
isodes divided into a series of time steps. In each episode, the agent chooses an action
based on the environment’s state representation. Based on the selected action, the agent
receives a reward from the environment and moves to the next state, as visualized in Fig-
ure

Figure 2. Agent and environment interaction in RL.

Compared to traditional methods, RL algorithms can provide appropriate techniques
for decision-making in terms of computational efficiency. The RL problem can be modeled
with an MDP as a 5-tuple (𝒮, 𝒜, 𝑇, ℛ, 𝜆), where 𝒮 is a state-space, 𝒜 is an action space, 𝑇 ∈ [0, 1] is a transition function, ℛ is a reward function, and 𝛾 ∈ [0, 1) is a discount
factor. The main aim of the RL agent is to learn the optimal policy that maximizes the
expected average reward. In simple problems, the policy can be presented by a lookup
table, a.k.a., Q-table, that maps all the environment states to actions. However, this type
of policy is impractical in complex problems with large or continuous state and/or action
spaces. DRL overcomes these challenges by replacing the Q-table with a deep neural net-
work model that approximates the states to actions mapping. A general architecture of
the DRL agent interacting with its environment is illustrated in Figure 3.

Figure 3. The general architecture of DRL.

The optimal value 𝑄∗(𝑠, 𝑎) presents the maximum accumulative reward that can be
achieved during the training. The looping relation between the action–value function in
two successive states 𝑠 and 𝑠 , is designated as the Bellman equation. 𝑄∗(𝑠, 𝑎) = 𝑟 + 𝛾 𝑚𝑎𝑥 𝑄(𝑠 , 𝑎 ) (7)

Figure 2. Agent and environment interaction in RL.

Compared to traditional methods, RL algorithms can provide appropriate techniques
for decision-making in terms of computational efficiency. The RL problem can be modeled
with an MDP as a 5-tuple (S , A, T, R, λ), where S is a state-space, A is an action space,
T ∈ [0, 1] is a transition function,R is a reward function, and γ ∈ [0, 1) is a discount factor.
The main aim of the RL agent is to learn the optimal policy that maximizes the expected
average reward. In simple problems, the policy can be presented by a lookup table, a.k.a.,
Q-table, that maps all the environment states to actions. However, this type of policy is
impractical in complex problems with large or continuous state and/or action spaces. DRL
overcomes these challenges by replacing the Q-table with a deep neural network model
that approximates the states to actions mapping. A general architecture of the DRL agent
interacting with its environment is illustrated in Figure 3.

Energies 2022, 15, x FOR PEER REVIEW 5 of 20

Figure 2. Agent and environment interaction in RL.

Figure 3. The general architecture of DRL.

The optimal value Q∗(s, a) presents the maximum accumulative reward that can be
achieved during the training. The looping relation between the action–value function in
two successive states st and st+1, is designated as the Bellman equation.

Q∗(s, a) = rt+1 + γmax︸︷︷︸
at+1

Q(st+1, at+1) (7)

Energies 2022, 15, 8235 6 of 20

The Bellman equation is employed in numerous RL approaches to direct the estimates
of the Q-values near the true values. At each iteration of the algorithm, the estimated
Q-value Qt is updated by:

Qt+1(st, at)← Q(st, at) + α[ rt + γmax︸︷︷︸
at+1

Q(st+1, at+1)−Q(st, at)] (8)

3.2. Different Configurations for DRL-Based DR

To solve the DR by DRL, an iterative decision-making method with a time step of 1 h is
considered. An episode is defined as one complete day (T = 24 time steps). As presented in
Figure 4, the DR problem is formulated based on forecasted data, appliances’ data, and user
preferences. The DRL configuration for the electrical home appliances is defined below.

Energies 2022, 15, x FOR PEER REVIEW 6 of 20

The Bellman equation is employed in numerous RL approaches to direct the esti-
mates of the Q-values near the true values. At each iteration of the algorithm, the esti-
mated Q-value 𝑄 is updated by: 𝑄 (𝑠 , 𝑎 ) ← 𝑄(𝑠 , 𝑎 ) + 𝛼[ 𝑟 + 𝛾 𝑚𝑎𝑥 𝑄(𝑠 , 𝑎 ) − 𝑄(𝑠 , 𝑎 )] (8)

3.2. Different Configurations for DRL-Based DR
To solve the DR by DRL, an iterative decision-making method with a time step of 1

h is considered. An episode is defined as one complete day (T = 24 time steps). As pre-
sented in Figure 4, the DR problem is formulated based on forecasted data, appliances’
data, and user preferences. The DRL configuration for the electrical home appliances is
defined below.

Figure 4. DRL-HEMS structure.

3.2.1. State Space Configuration
The state 𝑠 at time 𝑡 comprises the essential knowledge to assist the DRL agent in

optimizing the loads. The state space includes the appliances’ operational data, ESS’s state
of energy (SoE), PV generation, and the electricity price received from the utility. The time
resolution to update the data is 1 h. In this paper, the state representation is kept the same
throughout the presented work. The state 𝑠 is given as: 𝑠 = 𝑠 , , … , 𝑠 , , 𝜆 , 𝑃 , ∀𝑡 (9)

3.2.2. Action Space Configuration
The action selection for each appliance depends on the environment states. The

HEMS agent performs the binary actions {1—‘On’, —‘Off’} to turn on or off the shiftable
appliances. The controllable appliances’ actions are discretized in five different energy
levels. Similarly, the ESS and EV actions are discretized with two charging and two dis-
charging levels. The action set 𝐴, in each time step 𝑡 determined by a neural network, is
the ‘ON’ or ‘OFF’ actions of the time shiftable appliances, the power levels of the control-
lable appliances, and the charging/discharging levels of the ESS and EV. The action set 𝑎
is given as: 𝑎 = 𝑢 , , 𝐸 , , 𝐸 , , 𝐸 , , ∀𝑡 (10)

Figure 4. DRL-HEMS structure.

3.2.1. State Space Configuration

The state st at time t comprises the essential knowledge to assist the DRL agent in
optimizing the loads. The state space includes the appliances’ operational data, ESS’s state
of energy (SoE), PV generation, and the electricity price received from the utility. The time
resolution to update the data is 1 h. In this paper, the state representation is kept the same
throughout the presented work. The state st is given as:

st =
(

s1,t, . . . , sN,t, λt, PPV
t

)
, ∀t (9)

3.2.2. Action Space Configuration

The action selection for each appliance depends on the environment states. The HEMS
agent performs the binary actions {1—‘On’, —‘Off’} to turn on or off the shiftable appliances.
The controllable appliances’ actions are discretized in five different energy levels. Similarly,
the ESS and EV actions are discretized with two charging and two discharging levels. The
action set A, in each time step t determined by a neural network, is the ‘ON’ or ‘OFF’
actions of the time shiftable appliances, the power levels of the controllable appliances, and
the charging/discharging levels of the ESS and EV. The action set at is given as:

at =
(

ut,n, Et,n, EEV
t,n , EESS

t,n

)
, ∀t (10)

where ut,n is a binary variable to control the shiftable appliance, Et,n is the energy consump-
tion of the controllable appliance, EEV

t,n is the energy consumption of EV and EESS
t,n is the

energy consumption of ESS.

Energies 2022, 15, 8235 7 of 20

3.2.3. Reward Configuration

The HEMS agent learns how to schedule and control the appliances’ operation through
trial experiences with the environment, i.e., household appliances. It is essential to decide
the rewards/penalties and their magnitude accordingly. The reward function encapsu-
lates the problem objectives, minimizing electricity and customer discomfort costs. The
comprehensive reward for the HEMS is defined as:

R =

rt
1 + rt

2 (11)

where rt
1 is the operation electricity cost measured in €, and rt

2 measures the dissatisfaction
caused to the user by the HEMS and measured in €. The electricity cost rt

1 is determined by:

rt
1 = λt

(
Eg

t − EPV
t − EEV/dis

t − EESS/dis
t

)
, ∀t (12)

The discomfort index rt
2 considers different components, which are shown in Equa-

tions (13)–(16). It reflects the discomfort cost of the shiftable appliances, controllable
appliances, EV, and ESS in the scheduling program, respectively. The importance factor
ζn reflect the discomfort caused to the user into cost, measured in €/kWh. The operation
limits and user comfort constraints for each appliance’s type are detailed in [17].

rt
2 = ζn(tstart,n − ta,n) (13)

rt
2 = ζn(Emax

n − En,t) (14)

rt
2 = ζn(SOEt − SOEmax) 2, t = tEV,end (15)

rt
2 =

{
ζn(SOEt − SOEmax) 2 i f SOEt > SOEmax

ζn
(
SOEt − SOEmin) 2 i f SOEt < SOEmin (16)

The total reward function R aims to assess the HEMS performance in terms of cost
minimization and customer satisfaction. Accordingly, the reward function is defined with
different representations as follows.

(1) Total cost:

In this representation, similar to [15], the reward is the negative sum of the electricity
cost (λt) and dissatisfaction cost (ξt) at time t for n appliances. This reward representation
is widely used in HEMS and has a simple mathematical form and input requirements.

R1 = −

∑
n∈N

(
rt

1 + rt

2
)

(17)

(2) Total cost—squared:

In this representation, the first function is modified by squaring the result of adding all
negative costs of each episode. This increasingly penalizes actions that lead to higher costs.

R2 = −
(

∑
n∈N

(
rt

1 + rt
2
))2

(18)

(3) Total cost—inversed:

The reward function is presented as the inverse of electricity and dissatisfaction costs,
as presented in [17]. Actions that decrease the cost lead to an increase in the total rewards.

R3 = ∑
n∈N

1
rt1 + rt2 (19)

Energies 2022, 15, 8235 8 of 20

(4) Total cost—delta:

Here, the reward is the difference between the previous and current costs, turning
positive for the action that decreases the total cost and negative when the cost increases.

R4 = ∑
n∈N

(
rt−1

1 + rt−1
2
)
−
(

rt
1 + rt

2
)

(20)

(5) Total cost with a penalty:

This reward representation is more information-dense than the previously mentioned
reward functions. It provides both punishment and rewards. The electricity price (λt) is
divided into low and high price periods using levels. The penalty is scaled according to the
electricity price level. The agent receives a positive reward if it successfully schedules the
appliances in the user’s desired scheduling interval within a low-price period (λt < λavg). The agent is penalized by receiving a negative reward if it schedules the appliance in a high-price period ( λt > λavg) time slot or outside the desired scheduling time defined by
the user. Table 1 summarizes the aforementioned reward function representations, and
Table 2 summarizes the state set, action set, and reward functions for each appliance type.

R5 =

 ∑
n∈N

(
rt

1 + rt
2) i f λt < λavg

− ∑
n∈N

(
rt

1 + rt
2) i f λt > λavg (21)

Table 1. Different reward representations.

Variations Equations

Reward-1 R = − ∑
n∈N

(
rt

1 + rt
2)

Reward-2 R = −
(

∑
n∈N

(
rt

1 + rt
2))2

Reward-3 R = ∑
n∈N

1
rt1+rt2

Reward-4 R = ∑
n∈N

(
rt−1

1 + rt−1
2)− (rt

1 + rt
2)

Reward-5
R =

 ∑
n∈N

(
rt

1 + rt
2) i f λt < λavg − ∑ n∈N ( rt

1 + rt
2) i f λt > λavg

Table 2. State, action, and reward for each appliance type.

Appliances Type Action Set Reward

Fixed {1—ON} Equation (12)

Shiftable {0—OFF, 1—ON} Equations (12) and (13)

Controllable {0.8, 1, . . . , 1.4} Equations (12) and (14)

EV {−3, −1.5, 0, 1.5, 3} Equations (12) and (15)

ESS {−0.6, 0, 0.6} Equations (12) and (16)

3.3. Implementation Process

Figure 5 presents the implementation process DQN agent. At the beginning of the
training, the Q values, states, actions, and network parameters are initialized. The outer
loop limits the number of training episodes, while the inner loop limits the time steps
in each training episode. DQN agent uses the simple ε-greedy exploration policy, which
randomly selects an action with probability ε ∈ [0, 1] or selects an action with maximum

Energies 2022, 15, 8235 9 of 20

Q-value. After each time step, the environment states are randomly initialized. However,
the DQN network parameters are saved after one complete episode. At the end of each
episode, a sequence of states, actions, and rewards is obtained, and new environment states
are observed. Then the DQN network parameters are updated using Equation (9), and the
network is utilized to decide the following action.

Energies 2022, 15, x FOR PEER REVIEW 9 of 20

3.3. Implementation Process
Figure 5 presents the implementation process DQN agent. At the beginning of the

training, the Q values, states, actions, and network parameters are initialized. The outer
loop limits the number of training episodes, while the inner loop limits the time steps in
each training episode. DQN agent uses the simple 𝜖-greedy exploration policy, which
randomly selects an action with probability 𝜖 ∈ [0, 1] or selects an action with maximum
Q-value. After each time step, the environment states are randomly initialized. However,
the DQN network parameters are saved after one complete episode. At the end of each
episode, a sequence of states, actions, and rewards is obtained, and new environment
states are observed. Then the DQN network parameters are updated using Equation (9),
and the network is utilized to decide the following action.

Figure 5. DQN agent implementation flowchart. Figure 5. DQN agent implementation flowchart.

4. Performance Analysis
4.1. Case Study Setup

The HEMS utilizes forecasted data such as electricity price and PV generation to make
load schedules to satisfy the demand response objectives. It is assumed that the received
date is precise, and the prediction model accounts for their uncertainties. Electricity prices
and appliance power are attained to train the agent. The agent is trained, validated, and
tested on real-world datasets [17]. Table 3 presents a sample of the considered household

Energies 2022, 15, 8235 10 of 20

appliances. The home can also be equipped with an EV and ESS; their parameters are
shown in Table 4. When the PV-power is less than the appliances’ consumption, the
appliances consume power from the grid.

Table 3. Parameters of household appliances.

ID ζn Power Rating (kWh)
[
Tint,n,Tend,n

]
dn

DWs 0.2 1.5 7–12 2

WMs 0.2 2 7–12 2

AC-1 2 0.6–2 0–24 –

AC-2 2.5 0.6–2 0–24 –

AC-2 3 0.6–2 0–24 –

Table 4. Parameters of EV and ESS.

Parameters ESS EV

Charging/discharging levels [0.3, 0.6] [1.5, 3.3]

Charging/discharging limits (kWh) 3.3 0.6

Minimum discharging (%) 20 20

Maximum charging (%) 95 95

ζn 2.5 2.5

The dataset is split into three parts: training and validation, where hyperparameters
are optimized, and testing, as presented in Figure 6. The hyperparameter tuning process is
divided into three phases. The initial phase starts with baseline parameters. The middle
phase is manual tuning or grid search of a few essential parameters, and the intense
phase is searching and optimizing more parameters and final features of the model. Based
on the selected parameters, the performance is validated. Then, change a part of the
hyperparameter, train the model again, and check the difference in the performance. In
the test phase, one day (24 h) from the testing dataset is randomly selected to test the
DRL-HEMS performance. Since the nature of the problem is defined as discrete action
space, DQN is selected as the policy optimization algorithm.

Energies 2022, 15, x FOR PEER REVIEW 11 of 20

Figure 6. The differences between the training dataset, validation dataset, and testing dataset.

The utilized network consists of two parallel input layers, three hidden layers with
ReLU activation for all layers and 36 neurons in each layer, and one output layer. The
training of the DQN for 1000 episodes is almost 140 min to 180 min on a computer with
an Intel Core i9-9980XE CPU @ 3.00 GHz. The code is implemented in MATLAB R2020b.
Before discussing the results, in Table 5 we summarize the settings for each DRL element,
including the training process, reward, action set, and environment.

Table 5. Settings for each of the DRL elements considered in this work.

Elements Settings

Training Process

Learning Rate {0.00001, 0.0001, 0.001, 0.01, 0.1}
Discount Factor {0.1, 0.5, 0.9}
Epsilon Decay {0, 0.005, 0.1}

Dataset 6-month and 1-year datasets.

Reward Total cost (𝑅 ), squared (𝑅 ), inversed (𝑅 ), delta (𝑅 ), and with a pen-
alty (𝑅 )

Actions sets {500, 2000, 4500, 8000, 36,000}
Environment With and without PV and different Initial SoE {30%, 50%, 80%}

4.2. Training Parameters Trade-Offs
Different simulation runs are performed to tune the critical parameters for the stud-

ied use case since the performance of DRL often does not hold when the parameters are
varied. This helps to identify the influential elements for the success or failure of the train-
ing process. The DQN agent hyperparameter is tuned in the testing process based on the
parameters set in Table 5 and training based on 1000 episodes to improve the obtained
policy. The optimal policy is when the agent achieves the most cost reduction by maxim-
izing the reward function. The findings elucidate the adequacy of DRL to address HEMS
challenges since it successfully schedules up to 73% to 98% of the appliances in different
scenarios. To further analyze the training parameter trade-offs and their effect on the
model’s performance, the following points are observed:
• In general, training the DRL agent with 1-year data takes more time steps to converge

but leads to more cost savings in most simulation runs. Therefore, training with more
prior data results in a more robust policy for this use case. However, the dataset size

Figure 6. The differences between the training dataset, validation dataset, and testing dataset.

Energies 2022, 15, 8235 11 of 20

Table 5. Settings for each of the DRL elements considered in this work.

Elements Settings

Training Process

Learning Rate {0.00001, 0.0001, 0.001, 0.01, 0.1}

Discount Factor {0.1, 0.5, 0.9}

Epsilon Decay {0, 0.005, 0.1}

Dataset 6-month and 1-year datasets.

Reward Total cost (R1), squared (R2), inversed (R3), delta (R4), and with a
penalty (R5)

Actions sets {500, 2000, 4500, 8000, 36,000}

Environment With and without PV and different Initial SoE {30%, 50%, 80%}

4.2. Training Parameters Trade-Offs

Different simulation runs are performed to tune the critical parameters for the studied
use case since the performance of DRL often does not hold when the parameters are
varied. This helps to identify the influential elements for the success or failure of the
training process. The DQN agent hyperparameter is tuned in the testing process based
on the parameters set in Table 5 and training based on 1000 episodes to improve the
obtained policy. The optimal policy is when the agent achieves the most cost reduction by
maximizing the reward function. The findings elucidate the adequacy of DRL to address
HEMS challenges since it successfully schedules up to 73% to 98% of the appliances in
different scenarios. To further analyze the training parameter trade-offs and their effect on
the model’s performance, the following points are observed:

• In general, training the DRL agent with 1-year data takes more time steps to converge
but leads to more cost savings in most simulation runs. Therefore, training with more
prior data results in a more robust policy for this use case. However, the dataset size
has less impact when γ = 0.1 since the agent’s performance is greedy regardless of the
training dataset size.

• The discount factor impacts the agent’s performance; it shows how the DQN agent
can enhance the learning performance based on future rewards. For the values around
0.9 discount factors, the rewards converge to the same values with a slight difference.
In the 0.5 case, the agent converges faster but with fewer rewards. The performance
deteriorates when assigning a small value to γ. For example, when γ = 0.1, the
instabilities and variations in the reward function increase severely. This is because
the learned policy relies more on greedy behavior and pushes the agent towards
short-term gains over long-term gains to be successful.

• Selecting the learning rate in the DRL training is fundamental. A too-small learning
rate value may require an extensive training time that might get stuck, while large
values could lead to learning a sub-optimal or unstable training. Different learning
rates are tested to observe their effect on training stability and failure. Figure 7 is
plotted to illustrate the required time steps for convergence at different learning rates,
along with the percentage of correct action taken by the trained agent for each one of
the appliances over one day. It can be observed that the percentage of correct actions
has an ascending trend from 1 × 10−5 to 0.1 learning rate values. However, this value
decreased at 0.1 and 1 learning rate values. This is due to the large learning rates of
0.1 and 1.0 and the failure of the agent to learn anything. For small learning rates, i.e.,

Energies 2022, 15, 8235 12 of 20

1 × 10−5 and 1 × 10−4, more training time is required to converge. However, it is
observed that the agent can learn the optimization problem fairly with a learning rate
of 0.01, which takes around 10,000 time steps based on our implementation.

Energies 2022, 15, x FOR PEER REVIEW 12 of 20

has less impact when 𝜸 = 0.1 since the agent’s performance is greedy regardless of

the training dataset size.

• The discount factor impacts the agent’s performance; it shows how the DQN agent

can enhance the learning performance based on future rewards. For the values

around 0.9 discount factors, the rewards converge to the same values with a slight

difference. In the 0.5 case, the agent converges faster but with fewer rewards. The

performance deteriorates when assigning a small value to 𝜸. For example, when 𝜸

= 0.1, the instabilities and variations in the reward function increase severely. This is

because the learned policy relies more on greedy behavior and pushes the agent to-

wards short-term gains over long-term gains to be successful.

• Selecting the learning rate in the DRL training is fundamental. A too-small learning

rate value may require an extensive training time that might get stuck, while large

values could lead to learning a sub-optimal or unstable training. Different learning

rates are tested to observe their effect on training stability and failure. Figure 7 is

plotted to illustrate the required time steps for convergence at different learning

rates, along with the percentage of correct action taken by the trained agent for each

one of the appliances over one day. It can be observed that the percentage of correct

actions has an ascending trend from 1 × 10−5 to 0.1 learning rate values. However, this

value decreased at 0.1 and 1 learning rate values. This is due to the large learning

rates of 0.1 and 1.0 and the failure of the agent to learn anything. For small learning

rates, i.e., 1 × 10−5 and 1 × 10−4, more training time is required to converge. However,

it is observed that the agent can learn the optimization problem fairly with a learning

rate of 0.01, which takes around 10,000 time steps based on our implementation.

Figure 7. Learning steps and % of scheduled appliances using different learning rates.

4.3. Exploration–Exploitation Trade-Offs

The exploration rate is an essential factor in further improving DRL performance.

However, reaching the solutions too quickly without enough exploration (𝜖 = 0) could

lead to local minima or total failure of the learning process. In an improved epsilon-greedy

method, also known as the decayed-epsilon-greedy method, learning a policy is done with

Figure 7. Learning steps and % of scheduled appliances using different learning rates.

4.3. Exploration–Exploitation Trade-Offs

The exploration rate is an essential factor in further improving DRL performance.
However, reaching the solutions too quickly without enough exploration (ε = 0) could
lead to local minima or total failure of the learning process. In an improved epsilon-greedy
method, also known as the decayed-epsilon-greedy method, learning a policy is done with
total N episodes. The algorithm initially sets ε = εmax (i.e., εinit = 0.95), then gradually
decreases to end at ε = εmin (i.e., εmin = 0.1) over n training episodes. Specifically, during
the initial training process, more freedom is given to explore with a high probability (i.e.,
εinit = 0.955). As the learning advances, ε is decayed by the decay rate εdecay over time to
exploit the learned policy. If ε is greater than εmin, then it is updated using ε = ε * (1 − εdeca

at the end of each time step.

In order to study the effect of the exploration rate, it is varied during training until
the agent can get out of the local optimum. The maximum value for epsilon εmax is set
to 1 and 0.01 as a minimum value εmin. Simulations are conducted to observe how the
agent explores the action space under three different ε values. All hyperparameters are
kept constant, and the learning rate is set to 0.01 and the discount factor at 0.95. In the first
case, no exploration is done (ε = 0), where the agent always chooses the action with the
maximum Q-value. The second case presents low exploration where ε is decayed with a
decay rate = 0.1. Lastly, ε is slowly decayed with a lower decay rate = 0.005 overtime to
give more exploration to the agent. Figure 8 compares the rewards obtained by the agent
at every given episode in the three cases. The greedy policy, i.e., no exploration, makes
the agent settle on a local optimum and quickly choose the same actions. The graph also
depicts that the exploration makes it more likely for the agent to cross the local optimum,
where increasing the exploration rate allows the agent to reach more rewards as it continues
to explore.

Energies 2022, 15, 8235 13 of 20

Energies 2022, 15, x FOR PEER REVIEW 13 of 20

total 𝑁 episodes. The algorithm initially sets 𝜖 = 𝜖 (i.e., 𝜖 = 0.95), then gradually
decreases to end at 𝜖 = 𝜖 (i.e., 𝜖 = 0.1) over 𝑛 training episodes. Specifically, dur-
ing the initial training process, more freedom is given to explore with a high probability
(i.e., 𝜖 = 0.955). As the learning advances, 𝜖 is decayed by the decay rate 𝜖 over
time to exploit the learned policy. If 𝜖 is greater than 𝜖 , then it is updated using 𝜖 = 𝜖 * (1 − 𝜖 ) at the end of each time step.

In order to study the effect of the exploration rate, it is varied during training until
the agent can get out of the local optimum. The maximum value for epsilon 𝜖 is set to
1 and 0.01 as a minimum value 𝜖 . Simulations are conducted to observe how the agent
explores the action space under three different 𝜖 values. All hyperparameters are kept
constant, and the learning rate is set to 0.01 and the discount factor at 0.95. In the first case,
no exploration is done (𝜖 = 0), where the agent always chooses the action with the maxi-
mum Q-value. The second case presents low exploration where 𝜖 is decayed with a decay
rate = 0.1. Lastly, 𝜖 is slowly decayed with a lower decay rate = 0.005 overtime to give
more exploration to the agent. Figure 8 compares the rewards obtained by the agent at
every given episode in the three cases. The greedy policy, i.e., no exploration, makes the
agent settle on a local optimum and quickly choose the same actions. The graph also de-
picts that the exploration makes it more likely for the agent to cross the local optimum,
where increasing the exploration rate allows the agent to reach more rewards as it contin-
ues to explore.

Figure 8. Reward function for different exploration rates.

4.4. Different Reward Configuration Trade-Offs
The DRL-learned knowledge is based on the collected rewards. The reward function

can limit the learning capabilities of the agent. We focus on improving the reward function
design by testing different reward formulations and studying the results. Different reward
formulations used during the DRL training often lead to quantitative changes in perfor-
mance. To analyze this, the previously mentioned reward functions in Table 1 are used.
The performance is compared to a reference reward representation (Reward-1). For the

0 200 400 600 800 1000
Training episodes

–

300

–

200

–

100

200

300

400

R
ew

ar
ds

No exploration
Low exploration
High exploration

Figure 8. Reward function for different exploration rates.

4.4. Different Reward Configuration Trade-Offs

The DRL-learned knowledge is based on the collected rewards. The reward function
can limit the learning capabilities of the agent. We focus on improving the reward func-
tion design by testing different reward formulations and studying the results. Different
reward formulations used during the DRL training often lead to quantitative changes in
performance. To analyze this, the previously mentioned reward functions in Table 1 are
used. The performance is compared to a reference reward representation (Reward-1). For
the five scenarios, the learning rate is set to 0.01 and the discount factor to 0.95. Figure 9
presents the average rewards for the five scenarios. The training is repeated five times for
each reward function, and the best result of the five runs is selected.

The penalty reward representation (Reward-5) is found to be the best performing
across all five representations. It is observed that adding a negative reward (penalty)
helps the agent learn the user’s comfort and converges faster. The results show that after
including a penalty term in the reward, the convergence speed of the network is improved
by 25%, and the results of the cost minimization are improved by 10–16%. Although
Reward-3 achieves low electricity cost, it leads to high discomfort cost. The negative cost
(Reward-1), widely used in the literature, is found to be one of the worst-performing
reward functions tested. However, squaring the reward function (Reward-2) shows an
improvement in the agent’s convergence speed. This confirms that DRL is sensitive to the
reward function scale. To further analyze the five reward functions trade-offs, the following
points are observed:

• Reward-1 has the simplest mathematical form and input requirements. It is a widely
used function in previous research work. However, it takes a long time to converge. It
reaches convergence at about 500 episodes. It has a high possibility of converging to
suboptimal solutions.

• Reward-2 has the same input as Reward-1 with a different suitable rewards scale,
converging faster. It reaches convergence at about 150 episodes. Although it converges
faster than Reward-1, it provides a suboptimal energy cost in the testing phase. This
indicates that this reward representation can easily lead to local optima.

Energies 2022, 15, 8235 14 of 20

• Reward-4 increases the agent’s utilization of previous experiences but requires more
input variables. It is less stable during the training by having abrupt changes in the
reward values during training.

• Reward-5 positively affects the network convergence speed by achieving convergence
at less than 100 episodes. Additionally, it helps the agent learn robust policy by
utilizing the implicit domain knowledge. It provides the most comfort level and
energy cost reduction in the testing phase. However, this reward representation has
a complex mathematical form and input requirements and requires a long time to
complete one training episode.

Energies 2022, 15, x FOR PEER REVIEW 14 of 20

five scenarios, the learning rate is set to 0.01 and the discount factor to 0.95. Figure 9 pre-
sents the average rewards for the five scenarios. The training is repeated five times for
each reward function, and the best result of the five runs is selected.

Figure 9. Reward functions for different scenarios.

The penalty reward representation (Reward-5) is found to be the best performing
across all five representations. It is observed that adding a negative reward (penalty) helps
the agent learn the user’s comfort and converges faster. The results show that after includ-
ing a penalty term in the reward, the convergence speed of the network is improved by
25%, and the results of the cost minimization are improved by 10–16%. Although Reward-
3 achieves low electricity cost, it leads to high discomfort cost. The negative cost (Reward-
1), widely used in the literature, is found to be one of the worst-performing reward func-
tions tested. However, squaring the reward function (Reward-2) shows an improvement
in the agent’s convergence speed. This confirms that DRL is sensitive to the reward func-
tion scale. To further analyze the five reward functions trade-offs, the following points are
observed:
• Reward-1 has the simplest mathematical form and input requirements. It is a widely

used function in previous research work. However, it takes a long time to converge.
It reaches convergence at about 500 episodes. It has a high possibility of converging
to suboptimal solutions.

• Reward-2 has the same input as Reward-1 with a different suitable rewards scale,
converging faster. It reaches convergence at about 150 episodes. Although it con-
verges faster than Reward-1, it provides a suboptimal energy cost in the testing
phase. This indicates that this reward representation can easily lead to local optima.

• Reward-5 positively affects the network convergence speed by achieving conver-
gence at less than 100 episodes. Additionally, it helps the agent learn robust policy
by utilizing the implicit domain knowledge. It provides the most comfort level and
energy cost reduction in the testing phase. However, this reward representation has

Figure 9. Reward functions for different scenarios.

Table 6 presents the electricity and discomfort costs for the reward configurations. The
discomfort cost presents the cost of the power deviation from the power set-point for all the
controllable appliances. Figure 10 highlights the trade-offs of each reward configuration
across the training and testing.

Table 6. Comparison of electricity cost for different reward configurations.

Reward ID Electricity Cost (€/Day) Discomfort Cost (€/Day)

Reward-1 5.12 1.72

Reward-2 4.85 1.90

Reward-3 4.71 2.48

Reward-4 4.92 2.38

Reward-5 4.62 1.34

Energies 2022, 15, 8235 15 of 20

Energies 2022, 15, x FOR PEER REVIEW 15 of 20

a complex mathematical form and input requirements and requires a long time to

complete one training episode.

Table 6 presents the electricity and discomfort costs for the reward configurations.

The discomfort cost presents the cost of the power deviation from the power set-point for

all the controllable appliances. Figure 10 highlights the trade-offs of each reward configu-

ration across the training and testing.

Table 6. Comparison of electricity cost for different reward configurations.

Reward ID Electricity Cost (€/day) Discomfort Cost (€/day)

Reward-1 5.12 1.72

Reward-2 4.85 1.90

Reward-3 4.71 2.48

Reward-4 4.92 2.38

Reward-5 4.62 1.34

Figure 10. Bar chart of the trade-offs for each reward configuration.

4.5. Comfort and Cost-Saving Trade-Offs

The main objective of the DRL-HEMS is to balance users’ comfort and cost savings.

Reward settings influence optimization results. Therefore, we analyze the DRL-HEMS

performance for various user behavior scenarios by changing the reward settings. This

process is done by assigning different weights to 𝑟𝑡
1 and 𝑟𝑡

2. The process is controlled by

the 𝜌 value, as presented in (22). 𝜌 takes a value between 0 and 1.

𝑚𝑖𝑛 𝑅 = (𝜌) 𝑟𝑡
1 + (1 − 𝜌) 𝑟𝑡

2 (22)

The simulations are conducted using the reward representation in (21). The value of

𝜌 is varied to achieve the trade-off between cost-saving 𝑟𝑡
1 and customer comfort 𝑟𝑡

A value of 𝜌 close to zero makes the HEMS focus strictly on comfort and thus will not try

to minimize energy consumption corresponding to the electricity price. Figure 11 illus-

trates the user’s comfort and demand with respect to the variation of 𝜌. The red line indi-

cates the out-of-comfort level, the percentage of power deviation from the power set-point

Figure 10. Bar chart of the trade-offs for each reward configuration.

4.5. Comfort and Cost-Saving Trade-Offs

The main objective of the DRL-HEMS is to balance users’ comfort and cost savings.
Reward settings influence optimization results. Therefore, we analyze the DRL-HEMS
performance for various user behavior scenarios by changing the reward settings. This
process is done by assigning different weights to rt

1 and rt
2. The process is controlled by

the ρ value, as presented in (22). ρ takes a value between 0 and 1.

min R = (ρ) rt
1 + (1− ρ) rt

2 (22)

The simulations are conducted using the reward representation in (21). The value of
ρ is varied to achieve the trade-off between cost-saving rt

1 and customer comfort rt
2. A

value of ρ close to zero makes the HEMS focus strictly on comfort and thus will not try to
minimize energy consumption corresponding to the electricity price. Figure 11 illustrates
the user’s comfort and demand with respect to the variation of ρ. The red line indicates
the out-of-comfort level, the percentage of power deviation from the power set-point for
all the controllable appliances. The blue line denotes the total energy consumption for the
same appliances. The actions taken by the agent are not the same in each case. According
to Equation (22), increasing the ρ value decreases the energy consumption, but at a certain
limit, the user comfort starts to be sacrificed. A ρ greater than 0.6 achieves more energy
consumption reduction, but the out-of-comfort percentage starts to rise rapidly, reaching
63% of the discomfort rate for a ρ equal to 0.8. For a user, another ρ could be selected,
targeting more cost reduction by optimizing the energy consumption while having an
acceptable comfort level, such as ρ = 0.6, where the energy consumption is optimized, and
the out-of-comfort percentage is around 18%.

Energies 2022, 15, 8235 16 of 20

Energies 2022, 15, x FOR PEER REVIEW 16 of 20

for all the controllable appliances. The blue line denotes the total energy consumption for

the same appliances. The actions taken by the agent are not the same in each case. Accord-

ing to Equation (22), increasing the 𝜌 value decreases the energy consumption, but at a

certain limit, the user comfort starts to be sacrificed. A 𝜌 greater than 0.6 achieves more

energy consumption reduction, but the out-of-comfort percentage starts to rise rapidly,

reaching 63% of the discomfort rate for a 𝜌 equal to 0.8. For a user, another 𝜌 could be

selected, targeting more cost reduction by optimizing the energy consumption while hav-

ing an acceptable comfort level, such as 𝜌 = 0.6, where the energy consumption is opti-

mized, and the out-of-comfort percentage is around 18%.

Figure 11. Comfort and energy-saving trade-offs.

4.6. Environmental Changes and Scalability

The uncertainties and environmental changes are significant problems that make the

HEMS algorithms difficult. However, the DRL agent can be trained to overcome these

problems. To show that, two scenarios are designed to examine how the trained agent

generalizes and adapts to different changes turn out in the environment. First, the PV is

eliminated by setting the output generation to zero. Table 7 shows that the DRL-HEMS

can optimize home consumption in the two cases with PV and without PV while achiev-

ing appropriate electricity cost savings. The PV source reduces the reliance on grid energy

and creates more customer profits. Then, various initial SoE for the EV and ESS are ana-

lyzed, and the findings are presented in Table 8. As can be seen, the electricity cost reduces

as the initial SoE for the EV and ESS rises. Thus, the agent is able to select the correct

actions to utilize the EV and ESS usage.

Table 7. The total operational electricity cost for different environments for one day.

Case Algorithm Electricity Cost (€/day)

With PV
Normal operation 7.42

DRL-HEMS 4.62

Without PV
Normal operation 9.82

DRL-HEMS 7.98

Table 8. The total operational electricity cost for different SoE values for one day.

Initial SoE for the EV

and ESS
Algorithm Electricity Cost (€/day)

30%
Normal operation 7.42

DRL-HEMS 5.02

50% Normal operation 7.42

Figure 11. Comfort and energy-saving trade-offs.

4.6. Environmental Changes and Scalability

The uncertainties and environmental changes are significant problems that make the
HEMS algorithms difficult. However, the DRL agent can be trained to overcome these
problems. To show that, two scenarios are designed to examine how the trained agent
generalizes and adapts to different changes turn out in the environment. First, the PV is
eliminated by setting the output generation to zero. Table 7 shows that the DRL-HEMS
can optimize home consumption in the two cases with PV and without PV while achieving
appropriate electricity cost savings. The PV source reduces the reliance on grid energy and
creates more customer profits. Then, various initial SoE for the EV and ESS are analyzed,
and the findings are presented in Table 8. As can be seen, the electricity cost reduces as the
initial SoE for the EV and ESS rises. Thus, the agent is able to select the correct actions to
utilize the EV and ESS usage.

Table 7. The total operational electricity cost for different environments for one day.

Case Algorithm Electricity Cost (€/Day)

With PV
Normal operation 7.42

DRL-HEMS 4.62

Without PV
Normal operation 9.82

DRL-HEMS 7.98

Table 8. The total operational electricity cost for different SoE values for one day.

Initial SoE for the EV and ESS Algorithm Electricity Cost (€/Day)

30%
Normal operation 7.42

DRL-HEMS 5.02

50%
Normal operation 7.42

DRL-HEMS 4.62

80%
Normal operation 7.42

DRL-HEMS 3.29

Energies 2022, 15, 8235 17 of 20

various state–action pairs by expanding the action space while maintaining a consistent
state space. The Episode Reward Manager in MATLAB is used to calculate the precise
training time, and the testing time is noted. The action space size is determined according
to the power consumption space, or power ratings, of the home appliances. For instance,
different consumption levels relate to different actions. As indicated in Table 9, the agent
needs more time to complete 1000 episodes as the state–action pair increases. Because
of the numerous action spaces, the outcomes in the latter two situations are worse than
random. Increasing the state–action pairs presents limitations on the results. Increasing the
consumption level (actions) gives more accurate energy consumption. However, this is not
achievable for most home appliances. Consequently, it makes more sense to discretize the
energy consumption into certain power levels. This results in more practical control of the
appliances and reduces the agent’s search time.

Table 9. The agent performance with different action–state configurations.

No. of State–Action Pair Training Time (s) Testing Time (s)

500 8746 1

2000 23,814 2.1

4500 87,978 5.3

8000 122,822 8.1

36,000 377,358 413.2

4.7. Comparison with Traditional Algorithms

Figure 12 demonstrates the overall costs of the home appliances using the MILP solver
and DRL agent. The DRL initially does not operate well. However, as the agent gains more
expertise through the training, it begins to adjust its policy by exploring and exploiting
the environment and outperforms the MILP solver, as shown in Figure 12. This is because
the DRL agent can change its behaviors during the training process, while the MILP is
incapable of learning.

Energies 2022, 15, x FOR PEER REVIEW 17 of 20

DRL-HEMS 4.62

80%
Normal operation 7.42
DRL-HEMS 3.29

The impact of scalability on DRL performance is analyzed. For all studied cases, the
learning rate is set to 0.01 and the discount factor to 0.95. The simulations are conducted
using the reward configuration in (21). In this case, a total of 1000 episodes are used to
train the model. Table 9 shows the performance on training and testing time, considering
various state–action pairs by expanding the action space while maintaining a consistent
state space. The Episode Reward Manager in MATLAB is used to calculate the precise
training time, and the testing time is noted. The action space size is determined according
to the power consumption space, or power ratings, of the home appliances. For instance,
different consumption levels relate to different actions. As indicated in Table 9, the agent
needs more time to complete 1000 episodes as the state–action pair increases. Because of
the numerous action spaces, the outcomes in the latter two situations are worse than ran-
dom. Increasing the state–action pairs presents limitations on the results. Increasing the
consumption level (actions) gives more accurate energy consumption. However, this is
not achievable for most home appliances. Consequently, it makes more sense to discretize
the energy consumption into certain power levels. This results in more practical control
of the appliances and reduces the agent’s search time.

Table 9. The agent performance with different action–state configurations.

No. of State–Action Pair Training Time (s) Testing Time (s)
500 8746 1

2000 23,814 2.1
4500 87,978 5.3
8000 122,822 8.1

36,000 377,358 413.2

4.7. Comparison with Traditional Algorithms
Figure 12 demonstrates the overall costs of the home appliances using the MILP

solver and DRL agent. The DRL initially does not operate well. However, as the agent
gains more expertise through the training, it begins to adjust its policy by exploring and
exploiting the environment and outperforms the MILP solver, as shown in Figure 12. This
is because the DRL agent can change its behaviors during the training process, while the
MILP is incapable of learning.

El
ec

tri
ci

ty
C

os
t (

€/
da

Figure 12. Total electricity costs over one day by DRL agent and MILP solver.

Energies 2022, 15, 8235 18 of 20

5. Conclusions

This paper addresses DRL’s technical design and implementation trade-offs for DR
problems in a residential household. Recently, DRL-HEMS models have been attracting
attention because of their efficient decision-making and ability to adapt to the residen-
tial user lifestyle. The performance of DRL-based HEMS models is driven by different
elements, including the action–state pairs, the reward representation, and the training
hyperparameters. We analyzed the impacts of these elements when using the DQN agent
as the policy approximator. We developed different scenarios for each of these elements
and compared the performance variations in each case. The results show that DRL can
address the DR problem since it successfully minimizes electricity and discomfort costs
for a single household user. It schedules from 73% to 98% of the appliances in different
cases and minimizes the electricity cost by 19% to 47%. We have also reflected on differ-
ent considerations usually left out of DRL-HEMS studies, for example, different reward
configurations. The considered reward metrics are based on their complexity, required
input variables, convergence speed, and policy robustness. The performance of the reward
functions is evaluated and compared in terms of minimizing the electricity cost and the
user discomfort cost. In addition, the main HEMS challenges related to environmental
changes and high dimensionality are investigated. The proposed DRL-HEMS shows differ-
ent performances in each scenario, taking advantage of different informative states and
rewards. These results highlight the importance of choosing the model configurations as an
extra variable in the DRL-HEMS application. Understanding the gaps and the limitations
of different DRL-HEMS configurations is necessary to make them deployable and reliable
in real environments.

This work can be extended by considering the uncertainty of appliance energy usage
patterns and working time. Additionally, network fluctuations and congestion impacts
should be considered when implementing the presented DRL-HEMS in real-world envi-
ronments. One limitation of this work is that the discussed results only relate to the DQN
agent training and do not necessarily represent the performance for other types of network,
such as policy gradient methods.

Author Contributions: Conceptualization, A.A., K.S. and A.M.; methodology, A.A.; software, A.A.;
validation, K.S. and A.M.; formal analysis, A.A.; investigation, A.A.; resources, K.S.; writing—original
draft preparation, A.A.; writing—review and editing, K.S. and A.M. All authors have read and agreed
to the published version of the manuscript.

Funding: This research was funded by the NPRP11S-1202-170052 grant from Qatar National Research
Fund (a member of the Qatar Foundation). The statements made herein are solely the responsibility
of the authors.

Data Availability Statement: Data are contained within the article.

Conflicts of Interest: The authors declare no conflict of interest.

Nomenclature

Indices
t Index of time steps.
n Index of appliances.
Variables
λt Electricity price at time t (€/kWh).
tint,n Start of scheduling window (t).
tend,n End of scheduling window (t).
tstart,n Start of operating time for appliance n(t).
En,t Consumption for appliancen at time t (kWh).
EPV

t Solar energy surplus at time t (kWh).
EEV/dis

t EV discharging energy (kWh).
EESS/dis

t ESS discharging energy (kWh).

Energies 2022, 15, 8235 19 of 20

SOEt State of energy for EV/ESS at time t.
Epv

t PV generation at time t (kW).
EEV

t,n Energy consumption level of EV at time t (kWh).
EESS

t,n Energy consumption level of ESS at time t (kWh).
Etotal

t Total energy consumed by the home at time t (kWh).
Eg

t Total energy purchased by the home t (kwh)
Constants
dn Operation period of appliance n.
ζn Appliances discomfort coefficient.
Emax

n Maximum energy consumption for appliance n (kWh).
Emin

n Minimum energy consumption for appliance n (kWh).
EEV/max

n,t Maximum charging/discharging level of EV (kWh).
ηEV

ch /ηEV
dis EV charging/discharging efficiency factor.

SOEmin Minimum state of energy of EV/ESS batteries at time t.
SOEmax Maximum state of energy of EV/ESS batteries at time t.

References

1. Xu, C.; Liao, Z.; Li, C.; Zhou, X.; Xie, R. Review on Interpretable Machine Learning in Smart Grid. Energies 2022, 15, 4427.

[CrossRef]
2. Mocanu, E.; Mocanu, D.C.; Nguyen, P.H.; Liotta, A.; Webber, M.E.; Gibescu, M.; Slootweg, J.G. On-Line Building Energy

Optimization Using Deep Reinforcement Learning. IEEE Trans. Smart Grid 2019, 10, 3698–3708. [CrossRef]
3. Ye, Y.; Qiu, D.; Sun, M.; Papadaskalopoulos, D.; Strbac, G. Deep reinforcement learning for strategic bidding in electricity markets.

IEEE Trans. Smart Grid 2019, 11, 1343–1355. [CrossRef]
4. Huang, Q.; Huang, R.; Hao, W.; Tan, J.; Fan, R.; Huang, Z. Adaptive Power System Emergency Control Using Deep Reinforcement

Learning. IEEE Trans. Smart Grid 2019, 11, 1171–1182. [CrossRef]
5. Kwon, Y.; Kim, T.; Baek, K.; Kim, J. Multi-Objective Optimization of Home Appliances and Electric Vehicle Considering

Customer’s Benefits and Offsite Shared Photovoltaic Curtailment. Energies 2020, 13, 2852. [CrossRef]
6. Amer, A.; Shaban, K.; Gaouda, A.; Massoud, A. Home Energy Management System Embedded with a Multi-Objective Demand

Response Optimization Model to Benefit Customers and Operators. Energies 2021, 14, 257. [CrossRef]
7. Nwulu, N.I.; Xia, X. Optimal dispatch for a microgrid incorporating renewables and demand response. Renew. Energy 2017, 101,

16–28. [CrossRef]
8. Faia, R.; Faria, P.; Vale, Z.; Spinola, J. Demand Response Optimization Using Particle Swarm Algorithm Considering Optimum

Battery Energy Storage Schedule in a Residential House. Energies 2019, 12, 1645. [CrossRef]
9. Dayalan, S.; Gul, S.S.; Rathinam, R.; Fernandez Savari, G.; Aleem, S.H.E.A.; Mohamed, M.A.; Ali, Z.M. Multi-Stage In-centive-

Based Demand Response Using a Novel Stackelberg–Particle Swarm Optimization. Sustainability 2022, 14, 10985. [CrossRef]
10. Vázquez-Canteli, J.R.; Nagy, Z. Reinforcement learning for demand response: A review of algorithms and modeling techniques.

Appl. Energy 2019, 235, 1072–1089. [CrossRef]
11. Buşoniu, L.; de Bruin, T.; Tolić, D.; Kober, J.; Palunko, I. Reinforcement learning for control: Performance, stability, and deep

approximators. Annu. Rev. Control 2018, 46, 8–28. [CrossRef]
12. Wang, B.; Li, Y.; Ming, W.; Wang, S. Deep Reinforcement Learning Method for Demand Response Management of Interruptible

Load. IEEE Trans. Smart Grid 2020, 11, 3146–3155. [CrossRef]
13. Mathew, A.; Roy, A.; Mathew, J. Intelligent Residential Energy Management System Using Deep Reinforcement Learning. IEEE

Syst. J. 2020, 14, 5362–5372. [CrossRef]
14. Lu, R.; Hong, S.H.; Yu, M. Demand Response for Home Energy Management Using Reinforcement Learning and Artificial Neural

Network. IEEE Trans. Smart Grid 2019, 10, 6629–6639. [CrossRef]
15. Xu, X.; Jia, Y.; Xu, Y.; Xu, Z.; Chai, S.; Lai, C.S. A Multi-Agent Reinforcement Learning-Based Data-Driven Method for Home

Energy Management. IEEE Trans. Smart Grid 2020, 11, 3201–3211. [CrossRef]
16. Liu, Y.; Zhang, D.; Gooi, H.B. Optimization strategy based on deep reinforcement learning for home energy management. CSEE J.

Power Energy Syst. 2020, 6, 572–582.
17. Amer, A.; Shaban, K.; Massoud, A. DRL-HEMS: Deep Reinforcement Learning Agent for Demand Response in Home Energy

Management Systems Considering Customers and Operators Perspectives. IEEE Trans. Smart Grid 2022. [CrossRef]
18. Li, H.; Wan, Z.; He, H. Real-Time Residential Demand Response. IEEE Trans. Smart Grid 2020, 11, 4144–4154. [CrossRef]
19. Ye, Y.; Qiu, D.; Wang, H.; Tang, Y.; Strbac, G. Real-Time Autonomous Residential Demand Response Management Based on Twin

Delayed Deep Deterministic Policy Gradient Learning. Energies 2021, 14, 531. [CrossRef]
20. Huang, C.; Zhang, H.; Wang, L.; Luo, X.; Song, Y. Mixed Deep Reinforcement Learning Considering Discrete-continuous Hybrid

Action Space for Smart Home Energy Management. J. Mod. Power Syst. Clean Energy 2020, 10, 743–754. [CrossRef]
21. Ahrarinouri, M.; Rastegar, M.; Seifi, A.R. Multiagent Reinforcement Learning for Energy Management in Residential Buildings.

IEEE Trans. Ind. Inform. 2021, 17, 659–666. [CrossRef]

http://doi.org/10.3390/en15124427

http://doi.org/10.1109/TSG.2018.2834219

http://doi.org/10.1109/TSG.2019.2936142

http://doi.org/10.1109/TSG.2019.2933191

http://doi.org/10.3390/en13112852

http://doi.org/10.3390/en14020257

http://doi.org/10.1016/j.renene.2016.08.026

http://doi.org/10.3390/en12091645

http://doi.org/10.3390/su141710985

http://doi.org/10.1016/j.apenergy.2018.11.002

http://doi.org/10.1016/j.arcontrol.2018.09.005

http://doi.org/10.1109/TSG.2020.2967430

http://doi.org/10.1109/JSYST.2020.2996547

http://doi.org/10.1109/TSG.2019.2909266

http://doi.org/10.1109/TSG.2020.2971427

http://doi.org/10.1109/TSG.2022.3198401

http://doi.org/10.1109/TSG.2020.2978061

http://doi.org/10.3390/en14030531

http://doi.org/10.35833/mpce.2021.000394

http://doi.org/10.1109/TII.2020.2977104

Energies 2022, 15, 8235 20 of 20

22. Yu, L.; Xie, W.; Xie, D.; Zou, Y.; Zhang, D.; Sun, Z.; Zhang, L.; Zhang, Y.; Jiang, T. Deep Reinforcement Learning for Smart Home
Energy Management. IEEE Internet Things J. 2020, 7, 2751–2762. [CrossRef]

23. Ruelens, F.; Claessens, B.J.; Vandael, S.; De Schutter, B.; Babuška, R.; Belmans, R. Residential Demand Response of Ther-
mostatically Controlled Loads Using Batch Reinforcement Learning. IEEE Trans. Smart Grid 2017, 8, 2149–2159. [CrossRef]

24. Lee, S.; Choi, D.H. Reinforcement learning-based energy management of smart home with rooftop solar photovoltaic system
Energy Storage System, and Home Appliances. Sensors 2019, 19, 3937. [CrossRef] [PubMed]

25. Honarmand, M.; Zakariazadeh, A.; Jadid, S. Optimal scheduling of electric vehicles in an intelligent parking lot considering
vehicle-to-grid concept and battery condition. Energy 2014, 65, 572–579. [CrossRef]

26. Paraskevas, A.; Aletras, D.; Chrysopoulos, A.; Marinopoulos, A.; Doukas, D.I. Optimal Management for EV Charging Stations: A
Win–Win Strategy for Different Stakeholders Using Constrained Deep Q-Learning. Energies 2022, 15, 2323. [CrossRef]

27. Mosavi, A.; Salimi, M.; Ardabili, S.F.; Rabczuk, T.; Shamshirband, S.; Varkonyi-Koczy, A.R. State of the Art of Machine Learning
Models in Energy Systems, a Systematic Review. Energies 2019, 12, 1301. [CrossRef]

http://doi.org/10.1109/JIOT.2019.2957289

http://doi.org/10.1109/TSG.2016.2517211

http://doi.org/10.3390/s19183937

http://www.ncbi.nlm.nih.gov/pubmed/31547320

http://doi.org/10.1016/j.energy.2013.11.045

http://doi.org/10.3390/en15072323

http://doi.org/10.3390/en12071301

Introduction

Demand Response Problem Formulation

Shiftable Appliances

Controllable, Also Known as Thermostatically Controlled, Appliances

Baseloads

Other Assets

DRL for Optimal Demand Response

Deep Reinforcement Learning (DRL)

Different Configurations for DRL-Based DR

State Space Configuration

Action Space Configuration

Reward Configuration

Implementation Process

Performance Analysis

Case Study Setup

Training Parameters Trade-Offs

Exploration–Exploitation Trade-Offs

Different Reward Configuration Trade-Offs

Comfort and Cost-Saving Trade-Offs

Environmental Changes and Scalability

Comparison with Traditional Algorithms

Conclusions

References

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Demand Response using Deep Reinforcement Learning (DQN) ”

Get high-quality paper

NEW! AI matching with writer

Order an Essay Now & Get These Features For Free:

Turnitin Report

Formatting

Title Page

Citation

Outline

Place an Order