Next Article in Journal
A Long-Range and Low-Cost Emergency Radio Beacon for Small Drones
Previous Article in Journal
Improving Satellite-Based Retrieval of Maize Leaf Chlorophyll Content by Joint Observation with UAV Hyperspectral Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DRL-Based Improved UAV Swarm Control for Simultaneous Coverage and Tracking with Prior Experience Utilization

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Drones 2024, 8(12), 784; https://doi.org/10.3390/drones8120784 (registering DOI)
Submission received: 9 November 2024 / Revised: 20 December 2024 / Accepted: 21 December 2024 / Published: 23 December 2024

Abstract

:
Area coverage and target tracking are important applications of UAV swarms. However, attempting to perform both tasks simultaneously can be a challenge, particularly under resource constraints. In such scenarios, UAV swarms must collaborate to cover extensive areas while simultaneously tracking multiple targets. This paper proposes a deep reinforcement learning (DRL)-based, scalable UAV swarm control method for a simultaneous coverage and tracking (SCT) task, called the SCT-DRL algorithm. SCT-DRL simplifies the interaction between UAV swarms into a series of pairwise interactions and aggregates the information of perceived targets in advance, based on which forms the control framework with a variable number of neighboring UAVs and targets. Another highlight of SCT-DRL is using the trajectories of the traditional one-step optimization method to initialize the value network, which encourages the UAVs to select the actions leading to the state with less rest time to task completion to avoid extensive random exploration at the beginning of training. SCT-DRL can be seen as a special improvement of the traditional one-step optimization method, shaped by the samples derived from the latter, and gradually overcomes the inherent myopic issue with the far-sighted value estimation through RL training. Finally, the effectiveness of the proposed method is demonstrated through numerical experiments.

1. Introduction

Unmanned aerial vehicle (UAV) swarms can perform numerous tasks a single UAV cannot accomplish [1,2,3], such as communication relay, logistics transportation, search and rescue, etc. [4,5,6,7]. Area coverage and target tracking are important applications of UAV swarms [8,9,10]. Specifically, area coverage refers to expanding the coverage of specific areas whereby the likelihood of detecting potential targets can be increased [11,12]; whereas target tracking is achieved by continuously keeping one or more static or dynamic key targets within the perception range [13,14]. Area coverage primarily focuses on maximizing the overall covered area, while target tracking emphasizes maintaining smooth and continuous tracking of targets, albeit potentially at the expense of searching for more potential targets. Furthermore, UAV swarms also need to consider collision avoidance between UAVs and between UAVs and obstacles.
Simultaneous coverage and tracking (SCT) by UAV swarms have emerged as a new method that can simultaneously address both area coverage and target tracking sub-tasks, attracting increasing attention from researchers [15,16,17]. This method enables the tracking of detected targets while exploring as many areas as possible to find more potential targets. Roughly, SCT methods consist of two categories: path-based methods and reactive methods. Path-based methods represent a rolling optimization strategy where UAVs make action decisions by predicting the future states of neighboring UAVs. For instance, the literature [18,19] proposes the Voronoi control (VC) method to optimize coverage problems and attempts to extend it to target tracking problems. One study [17] designs a distributed control (DC) method based on time-varying density functions to address SCT problems in robot swarms. Unfortunately, the predicted path sets often contain plenty of infeasible solutions in congested environments, easily leading to planning deadlocks. One way to address this issue is to introduce an interaction mechanism that enables each UAV to deduce the intention of its peers and anticipate their viable flight paths, thereby allowing the movement of one UAV to inform and direct the actions of the others. However, predicting paths for all UAVs requires significant time and computational resources [20,21]. Coupled with the uncertainties in modeling and measurement, the actual paths of other UAVs are difficult to align with the predicted ones, especially after long control steps, necessitating high-frequency updates of perceptual information for path-based methods, which further exacerbates the computational burden. In contrast, although reactive methods may be slightly inferior to path-based methods in task completion time and effectiveness, their advantage in computational speed cannot be overlooked. For example, literature [22] proposes a self-organized reciprocal control (RC) approach for sensing coverage with UAV swarms, and literature [16] extends the reciprocal mechanism to the SCT task. Reactive methods make rule-based decisions based on single-step optimization, adjusting UAV speeds according to the targets and neighboring information in the current timestep but ignoring the future states of the system. Though reactive methods offer advantages, such as fast computation, short coverage times, and high coverage rates, they have inherent short-sighted drawbacks, which may lead to unnatural behaviors such as motion oscillation in certain task scenarios.
Although numerous popular deep reinforcement learning (DRL) algorithms, such as MADDPG and DQN, have been employed for UAV swarm control [17,23,24,25], they predominantly initiate their iterative convergence process from scratch, largely disregarding data from established classic algorithms. However, this does not imply that existing classic algorithms are devoid of merit; rather, the knowledge and experience accumulated from prior algorithmic practices can also serve as valuable assets for optimizing learning control strategies. In response, we propose a DRL-based improved UAV Swarm SCT control method, SCT-DRL, which can be seen as a special improvement of the traditional one-step optimization method, shaped by the samples derived from the latter, and gradually overcomes the inherent myopic issue with the far-sighted value estimation through reinforcement learning training.
This paper primarily presents the following innovative contributions:
(1)
The SCT-DRL algorithm addresses the challenge of managing interactions within UAV swarms by simplifying the problem into a series of pairwise interactions, which allows for a more manageable aggregation of target information. This methodological reduction enables the construction of a dynamic control framework that adapts to the varying number of proximate UAVs and targets, enhancing the swarm’s operational flexibility and efficiency.
(2)
The initialization of the value network using trajectory data from one-step optimization techniques (RC). This initialization strategy primes the UAVs to favor actions that expedite task completion, thereby reducing the need for extensive random exploration during the initial training phase. By leveraging this initialization, SCT-DRL builds upon the strengths of RC while mitigating its limitations through the incorporation of DRL. This synergy allows SCT-DRL to evolve beyond the myopic constraints of RC, offering a more foresighted value estimation that is honed through the iterative process of RL.
(3)
The efficacy of the SCT-DRL algorithm is empirically validated through a series of numerical experiments, which substantiate its potential to significantly enhance UAV swarm performance in simultaneous coverage and tracking tasks. These findings underscore the algorithm’s capacity to navigate the complexities of multi-UAV operations, offering a promising avenue for advancing autonomous swarm technologies.
The organization of this paper is: Section 2 formulates the SCT problem from the case of two UAVs. Section 3 proposes a learning control algorithm based on DRL and reciprocal decision method [16,22]. Experiments and analyses are shown in Section 4 to prove the effectiveness of the proposed method. Finally, Section 5 concludes this paper.

2. Problem Formulation

The problem of swarm SCT is essentially an optimization decision problem. The core of UAV swarm SCT is as follows: based on its state and external observable state, any UAV in the swarm can realize area coverage and target tracking by interacting with its neighboring UAV. Here, we simplify the local neighbor interaction as a series of ultimately two UAV interactions and finally realize the UAV distributed control problem. In this paper, we start with the two-UAV SCT tasks and embed the idea of a scalable concept in modeling based on the Dec-POMDP framework.

2.1. Decentralized Partially Observable Markov Decision Process (Dec-POMDP)

To be addressed by the reinforcement learning method, the two-UAV SCT problem is formulated as Dec-POMDP with the tuple S , A , O , P , R , γ . The two UAVs are signified as i and j, S denotes the state space, A = A i × A j denotes the joint action space, O = O i × O j denotes the observation space, P is the state transition model, R = { R i , R j } is the collection of reward functions, and  γ is the discount factor when cumulating rewards.
(1) State space S: the total state of the system contains the states of UAVs, targets, and obstacles. The height of the UAV is fixed in this paper so that it moves in a two-dimensional space environment, and its position and speed are represented by p = [ p x , p y ] and v = [ v x , v y ] , respectively. The state of UAV i is represented as s i , including the position p, speed v, radius of the fuselage R f , UAV’s coverage radius R c , optimal speed v p r e f and motion direction θ . The optimal speed of a UAV varies based on aircraft performance, which is the constraint of the UAV’s actions.
s i = p x i , p y i , v x i , v y i , R f i , R c i , v p r e f i , θ i
Similarly, the state of target k is s tar , k = [ p x tar , k , p y tar , k , v x tar , k , v y tar , k , R tar , k ] , where R tar , k is the radius of target k. Additionally, the state of obstacle k is s obs , k = [ p x obs , k , p y obs , k , v x obs , k , v y obs , k , R obs , k ] , where R obs , k is the radius of obstacle k. The total two-UAV system state is
s = [ s i , s j , s tar , 1 , , s tar , n tar , s obs , 1 , , s obs , n obs ] ,
where n tar and n obs are numbers of targets and obstacles in the environment.
(2) Observation space O i : due to the limit of the sensor, the UAV is unable to obtain the complete system state defined above. The observation of UAV i includes four parts, self state s i , the observable states of neighboring UAV o neigh , i , the observable states of the targets o tar , i , the sensory information of the obstacle and boundary o sen , i .
o i = [ s i , o neigh , i , o tar , i , o sen , i ]
In the two-UAV scenario, only UAV j is in the neighbor of UAV i,
o neigh , i = [ p x j , p y j , v x j , v y j , R f j , R c j ] .
In simultaneous coverage and tracking tasks, the number of targets detected by each UAV is inconsistent, and the number of targets detected may be 0 within a period of time, or m targets may be detected simultaneously m ( m N ) . Therefore, if a UAV adds the observable state of an unknown number of its detected targets to its system state, multiple networks need to be constructed for training. To address this issue, we introduce the concept of optimal track vector when defining the observation input of the targets. The location clustering center of all targets within the sensing range of the UAV is taken as the best tracking position p opt , tar , and the mean of all target velocity vectors within the sensing range of the UAV is taken as the best tracking position offset speed v opt , tar .
o tar , i = [ p opt , tar , i , v opt , tar , i ] = 1 m k cover ( i ) [ p x tar , k , p y tar , k , v x k , v y tar , k ]
In this way, the unknown number of target information within the sensing range of the UAV can be converted into optimal tracking position and velocity information.
The observation input of obstacles and boundaries faces the same unfixed dimension issue as the targets. To effectively integrate the surrounding environment information, we set the N sen sensing directions with the detection radius of the UAV sensor as the sensing observation.
o sen , i = [ d 0 · 2 π N sen , d 1 · 2 π N sen , , d ( N sen 1 ) · 2 π N sen ]
The orientation of the UAV is set as the direction of 0 , and the clockwise is positive direction. If there is boundary or obstacle within the sensing range in the k · 2 π N sen direction, d k · 2 π N sen is the distance between the UAV and the nearest one of them. Else, d k · 2 π N sen is set as the maximum sensing distance. Though larger N sen brings a higher resolution of the sensory image, it increases the input dimension and size of networks as well. Under the premise of satisfying the task performance, we select 8 as N sen in this paper.
(3) Action space A i : the action of UAV i is denoted as a i A i . A i encompasses a collection of viable velocity vectors, which is constrained by its optimal movement speed.
(4) State transition model P: The state transition model P ( s t + 1 | s t , a i , a j ) describe the dynamic of the system. The position state transition of UAVs is clear, but the transition of the entire system remains uncertain as the strategies of targets and obstacles are unknown. The positions of UAV i and UAV j update as follows,
p t + 1 i = p t i + Δ t · a t i p t + 1 j = p t j + Δ t · a t j
Given the extremely short time interval Δ t , it is reasonable to assume, within the context of algorithm research, that the speed of action taken by the UAV remains constant during the time period Δ t .
(5) Reward function R i : R i ( s , a ) determines the reward UAV i received after each action is executed.
R i ( s , a i ) = 1 , if d i , j < R f i + R f j or d min sen < R f i , 0.5 + d min sen / 2 R c i , else if d min sen < R c i ( coverage ) , 0.25 d tar , i / 2 R c i , else if d tar , i > 0.5 R c i ( tracking ) , 1 , else if d i , j R c i R c j < ε & d min sen > = R c i , 0 . o . w .
where d i , j is the distance between the two UAVs, and  d min sen = min o sen , i is the minimum perceived distance of the eight UAV sensory directions; d tar , i denotes the distance between current position of UAV and the estimated best tracking position p opt , tar , i .
The shaping of the reward function is derived from three motivations: first, avoiding the UAV colliding with other UAVs, boundaries, or obstacles; second, avoiding the perception range of UAV occupied by obstacles or boundary-out zone, which saves the perception resource for covering more targets; third, keeping the UAV close to the best tracking position of neighboring targets.

2.2. Optimization Objective

In the SCT problem, the optimization objective is minimizing the time to realize maximum coverage and stable tracking t c . The optimization objective and constraints of two-UAV SCT can be formulated as:
arg min π E t c | s , π
subject to:
p t i p t j 2 R f i + R f j
p t + 1 i = p t i + Δ t · π i ( o t i )
p t + 1 j = p t j + Δ t · π j ( o t j )
p t c i p t c j 2 R c i R c j ε
p t c i p t c o p t , t a r , i 2 ε tar
where π = { π i , π j } denotes the joint policy. We select the deterministic policy in this paper, i.e., the policy π maps the observation to the best action, rather than mapping the observation to the probability distribution of actions. The policies of UAV i and UAV j are π i : o t i a t i and π j : o t j a t j , respectively, where o i O i , o j O j , a i A i , a j A j .
Formula (9) is the optimization target that is optimized through the system state and action policies of UAV i and UAV j; Formula (10) is the collision avoidance constraint of the two UAVs, that is, the Euclidean distance between the two UAV must be greater than the sum of the two UAV fuselage radius; Formulas (11) and (12) are the position updates of UAV i and UAV j, respectively; Formula (13) is the constraint of the coverage task, that is, when the final coverage is completed, the Euclidean distance between the two UAVs should be kept at and near the radius of the coverage range of the two UAVs, and the error should not exceed the allowable range ε ; Formula (14) denotes that the UAVs should keep tracking its perceived targets in task completion.
In the reinforcement learning method, through designing the reward function according to the goal of SCT, the above optimization problem has been transformed into maximizing the discount cumulative rewards. The value function Q π ( s , a ) represents the expectation of the discount cumulative rewards under the joint policy π after executing joint action a at state s.
Q π i ( s , a ) = E π k = 0 γ k R i ( s t + k , a t + k ) s t = s , a t = a
V π i ( s ) = E π k = 0 γ k R i ( s t + k , a t + k ) s t = s
However, as the state contains unobservable part, UAV i updates its policy according to the V π i ( o i ) and Q π i ( o i , a i ) in stead. Additionally, as the UAVs with slow optimal speed will lead to lower value due to the discount factor, the optimal speed v p r e f is incorporated into the discount for numerical normalization, which adjusts the value margin.
V π i ( o i ) = E π k = 0 γ k · v p r e f R i ( s t + k , a t + k ) o t i = o i
Q π i ( o i , a i ) = E π k = 0 γ k · v p r e f R i ( s t + k , a t + k ) o t i = o i , a t i = a i
The deterministic policy of UAV i selects the action corresponding to the highest value function Q π i ( o i , a i ) .
π i ( o i ) = a r g m a x a i Q π i ( o i , a i ) = a r g m a x a i R i ( o i , a i ) + γ v p r e f o i , p ( o i , | o i , a i ) V π i ( o i , ) d o i ,
In RL, it can be proved by the optimal Bellman equation that if UAVs keep constantly selecting the actions according to the policies and keep updating the value functions, the value functions and policies will converge to the optimal ones.

3. Approach

This section proposes a learning-based UAV swarm SCT algorithm, which adopts the reinforcement learning (RL) method to develop the foresighted reactive control strategy and incorporates the valuable experience samples derived from single-step optimization algorithm (RC) [16,22] within. In detail, we promote the RL policy by pretraining it based on the reciprocal control samples, which can be deemed as the DRL-based improvement of the traditional method as well.
In this section, we first introduce the general control framework of the SCT-DRL (Simultaneous Coverage and Tracking Deep Reinforcement Learning). Then, we introduce the network structure and training process of the SCT-DRL.

3.1. Control Framework of SCT-DRL

In the previous sections, considering the idea that the UAV swarm SCT problem can be decomposed as multiple two-UAV interactions, the two-UAV SCT problem has been formulated. Herein, after the policy has been well-trained, we adapt the optimized policy from the two-UAV case to the UAV swarm control. As all UAVs are homogeneous, they share a common policy with identical value network parameters.
In two-UAV cases, at each decision point, the UAV opts for the action that maximizes its value:
a r g m a x a t i A i R i ( o t i , a t i ) + γ v p r e f o t + 1 i p ( o t + 1 i | o t i , a t i ) V π i ( o t + 1 i ) d o t + 1 i
However, evaluating the integral in Formula (22) is a challenge, as UAVs are unable to discern the intentions of other UAVs, rendering the next state. Therefore, the part of o t + 1 i , o t + 1 neigh , i = [ p x , t + 1 j , p y , t + 1 j , v x , t + 1 j , v y , t + 1 j , R f j , R c j ] is uncertain. To address this, we assume the other UAV maintains the speed as the last timestep, thus the predicted observation of neighboring UAV is denoted as
o ^ t + 1 neigh , i [ p x , t j + Δ t · v x , t j , p y , t j + Δ t · v y , t j , v x , t j , v y , t j , R f j , R c j ] .
For the convenience of extending to UAV-swarm case, o t i is replaced by o t i , j , which indicates the variable of interaction of UAV i and UAV j. According to the above assumption, the action selection rule is updated as:
a r g m a x a t i A i R i ( o t i , j , a t i ) + γ v p r e f V π i ( o ^ t + 1 i , j )
Figure 1 visualizes the action strategy described by Formula (22), where Figure 1a depicts the state of the red UAV (The circle indicates the coverage of the UAV, and the arrow indicates its speed.), and Figure 1b displays various state values associated with different velocity vectors.
When turns to the UAV swarm SCT problem, the final action selection of UAV i will take care of all neighboring UAVs’ status, and make decisions according to:
a r g m a x a t i A i m i n k neigh ( i ) R i ( o t i , k , a t i ) + γ v p r e f · V i ( o ^ t + 1 i , k )
where o t i , k signifies the observation of UAV i when only considering the k-th UAV as a neighbor, i.e., only focus on the interaction of pair UAV i and UAV k. Consequently, UAV i selects the action that yields the lowest value of all UAV i and UAV k pairs highest from the action set.
Due to the sharing value network parameters, we denote the value network for the SCT task as V s c t ( · , w ) = V 1 = V 2 = = V N . In this method, we extra train the value networks for pure coverage task V c ( · , w ) , and switch the selection of V c and V s c t by discriminator D according to the current status. The mechanism is proposed to address the special scenarios, and some UAVs fail to perceive any targets for tracking input and avoid prematurely focusing on tracking current perceived targets when the target distribution is relatively sparse.
We conclude the detailed algorithm for this process in Algorithm 1.
Algorithm 1 SCT-DRL (Simultaneous Coverage and Tracking with Deep RL)
1:
Input: Discriminator D, Value network V c ( · , w ) and V s c a t ( · , w ) .
2:
Output: Trajectory s 0 : t f .
3:
Initial timestep t = 0 and observation o 0
4:
while task unfinished do
5:
     UAVs update their observations o t = [ o t 1 , o t 2 , , o t n ]
6:
     for  i = 1 , 2 , , N  do
7:
        discriminator D select the network V i ( · , w ) D ( o t i ) . , according to Formula (24)
8:
        predict the observation of UAV i about all neighboring UAVs in the next time step { o ^ t + 1 i , k } k neigh ( i ) , according to Formula (21)
9:
        UAV i selects the action a t i , according to Formula (23)
10:
     UAVs execute the joint action a t = [ a t 1 , a t 2 , , a t N ]
11:
     state of the system has been transited to s t + 1 .
12:
     t t + 1
13:
return  s 0 : t f
The discriminator D will employ the network of coverage V c ( · , w ) when the average distance between neighboring UAVs is too close or have not covered any targets, and the value of ε r can be adjusted by specific task scenarios.
D ( o i ) = V c ( · , w ) if 1 | neigh ( i ) | k neigh ( i ) d i , k < ε r or | cover ( i ) | = 0 , V s c a t ( · , w ) else

3.2. Training Mechanism of SCT-DRL

3.2.1. Network Structure

From a geometric point of view, the optimal simultaneous coverage and tracking strategy should be independent of rotation, translation, and other coordinate transformations. Hence, the parameterized description of the state of a stand-alone system is still somewhat redundant. In order to remove this uncertainty and enable a single UAV to better learn effective strategies independent of coordinate conversion from training, a UAV-based coordinate system is defined in this section, that is, which takes the UAV itself as the origin of the coordinate, and takes the fuselage orientation as the positive orientation of X-axis. Therefore, the observation of UAV i about UAV k after coordinate conversion is denoted as:
f fpp ( o i , k ) = f fpp ( [ s i , o neigh , i , o tar , i , o sen , i ] ) = [ v p r e f , v ˜ x i , v ˜ y i , R f i , R c i , θ ˜ i , v ˜ x k , v ˜ y k , p ˜ x k , p ˜ y k , R f k , R c k , c o s ( θ ˜ k ) , s i n ( θ ˜ k ) , d i , k , ( p ˜ x o p t , tar , i , p ˜ y o p t , tar , i , v ˜ x o p t , tar , i , v ˜ y o p t , tar , i , d o p t , tar , i ) t r a c k i n g , d 0 , d 45 , d 90 , d 135 , d 180 , d 225 , d 270 , d 315 ]
where z ˜ denotes the variable z after transforming from the first-person perspective of UAV i, d o p t , tar , i = p i p o p t , tar , i 2 denotes the distance between the position UAV i and its corresponding best targets tracking position.
The SCT value network V s c a t takes the f fpp ( o i , k ) as input, we use the V s c a t ( o i , k ) replace the V s c a t ( f fpp ( o i , k ) ) for simplicity. The value network consists of three fully-connected (FC) layers, with a size of 100. As displayed in Figure 2, The structure of the network is FC1, ReLU, FC2, ReLU, FC3, and the output dimension is 2, which denotes the value and direction of the velocity.
Besides, the value network for pure coverage V c differs from V s c a t in the input dimension, where the observation for pure coverage o c i , k does not contain tracking information,
f fpp ( o c i , k ) = f fpp ( [ s i , o neigh , i , o sen , i ] ) = [ v p r e f , v ˜ x i , v ˜ y i , R f i , R c i , θ ˜ i , v ˜ x k , v ˜ y k , p ˜ x k , p ˜ y k , R f k , R c k , c o s ( θ ˜ k ) , s i n ( θ ˜ k ) , d i , k , d 0 , d 45 , d 90 , d 135 , d 180 , d 225 , d 270 , d 315 ] .

3.2.2. Value Network Training

Before proposing the value network training method, we introduce the reciprocal control (RC) method [16,22] for self-organized multi-robot coverage and tracking, where the SCT problem is modeled in velocity space. RC method determines the optimal reciprocal coverage velocity by adjusting the velocity relative to neighboring robots toward the optimal coverage velocity with guaranteed collision avoidance. Though RC significantly outperforms the other representative methods, like the Voronoi control method based on region partitioning and the density control method grounded in local region analysis, RC is still a traditional single-step optimization method. The single-step optimization strategy has great advantages in computing time, it only considers the current state without approximating the future possibilities, which is prone to the formation of unnatural trajectories due to the local short-term optimization.
Therefore, we propose the DRL-based UAV swarm control method, which considers the long-term cost in the future of the current action selection with an iterated value function. However, the DRL method has the nonnegligible drawback that a large number of "trail-and-error" interactions are required to optimize a well-performance policy, and the extensive random exploration is quite time-consuming at the beginning of training. From another perspective, how to combine the traditional single-step optimization method and learning-based method for SCT can be a promising direction to develop an effective UAV swarm control method.
In this section, we incorporate the valuable trajectory samples derived by the RC method into the training of value networks. The training of the SCT-DRL value network comprises two primary stages: the initialization based on prior RC samples (step 1) and the subsequent training (step 2).
In the initialization phase, the two-UAV SCT problem is first solved by the RC method, through which the task trajectories are produced. Each generated successful trajectory can be translated into the pseudo ’state-value’ samples, like { ( o t i , j , γ ( t c t ) · v p r e f ) } t = 0 t c , where t c denotes the task completion time and t c t measures the time interval from t to the terminal time of the task. Thus, γ ( t c t ) · v p r e f , γ ( 0 , 1 ) indicates the how close the occasion when UAV i perceives observation o t i , j to task finish status, which has the consistent variation trend to value function V i ( o i , j ) , i.e., the closer o t i , j to task completion, the higher γ ( t c t ) · v p r e f and V i ( o t i , j ) are.
To incorporate the experience of RC-derived successful trajectory samples into the SCT-DRL, we first employ supervised learning to update the value network of SCT-DRL as the initialization:
arg min w k = 1 N r c ( γ ( t c t ) · v p r e f V ( o t i , j ; w ) ) 2
where N r c denotes the number of all successful samples derived by the RC method. After optimization based on Formula (27), V ( o i , j ; w ) has the ability to estimate the approximate rest time required for finishing the task. As a result, when UAV i selects its action according to the value function, the preferred action with higher V ( o i , j ; w ) corresponding to less estimated remaining time for task completion, which is beneficial to generating positive samples in the initial training phase of value network and facilitates the optimization of effective SCT-DRL policies.
In the subsequent training phase, the experience samples generated by the RL action selection, i.e., ( o t i , j , a t i , o t + 1 i , j , R i ( s t i , j , a t i ) ) are employed to update the value network. To introduce more randomness to avoid the local optimal, UAV uses the ε -greedy mechanism to choose actions based on Formula (22). The value network of SCT-DRL is updated as:
arg min w k = 1 N b ( R i ( s t i , j , a t i ) ) + γ v p r e f · V ( o t + 1 i , j ; w ) V ( o t i , j ; w ) ) 2
where N b is the size of the batch sampled from experience replay.
The difference between the training of V c ( · , w ) and V s c t ( · , w ) lies in three aspects, the initialization of V c ( · , w ) uses the trajectory samples generated by RC for pure coverage task, the observation input of V c ( · , w ) does not contain the tracking information, and the corresponding reward function R c i is slightly different from SCT task.
R c i ( s , a i ) = 1 , if d i , j < R f i + R f j or d min sen < R f i , 0.5 + d min sen / 2 R c i , else if d min sen < R c i ( coverage ) , 1 , else if d i , j R c i R c j < ε & d min sen > = R c i , 0 . o . w .

4. Experiment

In order to verify the effectiveness of the proposed method, the following experiments were conducted, and the main experimental parameters were set as shown in Table 1.

4.1. Computational Complexity Analysis

When experimenting, the number of pseudo ’state-value’ samples generated by RC is 20,000, the learning rate is 0.0001, the discount factor is 0.95, and the batch size is 150, Stochastic Gradient Descent algorithm (SGD) is adopted to optimize the value network.
The computing platform is a laptop with the i7-6700HQ CPU, and the UAV swarm SCT-DRL is implemented with Python programming. In the two-UAV coverage problem, each iteration takes an average of 7.3 ms. From the Formula (23), it can be found that when each UAV independently runs this distributed simultaneous coverage and tracking algorithm based on deep learning, the computational complexity increases linearly with the number of nearby UAVs. In 10-UAV distributed control, the SCT-DRL takes an average of 72 ms per iteration. In addition, simultaneous SCT-DRL can realize parallel computation, because its value network in Formula (23) is composed of a large number of independent computations.
In addition, the off-line training of the value network has basically converged when it takes less than 5 h of training. Specifically, the initialization phase took about 10 min to achieve 400,000 backpropagation iterations by using a small batch of 200 data; the reinforcement learning step uses the ε -greedy algorithm, which linearly decays from 0.1 to 0.001 over the first 400 training sessions and remains unchanged at 0.001 at the end. Reinforcement learning steps take about 2.5 h to achieve 1000 training sessions.

4.1.1. Two-UAV Area Coverage Test

Figure 3 shows the training effect of the value network, where different colors represent different UAVs, the dot is the initial position of the UAV, the asterisk is the final position of the UAV, the dotted circle is the coverage area of the UAV, and the solid line is the moving track of the UAV. Figure 3a is the initial coverage test scenario of two UAVs. Both UAVs are static, the distance r is 4 m, and the angle is 0. Figure 3b,c are the two UAV trajectory diagrams formed by the reciprocal control (RC) method and deep reinforcement learning (DRL), respectively. The two-UAV trajectory is formed only after the initial learning of a large amount of data generated by the reciprocal decision method. It can be found that the trajectory of the proposed DRL method is shorter and smoother than that of RC.
Figure 4 shows the configuration of two-UAV coverage and the statistics of coverage time. Figure 4a shows the test scene of the two UAV collaborative coverage areas in a barrier-free environment, where the black box is the designated area to be covered, the red solid circle is one UAV, the red dotted circle is its coverage, the red arrow is its velocity vector, and the other UAV is blue, whose linear meaning is the same as the red one. The distance between the UAVs is r, and the angle formed by the two-UAV connection and the positive X-axis is α . In this test scenario, the side length of the square coverage area is 90 m, the coverage radius of both UAVs is 25 m, the radius is 0.5 m, the maximum speed is 1 m/s, the distance between the UAVs is changed from 1 m to 44 m (the step is 1 m), the angle is from 0 to 360 degrees (the step is 3.6 degrees). Figure 4b shows the relationship between the time required for two UAV coverage and the distance/angle between the two UAVs. The time for the reciprocal decision method to complete the two-UAV coverage task was statistically measured by the distance and angle between the changing UAVs. It can be found that the task completion is basically unchanged in the change of angle but changes in the scale of distance. Hence, the completion time is only related to the distance between the UAVs. Therefore, the following experiments are tested at different distances, and the statistical time is the average of multiple angles.
In the two-UAV coverage scenario, the proposed SCT-DRL method is compared with the RC method in terms of the coverage task completion time by using different UAV distances, as shown in Figure 5. As shown in the figure, the proposed SCT-DRL method can complete the task faster and the completion time is more stable under all two-UAV spacing.

4.1.2. Swarm Area Coverage Test

Table 2 is the statistics of task completion time under various scenarios, where the completion time is shown as the average completion time, 75% completion time, and 90% completion time. It can be seen that under different UAV numbers and area sizes, the proposed SCT-DRL method has a shorter completion time for covering tasks than the RC method.
Figure 6 compares the task completion time in four-UAV coverage and six-UAV coverage, where Figure 6a shows the performance comparison of two methods at different distances r of four UAVs, and Figure 6b displays the performance comparison of six UAVs. It can be found that the proposed SCT-DRL method takes more concentrated time to complete the coverage task, and the SCT-DRL method takes a shorter time to complete the coverage task than the RC method.
Figure 7 shows the statistics of completion time under different numbers of UAVs. It can be concluded that in swarm coverage tasks, the SCT-DRL method takes a shorter average time to complete coverage tasks than the RC method, and the completion time is more concentrated with fewer outliers. Furthermore, by analyzing the trend curves of the average mission completion time as a function of the number of UAVs for both methods, it can be observed that the proposed method in this paper exhibits a lower growth rate compared to the traditional RC approach, indicating superior scalability in terms of swarm size.

4.2. Swarm Simultaneous Coverage and Tracking

In the conducted experiment, the targets were maintained in a static condition, and UAVs were tasked with performing SCT, with the objective of covering a greater area to identify more targets and maintaining track of the detected ones. Under identical circumstances, SCT-DRL was compared against three distinct, typical algorithms: the Voronoi control method based on region partitioning [18] (hereinafter referred to as the VC method), the density control method grounded in local region analysis [17] (DC method), and the Reciprocal control method [16] (RC method).
Figure 8 serves as a visualization of the SCT-DRL execution process, where the black dots represent targets, and the colored circles are UAVs. As shown in Figure 8a, when all targets are randomly dispersed within the square region and UAVs are likewise randomly distributed, the distance between UAVs is relatively proximity, and the number of detected targets is limited. Thus, the UAVs initiate an attempt to maintain tracking of the detected targets while concurrently seeking out additional potential targets by expanding their coverage areas. The resultant trajectories of UAV swarms are illustrated in Figure 8b. Ultimately, the UAV swarm attains a stable status, as depicted in Figure 8c. It is apparent that the UAV swarm employing the SCT-DRL algorithm covers a great area (indicated by colorful circles), detects a high number of targets (represented by black points), and exhibits a smooth trajectory.
To assess the efficacy of SCT-DRL, a statistical analysis was performed on the number of detected targets for the four selected methods with 100-s intervals, as depicted in Figure 9.
In comparison to the VC and DC methods, the RC method exhibits an enhanced capacity to detect a greater number of targets concurrently, attributable to its incorporation of neighborhood reciprocity performance. Meanwhile, the SCT-DRL method demonstrates further improvements in both coverage and tracking abilities, enabling it to detect an even higher number of targets simultaneously through the enhancement of the RC datasets.
The coverage rate is defined as the proportion of the covered area of all UAVs relative to the total mission region. In this task, the peak coverage attained by ten UAVs over the total mission area was 31.4%, indicating that the non-overlapping coverage of the given UAV swarm had achieved its maximum potential. Figure 10 presents a comparative analysis of the swarm coverage achieved by the four methods. The results reveal that the proposed SCT-DRL method exhibits a higher coverage rate than the other baselines.

5. Conclusions

This paper proposes a novel DRL-based improved SCT UAV swarm control method that leverages prior trajectory samples generated from traditional single-step optimization to facilitate policy learning. Specifically, it utilizes a self-organized reciprocal control (RC) algorithm to conduct two UAV SCT simulations, generating trajectory datasets to initialize the value network. Furthermore, the proposed value network for the two-UAV SCT task not only excels in addressing two-UAV cooperation problems but also demonstrates scalability to swarm cooperative control through rigorous testing. With its robust real-time performance, this method is suitable for distributed swarm systems. The feasibility and efficacy of the proposed algorithm have been validated through numerous simulation experiments. In the future, the aspiration is to implement the proposed method in real-world scenarios, with the aim of further validating its effectiveness through semi-physical simulations and flight tests.

Author Contributions

Conceptualization, Y.C. and R.C.; methodology, Y.C. and R.C.; software, Y.C. and R.C.; validation, Y.H., Z.X. and J.L.; investigation, Y.H. and Z.X.; writing—original draft preparation, Y.C. and R.C.; writing—review and editing, Y.H., Z.X. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to thank the editors and the reviewers for their most constructive comments to improve the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wubben, J.; Hernández, D.; Cecilia, J.M.; Imberón, B.; Calafate, C.T.; Cano, J.C.; Manzoni, P.; Toh, C.K. Assignment and Take-Off Approaches for Large-Scale Autonomous UAV Swarms. IEEE Trans. Intell. Transp. Syst. 2023, 24, 4836–4847. [Google Scholar] [CrossRef]
  2. Zhang, X.; Duan, L. Energy-saving deployment algorithms of UAV swarm for sustainable wireless coverage. IEEE Trans. Veh. Technol. 2020, 69, 10320–10335. [Google Scholar] [CrossRef]
  3. Wu, J.; Yu, Y.; Ma, J.; Wu, J.; Han, G.; Shi, J.; Gao, L. Autonomous cooperative flocking for heterogeneous unmanned aerial vehicle group. IEEE Trans. Veh. Technol. 2021, 70, 12477–12490. [Google Scholar] [CrossRef]
  4. Yang, B.; Shi, H.; Xia, X. Federated imitation learning for UAV swarm coordination in urban traffic monitoring. IEEE Trans. Ind. Inform. 2022, 19, 6037–6046. [Google Scholar] [CrossRef]
  5. Li, J.; Sun, G.; Duan, L.; Wu, Q. Multi-objective optimization for UAV swarm-assisted IoT with virtual antenna arrays. IEEE Trans. Mob. Comput. 2023, 23, 4890–4907. [Google Scholar] [CrossRef]
  6. Zhang, X.; Zheng, J.; Su, T.; Ding, M.; Liu, H. An Effective Dynamic Constrained Two-Archive Evolutionary Algorithm for Cooperative Search-Track Mission Planning by UAV Swarms in Air Intelligent Transportation. IEEE Trans. Intell. Transp. Syst. 2023, 25, 944–958. [Google Scholar] [CrossRef]
  7. Bostelmann-Arp, L.; Steup, C.; Mostaghim, S. Free-Form Coverage Path Planning of Quadcopter Swarms for Search and Rescue Missions Using Multi-Objective Optimization. In Proceedings of the 2024 IEEE Congress on Evolutionary Computation (CEC), Yokohama, Japan, 30 June–5 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar]
  8. Yang, Y.; Liang, Y.; Zhao, Y. An Analytical Solution for Obstacle Avoidance in Cooperative Area Coverage using UAV Swarms. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC), Xi’an, China, 25–27 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2432–2437. [Google Scholar]
  9. Yang, F.; Ji, X.; Yang, C.; Li, J.; Li, B. Cooperative search of UAV swarm based on improved ant colony algorithm in uncertain environment. In Proceedings of the 2017 IEEE International Conference on Unmanned Systems (ICUS), Beijing, China, 27–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 231–236. [Google Scholar]
  10. Xiang, L.; Wang, F.; Xu, W.; Zhang, T.; Pan, M.; Han, Z. Dynamic uav swarm collaboration for multi-targets tracking under malicious jamming: Joint power, path and target association optimization. IEEE Trans. Veh. Technol. 2023, 73, 5410–5425. [Google Scholar] [CrossRef]
  11. Liang, Y.; Yang, Y.; Zhao, Y. Multi-Area Complete Coverage with Fixed-Wing UAV Swarms Based on Modified Ant Colony Algorithm. In Proceedings of the 2022 IEEE International Conference on Unmanned Systems (ICUS), Guangzhou, China, 28–30 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 732–737. [Google Scholar]
  12. Rekabi-Bana, F.; Hu, J.; Krajník, T.; Arvin, F. Unified robust path planning and optimal trajectory generation for efficient 3D area coverage of quadrotor UAVs. IEEE Trans. Intell. Transp. Syst. 2023, 25, 2492–2507. [Google Scholar] [CrossRef]
  13. Zhou, L.; Leng, S.; Liu, Q.; Wang, Q. Intelligent UAV swarm cooperation for multiple targets tracking. IEEE Internet Things J. 2021, 9, 743–754. [Google Scholar] [CrossRef]
  14. Zhou, L.; Leng, S.; Wang, Q.; Liu, Q. Integrated sensing and communication in UAV swarms for cooperative multiple targets tracking. IEEE Trans. Mob. Comput. 2022, 22, 6526–6542. [Google Scholar] [CrossRef]
  15. Khaledyan, M.; Vinod, A.P.; Oishi, M.; Richards, J.A. Optimal coverage control and stochastic multi-target tracking. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 11–13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2467–2472. [Google Scholar]
  16. Chen, R.; Li, J.; Shen, L. A self-organized reciprocal control method for multi-robot simultaneous coverage and tracking. Assem. Autom. 2018, 38, 689–698. [Google Scholar] [CrossRef]
  17. Pimenta, L.C.; Schwager, M.; Lindsey, Q.; Kumar, V.; Rus, D.; Mesquita, R.C.; Pereira, G.A. Simultaneous coverage and tracking (SCAT) of moving targets with robot networks. In Proceedings of the Algorithmic Foundation of Robotics VIII: Selected Contributions of the Eight International Workshop on the Algorithmic Foundations of Robotics; Springer: Berlin/Heidelberg, Germany, 2010; pp. 85–99. [Google Scholar]
  18. Stergiopoulos, Y.; Tzes, A. Decentralized swarm coordination: A combined coverage/connectivity approach. J. Intell. Robot. Syst. 2011, 64, 603–623. [Google Scholar] [CrossRef]
  19. Moon, S.; Frew, E.W. Distributed cooperative control for joint optimization of sensor coverage and target tracking. In Proceedings of the 2017 International Conference on Unmanned Aircraft Systems (ICUAS), Miami, FL, USA, 13–16 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 759–766. [Google Scholar]
  20. Li, H.; Long, T.; Xu, G.; Wang, Y. Coupling-degree-based heuristic prioritized planning method for UAV swarm path generation. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3636–3641. [Google Scholar]
  21. Han, L.; Zhang, H. UAV path planning algorithm based on Global Optimal Solution Tracking Enhanced Particle Swarm Optimization. In Proceedings of the 2024 43rd Chinese Control Conference (CCC), Kunming, China, 28–31 July 2024; pp. 3125–3130. [Google Scholar]
  22. Chen, R.; Xu, N.; Li, J. A self-organized reciprocal decision approach for sensing coverage with multi-UAV swarms. Sensors 2018, 18, 1864. [Google Scholar] [CrossRef] [PubMed]
  23. Zhang, B.; Jing, T.; Lin, X.; Cui, Y.; Zhu, Y.; Zhu, Z. Deep Reinforcement Learning-based Collaborative Multi-UAV Coverage Path Planning. J. Phys. Conf. Ser. 2024, 2833, 012017. [Google Scholar] [CrossRef]
  24. Aydemir, F.; Cetin, A. Multi-agent dynamic area coverage based on reinforcement learning with connected agents. Comput. Syst. Sci. Eng. 2023, 45, 215–230. [Google Scholar] [CrossRef]
  25. Dai, A.; Li, R.; Zhao, Z.; Zhang, H. Graph convolutional multi-agent reinforcement learning for UAV coverage control. In Proceedings of the 2020 International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, 21–23 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1106–1111. [Google Scholar]
Figure 1. Reinforcement learning action strategies.
Figure 1. Reinforcement learning action strategies.
Drones 08 00784 g001
Figure 2. Value network diagram.
Figure 2. Value network diagram.
Drones 08 00784 g002
Figure 3. Value network training effect diagram.
Figure 3. Value network training effect diagram.
Drones 08 00784 g003
Figure 4. The configuration of two-UAV area coverage and the statistics of coverage time.
Figure 4. The configuration of two-UAV area coverage and the statistics of coverage time.
Drones 08 00784 g004
Figure 5. The comparison of coverage time under different two-UAV distances.
Figure 5. The comparison of coverage time under different two-UAV distances.
Drones 08 00784 g005
Figure 6. The statistics of task completion time in four-UAV coverage and six-UAV coverage.
Figure 6. The statistics of task completion time in four-UAV coverage and six-UAV coverage.
Drones 08 00784 g006
Figure 7. The statistics of completion time under different numbers of UAVs.
Figure 7. The statistics of completion time under different numbers of UAVs.
Drones 08 00784 g007
Figure 8. A visualization of the SCT-DRL execution.
Figure 8. A visualization of the SCT-DRL execution.
Drones 08 00784 g008
Figure 9. Comparison of target detection numbers of all algorithms.
Figure 9. Comparison of target detection numbers of all algorithms.
Drones 08 00784 g009
Figure 10. Comparison of coverage rate among all algorithms.
Figure 10. Comparison of coverage rate among all algorithms.
Drones 08 00784 g010
Table 1. The setting of simulation.
Table 1. The setting of simulation.
ObjectsPropertiesDescriptions
Algorithmlearning rate0.0001
discount factor0.95
batch size150
hidden layer size(100, 100, 100)
loss functionMean-Square Error (MSE)
optimization methodStochastic Gradient Descent (SGD)
UAVdecision interval100 ms
covering radius25 m
body radius0.5 m
maximum speed1 m/s
Table 2. The comparison of swarm coverage under random test.
Table 2. The comparison of swarm coverage under random test.
Test ScenarioTask Completion Time (s) [ave/75%/90%]
nArea (m)RCSCT-DRL
290 × 9053.50/59.91/68.0025.00/29.13/34.00
4100 × 10068.38/77.66/89.5035.00/42.25/49.00
6150 × 10087.13/100.63/112.3849.00/60.00/66.50
8100 × 200102.88/118.44/131.6362.00/71.00/79.00
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Y.; Chen, R.; Huang, Y.; Xiong, Z.; Li, J. DRL-Based Improved UAV Swarm Control for Simultaneous Coverage and Tracking with Prior Experience Utilization. Drones 2024, 8, 784. https://doi.org/10.3390/drones8120784

AMA Style

Chen Y, Chen R, Huang Y, Xiong Z, Li J. DRL-Based Improved UAV Swarm Control for Simultaneous Coverage and Tracking with Prior Experience Utilization. Drones. 2024; 8(12):784. https://doi.org/10.3390/drones8120784

Chicago/Turabian Style

Chen, Yiting, Runfeng Chen, Yuchong Huang, Zehao Xiong, and Jie Li. 2024. "DRL-Based Improved UAV Swarm Control for Simultaneous Coverage and Tracking with Prior Experience Utilization" Drones 8, no. 12: 784. https://doi.org/10.3390/drones8120784

APA Style

Chen, Y., Chen, R., Huang, Y., Xiong, Z., & Li, J. (2024). DRL-Based Improved UAV Swarm Control for Simultaneous Coverage and Tracking with Prior Experience Utilization. Drones, 8(12), 784. https://doi.org/10.3390/drones8120784

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop