[ ARTICLE ]

The Journal of Korea Robotics Society - Vol. 20, No. 2, pp.285-292

ISSN: 1975-6291 (Print) 2287-3961 (Online)

Print publication date 30 May 2025

Received 24 Oct 2024 Revised 13 Nov 2024 Accepted 05 Dec 2024

DOI: https://doi.org/10.7746/jkros.2025.20.2.285

Perception and Deep Reinforcement Learning-Based Local Path Planner for Autonomous Vehicles

Thanh-Danh Phan¹

; Gon-Woo Kim^†

1Graduate Student, Intelligent Systems and Robotics, Chungbuk National University, Cheongju, Korea danh1711@chungbuk.ac.kr

자율 주행차를 위한 인식 및 심층 강화 학습 기반 국부 경로 계획

판탄단¹

; 김곤우^†

Correspondence to: ^†Professor, Intelligent Systems and Robotics, Chungbuk National University, Cheongju, Korea ( gwkim@cbnu.ac.kr)

Abstract

Autonomous driving has become a revolutionary area in modern transportation studies. End-to-end approaches offer an integrated paradigm that unifies perception, prediction, and planning functionalities within a single framework. This holistic approach streamlines the overall decision-making process, potentially improving the efficiency and accuracy of autonomous vehicle operations. In this paper, we propose an end-to-end reinforcement learning-based framework based on refined observations collected from sensors. Besides, the ray casting algorithm is introduced based on the top-view segmented images to generate the surrounding relative distances to the outset, which facilitates the surrounding distance parsing of an autonomous car. Subsequently, these distances, latent information from the driver’s view, and other sensor information are fed as an observation for a deep reinforcement framework to predict a steering angle. The experimental results showcase the effectiveness of methods implemented in self-driving vehicles within CARLA simulation environments.

Keywords:

Autonomous Driving, Ray Casting, Deep Reinforment Learning

1. Introduction

In recent years, the pursuit of safe and reliable autonomous driving systems has evolved from a futuristic vision into a pressing technological challenge that intersects computer science, robotics, and transportation engineering. While current autonomous vehicles demonstrate impressive capabilities on highways and in controlled environments, they still struggle with the intricate dynamics of urban mobility, where split-second decisions can mean the difference between safety and catastrophe.

Traditional computer vision and rule-based systems, despite their initial success, have revealed fundamental limitations in handling the unpredictable nature of real-world driving scenarios. These systems often falter when confronted with edge cases - situations that deviate from their training data or programmed rules. The challenge lies not merely in detecting obstacles or following lanes, but in developing systems that can understand context, anticipate human behavior, and make decisions that balance safety with efficiency. The sensor fusion techniques are pivotal factor contribute to the success of an autonomous car. The papers ^[1,2] combined visual device and 2D LiDAR in their path planning tasks, which could provide the safe and hint for decision making.

Recent advancements in deep reinforcement learning (DRL) have opened up new possibilities for learning-based approaches in autonomous driving decision-making. Several DRL techniques ^[3] have been successfully applied to self-driving vehicles. The study ^[4] in addresses the complex challenges involved in driverless car operations. The approach proposed in ^[5,6] utilizes a bird’s-eye view representation and analyzes relative spatial relationships between vehicles from the ego vehicle’s perspective. While this methodology offers valuable insights, it presents significant challenges in terms of data preprocessing requirements and practical implementation in real-world testing scenarios. However, using raw, unprocessed images as input can lead to substantial computational overhead and may limit the agent’s ability to extract meaningful semantic information during the learning process. Moreover, the inference time and practical applicability of the algorithm are not consistently guaranteed, which may impact its reliability and effectiveness in various real-world applications. From this, it can be inferred that a well-structured observation space and carefully designed reward mechanisms can significantly enhance the learning performance and inference-time characteristics of autonomous systems.

As the field of autonomous driving continues to expand, the need for platforms to test and enhance Sim-to-Real methods has become increasingly critical. A range of simulation platforms, such as CARLA ^[7], and AirSim ^[8], play a pivotal role in advancing and assessing deep reinforcement learning (RL) systems for autonomous vehicles. These platforms provide a variety of environments and capabilities that enable the realistic simulation of real-world driving conditions.

Motived by the above standpoints, this work focuses on developing an real-time end-to-end framework for driverless car based on DRL. The algorithm relies on the refined observation and reward design. More specifically, the refined observation is designed based on three primary factors: semantic driver view, surrounding environment information, and processed inputs from other sensors. Moreover, a ray-casting algorithm is designed to utilize the semantic mask as a lightweight yet effective observational input for a DRL framework. Then, a highly explorative DRL framework is deployed as a agregation module and decision making model for the agent. Ultimately, the comprehensive model is deployed on CARLA simulation to prove effectiveness and robustness.

The remainder of this paper is structured as follows: Section 2, which includes five mains topic parts generalizing the end-to-end model from the prepocessing stage. Meanwhile, Section 3 show the related results in simulation software. Finally, the conclusion is presented in section 4.

2. End-to-End Autonomous Driving

2.1 System Overview

The system is outlined as [Fig. 1], which illustrates the main components of the pipeline. The agent has three different input blocks, each of them would generate semantic information accordingly. The observation is stated as follows:

S t = e V t ⊕ Ψ B t ⊕ O t

(1)

[Fig. 1]

Overall system diagram

where $O t ∈ R$ represents the raw observation signal; function e is latent parsing of driver view; Ψ is the ray casting algorithm for producing distance of surrounding environments; ⊕ is the concatenation operation.

2.2 Latent Encoding for Driver View

In this section, we describe the process of compressing images into a latent space for use as observations in an autonomous driving system. The driver’s view is critical, requiring analysis of semantic features for effective decision-making. We utilize the method from ^[9] to compress images into a latent state, reducing the computational load on the DRL model while providing it with more relevant, abstracted information. This approach improves efficiency by focusing the DRL model on higher-level tasks. The architecture is outlined in [Fig. 2], and two loss functions, a reconstruction loss and a feature loss, are used to ensure accurate latent space representation as:

L t o t a l = 1 n ∑ k = 1 n Y k - Y l 2 + 12 ∑ j = 1 J 1 + l o g ⁡ σ j 2 - μ j 2 - σ j 2

(2)

[Fig. 2]

Variational Autoencoder for parsing driver view

where $Y l ∈ R$ represents the label, while $Y k ∈ R$ denotes its reconstructed output through the autoencoder network. The total number of samples is indicated by n. Within the latent space representation, $μ j ∈ R$ corresponds to the mean of the j^th dimension, and $σ j ∈ R$ represents its associated standard deviation. The complete dimensionality of the latent space is denoted by J. This dual-objective optimization ensures both accurate reconstruction of the input state and a well-structured latent space. The encoder network employs convolutional layers to transform the input image into two vectors: a mean vector and a standard deviation vector, from which the latent vector is sampled.

2.3 Ray Casting Algorithm

While the front view or driver view observation supplies the local infomation of the agent in road, the bird’s-eye view (BEV) or top camera captures agnotic nature of surrounding information. The BEV segmented view can be achieved by using ^[10]. However, inputing it directly to DRL model would lead to be difficult for convergence and increase the computational burden. Therefore, we propose an efficient ray casting algorithm to extract essential spatial information by parsing distances to environmental boundaries.

To begin with, the process is generally demonstrated in [Fig. 3], which inputs a BEV segmented view and using image processing techniques to parse this image. Particularly, we convert a BEV segmented view into a binary image B_t with the drivable area being 255 and the non-drivable area is zero. Subsequently, we extract the binary contour C(α), which represents the boundaries between drivable and non-drivable regions. The continuous contour function C(α) maps the curve parameter α to two-dimensional spatial coordinate.

C α → x α, y α ∈ R 2

(3)

[Fig. 3]

The diagram of prepocessing BEV image

The contour results are presented in [Fig. 4], with the output of the ray-casting algorithm displayed in the leftmost image. With the environmental boundaries defined, we proceed to implement the ray casting mechanism. The algorithm employs an angular resolution Δρ, which determines the granularity of environmental sampling. Given this resolution, we distribute N rays uniformly around the vehicle’s position, where each ray Rᵢ is cast at angle ρ_i = iㆍΔρ, for i ∈ {0, 1, ..., N-1}. This uniform sampling ensures comprehensive coverage of the surrounding environment. The main idea of our algorithm lies in computing the intersections between these rays and the environmental boundaries. For each ray R_i, we determine its intersection with the contour C(α) by solving the equation C(α) = R_i(λ) shown in [Fig. 5], where $λ ∈ R +$ represents the ray parameter. This intersection calculation yields crucial information about the distance to obstacles or boundaries in each direction. To obtain the final distance measurements, we compute a distance vector D = [d₀, d₁, ..., d_n_-1]. Each component d_i is determined by finding points (x_i, y_i) = C(α_min) that minimize the Euclidean distance to the corresponding ray within its angular sector [ρ_i, ρ_i + Δρ]. Mathematically, this is expressed as:

d i = m i n C α - R i λ, α ∈ α l, α u

(4)

[Fig. 4]

The processed images

[Fig. 5]

The principle of raycasting algorithm

where [α_l, α_u] represents the contour parameter range corresponding to the angular sector [ρ_i, ρ_i + Δρ]. Distance vector D serves as a compact yet informative representation of the environment’s spatial structure. This representation significantly reduces input dimensionality while preserving the essential spatial information necessary for autonomous navigation. By transforming complex visual data into a more manageable form, our ray casting algorithm enables efficient processing within the DRL framework while maintaining the fidelity of environmental perception.

2.4 Soft Actor Critic Network

Soft Actor-Critic (SAC) ^[11] is an off-policy maximum entropy deep reinforcement learning algorithm that provides a framework for energy-efficient and robust learning. The key principle of SAC is to maximize both the expected return and the entropy of the policy, which encourages exploration while maintaining stable learning. This is formulated as:

π * = a r g ⁡ m a x π E τ ∼ π ∑ t = 0 T γ t R s t, a t + β H π ⋅ s t

(5)

where β is the temperature parameter determines the relative importance of entropy according to reward; π* is optimal policy; τ series of state-action pairs, γ^t is discount factor ∈ [0,1]; s_t and a_t are state and action at time step t, respectively; R(s_t, a_t) is expected reward; and H(π(⋅|s_t))represents the entropy of the policy at state s_t. SAC employs two Q-functions to mitigate positive bias in policy improvement, and a state value function. These are defined as equation (6), (7), respectively.

Q ϕ s t, a t = r t + γ E s t + 1 V s t + 1

(6)

V s t = E a t ∼ π Q s t, a t - β l o g ⁡ π a t s t

(7)

The policy network in SAC is parameterized as a Gaussian distribution with state-dependent mean and standard deviation:

π θ a t s t = N μ θ s t, σ θ s t

(8)

Then, it is updated to minimize the Kullback-Leibler divergence as:

L π θ = E s t ∼ D E a t ∼ π θ β l o g ⁡ π θ a t s t - Q ϕ s t, a t

(9)

where θ is policy network parameters; σ_θ is standard deviation of the policy distribution; and μ_θ is mean of the policy distribution. To further lift the exploration nature, the temperature parameter β is automatically adjusted during training to achieve a target entropy $H t a r g e t$ :

L β = E a t ∼ π t - β l o g ⁡ π t a t s t - β H t a r g e t

(10)

The final loss comprises 3 different losses from equation (9), (10) and a Q-function loss expressed as:

L Q ϕ = E s t, a t, r t + s t + 1 ∼ D 12 Q ϕ s t, a t - r t + γ E a t + 1 Q ϕ ¯ s t + 1, a t + 1 - β l o g π θ a t + 1 s t + 1 2

(11)

The training process would include four basic steps: sample the batch from replay buffer, update Q-networks using L_Q and policy utilizing L_π, adjust temperature param β, and update the target network.

2.5 Reward Function

Given that the optimal policy is highly influenced by reward signals, the agent’s behavior is significantly shaped by the design of the reward function. In this case, the reward function is structured around five key factors, with the first being the vehicle’s speed, which is regulated to ensure appropriate speed control. It is defined as:

R s ϵ = ϵ ϵ l, i f 0 ≤ ϵ < ϵ l 1, i f ϵ l ≤ ϵ ≤ ϵ t 1 - ϵ - ϵ t ϵ u - ϵ t, i f ϵ t < ϵ ≤ ϵ u

(12)

where ϵ is the current speed of car; ϵ_l and ϵ_u is the lower and upper speed limit respectively; ϵ_t is speed target that the car needs to achieve. The centering factor is defined as:

R c ϱ c = m a x 1 - ϱ c ϱ m a x, 0

(13)

where $ϱ c$ is the distance of the vehicle from the center of the lane and $ϱ m a x$ is the maximum allowable distance for full reward. The standard deviation reward is calculate using the standard deviation of history distance $σ ϱ c$ from agent to the center path. This is formulated as:

R s t d σ ϱ c = m a x 1 - σ ϱ c σ m a x, 0

(14)

The angle factor demonstrates the deviation from the center road ϕ. Its reward is defined as:

R a ϕ = m a x 1 - ϕ r a d ϕ m a x, 0

(15)

where safe distance maintenance from obstacles or roadside is defined by raycasting algorithm. Its reward is stated as:

R r a y D i = m i n i = 1, …, N e x p - κ d s a f e - d i i f d i < d s a f e 1 i f d i ≥ d s a f e

(16)

where κ is a scaling factor that normalizes the standard deviation penalty. Additionally, R_T represent the following traffic light rule is motivated by ^[4]. Finally, the reward function is for total term is a weighted multiplication of all terms.

3. Experimental Results

3.1 Environment and Setup

In order to validate our method, we conducted our experiments on CARLAR simulator, which allows us to collect comprehensive types of inputs. The simulation is reset whenever the robot experiences a collision within the virtual environment. We trained the deep reinforcement learning (RL) algorithm within the framework using a service robot model for approximately 1,200 episodes. The entire training process trained by SAC took nearly 8 hours, significantly reducing the time.

[Table 1] shows the training configuration of our experiment setup. In detail, the experiments were conducted on a system equipped with an NVIDIA-GeForce RTX 2060 GPU and an Intel Core i7-8700 processor running at 3.20GHz, supported by 16GB of RAM. The software environment consisted of Ubuntu 20.02 as the operating system and Carla version 0.9.13 for simulation purposes. The proposed method is implemented within Pytorch framework.

[Table 1]

Training parameters

3.2 Evaluation

First, to prove our effectiveness of our methods compared to other method, we refered the testing metric from the paper ^[12]. The Town 1 and Town 2 are set up with traffic light state. Our success rate achieve a robust state for all cases and working smoothly for all cases. Moreover, we have improve the safety term by using ray casting algorithm. This both reduces the ration to colide with obstacle and with the footpath. Because in SAC the exploration nature is enhanced, which somehow make the agent find the shortest path to go. This characteristic is not optimal during turns, since it may jostle the roadside. That is an unexpected situation in autonomous driving. Moreover, in our work, we still optimize the traffic state for the in DRL network. The raycasting algorithm also support this process by evaluation the measured surrounding distane. When reaching the intersection, the fronview and side view distance may be significantly large and change predictably. [Fig. 6] showcases the autonomous car have an ability to realize the red-light traffic and intersection to stop moving.

[Fig. 6]

Follow the red light situation. The red box indicates vehicle speed

The algorithm is compared with two state of the art algorithms DDPG ^[15] and PPO ^[16]. The [Fig. 7] and [Fig. 8] illustrate the comparative performance of different trajectory planning methods, including our implemented SAC approach, against the ground truth path. The plot shows trajectories in a 2D coordinate system, with all methods starting from the marked initial point (approximately at x=0, y=0). SAC demonstrates remarkable alignment with both the ground truth and alternative methods (DDPG and PPO), particularly in handling the sharp vertical transition near x=-80. This indicates that our SAC implementation successfully captures the optimal path planning behavior, matching the ground truth’s characteristics in both the steep ascent phase and the subsequent horizontal stabilization at approximately y=30. PPO, otherwise, fail in the testing conduction at the sharp turn. Now, we can see that, DDPG and SAC almost aligned with the system but SAC can travel with a shortest path. As demonstrated in [Table 2], SAC achieves shorter path lengths due to its enhanced exploration capabilities, which are attributed to its entropy-based optimization. While this approach results in slightly larger center-line deviations, the algorithm maintains collision-free navigation and operational safety throughout all test scenarios. This prove that the whole pipeline of our framework working well in street environment. To validate our framework, we compare against state-of the-art methodology on CARLA benchmarks. The results illustrated in [Fig. 9] demonstrate that our proposed approach not only performs well in generalized tasks but also exhibits robust performance during sharp turns while maintaining acceptable velocities.

[Fig. 7]

The trajectory of comperative methods on Town 1

[Fig. 8]

The trajectory of comperative methods on Town 2

[Table 2]

Quantitative comparison of performance from different types of policies

[Fig. 9]

The comparative success rate compared with other existing pipelines: MP, IL, RL, CIL, CIRL, CAL [7,13,14]

4. Conclusion

In conclusion, we present an end-to-end framework for autonomous driving within the CARLA simulation environment, specifically designed to enhance both robust decision-making and energy efficiency. The framework’s performance and effectiveness are validated through comparisons with state-of-the-art methods, demonstrating its potential for real-time applications. Additionally, by introducing refined stages that effectively leverage semantic information, the framework achieves both real-time processing capabilities and improved robustness in various driving tasks. For future work, we will consider incorporating temporal information to enhance consistency across frames, as well as analyzing the distribution of distances within each angular sector to further improve decision accuracy and reliability.

Acknowledgments

This research was supported in part by the National Research Foundation (NRF) funded by the Korean government (MSIT) (RS-2024-00421129) and in part by Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (IITP-2025-RS-2020-II201462, 50).

References

V. M. Tran and G.-W. Kim, “Cooperative Deep Reinforcement Learning Policies for Autonomous Navigation in Complex Environments,” IEEE Access, vol. 12, pp. 101053-101065, Jul., 2024. [https://doi.org/10.1109/ACCESS.2024.3429230]
T.-T.-N. Nguyen, T.-D. Phan, M.-T. Duong, C.-T. Nguyen, H.-P. Ly, and M.-H. Le, “Sensor Fusion of Camera and 2D LiDAR for Self-Driving Automobile in Obstacle Avoidance Scenarios,” 2022 International Workshop on Intelligent Systems (IWIS), Ulsan, Republic of Korea, pp. 1-7, 2022. [https://doi.org/10.1109/IWIS56333.2022.9920917]
B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep Reinforcement Learning for Autonomous Driving: A Survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 4909-4926, Jun., 2022. [https://doi.org/10.1109/TITS.2021.3054625]
L. Anzalone, S. Barra, and M. Nappi, “Reinforced Curriculum Learning For Autonomous Driving In Carla,” 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, pp. 3318-3322, 2021. [https://doi.org/10.1109/ICIP42928.2021.9506673]
J. Sun, X. Fang, and Q. Zhang, “Reinforcement Learning Driving Strategy based on Auxiliary Task for Multi-Scenarios Autonomous Driving,” 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), Xiangtan, China, pp. 1337-1342, 2023. [https://doi.org/10.1109/DDCLS58216.2023.10166271]
J. Wang, L. Chu, Y. Zhang, Y. Mao, and C. Guo, “Intelligent Vehicle Decision-Making and Trajectory Planning Method Based on Deep Reinforcement Learning in the Frenet Space,” Sensors, vol. 23, no. 24, Dec., 2023. [https://doi.org/10.3390/s23249819]
A. Dosovitskiy, G. Ros, F. Codevilla, A. M. López, and V. Koltun, “CARLA: An Open Urban Driving Simulator,” arXiv:1711.03938, 2017. [https://doi.org/10.48550/arXiv.1711.03938]
S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles,” Field and Service Robotics, pp. 621-635, Nov., 2017. [https://doi.org/10.1007/978-3-319-67361-5_40]
D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” arXiv:1312.6114, 2013. [https://doi.org/10.48550/arXiv.1312.6114]
L. Reiher, B. Lampe, and L. Eckstein, “A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird’s Eye View,” 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, pp. 1-7, 2020. [https://doi.org/10.1109/ITSC45102.2020.9294462]
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” arXiv:1801.01290, 2018. [https://doi.org/10.48550/arXiv.1801.01290]
J. Wang, Y. Wang, D. Zhang, Y. Yang, and R. Xiong, “Learning hierarchical behavior and motion planning for autonomous driving,” 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, pp. 2235-2242, 2020. [https://doi.org/10.1109/IROS45743.2020.9341647]
F. Codevilla, M. Miiller, A. López, V. Koltun, and A. Dosovitskiy, “End-to-End Driving Via Conditional Imitation Learning,” 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, pp. 4693-4700, 2018. [https://doi.org/10.1109/ICRA.2018.8460487]
A. Sauer, N. Savinov, and A. Geiger, “Conditional Affordance Learning for Driving in Urban Environments,” arXiv:1806.06498, 2018. [https://doi.org/10.48550/arXiv.1806.06498]
E. H. Sumieaa, S. J. Abdulkadira, H. S. Alhussiana, S. M. Al-Selwia, A. Alqushaibia, M. G. Ragaba, and S. M. Fati, “Deep Deterministic Policy Gradient Algorithm: A Systematic Review,” heliyon, vol. 10, no. 9, May, 2023. [https://doi.org/10.1016/j.heliyon.2024.e30697]
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv:1707.06347, 2017. [https://doi.org/10.48550/arXiv.1707.06347]

Thanh-Danh Phan

2018 HCMC University of Technology and Education (Bachelor degree)

2023~Present Department of Intelligent Systems and Robotics, Chungbuk National University, Korea (Master degree)

관심분야: SLAM, Vision Robot, Autonomous Driving, Reinforcement Learning

Gon-Woo Kim

2006 Seoul National University, Electrical and Computer Engineering (PhD)

2006~2008 Senior Researcher, Robotics Technology Center, Korea Institute of Industrial Technology

2008~2012 Assistant Professor, Wonkwang University

2012~Present Professor, Chungbuk National University

관심분야: Navigation, Localization, SLAM.

Parameter	Value
Batch size	256
Discount factor	0.98
Soft update coefficient	0.02
Learning rate	lr_schedule (5e-4, 1e-6, 2)
Replay buffer size	500000

Method	Speed Mean (km/h)	Center Deviation Mean (m)	Total Distance (m)	Policy Type	Success rate (%)
DDPG	4.775822043958583	0.10021	1144.197592809037	Off-Policy	100
PPO	4.813873128226654	0.2308434635559902	248.34906986043052	On-Policy	0
Proposed	14.9531623601	0.17366244446845922	1142.8214601254876	Off-Policy	100