Control and Simulation of a 6-DOF Biped Robot based on Twin Delayed Deep Deterministic Policy Gradient Algorithm

Objectives: To study an algorithm to control a bipedal robot to walk so that it has a gait close to that of a human. It is known that the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is a highly eﬃcient algorithm with a few changes compared to the popular algorithm — the commonly used Deep Deterministic Policy Gradient (DDPG) in the continuous action space problem in Reinforcement Learning. Methods: Diﬀerent from the usual sparse reward function model used, in this study, a reward model combined with a sparse reward function and dense reward function will be proposed. The application of the TD3 algorithm together with the proposed reward function model to control a bipedal robot model with 6 degrees of freedom will be presented. The training process is simulated in Gazebo/Robot Operating System (ROS) environment. Finding: The results show that, when choosing a reward model combined with a sparse reward function and a dense reward function suitable for the robot model, will help it learn faster and achieve better results. The biped robot can walk straight with an almost human-like gait. In the paper, the results from the TD3 algorithm combined with the proposed reward model are also compared with the results from other algorithms. Novelty: Applying the TD3 algorithm combined with the proposed reward model for the 6-DOF biped robot and simulating the robot’s gait in Gazebo/ROS environment, ROS is a middleware that can be used to control a robot in a real environment in the future.


Introduction
In the development of digital technology, the digital age, the strong development in the field of artificial intelligence, robots are gradually being used in production activities and human life, robots can completely replace humans in difficult and dangerous jobs. The work efficiency of robots is also getting better and better, gradually helping people not have to perform works that are dangerous to health. Not only that, robots can also be sociable with humans, interact directly with people nowadays. With these benefits, https://www.indjst.org/ robotics research is attracting the interest of many researchers ever, many robot models are designed to resemble human shapes so that based on the human gait, the robot will be controlled in the most convenient way. One of the important factors in controlling robots is their ability to move, many ideas have been put forward for robots to be able to move as quickly and accurately as wheeled robots (1)(2)(3)(4) . However, the disadvantage of robots that use wheels to move is that when facing an obstacle ahead, the robot must move in a different direction to avoid them, unlike humans with small obstacles, a human can completely step over them to continue. Therefore, learning, designing, and controlling a robot with a human shape, able to perform humanlike movement will help the robot move more flexibly on more complex terrain.
There are many methods used to control biped robots, for traditional solution methods (5,6) , we need to set up the Denavit-Hartenberg table for the robot, kinematic and dynamic analysis, design the motion trajectory for the joints. When robot model is changed, parameters are mostly calculated from beginning. One of the most commonly used methods in biped robot control is the zero-moment point (ZMP) (7,8) to set the motion trajectory of robot legs, however, this method suffers from many effects caused by noise, when there is noise from the environment, the robot can easily fall, so, it is difficult to apply in the real environment.
To prevent the robot from falling when there is noise during the robot's motion, several approaches are given to prevent the robot from falling (9)(10)(11) , or the article (12) suggests a solution that is based on the signals from sensors, to detect the current state of the robot, and after that, it uses classification algorithm Support Vector Machine (SVM) which classifies whether the robot is about to fall and what position the robot is in to give the necessary control signals to prevent the robot from falling. However, after detecting the falling position of the robot, it is still necessary to take action based on robot dynamics and kinematics (13) to prevent the robot from falling.
In recent times, with the advancement of machines and fast computational processors, methods using neural networks are getting more and more attention for their efficiency and they help reduce the computational dynamics and kinematics of the robot. Reinforcement Learning (RL) is an algorithm in the field of artificial intelligence that is widely known because the way models are trained is as natural as how humans learn something new (trial and error) (14,15) . Therefore, RL has achieved impressive achievements (16) such as AlphaGo (17) of DeepMind won the professional player in Go or AlphaFold in the medical field (18) . RL is also applied to robotics (19) and has applications in manufacturing and life (20) such as applications for robotic arms (21)(22)(23) or Turtlebot (24) . For a biped robot, RL is applicable to help the robot have the ability to learn and perform a human gait when walking.
In the article (25) , the RL algorithm is applied to find the suitable gait for the robot, their method uses model-based, they give appropriate actions based on the Poincare map model, with the algorithms using model-free, the learning process of robot does not need to be based on any other model, but it will learn to walk like a human by itself. One of the model-free RL algorithms commonly used for problems of the continuous action space is Deep Deterministic Policy Gradient (26) , it is applied and it's the learning process is simulated the learning process in the article (27) for a bipedal robot which has only 4-DOF. And the training time for the robot is very long when using only the sparse reward. Although the way to set up the sparse reward is simple, it is a challenging problem (28) for the training process of the robot, the model is difficult to achieve the best optimization because the sparse reward function returns only one reward value during a trajectory. So, in order for the model to reach the optimal value faster, in this study, we will combine this sparse reward function with a dense reward function. When using the dense reward function, the model will reach convergence value faster because each step of the robot has a different reward value from the dense reward function to evaluate the output of the model. So the robot's data will be richer and each input has a suitable reward value.
Recently, TD3 (29) and TD3's variants have been applied to some types of robots such as the manipulator arm (30,31) , 4-legged-Ant robot (32) which are simulated on the platform with the simple Pybullet environment. But it has yet to be applied and simulated for bipedal robots in ROS/Gazebo which can design complex environments and offer the possibility of practical activities with sensors (33) . Gazebo is a convenient simulation software to test these RL algorithms, and easily changing the parameters of the robot and the environment when using this software will help close the gap between the simulated environment and the real environment, which is convenient for robot control in future studies. Moreover, ROS is the middleware that can be used to apply the TD3 algorithm to the biped robot in the real environment. Therefore, it is necessary to evaluate reinforcement learning algorithms in ROS/Gazebo environment before putting the robot into the real environment.
This study will present a model of a 6 DOF -biped robot which is built to evaluate reinforcement learning algorithms, and how to simulate the robot's self-learning process in ROS/Gazebo so that the robot can walk like a human, the training method used is the TD3 algorithm (29) , This study will show the effectiveness of the TD3 algorithm for a biped robot. With the TD3 algorithm as well as a combination of a dense reward (34) and sparse rewards for the robot's learning process, the training time for the robot is significantly reduced and the accumulative reward is also higher than when using other RL algorithms for the biped robot.
https://www.indjst.org/ The next sections of the article will present the general theory of RL and TD3 algorithm (section 2), the shape and size of the robot model, and the training process used in the simulation (section 3) and section 4 will discuss results that we obtained in this research article.

Reinforcement learning: TD3 algorithm
Reinforcement learning is an algorithm that thanks to the interaction of the agent with the environment to decide which action should be taken at each timestep. At a timestep, the agent receives state which is the agent's observation in the environment E, and receives a reward value corresponding to that state. The reward function is shown in (1).
(γ is the discount factor 0 < γ < 1 and r i is the reward when the agent at state s i performs an action a i ), In (1), the reward function is calculated as the sum of the reward values r i multiplied from' by the discount factor γ from the timestep t to T, T is the terminal of the agent's trajectory. γ i−t means that further the state is (the bigger i is), the less effect on the current state's reward R t . Through that reward function, the agent will decide which action a i ∈ A from the policy is most beneficial and an agent will perform that action so that the environment will resend the next state. The goal of the algorithm is to find a set of parameters (ϕ ) of the policy network π ϕ to maximize the expected return or cumulative reward, the policy network is a network that outputs actions of the agent after receiving input as the state from the environment. The expected return is shown in (2) For continuous actions, with a large action space, an actor-critic method needs to be used to find the optimal policy. The parameters ϕ of the policy network (in the actor-critic method known as the actor-network) are updated by calculating the gradient of expected return ∇ ϕ J(ϕ ) mentioned in the deterministic policy gradient algorithm of D. Silver (35) . To evaluate the agent's action, a value function or a critic function evaluates the agent at state s and action a (an according to the policy π ). The gradient of expected return is represented by (3).
There have been many articles, with many algorithms used in continuous action problems such as Proximal Policy Optimization (PPO) (36) , Soft Actor-Critic (SAC) (37) , Deep Deterministic Policy Gradient (DDPG) (26) , Twin Delayed DDPG (TD3) (29) , ..., with good results achieved with robotic arms (30,31) , 4-legged-Ant robot (32) , TD3 is an appropriate algorithm for our robot. As mentioned in the article (29) , TD3 is an actor-critic algorithm, it consists of one actor-network(or policy network) π ϕ and two critic networks Q θ i (i = 1, 2) and their target networks (one actor target network π ϕ and two critic target networks Q θ ′ i (i = 1, 2) , θ i is the set of parameters of the critic network that is used to approximate the critic function Q π (s, a) ). The actor-network outputs the action values with input values that are state s or information obtained from the environment and added a noise (37) a ∼ π θ (s) + ε (with ε ∼ N (0, σ )) and target action a' are also calculated: , adding the clipped noise to the target action makes the target action closer to the original action. The noise used follows the Gaussian distribution N(0, σ ) .
The critic network uses the Bellman equation (38) to update its parameters, these two critic network give two Q-values, and use the smaller Q θ ′ i (i = 1, 2) value to calculate the target value y(r,s')(4) After the target value y(r, s') is calculated, the parameters of critic networks were learned by minimizing the mean square error between Q θ 1 , Q θ 2 and y (r, s ′ ) (5), (6).
https://www.indjst.org/ with D is the replay buffer consisting of transitions (s, a, r, s') received during agent's exploration process. When the number of transitions (s, a, r, s') stored in the replay buffer D is large enough, a batch size or a number of samples is randomly selected from D to calculate the loss value L θ i ,D (i = 1, 2) . The parameters ϕ of the actor-network are updated by maximizing Q θ 1 (7) However, the TD3 algorithm updating parameters less frequently (29) will make the variance value lower so that the updated parameters (8) (9) are more accurate. To reduce the frequency of parameter updating, a delay factor d is used, and to avoid overestimation, a hyperparameter τ(0 < τ < 1) is used which helps the parameters of target networks to be updated slowly.
The flow diagram of the TD3 algorithm is shown in Figure 1  Flow diagram of TD3 algorithm. The two critic networks include 2 hidden layers, both layers have 256 neurons, the actor-network also has 2 hidden layers, and also contains 256 neurons, the discount factor γ is 0.99, the maximum time step is 1600, the batchsize N is 64, the delay factor d is 2, the learning rate of all networks is0.0001. ReLU is used for activation functions.
With these improvements compared with DDPG which is applied to the 4 DOF bipedal robot (27) , therefore we will apply the TD3 algorithm to the 6-DOF biped robot with a more human appearance.

Biped robot model
The shape and size of the biped robot built is shown in Figure 2 The biped robot consists of seven links -one waist section, two thigh sections, two shank sections, and feet. The dimensions of links are all in meter. Symbols, mass, and materials of each link are shown in Table 1. Different types of joints to connect links are shown in Table 2. Therein, links "Ground", "Cyl_below", "Cyl_above" and "Horiz" only support simulation, not results. https://www.indjst.org/  The coordinate system of the joints is performed in Figure 3, the robot walks along the x-axis (in the absolute coordinate system). The rotation range of the six joints, i.e. the hip joints, knee joints, and ankle joints during the movement along the sagittal plane is tabulated in Table 3 and illustrated in Figure 4. Figure 4 shows the direction of rotation, flexion, and extension states of joints (hip, knee, and ankle). The desired angles (the output of the algorithms) will be normalized within these rotation ranges.

Training process
After the model is built by 3D design software, the robot model is converted to .stl format and .urdf file, .urdf file contains information about the mass, the moment of inertia, the angle limit of the joints, ... of the robot. With this format, the robot can https://www.indjst.org/   be launched into Gazebo for simulating the robot training process, ROS is used for the operation of robots in this simulation environment with a frequency of 60Hz. During the training process, the robot will begin learning from random actions (a) (the actions are calculated from the actor-network and added a Gaussian noise), which is the exploration of the agent (biped robot). The sensors will send signals which are the next state (s') of the agent. At each timestep, when the robot performs the action calculated from the neural network, a reward (r) is also calculated to create a transition (s, a, r, s'). These transitions are stored in Replay Buffer, when the number of transitions in the Replay Buffer is large enough, the neural network will randomly select a number of transitions from the Replay Buffer as inputs to update its parameters according to the TD3 algorithm presented in section 2. This interaction between the neural network and the environment through action, state, and reward is shown in Figure 5. Information about the action, state, and reward used is shown below. Action: At a timestep, through our robot-trained model, 6 action values which are desired angles of the joints are calculated so that the robot can move fast forward without falling. By the PID controller, the joints of the robot will be rotated to those desired positions. With ROS, PID coefficients are manually tuned so that the robot can rotate to those desired angles quickly, accurately, and consistently.
State: After performing these actions, the agent receives a state with 16 components from the environment including the robot's speed according to the robot's movement direction (x-axis), the robot's speed in a direction perpendicular to the ground (z-axis), the angle and the angular velocity of the hip joints, the angle and the angular velocity of the knee joints, the angle and the angular velocity of the ankles, the grounding state of the feet, we choose these components because these parameters can be measured by sensors located at the links, joints of the robot and it also characterizes the movement of the robot.
Reward: Based on the dense reward function in the article (34) for the bipedal robot, our reward function is designed by combining the sparse reward and dense reward function for our biped robot. With the sparse reward, we set a big bonus (+10) when the robot goes all the way and the biped robot will pay a heavy penalty (-10) if the robot falls, which contributes to the robot's balance problem. In addition, with dense reward, at each timestep, the robot with state received from the environment receives different reward values.
where v x (m/s) is the robot's speed in the robot's direction of movement (x-axis), a i is the value of the robot's actions, z (m) is the waist position in the z-direction of the robot (origin at the robot's highest position in the upright state of the robot), a i (rad) https://www.indjst.org/ are desired angles (action of agent), m and n are the constants for increasing the penalty value for the robot. According to our experience, we choose m = 3.0, n = 0.05. Based on the human gait, the human waist will fluctuate up and down with a cycle during walking. So in order for the robot to have a similar human gait, h 0 is calculated in the reward function by averaging the robot's waist position between the robot's vertical state and its one-foot forward state. Call H and L are the positions when the robot is upright (H = 0) and when the robot steps one foot (L= -0.003m). Therefore, h 0 is calculated as (11).
So, h 0 is equal to -0.0015m, which means that the robot's waist is around the z position = -0.0015 when the robot is in motion. The dense reward function is intended to guide the robot to move as quickly as possible and the robot's waist coordinate in the vertical direction of the body does not change too much during the movement. Finally, we get a reward function after combing the above two functions for the robot when it is in the timestep i (12), l e is the maximum distance used during training. 10 is a big bonus if the robot goes all the way, and -10 is the heavy penalty if the robot falls.

Results
To see the effectiveness of the TD3 algorithm in the training simulation of a 6-DOF biped robot, we used two other RL algorithms, DDPG and SAC, to compare with TD3. The training process is carried out in about 10000 episodes, each with 1600 timesteps, each episode will end when the robot falls or walks 15 meters or the number of steps is more than 1600. TD3 with more advantages (section 2) helps the robot achieve a higher reward than the other two algorithms, and the reward fluctuation range is also smaller ( Figure 5). In Figure 5, the solid line represents the reward value of the algorithms after every 100 episodes, the dashed line represents the average value of the rewards of those algorithms. With the same reward function as above (section 3.2), the same number of episodes (10000 episodes), the same robot parameters, and the same hyperparameters of the algorithms such as learning rate or batch size… the average reward of the TD3 algorithm is higher but not much higher to the other two algorithms, because each algorithm makes the robot walk fast forward with different gaits. However, in Figure 7, we can see that when using the TD3 algorithm the robot has a more human-like gait than the other two algorithms, the two joints corresponding to the two legs of the robot have similar graphs on amplitude as well as the cycle which is the same as when humans walk forward. During human movement, when the left foot is the pivot foot, the right footsteps forward, the angle of the right hip joint tends to increase and the angle of the left hip joint tends to decrease and vice versa. So, in the graph of the hip joint in Figure 5, we can see that the TD3 algorithm helps the robot to reach that state, but the other two algorithms do not follow a certain rotation period. And the rotation period of the knee and ankle joints of the TD3 algorithm is also more stable than DDPG or SAC.
https://www.indjst.org/ The efficiency of the algorithm is also shown more clearly in Figure 8. In Figure 8, we see that when using the TD3 algorithm, the z coordinate of the waist changes around the position z = 0.0016m (close to the desired value according to our reward function (-0.0015m)), the other two algorithms have a larger difference than the desired value. The waist's up and down cycle are also more stable with TD3. The results of simulating the gait of the biped robot in one step cycle from right foot forward to straight state to left foot forward and so on and so on is visualized in Figure 9.
During the research, we realized that the training time for the robot when using DDPG, SAC algorithm is almost double that of TD3 to reach 10000 episodes. When using only the sparse reward function (= 10 when reaching the destination; = -10 when falling) for the TD3 algorithm, in the same 10000 episodes, the robot still cannot go all the way. https://www.indjst.org/

Conclusion
In this study, a model of a biped robot is built to evaluate the algorithm with the shape of human legs, each leg of the robot has 3 degrees of freedom that can rotate in joint limits. The robot was trained by TD3 algorithm and simulated the training process by Gazebo/ROS, this algorithm will produce the desired rotation angle of each joint at each timestep when receiving information (state) from the environment. By PID controller, the joints will rotate to those desired angles. In this research, a reward function was given that helps robots learn faster and achieve higher efficiency as well. Unlike using only a sparse reward function, robot training time is significantly shortened by combining dense reward function and sparse reward function. This reward function helps the robot have a gait trajectory closer to humans. We have also compared TD3 with other RL algorithms (DDPG, SAC) when applying them with ROS for biped robots in the simulation environment. With the same number of timesteps during training, when the robot moves, the robot when using the TD3 algorithm has a more human-like gait than the DDPG and SAC algorithm, the two joints corresponding to the two legs of the robot have similar graphs on amplitude as well as the cycle ( Figure 6) and the robot's waist position oscillates closer to the desired value in our reward function (the desired position of waist = -0.0015m and the average position of the waist when using TD3, DDPG and SAC: avg_z(TD3) = 0.0016m; avg_z(DDPG) = 0.01m; avg_z(SAC) =0.0393m). The training time of the TD3 algorithm is only half that of other algorithms.
With these achieved results, in the future, we will try to apply RL algorithms and ROS for a real robot so that it can walk in the real environment and develop more in order that the biped robot can walk in more complex terrain and we will try to improve the robot so that it can move in 3D space.