Policy generalization for the reinforcement modeling learning algorithm based on models with large state and action spaces

I use a one-way reinforcement simulation approach for autonomous flight.

In this project, I used a simulator to collect training data (state, action, final state) so that the algorithm Locally Weighted Linear Regressioncould find out MODEL.

STATEdetermined by the vector: [Pitch , Yaw , Roll , Acceleration]to determine the position of the unmanned in space. When set, POLICYit has one more function.[WantedTrajectory]

ACTION also determined by the vector: [PowerOfMotor1 , PowerOfMotor2 , PowerOfMotor3 , PowerOfMotor4]

REWARD it is calculated depending on the accuracy of the accepted trajectory: for a given initial spatial state, the desired trajectory and the final spatial state, it is closer to the trajectory actually adopted to the one that needs a less negative reward.

The algorithm for the policy iterationfollowing:

start from a state S0

loop    

         1) select the best action according to the Policy

         2) use LWLR to find the ending state

         3) calculate reward

         4) update generalized V function



endloop;

Thus, the action taken also depends on the desired trajectory (selected by the user), the agent autonomously selects the power of 4 engines (trying to take the desired trajectory and have a larger, less negative result) and the policy is dynamic, because it depends on the updated function of values.

The only problem is choosing POLICYas follows (S = Pitch, Yaw, Roll, Acceleration, WantedTrajectory):

π(S) = argmax_a ( V( LWLR(S,a) ) )

( , , ) , .

POLOCY VALUE?

+4

All Articles