I am trying to develop an iterative Markov Process Decision Agent (MDP) in Python with the following characteristics:
- observed condition
- I process a potential βunknownβ state by reserving some state space to respond to request type steps made by DP (the state at t + 1 will identify the previous request [or zero if the previous step was not a request] and also the built-in resulting vector) this space 0s is filled up to a fixed length to maintain state alignment regardless of the request (data data length may vary)
- actions that may not always be available in all states
- The reward function may change over time.
- policy convergence must be incremental and only calculated per turn
Thus, the main idea is that the MDP should optimize its course at T as much as possible using its current probabilistic model (and since its probabilistic course, which it makes, is expected to be stochastic, implies a possible randomness), a pair of a new input state at T + 1 with reward from the previous move on T and revaluation of the model. Convergence does not have to be constant, because rewards can be modulated or available actions can change.
I would like to know if there are any existing python libraries (preferably cross-platform, since I will definitely change the environment between Windoze and Linux) that can do such things already (or can support it with a suitable setting, for example: support for a derived class , which allows you to redefine the reward method with one of its own).
I find information about online learning in MDP pretty scarce. Most of the uses of MDP that I can find seem to focus on deciding the entire policy as a preprocessing step.
python machine-learning markov
Brian jack
source share