We talked about the optimal policy.
A policy is a program that gives you a list of actions (a todo list). It uses the information about your company & the world in which it operates (the database).
Mathematically speaking, a policy is a function from the current state to an action. The action is performed by the agent. If the agent needs to operate autonomously (without getting the data from the policy) for a while, then the policy may return a list of actions (assuming that the agent will carry them out sequentially)
An optimal policy is a policy that accumulates the maximum sum of rewards over multiple executions. Since the state changes stochastically, some executions are more likely, while others are less likely. So it is permissible for the policy to achieve higher rewards over more likely executions, while achieving lower rewards over less likely executions.
We also notice that the model of the state might change. If the model changes, then the policy needs to be regenerated. However, the new policy must start working from the state that was reached by the old policy. That means it's suboptimal to plan for too-long-term - because the model on which such a plan relies might change.
A simple solution to this problem is to shorten the time horizon (the duration of a single episode). However, finding the "right" time horizon is a separate problem. In our specific business, we think it's acceptable to plan for 1 year in advance.