Hello everyone,
I followed the TensorFlow tutorial for agents and the multi armed bandit tutorial and now I’m trying to make one of the already implemented agents, from the examples, work on my own environment. Basically my environment exists of 5 actions and 5 observations. Applying one action i results in the same state i. One action contains another step of sending that action number to a different program via a socket and the answer from the program is interpreted for the reward. My environment seems to be working, I used the little test script below to test the observe and action functions. I know this is not a full proof but showed its atleast working.
Now I am missing the part of mapping the observation to the action, hence the agent with his policy. I followed the structure of the examples, but every agent I tried on my environment had a different error. I seem to apply them wrong to my environment but cant figure out what I’m doing wrong.
Am I not able to apply one of these end-to-end agents from the examples like it is stated? I searched all tutorials and documentations on tensorflow but couldnt get any answer. My environment should be simple enough. I seem to be missing some essential step.
The Errors for each agent:
Greedy: Input 0 of layer "dense_3" is incompatible with the layer: expected min_ndim=2, found ndim=1. Full shape received: (50,) Call arguments received: • observation=tf.Tensor(shape=(), dtype=int32) • step_type=tf.Tensor(shape=(), dtype=int32) • network_state=() • training=False Linucb: ValueError: Global observation shape is expected to be [None, 1]. Got []. LinThompson: lib/python3.8/site-packages/tf_agents/bandits/policies/linear_bandit_policy.py", line 242, in _distribution raise ValueError( ValueError: Global observation shape is expected to be [None, 1]. Got []. Exp3: lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7107, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute Mul as input #1(zero-based) was expected to be a int32 tensor but is a float tensor [Op:Mul]
The environment:
nest = tf.nest #https://www.tensorflow.org/agents/tutorials/2_environments_tutorial # Statemachine environment # # Actions: # n Actions: Every state of the statemachine represents one bandit with one action. # for now it is 5 states # # Observations: # one of the 5 states class AFLEnvironment(bandit_py_environment.BanditPyEnvironment): def __init__(self): action_spec = tensor_spec.BoundedTensorSpec( shape=(), dtype=np.int32, minimum=0, maximum=4, name='action') #actions: 0,1,2,3,4 for 5 states. observation_spec = tensor_spec.BoundedTensorSpec( shape=(), dtype=np.int32, minimum=0, maximum = 4,name='observation')#5 possible states self._state = tf.constant(0) super(AFLEnvironment, self).__init__(observation_spec, action_spec) def _observe(self): self._observation = self._state return self._observation # implementation of taking the action def _apply_action(self, action): sock = self.__connectToSocket() #answer: NO_FAULT = 0, FSRV_RUN_TMOUT = 1, FSRV_RUN_CRASH = 2, FSRV_RUN_ERROR = 3 answer = self.__fuzz(action, sock) if answer == "0": reward = 0.0 elif answer == "1": reward = 1.0 elif answer == "2": reward = 1.0 elif answer == "3": reward = 1.0 else: print("Error in return value from fuzzing: %s" % answer) sys.exit(1) self._state = tf.constant(action) print("Step ended, reward is: %s" % reward) return reward
The different agents:
nest = tf.nest flags.DEFINE_string('root_dir', os.getenv('TEST_UNDECLARED_OUTPUTS_DIR'), 'Root directory for writing logs/summaries/checkpoints.') flags.DEFINE_enum( 'agent', 'EXP3', ['GREEDY', 'LINUCB', 'LINTHOMPSON', 'EXP3'], 'Which agent to use. Possible values are `GREEDY`, `LINUCB`, `LINTHOMPSON` and `EXP3`. Default is GREEDY.') FLAGS = flags.FLAGS # From example, change here for training parameters BATCH_SIZE = 8 TRAINING_LOOPS = 200 STEPS_PER_LOOP = 2 CONTEXT_DIM = 15 # LinUCB agent constants. AGENT_ALPHA = 10.0 # epsilon Greedy constants. EPSILON = 0.05 LAYERS = (50, 50, 50) LR = 0.005 def main(unused_argv): tf.compat.v1.enable_v2_behavior() # The trainer only runs with V2 enabled. with tf.device('/CPU:0'): # due to b/128333994 env = AFLEnvironment() #'GREEDY', 'LINUCB', 'LINTHOMPSON', 'EXP3' if FLAGS.agent == 'GREEDY': network = q_network.QNetwork( input_tensor_spec=env.time_step_spec().observation, action_spec=env.action_spec(), fc_layer_params=LAYERS) agent = eps_greedy_agent.NeuralEpsilonGreedyAgent( time_step_spec=env.time_step_spec(), action_spec=env.action_spec(), reward_network=network, optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=LR), epsilon=EPSILON) elif FLAGS.agent == 'LINUCB': agent = lin_ucb_agent.LinearUCBAgent( time_step_spec=env.time_step_spec(), action_spec=env.action_spec(), alpha=AGENT_ALPHA, gamma=0.95, #wird teilweise in den examples weggelassen emit_log_probability=False, dtype=tf.float32) elif FLAGS.agent == 'LINTHOMPSON': agent = lin_ts_agent.LinearThompsonSamplingAgent( time_step_spec=env.time_step_spec(), action_spec=env.action_spec()) elif FLAGS.agent == 'EXP3': agent = exp3_agent.Exp3Agent( time_step_spec = env.time_step_spec(), action_spec = env.action_spec(), learning_rate = 1) replay_buffer = [] metric = py_metrics.AverageReturnMetric() observers = [replay_buffer.append, metric] driver = dynamic_step_driver.DynamicStepDriver( env=env, policy=agent.collect_policy, observers=observers, num_steps = 200) initial_time_step = env.reset() print("initial_time_step") print(initial_time_step) final_time_step, _ = driver.run(initial_time_step) print('Replay Buffer:') for traj in replay_buffer: print(traj) if __name__ == '__main__': app.run(main)
Test script:
env = AFLEnvironment() observation = env.reset().observation print("observation: %d" % observation) action = 1 #@param print("action: %d" % action) reward = env.step(action).reward print("reward: %f" % reward) print("observation : %d", env._observe())
submitted by /u/sampletext1111
[visit reddit] [comments]