A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It provides a formalism for modeling the environment in reinforcement learning. An MDP is defined by:

**States ((S))**: A finite set of states in the environment.**Actions ((A))**: A finite set of actions available to the agent.**Transition Function ((P))**: The probability ( P(s’, r | s, a) ) of moving from state ( s ) to state ( s’ ) and receiving reward ( r ) after taking action ( a ).**Reward Function ((R))**: The immediate reward received after transitioning from state ( s ) to state ( s’ ) due to action ( a ).**Policy ((\pi))**: A strategy that specifies the action ( \pi(a|s) ) to take in each state ( s ).

The goal in an MDP is to find a policy that maximizes the expected sum of rewards over time, often discounted by a factor ( \gamma ) (discount factor) to account for the preference for immediate rewards over future rewards.

### Key Concepts:

**Value Function**: Measures the expected return (sum of rewards) from a given state under a specific policy.**State Value Function ((V(s)))**: Expected return starting from state ( s ) and following policy ( \pi ).**Action Value Function ((Q(s, a)))**: Expected return starting from state ( s ), taking action ( a ), and then following policy ( \pi ).**Bellman Equations**: Fundamental recursive relationships that express the value of a state or state-action pair in terms of the values of successor states.

### Applications:

MDPs are widely used in various fields, including robotics, economics, and artificial intelligence, for solving problems where decision-making is essential under uncertainty and dynamic conditions.

this is probebility destribution on outcoms that we can get

there are uncertaintly in around us

در کل در باره. ی.عدم قطعیت صحبت میکند که چگونه باعث تصمیمات ما میشود

that was about decidions

that by

In a Markov Decision Process (MDP), decisions are indeed based on uncertainty. The decision-making process in an MDP involves dealing with randomness in both transitions between states and the rewards received

this is that reward that we get base on that randomless

we have some actions in that

hear we wana maximize this reward

Certainly! I’ll walk you through the steps of the provided MDP implementation, explaining each part in detail.

### 1. Initialization

First, the MDP class is initialized with the states, actions, transition probabilities, rewards, and a discount factor (`gamma`

).

```
import numpy as np
class MDP:
def __init__(self, states, actions, transition_prob, rewards, gamma=0.9):
self.states = states
self.actions = actions
self.transition_prob = transition_prob
self.rewards = rewards
self.gamma = gamma
self.V = np.zeros(len(states))
```

`states`

: A list of states.`actions`

: A list of actions.`transition_prob`

: A 3D list where`transition_prob[s][a][s_prime]`

is the probability of transitioning from state`s`

to state`s_prime`

given action`a`

.`rewards`

: A 3D list where`rewards[s][a][s_prime]`

is the reward for transitioning from state`s`

to state`s_prime`

given action`a`

.`gamma`

: The discount factor for future rewards.`V`

: An array to store the value function for each state.

### 2. Value Iteration

The `value_iteration`

method updates the value function `V`

until it converges to within a specified threshold (`epsilon`

).

```
def value_iteration(self, epsilon=0.01):
while True:
delta = 0
for s in range(len(self.states)):
v = self.V[s]
self.V[s] = max([sum([self.transition_prob[s][a][s_prime] *
(self.rewards[s][a][s_prime] + self.gamma * self.V[s_prime])
for s_prime in range(len(self.states))])
for a in range(len(self.actions))])
delta = max(delta, abs(v - self.V[s]))
if delta < epsilon:
break
```

`delta`

: Tracks the maximum change in value function for any state in each iteration.- For each state
`s`

, calculate the value function for all possible actions and update`V[s]`

with the maximum value. - The nested loop calculates the expected value for each action
`a`

by summing the product of the transition probability, immediate reward, and the discounted value of the next state. - The process continues until the change in value (
`delta`

) is less than`epsilon`

.

### 3. Extracting the Policy

The `get_policy`

method derives the optimal policy from the computed value function `V`

.

```
def get_policy(self):
policy = np.zeros(len(self.states), dtype=int)
for s in range(len(self.states)):
policy[s] = np.argmax([sum([self.transition_prob[s][a][s_prime] *
(self.rewards[s][a][s_prime] + self.gamma * self.V[s_prime])
for s_prime in range(len(self.states))])
for a in range(len(self.actions))])
return policy
```

- For each state
`s`

, find the action`a`

that maximizes the expected value. `np.argmax`

returns the action with the highest value for each state.

### 4. Define the MDP Components

Define the states, actions, transition probabilities, and rewards.

```
states = [0, 1, 2]
actions = [0, 1]
transition_prob = [
[[0.8, 0.2, 0.0], [0.0, 0.9, 0.1]],
[[0.9, 0.1, 0.0], [0.0, 0.8, 0.2]],
[[0.7, 0.3, 0.0], [0.0, 0.6, 0.4]]
]
rewards = [
[[10, 0, 0], [0, 5, 5]],
[[0, 2, 0], [0, 5, 10]],
[[1, 0, 0], [0, 4, 7]]
]
```

### 5. Initialize and Solve the MDP

Create an instance of the MDP class, perform value iteration, and extract the policy.

```
# Initialize MDP
mdp = MDP(states, actions, transition_prob, rewards)
# Perform value iteration
mdp.value_iteration()
# Extract policy
policy = mdp.get_policy()
print("Optimal Value Function:", mdp.V)
print("Optimal Policy:", policy)
```

### Execution

When you run the code, it will perform value iteration to find the optimal value function and then derive the optimal policy. The `Optimal Value Function`

and `Optimal Policy`

will be printed as the output.

```
Optimal Value Function: [Expected values for each state]
Optimal Policy: [Optimal actions for each state]
```

This step-by-step approach allows you to understand how

this is that how we can get that solutions

by that polocy we can get that solutions

just we need to define this policys

این که کجا برویم و چه تصمیمی در هر لححضه بگیریممشخص میشود

value is expected utility

مقدار ولیو در هر لحظه ثابت میماند اما یوتیلیتی تغییر میکنند چون که Expected ان است

بخش دوم برای زمانی است که حال برای ما اهمیت دارد

یعنی تصمیم میگیرد که اینده بهتر است یالان

for each of policy we have some value