Identifying Latent Intentions
via
Inverse Reinforcement Learning
in
Repeated Public Good Games
Carina I Hausladen, Marcel H Schubert, Christoph Engel
MAX PLANCK INSTITUTE
FOR RESEARCH ON COLLECTIVE GOODS
Social Dilemma Games
Initial contributions start positive but gradually decline over time.
Others | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
Yours |
9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
---|---|---|---|---|---|---|---|---|---|
19 | 20 |
---|---|
Others | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
Yours | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
---|---|---|---|---|---|---|---|---|---|
9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 |
---|---|
19 | 20 |
Meta-Analysis
Thöni et al. (2018)
Conditional Cooperation
19.2 %
Hump-Shaped
Fischbacher et al. (2001)
61.3 %
Freeriding
10.4 %
Theory Driven
Data Driven
Theory Driven
Data Driven
Theory Driven
Data Driven
Theory first: Use theory to find groups
Model first: Specify a model, then find groups
Data first: Let the data decide groups, then theorize
Theory Driven
Data Driven
Bardsely (2006)| Tremble terms: 18.5%.
Houser (2004) | Confused: 24%
Fallucchi (2021) |Others: 22% – 32%
Fallucchi (2019) |Various: 16.5%
18.5%
24%
32%
16.5%
random / unexplained
Step 1
Clustering
→ uncover patterns
Step 2
Inverse Reinforcement Learning
→ interpret patterns
two-dimensional time series
per player
rounds
highest mean contribution
lowest mean contribution
Step 1
Clustering
→ uncover patterns
Step 2
Inverse Reinforcement Learning
→ interpret patterns
Data Driven
Identifies empirical regularities
Moves from assumed types to discovered response pattern
Our approach:
Unsupervised modeling to uncover structure
Then apply theory to interpret the clusters.
Bottom line:
→ Clustering is not a substitute for theory — it is a way to structure behavioral complexity before theorizing.
"For empiricists, these theory- and data-driven modes of analysis have always coexisted. [...] Machine learning provides a powerful tool to hear, more clearly than ever, what the data have to say. These approaches need not be in conflict."
Minimum Wages and Employment
Cart & Krueger (2000)
rounds
20
contribution
1
0
Finds the best match between similar patterns
Does not force strict x-alignment.
Local Similarity Measure
Global Similarity Measure
Local Similarity Measure
Global Similarity Measure
Results depend fundamentally on how similarity is defined
The way we measure differences between time series is not neutral — it encodes our theoretical assumptions about what counts as meaningful variation.
The criteria for similarity embed prior ideas about learning, stability, or change.
Agglom
GMM
k-means
Agglom
GMM
k-means
DTW
Euclidean
k-means
uids
round
Agglom
GMM
k-means
DTW
Euclidean
k-means
highest
mean contribution
lowest
mean contribution
DTW
Euclidean
DTW
Euclidean
Results depend fundamentally on how similarity is defined.
A canonical finding in behavioral economics is the downward trend in PGGs.
We focus on patterns, not specific timepoints.
→ Not “people switch in round 5,”
→ But: a tipping point emerges after which contributions remain low.
This holds despite idiosyncratic group differences and varying game lengths.
DTW
Results depend fundamentally on how similarity is defined.
A canonical finding in behavioral economics is the downward trend in PGGs.
We focus on patterns, not specific timepoints.
→ Not “people switch in round 5,”
→ But: a tipping point emerges after which contributions remain low.
This holds despite idiosyncratic group differences and varying game lengths.
Theory Driven
Data Driven
Manhattan +
Finite Mixture Model
Bayesian Model
C-Lasso
DTW +
Spectral Clustering
Hierarchical Clustering
Theory Driven
Data Driven
Finite Mixture Model
C-Lasso
DTW Distance
Manhattan Distance
Finite Mixture Model
C-Lasso
DTW
Local
Clustering
reinforcement learning
Q-learning was recently applied to the
The key challenge is to define a
reward function.
Artificial Intelligence, Algorithmic Pricing, and Collusion
(Calvano et al. 2020)
Inverse RL recovers reward functions from data.
Hierarchical Inverse Q-Learning
18.5%
24%
32%
16.5%
random / unexplained
Hierarchical Inverse Q-Learning
action
state
\( Q(s,a) = (1- \alpha) Q(s,a) + \alpha \left( r + \gamma \max Q(s', a') - Q(s,a) \right) \)
Expected best possible outcome from the next state
Compare to now
re-ward
Hierarchical Inverse Q-Learning
action
state
Agents learn to act optimally over time without solving the full Bellman equation upfront.
Very intuitive for modeling boundedly rational players in repeated games.
Hierarchical Inverse Q-Learning
\( Q(s,a) = (1- \alpha) Q(s,a) + \alpha \left( r + \gamma \max Q(s', a') - Q(s,a) \right) \)
re-ward
action
state
Estimate the reward function by maximizing the likelihood of observed actions and states.
unknown
Hierarchical Inverse Q-Learning
action
state
Instead of learning Q-values from rewards, we observe actions and states and try to infer what reward function the agent must be optimizing.
What objective function (preferences, incentives) would rationalize the observed behavior as approximately optimal?
IRL is a structural estimation method dressed in AI terms.
Hierarchical Inverse Q-Learning
\( r_{t-1} \)
\( a_{t-1} \)
P
\( s \)
\( \Lambda \)
\( r_t \)
\( a_t \)
P
\( s_{t+1} \)
discrete transition
Hierarchical Inverse
Q-Learning
\( P(r_t \mid s_{0:t}, a_{0:t}) \)
action
state
\( r \)
Modeling the temporal dynamics of goal-directed behavior under bounded rationality and stochasticity.
\( r_{t-1} \)
\( \Lambda \)
\( r_t \)
1
2
3
4
→ 2
→ 3
→ 4
→ 5
0.6
0.4
0.2
0.2
75.2
88.6
101.5
114.4
\( \Delta \) Test LL
\( \Delta \) BIC
Choice of two intentions aligns with the fundamental RL principle of exploration vs. exploitation.
Unconditional
Cooperators
Consistent
Cooperators
Threshold
Switchers
Freeriders
Volatile
Explorers
Freeriders
Consistent Cooperators
carinah@ethz.ch
slides.com/carinah
S
We find that partitioning the data using spectral clustering with DTW distance yields the cleanest and least noisy clusters.
Individuals display higher inertia when others ↑ as opposed to other↓
Adjustments exhibit greater variability following other↓ as opposed to others ↑
Agglom
GMM
k-means
DTW
Euclidean
k-means
highest
mean contribution
lowest
mean contribution
Data-Driven Approaches
Theory-driven classifications may not accurately capture the heterogeneity observed.
Theory-Based Approaches
Strategy-method types often align with actual gameplay data
(e.g. Muller 2008).
Strong assumptions on decision-making :
Detect behavioral patterns without assumptions.
Use RL to derive a policy that maximizes agents' intrinsic motivation for sustainable behavior.
Simulate an economy to study emergent dynamics.
Existing simulations lack grounding in behavioral types (e.g., research on cooperation).
Incorporating behavioral types could improve simulations.
Tailored policies—potentially even type-specific—may lead to better outcomes.
Hierarchical Inverse Q-Learning
action
state
\( Q(s,a) = (1- \alpha) Q(s,a) + \alpha \left( r + \gamma \max Q(s', a') - Q(s,a) \right) \)
Expected best possible outcome from the next state
Compare to now
re-ward
Theory Driven
Data Driven
Data-Driven Approaches
Theory-Based Approaches
Theory-Based Approaches
Data-Driven Approaches
Action
State
Strategy Table Data
\( 21^{21} \approx 5.8 \times 10^{27} \)
\( (21 \times 21) ^{10 rounds} \)
Explanations
Social preferences
Confusion
Unexplained behavior due to strict theoretical assumptions and limited techniques for interpretation.
Reliance on perfect point-wise alignment
Two comparative observations
Intentions integrate actions and states.
Theory Driven
Data Driven
Bardsely (2006)| Tremble terms:
Random deviations that decline with experience: 18.5%.
Houser (2004) | Confused:
High randomness and decision errors: 24%
Fallucchi (2021) |Others:
Neither reciprocation nor strategy: 22% – 32%
Fallucchi (2019) |Various:
Unpredictable contributions: 16.5%