Reinforcement learning with associative or discriminative generalization across states and actions: fMRI at 3 T and 7 T

Abstract The model‐free algorithms of “reinforcement learning” (RL) have gained clout across disciplines, but so too have model‐based alternatives. The present study emphasizes other dimensions of this model space in consideration of associative or discriminative generalization across states and actions. This “generalized reinforcement learning” (GRL) model, a frugal extension of RL, parsimoniously retains the single reward‐prediction error (RPE), but the scope of learning goes beyond the experienced state and action. Instead, the generalized RPE is efficiently relayed for bidirectional counterfactual updating of value estimates for other representations. Aided by structural information but as an implicit rather than explicit cognitive map, GRL provided the most precise account of human behavior and individual differences in a reversal‐learning task with hierarchical structure that encouraged inverse generalization across both states and actions. Reflecting inference that could be true, false (i.e., overgeneralization), or absent (i.e., undergeneralization), state generalization distinguished those who learned well more so than action generalization. With high‐resolution high‐field fMRI targeting the dopaminergic midbrain, the GRL model's RPE signals (alongside value and decision signals) were localized within not only the striatum but also the substantia nigra and the ventral tegmental area, including specific effects of generalization that also extend to the hippocampus. Factoring in generalization as a multidimensional process in value‐based learning, these findings shed light on complexities that, while challenging classic RL, can still be resolved within the bounds of its core computations.

. Model comparison: 3-T Face/House version (Good-learner group). Listed first for 3 nonlearning models and 17 learning models fitted to empirical data are absolute scores for deviance and the corrected Akaike information criterion (AICc) (where a lower score is better). These absolute scores were translated to residual goodness of fit relative to the hysteresis model (where a higher score is better). Winning results determined by the AICc are highlighted with boldface and italics. "df" stands for degrees of freedom. This table is related to Figure 3. The conventions for displaying this table also apply for Tables S2-S15. Figure 3.

3-T Face/House
Nonlearner ( Figure 4. Figure S1. Discriminability of the GRL model: 3-T Face/House version. Compare to Figure 3. Each fitted instantiation of the 7-parameter "generalized reinforcement learning" (GRL) model ("AX | SY") was used to simulate a data set yoked to that of the respective subject. Replications of the results from the original model comparison were achieved with these simulations as a demonstration of the discriminability of this preferred model with its additional degrees of freedom. This figure is related to Tables S6-S8.  Figure S2. Figure S1. The basic RL model was recovered in lieu of the GRL model when substituting data simulated with basic RL. This converse model recovery again demonstrates an absence of overfitting. This figure is related to Tables S11-S13.  Figure S4.   Table S16. Neural substrates of the RL framework: 3-T Face/House version. Listed for every significant cluster (p < 0.005, k ≥ 10) are anatomical regions; hemispheres ("H") as left ("L"), right ("R"), or bilateral ("B"); stereotactic coordinates in MNI space in mm (x, y, z); test statistics (tdf); probability values (p); cluster extents in voxels (k); and results of small-volume correction (SVC) at the cluster level ("C") or the peak level ("P") (pFWE < 0.05), where marginally significant ("c" or "p" in lower case) (0.05 < pFWE < 0.10) or uncorrected ("U") (p < 0.005) results are also listed if the most stringent threshold for SVC was not attained within the region of interest. All relevant groupings of participants are included. The conventions for displaying this table also apply for Tables S17, S19, S20, S23, and S24. This table is related to Figure 8 and Table S18.

Figure S6. Neural substrates of the RL framework: 7-T Color/Motion version (Dopaminergic midbrain). At 7 T, reward-prediction error (RPE) signals from the GRL model
were further localized to the substantia nigra (SN) (p < 0.005). This figure is related to Figure 9 and Tables S17 and S18.  Reward-prediction error  Table S18. Neural substrates of the RL framework: Summary. The first portion of fMRI analyses across data sets and participant groups (i.e., "All", "Good", and "Poor" learners) are summarized for the RL framework that serves as the foundation of the GRL model. Regions of interest (ROIs) were informed by prior studies modeling the reward-prediction error, value, and reaction time. Initially, broader exploratory ROIs were defined anatomically and tested for uncorrected results ("U") (p < 0.005). For RPE and value signals, coordinate-based ROIs were first tested collectively via SVC at the set level ("S") (pFWE < 0.05). Post-hoc tests followed for individual ROIs via SVC at the cluster level ("C") or the peak level ("P") (pFWE < 0.05); marginally significant ("s", "c", or "p" in lower case) (0.05 < pFWE < 0.10) or uncorrected ("U") (p < 0.005) results are listed as well if the most stringent threshold for SVC was not attained. Left ("L"), right ("R"), and bilateral ("B") refer to hemispheres for each ROI. The conventions for displaying this table also apply for Tables S21, S22, and S25. This table is related to Figures  8, 9, and S6 and Tables S16 and S17.   Figure 10 and Table S21.

Figure S7. Neural substrates of the GRL model: 7-T Color/Motion version (Dopaminergic midbrain). (a)
At 7 T, interaction effects between RPE signals and state generalization were localized to both the SN and the ventral tegmental area (VTA) (p < 0.005).
(b) Interaction effects between RPE signals and action generalization were likewise observed in both the SN and the VTA (p < 0.005). This figure is related to Figure 10 and Tables S20 and  S21.    Table S21. Neural substrates of the GRL model: Summary (Basal ganglia). The second portion of the fMRI analyses are first summarized for the basal ganglia as further validation of the GRL model. As these effects lack precedent, the ROIs (as before) originated from a prior study that modeled the RPE without including any effects of generalization. This table is related to Figures 10 and S7 and Tables S19 and S20.  Table S22. Neural substrates of the GRL model: Summary (Hippocampus). This qualitative summary of the second portion of the fMRI analyses examines the hippocampus. This table is related to Figure 10 and Tables S19 and S20.     Table S25. Neural substrates of the learning rate: Summary. The absence of overlap between specific effects of generalization and effects of learning performance indicates that the former are not confounded with the latter. This table is related to Tables S23 and S24.