Some Remarks on CCP-based Estimators of Dynamic Models

This note provides several remarks relating to the conditional choice probability (CCP) based estimation approaches for dynamic discrete-choice models. Specifically, the Arcidiacono and Miller [2011] estimation procedure relies on the “inverse-CCP” mapping from CCP’s to choice-specific value functions. Exploiting the convex-analytic structure of discrete choice models, we discuss two approaches for computing this, using either linear or convex programming, for models where the utility shocks can follow arbitrary parametric distributions. Furthermore, the inverse-CCP mapping is generally distinct from the “selection adjustment” term (i.e. the expectation of the utility shock for the chosen alternative), so that computational approaches for computing the latter may not be appropriate for computing inverse-CCP mapping.


Introduction
Conditional choice probability (CCP) based estimation approaches for dynamic discrete-choice models have become well-established in the empirical literature on dynamic structural models. A crucial step in these procedures involves computing the ''inverse CCP'' mapping from choice probabilities to choice-specific value functions. This is exemplified by the Arcidiacono and Miller (2011) estimation procedure, which relies on knowing or computing the vector valued function ψ (p) = ( ψ 1 (p) , . . . , ψ J (p) ) ⊺ , where p = ( p 1 , . . . , p J ) ⊺ is a probability vector. For each alternative k the function ψ k satisfies where z denotes the model state, p (z) = ( p 1 (z) , . . . , p J (z) ) ⊺ the (conditional) choice probabilities implied by the model, v k , k = 1, . . . , J, are the choice-specific value functions, and V is the ex ante (or integrated) value function . For the multinomial logit model, = − log p k (z) . That is, ψ (p (z)) equals the expected utility shock of the optimal action, which we can interpret as a ''selection adjustment'' term. However, as we will see, this equality is more the exception than the rule. Furthermore, it is not clear how to compute ψ for any assumed distribution of the utility shocks ε-including, for instance, Gaussian errors, or errors which may depend on observed covariates or state variables.
In this note, we interpret the quantity ψ based on the convexanalytic properties of additive random utility models (ARUMs).
We characterize a class of distributions for which ψ k (p (z)) coincides with the selection adjustment term.
We discuss two general approaches for computing ψ for ARUM models with arbitrary error distributions . The first approach exploits the Mass Transport Estimator (Chiong et al., 2016), which allows one to compute ψ (p) using linear programming techniques .
Our second approach relies upon the characterization of ψ (p) as the (unique) solution to an unconstrained concave programming problem. We find the convex optimization approach to be better suited for larger problems. 1 1 Li (2018) considers a convex minimization algorithm to solve the similar problem of ''demand inversion'' and illustrates his method in the case of both the Berry et al. (1995) random coefficient logit demand model and the Berry and Pakes (2007) pure characteristics model. Proofs are included in the Appendix of this note. It supplements the note with the same title that is forthcoming Economics Letters.

Review: Dynamic discrete choice model
To set the scene, we review the dynamic discrete choice (DDC) model, up to the key Lemma 1 in Arcidiacono and Miller (2011).
In each period until T ≤ ∞, an individual chooses among J mutually exclusive actions. Let d jt = 1 if action j ∈ {1, . . . , J} is taken at time t and = 0 otherwise. The current period payoff for action j at time t depends on the state z t ∈ {1, . . . , Z }. If action j is taken at time t, the probability of z t+1 occurring in period t + 1 is f jt (z t+1 |z t ).
The individual's current period payoff for action j at time t is u jt (z t ) + ε jt . The choice-specific shocks ε jt , are revealed to the individual at the beginning of period t. The vector ε t = ( ε 1t , . . . , ε Jt ) ⊺ is independent of the state and i.i.d. over time with density function g, full support and finite means.
⊺ to sequentially maximize the discounted sum of payoffs where the expectation at each period t is taken over the future values of z t+1 , . . . , z T and ε t+1 , . . . , ε T and β ∈ (0, 1).

Expression (1) is maximized by a Markov decision rule
, which gives the optimal action conditional on t, z t , and ε t . Integrating over ε t gives the CCPs The ex ante value function period t, V t (z t ), is the expected discounted sum of payoffs just before ε t is revealed and conditional By Bellman's principle, the V t (z t )'s can be recursively expressed where the second expression integrates out the disturbance vector ε t . The choice-specific conditional value function, v jt (z t ), is the flow payoff of action j without ε jt plus the discounted expected future utility conditional on the optimal decision rule from period Observe that the ex-ante value function V t (z t ) coincides with the social surplus function W defined by evaluated at v t (z t ).

The ψ (inverse-CCP) mapping from Arcidiacono-Miller
Arcidiacono and Miller (2011, Lemma 1) show that the value function V t (z t ) can expressed as a function of a conditional value function v jt (z t ), plus a function of the conditional choice probabilities p t (z t ). Let ∆ • be the set of positive probability vectors p ∈ R J .
Lemma 1 (Arcidiacono and Miller, 2011). There exists a function Arcidiacono-Miller's estimation procedure relies on ψ. They show how to compute this for Generalized Extreme Value (GEV) distributed ε (leading to, e.g., logit or nested logit), but it is not clear how to compute this for general assumed distributions of ε.
This note addresses this issue.

Interpretation of ψ k
The ex-ante value function V t (z t ), as defined in (3) above, arises from evaluating the convex function W at the vector v t (z t ). Following Chiong et al. (2016), the ψ function can therefore be interpreted from a convex-analytic perspective.

Random utility and convex analysis
Consider a decision maker (DM) making a utility maximizing discrete choice among alternatives j ∈ {1, . . . , J} The utility of where v = (v 1 , . . . , v J ) ⊺ is deterministic and ε = (ε 1 , . . . , ε J ) ⊺ is a vector of random utility shocks. This is the classic additive random utility model (ARUM) McFadden (1978). Our presentation of the ARUM framework here will emphasize convex-analytic properties which will be important in drawing connections with Arcidiacono and Miller (2011)'s approach.
Assumption 1. The random vector ε is absolutely continuous with finite means, independent of v, and fully supported on R J . 2 Assumption 1 leaves the distribution of ε unspecified, thus allowing for a wide range of choice probability systems far beyond the often used logit model. The assumption allows arbitrary correlation between the ε j s, which may be important in applications.
The DM then has choice probabilities An important object in this paper is the surplus function of the discrete choice model (so named by McFadden, 1981). As defined at the end of Section 2, it is given by Under Assumption 1, W is convex and differentiable and the choice probabilities p coincide with the derivatives of W 3 : . This is the Williams-Daly-Zachary theorem, famous in the discrete choice literature (McFadden, 1978(McFadden, , 1981. Next we introduce the ''selection adjustment'' terms, which are the expected values of the utility shocks for each choice given that the choice is optimally selected, i.e.
Then the social surplus function W can be expressed as a weighted average, with weights given by the choice probabilities. Given a choice probability vector p, the conjugate surplus W * (p) is defined as Combining (9) with the fact that W (v)+W * (p) = v ⊺ p if and only if p = ∇W (v), we obtain an alternative expression for W * (p (v)) as a choice probability-weighted sum of expectations of the utility shocks ε 4 : 3.2. When do ψ k (p t (z t )) and e k (v t (z t )) coincide?
Returning to the DDC setting, we know that for the multinomial logit model both ψ k (p t (z t )) and e k (v t (z t )) equal − log p kt (z t ), k = 1, . . . , J. As we explain below, the same relation holds for all ARUMs arising from the GEV family. However, it does not hold for non-GEV distributions in general. (See Dearing, 2019 for a counterexample.) We next explore the relationship between ψ k (p t (z t )) and e k (v t (z t )) in more detail.
For choice probabilities p t (z t ), the rationalizing utilities v t (z t ) are identified up to location (cf. Chiong et al., 2016, Section 2.3).
With the utility normalization W (v 0 t (z t )) = 0, using Lemma 1 and It follows that , as both sides equal −W * (p t (z t )). As noted by Dearing (2019), this means that e t (v 0 t (z t )) and ψ(p t (z t )) lie in the same hyperplane. However, inner products coinciding does not imply that e t (v 0 t (z t )) = ψ(p t (z t )).
It turns out, that there is a simple condition that allows one to know when e(v t (z t )) = ψ(p t (z t )). The result is related to "invariance" as defined in Fosgerau et al. (2018). Letṽ j = v j + ε j ,v = max jṽj , and ξ = argmax jṽj . A random vector v has the invariance property, whenv, the utility of the chosen alternative, and ξ , the index of the chosen alternative, are statistically independent.
3 The convexity of W follows from the convexity of the max function. Under invariance, we can compute ψ from the expected utility shocks. Invariance implies that the distribution of the utility of a specific alternative, conditional on that alternative being chosen, is the same, regardless of which alternative is considered.

Proposition 2. If the ARUM satisfies invariance for allṽ, then
All (regular) GEV have the invariance property (Fosgerau et al., 2018). Then the finding that ψ(p t (z t )) = e(v t (z t )) for the GEV is a special case of the result for ARUM with the invariance property. It is the invariance property that drives the result.

Characterization and computation of ψ
The main issue in applying Lemma 1 is to compute the function ψ (p t (z t )). In this section we discuss two alternative approaches. Both work with an arbitrary distribution of the utility shocks ε and do not require the invariance property. This is in contrast to most of the existing literature, which has focused on the multinomial logit model. Following Chiong et al. (2016) we first show how to compute W * (p t (z t )) and recover the vector v t . Second, following Li (2018), we compute ψ as the solution to a convex optimization problem.

Two convex-analytic characterizations
By Fenchel's equality and the observation that For choice probabilities p t (z t ), the rationalizing utilities v t (z t ) differ by a common constant. The constant is differenced out in W (v t (z t )) − v jt (z t ), and therefore ψ j (p t (z t )) is uniquely determined.
Alternatively, Chiong et al. (2016) characterize the ψ function as the solution to In this way, ψ(p) can be interpreted as the vector of choicespecific value functions that rationalize the observed choice probabilities p, under the normalization that W (v) = 0.

Computation using linear programming
For given choice probability vector p t (z t ), we can use the LP procedure in Chiong et al. (2016) to compute W * (p t (z t )) and then combine with (11) to determine ψ (p t (z t )). Specifically, upon replacing F , the distribution of the utility shocks ε, by a discrete distribution, the program in (12) becomes a linear programan assignment problem -with dimension equal to the product of the number of alternatives (J) and the number of support points (S). LetF be an S-point discrete approximation to the shock distribution which is uniform on its support supp(F ) = { ε 1 , . . . , ε S } . We can approximate W * (p) using by solving the linear program (LP) For this discretized problem, letŴ (v) andŴ * (p) denote the approximate social surplus and conjugate surplus respectively. The set ∂Ŵ * (p) ⊆ R J then corresponds to the Lagrange multipliers associated with the constraints (14).
In short, for given choice probability vector p t (z t ), we can use the LP procedure in Chiong et al. (2016) to compute W * (p t (z t )).
Any one of the vectors v t (z t ) ∈ ∂Ŵ * (p t (z t )) which rationalize p t (z t ) can be recovered as the Lagrange multipliers in the LP problem. Subsequently, we can compute ψ k using Eq. (11).

Computation using convex programming
Alternatively, ψ can be computed as the solution to (12).
We suggest a convex optimization program that automatically incorporates the constraint (normalization) W (v) = 0 by using the exponentiated surplus e W rather than the surplus itself.
Proposition 3. The function ψ in Lemma 1 is given by The solution satisfies W (ψ (p)) = 0 for any p ∈ ∆ • .
The exponentiated surplus is strictly convex, so first-order conditions are necessary and sufficient to find ψ(p). In order to gain some intuition, we note that the first-order conditions require that p = ∇W (v)e W (v) , and ∇W (v) = p by the Williams-Daly-Zachary theorem. Then 1 = e W (v) , which is the desired normalization. Very efficient convex optimization algorithms are readily available for the problem (16). Li (2018) uses a trust region algorithm to solve the equivalent problem (12) for a static discrete-choice model and shows that it outperforms the Berry et al. (1995) contraction mapping computationally in the case of a random coefficient logit model.

Comparing linear and convex programming
The LP problem (13)-(15) typically becomes very highdimensional as a very large number of draws for the utility shocks ε may be required to approximate the consumer heterogeneity distribution F sufficiently. While solvers exist for large-scale LP problems (say, S < 10 6 ) , memory or time constraints may be prohibitive in practice. 5 In contrast, problem (16) recasts the constrained optimization problem (12) as an unconstrained optimization problem. Since the approximation sample only enters through the averagê one may employ a very large S at essentially no additional computational burden. Moreover, sinceŴ is differentiable almost everywhere with gradient given by one may solve the discretized version of (16) using one of the many gradient-based optimizers for unconstrained convex programming. For example, experimenting with Matlab's default unconstrained minimization algorithm (fminunc), we arrive at a highly precise answer within a fraction of a second even when using simulation draws in the hundreds of thousands. This numerical finding is especially encouraging when thinking about CCP inversion as part of an inner loop in a greater estimation 5 Such experience has been noted before (Galichon, 2016, pp. 31-32).
routine. Our experience with the two computational methods indicates that the convex programming approach is better suited for problems involving more than a few alternatives. If the number of simulation draws S is small, then our approximation (18) to a small surplus partial derivative may become exactly zero, and the resulting ψ approximation may be poor.
This affects both approaches as they solve the same problem. With the convex optimization approach the issue is minor since S can be increased at essentially no cost. If it is difficult to sample from F , then one may consider an alternative approximation to the surplus by means of importance sampling.

Conclusion
This note has interpreted the ψ function from Arcidiacono and Miller (2011) in terms of the convex-analytic properties of dynamic-discrete choice (DDC) models. This leads naturally to computational methods which enable researchers to estimate DDC models in which the error terms can be drawn from distributions far beyond the usual logit families assumed in the empirical literature. More generally, the results here highlight the deep connections between the CCP approach to estimating DDC models and the convex-analytic properties of additive random utility models. We believe further exploration of this connection may be fruitful.

Appendix. Proofs
Proof of Proposition 2. Under invariance: and then is strictly convex.

Proof. It is well known that
is not parallel to ι, we must have o ̸ = 0.) Let λ ∈ (0, 1). Then by the homogeneity property and by strict convexity of W in the direction of vector o, and hence also e W is strictly convex. ■ Proof of Proposition 3. By Lemma 4, Ω defined in (19) is strictly convex. Moreover, Ω is finite and everywhere differentiable. Then Rockafellar (1970, Theorem 26.5) applies, showing that the convex conjugate Ω * of Ω is proper, closed, essentially smooth and essentially strictly convex. Moreover, the gradient mapping ∇Ω : R J → int (dom Ω * ) : x → ∇Ω (x) is a topological isomorphism with inverse mapping (∇Ω) −1 = ∇Ω * .
The convex conjugate Ω * of Ω is defined by We recognize this as the maximization problem in Proposition 3.
The first-order condition for this problem is x = ∇Ω (v), and a solution exists uniquely for any x ∈ R J ++ since range ∇Ω = R J ++ .
6 Applying standard results from convex analysis, Sørensen and Fosgerau (2020) obtain a more general result.