$$\notag \newcommand{\ba}{\mathbf{a}} \newcommand{\be}{\mathbf{e}} \newcommand{\bff}{\mathbf{f}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\g}{\,|\,} \newcommand{\nth}{^{(n)}} \newcommand{\te}{\!=\!} \newcommand{\ttimes}{\!\times\!} $$

Softmax regression

Softmax1 regression is a generalization of logistic regression to “multiclass classification”: each label can take on one of \(K\!\ge\!2\) discrete settings. Some textbooks will simply call this generalization “logistic regression” as well.

As previously discussed one way to handle multiclass classification, is to represent the each label with a length-\(K\) one-hot encoding: \[ \by = \begin{bmatrix}0 & 0 & \ldots & 0 & 1 & 0 & \ldots & 0\end{bmatrix}^\top. \] If data-point \(n\) has label \(c\), then \(y_k^{(n)} = \delta_{kc}\), where \(\delta_{kc}\) is a Kronecker delta.

If we were to consider each element of \(\by\) to be independent from the rest, we could attempt to predict the elements of \(\by\) one at a time, with \(K\) separate one-vs-rest classifiers. Then, we would need a method to combine the outputs of these classifiers. Instead we’ll fit a probabilistic model that is explicit about how to predict the whole vector.

Our model will have parameters \(W\) and outputs a vector \(\bff\). We interpret the vector output \(\bff\) as a probability distribution over which bit in the target vector \(\by\) is turned on, giving the model: \[ P(y_k\te1\g \bx, W) = f_k(\bx; W). \] We can start by using a separate regression vector \(\bw^{(k)}\) for each class, and create a positive score, by exponentiating a linear combination of features: \[ s_k = e^{(\bw^{(k)})^\top\bx} \] Moving the features \(\bx\) in the direction \(\bw^{(k)}\) increases the score for class \(k\). In particular, if \(w^{(k)}_d\) is positive, then large values of \(x_d\) give class \(k\) a large score.

However, probabilities have to add up to one, so our output vector is normalized: \[ f_k = \frac{s_k}{\sum_{k'} s_{k'}} = \frac{e^{(\bw^{(k)})^\top\bx}}{\sum_{k'}e^{(\bw^{(k')})^\top\bx}}. \] Here \(k'\) is a dummy index used to sum over all of the classes. Our vector-valued function \(\bff\) is parameterized by \(W=\{\bw^{(k)}\}_{k=1}^K\). The classes now compete with each other. If \(w^{(k)}_d\) is large, then large \(x_d\) doesn’t necessarily favour class \(k\). The normalization means that the probability of class \(k\) could be small if \(w^{(j)}_d\) is larger for another class \(j\).

If \(W\) is a \(K\ttimes D\) matrix, we might write the above function as \(\bff(\mathbf{x}; W) = \mathrm{softmax}(W\mathbf{x})\).

1 Fitting the model

We want our function evaluated at a training input, \(\bff(\bx^{(n)})\), to predict the corresponding target \(\by^{(n)}\). At best, they are equal, in which case we confidently predict the correct label. However, if the labels are inherently noisy, so that the same input could be labelled differently, making perfect predictions every time is impossible.

Matching targets \(\by\) and function values \(\bff\) by least squares would be consistent: if a setting of the parameters can make \(\bff\) match the probabilities of the labels under the real process that is generating the data, then we will fit those parameters given infinite data. However, “maximum likelihood” estimation is also consistent, and for well-specified models has faster asymptotic convergence. Compared to least squares, maximum likelihood heavily penalizes confident but wrong function values, which may or may not be seen as desirable. (Consider the maximum loss that can be obtained on a single example under the two cost functions.)

We first find the gradients of the model’s log-probability of a single observation: a label \(c\) when the input features are \(\bx\): \[\begin{align} \log P(y_c\te 1\g \bx, W) = \log f_c &= (\bw^{(c)})^\top\bx - \log \sum_{k'} e^{(\bw^{(k')})^\top\bx}\\ \nabla_{\bw^{(k)}} \log f_c &= \delta_{kc}\bx - \frac{1}{\sum_{k'} e^{(\bw^{(k')})^\top\bx}} \,\bx\, e^{(\bw^{(k)})^\top\bx}\\ &= (y_k - f_k)\,\bx. \end{align}\] In the last line we have substituted \(y_k\te\delta_{kc}\) and the definition of \(f_k\).2

To apply stochastic gradient ascent3 on the log-likelihood we would take a training example \((\bx^{(n)},\by^{(n)})\) and move the weight vector for each class parallel to the input \(\bx^{(n)}\). The size and sign of the move depends on the disparity between the binary output \(y_k^{(n)}\) and the prediction \(f_k\).

Alternatively we could use an optimizer that uses a batch of examples, requiring the gradient \[\nabla_{\bw^{(k)}} \sum_n \log f_{c\nth}\nth = \sum_n (y_k^{(n)} - f_k(\bx^{(n)}))\,\bx^{(n)}, \] or minus that value for the gradient of the batch’s negative log probability.

2 Redundant parameters and binary logistic regression

For the special case of two classes, the softmax classifier gives: \[\begin{align} P(y\te1\g \bx, W) &= \frac{e^{(\bw^{(1)})^\top\bx}}{e^{(\bw^{(1)})^\top\bx} + e^{(\bw^{(2)})^\top\bx}}\\ &= \frac{1}{1 + e^{(\bw^{(2)} - \bw^{(1)})^\top\bx}} = \sigma\big((\bw^{(1)} - \bw^{(2)})^\top\bx\big). \end{align}\] The prediction only depends on the vector \((\bw^{(1)} - \bw^{(2)})\). In logistic regression we normally fit that single vector directly as “\(\bw\)”.

We can fit two vectors, as in the softmax classifier formulations. The parameters are not well specified: adding a constant vector \(\ba\) to both class’s weight vectors gives the same predictions. However, if we add a regularization term to our cost function, the cost function for this model will be minimized by a unique setting of the weights.

Alternatively we could set the weight vector for one of the classes to zero, and fit the rest of the parameters. For example, recovering logistic regression by setting \(\bw^{(2)}\te\mathbf{0}\).

 

open above question in new tab (allows annotation)


  1. I believe the term softmax was coined by John S. Bridle (1990). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing: Algorithms, Architectures and Applications, Berlin: Springer–Verlag, pp. 227–236.↩︎

  2. It may take some time to digest all of the notation here. We index into the vector of function outputs using the observed label \(c\). We didn’t use \(k\) for this index so we can use \(k\) to index the weight vector we are computing gradients for. We need the gradients for \(k\!\ne\!c\) as well as \(k\te c\). The dummy index \(k'\) is used because we need to consider all of the weight vectors \(\{\bw^{(k')}\}_{k'=1}^K\), when computing the gradient with respect to a particular weight vector \(\bw^{(k)}\).↩︎

  3. Optimization texts normally talk about gradient descent and minimizing functions. Gradient descent on the negative log-likelihood amounts to the same thing. We take a negative step in the direction of the gradient of the negative likelihood, and the two negatives cancel.↩︎