CS7150 Homework 2
Sampled Softmax
We talked about softmax classifer in the class. Suppose there are C classes. A softmax classifier takes feature vector x ∈ R computes logits then predicts its probability of i-th class as where wi ∈ R d , bi ∈ R, i = 1, . . . C are trainable parameters. They are trained by minimizing a cross entropy loss. Specifically, a datum x with label c incurs training loss.
And we update the trainable parameters via (stochastic) gradient descent.
- Revisit the class notes and derive the gradients, Express them as function of
- The denominator of pi requires to compute C terms. That is, this can incur tremendous computational cost. Sampled softmax alleviates this by randomly sampling K (K ≪ C) of these terms to approximate pi . Specifically, we choose a distribution with probabilty mass function q over the C classes. We draw K class ID’s from q. Denote this set of sampled class ID’s as S, and assume class i itself is excluded from S. We can then approximate the denominator by