CS7150 Homework 2

Sampled Softmax

We talked about softmax classifer in the class. Suppose there are C classes. A softmax classifier takes feature vector x ∈ R computes logits then predicts its probability of i-th class as where wi ∈ R d , bi ∈ R, i = 1, . . . C are trainable parameters. They are trained by minimizing a cross entropy loss. Specifically, a datum x with label c incurs training loss.

And we update the trainable parameters via (stochastic) gradient descent.

Revisit the class notes and derive the gradients, Express them as function of
The denominator of pi requires to compute C terms. That is, this can incur tremendous computational cost. Sampled softmax alleviates this by randomly sampling K (K ≪ C) of these terms to approximate pi . Specifically, we choose a distribution with probabilty mass function q over the C classes. We draw K class ID’s from q. Denote this set of sampled class ID’s as S, and assume class i itself is excluded from S. We can then approximate the denominator by

Wechat

QQ

Telegram

CS7150 Homework 2

Sampled Softmax