Ranking and recommendation systems, at the very least, rely on a scoring function \(f(x)\). The scoring function assigns scores to an item based on it’s feature vector \(x\). The scoring function can be expressed by \(f^{W}(x)\) if it is based on a parametrized neural network model, where \(W\) is the set of learned parameters. Once the items are scored, they are ranked by sorting the items based on the scores. To construct a scoring function, one needs to determine the underlying objective. Some example objectives are:
If the right modeling strategy is used, and the right collection of features are taken into account, optimizing for a single objective is a relatively simple task. Once the optimization is done on the target objective, we will have the scoring function that can score items so that they can be re-ranked to meet a certain goal (for example, increasing the advertisement profits, or getting users to buy more items).
Even while single objective optimization works effectively in the majority of use cases, there may be times when it is necessary to combine several objectives at once. For instance, we want to maximize the amount of ad clicks while also increasing the possibility that a product would be purchased. When items are ranked based on multiple objective signals, there will be a trade-off if these objectives are at odds with one another. If they don’t, then that’s the best outcome for a multi-objective scoring system. In this post, I will discuss multi-objective optimization, and leave the discussion on multi-objective scoring (Pareto optimality) for later.
Let’s consider that we have \(N\)=3 independent tasks that we need our model to optimize on. Let’s consider the following three tasks:
The most simple way to learn these three tasks is to optimize them individually. By the end, we will have three separate models, each having it’s own parameters. To construct the scoring function, we can take a weighted linear combination of the prediction output from each of these models (This is out of the scope of this discussion as we only deal with optimization, not scoring).
Although the above strategy is straightforward, there are some drawbacks. Multi-task learning is a class of techniques where a single model shares the parameters across multiple tasks and an optimization is done using a weighted linear combination of the individual losses of the tasks. Let’s say the shared model (i.e. a neural network) is parametrized by \(W\), then the final loss function \(\mathcal{L}(W)\) is given by
\[\begin{equation} \mathcal{L}(W) = w_{1} * \mathcal{L_1}(W) + w_{2} * \mathcal{L_2}(W) + w_{3} * \mathcal{L_3}(W) \\ \mathcal{L_1}(W) = - \log Softmax(y_1, f^{W}(x)) \\ \mathcal{L_2}(W) = \parallel y_2 - f^{W}(x) \parallel^{2} \\ \mathcal{L_3}(W) = \parallel y_3 - f^{W}(x) \parallel^{2} \\ \end{equation}\]where \(\mathcal{L_1}(W)\) is the logistic loss and \(y_1\) is the expected class output of binary classification task, \(\mathcal{L_2}(W)\) and \(\mathcal{L_3}(W)\) are the squared error loss with \(y_2\) and \(y_3\) as the expected regression output of first regression task (number of clicks) and the second regression task (view seconds) respectively.
The following are some justifications for why multitask optimization is preferable to individual task optimization:
In the following part, we go over the dis-advantages of manual tuning (\(w_1\), \(w_2\), \(w_3\)) and how to avoid it for the per-task loss.
There are two downsides of manually choosing the weights \(w_1\), \(w_2\) and \(w_3\).
Both the task output variance and the weight sensitivity depending on the task’s output scale (and therefore loss) indicate that we can express the task’s output uncertainty in the optimization objective such that
We consider two instances, as detailed in the sub-sections, to reflect the uncertainty in the task’s output.
To model the uncertainty in the classification tasks, we can define the likelihood of the task’s output with Gibbs distribution \(Gibbs(f^{W}(x), \sigma)\).
\[\begin{equation} p(y=c | f^{W}(x), \sigma) = \frac{exp(\frac{1}{\sigma^2}f^{W}_c(x))}{\sum_{c'}^{C}exp(\frac{1}{\sigma^2}f^{W}_{c'}(x))} \end{equation}\]where \(c\) is the output class, \(C\) is the total number of classes, and \(\sigma^2\) is the temperature parameter which measures uncertainity in terms of entropy.
The log likelihood is written as:
\[\begin{equation} \log p(y=c | f^{W}(x), \sigma) = \frac{1}{\sigma^2}f^{W}_{c}(x) - \log \sum_{c'}^{C} exp(\frac{1}{\sigma^2}f^{W}_{c'}(x)) \end{equation}\]To model the uncertainty in the regression task, we can define the likelihood of task’s output with Normal distribution \(\mathcal{N}(f^{W}(x), \sigma^2)\)
\[\begin{equation} p(y | f^{W}(x), \sigma) = \frac{1}{\sigma\sqrt{2\pi}}exp\left(-\frac{(y-f^{W}(x))^2}{2\sigma^2}\right) \end{equation}\]where \(y\) is the regression output and \(\sigma\) is the standard deviation of the normal distribution.
The log likelihood can be written as:
\[\begin{equation} \log p(y | f^{W}(x), \sigma) = -\frac{1}{2\sigma^2}\parallel y - f^{W}(x) \parallel^2 - \log \sigma \end{equation}\]Let’s assume that we are optimizing for the three previously mentioned tasks (one classification and two regression tasks). Under the independence assumption, the likelihood of the combined model output \(p(y_{1}=c, y_{2}, y_{3} \vert f^{W}(x), \sigma_{1}, \sigma_{2}, \sigma_{3})\) can be factorized further.
\[\begin{equation} p(y_1=c, y_2, y_3 | f^{W}(x), \sigma_{1}, \sigma_{2}, \sigma_{3}) = p(y_1=c | f^{W}(x), \sigma_{1}) * p(y_2 | f^{W}(x), \sigma_{2}) * p(y_3 | f^{W}(x), \sigma_{3}) \end{equation}\]where \(y_1\) is the expected output of classification task, \(y_2\) is the expected output of first regression task and \(y_3\) is the expected output of second regression task. \(\sigma_{1}\), \(\sigma_{2}\), and \(\sigma_{3}\) are the uncertainity parameters associated with each of the three tasks respectively.
The final optimization objective (the loss function) minimizes the negative log likelihood, as given by:
\[\begin{equation} \mathcal{L}(W, \sigma_1, \sigma_2, \sigma_3) = -\log p(y_1=c | f^{W}(x), \sigma_{1}) * p(y_2 | f^{W}(x), \sigma_{2}) * p(y_3 | f^{W}(x), \sigma_{3}) \\ = -\log p(y_1=c | f^{W}(x), \sigma_{1}) - \log p(y_2 | f^{W}(x), \sigma_{2}) - \log p(y_3 | f^{W}(x), \sigma_{3}) \\ \end{equation}\]The negative log-likelihood of classification task:
\[\begin{equation} -\log p(y_1=c | f^{W}(x), \sigma_{1}) = -\frac{1}{\sigma_{1}^2}f^{W}_{c}(x) + \log \sum_{c'}^{C} exp \left(\frac{1}{\sigma_{1}^2} f^{W}_{c'}(x) \right) \\ -\frac{1}{\sigma_{1}^2}f^{W}_{c}(x) + \frac{1}{\sigma_{1}^2} \log \sum_{c'}^{C}exp \left( f^{W}_{c'} \right) - \frac{1}{\sigma_{1}^2} \log \sum_{c'}^{C}exp \left( f^{W}_{c'} \right) + \log \sum_{c'}^{C} exp \left(\frac{1}{\sigma_{1}^2} f^{W}_{c'}(x) \right) \\ -\log p(y_1=c | f^{W}(x), \sigma_{1}) = -\frac{1}{\sigma_{1}^2}\left( f^{W}_{c}(x) - \log \sum_{c'}^{C} exp(f^{W}_{c'}(x)) \right) + \log \frac{\sum_{c'}^{C} exp\left ( \frac{1}{\sigma_{1}^2} f^{W}_{c'}(x) \right)}{\left( \sum_{c'}^{C} exp(f^{W}_{c'}(x)) \right)^{\frac{1}{\sigma_{1}^2}}} \\ \end{equation}\]After making an approximation substitution for \(\left( \sum_{c'}^{C} exp(f^{W}_{c'}(x)) \right)^{\frac{1}{\sigma_{1}^2}}\) by \(\frac{1}{\sigma_{1}^2} \sum_{c'}^{C} exp \left( \frac{1}{\sigma_{1}^2} f^{W}_{c'}(x) \right)\), we get
\[\begin{equation} -\log p(y_1=c | f^{W}(x), \sigma_{1}) = \frac{1}{\sigma_{1}^2} \mathcal{L}_{1}(W) + \log{\sigma_{1}} \end{equation}\]The negative log-likelihood of regressions task is given by:
\[\begin{equation} -\log p(y_2 | f^{W}(x), \sigma_{2}) = \frac{1}{2\sigma_{2}^2}\parallel y - f^{W}(x) \parallel^2 + \log{\sigma_{2}} \\ -\log p(y_2 | f^{W}(x), \sigma_{2}) = \frac{1}{2\sigma_{2}^2} \mathcal{L_{2}}(W) + \log{\sigma_{2}} \end{equation}\]Putting it all together, the final multi-optimization loss objective for all three tasks is given by:
\[\begin{equation} \mathcal{L}(W, \sigma_1, \sigma_2, \sigma_3) = \frac{1}{\sigma_{1}^2} \mathcal{L}_{1}(W) + \frac{1}{2\sigma_{2}^2} \mathcal{L_{2}}(W) + \frac{1}{2\sigma_{3}^2} \mathcal{L_{3}}(W) + \log{\sigma_{1}} + \log{\sigma_{2}} + \log{\sigma_{3}} \end{equation}\]We notice that in the objective, we are also minimizing the task’s output uncertainty as well as assigning less weights on the losses that are having high uncertainty.
We showed that optimizing for a multi-objective function can be challenging, especially when the scales of the individual objectives of tasks can vary. Furthermore, it takes a lot of trial and error to allocate how much of each task should be weighted. In this article, we discussed a procedure for tuning the loss weights as an extension of the optimization goal. The task’s output uncertainty is reduced by the learned uncertainty parameters, which enhances the performance of multi-objective models.
Please report errata and/or leave any comments/suggestion here.
Disclaimer: Information presented here does not represent views of my current or past employers. Opinions are my own.