Introduction

Machine learning systems (e.g. ranking systems) work best when the feature set used to model the predictions are sensical. Towards this, there is a plethora of material on the internet about constructing good features, aka feature engineering. In this post, my primary goal is to touch on a less discussed yet very important feature set class called “rate-based feature”. In the first half of this post, I will highlight what are rate-based features, how to construct them, and some best practices. In the second half of this post, I will go through an interesting mathematical derivation to show how statistical tools can help with better feature construction. The ideas developed here will be carried forward in the next blog post to show more interesting use cases of statistical tools for Machine Learning.

Rate-based features and how to construct them

As noted earlier, rate-based features are some of the most important types of features on which ML systems rely to make a good prediction. For example, let’s say we build a system to predict if a user is going to click on a certain item on an e-commerce website. A pretty straightforward feature to include is the total number of clicks on the item. This, however, is a bad feature because it does not care about how long the item has been on the website (item impression). A better feature is click rate e.g. number of clicks per impression. This feature is also problematic because a newer item will have a higher variance of being clicked, and this in turn affects how that content is shown, further biasing the impression of that item. For example, let’s say newer content is able to get 10 impressions, and for some reason, it only got one click because it was only shown to a specific user base. This means the rate-based feature, which is 1/10, makes the prediction system biased towards de-ranking that item further, which will prohibit that item from getting more impressions and clicks.

One simple strategy is to artificially augment the data by increasing both the number of impressions and the number of clicks under the assumption that a newer item is equally likely to be good as it is bad. Specifically, let’s say \(N_{I}\) is the total number of impressions on the item, and \(N_{c}\) is the total number of clicks, then our original feature (representing proportion) \(p = N_{c}/N_{I}\) changes to \(p’ = \frac{N_{c} + C/2}{N_{I} + C}\). Here \(C\) is an arbitrary constant.

This solves the problem to some extent but

We don’t know which values of \(C\) work best
We are making an assumption about item relevancy.

For a good choice of \(C\), the solution is fine, but there is a better strategy. We can add two rate-based features (instead of one) representing two extreme sides of our confidence on what the correct feature value might be. This brings us to discuss confidence intervals, and how to use them to construct a better rate-based feature.

Before delving into maths, let us discuss some fundamental statistics.

Confidence intervals, hypothesis testing, and the test inversion

Confidence intervals for proportions

A confidence interval is a set of values (a range) in which we expect an estimate to fall with a certain pre-decided confidence level. The confidence interval for the mean estimate of normally distributed data (remember central limit theorem?) is given by

\[\begin{equation} CI = \bar{X} \pm Z_{\alpha} \frac{\sigma}{\sqrt{N}} \end{equation}\]

where \(\bar{X}\) is the population mean, \(\sigma\) is the population standard deviation, and \(N\) is the sample size.

Since we don’t necessarily know the statistics of the population, we use sample’s mean (\(\hat{X}\)) and sample’s standard deviation (\(s\)), and the confidence interval becomes

\[\begin{equation} CI = \hat{X} \pm Z_{\alpha} \frac{s}{\sqrt{N}} \end{equation}\]

In our case, the estimate is the rate-based feature (or proportion of clicks). Rate-based features are in fact the mean of random variables modeled after the Bernoulli distribution. Let’s say \(p\) is the mean estimate of Bernoulli variables \(X_{i} \sim\) \(Bernoulli(p)\) (which is also our rate-based feature value), then replacing the standard deviation \(s\) with the standard deviation of the Bernoulli distribution \(\sqrt{p(1-p)}\) results in a confidence interval that looks like:

\[\begin{equation} CI = \hat{p} \pm Z_{\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{N}} \end{equation}\]

where \(\hat{p}\) is the sample mean, or observed rate value.

Hypothesis testing

A great reference can be found here

Inverting the test

Test inversion is one of the fundamental concepts in statistics which says — a confidence interval is a range of values of an estimate under test where we fail to reject a test. To show this in action, let’s again consider \(X_{1}\), \(X_{2}\), …, \(X_{n}\) \(\sim\) \(Bernoulli(p)\) where \(p\) is the mean and \(p(1-p)\) is the variance. \(\hat{p}\) = \(\frac{1}{N}\Sigma_{i=1}^{N}X_{i}\) then follows normal distribution (central limit theorem). \(T_{N} = \frac{\hat{p} - p}{\sqrt{\frac{\hat{p}(1-\hat{p})}{N}}}\) is a standarized normal distribution \(\mathcal{N}(0,1)\).

Let’s consider a two-sided test with the null hypothesis \(H_{0}: \hat{p} = p\) and an alternate hypothesis \(H_{1}: \hat{p} \neq p\), with a significance level \(\alpha\).

We reject \(H_{0}\) if \(\mid T_{N} \mid\) > \(Z_{\alpha}\). In other words, we fail to reject the null hypothesis when

\[\begin{equation} \mid T_{N} \mid \le Z_{\alpha} \\ -Z_{\alpha} \le \frac{\hat{p} - p}{\sqrt{\frac{\hat{p}(1-\hat{p})}{N}}} \le Z_{\alpha} \\ \hat{p} - Z_{\alpha}\frac{\hat{p}(1-\hat{p})}{N} \leq p \leq \hat{p} + Z_{\alpha}\frac{\hat{p}(1-\hat{p})}{N} \end{equation}\]

We fail to reject the null hypothesis when \(p\) lies in the interval \(\left[\hat{p} - Z_{\alpha}\frac{\hat{p}(1-\hat{p})}{N}, \hat{p} + Z_{\alpha}\frac{\hat{p}(1-\hat{p})}{N} \right]\), which is the confidence interval.

In the next section, we finally derive an accurate interval and construct better rate based feature using the interval’s extreme values.

Accurate intervals with score test inversion

We can use the interval values derived above and call it a day. But there are some problems if we want to use those as our rate-based features

The left values can become negative, which does not make sense.
For a very small or large value of \(\hat{p}\), the interval collapses. What will happen when \(\hat{p}\) = 0 or 1?

A more accurate test for proportion also called the score test, resolves the above two issues. In the score test, we use the standard deviation of the population (\(p(1-p)\)) and not the sample’s standard deviation (\(\hat{p}(1-\hat{p})\)).

After swapping for population’s standard deviation, we get

\[\begin{equation} -Z_{\alpha} \le \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{N}}} \le Z_{\alpha} \\ -Z_{\alpha}\sqrt{\frac{p(1-p)}{N}} \le \hat{p} - p \le Z_{\alpha}\sqrt{\frac{p(1-p)}{N}} \\ (\hat{p} - p)^{2} \le Z_{\alpha}\frac{p(1-p)}{N} \\ \frac{N}{Z_{\alpha}^{2}}(\hat{p} -p)^2 - \frac{p(1-p)}{N} \le 0 \\ \frac{N\hat{p}^2}{Z_{\alpha}^2} + \frac{Np^2}{Z_{\alpha}^2} - \frac{2N}{Z_{\alpha}^2}\hat{p}p - \frac{p}{N} + \frac{p^2}{N} \le 0 \\ p^2\left(\frac{N}{Z_{\alpha}^2} + \frac{1}{N}\right) - p\left(\frac{2N}{Z_{\alpha}^2} + \frac{1}{N} \right) + \frac{\hat{p}^2 N}{Z_{\alpha}^2} \le 0 \end{equation}\]

Solving the quadratic equation in \(p\) and after simplyfying, the roots are \(\begin{equation} p = \left(\hat{p} + \frac{Z_{\alpha}^2}{2N} \pm Z_{\alpha}\sqrt{\frac{\hat{p}(1-\hat{p})}{N} + \frac{Z_{\alpha}^2}{4N^2}}\right) / \left(1+\frac{Z_{\alpha}^2}{N}\right) \end{equation}\)

These roots describe a more accurate confidence interval (also called Wilson interval). Coming back to the original question: How do we define a good feature for a proportional values? Considering an \(\alpha\) = 0.05 (so \(Z_{\alpha} = 1.96\)), two features that should be added are:

\[\begin{equation} f_{1} = \left(f + \frac{3.84}{2N} - 1.96\sqrt{\frac{f(1-f)}{N} + \frac{3.84}{4N^2}}\right) / \left(1+\frac{3.84}{N}\right) \end{equation}\]

and

\[\begin{equation} f_{2} = \left(f + \frac{3.84}{2N} + 1.96\sqrt{\frac{f(1-f)}{N} + \frac{3.84}{4N^2}}\right) / \left(1+\frac{3.84}{N}\right) \end{equation}\]

where \(f\) is the originally observed rate based feature and \(N\) is the number of impression.

Conclusion

We saw how rate-based features used in ML systems such as ranking could be a pain. We showed the power of statistical tools for such use cases, by delving deeper into the derivations. On the way, we build some foundational understanding of confidence intervals, hypothesis testing, and test inversion. In a future post (part 2), we will continue drilling down more usage of statistical tools we can use for applied ML.

Notes

Please report errata and/or leave any comments/suggestion here.

Disclaimer: Information presented here does not represent views of my current or past employers. Opinions are my own.