概率基础

三个重要的概率不等式¶

Markov’s inequality
- If $X$ is a nonnegative random variable, then for all $t>0$ , $$ \mathbb{P}[X \geq t] \leq \frac{\mathbb{E}[X]}{t} $$
- 对于一个取值为正的随机变量, 其取值大于数字t的概率有上界, 即其期望与t的比值.
- 直观理解: 随机变量取值的分布收到其期望的约束. 例如, 假如取t为两倍期望, 可知该随机变量比t大的概率至多为二分之一.
- 证明思路: 定义一个随机变量 $I= \begin{cases}1 & \text { if } X \geq t \\ 0 & \text { if } X<t\end{cases}$ . 可知其期望为 $X \ge t$ 的概率. 利用 $t I \leq X$ 的关系, 两边去期望即可
Chebyshev’s inequality
- For all $t>0$ , $$ \mathbb{P}[|X-\mathbb{E}[X]| \geq t] \leq \frac{\operatorname{Var}[X]}{t^{2}} $$
- 理解: 一个随机变量记录其期望较远的概率受到其方差的约束.
- 证明思路: 利用平方函数的非负部分递增的性质, 可知 $\mathbb{P}[|X-\mathbb{E}[X]| \geq t]=\mathbb{P}\left[(X-\mathbb{E}[X])^{2} \geq t^{2}\right]$ , 然后利用 Markov 即可.
Chernoff-Hoeffding
- Lemma 3. Let $X_{1}, X_{2}, \cdots, X_{n}$ be $n$ independent random variables, such that $a_{i} \leq X_{i} \leq b_{i}$ for all $i$ , and let $X=\sum_{i=1}^{n} X_{i}$ , then $$ \mathbb{P}[|X-\mathbb{E}[X]| \geq t] \leq 2 \cdot e^{-\frac{2 t^{2}}{\sum_{i=1}^{n}\left(b_{i}-a_{i}\right)^{2}}} $$
- 理解: 相较于 Chebyshev, 这里考虑的是随机变量之和的分布界限.
- 注意: 当t较大的时候, 该不等式相较于 Chebyshev 提供了更好的下界. 以n个独立的等概率0-1 bernoulli分布之和为例, 可知其期望和方差分别为 $\frac{n}{2}$ 和 $\frac{n}{4}$ . 分别代入可得到下界. 观察, 可知当t为 $\Theta(\sqrt{n})$ 级别时两下界差别不大; 然而若t为 $\Theta({n})$ 级别, 则两者分别提供了 $\Theta(1/n)$ , $\Theta(\exp{-n})$ 级别的下界.

Morris’s algorithm¶

定义 Morris 算法:

initialize: $s \leftarrow 0$
update: if the bit is 1 , increment $s$ with probability $1 / 2^{s}$
estimate: output $\tilde{m}=2^{s}-1$

说明

问题的定义比较简单: 对于一个bit流, 统计其中1出现的个数. 相较于全部记录所花的空间复杂度为 $O(\log(m))$ , 这里的空间复杂度为 $O(\log(s))$ . 可以证明, 其为 $O(\log \log m)$ with high probability.
算法的直觉: 我们仅记录计数m中的 1,2,4,8… 这些事件. 那略过的这些事件怎么进行记录? 通过随机变量来使得在期望上是比较接近的.

Correctness 证明¶

Expectation of $\tilde{m}$

Claim 4. $\mathbb{E}[\tilde{m}]=\mathbb{E}\left[2^{s}-1\right]=m$ . Proof. It is sufficient to show $\mathbb{E}\left[2^{s}\right]=m+1$ . Let $s_{t}$ denote the value of $s$ after $t$ 1’s have arrived and before the $(t+1)$ -th 1 arrives. We next show that $\mathbb{E}\left[2^{s_{t}}\right]=t+1$ . $$ \begin{aligned} \mathbb{E}\left[2^{s_{t}}\right] &=\sum_{i=1}^{\infty} \mathbb{E}\left[2^{s_{t}} \mid s_{t-1}=i\right] \cdot \mathbb{P}\left[s_{t-1}=i\right] \ &=\sum_{i=1}^{\infty}\left(2^{i+1} \cdot \frac{1}{2^{i}}+2^{i} \cdot\left(1-\frac{1}{2^{i}}\right)\right) \cdot \mathbb{P}\left[s_{t-1}=i\right] \ &=\sum_{i=1}^{\infty}\left(2^{i}+1\right) \cdot \mathbb{P}\left[s_{t-1}=i\right] \ &=\sum_{i=1}^{\infty} 2^{i} \cdot \mathbb{P}\left[s_{t-1}=i\right]+\sum_{i=1}^{\infty} \mathbb{P}\left[s_{t-1}=i\right] \ &=\mathbb{E}\left[2^{s_{t-1}}\right]+1 . \end{aligned} $$ Since $\mathbb{E}\left[2^{s_{0}}\right]=1$ , we have $\mathbb{E}\left[2^{s_{t}}\right]=t+1$ by the above equation.

Variance of $\tilde{m}$

Claim 5. $\operatorname{Var}\left[2^{s}\right]=\frac{m^{2}}{2}-\frac{m}{2} \leq \frac{m^{2}}{2}$ Proof. Next we analyze the variance of $2^{s_{t}}$ . By definition $\operatorname{Var}\left[2^{s_{t}}\right]=\mathbb{E}\left[2^{2 s_{t}}\right]-\mathbb{E}\left[2^{s_{t}}\right]^{2}$ . The value of $\mathbb{E}\left[2^{s_{t}}\right]$ has already been calculated above, and thus we only need to calculate $\mathbb{E}\left[2^{2 s_{t}}\right]$ . $$ \begin{aligned} \mathbb{E}\left[2^{2 s_{t}}\right] &=\sum_{i=1}^{\infty} \mathbb{E}\left[2^{2 s_{t}} \mid s_{t-1}=i\right] \cdot \mathbb{P}\left[s_{t-1}=i\right] \ &=\sum_{i=1}^{\infty}\left(2^{2 i+2} \cdot \frac{1}{2^{i}}+2^{2 i} \cdot\left(1-\frac{1}{2^{i}}\right)\right) \cdot \mathbb{P}\left[s_{t-1}=i\right] \ &\left.=\sum_{i=1}^{\infty}\left(2^{i+2}+2^{2 i}-2^{i}\right) \cdot \mathbb{P}\left[s_{t-1}=i\right]\right) \ &=3 \cdot \sum_{i=1}^{\infty} 2^{i} \cdot \mathbb{P}\left[s_{t-1}=i\right]+\sum_{i=1}^{\infty} 2^{2 i} \cdot \mathbb{P}\left[s_{t-1}=i\right] \ &=3 \cdot \mathbb{E}\left[2^{s_{t-1}}\right]+\mathbb{E}\left[2^{2 s_{t-1}}\right] \ &=\mathbb{E}\left[2^{2 s_{t-1}}\right]+3 t \end{aligned} $$ Since $2^{2 s_{0}}=1$ , we have $$ \mathbb{E}\left[2^{2 s_{t}}\right]=1+3+2 \cdot 3+\cdots+3 t=1+3 \cdot \frac{t(t+1)}{2}=\frac{3}{2} t^{2}+\frac{3}{2} t+1 . $$ Hence, $\operatorname{Var}\left[2^{s_{t}}\right]=\frac{3}{2} t^{2}+\frac{3}{2} t+1-(t+1)^{2}=\frac{t^{2}}{2}-\frac{t}{2} \leq \frac{t^{2}}{2}$

Morris+: 多次取平均¶

直接采用上述策略得到的结果误差很大 (主要是因为方差较大). 利用 Chebyshev 不等式, 有 $$ \mathbb{P}[|\tilde{m}-m| \geq \varepsilon m] \leq \frac{m^{2} / 2}{\varepsilon^{2} m^{2}}=\frac{1}{2 \varepsilon^{2}} $$

显然没啥用. 因此, 可以多次重复取平均.
具体而言, 我们希望 $\mathbb{P}[|\tilde{m}-m| \geq \varepsilon m] \leq \delta \text { for some small } \delta<1$ .
因此, 我们可以取重复次数 $\ell=\frac{1}{\varepsilon^{2} \delta}$ .
这样, 算法的时间复杂度为 $O\left(\frac{1}{\varepsilon^{2} \delta} \log \log m\right)$ .
称该算法为 Morris+ $(\varepsilon, \delta)$ . 这里的 $\epsilon$ 表示的是相对误差, 而 $\delta$ 表示错误率 (相对误差误差超过了界限).

The median trick 进一步优化¶

所谓 median trick, 就是运行一系列不太好的 Morris+ $(\varepsilon, \delta)$ 算法 (错误率 $\delta$ 较大, 例如为 1/3); 然后取这些结果的中位数.
- 这里的intuition在于, 通过运行一系列虽然不太靠谱的算法 (期望是正确的, 但是方差较大), 总有一些可以得到正确解 (在误差范围内).
- 通过取中位数而非均值, 将每次运行 Morris+ 算法的结果是否正确看成是独立的事件. 从而可以采用更紧的 Hoffding 下界.
从结果来看, 将算法的时间复杂度从 $O\left(\frac{1}{\varepsilon^{2} \delta} \log \log m\right)$ 下降为 $O\left(\frac{1}{\varepsilon^{2}} \log \log m \log \frac{1}{\delta}\right)$ . 注意到这里优化的对象的错误率, 引入了一个对数项, 而优化的原因在于, Hoffding 不等式中的指数项.

The median trick: The dependence on $\delta$ can be further improved by the Median Trick, which works as follows. We run $h$ (to be determined later) independent copies of Morris $+(\varepsilon, 1 / 3)$ , and let $\tilde{m}_{1}, \cdots, \tilde{m}_{h}$ be their outputs; the final output $\tilde{m}=\operatorname{Median}\left\{\tilde{m}_{1}, \cdots, \tilde{m}_{h}\right\}$ . Note that the total space is $O\left(\frac{h}{\varepsilon^{2}} \log \log m\right)$ . How large $h$ should be?

By the guarantee of Morris $+(\varepsilon, 1 / 3)$ , we have $\mathbb{P}\left[\left|m-\tilde{m}_{i}\right| \geq \varepsilon m\right] \leq 1 / 3$ for all $i \in[1, h]$ . For each $i \in[1, h]$ , define $$ I_{i}= \begin{cases}1 & \text { If }\left|m-\tilde{m}_{i}\right| \geq \varepsilon m \ 0 & \text { Otherwise }\end{cases} $$ Let $I=\sum_{i=1}^{h} I_{i}$ , which is the total number Morris $+(\varepsilon, 1 / 3)$ which are “wrong”. Since $\mathbb{E}\left[I_{i}\right] \leq 1 / 3$ , $$ \mathbb{E}[I] \leq h / 3 $$ By Chernoff-Hoeffding, $$ \mathbb{P}[|I-\mathbb{E}[I]| \geq h / 6] \leq 2 \cdot e^{-\frac{h^{2} / 18}{h}}=2 \cdot e^{-h / 18} $$ By setting $h=O\left(\log \frac{1}{\delta}\right)$ , the above probability is smaller than $\delta$ . Let $\mathcal{A}$ denote the event $|I-\mathbb{E}[I]|<h / 6$ . Conditioned on $\mathcal{A}$ , we have $$ I<h / 6+\mathbb{E}[I] \leq h / 6+h / 3=h / 2 $$ This means that for at least half of $h$ copies of Morris+ are correct. Consequently, the median of them must be correct, i.e., $|\tilde{m}-m| \leq \varepsilon m$ (why?). This occurs whenever $\mathcal{A}$ happens, the probability of which is at least $1-\delta$ for $h=O\left(\log \frac{1}{\delta}\right)$ . The total space is thus $O\left(\frac{1}{\varepsilon^{2}} \log \log m \log \frac{1}{\delta}\right)$ .