繁星客栈 - 明天太阳依然升起的概率

gauge

发表文章数: 596
内力值: 375/375
贡献度: 8310
人气: 1396

论坛嘉宾学术成员

明天太阳依然升起的概率 [文章类型: 原创]

明天太阳依然升起的概率是多少？这是由Laplace提出并解决的一个概率问题。理解这个问题对于理解Bayesian统计推断是很有帮助的。

太阳或者升起或者不升起，二者互不相容，太阳明天是否升起是一个确定的事件，谈论太阳是否升起的客观概率是没有意义的。这个问题可以重新表述为：你认为太阳明天依然升起的概率是多少？这就转化为一个人，而且是某一个特定的人对于一个客观事件的主观判断。对于确定性的事件而言，客观概率都是没有意义的，因而似乎就只有采用主观概率的方式来处理这些带有随机性的事件。

我们按照Bayesian推断的方式来讨论这个问题。采用如下记号。SunRise=``太阳升起"。NotRise=``太阳没有升起"。假设$P(SunRise)=p$,则$p\in[0,1]$,但是我们并不知道$p$的准确值。事实上，$p$的准确值正是我们要推断的。既然我们不知道$p$到底等于多少，那么我们就需要对$[0,1]$之间的每一个数是否就是真实的$p$值进行评估。Bayesian先验分布就是这样的一个评估。也就是说，假设先验分布相对于Lebesgue测度的密度函数为$w(p)$.这意味着我们的评估相当于说``$p\in[a,b]\subseteq[0,1]$"的可能性为$\int_a^bw(p)dp$.

按照Bayesian推断的方法。当我们观察到事实，太阳在过去的$N$天中升起了$k$次。那么我们需要将对于太阳升起的概率修正为
$$
w(p|N,k)=\frac{w(p)p^k(1-p)^{N-k}}
{\int_0^1w(p)p^k(1-p)^{N-k}dp}.
$$
这仍然是关于$p$的一个评估，即在观察到事实$(N,k)$后，对于$p$的真实值的一个评估。我们需要给出一个具体的值，换言之，要作一个关于$p$的点估计。Bayesian的方法是计算平均值，即以
$$
\hat{p}=\int_0^1p w(p|N,k)dp,
$$
作为``太阳升起"的主观概率。为计算出来一个确定的值，需要知道先验分布$w(p)$.

接下来，我们需要选定一个先验分布从而可以计算出$\hat{p}$.Laplace认为一个公平的假设是合理的，即$p$是$[0,1]$中的任意一个值都有相同的可能性。亦即先验分布密度为$w(p)=1$.这样可以算出
$$
\hat{p}=\frac{k+1}{N+2}.
$$
这个值称之为Laplace法则。

假设在人类的历史上太阳一直升起。这是合理的。再假设人类的文明史始于亚当，或者大洪水之后，或者公元前6000年，距离现在差不多
$$
6000\times365\approx2\times10^6天.
$$
因而我们可以认为太阳明天不升起的概率为一百万分之五。当然这里有值得商讨之处，比如从哪一天开始Bayesian统计过程。可以接受的合理的选择至少包括如下几种

$(1)$ 从一个人生下来开始。

$(2)$ 从某一个值得信赖的记录了太阳升起的时间开始。

这样算出来的概率比前述的百万分之五要小。

这里我们选择均匀分布$w(p)=1$看似合理，其实不然。我们已经知道这个均匀性严重的依赖于背景Lebesgue测度。如何选择先验分布是Bayesian统计推断的主要的理论要素。选择先验分布不是一件容易的事情。Bayesian学派内部的分歧也由此产生，并分为主观Bayesian和客观Bayesian.当人们提出一个先验分布的时候，总是要给出一通道理说明，他给出的分布在某些方面要优于另外的分布。不同的人完全有理由给出完全不同的先验分布。对于一个盲人而言，太阳升起与否对他的影响很小，他甚至可以固执的人为太阳从来都没有升起过，当然他要否定太阳这个物体的存在也是可以的，只是不正确罢了。因而这个盲人选择的先验分布就是概率密度集中于一个点p=0上的\delta型分布。当然这个分布实际上不具有任何随机性。那么按照Bayesian统计推断，这个盲人将永远认为“太阳明天升起”的概率等于0，不论以往的日子是1天还是1000天，而且在其中的每一天太阳都升起了。只要这个带有巨大的偏见的盲人按照Bayesian的方式进行推理，他都会一直否认明天太阳有任何的可能性会升起。同样的论证也适用于一个事先认为“太阳明天升起”的概率等于1的人，他会永远认为明天太阳将在明天确定无疑的升起。既然Bayesian是关心的个人对于客观世界的看法，那么选取先验分布似乎就是这个人自己的事情，与其他人无关。这样看问题正是主观Bayesian的观点。然而主观的判断最终必定要面对客观世界，主观预测与客观世界的差异将使得选择了不恰当的先验分布的人受到应有的损失。

实际上，对于一个Bernoulli概型，理论上最好的分布是有几何意义的Jeffreys分布，亦即
$$
w(p)=\frac{2}{\pi\sqrt{p(1-p)}}.
$$
令$p=\sin^2\theta,\theta\in[0,\pi/2]$,则$w(p)$可化为关于$\theta\in[0,\pi/2]$
上的一个分布。容易算出
$$
u(\theta)d\theta=w(p)dp=\frac{dp}{\pi\sqrt{p(1-p)}}=\frac{2}{\pi}d\theta.
$$
刚好为$[0,\pi/2]$上的均匀分布。于是以$\theta$为参数可以得到相应的后验分布为
\begin{eqnarray*}
u(\theta|N,k)
&=&\frac{u(\theta)\sin^{2k}\theta\cos^{2(N-k)}\theta}
{\int_0^1u(\theta)\sin^{2k}\theta\cos^{2(N-k)}\theta d\theta}\\
&=&\frac{\sin^{2k}\theta\cos^{2(N-k)}\theta}
{\int_0^1\sin^{2k}\theta\cos^{2(N-k)}\theta d\theta}.
\end{eqnarray*}

注意对这个分布要计算其Bayesian点估计比较麻烦。这也正是选择Bayesian先验分布的一个困难之处，合理的分布不一定容易计算，特别是在没有计算机的年代更是如此。

发表时间: 2007-02-04, 04:23:17

个人资料

laworder

发表文章数: 89
内力值: 139/139
贡献度: 1863
人气: 91

学术成员

Re: 明天太阳依然升起的概率 [文章类型: 原创]

A nice piece.

因而我们可以认为太阳明天不升起的概率为一百万分之五
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The probability of that horrible event should be 5 \times 10^{-7}, i.e., half a millionth, that would make us sleep better.

人能理解世界是最大的科学之谜。

发表时间: 2007-02-04, 08:23:14

个人资料

大漠孤狼

发表文章数: 623
内力值: 361/361
贡献度: 6528
人气: 648

论坛嘉宾

Re: 明天太阳依然升起的概率 [文章类型: 原创]

gauge兄请推荐几本关于概率论和Bayesian统计的教科书，最好是中文的。

发表时间: 2007-02-04, 23:17:59

个人资料

Omni

发表文章数: 280
内力值: 263/263
贡献度: 4868
人气: 688

论坛嘉宾学术成员

Comments (Part 1) [文章类型: 原创]

老兄此贴选题非常好，我在去年曾经想写而一直没法腾出大块时间动笔。选择二项分布概型(binomial distribution)来讲Bayesian Inference无疑是最佳选择，而当年Thomas Bayes神父的论文中就用了这个概率模型。Bayes Theorem公式中的积分好像只有在这个概型中可以用closed form来搞定。

今晚就趁看完”超级碗"的余兴，对老兄的佳作作一番点评，顺便谈些个人理解以供探讨。为节省时间和表达的精确，以下的评论文字将以英文为主：

>>我们需要给出一个具体的值，换言之，要作一个关于$p$的点估计。Bayesian的方法是计算平均值...

It's not an ideal way to introduce Bayesian inference by using an example of point estimates. A better choice is to calculate a more complex probability than \hat{p} to distinguish the Bayesian methodology with Fisher's maximum likelihood estimate (MLE). For example, we can ask what's the probability of "the sun not rising for the next 3 mornings in a row". In this case, classical statistics has to calculate (1 - \hat{p})^3 by using the MLE point estimate of p. In stark contrast, Bayesian statistics will calculate this estimated probability as

Prob(NotRise 3 Days) = \int_0^1 (1-p)^3 * w(p|N,k)dp --- (1)

A casual reader of your post may misunderstand the Bayesian methodology by using your calculation to get the Bayesian point estimate of the parameter p (i.e., your "\hat{p}") followed by a naive "semi-classical" calculation of (1 - \hat{p})^3 to get a wrong answer. It is easy to see that this number will be quite different from the correct answer calculated by Equation (1) shown above.

The spirit of Bayesian statistical inference is to write down exactly the probability we want to infer, in terms only of the data we know (in your example, the data is "the Sun has risen k times in the past N days"), and directly calculate the resulting complicated integrals. One distinctive feature of a Bayesian approach is that when we need to invoke uncertain parameters in the problem, we do NOT attempt to make point estimates of these parameters; instead, we treat them as random variables and deal with uncertainty more rigorously, by integrating over all possible values that a parameter might assume ("integrating over" has the mathematical effect of "averaging over").

Of course, in the simple problem posed by the title of your article: "Will the sun rise tomorrow", only a simple point estimate of $p$ is needed. But this simple example seems to obscure the awesome power of Bayesian inference.

>>那么我们需要将对于太阳升起的概率修正为
$$
w(p|N,k)=\frac{w(p)p^k(1-p)^{N-k}}
{\int_0^1w(p)p^k(1-p)^{N-k}dp}.
$$

I think it's better to write out the regular form of Bayes Theorem before the derivation of your formula for w(p|N,k):

P(model|data) = P(model) * P(data|model) / P(data) --- (2)

where P(model) is the prior probability density of the unknown parameters, P(data|model) is the likelihood function, P(model|data) is the posterior probability density of the unknown parameters to be estimated, and

P(data) = \sum P(model) * P(data|model) for discrete unknown model parameters

or

P(data) = \int P(model) * P(data|model) for continuous unknown model parameters

Since the unknown model parameters are "summed out" or "integrated out" in the calculation of P(data), P(data) is independent of the model parameters in Equation (2). Then P(data) can be viewed as a normalization constant in (2).

The deceptively simple-looking Equation (2) contains the quintessence of Bayesian statistics --- the "inverse probability problem". The amazing thing is that this deep Bayes Theorem can be derived from a trivial algebraic truism from the definition of conditional probabilities. In my opinion, this really exemplifies the beauty of mathematics!

Now we can use (2) to derive Gauge's formula for w(p|N,k) which is simply the explicit form of the posterior probability density "P(model|data)". Note that Gauge also implicitly canceled out the binomial coefficient N!/(k!(N-k)!) which shows up both in the numerator and in the denominator.

>>接下来，我们需要选定一个先验分布从而可以计算出$\hat{p}$. Laplace认为一个公平的假设是合理的，即$p$是$[0,1]$中的任意一个值都有相同的可能性。亦即先验分布密度为$w(p)=1$.这样可以算出
$$
\hat{p}=\frac{k+1}{N+2}.
$$
这个值称之为Laplace法则。

It's better to state that the prior density for p chosen by Laplace, $w(p)=1$, is known as the "uniform distribution". Because it's uniform, it's a constant, and it cancels out of the Bayes equation like the binomial coefficient shown above. It's also better for me to write out the full calculation steps skipped by Gauge:

\hat{p} = \int_0^1 p*w(p|N,k)dp = \int_0^1 p^(k+1)*(1-p)^(N-k)dp / \int_0^1 p^k *( 1-p)^(N-k)dp --- (3)

[Note]: Here we simply plug in Gauge's formula for w(p|N,k).

Fortunately, these complicated integrals happened to have analytical solutions due to the great Euler's Beta and Gamma functions. The so-called "beta integral" is given by the formula:

\int_0^1 p^(m-1)*(1-p)^(n-1)dp = Gamma(n)*Gamma(m)/Gamma(m+n) ---(4)

where Gamma(n+1) = n! for an integer n. Using (4) for both the numerator and the denominator of (3), we obtain

\hat{p} = [Gamma(k+2)*Gamma(N-k+1)/Gamma(N+3)] / [Gamma(k+1)*Gamma(N-k+1)/Gamma(N+2)] = [(k+1)!(N+1)!] / [k!(N+2)!] = (k+1) / (N+2) ---(5)

which is the same as the Laplace rule given by Gauge.

发表时间: 2007-02-05, 00:19:22

个人资料

Omni

发表文章数: 280
内力值: 263/263
贡献度: 4868
人气: 688

论坛嘉宾学术成员

Comments (Part 2) [文章类型: 原创]

>>注意对这个分布要计算其Bayesian点估计比较麻烦。这也正是选择Bayesian先验分布的一个困难之处，合理的分布不一定容易计算，特别是在没有计算机的年代更是如此.

Very good point! That's exactly the reason why Thomas Bayes was lucky to choose the binomial probability model for his calculations and got the major credit for Bayesian statistics. Many statisticians would argue that Laplace was really the first "Bayesian", hehe.

There are three major difficulties with Bayesian statistics:

A. Computational difficulty. Most Bayesian integrals don't have analytical solution in close forms and instead require computationally intensive numerical integration (such as Markov Chain Monte Carlo (MCMC) methods).

B. The choice of prior probability distributions. Gauge already had nice coverage of this topic. Bayesian inferences generally assume "uninformative" priors in many cases. Many frequentist statisticians like to use this difficulty to attack the Bayesian methodology, but the truth is many non-Bayesian methods make assumptions implicitly. A case in point is Fisher's celebrated ANOVA method invented in 1925, it actually makes three major assumptions!

C. The treatment of unknown parameters as random variables. This is against many scientists' intuition and philosophical belief. For example, how can a physicist treat the mass of the Earth as a random variable? This is a very controversial topic for me to dig deep. I simply want to point out the fact classical statisticians faced a similar difficulty --- they have to approach the estimate of the mass of Earth by imagining that we can measure it many many times to come up with a confidence interval. Their interpretation of the confidence interval of this "unknown constant" is really awkward and unintuitive.

Finally, we can summarize three distinctive features of Bayesian statistics as the take-home message:

I. When dealing with unknown parameters, we don't attempt to make point estimate as in the case of classical MLE. Rather, we integrate over all possible values of this parameter by treating it as a random variable.

II. The use of inverse probability calculations and Bayes Theorem in a systematic fashion.

III. The use of probabilty to represent a degree of belief.

发表时间: 2007-02-05, 00:20:32

个人资料

gauge

发表文章数: 596
内力值: 375/375
贡献度: 8310
人气: 1396

论坛嘉宾学术成员

Re: 明天太阳依然升起的概率 [文章类型: 原创]

to Omni兄，多谢点评，非常好的点评。写的时候很难周全。
to 大漠兄，其实我学习概率统计才半年时间，写下来的是一些个人体会。说实话我一本统计的书都没有看完，感觉看书很不舒服。一般说来我是想要什么就上网搜一搜，wiki是一个很好的地方，但国内没法上，我找了个类似的网站，answer.com,其上的很多论述比绝大多数教科书都好。我现在看书都处于一种不求甚解的状态，先看结论，然后考虑一下是否合理，有必要的话，翻翻证明。如果不能理解的话，就到处找资料，当然还是在网上。用某些人的话说是边学边考虑问题。理论只有自己去用才能真正理解，我说的用不是指做题。

Laplace法则，通常称之为Laplace success rule,看来应该翻译成Laplace成功法则。

我想Omni兄说的另一个问题是频率统计学的置信区间问题。这个问题看来困扰了很多的人。统计学在大英百科全书上定义为数据的搜集、整理、分析的艺术。从某种意义上说，统计学是一个很讲究文字表达的有点玄的“科学”，统计学家常常玩文字游戏。对置信区间的直观意义这个问题，至少有好几个星期的时间都很令我头疼。为此我翻阅了模糊数学以及区间概率论这些破烂，希望有所帮助，事实证明这些理论于事无补。最后我还是给出了一个至少让自己满意的说法，英文太烂，所以没能写写来。统计学对英文的要求比一般的数学要高很多。如果Omni兄有兴趣，可以讨论一下。另外，我关心Jaynes的逻辑概率论也是希望最终搞清楚统计的直观意义。到现在我也没有搞清楚逻辑概率论和主观概率论有何本质区别，看起来逻辑概率论是为主观概率轮提供了一个“坚实”的基础，而不是概率的一个全新的观点。这里我所谓的坚实的基础，只是说这是逻辑概率论者自己的观点，我不这么认为。主观概率论无论如何都不能解释大数定律，这是无法挽救的致命缺陷。有机会另外发文讨论。

关于置信区间理论，争议非常多，甚至有一段时间，有一个医学方面的很好的杂志声称拒绝接受以置信区间来解释统计数据的论文。另一个有名的例子是治疗心脏病的一种药物的有效性，被夸大为其本来效应的两倍。这个事件发生在1990-1995年之间，大概是英国吧。有很多的临床医生发表的疗效报告都表明这种新药的效果是以前使用的药物的两倍，被称为心脏病人的福音。这些论文有的发表在“柳叶刀”上，这是医学方面最好的杂志之一。如何解释这个被夸大的效应，有好几种版本。一种解释，是试验对象被主治医生有意无意的做出了选择，这个正如中医治疗SARS,只治疗病情较轻的当然效果显得好些。另一个解释，纯粹是属于某些人造假。还有一种解释，认为原因在于置信区间理论，这是某些Bayesian的看法。那么到底是什么地方出了毛病呢？如果有数据的话也许可以给出另一种解释。

发表时间: 2007-02-05, 01:44:35

个人资料

星空浩淼

发表文章数: 799
内力值: 423/423
贡献度: 8426
人气: 1826

客栈长老学术成员

Re: 明天太阳依然升起的概率 [文章类型: 原创]

无论是数学中的概率统计，物理学中的统计物理，还是在工程随机信号分析领域，感觉是一个相对独立而内容丰富的领域，连微积分都可以存在一套用概率来定义极限“随机微积分”，真是奇妙！

看得出guage是这方面的专家。

One may view the world with the p-eye and one may view it with the q-eye but if one opens both eyes simultaneously then one gets crazy

发表时间: 2007-02-05, 05:35:11

个人资料

卢昌海

发表文章数: 768
内力值: 416/416
贡献度: 7898
人气: 1737

客栈长老学术成员

Re: 明天太阳依然升起的概率 [文章类型: 原创]

:: Laplace法则，通常称之为Laplace success rule,看来应该翻译成Laplace成功法则

这个法则通常称为 Laplace's Rule of succession， succession 与 success 虽然形似，但含义颇不相同，如果要翻译全名的话，以译成 “Laplace 逐次法则” 或 “ Laplace 累次法则” 等为好。

宠辱不惊，看庭前花开花落
去留无意，望天空云卷云舒

发表时间: 2007-02-05, 07:22:05

个人资料

gauge

发表文章数: 596
内力值: 375/375
贡献度: 8310
人气: 1396

论坛嘉宾学术成员

Re: 明天太阳依然升起的概率 [文章类型: 原创]

昌海的意见正是我在第一个贴中没有把Laplace法则的名字写全的原因。其实我也想到那个词是逐次的意思，但又觉得不足以反映其含义。而且记得也不大准，将succession当作success了。其实是先就拿不准succession应该怎么翻译，其实心里希望翻译成成功，我的意思是下一次成功的概率。最后由这个汉语返回去把英语也改了。由此可见，先入为主无处不在。

发表时间: 2007-02-05, 10:37:36

个人资料

Omni

发表文章数: 280
内力值: 263/263
贡献度: 4868
人气: 688

论坛嘉宾学术成员

不求甚解 [文章类型: 原创]

>>我现在看书都处于一种不求甚解的状态，先看结论，然后考虑一下是否合理，有必要的话，翻翻证明。如果不能理解的话，就到处找资料，当然还是在网上。用某些人的话说是边学边考虑问题。理论只有自己去用才能真正理解，我说的用不是指做题。

Gauge兄这番话甚和我意，掌握统计学理论的关键在于Exploratory Data Analysis(EDA)的真刀真枪实战，也就是本坛很多人常说的“边走边打”。关于“不求甚解”，我认为这是知识爆炸时代必需的一种战略，也就是说只在必要情况下才去“求甚解”，挖深度。我们在大多数情况下只能先求博(breadth)，再求深(depth)。

易中天先生在点评诸葛亮时所发表的见解值得在此分享：

"...但是另一方面，诸葛亮读书却很马虎，《三国志》的说法是“观其大略”，就是说诸葛亮的朋友们读书都非常认真，字字推敲，诸葛亮拿来一看，一目十行，观其大略，相当于后来陶渊明说的“好读书，不求甚解”。这其实是会读书，观其大略就是能够掌握精髓，不求甚解就是善于抓住要害，这叫做会读书，而且在我看来，一个人只要不是做学问的，读书就应该像诸葛亮和陶渊明说的那样观其大略、不求甚解，不要去咬文嚼字，不要去抠那些很小很细的小问题，这正如一个要得天下的人不会计较一城一池的得失，诸葛亮就是这样一个大气的人。"

该集节目的全文非常值得仔细玩味：

http://book.kanunu.cn/html/2006/0511/3851_15.html

下面这一段有关“选老板策略”的评论也是非常精辟的一个类比：

"这就是当时的刘备和诸葛亮，也就是说，刘备像什么呢？像一家正在发展中的民营企业，有一点资金，有一点经验，也有一点点产品，但是找不到主打产品和营销路线，他缺一个CEO。诸葛亮像一个能干的职业经理人，他就是给人家做总经理的，他自己没有产业，他也不开公司，他需要找到一家好公司。所以这两个人后来见面以后那真是叫做一拍即合，如鱼得水，这才成就了他们这一段君臣际遇，千古流传的佳话。但是这里面有一个问题，就是他俩谁找谁？按照《三国志》和《三国演义》的说法是刘备三顾茅庐，但是按照《魏略》和《九州春秋》的说法是诸葛亮主动上门，那么我们就要问：究竟是刘备三顾茅庐呢，还是诸葛亮登门自荐呢？如果是刘备三顾茅庐，那么刘备是去了三次才见着，还是去了三次见了三次呢？请看下集——三顾茅庐。"

发表时间: 2007-02-07, 00:53:19

个人资料