× HomeMath BoardArchiveType

Point Estimation

Example. Let $X_1, \ldots, X_n \sim f(\cdot ; \theta)$ be a random sample where

$f(x ; \theta)= \begin{cases}e^{-(x-\theta)}, & \text { if } x \geq \theta \\ 0, & \text { elsewhere }\end{cases}$

Show that $\bar{X}$ is a biased estimator of $\theta$.

$\bar{X} = \frac{1}{n}\sum_{i=1}^n\mathbb{E}[X_i] = \int_{\theta}^{\infty} x\exp(-(x-\theta)) = (-x \exp(-(x-\theta)))\bigg|_{\theta}^{\infty} +\int_{\theta}^{\infty} e^{-(x-\theta)}dx = \theta+1 \not = \theta$.

How can we transform a distribution to make the average an unbiased estimator?

Example. Let $X_1, \ldots, X_n \sim \operatorname{Uniform}(0, \beta)$ be a random sample where $\beta$ is a parameter. Show that $Y_n$ (the $n$-th order statistic) is a biased estimator but an asymptotically unbiased estimator of $\beta$.

$$\Pr[X_n = x] = \frac{n!}{(n-1)!} \Pr[X < x]^{n-1} \frac{1}{\beta} = n \left(\frac{x}{\beta}\right)^{n-1} \frac{1}{\beta} = \frac{n}{\beta^n}x^{n-1}$$

$$\int_0^{\beta}\frac{n}{\beta^n}x^n dx =\frac{n}{n+1}\frac{x^{n+1}}{\beta^n} \bigg|_{x=0}^{\beta} = \frac{n}{n+1} \beta $$

Sending $n$ to infinity, we will have $\beta$.

Let's consider the following inequality from "Course Notes for Math 162: Mathematical Statistics The Cram´er-Rao Inequality" by Adam Merberg and Steven J. Miller.

Cramér-Rao Inequality

Suppose we have a probability density function $f(x ; \theta)$ with a continuous parameter $\theta$. Let $X_1, \ldots, X_n$ be independent random variables with density $f(x ; \theta)$, and $\phi(\vec{x}) = \hat{\Theta}\left(x_1, \ldots, x_n\right) $ be an unbiased estimator of $\theta$.

Let $F(\vec{x}, \theta) = \prod_{i=1}^n f\left(x_i ; \theta\right)$ and

We assume that $f(x ; \theta)$ satisfies the following two conditions:

1. We have the condition:

$$\frac{\partial}{\partial \theta}\left[ \int_M \phi(\vec{x}) F(\bar{x}, \theta) d \vec{x}\right] = \int_M \phi(\vec{x}) \frac{\partial F(\vec{x}; \theta)}{\partial \theta} d \vec{x}$$, where $M$ defines the whole $F(\vec{x}, \theta)$ and $m$ defines the $f(x;\theta)$.

2. For each $\theta$, the variance of $\phi(\vec{x})$ is finite.

Under these conditions, the variance of $\hat{\Theta}$ is bounded below by:

$$\operatorname{var}(\phi) \times \left(n \mathbb{E}\left[\left(\frac{\partial }{\partial \theta}\ln f(x ; \theta)\right)^2\right]\right) \geq 1$$

Proof.

Since $\phi(\vec x)$ is an unbiased estimator of $\theta$, we will have

$$0=\mathbb{E}[\phi(\vec x) - \theta] = \int_M (\phi(\vec x) - \theta)F(\vec x, \theta) d \vec{x} = \int_M \phi(\vec x)F(\vec x, \theta) d \vec{x} - \int_M\theta F(\vec x, \theta) d \vec{x}$$

If we take partial derivative both side w.r.t $\theta$, then we will have

$$0 = \int_M \phi(\vec{x}) \frac{\partial}{\partial \theta}F(\vec x, \theta) d \vec{x} - \int_M F(\vec x, \theta) d \vec{x} - \int_M \theta\frac{\partial }{\partial \theta} F(\vec x, \theta) d \vec{x}$$

$$1 = \int_M \left(\phi(\vec{x})-\theta\right) \frac{\partial}{\partial \theta}F(\vec x, \theta) d \vec{x}$$

We also notice that $\frac{\partial}{\partial \theta}F(\vec x, \theta) = \sum_{i=1}^n \frac{\partial}{\partial \theta} f(x_i; \theta) \prod_{j=1, j\not =i}^n f(x_j;\theta) = \sum_{i=1}^n \frac{1}{f(x_i;\theta)} \frac{\partial}{\partial \theta} f(x_i; \theta) \prod_{j=1}^n f(x_j;\theta) = \sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta) \prod_{j=1}^n f(x_j;\theta) \\=F(\vec x;\theta)\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta) $

So we will have:

$$1 = \int_{M}(\phi(\vec x) -\theta)(F(\vec x;\theta))^{1/2}(F(\vec x;\theta))^{1/2}\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta) \, d\vec x$$

By Cauchy-Schwarz inequality, we will get the following.

$$1 \le \left[\int_{M}(\phi(\vec x) -\theta)^2(F(\vec x;\theta))\, d\vec x\right]^{1/2}\left[\int_{M}\left(\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2 F(\vec x;\theta) \, d\vec x\right]^{1/2}$$

The first term due to $\theta = \mathbb{E}[\theta] = \mathbb{E}[\mathbb{\hat{\Theta}}]$ so it's the variance of $\hat\Theta$. The second term we need to do further analysis. So currently we have only the inequality:

$$1 \le \operatorname{var}(\phi)\left[\int_{M}\left(\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2 F(\vec x;\theta) \, d\vec x\right]$$.

We want to show $ \int_{M}\left(\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2 F(\vec x;\theta) \, d\vec x= n\mathbb{E}\left[\left(\frac{\partial }{\partial \theta}\ln f(x ; \theta)\right)^2\right]$

$$\int_{M}\left(\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2 F(\vec x;\theta) \, d\vec x = \int_{M}\left[\sum_{i=1}^n\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2+\sum_{i=1, j=1, j\not=i}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\frac{\partial}{\partial \theta} \ln f(x_j; \theta)\right] F(\vec x;\theta) \, d\vec x$$

We can only need to show that $\int_{M}\left[\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2\right] F(\vec x;\theta) \, d\vec x = \mathbb{E}\left[\left(\frac{\partial }{\partial \theta}\ln f(x ; \theta)\right)^2\right]$ and

$ \int_{M}\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\frac{\partial}{\partial \theta} \ln f(x_j; \theta)F(\vec x;\theta) \, d\vec x=0$

One can easily see that

$$\int_{M}\left[\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2\right] F(\vec x;\theta) \, d\vec x = \int_{m}\left[\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2\right] f(x_i;\theta)\, dx_i \times \prod_{j\not=i} \int f(x_j;\theta)\, dx_j \\= \int_{m}\left[\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2\right] f(x_i;\theta)\, dx_i\\=\int_{m}\left[\left(\frac{\partial}{\partial \theta} \ln f(x; \theta)\right)^2\right] f(x;\theta)\, dx= \mathbb{E}\left[\left(\frac{\partial}{\partial \theta} \ln f(x; \theta)\right)^2\right] $$

And for

$$ \int_{M}\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\frac{\partial}{\partial \theta} \ln f(x_j; \theta)F(\vec x;\theta) \, d\vec x = \int_{m}\frac{\partial}{\partial \theta} \ln f(x_i; \theta) f(x_i;\theta)dx_i \int_{m}\frac{\partial}{\partial \theta}\ln f(x_j; \theta)f(x_j;\theta)dx_j $$

We notice that

$$\int_{m}\frac{\partial}{\partial \theta} \ln f(x_i; \theta) f(x_i;\theta)dx_i = \int_{m}\frac{\partial}{\partial \theta} f(x_i;\theta) dx_i = 0$$,

since $1=\int_{m}f(x;\theta)\, dx \implies 0 = \int_{m}\frac{\partial}{\partial \theta} f(x_i;\theta) dx_i.$

Example. Let $X_1, \ldots, X_n$ be a random sample of $N\left(\mu, \sigma^2\right)$ where $\mu$ is to be estimated. Then $\bar{X}$ is an MVUE of $\mu$.

Solution: $\mathbb{E}[\bar X]=\frac{n\mathbb{E}[X]}{n} = \mu$ and $\operatorname{V}[\bar X] = \operatorname{V}\left[\frac{\sum_{i=1}^n X_i}{n}\right] = \frac{n}{n^2}\operatorname{V}[X] = \frac{\sigma^2}{n}.$

We know the normal distribution has pdf: $f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$, then $\ln f(x) = \ln\left(\frac{1}{\sqrt{2\pi}\sigma}\right) - \frac{(x-\mu)^2}{2\sigma^2}$, then$\frac{\partial}{\partial \mu} \ln f(x) = -\frac{\mu-x}{\sigma^2}$ and $\left(\frac{\partial}{\partial \mu} \ln f(x)\right)^2 = \frac{(x-\mu)^2}{\sigma^4}$

Then $I(\mu) = \int_{-\infty}^{\infty} \frac{(x-\mu)^2}{\sigma^4} = \frac{1}{\sigma^2}$ and so we will have $$\frac{1}{n I(\mu)} = \frac{1}{n \frac{1}{\sigma^2}} = \frac{\sigma^2}{n} = \operatorname{V}[X].$$

So $\bar{X}$ is an MVUE of $\mu$.

We will review the Cramer-Rao inequality:

$$\operatorname{var}(\phi) \times \left(n \mathbb{E}\left[\left(\frac{\partial }{\partial \theta}\ln f(x ; \theta)\right)^2\right]\right) \geq 1$$

If we are given two point estimators $\Theta_1$ and $\Theta_2$ of our parameter $\theta$. How can we decide which one to us. Let's pick a case where $\Theta_1$ is the MVUE of $\theta$, then we will have

$$\operatorname{V}(\Theta_1) \times \left(n I(\theta)\right) = 1$$

We also know that

$$ \operatorname{V}(\Theta_2) \times \left(n I(\theta)\right) \ge 1$$

and so we will have essentially:

$$\operatorname{V}(\Theta_2) \times \left(n I(\theta)\right) \ge \operatorname{V}(\Theta_1) \times \left(n I(\theta)\right) $$

If we take the reciprocal of the inequality, we will have $$\frac{1}{\operatorname{V}(\Theta_2) \times \left(n I(\theta)\right)} \le \frac{1}{\operatorname{V}(\Theta_1) \times \left(n I(\theta)\right)} $$.

Here, comes the definition of efficiency. If an unbiased estimator is more efficient than the other, then it will be $\Theta_1$ in the above inequality. So the efficiency of an unbiased estimator $\Theta$ is defined as $$e(\Theta) =\frac{1}{\operatorname{V}(\Theta) \times \left(n I(\theta)\right)}.$$

and $$\frac{e(\Theta_1)}{e(\Theta_2)} = \frac{\operatorname{V}[\Theta_2]}{\operatorname{V}[\Theta_1]}.$$

Example. Let $X_1, \ldots, X_n \sim \operatorname{Uniform}(0, \beta)$ be a random sample where $\beta$ is a parameter. Show that both $2 \bar{X}$ and $\frac{n+1}{n} Y_n$ are unbiased estimators of $\beta$. Compare their efficiency.

First, let's consider the variance of uniform distribution. We know that the mean of the above uniform distribution takes $\frac{\beta}{2}.$ And the variance of the distribution is $\frac{1}{\beta}\int_0^{\beta} (x-\beta/2)^2 \, dx = \frac{1}{3}x^3 \bigg|_{x=-\beta/2}^{\beta/2} \cdot \frac{1}{\beta} = \frac{1}{12}\beta^2.$

And so $\mathbb{E}[2\bar{X}] = 2\cdot\frac{\beta}{2}=\beta.$ and $\operatorname{V}[2\bar{X}] = \frac{4}{n^2} \cdot \frac{n\beta^2}{12} = \frac{\beta^2}{3n}$

Now let's consider $\Pr[Y_n=y] = n \left(\int_0^{y} \frac{1}{\beta}\, dy\right)^{n-1} \frac{1}{\beta} =n \left( \frac{y}{\beta}\right)^{n-1} \frac{1}{\beta} = n\cdot \frac{y^{n-1}}{\beta^n} $

And so $\mathbb{E}[Y_n] = \int_0^\beta n\cdot \frac{y^n}{\beta^n} \,dy = \frac{n}{n+1} \frac{y^{n+1}}{\beta^n} \bigg|_{y=0}^{y=\beta} = \frac{n}{n+1} \beta$

And so $$\mathbb{E}\left[\frac{n+1}{n}Y_n\right] = \frac{n+1}{n}\frac{n}{n+1} \beta = \beta.$$

Now let's consider:

$$\mathbb{E}[Y_n^2] = \int_0^{\beta} y^2 \cdot n \cdot \frac{y^{n-1}}{\beta^n} \, dy \\=\frac{n}{\beta^n}\int_0^{\beta} y^{n+1} \, dy \\= \frac{n}{\beta^n} \frac{\beta^{n+2}}{n+2}\\= \frac{n}{n+2}\beta^2.$$

And so $$\operatorname{V}[Y_n] = \mathbb{E}[Y_n^2]- (\mathbb{E}[Y_n])^2 = \frac{n}{n+2}\beta^2 - \left(\frac{n}{n+1}\beta\right)^2=\frac{n(n+1)^2 -n^2(n+2)}{(n+2)(n+1)^2}\beta^2 = \frac{n(n^2+2n+1) -n^3 -2n^2}{(n+1)^2(n+2)}\beta^2 = \frac{n}{(n+1)^2(n+2)}\beta^2$$

An so $\operatorname{V}\left[\frac{n+1}{n}Y_n\right] = \left(\frac{n+1}{n}\right)^2\operatorname{V}[Y_n] = \left(\frac{n+1}{n}\right)^2 \frac{n}{(n+1)^2(n+2)}\beta^2 = \frac{1}{n(n+2)}\beta^2$

So the ratio of $$\frac{e(\frac{n+1}{n}Y_n)}{e(2\bar{X})} = \frac{\frac{1}{3n}\beta^2}{\frac{1}{n(n+2)}\beta^2} = \frac{n+2}{3}$$

If $n >1$, then $2\bar{X}$ is more efficient than $\frac{n+1}{n}Y_n$.

Definition. Let $\hat{\Theta}$ be any point estimator of $\theta$. Then the mean square error (MSE) of $\hat{\Theta}$ is defined as

$$\operatorname{MSE}(\hat{\Theta})=\mathbb{E}\left[(\hat{\Theta}-\theta)^2\right]$$

Then we will have $$\mathbb{E}\left[(\hat{\Theta}-\theta)^2\right] = \mathbb{E}\left[(\hat{\Theta}-\Theta +\Theta-\theta)^2\right] \\= \mathbb{E}\left[(\hat{\Theta}-\Theta)^2 +(\Theta-\theta)^2\right]\\=\mathbb{E}[(\hat{\Theta}-\Theta)^2] +(\Theta-\theta)^2 \\= \operatorname{V}[\hat{\Theta}]+b(\hat{\Theta})$$

where $\operatorname{V}[\hat{\Theta}]$ is the variance of $\hat{\Theta}$ and $\Theta = \mathbb{E}[\hat{\Theta}]$ and $b(\hat{\Theta})$ is the bias of $\hat{\Theta}$.

Example. Let's compare the Mean Squared Error (MSE) of two estimators for $\sigma^2$: $S^2$ and $\frac{n-1}{n} S^2$, for a normal population $N(\mu, \sigma^2)$.

Solution: Let's consider the variance of $S^2$ and $\frac{n-1}n{S^2}$ and also the bias of $S^2$ and $\frac{n-1}{n}S^2.$

For $S^2$, $\mathbb{E}[S^2] = \sigma^2$ and since $\frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2$ so we will have

$$\operatorname{V}\left[\frac{(n-1)S^2}{\sigma^2}\right] =2(n-1)$$

and we also know that $\operatorname{V}\left[\frac{(n-1)S^2}{\sigma^2}\right] = \frac{(n-1)^2}{\sigma^4}\operatorname{V}[S^2]$

so we will have $\operatorname{V}[S^2] = \frac{2\sigma^4}{n-1}.$

And also $\mathbb{E}[\frac{n-1}{n}S^2] = \frac{n-1}{n}\sigma^2$ and $\operatorname{V}[\frac{n-1}{n}S^2] = \left(\frac{n-1}{n}\right)^2 \frac{2\sigma^4}{n-1} = 2\left(\frac{n-1}{n^2}\right)\sigma^4$

So the MSE of $S^2$ is $\frac{2\sigma^4}{n-1}$ and MSE of $\frac{n-1}{n}S^2$ is $2\left(\frac{n-1}{n^2}\right)\sigma^4 + \frac{1}{n^2}\sigma^4 =\frac{2n-1}{n^2}\sigma^4$

so the ratio will be $$\frac{\operatorname{MSE}[\frac{n-1}{n}S^2]}{\operatorname{MSE}[S^2]} = \frac{\frac{2n-1}{n^2}\sigma^4}{\frac{2}{n-1}\sigma^4} =\frac{(2n-1)(n-1)}{2n^2}= \frac{2n^2 -3n +1}{2n^2}= 1+ \frac{1-3n}{2n^2}<1$$

So $\frac{n-1}{n}S^2$ has less MSE than $S^2$.

Consistency is defined as follows: An estimator $\hat{\Theta}$ of $\theta$, based on a random sample of size $n$, is considered consistent if, for any positive value $c$, the probability approaches zero as the sample size $n$ tends to infinity that the absolute difference between the estimator $\hat{\Theta}$ and the true parameter $\theta$ is greater than $c$.

$$\lim_{n\to\infty} \Pr[\left|\bar\Theta-\theta\right|>c] = 0$$

Theorem. If $\hat{\Theta}$ is an unbiased estimator of $\theta$ and $\operatorname V[\hat{\Theta}] \rightarrow 0$ as $n \rightarrow \infty$, then $\hat{\Theta}$ is a consistent estimator of $\theta$.

Proof:

We know Chebyshev's inequality, then for any real number $k>0$,

$$\Pr(|\hat{\Theta}-\theta| \geq k \operatorname V[\hat{\Theta}]) \leq \frac{1}{k^2}$$

Setting $k$ to satisfy $k\operatorname V[\hat{\Theta}] = c$, then by sending $n$ to infinity we will have

$$\lim_{n\to\infty}\Pr(|\hat{\Theta}-\theta| \geq c) \leq \lim_{n\to\infty}\frac{\left(\operatorname V[\hat{\Theta}]\right)^2}{c^2} =0$$

Example. Suppose $S^2$ is the sample variance of the random sample from a normal population $N\left(\mu, \sigma^2\right)$, then $S^2$ is a consistent estimator of $\sigma^2$.

We know from the previous that $\frac{(n-1)S^2}{\sigma^2}\sim \chi_{n-1}^2$ so

$$\frac{(n-1)^2}{\sigma^4}\operatorname V[S^2] = 2(n-1) \implies \operatorname V[S^2] =\frac{2\sigma^4}{n-1}$$

then we will have $$\lim_{n\to \infty} V[S^2] =\lim_{n\to \infty}\frac{2\sigma^4}{n-1} =0$$

So $S^2$ is consistent.

Example: Let $f$ be a probability density function (pdf) with a finite mean $\mu$ and variance $\sigma^2$.

Consider a natural number $n$, and let $X_1, \ldots, X_n$ be a random sample drawn from $f$.

Additionally, let $Y_n$ follow a Bernoulli distribution with parameter $\frac{1}{n}$ and be independent of the random sample.

We define the estimator $\hat{\Theta}_n$ as $\hat{\Theta}_n = n^2 Y_n + \left(1 - Y_n\right) \bar{X}_n$, where $\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$ represents the sample mean.

Show that $\hat{\Theta}_n$ is neither unbiased nor asymptotically unbiased, but it is a consistent estimator of $\mu$.

$$\mathbb{E}[\hat{\Theta}_n] = n^2 \mathbb{E}[Y_n] + \mathbb{E}[1-Y_n] \mathbb{E}[\bar{X_n}] = n +\frac{n-1}{n}\mu$$

So $\hat{\Theta}_n$ is neither unbiased nor asymptotically unbiased.

$$\lim_{n\to\infty}\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c\right] = \lim_{n\to\infty}\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c | Y_n=0\right] \Pr[Y_n=0]+\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c| Y_n=1\right] \Pr[Y_n=1] \\=\lim_{n\to\infty}\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c | Y_n=0\right] (1-\frac{1}{n})+\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c| Y_n=1\right] \frac{1}{n}\\=\lim_{n\to\infty}\Pr\left[\left|\bar{X}_n-\mu\right| \ge c \right] \le \lim_{n\to\infty}\frac{\sigma^2}{c^2n} = 0$$.

Definition: An estimator $\hat{\Theta}$ is considered sufficient for the parameter $\theta$ if the conditional probability of the random sample $X_1, \ldots, X_n$ given $\hat{\Theta}=\widehat{\theta}$ is independent of $\theta$. In other words, the probability distribution $f_{X_1, \ldots, X_n \mid \widehat{\Theta}}\left(x_1, \ldots, x_n \mid \widehat{\theta}\right)$ does not depend on the specific value of $\theta$.

Example. If $X_1, \ldots, X_n$ is a random sample of $\text{Bernoulli}(p)$, then $\hat{\Theta}=\bar{X}$ is a sufficient estimator of $p$.

Proof. $\Pr[\vec{X} =\vec{x}] = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} = p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^nx_i}$ and $$\Pr[\hat{\Theta} =\theta]\\=\Pr[\sum_{i=1}^n x_i=n\hat{\Theta} =n\theta]\\= {n \choose n \theta}p^{n\theta} p^{n(1-\theta)} = n\hat{\Theta} =n\theta]\\= {n \choose n \theta}p^{\sum_{i=1}^n x_i} p^{n-\sum_{i=1}^n x_i}$$

Then $$\frac{\Pr[\vec{X} =\vec{x}]}{\Pr[\hat{\Theta} =\theta]} = \frac{1}{{n\choose \sum_{i=1}^n x_i}}$$

So $\hat \Theta$ is a sufficient estimator.

Example. Let $X_1, X_2, X_3$ be a random sample of Bernoulli $(\theta)$, then $\hat{\Theta}=$ $\frac{1}{6}\left(X_1+2 X_2+3 X_3\right)$ is not a sufficient estimator of $\theta$.

$$\Pr[X_1 =x_1, X_2 = x_2, X_3 = x_3] = \theta^{x_1+x_2+x_3} (1-\theta)^{3-x_1-x_2-x_3}.$$

$$\Pr[\hat \Theta =\frac{1}{2}] =\Pr[X_1+2X_2+3X_3 =3] \\= \Pr[X_1 =0, X_2 = 0, X_3 = 1] + \Pr[X_1 =1, X_2 = 1, X_3 = 0] \\= \theta(1-\theta)^2 + (1-\theta)\theta^2 $$

And $$\frac{\Pr[X_1 =0, X_2 = 0, X_3 = 1]}{\Pr[\hat \Theta =\frac{1}{2}]} = \frac{\theta(1-\theta)^2 }{\theta(1-\theta)^2 +(1-\theta)\theta^2} \\= \frac{1-\theta}{1-\theta+\theta} = 1-\theta$$

Theorem: $\hat{\Theta}$ is a sufficient estimator of $\theta$ if and only if the joint probability mass function (pmf) or probability density function (pdf) of $X_1, \ldots, X_n$ can be expressed as a product of two functions, $\phi(\hat{\theta}; \theta)$ and $h(x_1, \ldots, x_n)$, where the function $h$ does not depend on $\theta$.

Example. Consider a normal population $N\left(\mu, \sigma^2\right)$ for known $\sigma^2$. Show that $\bar{X}$ is a sufficient estimator of $\mu$.

Solution:

$$\Pr[\vec{X} = x] = \frac{1}{\sigma^n(2\pi)^{n/2}} \exp\left(-\sum_{i=1}^n\frac{(x_i-\mu)^2}{2\sigma^2}\right) \\= \frac{1}{\sigma^n(2\pi)^{n/2}} \exp\left(-\sum_{i=1}^n\frac{(x_i-\bar{X}+\bar{X}-\mu)^2}{2\sigma^2}\right) \\= \frac{1}{\sigma^n(2\pi)^{n/2}} \exp\left(-\frac{\sum_{i=1}^n(x_i-\bar{X})^2+n(\bar{X}-\mu)^2}{2\sigma^2}\right)\\ = \left(\exp(-\frac{n(\bar{X}-\mu)^2}{2\sigma^2})\right)\cdot\prod_{i=1}^{n} \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left(-\frac{x_i -\bar{X}}{2\sigma^2}\right)$$

So the estimator is sufficient.

Now, let's discuss approaches for constructing point estimators. There are three common methods used for this purpose:

Method of Moments

Maximum Likelihood Estimation

Bayesian Estimation

The Method of Moments is a technique used to estimate the parameters $\theta_1, \ldots, \theta_r$ of a distribution $f$. The approach involves setting up a system of $r$ equations:

$$ \mu_k^{\prime}\left(\theta_1, \ldots, \theta_r\right) = m_k^{\prime}, \quad \text{for } k = 1, \ldots, r, $$

where $\mu_k^{\prime}$ represents the $k$-th population moment (e.g., mean, variance) of the distribution $f$, and $m_k^{\prime}$ represents the corresponding $k$-th sample moment (calculated from the given data $x_1, \ldots, x_n$). Solving this system of equations yields the estimates for the parameters:

$$ \widehat{\theta}_k = \widehat{\theta}_k\left(x_1, \ldots, x_n\right), \quad \text{for } k = 1, \ldots, r. $$

Thus, the resulting estimators obtained through the Method of Moments are denoted as $\hat{\Theta}_k = \hat{\Theta}_k\left(X_1, \ldots, X_n\right)$ for $k = 1, \ldots, r$.

Example. Given a random sample of size $n$ from $\operatorname{Uniform}(a, b)$, use the method of moments to obtain an estimator of $a, b$.

We know that the first moment of the uniform distribution is $\frac{a+b}{2}$ and the second moment of the uniform distribution is $\frac{a^2+ab+b^2}{3}.$ Then we will have

$$\begin{cases} \frac{a+b}{2} = x \\ \frac{a^2+ab+b^2}{3} = y\end{cases}$$

,

where $x = \bar{X}$ is the sample mean and $y = \bar{X^2}$ is the sample second moment.

Then we will have $a = 2x - b$ and so $b^2 - 2 b x + 4 x^2=a^2+ab+b^2=3y \implies b^2-2bx + 4x^2 - 3y = 0\implies \Delta = 4b^2 - 4(4x^2 - 3y) = 4(b^2 +3y- 4x^2)$ and so $$b = \frac{2x +2\sqrt{b^2+3y-4x^2}}{2}$$ can be an estimator of $b$.

We can use the similar procedure to solve $a$ and get $a = \frac{2x -2\sqrt{b^2+3y-4x^2}}{2}$

Example. Given a random sample of size $n$ from $\operatorname{Gamma}(\alpha, \beta)$, use the method of moments to obtain estimators of $\alpha$ and $\beta$.

Solution, we will use the same definition for $x,y$ as the above probelm but for gamma distribution, then we will have

$$\begin{cases}\alpha \beta = x\\ \alpha(\alpha+1)\beta^2 = y\end{cases}$$

And so $$\begin{cases}\alpha \beta =x \\ (\alpha+1)\beta = \frac{y}{x}\end{cases}$$

And so $$\begin{cases} \alpha= \frac{x^2}{y-x^2} \\ \beta = \frac{y}{x}-x = \frac{y-x^2}{x}\end{cases}$$

Maximum likelihood estimation

The Maximum Likelihood Estimation (MLE) is a method used to estimate the parameter $\theta$ of a statistical distribution, given the observed values $x_1, \ldots, x_n$. To perform MLE, we define the likelihood function $L(\theta)$, which is the joint probability density function (pdf) or probability mass function (pmf) of the observed data $x_1, \ldots, x_n$ when treated as a function of the parameter $\theta$. The likelihood function is given by:

$$ L(\theta) = f_{X_1, \ldots, X_n}\left(x_1, \ldots, x_n; \theta\right) = \prod_{i=1}^n f\left(x_i; \theta\right). $$

In this equation, $f\left(x_i; \theta\right)$ represents the pdf or pmf of each individual data point $x_i$, and the product of these probabilities is taken for all $n$ observed data points.

The MLE of $\theta$, denoted as $\hat{\theta}$, is the value that maximizes the likelihood function $L(\theta)$. In other words, it is the parameter value that makes the observed data most probable, given the assumed statistical model characterized by $f\left(x; \theta\right)$. Mathematically, we find $\hat{\theta}$ by solving:

$$ \hat{\theta} = \arg\max_{\theta} L(\theta). $$

The value $\hat{\theta}$ obtained in this way is referred to as the Maximum Likelihood Estimate (MLE) of $\theta$, which provides an estimate of the unknown parameter based on the observed data. The MLE is a widely used and valuable method for parameter estimation in various statistical models and machine learning algorithms.

Example. Given $x$ successes in $n$ trials, find the MLE of $\theta$ in the corresponding $\operatorname{Binomial}(n, \theta)$.

We see that the probability we want to maximize is $$p(X = x; \theta) = {n \choose x} \theta^x (1-\theta)^{n-x}$$

If we take derivative with respect to $\theta$ and setting it to zero, we will have the following:

$$\frac{\partial}{\partial \theta}p(X = x; \theta) = {n \choose x} \left(x\theta^{x-1} (1-\theta)^{n-x} -(n-x)\theta^x (1-\theta)^{n-x-1}\right)=0$$

and so we will have

$$x(1-\theta)-(n-x) \theta = 0 \implies x -x\theta -n\theta+x\theta =0\implies \theta = \frac{x}{n}.$$

Example. Let $X_1, \ldots, X_n$ be a random sample from Exponential distribution$(\theta)$. Find the MLE of $\theta$.

Solution:

So the exponential distribution has the following pdf:

$$f_{X_1}(x) = \frac{1}{\theta} \exp(-\frac{x}{\theta})$$

and so the joint pdf will have the following:

$$f_{\vec X}(\vec x) = \frac{1}{\theta^n} \exp\left(-\frac{\sum_{i=1}^n x_i}{\theta}\right)$$

If we take $\ln$ both sides and maximize the $\ln$ formula, we will have

$$\frac{\partial}{\partial\theta} \ln f_{\vec X}(\vec x) = \frac{\partial}{\partial\theta}\left( -n\ln(\theta)-\frac{\sum_{i=1}^n x_i}{\theta}\right)\\ = -\frac{n}{\theta}+\theta^{-2}\sum_{i=1}^n x_i = 0$$

It means that $$\theta \left(\sum_{i=1}^n x_i\right)^{-1} = n^{-1} \implies \theta = \frac{\sum_{i=1}^nx_i}{n}.$$

Example. Let $X_1, \ldots, X_n$ be a random sample of Uniform(0, $\left.\beta\right)$. Find the MLE of $\beta$.

We will have the following pdf:

$$f_{\vec X}(\vec x) = \prod_{i=1}^n f(x_i;\beta) = \prod_{i=1}^n \frac{1}{\beta} I_{x_i \le \beta}(x_i) = \begin{cases}\beta^{-n} & \text{ if }\beta \ge \max_{1\le i \le n}x_i \\ 0 &\text{ otherwise}\end{cases}$$

and therefore, $\beta = \max x_i $ is the maximum likelihood estimator.

Example. Let $X_1, \ldots, X_n$ be a random sample of $N\left(\mu, \sigma^2\right)$ where both $\mu$ and $\sigma^2$ are unknown. Find the MLE of $\mu$ and $\sigma^2$.

We have the following pdf:

$$f_{\vec X}(\vec x) = (2\pi \sigma^2)^{-n/2}\exp(-\frac{\sum(x_i-\mu)^2}{2\sigma^2}) \\= (2\pi u)^{-n/2}\exp(-\frac{\sum(x_i-\mu)^2}{2u})$$

where $u = \sigma^2$ and if we take natural log, we will have $$v:= \ln f_{\vec X}(\vec x) = -\frac{n}{2}\ln(2\pi u)-\frac{\sum(x_i-\mu)^2}{2u}$$

Then $$\frac{\partial}{\partial u} v = -\frac{n}{2u} +\frac{\sum(x_i-\mu)^2}{2u^2} =0$$

and $$\frac{\partial}{\partial \mu} v = \frac{\sum \mu-x_i}{u^2} =0 $$

and so $\mu = \frac{\sum x_i}{n}$ and so

$$nu = \sum (x_i - \mu)^2 \implies u = \frac{\sum (x_i - \mu)^2}{n}$$

Bayesian estimation (Maximum Posterior Distribution)

$\ln \phi\left(\theta \mid x_1, \ldots, x_n\right)=\ln p(\theta)+{\sum_{i=1}^n \ln f\left(x_i \mid \theta\right)}-{\ln g\left(x_1, \ldots, x_n\right)}$

We can see the summation is the log-likelihood and we know that $\ln p(\theta)$ is serving as a regularization to the likelihood function.

Example. Let $X$ follow $\operatorname{Binomial}(n, \theta)$ for unknown $\theta \in(0,1)$. Suppose the prior distribution of $\theta$ is $\operatorname{Beta}(\alpha, \beta)$ for some given $\alpha, \beta>0$. Find the posterior distribution and Bayesian estimate of $\theta$.

The posterior distribution will be proportional to the following:

$${n \choose x}\theta^{x}(1-\theta)^{n-x}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}{n \choose x}\theta^{x+\alpha-1}(1-\theta)^{n+\beta-x-1}$$

Then if we take $\ln$ and focus only on $\theta$, then we will have

$$(x+\alpha-1)\ln \theta+(n+\beta-x-1)\ln(1-\theta)$$.

If we take partial derivative with respect to $\theta$, then we will have

$$\frac{x+\alpha-1}{\theta}+\frac{n+\beta-x-1}{\theta-1} =0 \implies (x+\alpha-1 +n +\beta -x -1)\theta -(x+\alpha-1) =0 \implies \theta = \frac{x+\alpha-1}{n+\alpha+\beta-2}.$$

Example. Suppose $X_1, \ldots, X_n$ is a random sample of $N\left(\mu, \sigma^2\right)$ where $\sigma^2$ is known. Assume the prior distribution of $\mu$ is $N\left(\mu_0, \sigma_0^2\right)$ for some known $\mu_0$ and $\sigma_0^2$. Find the posterior distribution of $\mu$ and the Bayesian estimate.

Solution:

The posterior distribution of $\mu$ follows the following:

$$(2\pi\sigma^2)^{-n/2}\exp\left(-\frac{\sum_{i}(X_i-\mu)^2}{2\sigma^2}\right) (2\pi\sigma_0^2)^{-1/2}\exp\left(-\frac{(\mu-\mu_0)^2}{2\sigma_0^2}\right)$$

We can see that $$\frac{\sum_{i}(X_i-\mu)^2}{\sigma^2} + \frac{(\mu-\mu_0)^2}{\sigma_0^2} = \frac{\sum_{i}X_i^2+\mu^2-2X_i\mu}{\sigma^2} + \frac{\mu^2+\mu_0^2-2\mu \mu_0}{\sigma_0^2} \\= \frac{(\sigma^2_0 n+ \sigma^2)\mu^2-2(n\bar{X} \sigma^2_0 +\mu_0\sigma^2)\mu + \sigma^2 \mu_0^2+\sigma_0^2\sum_i X_i^2}{(\sigma\sigma_0)^2}$$

And so $\mu$ is having normal distribution with $\frac{2 n \bar{X}\sigma_0^2 + \mu_0\sigma^2}{\sigma_0^2n +\sigma^2}$ as mean and $\frac{\sigma^2 \sigma_0^2}{\sigma_0^2 n +\sigma^2}$ as the variance.

Example. Suppose $X_1, \ldots, X_n$ is a random sample of $N\left(\mu, \sigma^2\right)$ where $\sigma^2$ is known. Assume the prior distribution of $\mu$ is $N\left(\mu_0, \sigma_0^2\right)$ for some known $\mu_0$ and $\sigma_0^2$. Find the posterior distribution of $\mu$ and the Bayesian estimate.

Solution:

The posterior distribution of $\mu$ follows the following:

$$(2\pi\sigma^2)^{-n/2}\exp\left(-\frac{\sum_{i}(X_i-\mu)^2}{2\sigma^2}\right) (2\pi\sigma_0^2)^{-1/2}\exp\left(-\frac{(\mu-\mu_0)^2}{2\sigma_0^2}\right)$$

We can see that $$\frac{\sum_{i}(X_i-\mu)^2}{\sigma^2} + \frac{(\mu-\mu_0)^2}{\sigma_0^2} = \frac{\sum_{i}X_i^2+\mu^2-2X_i\mu}{\sigma^2} + \frac{\mu^2+\mu_0^2-2\mu \mu_0}{\sigma_0^2} \\= \frac{(\sigma^2_0 n+ \sigma^2)\mu^2-2(n\bar{X} \sigma^2_0 +\mu_0\sigma^2)\mu + \sigma^2 \mu_0^2+\sigma_0^2\sum_i X_i^2}{(\sigma\sigma_0)^2}$$

And so $\mu$ is having normal distribution with $\frac{2 n \bar{X}\sigma_0^2 + \mu_0\sigma^2}{\sigma_0^2n +\sigma^2}$ as mean and $\frac{\sigma^2 \sigma_0^2}{\sigma_0^2 n +\sigma^2}$ as the variance.

And the Bayesian estimator will be

$$\frac{\partial}{\partial \mu}\frac{\sum_{i}(X_i-\mu)^2}{2\sigma^2}+\frac{(\mu-\mu_0)^2}{2\sigma_0^2} = \frac{\sum_i (\mu-X_i)}{\sigma^2} + \frac{\mu -\mu_0}{\sigma_0^2}= 0$$

and it means that $$\left(\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2}\right)\mu=\frac{\sum X_i}{\sigma^2}+\frac{\mu_0}{\sigma_0^2} \implies \mu = \left(\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2}\right)^{-1}\left(\frac{\sum X_i}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right)$$.