A Brief Introduction to Point Estimates
Created on July 29, 2023
Written by Some author
Read time: 21 minutes
Summary: In this blog, We introduces various methods in point estimates and their definition. The blog follows from lecture note Point Estimate from professor Xiaojing Ye in Georgia State University
Point Estimation
Example. Let $X_1, \ldots, X_n \sim f(\cdot ; \theta)$ be a random sample where
$f(x ; \theta)= \begin{cases}e^{-(x-\theta)}, & \text { if } x \geq \theta \\ 0, & \text { elsewhere }\end{cases}$
Show that $\bar{X}$ is a biased estimator of $\theta$.
$\bar{X} = \frac{1}{n}\sum_{i=1}^n\mathbb{E}[X_i] = \int_{\theta}^{\infty} x\exp(-(x-\theta)) = (-x \exp(-(x-\theta)))\bigg|_{\theta}^{\infty} +\int_{\theta}^{\infty} e^{-(x-\theta)}dx = \theta+1 \not = \theta$.
How can we transform a distribution to make the average an unbiased estimator?
Example. Let $X_1, \ldots, X_n \sim \operatorname{Uniform}(0, \beta)$ be a random sample where $\beta$ is a parameter. Show that $Y_n$ (the $n$-th order statistic) is a biased estimator but an asymptotically unbiased estimator of $\beta$.
$$\Pr[X_n = x] = \frac{n!}{(n-1)!} \Pr[X < x]^{n-1} \frac{1}{\beta} = n \left(\frac{x}{\beta}\right)^{n-1} \frac{1}{\beta} = \frac{n}{\beta^n}x^{n-1}$$
$$\int_0^{\beta}\frac{n}{\beta^n}x^n dx =\frac{n}{n+1}\frac{x^{n+1}}{\beta^n} \bigg|_{x=0}^{\beta} = \frac{n}{n+1} \beta $$
Sending $n$ to infinity, we will have $\beta$.
Let's consider the following inequality from "Course Notes for Math 162: Mathematical Statistics The Cram´er-Rao Inequality" by Adam Merberg and Steven J. Miller.
Cramér-Rao Inequality
Suppose we have a probability density function $f(x ; \theta)$ with a continuous parameter $\theta$. Let $X_1, \ldots, X_n$ be independent random variables with density $f(x ; \theta)$, and $\phi(\vec{x}) = \hat{\Theta}\left(x_1, \ldots, x_n\right) $ be an unbiased estimator of $\theta$.
Let $F(\vec{x}, \theta) = \prod_{i=1}^n f\left(x_i ; \theta\right)$ and
We assume that $f(x ; \theta)$ satisfies the following two conditions:
1. We have the condition:
$$\frac{\partial}{\partial \theta}\left[ \int_M \phi(\vec{x}) F(\bar{x}, \theta) d \vec{x}\right] = \int_M \phi(\vec{x}) \frac{\partial F(\vec{x}; \theta)}{\partial \theta} d \vec{x}$$, where $M$ defines the whole $F(\vec{x}, \theta)$ and $m$ defines the $f(x;\theta)$.
2. For each $\theta$, the variance of $\phi(\vec{x})$ is finite.
Under these conditions, the variance of $\hat{\Theta}$ is bounded below by:
$$\operatorname{var}(\phi) \times \left(n \mathbb{E}\left[\left(\frac{\partial }{\partial \theta}\ln f(x ; \theta)\right)^2\right]\right) \geq 1$$
Proof.
Since $\phi(\vec x)$ is an unbiased estimator of $\theta$, we will have
$$0=\mathbb{E}[\phi(\vec x) - \theta] = \int_M (\phi(\vec x) - \theta)F(\vec x, \theta) d \vec{x} = \int_M \phi(\vec x)F(\vec x, \theta) d \vec{x} - \int_M\theta F(\vec x, \theta) d \vec{x}$$
If we take partial derivative both side w.r.t $\theta$, then we will have
$$0 = \int_M \phi(\vec{x}) \frac{\partial}{\partial \theta}F(\vec x, \theta) d \vec{x} - \int_M F(\vec x, \theta) d \vec{x} - \int_M \theta\frac{\partial }{\partial \theta} F(\vec x, \theta) d \vec{x}$$
$$1 = \int_M \left(\phi(\vec{x})-\theta\right) \frac{\partial}{\partial \theta}F(\vec x, \theta) d \vec{x}$$
We also notice that $\frac{\partial}{\partial \theta}F(\vec x, \theta) = \sum_{i=1}^n \frac{\partial}{\partial \theta} f(x_i; \theta) \prod_{j=1, j\not =i}^n f(x_j;\theta) = \sum_{i=1}^n \frac{1}{f(x_i;\theta)} \frac{\partial}{\partial \theta} f(x_i; \theta) \prod_{j=1}^n f(x_j;\theta) = \sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta) \prod_{j=1}^n f(x_j;\theta) \\=F(\vec x;\theta)\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta) $
So we will have:
$$1 = \int_{M}(\phi(\vec x) -\theta)(F(\vec x;\theta))^{1/2}(F(\vec x;\theta))^{1/2}\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta) \, d\vec x$$
By Cauchy-Schwarz inequality, we will get the following.
$$1 \le \left[\int_{M}(\phi(\vec x) -\theta)^2(F(\vec x;\theta))\, d\vec x\right]^{1/2}\left[\int_{M}\left(\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2 F(\vec x;\theta) \, d\vec x\right]^{1/2}$$
The first term due to $\theta = \mathbb{E}[\theta] = \mathbb{E}[\mathbb{\hat{\Theta}}]$ so it's the variance of $\hat\Theta$. The second term we need to do further analysis. So currently we have only the inequality:
$$1 \le \operatorname{var}(\phi)\left[\int_{M}\left(\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2 F(\vec x;\theta) \, d\vec x\right]$$.
We want to show $ \int_{M}\left(\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2 F(\vec x;\theta) \, d\vec x= n\mathbb{E}\left[\left(\frac{\partial }{\partial \theta}\ln f(x ; \theta)\right)^2\right]$
$$\int_{M}\left(\sum_{i=1}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2 F(\vec x;\theta) \, d\vec x = \int_{M}\left[\sum_{i=1}^n\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2+\sum_{i=1, j=1, j\not=i}^n \frac{\partial}{\partial \theta} \ln f(x_i; \theta)\frac{\partial}{\partial \theta} \ln f(x_j; \theta)\right] F(\vec x;\theta) \, d\vec x$$
We can only need to show that $\int_{M}\left[\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2\right] F(\vec x;\theta) \, d\vec x = \mathbb{E}\left[\left(\frac{\partial }{\partial \theta}\ln f(x ; \theta)\right)^2\right]$ and
$ \int_{M}\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\frac{\partial}{\partial \theta} \ln f(x_j; \theta)F(\vec x;\theta) \, d\vec x=0$
One can easily see that
$$\int_{M}\left[\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2\right] F(\vec x;\theta) \, d\vec x = \int_{m}\left[\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2\right] f(x_i;\theta)\, dx_i \times \prod_{j\not=i} \int f(x_j;\theta)\, dx_j \\= \int_{m}\left[\left(\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\right)^2\right] f(x_i;\theta)\, dx_i\\=\int_{m}\left[\left(\frac{\partial}{\partial \theta} \ln f(x; \theta)\right)^2\right] f(x;\theta)\, dx= \mathbb{E}\left[\left(\frac{\partial}{\partial \theta} \ln f(x; \theta)\right)^2\right] $$
And for
$$ \int_{M}\frac{\partial}{\partial \theta} \ln f(x_i; \theta)\frac{\partial}{\partial \theta} \ln f(x_j; \theta)F(\vec x;\theta) \, d\vec x = \int_{m}\frac{\partial}{\partial \theta} \ln f(x_i; \theta) f(x_i;\theta)dx_i \int_{m}\frac{\partial}{\partial \theta}\ln f(x_j; \theta)f(x_j;\theta)dx_j $$
We notice that
$$\int_{m}\frac{\partial}{\partial \theta} \ln f(x_i; \theta) f(x_i;\theta)dx_i = \int_{m}\frac{\partial}{\partial \theta} f(x_i;\theta) dx_i = 0$$,
since $1=\int_{m}f(x;\theta)\, dx \implies 0 = \int_{m}\frac{\partial}{\partial \theta} f(x_i;\theta) dx_i.$
Example. Let $X_1, \ldots, X_n$ be a random sample of $N\left(\mu, \sigma^2\right)$ where $\mu$ is to be estimated. Then $\bar{X}$ is an MVUE of $\mu$.
Solution: $\mathbb{E}[\bar X]=\frac{n\mathbb{E}[X]}{n} = \mu$ and $\operatorname{V}[\bar X] = \operatorname{V}\left[\frac{\sum_{i=1}^n X_i}{n}\right] = \frac{n}{n^2}\operatorname{V}[X] = \frac{\sigma^2}{n}.$
We know the normal distribution has pdf: $f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$, then $\ln f(x) = \ln\left(\frac{1}{\sqrt{2\pi}\sigma}\right) - \frac{(x-\mu)^2}{2\sigma^2}$, then$\frac{\partial}{\partial \mu} \ln f(x) = -\frac{\mu-x}{\sigma^2}$ and $\left(\frac{\partial}{\partial \mu} \ln f(x)\right)^2 = \frac{(x-\mu)^2}{\sigma^4}$
Then $I(\mu) = \int_{-\infty}^{\infty} \frac{(x-\mu)^2}{\sigma^4} = \frac{1}{\sigma^2}$ and so we will have $$\frac{1}{n I(\mu)} = \frac{1}{n \frac{1}{\sigma^2}} = \frac{\sigma^2}{n} = \operatorname{V}[X].$$
So $\bar{X}$ is an MVUE of $\mu$.
We will review the Cramer-Rao inequality:
$$\operatorname{var}(\phi) \times \left(n \mathbb{E}\left[\left(\frac{\partial }{\partial \theta}\ln f(x ; \theta)\right)^2\right]\right) \geq 1$$
If we are given two point estimators $\Theta_1$ and $\Theta_2$ of our parameter $\theta$. How can we decide which one to us. Let's pick a case where $\Theta_1$ is the MVUE of $\theta$, then we will have
$$\operatorname{V}(\Theta_1) \times \left(n I(\theta)\right) = 1$$
We also know that
$$ \operatorname{V}(\Theta_2) \times \left(n I(\theta)\right) \ge 1$$
and so we will have essentially:
$$\operatorname{V}(\Theta_2) \times \left(n I(\theta)\right) \ge \operatorname{V}(\Theta_1) \times \left(n I(\theta)\right) $$
If we take the reciprocal of the inequality, we will have $$\frac{1}{\operatorname{V}(\Theta_2) \times \left(n I(\theta)\right)} \le \frac{1}{\operatorname{V}(\Theta_1) \times \left(n I(\theta)\right)} $$.
Here, comes the definition of efficiency. If an unbiased estimator is more efficient than the other, then it will be $\Theta_1$ in the above inequality. So the efficiency of an unbiased estimator $\Theta$ is defined as $$e(\Theta) =\frac{1}{\operatorname{V}(\Theta) \times \left(n I(\theta)\right)}.$$
and $$\frac{e(\Theta_1)}{e(\Theta_2)} = \frac{\operatorname{V}[\Theta_2]}{\operatorname{V}[\Theta_1]}.$$
Example. Let $X_1, \ldots, X_n \sim \operatorname{Uniform}(0, \beta)$ be a random sample where $\beta$ is a parameter. Show that both $2 \bar{X}$ and $\frac{n+1}{n} Y_n$ are unbiased estimators of $\beta$. Compare their efficiency.
First, let's consider the variance of uniform distribution. We know that the mean of the above uniform distribution takes $\frac{\beta}{2}.$ And the variance of the distribution is $\frac{1}{\beta}\int_0^{\beta} (x-\beta/2)^2 \, dx = \frac{1}{3}x^3 \bigg|_{x=-\beta/2}^{\beta/2} \cdot \frac{1}{\beta} = \frac{1}{12}\beta^2.$
And so $\mathbb{E}[2\bar{X}] = 2\cdot\frac{\beta}{2}=\beta.$ and $\operatorname{V}[2\bar{X}] = \frac{4}{n^2} \cdot \frac{n\beta^2}{12} = \frac{\beta^2}{3n}$
Now let's consider $\Pr[Y_n=y] = n \left(\int_0^{y} \frac{1}{\beta}\, dy\right)^{n-1} \frac{1}{\beta} =n \left( \frac{y}{\beta}\right)^{n-1} \frac{1}{\beta} = n\cdot \frac{y^{n-1}}{\beta^n} $
And so $\mathbb{E}[Y_n] = \int_0^\beta n\cdot \frac{y^n}{\beta^n} \,dy = \frac{n}{n+1} \frac{y^{n+1}}{\beta^n} \bigg|_{y=0}^{y=\beta} = \frac{n}{n+1} \beta$
And so $$\mathbb{E}\left[\frac{n+1}{n}Y_n\right] = \frac{n+1}{n}\frac{n}{n+1} \beta = \beta.$$
Now let's consider:
$$\mathbb{E}[Y_n^2] = \int_0^{\beta} y^2 \cdot n \cdot \frac{y^{n-1}}{\beta^n} \, dy \\=\frac{n}{\beta^n}\int_0^{\beta} y^{n+1} \, dy \\= \frac{n}{\beta^n} \frac{\beta^{n+2}}{n+2}\\= \frac{n}{n+2}\beta^2.$$
And so $$\operatorname{V}[Y_n] = \mathbb{E}[Y_n^2]- (\mathbb{E}[Y_n])^2 = \frac{n}{n+2}\beta^2 - \left(\frac{n}{n+1}\beta\right)^2=\frac{n(n+1)^2 -n^2(n+2)}{(n+2)(n+1)^2}\beta^2 = \frac{n(n^2+2n+1) -n^3 -2n^2}{(n+1)^2(n+2)}\beta^2 = \frac{n}{(n+1)^2(n+2)}\beta^2$$
An so $\operatorname{V}\left[\frac{n+1}{n}Y_n\right] = \left(\frac{n+1}{n}\right)^2\operatorname{V}[Y_n] = \left(\frac{n+1}{n}\right)^2 \frac{n}{(n+1)^2(n+2)}\beta^2 = \frac{1}{n(n+2)}\beta^2$
So the ratio of $$\frac{e(\frac{n+1}{n}Y_n)}{e(2\bar{X})} = \frac{\frac{1}{3n}\beta^2}{\frac{1}{n(n+2)}\beta^2} = \frac{n+2}{3}$$
If $n >1$, then $2\bar{X}$ is more efficient than $\frac{n+1}{n}Y_n$.
Definition. Let $\hat{\Theta}$ be any point estimator of $\theta$. Then the mean square error (MSE) of $\hat{\Theta}$ is defined as
$$\operatorname{MSE}(\hat{\Theta})=\mathbb{E}\left[(\hat{\Theta}-\theta)^2\right]$$
Then we will have $$\mathbb{E}\left[(\hat{\Theta}-\theta)^2\right] = \mathbb{E}\left[(\hat{\Theta}-\Theta +\Theta-\theta)^2\right] \\= \mathbb{E}\left[(\hat{\Theta}-\Theta)^2 +(\Theta-\theta)^2\right]\\=\mathbb{E}[(\hat{\Theta}-\Theta)^2] +(\Theta-\theta)^2 \\= \operatorname{V}[\hat{\Theta}]+b(\hat{\Theta})$$
where $\operatorname{V}[\hat{\Theta}]$ is the variance of $\hat{\Theta}$ and $\Theta = \mathbb{E}[\hat{\Theta}]$ and $b(\hat{\Theta})$ is the bias of $\hat{\Theta}$.
Example. Let's compare the Mean Squared Error (MSE) of two estimators for $\sigma^2$: $S^2$ and $\frac{n-1}{n} S^2$, for a normal population $N(\mu, \sigma^2)$.
Solution: Let's consider the variance of $S^2$ and $\frac{n-1}n{S^2}$ and also the bias of $S^2$ and $\frac{n-1}{n}S^2.$
For $S^2$, $\mathbb{E}[S^2] = \sigma^2$ and since $\frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2$ so we will have
$$\operatorname{V}\left[\frac{(n-1)S^2}{\sigma^2}\right] =2(n-1)$$
and we also know that $\operatorname{V}\left[\frac{(n-1)S^2}{\sigma^2}\right] = \frac{(n-1)^2}{\sigma^4}\operatorname{V}[S^2]$
so we will have $\operatorname{V}[S^2] = \frac{2\sigma^4}{n-1}.$
And also $\mathbb{E}[\frac{n-1}{n}S^2] = \frac{n-1}{n}\sigma^2$ and $\operatorname{V}[\frac{n-1}{n}S^2] = \left(\frac{n-1}{n}\right)^2 \frac{2\sigma^4}{n-1} = 2\left(\frac{n-1}{n^2}\right)\sigma^4$
So the MSE of $S^2$ is $\frac{2\sigma^4}{n-1}$ and MSE of $\frac{n-1}{n}S^2$ is $2\left(\frac{n-1}{n^2}\right)\sigma^4 + \frac{1}{n^2}\sigma^4 =\frac{2n-1}{n^2}\sigma^4$
so the ratio will be $$\frac{\operatorname{MSE}[\frac{n-1}{n}S^2]}{\operatorname{MSE}[S^2]} = \frac{\frac{2n-1}{n^2}\sigma^4}{\frac{2}{n-1}\sigma^4} =\frac{(2n-1)(n-1)}{2n^2}= \frac{2n^2 -3n +1}{2n^2}= 1+ \frac{1-3n}{2n^2}<1$$
So $\frac{n-1}{n}S^2$ has less MSE than $S^2$.
Consistency is defined as follows: An estimator $\hat{\Theta}$ of $\theta$, based on a random sample of size $n$, is considered consistent if, for any positive value $c$, the probability approaches zero as the sample size $n$ tends to infinity that the absolute difference between the estimator $\hat{\Theta}$ and the true parameter $\theta$ is greater than $c$.
$$\lim_{n\to\infty} \Pr[\left|\bar\Theta-\theta\right|>c] = 0$$
Theorem. If $\hat{\Theta}$ is an unbiased estimator of $\theta$ and $\operatorname V[\hat{\Theta}] \rightarrow 0$ as $n \rightarrow \infty$, then $\hat{\Theta}$ is a consistent estimator of $\theta$.
Proof:
We know Chebyshev's inequality, then for any real number $k>0$,
$$\Pr(|\hat{\Theta}-\theta| \geq k \operatorname V[\hat{\Theta}]) \leq \frac{1}{k^2}$$
Setting $k$ to satisfy $k\operatorname V[\hat{\Theta}] = c$, then by sending $n$ to infinity we will have
$$\lim_{n\to\infty}\Pr(|\hat{\Theta}-\theta| \geq c) \leq \lim_{n\to\infty}\frac{\left(\operatorname V[\hat{\Theta}]\right)^2}{c^2} =0$$
Example. Suppose $S^2$ is the sample variance of the random sample from a normal population $N\left(\mu, \sigma^2\right)$, then $S^2$ is a consistent estimator of $\sigma^2$.
We know from the previous that $\frac{(n-1)S^2}{\sigma^2}\sim \chi_{n-1}^2$ so
$$\frac{(n-1)^2}{\sigma^4}\operatorname V[S^2] = 2(n-1) \implies \operatorname V[S^2] =\frac{2\sigma^4}{n-1}$$
then we will have $$\lim_{n\to \infty} V[S^2] =\lim_{n\to \infty}\frac{2\sigma^4}{n-1} =0$$
So $S^2$ is consistent.
Example: Let $f$ be a probability density function (pdf) with a finite mean $\mu$ and variance $\sigma^2$.
Consider a natural number $n$, and let $X_1, \ldots, X_n$ be a random sample drawn from $f$.
Additionally, let $Y_n$ follow a Bernoulli distribution with parameter $\frac{1}{n}$ and be independent of the random sample.
We define the estimator $\hat{\Theta}_n$ as $\hat{\Theta}_n = n^2 Y_n + \left(1 - Y_n\right) \bar{X}_n$, where $\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$ represents the sample mean.
Show that $\hat{\Theta}_n$ is neither unbiased nor asymptotically unbiased, but it is a consistent estimator of $\mu$.
$$\mathbb{E}[\hat{\Theta}_n] = n^2 \mathbb{E}[Y_n] + \mathbb{E}[1-Y_n] \mathbb{E}[\bar{X_n}] = n +\frac{n-1}{n}\mu$$
So $\hat{\Theta}_n$ is neither unbiased nor asymptotically unbiased.
$$\lim_{n\to\infty}\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c\right] = \lim_{n\to\infty}\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c | Y_n=0\right] \Pr[Y_n=0]+\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c| Y_n=1\right] \Pr[Y_n=1] \\=\lim_{n\to\infty}\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c | Y_n=0\right] (1-\frac{1}{n})+\Pr\left[\left|\hat\Theta_n-\mu\right| \ge c| Y_n=1\right] \frac{1}{n}\\=\lim_{n\to\infty}\Pr\left[\left|\bar{X}_n-\mu\right| \ge c \right] \le \lim_{n\to\infty}\frac{\sigma^2}{c^2n} = 0$$.
Definition: An estimator $\hat{\Theta}$ is considered sufficient for the parameter $\theta$ if the conditional probability of the random sample $X_1, \ldots, X_n$ given $\hat{\Theta}=\widehat{\theta}$ is independent of $\theta$. In other words, the probability distribution $f_{X_1, \ldots, X_n \mid \widehat{\Theta}}\left(x_1, \ldots, x_n \mid \widehat{\theta}\right)$ does not depend on the specific value of $\theta$.
Example. If $X_1, \ldots, X_n$ is a random sample of $\text{Bernoulli}(p)$, then $\hat{\Theta}=\bar{X}$ is a sufficient estimator of $p$.
Proof. $\Pr[\vec{X} =\vec{x}] = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} = p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^nx_i}$ and $$\Pr[\hat{\Theta} =\theta]\\=\Pr[\sum_{i=1}^n x_i=n\hat{\Theta} =n\theta]\\= {n \choose n \theta}p^{n\theta} p^{n(1-\theta)} = n\hat{\Theta} =n\theta]\\= {n \choose n \theta}p^{\sum_{i=1}^n x_i} p^{n-\sum_{i=1}^n x_i}$$
Then $$\frac{\Pr[\vec{X} =\vec{x}]}{\Pr[\hat{\Theta} =\theta]} = \frac{1}{{n\choose \sum_{i=1}^n x_i}}$$
So $\hat \Theta$ is a sufficient estimator.
Example. Let $X_1, X_2, X_3$ be a random sample of Bernoulli $(\theta)$, then $\hat{\Theta}=$ $\frac{1}{6}\left(X_1+2 X_2+3 X_3\right)$ is not a sufficient estimator of $\theta$.
$$\Pr[X_1 =x_1, X_2 = x_2, X_3 = x_3] = \theta^{x_1+x_2+x_3} (1-\theta)^{3-x_1-x_2-x_3}.$$
$$\Pr[\hat \Theta =\frac{1}{2}] =\Pr[X_1+2X_2+3X_3 =3] \\= \Pr[X_1 =0, X_2 = 0, X_3 = 1] + \Pr[X_1 =1, X_2 = 1, X_3 = 0] \\= \theta(1-\theta)^2 + (1-\theta)\theta^2 $$
And $$\frac{\Pr[X_1 =0, X_2 = 0, X_3 = 1]}{\Pr[\hat \Theta =\frac{1}{2}]} = \frac{\theta(1-\theta)^2 }{\theta(1-\theta)^2 +(1-\theta)\theta^2} \\= \frac{1-\theta}{1-\theta+\theta} = 1-\theta$$
Theorem: $\hat{\Theta}$ is a sufficient estimator of $\theta$ if and only if the joint probability mass function (pmf) or probability density function (pdf) of $X_1, \ldots, X_n$ can be expressed as a product of two functions, $\phi(\hat{\theta}; \theta)$ and $h(x_1, \ldots, x_n)$, where the function $h$ does not depend on $\theta$.
Example. Consider a normal population $N\left(\mu, \sigma^2\right)$ for known $\sigma^2$. Show that $\bar{X}$ is a sufficient estimator of $\mu$.
Solution:
$$\Pr[\vec{X} = x] = \frac{1}{\sigma^n(2\pi)^{n/2}} \exp\left(-\sum_{i=1}^n\frac{(x_i-\mu)^2}{2\sigma^2}\right) \\= \frac{1}{\sigma^n(2\pi)^{n/2}} \exp\left(-\sum_{i=1}^n\frac{(x_i-\bar{X}+\bar{X}-\mu)^2}{2\sigma^2}\right) \\= \frac{1}{\sigma^n(2\pi)^{n/2}} \exp\left(-\frac{\sum_{i=1}^n(x_i-\bar{X})^2+n(\bar{X}-\mu)^2}{2\sigma^2}\right)\\ = \left(\exp(-\frac{n(\bar{X}-\mu)^2}{2\sigma^2})\right)\cdot\prod_{i=1}^{n} \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left(-\frac{x_i -\bar{X}}{2\sigma^2}\right)$$
So the estimator is sufficient.
Now, let's discuss approaches for constructing point estimators. There are three common methods used for this purpose:
Method of Moments
Maximum Likelihood Estimation
Bayesian Estimation
The Method of Moments is a technique used to estimate the parameters $\theta_1, \ldots, \theta_r$ of a distribution $f$. The approach involves setting up a system of $r$ equations:
$$ \mu_k^{\prime}\left(\theta_1, \ldots, \theta_r\right) = m_k^{\prime}, \quad \text{for } k = 1, \ldots, r, $$
where $\mu_k^{\prime}$ represents the $k$-th population moment (e.g., mean, variance) of the distribution $f$, and $m_k^{\prime}$ represents the corresponding $k$-th sample moment (calculated from the given data $x_1, \ldots, x_n$). Solving this system of equations yields the estimates for the parameters:
$$ \widehat{\theta}_k = \widehat{\theta}_k\left(x_1, \ldots, x_n\right), \quad \text{for } k = 1, \ldots, r. $$
Thus, the resulting estimators obtained through the Method of Moments are denoted as $\hat{\Theta}_k = \hat{\Theta}_k\left(X_1, \ldots, X_n\right)$ for $k = 1, \ldots, r$.
Example. Given a random sample of size $n$ from $\operatorname{Uniform}(a, b)$, use the method of moments to obtain an estimator of $a, b$.
We know that the first moment of the uniform distribution is $\frac{a+b}{2}$ and the second moment of the uniform distribution is $\frac{a^2+ab+b^2}{3}.$ Then we will have
$$\begin{cases} \frac{a+b}{2} = x \\ \frac{a^2+ab+b^2}{3} = y\end{cases}$$
,
where $x = \bar{X}$ is the sample mean and $y = \bar{X^2}$ is the sample second moment.
Then we will have $a = 2x - b$ and so $b^2 - 2 b x + 4 x^2=a^2+ab+b^2=3y \implies b^2-2bx + 4x^2 - 3y = 0\implies \Delta = 4b^2 - 4(4x^2 - 3y) = 4(b^2 +3y- 4x^2)$ and so $$b = \frac{2x +2\sqrt{b^2+3y-4x^2}}{2}$$ can be an estimator of $b$.
We can use the similar procedure to solve $a$ and get $a = \frac{2x -2\sqrt{b^2+3y-4x^2}}{2}$
Example. Given a random sample of size $n$ from $\operatorname{Gamma}(\alpha, \beta)$, use the method of moments to obtain estimators of $\alpha$ and $\beta$.
Solution, we will use the same definition for $x,y$ as the above probelm but for gamma distribution, then we will have
$$\begin{cases}\alpha \beta = x\\ \alpha(\alpha+1)\beta^2 = y\end{cases}$$
And so $$\begin{cases}\alpha \beta =x \\ (\alpha+1)\beta = \frac{y}{x}\end{cases}$$
And so $$\begin{cases} \alpha= \frac{x^2}{y-x^2} \\ \beta = \frac{y}{x}-x = \frac{y-x^2}{x}\end{cases}$$
Maximum likelihood estimation
The Maximum Likelihood Estimation (MLE) is a method used to estimate the parameter $\theta$ of a statistical distribution, given the observed values $x_1, \ldots, x_n$. To perform MLE, we define the likelihood function $L(\theta)$, which is the joint probability density function (pdf) or probability mass function (pmf) of the observed data $x_1, \ldots, x_n$ when treated as a function of the parameter $\theta$. The likelihood function is given by:
$$ L(\theta) = f_{X_1, \ldots, X_n}\left(x_1, \ldots, x_n; \theta\right) = \prod_{i=1}^n f\left(x_i; \theta\right). $$
In this equation, $f\left(x_i; \theta\right)$ represents the pdf or pmf of each individual data point $x_i$, and the product of these probabilities is taken for all $n$ observed data points.
The MLE of $\theta$, denoted as $\hat{\theta}$, is the value that maximizes the likelihood function $L(\theta)$. In other words, it is the parameter value that makes the observed data most probable, given the assumed statistical model characterized by $f\left(x; \theta\right)$. Mathematically, we find $\hat{\theta}$ by solving:
$$ \hat{\theta} = \arg\max_{\theta} L(\theta). $$
The value $\hat{\theta}$ obtained in this way is referred to as the Maximum Likelihood Estimate (MLE) of $\theta$, which provides an estimate of the unknown parameter based on the observed data. The MLE is a widely used and valuable method for parameter estimation in various statistical models and machine learning algorithms.
Example. Given $x$ successes in $n$ trials, find the MLE of $\theta$ in the corresponding $\operatorname{Binomial}(n, \theta)$.
We see that the probability we want to maximize is $$p(X = x; \theta) = {n \choose x} \theta^x (1-\theta)^{n-x}$$
If we take derivative with respect to $\theta$ and setting it to zero, we will have the following:
$$\frac{\partial}{\partial \theta}p(X = x; \theta) = {n \choose x} \left(x\theta^{x-1} (1-\theta)^{n-x} -(n-x)\theta^x (1-\theta)^{n-x-1}\right)=0$$
and so we will have
$$x(1-\theta)-(n-x) \theta = 0 \implies x -x\theta -n\theta+x\theta =0\implies \theta = \frac{x}{n}.$$
Example. Let $X_1, \ldots, X_n$ be a random sample from Exponential distribution$(\theta)$. Find the MLE of $\theta$.
Solution:
So the exponential distribution has the following pdf:
$$f_{X_1}(x) = \frac{1}{\theta} \exp(-\frac{x}{\theta})$$
and so the joint pdf will have the following:
$$f_{\vec X}(\vec x) = \frac{1}{\theta^n} \exp\left(-\frac{\sum_{i=1}^n x_i}{\theta}\right)$$
If we take $\ln$ both sides and maximize the $\ln$ formula, we will have
$$\frac{\partial}{\partial\theta} \ln f_{\vec X}(\vec x) = \frac{\partial}{\partial\theta}\left( -n\ln(\theta)-\frac{\sum_{i=1}^n x_i}{\theta}\right)\\ = -\frac{n}{\theta}+\theta^{-2}\sum_{i=1}^n x_i = 0$$
It means that $$\theta \left(\sum_{i=1}^n x_i\right)^{-1} = n^{-1} \implies \theta = \frac{\sum_{i=1}^nx_i}{n}.$$
Example. Let $X_1, \ldots, X_n$ be a random sample of Uniform(0, $\left.\beta\right)$. Find the MLE of $\beta$.
We will have the following pdf:
$$f_{\vec X}(\vec x) = \prod_{i=1}^n f(x_i;\beta) = \prod_{i=1}^n \frac{1}{\beta} I_{x_i \le \beta}(x_i) = \begin{cases}\beta^{-n} & \text{ if }\beta \ge \max_{1\le i \le n}x_i \\ 0 &\text{ otherwise}\end{cases}$$
and therefore, $\beta = \max x_i $ is the maximum likelihood estimator.
Example. Let $X_1, \ldots, X_n$ be a random sample of $N\left(\mu, \sigma^2\right)$ where both $\mu$ and $\sigma^2$ are unknown. Find the MLE of $\mu$ and $\sigma^2$.
We have the following pdf:
$$f_{\vec X}(\vec x) = (2\pi \sigma^2)^{-n/2}\exp(-\frac{\sum(x_i-\mu)^2}{2\sigma^2}) \\= (2\pi u)^{-n/2}\exp(-\frac{\sum(x_i-\mu)^2}{2u})$$
where $u = \sigma^2$ and if we take natural log, we will have $$v:= \ln f_{\vec X}(\vec x) = -\frac{n}{2}\ln(2\pi u)-\frac{\sum(x_i-\mu)^2}{2u}$$
Then $$\frac{\partial}{\partial u} v = -\frac{n}{2u} +\frac{\sum(x_i-\mu)^2}{2u^2} =0$$
and $$\frac{\partial}{\partial \mu} v = \frac{\sum \mu-x_i}{u^2} =0 $$
and so $\mu = \frac{\sum x_i}{n}$ and so
$$nu = \sum (x_i - \mu)^2 \implies u = \frac{\sum (x_i - \mu)^2}{n}$$
Bayesian estimation (Maximum Posterior Distribution)
$\ln \phi\left(\theta \mid x_1, \ldots, x_n\right)=\ln p(\theta)+{\sum_{i=1}^n \ln f\left(x_i \mid \theta\right)}-{\ln g\left(x_1, \ldots, x_n\right)}$
We can see the summation is the log-likelihood and we know that $\ln p(\theta)$ is serving as a regularization to the likelihood function.
Example. Let $X$ follow $\operatorname{Binomial}(n, \theta)$ for unknown $\theta \in(0,1)$. Suppose the prior distribution of $\theta$ is $\operatorname{Beta}(\alpha, \beta)$ for some given $\alpha, \beta>0$. Find the posterior distribution and Bayesian estimate of $\theta$.
The posterior distribution will be proportional to the following:
$${n \choose x}\theta^{x}(1-\theta)^{n-x}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}{n \choose x}\theta^{x+\alpha-1}(1-\theta)^{n+\beta-x-1}$$
Then if we take $\ln$ and focus only on $\theta$, then we will have
$$(x+\alpha-1)\ln \theta+(n+\beta-x-1)\ln(1-\theta)$$.
If we take partial derivative with respect to $\theta$, then we will have
$$\frac{x+\alpha-1}{\theta}+\frac{n+\beta-x-1}{\theta-1} =0 \implies (x+\alpha-1 +n +\beta -x -1)\theta -(x+\alpha-1) =0 \implies \theta = \frac{x+\alpha-1}{n+\alpha+\beta-2}.$$
Example. Suppose $X_1, \ldots, X_n$ is a random sample of $N\left(\mu, \sigma^2\right)$ where $\sigma^2$ is known. Assume the prior distribution of $\mu$ is $N\left(\mu_0, \sigma_0^2\right)$ for some known $\mu_0$ and $\sigma_0^2$. Find the posterior distribution of $\mu$ and the Bayesian estimate.
Solution:
The posterior distribution of $\mu$ follows the following:
$$(2\pi\sigma^2)^{-n/2}\exp\left(-\frac{\sum_{i}(X_i-\mu)^2}{2\sigma^2}\right) (2\pi\sigma_0^2)^{-1/2}\exp\left(-\frac{(\mu-\mu_0)^2}{2\sigma_0^2}\right)$$
We can see that $$\frac{\sum_{i}(X_i-\mu)^2}{\sigma^2} + \frac{(\mu-\mu_0)^2}{\sigma_0^2} = \frac{\sum_{i}X_i^2+\mu^2-2X_i\mu}{\sigma^2} + \frac{\mu^2+\mu_0^2-2\mu \mu_0}{\sigma_0^2} \\= \frac{(\sigma^2_0 n+ \sigma^2)\mu^2-2(n\bar{X} \sigma^2_0 +\mu_0\sigma^2)\mu + \sigma^2 \mu_0^2+\sigma_0^2\sum_i X_i^2}{(\sigma\sigma_0)^2}$$
And so $\mu$ is having normal distribution with $\frac{2 n \bar{X}\sigma_0^2 + \mu_0\sigma^2}{\sigma_0^2n +\sigma^2}$ as mean and $\frac{\sigma^2 \sigma_0^2}{\sigma_0^2 n +\sigma^2}$ as the variance.
Example. Suppose $X_1, \ldots, X_n$ is a random sample of $N\left(\mu, \sigma^2\right)$ where $\sigma^2$ is known. Assume the prior distribution of $\mu$ is $N\left(\mu_0, \sigma_0^2\right)$ for some known $\mu_0$ and $\sigma_0^2$. Find the posterior distribution of $\mu$ and the Bayesian estimate.
Solution:
The posterior distribution of $\mu$ follows the following:
$$(2\pi\sigma^2)^{-n/2}\exp\left(-\frac{\sum_{i}(X_i-\mu)^2}{2\sigma^2}\right) (2\pi\sigma_0^2)^{-1/2}\exp\left(-\frac{(\mu-\mu_0)^2}{2\sigma_0^2}\right)$$
We can see that $$\frac{\sum_{i}(X_i-\mu)^2}{\sigma^2} + \frac{(\mu-\mu_0)^2}{\sigma_0^2} = \frac{\sum_{i}X_i^2+\mu^2-2X_i\mu}{\sigma^2} + \frac{\mu^2+\mu_0^2-2\mu \mu_0}{\sigma_0^2} \\= \frac{(\sigma^2_0 n+ \sigma^2)\mu^2-2(n\bar{X} \sigma^2_0 +\mu_0\sigma^2)\mu + \sigma^2 \mu_0^2+\sigma_0^2\sum_i X_i^2}{(\sigma\sigma_0)^2}$$
And so $\mu$ is having normal distribution with $\frac{2 n \bar{X}\sigma_0^2 + \mu_0\sigma^2}{\sigma_0^2n +\sigma^2}$ as mean and $\frac{\sigma^2 \sigma_0^2}{\sigma_0^2 n +\sigma^2}$ as the variance.
And the Bayesian estimator will be
$$\frac{\partial}{\partial \mu}\frac{\sum_{i}(X_i-\mu)^2}{2\sigma^2}+\frac{(\mu-\mu_0)^2}{2\sigma_0^2} = \frac{\sum_i (\mu-X_i)}{\sigma^2} + \frac{\mu -\mu_0}{\sigma_0^2}= 0$$
and it means that $$\left(\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2}\right)\mu=\frac{\sum X_i}{\sigma^2}+\frac{\mu_0}{\sigma_0^2} \implies \mu = \left(\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2}\right)^{-1}\left(\frac{\sum X_i}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right)$$.