Theory Of Statistics
Contents
THIS IS A SIMPLIFIED LECTURE REVIEW BASED ON CUHKSZ’S THEORY OF STATISTICS (Statistical Inference George Casella)
Common Family of Distributions
Exponential Family
$f(x|\theta)=h(x)c(\theta)\exp(\sum_{i}^{k}w_{i}(\theta)t_{i}(x))$
- Binomial Distribution
- Poisson Distribution
- Exponential Distribution
Full exponential family (dimension k = number of parameters d) versus curved exponential family (dimension k > number of parameters d)
Why do we need exponential family?
- Simplify calculating exception and variance of different moments based on $t_{i}(x)$'
- We can use k statistics instead of a large number of original samples to make an inference about the parameter $\theta$ equivalently, without losing any information. i.e., Good property of sufficiency and completeness.
Note: for some distribution, we need to recalculate its pdf or pmf using indicator function based on its support.
Location and Scale Family
- Let $f(x)$ be any pdf. Then the family of pdfs $f(x-\mu)$, indexed by the parameter $\mu$ is called the location family.
- If Z has a pdf $f(z)$, then $X=Z+\mu$ has density $f(x-\mu)$
- Let $f(x)$ be any pdf. Then the family of pdfs $\frac{1}{\sigma}f(\frac{x}{\sigma})$ is called the scale family with standard pdf $f(x)$
- How to prove the above statement? Event equivalence!
- ${X = x } \Leftrightarrow {Z = \frac{X-\mu}{\sigma}}$
- $P(X=x)=P(Z=\frac{x-\mu}{\sigma})$
- $f_{X}(x)dx=f(\frac{x-\mu}{\sigma})dz$
Data Reduction
Why reduction? Summarize the information in a sample by determining a few key features of the sample values
How to do reduction? The method that do not discard important information about unknown parameter $\theta$ and methods that successfully discard information that is irrelevant as far as gaining knowledge about $\theta$ is concerned.
Sufficiency
Definition: $P(X=x | T(X)=T(x))$ is independent of $\theta$. The conditional distribution of a sample $X$ given the value $T(X)$ does not depend on $\theta$.
It turns out that outside the exponential family distribution, it is rare to have a sufficient statistic of smaller dimension of the sample.
How to find sufficiency statistic?
- $\frac{p(x|\theta)}{q(T(x|\theta)}$ independent of $\theta$
- Factorization theorem $f(x|\theta)=h(x)g(T(x)|\theta)$
- Using exponential family $\sum_{i}^{n}t(X_{i})$
$T'(x) = r(T(x))$, if $T'$ is sufficient, then $T$ is also sufficient. The converse of the statement is false.
Minimal Sufficiency Statistic
Why we need Minimal Sufficiency Statistic (M.S.S)? Since we have already many sufficiency statistics and any one-to-one function of a sufficient statistic is a sufficient statistic. Among these sufficient statistic, we want to find one unique sufficient statistic that is a function of another sufficient statistic.
Definition: A sufficient statistic T is called minimal sufficient statistic if, for any other sufficient statistic T', T is a function of T'.
How to find M.S.S?
- $\frac{f(x|\theta)}{f(y|\theta)}$ is independent of $\theta$ if and only if $T(x) = T(y)$
Ancillary
Definition: A statistic S(X) whose distribution does not depend on the parameter $\theta$ is called ancillary statistic. (i.e. $P(S(X)| \theta)$ independent of $\theta$ )
Completeness
Definition: For $g(T)$ satisfies $E_{\theta}(g(T)) = 0$ for all $\theta \in \Theta$ implies $P_{\theta}(g(T)=0)=1$ for $\theta \in \Theta$ . Then $T(X)$ is called complete statistic
Interpretation:
- No ancillary statistic can be constructed based on complete statistic.
- Unique unbiased estimator, i.e., $E(T(x)) =\tau(\theta)$ if $T$ is a complete sufficient statistic
Disprove Completeness:
- Construct ancillary statistic $g(T)=S(T)-E(S(T))$, $P(g(T)) \ne 0$
- Just cancel First moment i.e. $E[h_{1}(T)]=E[h_{2}(T)]$, $g(t)=E[h_{1}(T)]-E[h_{2}(T)]$
Prove
- Direct calculate $E(g(T))=0$, or take the derivative related to $\theta$
- Using the nice property of Full exponential family i.e., d=k
Property of Complete Statistic:
- $T' = r(T)$, if T is complete, then T' is also complete.
- A complete statistic can be not sufficient.
Basu’s Theorem
T(X) is complete and minimal sufficient statistic, then T(X) is independent of every ancillary statistic.
Bahadur’s Theorem
C.S.S is in M.S.S.
There exists only two situation:
- If there exists a T(x) to be M.S.S., but not complete, then no sufficient statistic is complete.
- If there exists a T(x) such that T is C.S.S and M.S.S, then any M.S.S are also C.S.S.
Point Estimation
Methods of Finding Point Estimators
Method of Moment
Drawback: sometimes the estimator will out of range and lack of numerical stability (i.e., a small change in the sample will lead to a huge change in the estimator.)
Maximum Likelihood Estimation
Scalar | Multidimension | |
---|---|---|
Continuous | Yes (Second order sufficient condition SOSC) | Yes |
Discrete | Yes | No (Discrete optimization) |
Advantage: the range of MLE coincides the range of parameters.
Drawback:
Difficult to find a global maximum
numerical sensitivity. That is how sensitive the estimator to the small change in the data.
Invariance property of MLE. i.e., MLE of $\tau(\theta)$ is $\tau(\hat{\theta})$
Bayes Estimator
Data: (x1,x2,…xn) + Expert Knowledge $f(x|\theta)$ or $\pi(x)$
Cons: controversial because it inherently embraces a subjective notion of probability. It has no guarantee of long time performance
Posterior distribution: $\pi(\theta|x) = f(x|\theta)\pi(\theta)/ m(x)$
$\pi(x)$ Is said to conjugate to $f(x|\theta)$, if $\pi(\theta|x)$ is in the same distribution family as $\pi(x)$
Methods of Evaluating Point Estimators
Mean Square Error (MSE)
MSE = $E_{\theta}(W-\theta)^{2}=Var_{\theta}(W)+(Bias_{\theta}(W))^{2}$
MSE is not good for scale parameter (i.e. $\sigma$ in normal distribution), since it MSE penalizes equally for overestimation and underestimation. And the scale case, 0 is a natural lower bound, so the estimation problem is not symmetric.
Best Unbiased Estimator (Minimize variance and control bias)
Motivation: many times we can a estimator that is uniformly better than the other estimator. (i.e., MSE($\hat{\sigma}^{2}$)<MSE($S^{2}$) for any $\sigma^{2} > 0$). If we can find an unbiased estimator with uniformly smallest variance, a best unbiased estimator, then our task is done.
UMVUE (uniform minimum variance unbiased estimator)
Cramer-Rao Inequality Cramer-Rao Lower Bound valid or not?
We can specify a lower bound, say $B(\theta)$, on the variance of any unbiased estimator of $\tau(\theta)$. If we can attain such lower bound, we can say we have found a best unbiased estimator.
Attainment
Since sometimes there is no guarantee that the bound is sharp, we need to find out wether we can attain the Lower bound or not.
Use of sufficiency and unbiasedness
Rao-Blackwell gives decrease property conditioned on sufficiency
Condition a unbiased statistic based on a sufficient statistic $T$, i.e., $\phi(T)=E(W|T)$, then $E_{\theta}\phi(T)=\tau(\theta)$ and $Var_{\theta}(\phi(T)) \leq Var_{\theta}(W)$ for all $\theta$.
Uniqueness of best estimator
DISPROVE BEST ESTIMATOR (IF IT IS NOT ATTAINED BY CRLB): If $E_{\theta}W = \tau(\theta)$, W is the best unbiased estimator of $\tau(\theta)$ if and only if $W$ is uncorrelated with all unbiased estimators of 0.
If T is complete and sufficient statistic, then $E(W|T)$ is the best unbiased estimator of $\tau(\theta)$, since $\phi(T)$ is uncorrelated with any unbiased estimator of 0. So that it gives the best estimator
Minimize Risk Function
Hypothesis test
Definition: A hypothesis is a a statement about a population parameter. A hypothesis test is a rule that specifies 1. For which sample values $H_{0}$ is rejected and for which sample value $H_{0}$ is accepted to be true.
Find Hypothesis test
Frequentist view point: Compare the likelihood of H0 and H1. Likelihood Ratio Test (LRT), Union-Intersection, Intersection-Union gives rejection region $X\in R = {\lambda(x)<c}$
Bayesian view point: Compare the probability of H0 and H1
Evaluate Hypothesis test
Generally, we have to control Type 1 error (incorrectly reject H0) and Type 2 error (incorrectly accept H0)
Power function
- Power function $P_{\theta}(X \in R) = \beta(\theta)$ that quantifies both Type 1 and Type 2 error)
- Specifically, we want to control Type 1 error while minimizing Type 2 error. i.e Size alpha test ‘'=’' and level alpha test “$\le$” restrict c in some degree $sup_{\theta \in \theta_{0}} P(X\in R) = \alpha$, control the type 1 error
Most Powerful Test
Size alpha test and level alpha test may not be unique. We need to minimize Type 2 error and control Type 1 error simultaneously
UMP (uniformly most powerful) test further impose hard restriction (Type 2 error) test from level alpha test.
- Neyman-Pearson for simple hypothesis (UMP).
- Karlin Rubin for one-side hypothesis (UMP)
- Non-existence for two-side hypothesis
- Sometimes UMP does not exist, what we should do? (Two-side hypothesis)
P-value
Definition: P-value is the probability of getting a value of test statistic that is at least as extreme as the one representing the sample data. $P(W(X)>W(x)|H_{0} \ is \ true)$
P value reports the test results on a continuous case, rather than a dichotomous case ‘accept’ and ‘reject’
P-value cannot quantify Type 2 error and the definition of extremeness is vague.
We can construct $P(X) = sup_{\theta \in \Theta_{0}}P_{\theta}(W(X) \ge W(x))$ or $P(W(X)\ge W(x) | S(X)=S(x))$ conditioned on sufficient statistic.
Interval Estimation
Every confidence set corresponds to a hypothesis test and vice versa.
Methods of finding Interval Estimation
Inverted LRT Step 1: find LRT accept region, Step 2: turn it to explicit form, Step 3: show that exist for all $\theta$
Pivot Quantiles
Methods of Evaluating Interval Estimation
Minimize the size (length) of interval and control the coverage probability. (Solved by Lagrange multipliers with equality constraints.)
Decision Theory. $R(\theta,c)= E_{\theta}[L(\theta, c)]=bE_{\theta}[Length(C(X))]-P_{\theta}(\theta \in C(X))$
Author Zitao
LastMod 2021-12-26