Statistics - Complete Guide with Data Analysis
Table of Contents
Introduction and Real-World Applications
Statistics is the science of collecting, analyzing, interpreting, and presenting data. It provides tools to make sense of uncertainty and variability in the world around us.
Why Learn Statistics?
- Business & Economics: Market analysis, risk assessment, forecasting
- Medicine & Healthcare: Clinical trials, epidemiology, treatment efficacy
- Social Sciences: Survey analysis, behavioral studies, policy evaluation
- Technology: A/B testing, machine learning, quality control
- Sports Analytics: Performance metrics, strategy optimization
- Daily Decision Making: Understanding polls, news, and research
Two Branches of Statistics
Descriptive Statistics: Summarize and describe data
Inferential Statistics: Make predictions and test hypotheses
Descriptive Statistics
Measures of Central Tendency
Mean (Average)
$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$
Sum of all values divided by the number of values
Median
Middle value when data is ordered
- Odd n: Middle value
- Even n: Average of two middle values
Mode
Most frequently occurring value(s)
Measures of Variability
Range
Range = Maximum - Minimum
Variance
Population variance: $\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$
Sample variance: $s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$
Standard Deviation
Population: $\sigma = \sqrt{\sigma^2}$
Sample: $s = \sqrt{s^2}$
Coefficient of Variation
$CV = \frac{\sigma}{\mu} \times 100\%$ (relative variability)
Measures of Position
Percentiles and Quartiles
- Q1 (25th percentile): First quartile
- Q2 (50th percentile): Median
- Q3 (75th percentile): Third quartile
- IQR: Interquartile Range = Q3 - Q1
Z-Score (Standard Score)
$$z = \frac{x - \mu}{\sigma}$$
Number of standard deviations from the mean
Data Visualization
Common Charts
- Histogram: Distribution of continuous data
- Box Plot: Five-number summary visualization
- Scatter Plot: Relationship between two variables
- Bar Chart: Categorical data comparison
- Pie Chart: Parts of a whole
Probability Fundamentals
Basic Probability
Probability of an Event
$$P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of possible outcomes}}$$
Properties: $0 \leq P(A) \leq 1$
Complement Rule
$P(A^c) = 1 - P(A)$
Probability Rules
Addition Rule
$P(A \cup B) = P(A) + P(B) - P(A \cap B)$
For mutually exclusive events: $P(A \cup B) = P(A) + P(B)$
Multiplication Rule
$P(A \cap B) = P(A) \cdot P(B|A)$
For independent events: $P(A \cap B) = P(A) \cdot P(B)$
Conditional Probability
$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$
Bayes' Theorem
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$
Counting Principles
Permutations (Order matters)
$P(n,r) = \frac{n!}{(n-r)!}$
Combinations (Order doesn't matter)
$C(n,r) = \binom{n}{r} = \frac{n!}{r!(n-r)!}$
Probability Distributions
Discrete Distributions
Binomial Distribution
$X \sim B(n,p)$
$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$
- Mean: $\mu = np$
- Variance: $\sigma^2 = np(1-p)$
Poisson Distribution
$X \sim Poisson(\lambda)$
$P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}$
- Mean: $\mu = \lambda$
- Variance: $\sigma^2 = \lambda$
Continuous Distributions
Normal Distribution
$X \sim N(\mu, \sigma^2)$
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$
Standard Normal Distribution
$Z \sim N(0, 1)$
68-95-99.7 Rule:
- 68% within 1 standard deviation
- 95% within 2 standard deviations
- 99.7% within 3 standard deviations
t-Distribution
Used when population standard deviation is unknown
Approaches normal distribution as degrees of freedom increase
Central Limit Theorem
For large sample sizes (n ≥ 30), the sampling distribution of the sample mean approaches a normal distribution:
$$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$
This holds regardless of the population distribution!
Sampling and Estimation
Sampling Methods
- Simple Random Sampling: Every member has equal chance
- Stratified Sampling: Divide into strata, sample from each
- Cluster Sampling: Sample entire clusters
- Systematic Sampling: Select every kth member
Point Estimation
Properties of Good Estimators
- Unbiased: $E(\hat{\theta}) = \theta$
- Consistent: Converges to true value as n increases
- Efficient: Minimum variance among unbiased estimators
Confidence Intervals
CI for Mean (σ known)
$$\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$$
CI for Mean (σ unknown)
$$\bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}$$
CI for Proportion
$$\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
Sample Size Determination
For mean: $n = \left(\frac{z_{\alpha/2} \sigma}{E}\right)^2$
For proportion: $n = \left(\frac{z_{\alpha/2}}{E}\right)^2 p(1-p)$
Hypothesis Testing
Steps in Hypothesis Testing
State null and alternative hypotheses
$H_0$: Status quo, $H_1$: Research claim
Choose significance level (α)
Common values: 0.01, 0.05, 0.10
Calculate test statistic
Find p-value or critical value
Make decision: Reject or fail to reject $H_0$
Common Tests
One-Sample t-test
Test statistic: $t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$
df = n - 1
Two-Sample t-test
Test statistic: $t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$
Chi-Square Test
Test statistic: $\chi^2 = \sum \frac{(O - E)^2}{E}$
ANOVA (F-test)
Test statistic: $F = \frac{MS_{between}}{MS_{within}}$
Types of Errors
| $H_0$ True | $H_0$ False | |
|---|---|---|
| Reject $H_0$ | Type I Error (α) | Correct Decision |
| Fail to Reject $H_0$ | Correct Decision | Type II Error (β) |
Correlation and Regression
Correlation
Pearson Correlation Coefficient
$$r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}$$
- Range: -1 ≤ r ≤ 1
- r = 1: Perfect positive correlation
- r = -1: Perfect negative correlation
- r = 0: No linear correlation
Simple Linear Regression
Regression Line
$\hat{y} = a + bx$
Slope
$$b = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$
Intercept
$a = \bar{y} - b\bar{x}$
Coefficient of Determination
$R^2 = r^2$ (proportion of variance explained)
Multiple Regression
$\hat{y} = b_0 + b_1x_1 + b_2x_2 + ... + b_kx_k$
Uses matrix algebra to find coefficients
Adjusted $R^2$ accounts for number of predictors
Practice Problems
Beginner Level
- Find the mean, median, and mode of: 2, 4, 4, 7, 9, 10, 12
- Calculate the standard deviation of: 10, 12, 15, 18, 20
- If P(A) = 0.3 and P(B) = 0.4, find P(A∪B) if A and B are mutually exclusive
- Find the z-score for x = 85 if μ = 75 and σ = 8
Intermediate Level
- A coin is flipped 10 times. Find P(exactly 6 heads)
- Construct a 95% confidence interval for μ if n = 36, x̄ = 42, s = 6
- Test H₀: μ = 50 vs H₁: μ ≠ 50 with α = 0.05, n = 25, x̄ = 52, s = 5
- Find the regression line for: (1,2), (2,4), (3,5), (4,8), (5,10)
Advanced Level
- A factory produces items with 5% defect rate. Find P(at most 2 defects in 20 items)
- Test if two population means are equal using a two-sample t-test
- Perform a chi-square test for independence on a 3×2 contingency table
- Calculate the power of a hypothesis test given α, effect size, and sample size
Interactive Visualizations
Normal Distribution Explorer
Adjust parameters to see how the normal distribution changes
Central Limit Theorem Demonstration
See how sample means form a normal distribution