Statistics - Complete Guide with Data Analysis

Introduction and Real-World Applications

Statistics is the science of collecting, analyzing, interpreting, and presenting data. It provides tools to make sense of uncertainty and variability in the world around us.

Why Learn Statistics?

  • Business & Economics: Market analysis, risk assessment, forecasting
  • Medicine & Healthcare: Clinical trials, epidemiology, treatment efficacy
  • Social Sciences: Survey analysis, behavioral studies, policy evaluation
  • Technology: A/B testing, machine learning, quality control
  • Sports Analytics: Performance metrics, strategy optimization
  • Daily Decision Making: Understanding polls, news, and research

Two Branches of Statistics

Descriptive Statistics: Summarize and describe data

Inferential Statistics: Make predictions and test hypotheses

Descriptive Statistics

Measures of Central Tendency

Mean (Average)

$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$

Sum of all values divided by the number of values

Median

Middle value when data is ordered

  • Odd n: Middle value
  • Even n: Average of two middle values

Mode

Most frequently occurring value(s)

Measures of Variability

Range

Range = Maximum - Minimum

Variance

Population variance: $\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$

Sample variance: $s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$

Standard Deviation

Population: $\sigma = \sqrt{\sigma^2}$

Sample: $s = \sqrt{s^2}$

Coefficient of Variation

$CV = \frac{\sigma}{\mu} \times 100\%$ (relative variability)

Measures of Position

Percentiles and Quartiles

  • Q1 (25th percentile): First quartile
  • Q2 (50th percentile): Median
  • Q3 (75th percentile): Third quartile
  • IQR: Interquartile Range = Q3 - Q1

Z-Score (Standard Score)

$$z = \frac{x - \mu}{\sigma}$$

Number of standard deviations from the mean

Data Visualization

Common Charts

  • Histogram: Distribution of continuous data
  • Box Plot: Five-number summary visualization
  • Scatter Plot: Relationship between two variables
  • Bar Chart: Categorical data comparison
  • Pie Chart: Parts of a whole

Probability Fundamentals

Basic Probability

Probability of an Event

$$P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of possible outcomes}}$$

Properties: $0 \leq P(A) \leq 1$

Complement Rule

$P(A^c) = 1 - P(A)$

Probability Rules

Addition Rule

$P(A \cup B) = P(A) + P(B) - P(A \cap B)$

For mutually exclusive events: $P(A \cup B) = P(A) + P(B)$

Multiplication Rule

$P(A \cap B) = P(A) \cdot P(B|A)$

For independent events: $P(A \cap B) = P(A) \cdot P(B)$

Conditional Probability

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Bayes' Theorem

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Counting Principles

Permutations (Order matters)

$P(n,r) = \frac{n!}{(n-r)!}$

Combinations (Order doesn't matter)

$C(n,r) = \binom{n}{r} = \frac{n!}{r!(n-r)!}$

Probability Distributions

Discrete Distributions

Binomial Distribution

$X \sim B(n,p)$

$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$

  • Mean: $\mu = np$
  • Variance: $\sigma^2 = np(1-p)$

Poisson Distribution

$X \sim Poisson(\lambda)$

$P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}$

  • Mean: $\mu = \lambda$
  • Variance: $\sigma^2 = \lambda$

Continuous Distributions

Normal Distribution

$X \sim N(\mu, \sigma^2)$

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$

Standard Normal Distribution

$Z \sim N(0, 1)$

68-95-99.7 Rule:

  • 68% within 1 standard deviation
  • 95% within 2 standard deviations
  • 99.7% within 3 standard deviations

t-Distribution

Used when population standard deviation is unknown

Approaches normal distribution as degrees of freedom increase

Central Limit Theorem

For large sample sizes (n ≥ 30), the sampling distribution of the sample mean approaches a normal distribution:

$$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$

This holds regardless of the population distribution!

Sampling and Estimation

Sampling Methods

  • Simple Random Sampling: Every member has equal chance
  • Stratified Sampling: Divide into strata, sample from each
  • Cluster Sampling: Sample entire clusters
  • Systematic Sampling: Select every kth member

Point Estimation

Properties of Good Estimators

  • Unbiased: $E(\hat{\theta}) = \theta$
  • Consistent: Converges to true value as n increases
  • Efficient: Minimum variance among unbiased estimators

Confidence Intervals

CI for Mean (σ known)

$$\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$$

CI for Mean (σ unknown)

$$\bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}$$

CI for Proportion

$$\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

Sample Size Determination

For mean: $n = \left(\frac{z_{\alpha/2} \sigma}{E}\right)^2$

For proportion: $n = \left(\frac{z_{\alpha/2}}{E}\right)^2 p(1-p)$

Hypothesis Testing

Steps in Hypothesis Testing

Step 1:

State null and alternative hypotheses

$H_0$: Status quo, $H_1$: Research claim

Step 2:

Choose significance level (α)

Common values: 0.01, 0.05, 0.10

Step 3:

Calculate test statistic

Step 4:

Find p-value or critical value

Step 5:

Make decision: Reject or fail to reject $H_0$

Common Tests

One-Sample t-test

Test statistic: $t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$

df = n - 1

Two-Sample t-test

Test statistic: $t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$

Chi-Square Test

Test statistic: $\chi^2 = \sum \frac{(O - E)^2}{E}$

ANOVA (F-test)

Test statistic: $F = \frac{MS_{between}}{MS_{within}}$

Types of Errors

$H_0$ True $H_0$ False
Reject $H_0$ Type I Error (α) Correct Decision
Fail to Reject $H_0$ Correct Decision Type II Error (β)

Correlation and Regression

Correlation

Pearson Correlation Coefficient

$$r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}$$

  • Range: -1 ≤ r ≤ 1
  • r = 1: Perfect positive correlation
  • r = -1: Perfect negative correlation
  • r = 0: No linear correlation

Simple Linear Regression

Regression Line

$\hat{y} = a + bx$

Slope

$$b = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$

Intercept

$a = \bar{y} - b\bar{x}$

Coefficient of Determination

$R^2 = r^2$ (proportion of variance explained)

Multiple Regression

$\hat{y} = b_0 + b_1x_1 + b_2x_2 + ... + b_kx_k$

Uses matrix algebra to find coefficients

Adjusted $R^2$ accounts for number of predictors

Practice Problems

Beginner Level

  1. Find the mean, median, and mode of: 2, 4, 4, 7, 9, 10, 12
  2. Calculate the standard deviation of: 10, 12, 15, 18, 20
  3. If P(A) = 0.3 and P(B) = 0.4, find P(A∪B) if A and B are mutually exclusive
  4. Find the z-score for x = 85 if μ = 75 and σ = 8

Intermediate Level

  1. A coin is flipped 10 times. Find P(exactly 6 heads)
  2. Construct a 95% confidence interval for μ if n = 36, x̄ = 42, s = 6
  3. Test H₀: μ = 50 vs H₁: μ ≠ 50 with α = 0.05, n = 25, x̄ = 52, s = 5
  4. Find the regression line for: (1,2), (2,4), (3,5), (4,8), (5,10)

Advanced Level

  1. A factory produces items with 5% defect rate. Find P(at most 2 defects in 20 items)
  2. Test if two population means are equal using a two-sample t-test
  3. Perform a chi-square test for independence on a 3×2 contingency table
  4. Calculate the power of a hypothesis test given α, effect size, and sample size

Interactive Visualizations

Normal Distribution Explorer

Adjust parameters to see how the normal distribution changes

Central Limit Theorem Demonstration

See how sample means form a normal distribution