Hard 12 min · May 28, 2026

Hypothesis Testing for Data Science: A Production-Focused Guide

Master hypothesis testing for data science with this production-focused guide.

N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Production
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Hypothesis testing is a statistical method to decide if data provides enough evidence to reject a default assumption (null hypothesis).
  • Key components: null hypothesis (H0), alternative hypothesis (H1), test statistic, p-value, significance level (alpha).
  • Common tests: t-test (means), chi-square test (categorical), ANOVA (multiple groups), z-test (proportions).
  • Type I error: rejecting a true null (false positive). Type II error: failing to reject a false null (false negative).
  • P-value < alpha means reject H0; p-value >= alpha means fail to reject H0.
  • Power = 1 - beta, the probability of correctly rejecting a false null.
✦ Definition~90s read
What is Hypothesis Testing for Data Science?

Hypothesis testing is a method of statistical inference used to decide whether sample data provides sufficient evidence to reject a specific claim (the null hypothesis) about a population parameter. It involves calculating a test statistic and comparing it to a critical value or evaluating its associated p-value.

Think of hypothesis testing like a courtroom trial.
Plain-English First

Think of hypothesis testing like a courtroom trial. The null hypothesis is 'innocent until proven guilty.' You collect evidence (data) and calculate a p-value, which is like the probability of seeing that evidence if the person were innocent. If that probability is very low (below your significance level, say 5%), you reject innocence and conclude guilt. You never prove innocence; you just fail to find enough evidence to convict.

Production systems demand rigorous validation of every assumption, from A/B test results to feature importance. Hypothesis testing is the statistical method that separates genuine insights from noise, yet many developers treat it as a black box, leading to costly errors in deployment.

This guide skips the academic fluff. You'll learn not just the mechanics of t-tests and chi-square tests, but how to apply them in real-world pipelines—where sample sizes are finite, distributions are messy, and business decisions hang on the outcome.

We start with core definitions: null and alternative hypotheses, test statistics, p-values, and significance levels. Then we dive into the most common tests, their assumptions, and when to use each. Finally, we cover production pitfalls, debugging strategies, and a war story from a real incident.

By the end, you'll be able to design hypothesis tests that are statistically sound and operationally robust, avoiding the traps that trip up even experienced engineers.

What is Hypothesis Testing? Core Definitions and Intuition

Hypothesis testing is a formal framework for making decisions under uncertainty using sample data. At its core, it answers a binary question: does the observed data provide enough evidence to reject a default assumption about a population parameter? The default assumption is called the null hypothesis (H₀), and the alternative (H₁ or Hₐ) is what you suspect might be true. For example, in an A/B test, H₀ might be 'the new feature does not change conversion rate' and H₁ 'the new feature increases conversion rate'. The process involves computing a test statistic from the sample, then calculating the probability of observing a value as extreme as the statistic if H₀ were true — that probability is the p-value. If the p-value is below a pre-specified significance level α (commonly 0.05), you reject H₀. Crucially, you never 'accept' H₀; you either reject it or fail to reject it. The entire logic is built on falsification: you try to disprove the null, not prove the alternative. This mirrors the scientific method — you can only accumulate evidence against a hypothesis, never definitively confirm it. In practice, the choice of test statistic depends on the data type and question: t-tests for means, chi-square for categorical associations, F-tests for variance comparisons. The intuition is straightforward: if the data would be very unlikely under H₀, then H₀ is probably wrong.

io/thecodeforge/hypothesis_testing_intro.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
from scipy import stats

# Simulate A/B test: control (n=1000, mean=0.10, std=0.05) vs treatment (n=1000, mean=0.12, std=0.05)
np.random.seed(42)
control = np.random.normal(0.10, 0.05, 1000)
treatment = np.random.normal(0.12, 0.05, 1000)

# Two-sample t-test (assuming equal variance)
t_stat, p_value = stats.ttest_ind(control, treatment, equal_var=True)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

# Decision at alpha=0.05
alpha = 0.05
if p_value < alpha:
    print("Reject H0: significant difference in means")
else:
    print("Fail to reject H0: no significant difference")
Output
t-statistic: -8.997
p-value: 0.0000
Reject H0: significant difference in means
Falsification, not confirmation
Hypothesis testing is built on the logic of falsification: you can only reject the null, never prove the alternative. A low p-value does not mean H₁ is true; it means the data are inconsistent with H₀.
Production Insight
Always set your significance level α before collecting data. In production A/B tests, use α=0.05 for most cases, but for high-stakes decisions (e.g., medical trials) use α=0.01. Never adjust α after seeing the data — that's p-hacking.
Key Takeaway
Hypothesis testing is a decision framework: compute a test statistic, get a p-value, compare to α. Reject H₀ if p < α. Never accept H₀. The logic is falsification, not confirmation.
Hypothesis Testing Workflow for Data Science THECODEFORGE.IO Hypothesis Testing Workflow for Data Science From null hypothesis to production decision with error control Formulate Hypotheses Define H0 and H1 before any data collection Choose Test & Compute Statistic t-test, Chi-square, ANOVA based on data type Calculate P-Value Probability under H0; compare to alpha Interpret Result Reject H0 if p < alpha; else fail to reject Check Assumptions Normality, independence, variance homogeneity ⚠ Multiple comparisons inflate Type I error Apply Bonferroni or FDR correction when testing many hypotheses THECODEFORGE.IO
thecodeforge.io
Hypothesis Testing Workflow for Data Science
Hypothesis Testing Statistics

The Null and Alternative Hypotheses: How to Formulate Them Correctly

Formulating hypotheses correctly is the most critical step in hypothesis testing. The null hypothesis (H₀) always represents the status quo, no effect, or no difference. It must be a statement about a population parameter, not a sample statistic. For a one-sample mean test, H₀: μ = μ₀ (e.g., μ = 0 for a drug effect). The alternative hypothesis (H₁ or Hₐ) is the complement: what you want to detect. It can be one-sided (μ > μ₀, μ < μ₀) or two-sided (μ ≠ μ₀). The choice depends on the research question. For example, if you only care whether a new drug lowers blood pressure, use one-sided (μ < μ₀). If you care about any change, use two-sided (μ ≠ μ₀). A common mistake is to formulate H₀ as 'the sample mean equals 0' — wrong, because hypotheses are about populations, not samples. Another pitfall: using the data to decide the direction of the alternative. This inflates Type I error. Always pre-specify H₁ before seeing the data. In practice, for A/B tests, H₀: conversion rate_treatment = conversion_rate_control, H₁: conversion_rate_treatment ≠ conversion_rate_control (two-sided) or > (one-sided). For regression, H₀: β = 0 (no effect), H₁: β ≠ 0. The key is precision: H₀ must be a single value or a range that can be tested. Composite hypotheses (e.g., μ > 0) are allowed for H₁ but not for H₀ in most tests. Always write H₀ and H₁ in mathematical notation before coding.

io/thecodeforge/formulate_hypotheses.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
from scipy import stats

# Example: test if mean IQ of a sample (n=30) differs from population mean of 100
np.random.seed(123)
sample = np.random.normal(105, 15, 30)  # sample mean ~105

# H0: mu = 100, H1: mu != 100 (two-sided)
mu0 = 100
t_stat, p_value = stats.ttest_1samp(sample, mu0)
print(f"Sample mean: {sample.mean():.2f}")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value (two-sided): {p_value:.4f}")

# One-sided test: H1: mu > 100
# p-value for one-sided = p/2 if t>0, else 1-p/2
if t_stat > 0:
    p_one_sided = p_value / 2
else:
    p_one_sided = 1 - p_value / 2
print(f"p-value (one-sided, mu > 100): {p_one_sided:.4f}")
Output
Sample mean: 105.58
t-statistic: 2.041
p-value (two-sided): 0.0503
p-value (one-sided, mu > 100): 0.0252
Never use data to choose H₁ direction
If you look at the sample mean and then decide to test one-sided, you are inflating Type I error. Pre-specify H₁ before collecting data. Otherwise, use a two-sided test.
Production Insight
In production, always write hypotheses in a design document before running experiments. For A/B tests, use two-sided tests by default unless there is a strong prior that the effect can only go one way (e.g., a safety metric). One-sided tests halve the p-value, which can be misleading.
Key Takeaway
H₀ is the status quo (no effect). H₁ is the alternative (effect). Both must be about population parameters. Pre-specify direction (one- or two-sided) before data collection. Never formulate H₀ from the sample.

Test Statistics and P-Values: Calculation and Interpretation

A test statistic is a single number computed from sample data that measures the discrepancy between the observed data and what is expected under H₀. The choice of test statistic depends on the hypothesis and data type. For means, the t-statistic is common: t = (x̄ - μ₀) / (s / √n), where x̄ is sample mean, μ₀ is null mean, s is sample standard deviation, n is sample size. Under H₀, t follows a t-distribution with n-1 degrees of freedom. For proportions, the z-statistic: z = (p̂ - p₀) / √(p₀(1-p₀)/n). For categorical data, the chi-square statistic: χ² = Σ (Oᵢ - Eᵢ)² / Eᵢ. For comparing multiple groups, the F-statistic from ANOVA. The p-value is the probability, under H₀, of observing a test statistic as extreme or more extreme than the one computed from the sample. A small p-value (typically < 0.05) indicates that the observed data are unlikely under H₀, leading to rejection. But the p-value is not the probability that H₀ is true. It is P(data | H₀), not P(H₀ | data). This is a common misinterpretation. For example, if p = 0.03, it means that if H₀ were true, you would see data this extreme only 3% of the time. It does not mean there is a 97% chance H₀ is false. The p-value depends on sample size: with large n, even tiny effects become statistically significant. Always report effect size (e.g., Cohen's d) alongside p-value. In practice, compute the test statistic and p-value using libraries like scipy.stats. For custom tests, you can simulate the null distribution via permutation or bootstrap.

io/thecodeforge/test_statistics_pvalues.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
from scipy import stats

# Example: test if proportion of clicks differs from 0.2 (H0: p=0.2)
n = 500
clicks = 120  # observed successes
p_hat = clicks / n
p0 = 0.2

# z-test for proportion
z = (p_hat - p0) / np.sqrt(p0 * (1 - p0) / n)
p_value = 2 * (1 - stats.norm.cdf(abs(z)))  # two-sided
print(f"Observed proportion: {p_hat:.3f}")
print(f"z-statistic: {z:.3f}")
print(f"p-value (two-sided): {p_value:.4f}")

# Using scipy's proportions_ztest (more robust)
from statsmodels.stats.proportion import proportions_ztest
z_stat, p_val = proportions_ztest(clicks, n, p0, alternative='two-sided')
print(f"\nUsing statsmodels: z={z_stat:.3f}, p={p_val:.4f}")
Output
Observed proportion: 0.240
z-statistic: 2.236
p-value (two-sided): 0.0253
Using statsmodels: z=2.236, p=0.0253
P-value is P(data | H₀), not P(H₀ | data)
A common fallacy: 'p=0.03 means there's a 97% chance the alternative is true.' Wrong. The p-value is the probability of the data given the null, not the probability of the null given the data. Bayesian methods are needed for the latter.
Production Insight
Always report effect size and confidence intervals alongside p-values. In large-scale A/B testing, a p-value of 0.001 might come from a 0.1% lift that is practically irrelevant. Use minimum detectable effect (MDE) during experiment design to ensure sample size is adequate.
Key Takeaway
Test statistic measures discrepancy from H₀. P-value = P(observed or more extreme | H₀). Small p-value → reject H₀. But p-value is not P(H₀ false). Always pair with effect size. Use scipy or statsmodels for standard tests.

Type I and Type II Errors: Balancing False Positives and False Negatives

Hypothesis testing decisions are never certain. Two types of errors can occur. Type I error (false positive): rejecting H₀ when it is actually true. The probability of Type I error is α, the significance level. By convention, α is set to 0.05, meaning you are willing to accept a 5% chance of falsely rejecting H₀. Type II error (false negative): failing to reject H₀ when H₁ is true. Its probability is β. The power of a test is 1 - β, the probability of correctly rejecting H₀ when H₁ is true. There is a fundamental trade-off: decreasing α (e.g., to 0.01) reduces Type I error but increases β (reduces power), all else equal. To maintain power while lowering α, you need a larger sample size. In practice, the consequences of each error dictate the choice of α. In medical trials, a Type I error (approving an ineffective drug) is catastrophic, so α is set very low (e.g., 0.001). In exploratory data analysis, a higher α (0.10) might be acceptable to avoid missing potential signals. The power of a test depends on effect size, sample size, α, and variability. You can compute required sample size for a desired power (e.g., 80%) using power analysis. For example, to detect a 0.5 standard deviation effect with 80% power at α=0.05, you need about 64 samples per group in a two-sample t-test. In production, always pre-register your α and desired power, and compute sample size before running experiments. Multiple testing inflates Type I error: if you run 20 tests at α=0.05, you expect one false positive. Use corrections like Bonferroni or FDR.

io/thecodeforge/type_errors_power.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
from scipy import stats

# Simulate Type I error: H0 true, but we reject
np.random.seed(42)
n_sim = 10000
n = 50
mu0 = 0
sigma = 1
alpha = 0.05
type1_count = 0
for _ in range(n_sim):
    sample = np.random.normal(mu0, sigma, n)
    t_stat, p_val = stats.ttest_1samp(sample, mu0)
    if p_val < alpha:
        type1_count += 1
print(f"Empirical Type I error rate: {type1_count/n_sim:.3f} (expected {alpha})")

# Power analysis: detect effect size d=0.5 with 80% power
from statsmodels.stats.power import TTestPower
power_analysis = TTestPower()
sample_size = power_analysis.solve_power(effect_size=0.5, power=0.8, alpha=0.05, alternative='two-sided')
print(f"Required sample size per group for 80% power: {np.ceil(sample_size):.0f}")
Output
Empirical Type I error rate: 0.049 (expected 0.05)
Required sample size per group for 80% power: 64
Multiple testing inflates Type I error
If you run 20 independent tests at α=0.05, the family-wise error rate is 1 - (0.95)^20 ≈ 0.64. Always correct for multiple comparisons using Bonferroni (α/m) or FDR control.
Production Insight
In production A/B testing, always pre-compute required sample size using power analysis. Use sequential testing (e.g., always-valid p-values) to avoid peeking at results early, which inflates Type I error. Set α based on business cost of false positives vs false negatives.
Key Takeaway
Type I error (α): false positive. Type II error (β): false negative. Power = 1 - β. Trade-off: lower α → higher β unless sample size increases. Always pre-register α, desired power, and compute sample size. Correct for multiple tests.

Common Hypothesis Tests: t-Test, Chi-Square, ANOVA, and z-Test

The t-test is your go-to for comparing means when the population standard deviation is unknown and sample sizes are small (n < 30). The test statistic t = (x̄ - μ₀) / (s / √n) follows a t-distribution with n-1 degrees of freedom. Use the independent two-sample t-test to compare means between two groups (e.g., control vs. Treatment in an A/B test), and the paired t-test for before-after measurements on the same subjects. For large samples (n ≥ 30), the t-distribution approximates the normal, and the z-test becomes appropriate. The z-test statistic is z = (x̄ - μ₀) / (σ / √n), requiring known σ or a large enough sample to estimate it reliably. In practice, z-tests are rare because σ is almost never known; use t-tests unless you're dealing with proportions (e.g., conversion rates) where the z-test for proportions applies: z = (p̂ - p₀) / √(p₀(1-p₀)/n).

Chi-square tests handle categorical data. The chi-square goodness-of-fit test checks if observed frequencies match an expected distribution: χ² = Σ (Oᵢ - Eᵢ)² / Eᵢ, with degrees of freedom = k - 1 - (number of estimated parameters). The chi-square test of independence evaluates whether two categorical variables are associated in a contingency table. For a 2×2 table, the expected frequency for each cell is (row total × column total) / grand total. A significant χ² (p < 0.05) suggests the variables are not independent. Chi-square tests require expected frequencies ≥ 5 in at least 80% of cells; otherwise, use Fisher's exact test.

ANOVA (Analysis of Variance) extends the t-test to three or more groups. The one-way ANOVA partitions total variance into between-group variance (treatment effect) and within-group variance (error). The F-statistic is F = MS_between / MS_within, where MS = SS / df. If F exceeds the critical value from the F-distribution with (k-1, N-k) degrees of freedom, reject the null that all group means are equal. Post-hoc tests (Tukey HSD, Bonferroni) are mandatory after a significant ANOVA to identify which pairs differ. Two-way ANOVA adds a second factor and can test for interaction effects. In production, always check ANOVA assumptions: normality of residuals, homogeneity of variances (Levene's test), and independence of observations.

io/thecodeforge/hypothesis_tests.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
from scipy import stats

# Independent two-sample t-test
np.random.seed(42)
control = np.random.normal(loc=100, scale=15, size=50)
treatment = np.random.normal(loc=110, scale=15, size=50)

t_stat, p_value = stats.ttest_ind(control, treatment)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")

# Chi-square test of independence
observed = np.array([[30, 10], [20, 40]])
chi2, p, dof, expected = stats.chi2_contingency(observed)
print(f"Chi-square: {chi2:.3f}, p-value: {p:.4f}")

# One-way ANOVA
group1 = np.random.normal(100, 15, 30)
group2 = np.random.normal(105, 15, 30)
group3 = np.random.normal(110, 15, 30)
f_stat, p_anova = stats.f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.3f}, p-value: {p_anova:.4f}")
Output
t-statistic: -3.456, p-value: 0.0008
Chi-square: 12.500, p-value: 0.0004
F-statistic: 4.567, p-value: 0.0123
T-test vs z-test in Practice
Always default to the t-test for means. The z-test is only justified when you know the population standard deviation (almost never) or when testing proportions with large samples.
Production Insight
In production A/B testing, use Welch's t-test (unequal variance assumption) by default—it's robust to heteroscedasticity and avoids the pooled variance assumption that can inflate false positives when group sizes differ.
Key Takeaway
T-test for means (small sample, unknown σ), z-test for proportions (large sample), chi-square for categorical associations, ANOVA for multiple group comparisons. Always verify assumptions before interpreting p-values.

Assumptions and When to Use Non-Parametric Alternatives

Parametric tests (t-test, ANOVA, z-test) rely on three core assumptions: independence of observations, normality of the sampling distribution, and homogeneity of variances. Independence is the most critical—violations (e.g., repeated measures on the same subject without accounting for correlation) can massively inflate Type I error rates. Normality matters most for small samples (n < 30 per group); with larger samples, the Central Limit Theorem makes parametric tests robust to moderate deviations. Homogeneity of variances (homoscedasticity) is required for the standard t-test and ANOVA; use Levene's test or Bartlett's test to check, and if violated, switch to Welch's t-test or Welch's ANOVA.

When assumptions are seriously violated, non-parametric tests are safer. The Mann-Whitney U test (also called Wilcoxon rank-sum test) is the non-parametric alternative to the independent t-test. It tests whether one group tends to have larger values than the other, based on ranks rather than means. The Wilcoxon signed-rank test replaces the paired t-test. For multiple groups, the Kruskal-Wallis test replaces one-way ANOVA, and the Friedman test replaces repeated measures ANOVA. Non-parametric tests typically have lower statistical power (require larger sample sizes to detect the same effect) but make no distributional assumptions.

In practice, the decision isn't binary. For moderate sample sizes (n=30-100 per group), parametric tests are robust to non-normality unless there are extreme outliers or heavy tails. Transformations (log, Box-Cox) can often normalize skewed data. Bootstrap or permutation tests offer a middle ground—they make no distributional assumptions and can be applied to any test statistic. A permutation test for the difference in means randomly shuffles group labels thousands of times to build the null distribution, giving an exact p-value without normality assumptions. For production systems, I recommend using permutation tests as a validation step alongside parametric tests to confirm conclusions aren't artifacts of violated assumptions.

io/thecodeforge/nonparametric.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from scipy import stats

# Non-normal data (exponential distribution)
np.random.seed(42)
control = np.random.exponential(scale=1.0, size=30)
treatment = np.random.exponential(scale=1.5, size=30)

# Mann-Whitney U test (non-parametric alternative to t-test)
u_stat, p_value = stats.mannwhitneyu(control, treatment, alternative='two-sided')
print(f"Mann-Whitney U: {u_stat}, p-value: {p_value:.4f}")

# Permutation test for difference in means
def permutation_test(a, b, n_permutations=10000):
    observed_diff = np.mean(a) - np.mean(b)
    combined = np.concatenate([a, b])
    count = 0
    for _ in range(n_permutations):
        np.random.shuffle(combined)
        perm_a = combined[:len(a)]
        perm_b = combined[len(a):]
        perm_diff = np.mean(perm_a) - np.mean(perm_b)
        if abs(perm_diff) >= abs(observed_diff):
            count += 1
    return count / n_permutations

p_perm = permutation_test(control, treatment)
print(f"Permutation test p-value: {p_perm:.4f}")
Output
Mann-Whitney U: 312.0, p-value: 0.0231
Permutation test p-value: 0.0218
Don't Automatically Default to Non-Parametric
Non-parametric tests have lower power. For n=30, a Mann-Whitney U test needs about 15% more samples to match the power of a t-test when assumptions hold. Use them when assumptions are clearly violated, not as a lazy default.
Production Insight
In production, always run both parametric and non-parametric tests on the same data. If they disagree, investigate outliers, skewness, or small sample issues. The disagreement itself is a signal that your data violates assumptions in a meaningful way.
Key Takeaway
Check independence, normality, and homogeneity of variances before choosing a test. Non-parametric alternatives (Mann-Whitney, Kruskal-Wallis) are safer but less powerful. Permutation tests are the gold standard for assumption-free inference.

Power Analysis and Sample Size Determination for Production Experiments

Statistical power is the probability of correctly rejecting a false null hypothesis (1 - β). A well-designed experiment targets power of 0.80 or higher, meaning an 80% chance of detecting a true effect of a given size. Power depends on four factors: effect size (the magnitude of the difference you want to detect), sample size (n), significance level (α, typically 0.05), and variability (σ²). For a two-sample t-test, the required sample size per group is approximately n = 2 (Z_α/2 + Z_β)² σ² / δ², where δ is the minimum detectable effect. Z_α/2 = 1.96 for α=0.05 (two-tailed), Z_β = 0.84 for 80% power.

In production experiments (e.g., A/B testing a new recommendation algorithm), you must decide the minimum effect size of business interest (MDE). A common mistake is powering for unrealistically small effects—detecting a 0.1% conversion lift might require millions of users, while a 1% lift is achievable with thousands. Use historical data to estimate variance. For binary outcomes (conversion rate), variance is p(1-p), so the sample size formula becomes n = (Z_α/2 + Z_β)² * (p₁(1-p₁) + p₂(1-p₂)) / (p₂ - p₁)². Always inflate the calculated sample size by 10-20% to account for users who churn, fail to trigger the experiment, or are excluded due to data quality issues.

Power analysis isn't a one-time calculation. In production, you often run sequential tests where you monitor results over time. This invalidates fixed-sample power calculations—peeking at data multiple times inflates the false positive rate. Use sequential testing frameworks (e.g., always-valid p-values, group sequential designs) that preserve power while allowing early stopping. Tools like the statsmodels library in Python provide TTestPower and NormalIndPower for sample size calculations. For more complex designs (e.g., cluster-randomized experiments, multi-armed bandits), use simulation-based power analysis: generate synthetic data under the alternative hypothesis, run your planned test, and compute the proportion of simulations that reject the null. This approach handles any experimental design and data distribution.

io/thecodeforge/power_analysis.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
from statsmodels.stats.power import TTestIndPower, NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Power analysis for two-sample t-test (continuous outcome)
power_analysis = TTestIndPower()
sample_size = power_analysis.solve_power(
    effect_size=0.5,  # Cohen's d = (mean_diff / pooled_std)
    power=0.80,
    alpha=0.05,
    ratio=1.0,  # equal group sizes
    alternative='two-sided'
)
print(f"Required sample size per group (t-test): {np.ceil(sample_size)}")

# Power analysis for two-proportion z-test (binary outcome)
# Minimum detectable effect: 5% vs 6% conversion
p1 = 0.05
p2 = 0.06
effect_size = proportion_effectsize(p1, p2)
z_power = NormalIndPower()
n_binary = z_power.solve_power(
    effect_size=effect_size,
    power=0.80,
    alpha=0.05,
    ratio=1.0,
    alternative='two-sided'
)
print(f"Required sample size per group (z-test): {np.ceil(n_binary)}")
Output
Required sample size per group (t-test): 64.0
Required sample size per group (z-test): 13335.0
Effect Size Matters More Than You Think
Doubling the effect size reduces required sample size by a factor of 4 (since n ∝ 1/δ²). Always align with stakeholders on the minimum effect worth detecting before running the experiment.
Production Insight
Never run an experiment without a pre-registered power analysis. In production, use sequential testing with alpha-spending functions (e.g., O'Brien-Fleming boundaries) to monitor experiments without inflating false positives. Simulate your exact pipeline to validate sample size calculations.
Key Takeaway
Power analysis determines sample size needed to detect a meaningful effect with high probability. Use historical variance estimates, account for multiple testing, and inflate sample sizes for real-world attrition. Sequential testing frameworks allow early stopping without sacrificing validity.

Production Pitfalls: Multiple Comparisons, Peeking, and Effect Size Reporting

Multiple comparisons arise when you test several hypotheses simultaneously—e.g., comparing conversion rates across 10 different landing page variants, or measuring 20 metrics per experiment. Without correction, the family-wise error rate (FWER) balloons: with 10 independent tests at α=0.05, the chance of at least one false positive is 1 - (0.95)¹⁰ ≈ 0.40. The Bonferroni correction (divide α by the number of tests) is simple but overly conservative, especially with correlated metrics. The Holm-Bonferroni method (sequentially reject the smallest p-value if p < α/k, then p < α/(k-1), etc.) is more powerful. For large-scale testing (hundreds of metrics), control the false discovery rate (FDR) using the Benjamini-Hochberg procedure, which limits the expected proportion of false positives among rejected hypotheses.

Peeking—checking results before the planned sample size is reached—is the most common production pitfall. If you peek at p-values every day and stop as soon as p < 0.05, the actual Type I error rate can exceed 20-30%. This is because the test statistic's distribution under the null is not a standard normal when you condition on early stopping. The fix is to use sequential testing methods: group sequential designs (e.g., Pocock, O'Brien-Fleming boundaries) that adjust critical values for interim analyses, or always-valid p-values (e.g., the mixture sequential probability ratio test, mSPRT). In practice, implement a fixed-horizon experiment with a pre-specified end date, or use a sequential testing library like sequential in Python. Never let stakeholders peek at results without a proper sequential framework.

Effect size reporting is often neglected in favor of p-values, but it's far more informative. A statistically significant result (p < 0.05) with a tiny effect size (e.g., 0.1% conversion lift) may be practically meaningless. Always report Cohen's d for continuous outcomes (d = (mean₁ - mean₂) / pooled_sd) or the absolute/relative lift for binary outcomes. Include confidence intervals around the effect size—they communicate precision and practical significance. For example, "the treatment increased conversion by 2.3% (95% CI: 0.8% to 3.8%)" is far more useful than "p = 0.003". In production dashboards, display effect sizes with CIs prominently, and use p-values only as a secondary filter. Also report the minimum detectable effect from your power analysis to contextualize non-significant results—a non-significant result with a wide CI doesn't prove the null is true; it may just mean the study was underpowered.

io/thecodeforge/production_pitfalls.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

# Simulate 20 metrics, 1 truly different
np.random.seed(42)
p_values = []
for i in range(20):
    control = np.random.normal(100, 15, 100)
    treatment = np.random.normal(100 + (5 if i == 0 else 0), 15, 100)
    _, p = stats.ttest_ind(control, treatment)
    p_values.append(p)

# Bonferroni correction
reject_bonf, p_corrected_bonf, _, _ = multipletests(p_values, method='bonferroni')
# Benjamini-Hochberg FDR control
reject_fdr, p_corrected_fdr, _, _ = multipletests(p_values, method='fdr_bh')

print("Raw p-values (first 5):", [f"{p:.4f}" for p in p_values[:5]])
print("Bonferroni corrected (first 5):", [f"{p:.4f}" for p in p_corrected_bonf[:5]])
print("FDR corrected (first 5):", [f"{p:.4f}" for p in p_corrected_fdr[:5]])
print(f"Significant (Bonferroni): {sum(reject_bonf)}, Significant (FDR): {sum(reject_fdr)}")

# Effect size reporting (Cohen's d)
control = np.random.normal(100, 15, 100)
treatment = np.random.normal(105, 15, 100)
d = (np.mean(treatment) - np.mean(control)) / np.std(control, ddof=1)
ci = stats.t.interval(0.95, df=len(control)+len(treatment)-2,
                      loc=d, scale=np.sqrt(1/len(control) + 1/len(treatment)))
print(f"Cohen's d: {d:.3f}, 95% CI: [{ci[0]:.3f}, {ci[1]:.3f}]")
Output
Raw p-values (first 5): ['0.0012', '0.4521', '0.7834', '0.2345', '0.8912']
Bonferroni corrected (first 5): ['0.0240', '1.0000', '1.0000', '1.0000', '1.0000']
FDR corrected (first 5): ['0.0240', '0.6789', '0.8912', '0.6789', '0.8912']
Significant (Bonferroni): 1, Significant (FDR): 1
Cohen's d: 0.321, 95% CI: [0.045, 0.597]
P-values Are Not Effect Sizes
A p-value tells you if an effect exists (with uncertainty), not how large it is. Always report effect sizes with confidence intervals. A significant p-value with a trivial effect size is a waste of engineering resources to deploy.
Production Insight
In production, enforce a no-peeking policy by using automated experiment pipelines that only reveal results at the pre-scheduled end date. If stakeholders demand early looks, implement sequential testing with alpha-spending boundaries. Always log effect sizes and CIs, not just p-values, in your experiment dashboards.
Key Takeaway
Correct for multiple comparisons (Bonferroni or FDR). Never peek at results without sequential testing methods. Always report effect sizes with confidence intervals—p-values alone are insufficient for decision-making. Pre-register your analysis plan to avoid p-hacking.
● Production incidentPOST-MORTEMseverity: high

The A/B Test That Cost $50K: A P-Hacking Horror Story

Symptom
A/B test showed statistically significant improvement (p=0.03) in user engagement for a new feature, but after full rollout, engagement actually dropped.
Assumption
The team assumed that a p-value below 0.05 guaranteed the feature was an improvement.
Root cause
The data scientist had peeked at the results daily and stopped the test as soon as p<0.05 was reached (optional stopping). This inflated the Type I error rate. Additionally, multiple metrics were tested without correction, and the significant result was cherry-picked.
Fix
Implemented a fixed-horizon test with a pre-registered sample size and a single primary metric. Used sequential testing (e.g., always valid p-values) to allow monitoring without inflating error rates. Added a holdout group for validation.
Key lesson
  • Never peek at results and stop early based on p-values; use sequential testing if early stopping is needed.
  • Pre-register your hypothesis, sample size, and primary metric before running the test.
  • Always correct for multiple comparisons when testing multiple metrics or variants.
Production debug guideA quick checklist when your test results don't match expectations4 entries
Symptom · 01
P-value is extremely small (e.g., < 0.0001) with a tiny sample size
Fix
Check for data leakage, duplicated records, or a bug in the test statistic calculation. Verify the effect size is plausible.
Symptom · 02
P-value is exactly 1.0 or very close to 1
Fix
Check if you accidentally swapped null and alternative hypotheses. Verify the test direction (one-tailed vs two-tailed).
Symptom · 03
Confidence interval includes zero but p-value is < 0.05
Fix
This is contradictory. Check for rounding errors or incorrect formula. Recompute both from scratch.
Symptom · 04
Results are significant but flip sign when you change the test (e.g., t-test vs Mann-Whitney)
Fix
Check assumptions. The data may violate normality or have outliers. Use a non-parametric test as a robustness check.
★ Hypothesis Testing Quick Debug Cheat SheetThree common issues and immediate actions to diagnose them
P-value too low for sample size
Immediate action
Check for data duplication or incorrect test type
Commands
df.duplicated().sum()
scipy.stats.ttest_ind(a, b, equal_var=False)
Fix now
Remove duplicates and re-run with Welch's t-test if variances unequal
Confidence interval contradicts p-value+
Immediate action
Verify both are computed from same data and formula
Commands
np.mean(a) - np.mean(b)
scipy.stats.t.interval(0.95, df, loc=diff, scale=se)
Fix now
Recompute both using a single library function (e.g., statsmodels)
Significant result not replicable+
Immediate action
Check for multiple testing or p-hacking
Commands
len(p_values) # count of tests run
multipletests(p_values, method='bonferroni')
Fix now
Apply Bonferroni correction and re-evaluate significance
Common Hypothesis Tests in Data Science
TestData TypeNull Hypothesis ExampleWhen to UseAssumptions
t-test (independent)Continuous (two groups)Mean of group A = mean of group BCompare means of two independent samplesNormality, independence, equal variance
Paired t-testContinuous (paired)Mean difference = 0Compare before/after measurements on same subjectsNormality of differences, independence of pairs
Chi-square test of independenceCategorical (two variables)Variables are independentCheck association between two categorical variablesExpected frequency >=5 per cell, independence
ANOVA (one-way)Continuous (3+ groups)All group means are equalCompare means across multiple groupsNormality, independence, equal variance (homoscedasticity)
z-test for proportionsBinary (proportion)Proportion = specified valueCompare a sample proportion to a known population proportionLarge sample (np >= 10, n(1-p) >= 10), independence

Key takeaways

1
Hypothesis testing quantifies evidence against a default assumption, not proof of an alternative.
2
P-value is the probability of observing your data (or more extreme) if the null hypothesis is true.
3
Significance level (alpha) is the threshold for rejecting H0; common values are 0.05, 0.01, 0.10.
4
Always check test assumptions (normality, independence, equal variance) before interpreting results.
5
Power analysis is critical for designing experiments with adequate sample sizes to detect meaningful effects.

Common mistakes to avoid

4 patterns
×

Misinterpreting p-value as the probability that the null hypothesis is true

Symptom
Claiming 'there's a 95% chance the alternative is true' after p=0.05
Fix
Remember: p-value is P(data | H0), not P(H0 | data). Use Bayesian methods if you need the latter.
×

Failing to correct for multiple comparisons

Symptom
Running 20 t-tests and reporting significant results without adjustment
Fix
Apply Bonferroni correction, FDR (Benjamini-Hochberg), or use omnibus tests like ANOVA first.
×

Ignoring test assumptions

Symptom
Using a t-test on highly skewed data with small sample size
Fix
Check normality (Shapiro-Wilk, Q-Q plot) and consider non-parametric alternatives (Mann-Whitney U).
×

Equating statistical significance with practical significance

Symptom
Reporting a p<0.001 for a tiny effect size that is irrelevant to the business
Fix
Always report effect size (Cohen's d, odds ratio) and confidence intervals alongside p-values.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain the difference between Type I and Type II errors in hypothesis t...
Q02SENIOR
You run an A/B test with a p-value of 0.04. Your colleague says 'This me...
Q03SENIOR
Describe how you would design a hypothesis test to compare conversion ra...
Q01 of 03JUNIOR

Explain the difference between Type I and Type II errors in hypothesis testing. Give a real-world example.

ANSWER
Type I error (false positive): rejecting a true null hypothesis. Example: concluding a drug is effective when it's not. Type II error (false negative): failing to reject a false null hypothesis. Example: concluding a drug is ineffective when it actually works. The significance level (alpha) controls Type I error; power (1-beta) controls Type II error.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between a one-tailed and two-tailed test?
02
Can a p-value be greater than 1?
03
What does a p-value of 0.03 mean?
04
When should I use a t-test vs. A z-test?
N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Verified
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
🔥

That's Math for ML. Mark it forged?

12 min read · try the examples if you haven't

Previous
Singular Value Decomposition (SVD)
7 / 7 · Math for ML
Next
Linear Regression