Dataset Size Calculator
How many samples does your machine learning model need?
Find out how many data samples you need to train a reliable machine learning model. Enter your desired precision level, confidence level, and estimated population size — see minimum dataset size, margin of error, and sample adequacy. Assumes random sampling and normal distribution.
—
Send feedback
💡 Share your idea or report a problem
✓ Thanks! We'll take a look.
Learn more
How It Works
The formula, explained simply
Your model's accuracy depends more on data quality than quantity. A medical imaging AI trained on 5,000 expertly-labeled scans will outperform one trained on 50,000 crowd-sourced images with inconsistent annotations. The statistical formulas behind dataset sizing assume your data represents the real world - but most ML failures come from biased sampling, not insufficient volume.
This calculator uses the Cochran sample size formula to determine statistical significance for your confidence requirements. It accounts for finite population correction when your total possible data points are limited, then adjusts upward based on expected data quality losses. The confidence level determines your z-score (1.96 for 95% confidence), while margin of error sets your acceptable prediction variance.
The tool assumes random sampling and normal distribution of errors - assumptions that rarely hold in real ML projects. Your actual dataset needs depend on class imbalance, feature complexity, and model architecture. Start with the calculated baseline, then monitor validation accuracy to determine if you need more data or better data cleaning.
When To Use This
Right tool, right situation
Use this calculator during project planning to estimate data collection costs and timelines. If your calculated requirement is 100,000 samples but your budget only covers 5,000, either relax your precision requirements or explore transfer learning approaches. Many teams skip this step and discover mid-project they need 10x more data than expected.
This tool is most valuable for supervised learning problems with balanced classes and measurable accuracy metrics. It's less applicable to unsupervised learning, reinforcement learning, or problems where accuracy isn't the primary metric. For recommendation systems or generative models, focus on diversity and coverage rather than statistical sample size.
Run these calculations early in your ML pipeline design, not after you've already collected data. Use the results to justify data collection budgets, plan annotation workflows, and set realistic accuracy expectations with stakeholders who might expect perfect models from minimal data.
Common Mistakes
Why results sometimes look wrong
The biggest mistake is treating the calculated number as gospel. These formulas assume perfect random sampling, but ML datasets are rarely random - they're scraped from biased sources, collected under specific conditions, or labeled by particular annotators. A dataset of 10,000 images all taken with iPhone cameras won't generalize to Android photos, regardless of statistical significance.
Many teams collect the minimum calculated dataset, then wonder why their model fails in production. The formula gives you statistical confidence in your training accuracy, not real-world robustness. Plan to collect 2-3x your calculated baseline, then use techniques like stratified sampling and cross-validation to ensure your data covers edge cases your production model will encounter.
Another common error is ignoring class imbalance. If you need 1,000 samples but have 10 classes, you need roughly 100 samples per class - but some classes might be naturally rare in your population. Fraud detection models might need 50,000 normal transactions but only 500 fraud cases. Adjust your collection strategy to ensure adequate representation of minority classes that matter most.
The Math
Worked examples and deeper derivation
The core formula is Cochran's sample size equation: n = (Z²×p×(1-p)) / E², where Z is the z-score for your confidence level, p is the expected proportion (typically 0.5 for maximum variance), and E is your margin of error as a decimal. For 95% confidence with 5% margin of error, this gives n = (1.96²×0.5×0.5) / 0.05² = 384 samples.
When your population is finite, apply the finite population correction: n_adjusted = n / (1 + ((n-1) / N)), where N is your total population size. This prevents oversampling small populations. For a population of 1,000 with the above parameters, the adjusted sample becomes 278 instead of 384.
Finally, account for data quality by dividing by your expected usable data rate. If 20% of collected samples will be unusable, divide your statistical requirement by 0.8. Real ML projects often see 30-50% data loss from duplicates, corrupted files, labeling errors, and outliers that must be removed during preprocessing.
Expert Unlock
The thing most explanations skip
Statistical sample size formulas break down in high-dimensional spaces due to the curse of dimensionality. With 1,000 features, your 'statistically adequate' dataset becomes sparse across the feature space, leading to poor generalization despite meeting classical significance thresholds. Practitioners use intrinsic dimensionality estimates and manifold learning to determine actual sample requirements for complex feature spaces.
How do I know if my dataset is actually big enough?
Need something this doesn't cover?
Suggest a tool — we'll build it →