Psychometrics · 14 min read

What Makes an Assessment Test Valid and Reliable?

Not all assessments are equal. Learn the difference between valid and invalid tests, and why it's crucial for your hiring.

Door Ingmar van Maurik · Founder & CEO, Making Moves

Why it matters

An assessment is only valuable if it measures what it claims to measure and gives consistent results. Sounds logical, but the reality is that many companies deploy assessments without knowing whether they actually predict job performance.

The consequence: hiring decisions based on noise. You think you're hiring data-driven, but in reality, you're using an instrument that predicts no better than a coin flip — and sometimes even worse, because it creates a false sense of certainty.

In this article, we explain what validity and reliability actually mean, how to measure them, and why generic assessments often fall short. We also show how you can build assessments with your own system that actually predict who will succeed.

Validity: are you measuring what you want to measure?

Validity is the foundation of every assessment. It answers the question: does this test actually predict job performance? There are multiple forms of validity, each with a specific function.

Predictive validity

This is the gold standard in assessment psychometrics. You compare test scores with later real-world performance:

Do high-scoring candidates also score high on performance reviews at 6 and 12 months?

Are there correlations between specific test components and role success?

Do scores predict retention — do high-scoring candidates stay longer?

Predictive validity is expressed as a correlation coefficient (r). In psychometrics, these benchmarks apply:

Correlation coefficientInterpretation

|------------------------|---------------|

r < 0.10Negligible — the test predicts nothing r = 0.10-0.20Weak — limited value r = 0.20-0.30Moderate — some predictive value r = 0.30-0.50Strong — good predictor r > 0.50Very strong — excellent predictor

The best generic cognitive tests achieve an r of 0.30-0.50. But company-specific assessments can score significantly higher because they're calibrated to what success means in your specific context.

Construct validity

Does the test measure the right construct? This sounds simple but is complex in practice:

A test for "analytical ability" should actually measure analytical ability, not reading proficiency or working memory

A personality test measuring "leadership" should distinguish from dominance and assertiveness — related but different constructs

A test for "cultural fit" should measure what it claims, not simply formalize similarity bias

Construct validity is measured through:

Convergent validity — does the test correlate with other validated tests measuring the same construct?

Divergent validity — does the test not correlate with tests measuring a different construct?

Factor analysis — do test items load on expected factors?

Criterion validity

How well does the test predict a specific criterion? This can be:

Productivity — output and quality of work

Retention — does the employee stay at least 12-18 months?

Customer satisfaction — scores from customers working with the employee

Growth velocity — how quickly does the employee develop to the next level?

Team effectiveness — does the employee contribute positively to the team?

It's important to recognize that different criteria require different predictors. A test that predicts productivity doesn't automatically predict retention.

Content validity

Does the test cover relevant content for the role? An assessment for a software developer should test for:

Problem-solving in a technical context

Code review skills

Collaboration in development teams

Dealing with ambiguity and changing requirements

Not for: general verbal intelligence or abstract pattern recognition that has no relation to daily work activities.

Reliability: is it consistent?

Reliability asks the question: does the test produce comparable results on repeated administration? A test cannot be valid without being reliable — but a reliable test is not automatically valid.

Test-retest reliability

Does the same person score similarly when taking the test at two different times? This is measured with the test-retest correlation:

r > 0.80 — good test-retest reliability

r = 0.60-0.80 — acceptable for some constructs

r < 0.60 — insufficient — the test measures too much noise

Important: some constructs are inherently less stable (e.g., mood vs. personality), which affects expected test-retest reliability.

Internal consistency

Do all questions within a section measure the same construct? This is measured with Cronbach's alpha:

α > 0.80 — good

α = 0.70-0.80 — acceptable

α < 0.70 — the questions don't consistently measure the same thing

Low internal consistency means some questions measure something different from the rest, making the total score unreliable.

Inter-rater reliability

For assessments requiring human judgment (e.g., simulations, presentations, interviews): do different evaluators reach the same conclusion? This is critical for:

Assessment centers

Structured interviews

Work samples

Video assessments with human scoring

The solution for low inter-rater reliability: structured scoring rubrics and evaluator training. Or better yet: deploy AI scoring where possible, which is inherently consistent.

Why generic tests often fall short

Most commercial assessments — from providers like SHL, Harver, TestGorilla, and Saville — are validated on generic populations. This means:

The norm group problem

Scores are compared with thousands of random people from diverse industries and roles. But:

What predicts success at a bank is fundamentally different from a tech startup

A norm group of 10,000 random professionals is not relevant for your specific senior developer role

The cultural context of a Dutch company differs from an American norm group

The static model problem

Generic tests are updated every 5-10 years. Your company changes continuously:

New technologies, processes, and culture

Changing team dynamics and leadership styles

Evolution of what "success" means in a role

A test validated in 2020 may no longer measure what's relevant in 2026.

The one-size-fits-all problem

The same personality test is used for developers, sales managers, finance analysts, and customer service representatives. But the competencies that predict success are fundamentally different per role.

Read more in our article on why generic assessments don't work.

The solution: company-specific validation

With your own assessment system, you can address the shortcomings of generic tests:

Building your own norm groups

Instead of comparing scores to a generic population, you build norm groups per role and department:

Your senior developers are compared with your senior developers, not the market

Your sales team's scores are benchmarked against your top performers, not a generic sales norm group

New hires are compared with employees already succeeding in the same role

Calculating predictive validity with your own data

This is the ultimate test: do your assessments actually predict success? With your own data, you can:

Correlate assessment scores with performance reviews (6, 12, 18 months)

Identify which test components are most predictive for which roles

Adjust weights so the most predictive components carry more importance

Set up a [continuous validation cycle](/artikelen/continuous-validation-hiring) that makes the model increasingly accurate

Continuous calibration after every hire

After every hire, the model is validated:

1. Candidate scores on the assessment

2. Candidate is hired (or rejected)

3. After 6 months: performance review

4. Calculate correlation: was the prediction correct?

5. Adjust the model based on results

This means your assessment system gets smarter over time — an advantage generic tests cannot provide by definition.

Bias analyses on your own population

With your own data, you can actively monitor:

Are certain groups systematically scored higher or lower?

Are there components with adverse impact that need adjustment?

Is the test equally predictive for all subgroups?

The difference in practice

AspectGeneric assessmentCompany-specific assessment

|--------|-------------------|---------------------------|

Norm group10,000+ randomYour employees per role Predictive validityr = 0.20-0.40r = 0.40-0.60+ UpdatesEvery 5-10 yearsContinuous Bias monitoringVendor (generic)You (specific to your population) Cost per candidate€50-€200Included in system Data ownershipVendorYou

Key takeaways

An assessment without validation is an expensive gamble. You give it the appearance of objectivity, but in reality, you base decisions on unproven assumptions. A validated custom assessment, on the other hand, is a strategic weapon in your hiring.

The core points:

Validity is about whether you measure what you want to measure — and whether it predicts job performance

Reliability is about consistency — do you get the same results on repeated administration?

Generic tests fall short due to generic norm groups, static models, and one-size-fits-all approach

Company-specific validation solves these problems with your own norm groups, continuous calibration, and predictive validity on your data

Want to know how valid your current assessments are? Or want a system that continuously learns and improves? Get in touch or see how our AI hiring system builds assessment validation into the process.

Book an intake call · View our AI Hiring System