Psychometrics · 14 min read

What Makes an Assessment Test Valid and Reliable?

Not all assessments are equal. Learn the difference between valid and invalid tests, and why it's crucial for your hiring.

Door Ingmar van Maurik · Founder & CEO, Making Moves


Why it matters

An assessment is only valuable if it measures what it claims to measure and gives consistent results. Sounds logical, but the reality is that many companies deploy assessments without knowing whether they actually predict job performance.

The consequence: hiring decisions based on noise. You think you're hiring data-driven, but in reality, you're using an instrument that predicts no better than a coin flip — and sometimes even worse, because it creates a false sense of certainty.

In this article, we explain what validity and reliability actually mean, how to measure them, and why generic assessments often fall short. We also show how you can build assessments with your own system that actually predict who will succeed.

Validity: are you measuring what you want to measure?

Validity is the foundation of every assessment. It answers the question: does this test actually predict job performance? There are multiple forms of validity, each with a specific function.

Predictive validity

This is the gold standard in assessment psychometrics. You compare test scores with later real-world performance:

  • Do high-scoring candidates also score high on performance reviews at 6 and 12 months?
  • Are there correlations between specific test components and role success?
  • Do scores predict retention — do high-scoring candidates stay longer?
  • Predictive validity is expressed as a correlation coefficient (r). In psychometrics, these benchmarks apply:

    Correlation coefficientInterpretation

    |------------------------|---------------|

    r < 0.10Negligible — the test predicts nothing r = 0.10-0.20Weak — limited value r = 0.20-0.30Moderate — some predictive value r = 0.30-0.50Strong — good predictor r > 0.50Very strong — excellent predictor

    The best generic cognitive tests achieve an r of 0.30-0.50. But company-specific assessments can score significantly higher because they're calibrated to what success means in your specific context.

    Construct validity

    Does the test measure the right construct? This sounds simple but is complex in practice:

  • A test for "analytical ability" should actually measure analytical ability, not reading proficiency or working memory
  • A personality test measuring "leadership" should distinguish from dominance and assertiveness — related but different constructs
  • A test for "cultural fit" should measure what it claims, not simply formalize similarity bias
  • Construct validity is measured through:

  • Convergent validity — does the test correlate with other validated tests measuring the same construct?
  • Divergent validity — does the test not correlate with tests measuring a different construct?
  • Factor analysis — do test items load on expected factors?
  • Criterion validity

    How well does the test predict a specific criterion? This can be:

  • Productivity — output and quality of work
  • Retention — does the employee stay at least 12-18 months?
  • Customer satisfaction — scores from customers working with the employee
  • Growth velocity — how quickly does the employee develop to the next level?
  • Team effectiveness — does the employee contribute positively to the team?
  • It's important to recognize that different criteria require different predictors. A test that predicts productivity doesn't automatically predict retention.

    Content validity

    Does the test cover relevant content for the role? An assessment for a software developer should test for:

  • Problem-solving in a technical context
  • Code review skills
  • Collaboration in development teams
  • Dealing with ambiguity and changing requirements
  • Not for: general verbal intelligence or abstract pattern recognition that has no relation to daily work activities.

    Reliability: is it consistent?

    Reliability asks the question: does the test produce comparable results on repeated administration? A test cannot be valid without being reliable — but a reliable test is not automatically valid.

    Test-retest reliability

    Does the same person score similarly when taking the test at two different times? This is measured with the test-retest correlation:

  • r > 0.80 — good test-retest reliability
  • r = 0.60-0.80 — acceptable for some constructs
  • r < 0.60 — insufficient — the test measures too much noise
  • Important: some constructs are inherently less stable (e.g., mood vs. personality), which affects expected test-retest reliability.

    Internal consistency

    Do all questions within a section measure the same construct? This is measured with Cronbach's alpha:

  • α > 0.80 — good
  • α = 0.70-0.80 — acceptable
  • α < 0.70 — the questions don't consistently measure the same thing
  • Low internal consistency means some questions measure something different from the rest, making the total score unreliable.

    Inter-rater reliability

    For assessments requiring human judgment (e.g., simulations, presentations, interviews): do different evaluators reach the same conclusion? This is critical for:

  • Assessment centers
  • Structured interviews
  • Work samples
  • Video assessments with human scoring
  • The solution for low inter-rater reliability: structured scoring rubrics and evaluator training. Or better yet: deploy AI scoring where possible, which is inherently consistent.

    Why generic tests often fall short

    Most commercial assessments — from providers like SHL, Harver, TestGorilla, and Saville — are validated on generic populations. This means:

    The norm group problem

    Scores are compared with thousands of random people from diverse industries and roles. But:

  • What predicts success at a bank is fundamentally different from a tech startup
  • A norm group of 10,000 random professionals is not relevant for your specific senior developer role
  • The cultural context of a Dutch company differs from an American norm group
  • The static model problem

    Generic tests are updated every 5-10 years. Your company changes continuously:

  • New technologies, processes, and culture
  • Changing team dynamics and leadership styles
  • Evolution of what "success" means in a role
  • A test validated in 2020 may no longer measure what's relevant in 2026.

    The one-size-fits-all problem

    The same personality test is used for developers, sales managers, finance analysts, and customer service representatives. But the competencies that predict success are fundamentally different per role.

    Read more in our article on why generic assessments don't work.

    The solution: company-specific validation

    With your own assessment system, you can address the shortcomings of generic tests:

    Building your own norm groups

    Instead of comparing scores to a generic population, you build norm groups per role and department:

  • Your senior developers are compared with your senior developers, not the market
  • Your sales team's scores are benchmarked against your top performers, not a generic sales norm group
  • New hires are compared with employees already succeeding in the same role
  • Calculating predictive validity with your own data

    This is the ultimate test: do your assessments actually predict success? With your own data, you can:

  • Correlate assessment scores with performance reviews (6, 12, 18 months)
  • Identify which test components are most predictive for which roles
  • Adjust weights so the most predictive components carry more importance
  • Set up a [continuous validation cycle](/artikelen/continuous-validation-hiring) that makes the model increasingly accurate
  • Continuous calibration after every hire

    After every hire, the model is validated:

    1. Candidate scores on the assessment

    2. Candidate is hired (or rejected)

    3. After 6 months: performance review

    4. Calculate correlation: was the prediction correct?

    5. Adjust the model based on results

    This means your assessment system gets smarter over time — an advantage generic tests cannot provide by definition.

    Bias analyses on your own population

    With your own data, you can actively monitor:

  • Are certain groups systematically scored higher or lower?
  • Are there components with adverse impact that need adjustment?
  • Is the test equally predictive for all subgroups?
  • The difference in practice

    AspectGeneric assessmentCompany-specific assessment

    |--------|-------------------|---------------------------|

    Norm group10,000+ randomYour employees per role Predictive validityr = 0.20-0.40r = 0.40-0.60+ UpdatesEvery 5-10 yearsContinuous Bias monitoringVendor (generic)You (specific to your population) Cost per candidate€50-€200Included in system Data ownershipVendorYou

    Key takeaways

    An assessment without validation is an expensive gamble. You give it the appearance of objectivity, but in reality, you base decisions on unproven assumptions. A validated custom assessment, on the other hand, is a strategic weapon in your hiring.

    The core points:

  • Validity is about whether you measure what you want to measure — and whether it predicts job performance
  • Reliability is about consistency — do you get the same results on repeated administration?
  • Generic tests fall short due to generic norm groups, static models, and one-size-fits-all approach
  • Company-specific validation solves these problems with your own norm groups, continuous calibration, and predictive validity on your data
  • Want to know how valid your current assessments are? Or want a system that continuously learns and improves? Get in touch or see how our AI hiring system builds assessment validation into the process.


    Book an intake call · View our AI Hiring System