Measurement as the ‘scientific’ basis of the hiring process is generally an unquestioned expectation when applying assessment tools as part of the recruiting process. The assumption being that the tools in use have been validated and are generally accepted in practice as being statistically sound. As a consequence, most of the ongoing efforts of recruiters have been focused on the influence and impact of the assessment results on the current stock of strategic directions like ‘recruiting the best’, ‘building world class talent’, driving diversity through creative recruiting strategies’, finding more innovative ways to hire the best’ and, ‘building pipelines for leadership talent’, etc. In much the same way, the more transactional elements like the preferred methods of delivery and reporting efficacy of assessment tools (online vs. paper-and-pencil debate, etc.) have become key areas of focus. In many ways the perception of what is happening to the recruiting community is that it has moved along to ‘bigger and better’ things as though the assessment part of the process has been neatly packaged and dealt with by virtue of it’s historical contribution. In other words, it has become a ‘given’ with the focus shifting to ‘what is to be measured’. The question however remains - just how convinced are we that the assessments we’re using are in fact providing the solid foundation we assume they are?
While all the aspirational strategies provide most recruiters with great material for planning sessions and personal performance development plans, one cannot but reflect on where it all points to – what is fundamentally driving all the fanciful models and, more importantly, what is everyone ultimately attempting to achieve! Once distilled, it appears relatively straightforward – ‘measuring for the most suitable person for the job’ – nothing more, and nothing less. The one ‘objective’ opportunity every organisation has to get it right. Everything else is a consequence of this choice – coaching, communication, creativity and innovation, diversity, empowerment, initiative and risk-taking, mentoring, personal integrity, planning and organizing, problem solving and decision making, quality of results, teamwork, technical competency, vision, and the list goes on. All said and done, the selection process, more often than not, is really a guestimate based on interviewer experience, some ‘timeless’ tools (and accompanying assumptions) we inherit with it, the odd gut-feel (often coated as the ‘fit-factor’), and the internal politics of the day.
Considering the aforementioned, maybe there should be a pause in the recruiting community to consider the implications of underestimating the operationalising and subsequent measurement of the variables used to develop assessment instruments that we use to support our recruiting process and its supporting strategies. These products are often left in the hands of the ‘experts’ from various outsourced companies to decide on and administer because of the perceived overly technical nature of this part of the recruiting exercise. This is understandable given the effort it takes to go through any assessment’s detailed technical report to validate its proposed accuracy and applicability. These very impressive looking manuals, usually filled with equally impressive statistics are often not worth the paper they are written on. More often than not, they’re really applying traditional measurement concepts embedded in classical testing theory to raw scores with no meaningful comparability qualities. The truth of the matter is that any interpretation can only be as good as the quality of its measure! Herein lies the fundamental dilemma we’re facing – recruiters are often unaware of the way in which these measures are constructed. When one considers the diligence exhibited throughout the recruiting processes, the choice of tools and instruments, and sometimes even the choice of statistical analysis, it would make sense for an equal amount of consideration to be focused on the primary question of the basis of the measurement construction. It is impossible to make inferences about who is the better person for the job without rigorous measurement. This has to be the pivotal issue once we decide to use any form of psychological assessment as part of the selection battery.
So, to the point. If we’re looking for the best – how do we measure for ‘the best’! And even more importantly, how do we know we have a measure that is calibrated to measure ‘the best’ in a way that ‘…can be reproduced whenever necessary and shown to be invariant enough to approximate the continuity we count on to make our thoughts about amounts useful.’
For a bit more clarity let’s deconstruct this using a typical example of scale construction when developing an occupational or personality instrument. Firstly, the researcher would establish a group of items intended to measure a specific construct i.e. leadership. After administering these items to a selected sample of individuals, the responses would in turn be aggregated and presented as a total scale value. In our example, let’s assume the researcher is developing a scale to measure leadership with high scores representing more of the qualities being measured and low total scores indicating less leadership qualities in terms of our current example. The items are scored on a five-point Likert scale – 1 (never), 2 (seldom), 3 (neither most of the time nor seldom), 4 (most of the time) 5 (always). Given this context, let’s assume the following represent a few of the items measuring this Leadership construct.
Let us now assume that the rating scale response for an individual on the three items represented above is 1, 3, and 5 respectively. Traditionally, this person would be assigned a score of 9 on the leadership scale and then subsequently, this 9 would be used as the ‘measure’ on all further statistical analyses. Now let’s consider another individual that responded 4, 4, and 1. It is patent that this person receives exactly the same score of 9.
Considering this situation, it is clear that we are making the following assumptions when we sum the rating in this fashion:
i) that each item contributes equally to the measurement of the construct, and
ii) each item is measured on the same interval scale.
When considering the first assumption, we are concluding that each item has exactly the same qualitative value when measuring this imaginary leadership construct. However, it is glaringly obvious that each the items represented above bring distinctively different qualitative value to the overall Leadership construct we’re attempting to measure. Consider items 1 and 3 in our example. It is obvious that a high score on item 1 should hold more weight than an equally high score on item 3. One could equally argue that item 2 also holds a relatively low standing on this Leadership hierarchy relative to item 1 and possibly a higher ranking than item 3. This brings us to the heart of this discussion. If our items reflect a distinct difference in the levels of endorsement they bring to the Leadership construct we’re seeking to measure, then we cannot avoid analyzing our data in a way that distinguishes the value each item brings to the measurement of the construct. There can be very little doubt that a score of 4 on item 1 contributes distinctly greater to the Leadership construct than a score of 4 on item 3.
The second assumption in turn raises the issue of the interval scale. By including the five-point Likert scale, we assume that it represents a uniform distribution between each point on the item scale as well as across all the items. Let’s consider the first item in our example. Given the assumption of each point on the scale being equidistant we would imagine that (N) ‘never’ is equally far from (S) ‘seldom’ as (M) ‘most of the time’ is from (A) ‘always’ or (S) ‘seldom’ is from (NSM) ‘neither most of the time nor seldom’, etc. However, considering the item statement ‘Exploits information from various constituents to formulate plans for competing in the marketplace’ it could very well be possible that (N) ‘never’ and (S) ‘seldom’ are psychologically much closer to each other (in the minds of the respondents) and similarly (M) ‘most of the time’ and (A) ‘always’. Let us explore this graphically using item 1.
Responses based on the assumption of linearity and equidistance would graphically look something like this (the assumption being that value of the contribution to the construct is equal between any selected option i.e. (N)–(S), (S)-(NSM), etc.):
In reality, and as explained above, there is a very distinct difference between the way different respondents ‘psychologically’ interpret the distances between these various options. Respondents to this and other items would find the choice between (M) ‘more often than not’ and (A) ‘always’ much easier to ‘switch between’ as a choice as opposed to (S) ‘seldom’ and (NSM) ‘neither most of the time nor seldom’. This would also apply to (N) ‘never’ and (S) ‘seldom’ as being much more difficult to switch between in terms of endorsement.
If we were to graphically lay this out, we would note that the relative distances in terms of strength of endorsement looks something like this:
Presented with this spatial representation of the psychological impact of making these choices it becomes evident that there is a large psychological difference between endorsing (A) on this fictitious leadership item as apposed to rejecting it (N) - (d). The decision is relatively definitive – you either ‘exploit information from various constituents’ or you don’t. However, when one considers the psychological shift one has to make when deciding to either endorse (M) ‘more often than not’ or (A) ‘always’, the boundaries start blurring considerably – (b). The choice now becomes more reflexive. This could be based on a host of psychological precursors – the respondent could just have completed an exercise gathering strategic market related intelligence. On the other hand, the respondent could be accessing information based on similar but more historical personal actions. These two scenarios could be all that separate the choice of ‘always’ from ‘most of the time’! However, it is very unlikely that, given the suggested context above, the second scenario would result in a toss up between (NSM) ‘neither most of the time nor seldom’ and (M) ‘most of the time’ – (c).
In much the same way as explained above, this lack of linearity is usually evident across all the items as well. For example, the psychological distance (value) between (NSM) and (M) - (c) on item 1 may be very different to that of item 3. Two prospective applicants with exactly the same psychological style and associated skills being identified by item 1 could very possibly experience the psychological difference between (NSM)–(M)–(A) very differently. For example one individual could feel comfortable selecting (M) as a response while another would have no problem selecting (NSM). This would have minimal impact when hiring some frontline supervisor, but when we start dealing with high-stakes personnel it can so easily translate into either an unqualified success or devastating failure for the organisation. Subsequently, given the aforementioned proposition, one cannot make the assumption that the ‘value’ of a move from (NSM) to (M) is the same as a move from (M) to (A). Further, by simply tallying raw scores and using them as an indicator of the strongest candidate on this particular item will most definitely bias the individual that selected (M) (most of the time) as opposed to (A) ‘always’. In essence, raw item ratings are unable to factor in this lack of linearity both within and across the various items measuring a specific construct. The majority of instruments currently in circulation perpetuate this fundamental weakness in their designs - they confuse counts with measures. In essence, the quantitative observations they use to arrive at a ‘final score’ are grounded in counting observed events or, as in this example, leadership properties while, for any measurement to be meaningful, it has to be based on the arithmetical properties of the interval scales used. So, before we even begin to consider whether one candidate is better suited for a particular job as opposed to another, we have to be assured of this fundamental prerequisite – ‘…a measure implies the previous construction and maintenance of a calibrated measuring system with a well-defined origin and unit which has been shown to work well enough to be useful.’ We have to ensure that we are measuring and not simply counting observations.
To date we have readily accepted the various reports and analysis stemming from a host of assessments without any reservation. I propose that the next time we reach for the telephone to call our favorite assessment consultancy or reach for the stock-standard one off the shelf that we are able to answer the question – does this instrument allow me to independently estimate the measure of the person’s level on the latent trait and the level of various items on the same latent trait and yet be compared explicitly to one another? After all, how accurate would any selection process be if the determining psychological construct has never been measured comprehensively in the first place?