Episode 13

The Pitfalls of Using AI Systems for Hiring

Julia Stoyanovich, NYU

Transcript

Description

In this episode we are joined by Julia Stoyanovich from NYU, to talk about her work into how AI is being used in the hiring process.

Whether you are responsible for hiring on behalf of a business or are a job seeker, you will find this podcast very interesting, but for very different reasons.

Show Notes

Resources

Episode Summary

Algorithmic decision making in the hiring process – what does that mean for businesses and job seekers?
The hiring process – the funnel effect.
Lack of public disclosure about the use of algorithmic tools as part of the talent acquisition pipeline.
Are job seekers being unfairly screened out of the hiring process?
How AI based implementations of psychometric instruments are used today.
Is it possible to measure a person’s personality based on data alone?
Do these systems remove bias and discrimination from the hiring process?
Testing the stability and consistency of these algorithmic systems.
Vendors of systems and their lack of testing / recognising the issues.
Are new laws needed so the hiring process is fairer and more transparent?
What does the future of hiring look like – fewer AI systems and more human intervention?

https://www.linkedin.com/in/julia-stoyanovich-b184851/

https://engineering.nyu.edu/faculty/julia-stoyanovich

External Stability Auditing to Test the Validity of Personality Prediction in AI Hiring Article (Published in Jan 2022): https://arxiv.org/abs/2201.09151

Series you might like

AI V Humans

2 Parts

Data Strategy Evolved: How the Biological Model fuels enterprise data performance

1 Part

Deep Fakes

2 Parts

Enhancing GenAI with Knowledge Graphs: A Deep Dive

1 Part

Enterprise Data Architecture in The Age of AI - How To Balance Flexibility, Control and Business Value

1 Part

Future AI Trends: Strategy, Hardware & AI Security at Intel

1 Part

How AI Is Driving The Eradication Of Malaria

1 Part

How AI is Reshaping Startup Dynamics and VC Strategies

1 Part

How Observability is Advancing Data Reliability and Data Quality

1 Part

How Science is (mis)communicated in Online Media

1 Part

How to Leverage Data For Exponential Growth

1 Part

How to Use Neural Networks

2 Parts

How XPRIZE is enabling AI for social good

1 Part

Image Processing

1 Part

Key Principles For Scaling AI In Enterprise: Leadership Lessons

1 Part

Mapping forests: Verifying carbon offsetting with machine learning

1 Part

Maximising the Impact of Your Data & AI Consulting Projects

1 Part

The Evolution of GenAI: From GANs to Multi-Agent Systems

1 Part

The future of LLMs, ELMs and the semantic layer

1 Part

The Path to Responsible AI

1 Part

The Pitfalls of Using AI Systems for Hiring - Julia Stoyanovich, NYU

1 Part

Transforming Freight Logistics with AI and Machine Learning

1 Part

Using Open Source LLMs in Language for Grammatical Error Correction (GEC)

1 Part

Using Time Series Analysis to Uncover Why Gun Sales Increase After Mass Shootings

1 Part

Why Evolutionary Biology Has Big Implications For Future AI Development

1 Part

Transcript

Speaker 1 (00:03):

This is the Data Science Conversations podcast with Damien Deighan and Dr. Phillip Diesinger. We feature cutting edge data science and AI research from the world’s leading academic minds and industry practitioners, so you can expand your knowledge and grow your career. This podcast is sponsored by Data Science Talent the data science recruitment experts.

New Speaker (00:33):

Welcome to the Data Science Conversations podcast. My name is Damian Deighan, and I’m here with my cohost Dr. Philip Diesinger. Today we feature another brilliant academic from NYU Julia Stoyanovich and Julia is here to talk about her fascinating work into how AI is being used in hiring. Julia, by way of background, is an Institute Associate Professor at NYU in both the Tandon School of Engineering, and the Center for Data Science. In addition she is Director of the Center for Responsible AI also at NYU. Her research primarily focuses on responsible data management and analysis, fairness, diversity, transparency, and data protection at all stages of the data science lifecycle. She established the “Data, Responsibly” consortium and served on the New York City Automated Decision Systems Task Force, that was an appointment from the New York Mayor de Blasio. Julia has also been teaching and developing courses on Responsible Data Science at NYU, and is the co-creator of an award-winning comic book series on the topic. In addition to data ethics, Julia works on the management and analysis of preference and voting data, and on querying large evolving graphs. She holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics & Statistics from the University of Massachusetts at Amherst. She is a recipient of an NSF CAREER award and a Senior Member of the ACM. We’re very excited to have you with us on the podcast. Thank you so much for joining us today Julia.

Speaker 2 (02:28):

Thank you Damian, it’s a pleasure. I should abbreviate that bio it’s quite a mouthful.

Speaker 1 (02:34):

It was very well written actually, you made my task very easy for this episode, so thank you. So if we just start with your personal journey, Julia. Could you tell us what motivated you to get involved in researching the area of responsible AI and AI ethics?

Speaker 2 (02:53):

There’s any number of answers I could give to this, but I’m actually not quite sure myself how it is that I got interested in this topic. I feel like it’s responsible data science and responsible AI and AI ethics. All of these names are used more or less interchangeably. And to me, this is an area where my mind and my heart meet, and if that’s a cliche, then be it. I always had pretty strong political views, kind of on the left side of the spectrum. And I also am always interested in understanding to what extent models fit the world, to what extent data fits the world. How well can we actually ask the kinds of questions that we then want technical solutions to address? And this is a kind of an engineering angle, and I’m really happy to be alive at the time where being both an engineer and somebody who is socially conscious is something that you can combine professionally.

Speaker 2 (03:51):

My interest in this topic started actually very very long ago although, of course, at that time, these areas were not called AI ethics, or responsible data science, but I was always interested in preferences and rankings and opinions of people, and really just understanding when there is a ground truth and when isn’t there a ground truth, when is there actually a disagreement that you need to intrinsically model and where you can not converge to a single kind of an authoritative score or rank order, and then the technical tools that I have been developing, sort of in that realm, I’m finding that they are applicable now to the thinking about ethics and questions about to what extent things are, can be taken in the absolute, to what extent are things, context specific, culturally grounded, et cetera.

Speaker 1 (04:38):

Obviously one of your core area of recent interest is the use of AI in hiring. Perhaps you could talk about what prompted that and maybe give us the big picture view of this whole area.

Speaker 2 (04:53):

Yes. So algorithmic hiring, the use of data-driven algorithmic techniques, predictive analytics, machine learning models. Other types of AI in hiring is really, really widespread. And this is a specific application domain, right? And I am a computer science researcher by training and by calling. So why am I looking at this specific domain? And the answer is that the kinds of benefits and the kinds of risks that this domain raises are just extreme, especially the risks are something that we all feel as individuals, as members of specific societal groups, and society at large is getting really, I feel destabilized by the fact that we are using systems in the crucial domain like hiring without a good understanding about what they do, whether they work, whether they discriminate, where are they helping us address discrimination? So this domain is important to me, both because it’s a very interesting, specific use case for the clients of technical tools that they have been developing in the preferences and ranking and set selection area, but also because of the legal and the social impacts of this domain are so important that it’s worthwhile just to focus on it for the purpose of understanding what to do correctly, right?

Speaker 2 (06:11):

So, the way that I think that the engineers should somehow act in society today is we should be looking for ways to intervene on society, to make it better that go beyond the technical solutions. If the kinds of ways to intervene, have to do with laws and regulation, or they have to do with educating people, educating regular people, members of the public or technical students, then this is where we should go. And all of these opportunities, all of these potential interventions are very clear when it comes to algorithmic hiring. So why am I talking about algorithmic hiring and what is it really? There was a brilliant report in 2018 that was produced by Miranda Bogan and Aaron Rieke from Upturn and it’s called Help Wanted where they described the hiring process as a funnel. It’s a sequence of data driven algorithm assisted steps in which a series of decisions culminates in job offers to some candidates and rejections to others.

Speaker 2 (07:12):

And of course, all of these steps are ultimately conducted by people, but with the help of data and algorithmic decision-making. And I find it very useful to think about this funnel in terms of the concrete steps that are included in it. The first step is going to be when employers source candidates with the help of ads or job postings. And this is more often than not carried out on social media platforms. Next there’s a screening stage, and this is where employers assess candidates by analyzing various properties of these candidates, their experience, their hard skills, their soft skills, other characteristics. At this stage, AI tools are used increasingly, and they include a wide variety of tools such as resume and social media profile screeners to identify specific skills that an employer may be looking for, but also tools like game based psychometric assessment, or resume based psychometric assessment, where the goal is to identify personality traits, quantify them on a numerical scale, and then see how these traits might map to requirements of a particular job and to predicting success on that particular job. Then after screening, there are still multiple stages left there is interviewing candidates, and that is also very often done with the help of AI tools.

Speaker 2 (08:39):

So a candidate very often is not speaking to an actual recruiter, rather they’re sending in the recording of their interview video or just the script, right? And then there’s going to be a bunch of tools used by these employers to analyze the video, or the text, and then there’s a selection step. And this is when employers may run background checks, for example, or rather predictive analytics to, based on someone’s employment history, figure out what salary to offer to them so as to make that attractive, and in all of these stages, we see very complex, very rich, very often a big interactions between data and algorithms and people.

Speaker 3 (09:17):

What do you think, how many applications are basically touched by an automated process like this today?

Speaker 2 (09:25):

Really the answer is that we don’t know, but we have a pretty good guess that the vast majority of applications is going to be touched by algorithmic processing at some stage in this funnel. There are currently no requirements for companies to disclose publicly that they are using algorithmic tools as part of the talent acquisition pipeline, as it’s also sometimes called. So we don’t have any hard numbers here. And the best way for us to get information about this is to go to specific vendors, look at their websites and see what they claim about how popular their tools are. And so based on that information, we think that between 75 and 100% of all submitted resumes are being read by software today and this is only for the resume screening part. If you think about the earlier stages, for example, you know, who sees what job ad hardly any of that happens in newspapers anymore or on TV, right? All of this is facilitated by online platforms. So I do you believe close to 100% of the sourcing is happening online, perhaps with the exception of some very, very targeted searches for executive level positions at companies.

Speaker 3 (10:45):

So for job seekers, that means for any application there is a substantial risk that when I submit a CV, basically for an opening, that there might be a first step in this funnel that you described, a first filter, which is completely automated and might entirely basically exclude me at this stage already from the hiring process.

Speaker 2 (11:08):

Indeed. And this, this happens at every stage, right? So you would be excluded very often, even before you know, that there is an opening because an ad for the job will never be shown to you, right? So that’s kind of the most impactful portion of the funnel, because this is the widest portion, right. But then for the resume screening, again, as far as we know for most positions, resume screening is automated. And it’s very rare for an individual to hear back to know that their resume was screened by an algorithm rather than looked at by a human. The applicants are not told why their resume was screened out.

Speaker 3 (11:43):

From what you’re describing, it sounds like at this stage, it’s pretty much safe to assume that basically any resume or any application will be just screened by some sort of system.

Speaker 2 (11:53):

Yes, I would guess that again, maybe except for some very, very targeted searches that are going to be conducted by boutique recruitment agencies or that are looking for a specific individual with a specific profile for an executive level position. I think all of the entry level and mid-level jobs are going to be subject to this automatic screening.

Speaker 3 (12:16):

Could you specify a little bit more how specific software would basically process a CV as you’ve already mentioned that there are different types of input data, there’s voice recording at the later stage, but let’s say in the early stage of an application process, there would just be a resume or maybe some external data, like you mentioned LinkedIn also. So that is input data so that is basically written text. Is that correct? And then what is the kind of the outcome characteristic that makes a decision whether to continue with an applicant or not?

Speaker 2 (12:49):

Yeah, this again is a question that seems simple, right, but we lack information about how specifically these tools operate and even more so about the context of their use. And this is something that is very important to understand, right? Because ultimately it’s not going to be the case very often that the only thing that happens to an application is that they are screened and then whichever resumes are found not to match they are just dropped on the floor and rejected, and the other ones continue. There’s going to be a human in the loop at some point. And so what we don’t have an understanding of is, how do the various types of predictions that this software makes, how are they used by the actual human decision-makers like hiring managers? Do they trust blindly these predictions? And then just say, yes, I accept whatever matches or non matches the software identified.

Speaker 2 (13:45):

Does it depend on the industry? Does it depend on the type of a job? Ultimately, the kinds of predictions that an algorithm can make are going to be based on either some sort of first principles, but more likely than not they’re based on past data, right? So the way that these systems operate is they, for example, have access to a bunch of resumes of people who worked in similar jobs in the past, and who did well or did maybe less well, right, but who were already employed in these particular positions. And then a system would be a classifier that would take a bunch of resumes as input or a particular resume as input and it would classify them according to the model that was learned from historical past data, from historical resumes that describe past employees about whether this new person would be a good fit for the particular job.

Speaker 2 (14:37):

So in a nutshell, they are predictive analytics, they’re classifiers. And as we know, classifiers are trained on some training data set that represents the past and that they use them to predict the future, to predict future performance. So this is one very standard staple the tool and these tools, you know, if you think about how they work, you would understand immediately that if you had an under-representation of individuals from particular demographic groups in your training set, then you wouldn’t be able to meet with predictions for them. Your predictions would be random at best. And most likely you would classify them negatively because you’ve never seen positive cases like this. And there are also tools that are not simply classifiers, but rather that are a kind of a decision support tool, more explicitly for a hiring manager. Some of these tools are of the psychometric testing variety.

Speaker 2 (15:27):

So we’ve been using psychometric tests for hiring since the early 1900s, as far as we know, if not earlier. And the way that in the analog days, these tests were used was there was a bunch of standard questions that people would respond to. And then based on their answers, some personality traits would be identified of these people. And then there would be some theory that is developed in an area that we now call industrial psychology that would see to what extent these particular traits will be predictive of, let’s say, on the job performance with particular candidates. And so right now we’re seeing a lot of implementations, algorithmic implementations, AI based implementations of psychometric instruments and make whatever assumptions that the analog psychometric instruments make and translate them into this digital world of decision-making in the hiring context.

Speaker 3 (16:23):

And a psychometric test would be something that tries to measure some sort of personality traits.

Speaker 2 (16:30):

Right. And then the question of course becomes, you know, do you actually believe that personality is a valid answer? Do you believe that you can measure personality given the information that you have access to? Right. Does a test allow you to measure a personality in some meaningful way? And another question that is again a higher level one, I guess is, even if personality was a meaningful construct and you were able to measure it, how important is it in the context of deciding who gets a job and who does not? Perhaps we shouldn’t be assuming that the person stays the same, no matter what company they’re hired for, no matter what the environment is at that company. Right? Of course the environment matters a lot for people’s success. People have agency people can decide to do well or to not do well on the job, irrespective of however they came in. Right? So these are all questions that are very important for people to ponder when they decide whether or not to use psychometric testing. But the state of the world today states that psychometric tests are used, they have been used in the past in the hiring context, and now they are being used also in this algorithmic hiring context. And increasingly so.

Speaker 3 (17:41):

And so obviously it’s very important that these software solutions utilizing AI that they are consistent in their judgment, right and their output metrics. And this is something that you investigated a little bit in your recent research. Can you talk us through that?

Speaker 2 (18:01):

This is a recent project that I undertook with a wonderful and very interdisciplinary team of collaborators. Some of them are data scientists and that’s Ali Maria, Kelsey Marquee, and Lauren [inaudible]. One of them is an investigative journalist Hilke Schellman who has been studying this area of the use of algorithmic tools in hiring. One of them is Mona Sloan, she is a sociologist and Paul Squires is an industrial psychologist. So this group of people came together to investigate a particular aspect that pertains to the use of algorithmic psychometric tests in the hiring context.

New Speaker (18:16):

And maybe before diving into this paper, I should say that a lot of attention today is being placed to bias and discrimination in hiring in general. And then these algorithmic tools are seen as a way to address the fact that people, humans, when they hire, they are biased and then there’s this argument being made that because people are biased, we have no choice but to bring in machines to help hire on our behalf. So as to have better workforce diversity, for example, and this is all important. I personally don’t really buy that argument of machines being better than humans at diversifying because ultimately humans have to tell machines what the objectives are. But a lot of attention is being paid to bias and discrimination and tools are being marketed as solutions to that. But perhaps an even more important question here is, do these tools work even? What do they do, right? If you have a tool that is going to, let’s say, admit, select a sufficient number of people in each demographic group as required by law. But other than that, it’s predictions are random. Is that a fair tool? I don’t think so.

Speaker 2 (19:57):

I think it’s an arbitrary set of decisions that it’s making. And then rather than using such a tool, you should be just taking a stratified sample, essentially flipping a bunch of random points. Like you don’t need any AI for this, but then what happens very often with these tools is that not only are they arbitrary, but they can also be picking up signal without telling us that correlates with the person’s disability steps, for example, or a person’s gender or a person’s race in some ways that we can’t really counteract. So this is kind of the background for this work. In this work we did not look at bias and discrimination. Instead, we looked at stability which can be used as an indication over to working or not working at the very least when you have a test, a psychometric instrument, right? I mean, think of an IQ test or an SAP test or a [inaudible] test, right?

Speaker 2 (20:47):

You would expect that when you take that test yourself, that your scores won’t change very much, if nothing changes about you, right? So you take the test today, you don’t study, you don’t change anything about yourself. You take it the next day, the score should be more or less as it was yesterday, right? And generally you have a measurement instrument like a thermometer and it measures the temperature outside and you use it once. It gives you a reading 20 degrees centigrade, and then you use it again in five minutes and nothing changed outside. There’s no thunderstorm, there’s no hurricane. And now it tells you 30 degrees centigrade, you’re going to worry that the instrument, that their monitor is broken. Right? So what we’re looking at in the study is whether we have any indication about these psychometric instruments actually working, not being broken, not giving different readings of a person’s personality traits under the same conditions or within a short time span, or if you give them an input in PDF versus in let’s say word or raw text, you wouldn’t expect there to be any differences in how your personality is read by these tools. Right?

Speaker 2 (21:54):

And the interesting point here is that by conducting the study we are not implicitly or explicitly endorsing personality or personality traits as a valid construct. And we’re also not endorsing or questioning the use of personality traits in hiring. The methodology that we have here does not allow us to question these aspects. But what we are able to question is whether the assumptions that the vendors of the tools themselves make, whether they are met in the way that these tools operate and these assumptions are these stability assumptions. The reason that we use stability is to use the terminology that folks in industrial psychology use is that stability can be used as an indication of validity of a test. If a test is unstable, it cannot be considered valid.

Speaker 3 (22:54):

Yeah, it’s a requirement, yeah?

Speaker 2 (22:55):

Right? If it’s stable, it doesn’t mean it’s valid because you still need to show that not only is it giving you consistently the same result, but also that that result has anything to do, is in any way job relevant. A watch that doesn’t work will reliably tell you that it’s midnight or noon. Right. But that doesn’t mean that that’s correct. At the very least, you know, that it’s giving you a stable prediction.

Speaker 3 (23:19):

Yeah. So how did you set out to test the stability or consistency of these solutions? Do you have like, are there specific ones that you tested, how exactly did you do it?

Speaker 2 (23:33):

Yeah, so what we did was we purchased licenses. We did an external stability audit of two tools. They are called Humantic AI and Crystal. We purchased subscriptions to these tools. We obviously did not work with the vendors, we didn’t tell the vendors that we are looking at their systems, and we filed and were approved for a study at New York University by our institutional review board IRB to collect resumes and LinkedIn profiles and also Twitter handles of graduate students in computer science, data science, and computer science and engineering, who are representative of a particular category of job seekers. So we collected the corpus of resumes and we ran that corpus under different treatments. So these tools, Humantic AI and Crystal and we looked at the outputs that these tools produced on various sets of resumes in the corpus. The types of output that these tools produced, and I already mentioned that they are algorithmic psychometric tests, so they all produced estimates of personality traits for the individuals whose resumes or LinkedIn profiles or Twitter handles are being supplied as input. And there are two personality trait skills that each system uses. One of them is called Disc and the other’s called The Big Five. And so these are the methods that we used. We looked at the scores, the Disc scores and The Big Five scores that these systems produced. And then we looked at whether or not these scores change under particular treatments.

Speaker 3 (25:30):

Yeah. And with treatment, you obviously didn’t change like the content of the CV, but what was it that you changed? Like so you basically you process a given resume multiple times through the same software, but with different file formats or how did you do that?

Speaker 2 (25:48):

The way that we thought about this is that we should be framing our methodology around the underlying assumptions that are made by algorithmic personality test vendors themselves within the hiring domain. An important thing to note here is that because algorithmic personality tests are a category of psychometric instrument, then they are subject to the assumptions made by the traditional instruments already. And these assumptions are personality is a valid construct. Personality traits are measurable and personality traits are indicative of performance on the job. So these are not the assumptions that we can test here. Instead, our validity experiments used the following assumptions. The first one was that the output of an algorithmic personality test should be stable across input types, meaning that a PDF of someone’s resume or a Microsoft Word version or a raw text version of someone’s resume should not change the personality score, the personality profile that that resume receives.

Speaker 2 (27:00):

And the reason that we chose this as an assumption to test is because the vendors themselves don’t discriminate between these input types. They don’t direct their user to treat the outputs that were computed over the PDF resumes differently than the outputs that were computed over a Microsoft Word version of the resume. They tell you, no matter what you have just fitted in, we’ll give you a score and everybody’s comfortable according to that score, right? So this is clearly an assumption that the vendors themselves meet. And this is one assumption that we tested. Another assumption that we tested was that the output of an algorithmic personality test should be stable across input sources. These are resume or LinkedIn or Twitter. And here again, the reason that we make this assumption is because the vendor tells the user, if you have a resume, feed it in, we’ll give you a score. If you have a LinkedIn profile, give us that we’ll give you a score. And then all of these scores are comparable. There shouldn’t be a difference between a resume computed score and the LinkedIn computed score. And the third assumption that we made, and this is the final one, is that the output of an algorithmic personality test on the same input should be stable over time. This again is based on this assumption made by the systems that personality is a stable construct. And furthermore, the context of use is such that if you have a search to fill a position that lasts a month, the vendor is going to allow you to treat interchangeably the scores of people that you computed on the 1st of the month and on the 31st of the month. So they are making that assumption of test and re-test reliability as it’s called in the psychometric literature. So these are the assumptions that we test in various ways.

Speaker 3 (28:52):

Can I ask, are these systems also using some additional data besides the CV? Yeah. So we are talking about Crystal here, I think that’s used also on LinkedIn, for instance. So submitting your CV, obviously there’s a name on it, right? You could look that up so are these systems just purely processing a CV or are they also using other extra information or extra database?

Speaker 2 (29:16):

So both Crystal and Humantic accept a resume or CV. They also both accept a LinkedIn URL. In fact, in Humantic AI, and this is a very interesting finding that we had very early on, if you have your LinkedIn URL appearing in your resume, the system is going to just go to LinkedIn and use the information from LinkedIn to compute the score. And it’s going to create a persistent linkage between the LinkedIn URL that it encountered and some email address that it encounters in that same resume. It could be your email address at the top, or it could be the email address of one of the references that you gave at the bottom. And once it’s created that association, there is no way for you to break it. So this was one of the most serious issues that we found that has to do with data protection that enables malicious attacks. This is very problematic. But to get back to your question, Philip Twitter is also another source that Humantic AI uses and they treat again all of them interchangably.

Speaker 3 (30:23):

So Humantic AI and Crystal, they basically combine what you submit with other data sources that they have available online – Twitter, LinkedIn, and so on.

Speaker 2 (30:33):

I don’t know whether they would use other data sources in addition to the information that you submit without your permission. We don’t have any evidence of that, at least within this experiment, with the exception of this linkage that Humantic makes between the resume and the LinkedIn that once you supply that information about the LinkedIn, that it stores it, but we don’t know whether it goes and looks at other data sources.

Speaker 3 (30:57):

I see. Yeah. So basically you, so you set up this stability test. Yeah. So it is equipped with the underlying assumptions, what came out of testing these assumptions? Are they valid or not?

Speaker 2 (31:11):

We found that both systems show substantial instability with respects to key facets of measurement. And this is an unfortunate finding for the vendors that are using these systems. And of course for the job seekers being screened. And our conclusion is that these systems therefore cannot be considered valid testing instruments. So it’s just two salient examples. Crystal frequently computes different personality scores, if the same resume is given in PDF versus in raw text format. And this violates the assumption that the system itself makes that the output of an algorithmic personality test should be stable across job irrelevant variations in the input, such as the format of the resume. And among other things that we found that are notable is evidence of persistent and often incorrect data linkage by Humantic AI, where a LinkedIn URL is connected with an email address that appears on the resume. And then that linkage cannot be removed from the system. And this is very problematic.

Speaker 3 (32:25):

How did you define whether it failed or not?

Speaker 2 (32:28):

So we computed several measures, several ways, to estimate stability with respect to each key facet of measurement. And specifically we looked at so-called rank order stability, locational stability, total change in scores. And we also, in some cases, were able to look at these measures with respect to subgroups. So let me talk about rank order stability as a prominent example here of the kind of tests that we did. Generally in the psychometric literature, reliability of psychometric instruments is measured with correlations. And so we used also correlation analysis to assess rank order stability. And we estimate the correlation between the outputs across each facet of interest. So you’re going to get a corpus of in the first example that they gave of resumes that are in PDF. You’re going to compute personality scores, and you’re going to get the corpus of the same resumes, but now in raw text, you will get outputs and you will compute scores.

Speaker 2 (33:36):

And we used a specific significance level for reliability correlations, where the bare minimum, according to the literature that we cite in this paper is considered to be a correlation of 0.9, but really the desirable standard for individual level decisions, and these are individual level decisions, right, whether to hire a person or not to hire them, the desirable standard should be 0.95. Still in our experiments, we saw that if a tool passed the 0.9 threshold, it also likely passed the 0.95. So there wasn’t really much of a difference in our test for this.

Speaker 3 (34:15):

So failing the stability test is obviously a disaster for solutions like this, right. What would you recommend job seekers like candidates submitting their resumes, is there anything they can do to influence this process?

Speaker 2 (34:31):

Uh, I think that the more we know about ways in which these systems operate and importantly ways in which they are broken, the more of an opportunity we give to job seekers and to the broader public to speak up and to demand accountability. So the main outcome of this work in my mind is being able to publicize it broadly, to tell people that they are subjected to these essentially arbitrary screens, right on which their lives and livelihoods depend. And when job seekers have access to information, have an understanding of what tools are used to screen them, and in what ways these tools are broken, they can demand, for example, that new laws should be passed that would compel both vendors of these tools, and also importantly, employers who use these tools to disclose what tools they’re using and from where are these tools are picking up signals?

Speaker 2 (35:27):

And I’ve written a bit on this, in the popular press. So for example, just in September of 2021, I had an article in the Wall Street Journal in which I am arguing for disclosure to job seekers of when systems are used and specifically I’m arguing for the need to attach a nutritional label to a job posting, to tell job seekers how they will be screened, what data will be screened about them and what tools will be used, what is the process to allow them to opt out by not applying right, or to give them formal consent by applying for the job and also to attach a nutritional label to a decision that would then explain to a job seeker, which specific feature it was in their resume that made them look insufficiently conscientious, and I’m using air quotes here so that this online, this algorithmic psychometric test didn’t match them with the job.

Speaker 3 (36:24):

Do you think the companies utilizing these solutions are aware of the limitations?

Speaker 2 (36:29):

I’m sure they are not because it’s not to the vendor’s advantage to be disclosing these limitations and they are not at the moment compelled to do so. The sort of test that we were able to do with Crystal and Humantic, this would be so much easier for the vendors themselves to do. And had they conducted such tests they would have known to, to try and improve things in their processing. I mean, some of the issues we identified are just, outright bugs. This association between the LinkedIn URL and just a random email that appears in your resume, right? That can’t be an intentional feature, standards are just lacking, both in the software development and in procurement of these tools.

Speaker 1 (37:14):

So what should companies who are using these tools, what should they be doing instead in your opinion, Julia?

Speaker 2 (37:23):

I think that they should be questioning what they expect these tools to do for them, and that they should be checking that these tools in fact, are accomplishing the goals that they have set out for themselves. Being skeptical, not taking it on face, that AI is this magic bullet that knows what you mean and it’s going to just fix all of the world’s problems for you if only you’ pay enough to the vendor, I think is the first step. Um, the other thing that companies should be understanding is that software vendors very often tell them that if there are any challenges to the legal compliance of their solutions, of the software that they’re using, that the vendors will essentially respond to these challenges in court. And that argument is not going to stand up. Ultimately, the employer is responsible for using the tools to screen the applicants, right? And so, while we don’t currently have very strong regulation in place that will compel employers to provide disclosure to job applicants about what tools they use or how they use them, this regulation is coming. So employers should be afraid to be using tools that they don’t understand.

Speaker 1 (38:37):

I think it’s probably fairly safe to assume that reliance on these tools is only going to increase as the years go on. What are the implications for hiring companies of continuing to use them?

Speaker 2 (38:53):

I don’t know if I agree with you that reliance on these tools will only increase. I think when we become more disciplined about our expectations of these tools, we are going to be able to decrease this amount of resumes that respond to every posting, and generally just limits the traffic too, in a way that is more manageable for a human. I think that at the moment, the reason that there’s just the volume is so high of these, of these applications is because these tools are there. So they are in some way fueling the market by their mere existence. It’s very hard to predict the future. I don’t know what will happen. I don’t know how quickly we will be able to rein in these arbitrary uses of technology, including AI. But again, I hope that once we’re able to give the public more information about how they are screened by these tools, that the market will calm down a little bit and that we will come to a point where we don’t continually throw AI problems that AI can not address.

Speaker 1 (39:59):

And I don’t know, have you done much research into the other areas, such as the software that matches job descriptions to resumes and the video interviewing how they’re analyzing that? Have you looked at those Julia?

Speaker 2 (40:17):

So the matchmaking tools, I don’t know very much about. The video interviewing tools are under a lot of scrutiny because of this skepticism that we have now, generally of facial recognition technology. So one of the larger vendors in that space is Higher View and they have discontinued their analysis of video interviews. They only now are analyzing text and speech. We don’t know very much about what Higher View does internally. We don’t know very much about the performance of their models. We know that they are very broadly used. And again, this is one example of the kind of companies fueling this, this market, right? So that there’s more demand for their solutions. We do know based in part on some reporting by my collaborator, Hilke Schellman, the kinds of results that speech analysis tools for example are giving, are also arbitrary. So Huka did this experiment and she is German where she read off some random Wikipedia article and submitted in German and submitted that in response to an interview request for an English language position. And she was given the very high competency score. So, so this makes you wonder, you know, whether where these tools are meaningful at all. Right? So as far as we know, it’s as arbitrary as what we tested here.

Speaker 1 (41:40):

Then the other one that you mentioned earlier, the other area is automated background checking. One would hope that if a negative result comes back on a background check in particular that, at that point, there is some human intervention that it’s at least a check and balance. Do you know anything about that? Is that the case, or is that a particularly worrying area as well?

Speaker 2 (42:04):

This is not something where I have any information, unfortunately, about how these background checks are done. The one example that I usually bring up is a kind of a classical example that most people know, and this was surfaced by Latanya Sweeney well, back in 2012/2013. She is a computer scientist. She’s done a lot of work on technology policy and she’s on the faculty at Harvard. And as you can probably guess by her name, Latanya Sweeney, she’s, African-American. She Googled her name to look up some information about herself and got an ad from this company called Instant Checkmate, asking whether she wants to see her own Latanya Sweeney’s criminal record, which she doesn’t have. And then she designed the study and she showed that if you even control for whether an individual has a criminal record, African-American something names will trigger these ads. And so when a potential employer, when they Google someone’s name, right, and this is the easiest thing to start with is a background check.

Speaker 2 (43:06):

They don’t even have to click on that ad just by virtue of seeing that ad, there’s a reputational harm that the candidate experiences, because the hiring manager would think, you know, Google wouldn’t be showing me something if there was nothing there. So, so this is a very, very problematic practice. And again, very little information is available to us to really learn about how these systems are incorporated into hiring processes. Right? All we know is that these systems exist. Some of them are platforms like Google. Some of them are, you know, software solutions that a vendor provides and an employer buys, but how these fit into their pipeline we don’t know.

Speaker 1 (43:50):

Yeah. The background checking in particular, there are a lot of vendors out there now offering very fast, seemingly very comprehensive background checking across the world. So like you say, it’d be very interesting to see how robust that is.

Speaker 1 (44:09):

And that brings today’s episode to a close, another really fantastic conversation. If you enjoyed this episode, then please do check out our previous podcast, which featured Maurizio Porfiri also from NYU and his amazing research into how mass shootings impact the sales of weapons in the U S but for now, Julia, thanks so much for joining us on the show today. It was a real pleasure having you on.

Speaker 2 (44:37):

Thank you very much for having me, and thank you, Philip.