By Samuel Alter
Apziva: 6bImatZVlK6DnbEo
This project uses NLP techniques like word embedding and Learning-to-Rank systems using neural networks (RankNet with PyTorch) and LambdaRank (LightGBM's LGBMRanker) to analyze a job candidate dataset. This project is split into two parts:
- Part 1: More straightforward NLP analysis to ultimately rank the candidates base on their job title's similarity to the search terms
- Part 2: Implementing machine learning models for Learning to Rank scoring systems, specifically with RankNet in PyTorch and LambdaRank in LightGBM.
- Project introduction
1.1. Summary
1.2. Table of Contents
1.3. Overview
1.3.1. Goals
1.3.2. The Dataset - Part 1: EDA and Candidate Ranking from Text Embedding
2.1. Connections
2.2. Geographic Locations
2.3. Initial NLP
2.3.1. Text Embedding and Cosine Similarity
2.3.1.1. Notes on methods
2.3.1.1.1. GloVe
2.3.1.1.2. fastText
2.3.1.1.3. SBERT - Part 2 - Machine Learning Models Using Learning to Rank Systems
3.1. RankNet
3.2. LambdaRank
3.2.1. Prepare the data
3.2.2. Split the data
3.2.3. Train the LambdaRank model
3.2.4. Test, evaluate, and plot - Conclusion
We are working with a talent sourcing and management company to help them surface candidates that are a best fit for their human resources job post. We are using a dataset of job candidates' job titles, their location, and their number of LinkedIn connections.
Produce a probability, between 0 and 1, of how closely the candidate fits the job description of "Aspiring human resources" or "Seeking human resources." After an initial recommendation pulls out a candidate(s) to be starred for future consideration, the recommendation will be re-run and new "stars" will be awarded.
To help predict how the candidates fit, we are tracking the performance of two success metrics:
- Rank candidates based on a fitness score
- Re-rank candidates when a candidate is starred
We also need to do the following:
- Explain how the algorithm works and how the ranking improves after each starring iteration
- How to filter out candidates which should not be considered at all
- Determine a cut-off point (if possible) that would work for other roles without losing high-potential candidates
- Ideas to explore on automating this procedure to reduce or eliminate human bias
Column | Data Type | Comments |
---|---|---|
id |
Numeric | Unique identifier for the candidate |
job_title |
Text | Job title for the candidate |
location |
Text | Geographic location of the candidate |
connections |
Text | Number of LinkedIn connections for the candidate |
Connections over 500 are encoded as "500+". Some do not have specific locations listed and just had their country, so I substituted capitol cities or geographic centers to represent those countries.
There are no nulls in the dataset. There are 104 total observations.
Most applicants have more than 500 connections (n=44). But if we look at Figure 1 to see those that have less than 500 connections, the majority of this group have around 50:
Figure 1: Histogram of Connections
Viewing the data as a boxplot in Figures 2 and 3 shows this patten well:
Figure 2: Boxplot of all candidate's connections
Figure 3: Boxplot of candidates who have fewer than 500 connections
There is some location data and it would be good to see where the candidates are located in the world. Figure 4 shows a choropleth of the candidates' locations.
Figure 4: Choropleth of candidates' locations. Three US-based candidates did not provide a city so they are not included in this map.
I defined a preprocessor that didn't include lemmatization so as to preserve the specific words included in the job titles. After running the preprocessor over the job titles, I plotted the 10 most-common words in the corpus:
Figure 5: Top 10 most-common words in the candidates' job titles
It's fun to make a wordcloud which shows all the words in the corpus, sized by their prevalence. We can see that "human","resources", and "aspiring" all appear frequently in the dataset.
Figure 6: Wordcloud of most common words in the candidates' job titles
As a reminder, we are currently helping the company understand which candidates are a best fit for their human resources position. As such, the company is focusing on two search terms:
- "Aspiring human resources", or
- "Seeking human resources"
We will use the same preprocessor as above, without lemmatization, and will use the cosine similarity to determine the similarity between the job titles and the search terms.
We used five methods for text embedding:
Tfidf
Word2Vec
GloVe
fastText
, andSBERT
The steps for these methods are similar:
- Load word embeddings
- Process job titles and search terms
- Calculate cosine similarity
GloVe
The
GloVe
model (Global Vectors for Word Representation) is a word embedding model that provides a dense vector representation of words similar toWord2Vec
.GloVe
was trained on matrix factorization techniques. We used the 6B model, which can be downloaded from here.
fastText
fastText
is another word embedding model with the advantage that it can handle out-of-vocabulary (OOV) words using subword embeddings. Said another way, it can generate embeddings for words that are not in the training vocabulary, which can be helpful for uncommon words or typos. We'll be using the Wiki News 300d vector with subwords, measuring at 16 billion tokens. This was trained on Wikipedia in 2017.
Sentence-BERT, or SBERT, is designed to generate sentence embeddings rather than individual word embeddings like the previous four methods we've employed.
With five methods, I wanted to plot the similarity scores over time. Comparing how to each method changes the position of the candidates sequentially was interesting, as the methods gave different candidates to have the highest similarity score. You can view the comparison chart below:
Figure 7: Comparing the cosine similarity scores based on methods of text embedding of the candidates' job title to the search terms
What about comparing the candidates overall, across all the methods? I decided to come up with the following scoring system:
- If you are in first place for a particular method, you get no penalty
- Second place gets one point, third place gets two points, etc.
- Any candidate equal to or beyond 10th place gets 10 penalty points
- Save the points for each candidate and for each method
- Take the mean of the points across all the methods
- Then plot the scores using this penalty or "golf"-style scoring method
Figure 8: Best scoring candidates overall
Using this compound score, I applied a rank to the candidates. Looking at the ranks, I saw that the models were not perfect, as some candidates with job titles that match reasonably well the sentiment if not the letter of the search terms are getting ranked a lot lower than others.
RankNet, the first neural network that we will use in the next section, requires that the job titles be vectorized. I will use the final method, SBERT, as that is a robust method of word embedding. Thankfully, we can use the dataframe that we computed in the SBERT section.
I will also manually adjust some rankings and save those to a new column.
The final step before Part 2 is to prepare the data for the neural network training. This involves three steps:
- Generate pairs to rank the candidates
- Use
train_test_split
to partition the dataset into training and testing splits - Save the splits to parquet
RankNet
, a pairwise ranking algorithm, is defined by its specific ranking methodology rather than the specifics of its architecture. It is defined by:
- Pairwise comparison of pairs of inputs
- Sigmoid-based probability output
- Loss function involving Binary Cross-Entropy Loss to compute error between predicted probabilities and true pairwise labels
- Backpropagation and gradient decent allows for training and updating weights using gradients calculated from the pairwise ranking loss
I was curious if there was something inherent in the architecture of the neural network that made the algorithm RankNet. I learned that the architecture in fact is not fixed and we can:
- Add or remove layers
- Change the number of neurons in the hidden layers
- Adjust activation functions
- Change dropout rates
See below for an image of the architecture of the network. I opted for a sequential stack of layers involving the following:
- Fully-connected layer with 512 inputs and 512 outputs, which projects the input features into a higher-dimensional space
- Batch normalization transforms the activatiosn to have zero mean and unit variance within each batch
- Dropout layer to randomly set 50% of the activations to zero during training, which prevents overfitting by introducing randomness in training
- LeakyReLU Activation introduces non-linearity and helps improve the model's learning capability
- Fully-connected layer with 512 inputs, outputting 256 to compress the feature space to focus on the most important representations
- Normalization and Dropout layers to stabilize training and reduce overfitting
- LeadyReLU Activation to reintroduce non-linearity
- Fully-connected output layer with 256 inputs and 1 output
I also defined a forward pass for the network, which takes two inputs and computes the difference in their scores. This is a key step as it helps to define the algorithm as RankNet.
Finally, I defined training and testing functions:
Training function
Uses batching, a binary cross-entropy loss calculation with logits, an optimizer for the backward pass, a sigmoid conversion to turn logits to probabilities, thresholds probabilities to 0.5 for binary classification, and outputs the loss and accuracy for each epoch
Testing function
Uses a similar proceses involving batching, loss calculation, conversion of logits to probabilities, computing the average test loss and accuracy, and finally compiles the probability results into a dataframe
To log how many times each candidate "won" the probability determination, I created a for loop to count each time one candidate scored a higher probability than their pair. In a similar fashion to Figure 8, I plotted the results:
In an effort to explore other ranking algorithms, we will now turn to LambdaRank. It is an evolution of the RankNet algorithm that we worked on above. While RankNet looks to optimize pairwise accuracy, LambdaRank optimizes for ranking metrics like NDCG, or Normalized Discounted Cumulative Gain. A score of 1 means a perfect ranking of the items. It does not require pairwise comparisons as input. MRR is another metric that calculates the average of the reciprocal ranks (1 for first place, 1/2 for second place, 1/3 for third place, and so on). Values closer to 1 mean that the model did a good job of ordering candidates.
NDCG checks not only if the first item should be ranked higher than the second, but also how much swapping their order would improve the final ranking. The gain can be thought of this way: if a relevant item is placed close to the top, it will have a greater gain than if a relevant item was placed towards the bottom.
LambdaRank uses lambdas that help adjust the model's focus to help improve the overall ranking quality, while RankNet takes advantage of a loss function and cares about individual rankings.
You can read more about LambdaRank here. There's a short snippet of information about LambdaRank from Microsoft. Researchers there designed the algorithm.
This repository gives a good example of how to implement the algorithm.
LambdaRank uses the LightGBM algorithm and requires the following:
- Feature matrix (X) of each item
- Relevance scores (y) for each item
- "Groups," or the number of items per query group
We saved our SBERT embedding work from Part 1 to pick up where we left off, this time using LambdaRank.
The steps for this final part of the project were as follows:
- Feature matrix (X): Extracting each
job_title
tensor and converting it to a numpy array - Relevance Scores: Invert the ranks so that if the ranks are 1, 2, 3, the scores become 3, 2, 1
- Group is defined in the next step
I split the data into training, validation, and testing sets. I defined the groups based on the number of candidates within each set.
I used the following parameters to train the model:
Parameter | Detail |
---|---|
objective | lambdarank |
boosting_type | gbdt |
metric | ndcg |
n_estimators | 100 |
learning_rate | 0.1 |
importance_type | gain |
random_state | seed |
Note: I defined a random seed at the beginning of this project and used it throughout.
The model achieved an NDCG@5 score of 0.9223, which is quite good! The MRR score was 1. This is exceptionally high and probably suggests that the model is overfitting in some way.
I thought to run a quick Optuna hyperparameter optimization search, though the resulting NDCG@5 score was lower, at 0.8969. The MRR score stayed at 1.
Given that the base model achieved a higher NDCG@5 score, I used the resulting relevance to plot a figure that compares the predicted versus true relevance:
If we look at the predicted relevance first, we see that each bar represents a candidate ID with its predicted score on the y-axis. The model is showing that candidate 4 and 6 have the most relevance to the target, while 16 and 19 have the least.
Comparing the predicted relevance to the true relevance shows that the model is not able to align with the true labels. Although it achieved high NDCG and MRR scores, there are only 21 candidates in the test set and 104 candidates in the entire dataset.
Below we can see the candidates and their associated job titles, in the same style of figure that we used for the RankNet.
Finally, let's compare the results of LambdaRank against RankNet. Although I was unable to hide the green lines beyond the bounds of the box, but we can see that the two methods have different results. The point with this jumbled figure is to show that these two methods are clearly focusing on different aspects of the data when giving their ranks. I will also add that the dataset was small and as the company adds more candidates to their search, the models may be more equal in their determination. Furthermore, when I manually adjusted some of the rankings, I could have been inconsistent and the models are not sure what to focus on as a strong signal.
This project covered a lot of ground. I used five word- and sentence embedding techniques (Tfidf, Word2Vec, GloVe, fastText, and SBERT) and calculated an ensemble score to rank candidates. Next, I used PyTorch to create the RankNet algorithm on the job candidates. Finally, I used the LambdaRank model with LGBMRanker from LightGBM to generate ranks of candidates. Thanks to my work, the company has a suite of tools at their disposal to compare candidates.