/ DEEP-LEARNING, MACHINE-LEARNING, NLP

Recruiting Systems the AI way: Deep Neural Network for Extracting Information from Job Descriptions

Deep Neural Network to extract Information from job descriptions.

Recruiting Software is evolving rapidly everyday and this tools is even became more complex and more intelligent thanks to the massive information that presents on the internet.

Let’s take a deeper look from Marketing point of view

indeed by 2017 claimed they had over 100 millions resume. which makes this platform became one of the biggest hires in US in 2017. other platforms like linkedin, glassdoor and careerbuilder is becoming popular too as it becomes the second biggest sources of external hires in US.

We have similar platforms in MENA like Wuzzuf, Bayet, Forasna and Tanqeeb.

Since AI & Automation tools will be a top recruiting software, in this post we are going to discuss how Deep Neural Network is used to build  AI & Automation tool  to extract information from massive data and our scope will be Job Descriptions posted by Employers or HR Facility.

Named Entity Recognition

Named Entity Recognition is the process of locating and classifying named entities in text into predicted categories such as names of persons, locations and organizations, .. etc, in some research literature it may be called sequence labeling.

So, in this post we are building named entity recognition model to classify skills required in a job description paragraph.

Skills:
==========================
Soft-Skills i.e: "customer-service, vervbal communnications skills"
Technical-Skills i.e: "Java, HTTP, networking, LINUX/UNIX, C++, C#, AngularJs, javascript, bootstrap, .. etc"

Dataset

I had been searching for a while on an annotated database since our task (named entity recognition) is supervised learning approach so it is required the entities classes should be pre-defined. I scrapped from indeed and careerbuilder over 20k job posts.

still no annotated data, in this case i built simple command line annotation tools along with lookup table to semi automate the process and save time.

IOB Tagging system is followed.

I-TAG: Inside the chnuk
B-TAG: Beginning of the chunck
O: Outside the chunk

Used TAGS

B-TECH: Technical Skill (Beginning)
I-TECH: Technical skill (Inside)
B-SOFT: Soft Skill (Beginning)
I-SOFT: Soft Skill (Inside)
B-CERT: Certification (Beginning)
I-CERT: Certification (Inside)
B-YEXP: Years of Experience (Beginning)
I-YEXP: Years of Experience (Inside)

There is public dataset available on kaggle which is useful for anyone want to get started with and i am also impressed that wuzzuf had published corpus on kaggle too

Tools Belt


keras
pandas
matplotlib
sklearn

pip install keras
pip install pandas
pip install matplotlib
pip install sklearn

# import dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import random
import string
import pickle

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from keras_contrib.layers import CRF

Let's read some unstructured data in json format then construct dataframe accordingly

## Explore ONET Organization Dataset

def read_from_multiple(files):
    """read data from multiple files {json files} and parse each one then merge
    to construct single dataframe.
    """
    data = []
    json_objects = []
    for f in files:
        data_path = os.path.join('../data', f)
        with open(data_path) as f:
            for line in f.readlines():
                data.append(line)
                json_objects.append(json.loads(line))
    return json_objects

files = [
    'openjobs-jobpostings.jan-2017.json',
    'openjobs-jobpostings.mar-2016.json',
    'openjobs-jobpostings.oct-2016.json'
]

## parsed json files
data = read_from_multiple(files)

## dataframe construction
df = pd.DataFrame(data)

## select columns
df = df[['title', 'normalizedTitle', 'employmentType', 'jobDescription']]

## discard some data
df = df[df['employmentType'].str.len()!=0]

## remove the list on each row so to be hashable in pandas operations
df.employmentType = df.employmentType.apply(lambda x: x[0])
	SENT			WORD			TAG
437649	1084			Web			O
896	1			business		O
682842	1672			managers		O
408038	1001			Creo			B-TECH
571208	1430			our			O
720411	1758			their			O
36027	79			company			O
390337	953			Ruby			B-TECH
548633	1370			mix			O
584272	1463			care			O
148564	335			roles 			O
559345	1398			mentoring		O

Let’s write a class that we are going to use to get the full sentence from pandas tabular format

class Tokens2Sent:
    """
    Convert tabular format of tokens to full sentences
    """
    def __init__(self, data):
        self.n_sent = 0
        self.data = data
        self.data['SENT'] = data['SENT'].apply(lambda r: int(r))
        self.groups = self.data.groupby("SENT").apply(
             lambda s: [(w, t) for w, t in zip(s["WORD"].values.tolist(),
                                               s["TAG"].values.tolist())])
        self.sentences = [s for s in self.groups]
        

    def get_paragraph(self, idx=None):
        if idx is None:
            sent = self.data[self.data['SENT']==self.n_sent].copy().WORD
        else:
            sent = self.data[self.data['SENT']==idx].copy().WORD
        return " ".join(sent.values)

token2sent = Tokens2Sent(data_df)
print(token2sent.get_paragraph(59))

This position is open as of 8 4 2018 Sr Fullstack Engineer Ruby on Rails Conveniently located in Capitol Hill in Seattle we are a fast growing Seattle based company that offers a software solution that allows industries such as gyms studios and schools to grow their business and develop deeper client relationships We are currently looking for a Senior Software Design Engineer with 7 years software development experience and past success translating UI/UX design wireframes to actual code If you have produced excellent user interfaces for a great application we would love the chance to tell you more about this exciting opportunity …

words = list(set(data_df.WORD.values))
words.append('<EOS>')
n_words = len(words)
print(n_words)

tags = list(set(data_df.TAG.values))
n_tags = len(tags)
print(n_tags)

let's build dictionary that maps between token and their corresponding ids and the tags likewise which we are going to use later for building lookup table in the Embedding layer of our network.

## mapping between words/tags and their unique ids
word2idx = {w: i for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}

Since our input will be a sequence of tokens and output will be sequence of tags that of course variable in length from paragraph to another in such case that will be a problem since we need fixed length vector that represents the job description sequence signal.

Consider the following sequences

Software Engineer is required with the following skills html, css, c++. We are hiring experienced android developer

[Software, Engineer, is, required, with, the, following, skills, html, css, c++]
[We, are, hiring, experienced, android, developer, <EOS>, <EOS>, <EOS>, <EOS>, <EOS>]
## keras utils
from keras.preprocessing.sequence import pad_sequences
pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)

https://keras.io/preprocessing/sequence/#pad_sequences

## Pytorch utils
from torch.nn.utils.rnn import pad_sequence
pad_sequence(sequences, batch_first=False, padding_value=0)
# https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_sequence

We solved the problem of variable length by using special token which represents a blank space like in the sequence.

Another problem might show up if we are going to use sequence to sequence modeling is the high variance between too small sentences and too long sentence since token may cause loss of information as it will appear more frequently.

so let’s do some statistics and print histogram to see how we are going to solve such a problem.

Sequence Length Frequency distribution

from the previous histogram, it is clear to me from the histogram the average length of the sequences between 75-100 without loosing to much information so in order to solve such an issue we will use sequence of maximum 120-150 tokens.

Now Let’s write some python code that prepare the data with the appropriate format, we are going to split our data into test and train chunks too and we are going to use test set for testing our model later.

## input parameters
SEQ_MAX_LEN = 150
INP_MAX_LEN = n_words
OUT_MAX_LEN = n_tags

## variables definitons
X = np.array([[word2idx[w[0]] for w in s] for s in token2sent.sentences])
Y = [[tag2idx[w[1]] for w in s] for s in token2sent.sentences]
print(X[59]); print(Y[59])

## data preprocesssing and sequences padding
X = pad_sequences(maxlen=SEQ_MAX_LEN, sequences=X, padding="post", value=n_words-1)
Y = pad_sequences(maxlen=SEQ_MAX_LEN, sequences=Y, padding="post", value=tag2idx["O"])
Y = np.array([to_categorical(i, num_classes=n_tags) for i in Y])

## splitter for evaluation
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20)

Now it is time for the actual network model and this section you probably want to know what is the model architecture looks like and how RNN is going to use sequential data and do some predictions accordingly.

Sequence Tagging Neural Network GRU
inp   = Input(shape=(SEQ_MAX_LEN, ))
model = Embedding(input_dim=INP_MAX_LEN, output_dim=20, input_length=SEQ_MAX_LEN, mask_zero=True)(inp)
model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=64, return_sequences=True, recurrent_dropout=0.1))(model)
out   = TimeDistributed(Dense(OUT_MAX_LEN, activation="relu"))(model)

crf = CRF(n_tags)  # CRF layer
out = crf(model)   # output

model = Model(inp, out)

model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy])

model.fit(X, Y, batch_size=32, epochs=5, validation_split=0.1)
Train on 1800 samples, validate on 201 samples
Epoch 1/5
1800/1800 [==============================] - 63s 35ms/step - loss: 0.1959 - acc: 0.8950 - val_loss: 0.0934 - val_acc: 0.9766
Epoch 2/5
1800/1800 [==============================] - 63s 35ms/step - loss: 0.0751 - acc: 0.9789 - val_loss: 0.0761 - val_acc: 0.9777
Epoch 3/5
1800/1800 [==============================] - 60s 33ms/step - loss: 0.0544 - acc: 0.9808 - val_loss: 0.0516 - val_acc: 0.9804
Epoch 4/5
1800/1800 [==============================] - 57s 32ms/step - loss: 0.0364 - acc: 0.9856 - val_loss: 0.0355 - val_acc: 0.9869
Epoch 5/5
1800/1800 [==============================] - 62s 34ms/step - loss: 0.0275 - acc: 0.9893 - val_loss: 0.0378 - val_acc: 0.9872
<keras.callbacks.History at 0x7f67504aae80>

Conditional Random Fields (CRF) Layer

CRF is used often when there is dependency between the inputs. It has been proved that combining LSTM network and a CRF network can produce higher accuracy in tasks like sequence tagging (labeling) since the label of previous inputs can be used to affect the predictions of the current input. It seems in Bidirectional LSTM (BI-LSTM) we can efficiently use past and future tags to predict the current tag for better and more reasonable predictions.

let’s elaborate it by example, if we have the following sequence

we are hiring frontend developer with the following skills UI/UX Design, Angular, VueJS and Graphic Design.

so the sequence frontend developer can affect the predictions of the other tags like Angular, Graphic Design since they are correlated and dependent on each others (appears in the same context frequently).

Photo borrowed from https://arxiv.org/pdf/1508.01991v1.pdf

Now let’s see the model learning history plot

hist = model.history.history
hist = pd.DataFrame(hist)
plt.style.use("ggplot")
plt.figure(figsize=(12,12))
plt.plot(hist["acc"])
plt.plot(hist["val_acc"])
plt.show()

Time for making some prediction on what the network learned to do which is predict sequence of tags.

i = 59
pred = model.predict(np.array([X[i]]))
pred = np.argmax(pred, axis=-1)
true = np.argmax(Y[i], axis=-1)
print("|{:25} |{:15} |{:15}|".format("Word", "True", "Pred"))
print('=='*30)
for w, t, pred in zip(X[i], true, pred[0]):
    if w != 0:
        print("|{:25} |{:15} |{:15}|".format(words[w], tags[t], tags[pred]))
Prediction Result
Word True Pred
Software B-TECH O
Design I-TECH O
Engineer O O
with O O
7 O O
years O O
software B-TECH O
development I-TECH O
experience O O
and O O
past O O
success O O
translating O O
UI/UX B-TECH B-TECH
design I-TECH I-TECH
wireframes O O
to O O
actual O O
code O O
If O O
you O O
have O O
produced O O
excellent O O
user O O
interfaces O O
the O O
art O O
of O O
programming O O
2 O O
Responsible O O
for O O
the O O
translation O O
of O O
the O O
UI/UX B-TECH B-TECH
design I-TECH I-TECH
wireframes O O
to O O
actual O O
code O O
that O O
Ensure O O
the O O
technical O O
feasibility O O
of O O
UI/UX B-TECH B-TECH
designs I-TECH O
8 O O
Assure O O
that O O
all O O
user O O
input O O
is O O
validated O O
before O O
submitting O O
to O O
back O O
end O O
What O O
You O O
Need O O
for O O
this O O
Position O O
7 O O
years O O
of O O
software B-TECH O
development I-TECH O
experience O O
Proficient O O
in O O
Ruby B-TECH B-TECH
on I-TECH O
Rails I-TECH B-TECH
AngularJS B-TECH B-TECH
React B-TECH B-TECH
Proficient O O
in O O
HTML B-TECH B-TECH
CSS B-TECH B-TECH
Understanding O O
of O O
server O O
side O O
CSS B-TECH B-TECH
platforms O O
LESS O O
SASS B-TECH B-TECH
Good O O
understanding O O
of O O
AJAX B-TECH B-TECH
JSON B-TECH B-TECH
• O O
Nice O O
to O O
haves O O
Experience O O
with O O
AWS B-TECH B-TECH
Experience O O
with O O
Docker B-TECH B-TECH
or O O
other O O
container O O
based O O
platforms O O
apply O O
today O O
Required O O
Skills O O
Ruby B-TECH B-TECH
On I-TECH O
Rails I-TECH B-TECH
AngularJS B-TECH B-TECH
React B-TECH B-TECH
HTML B-TECH B-TECH
CSS B-TECH B-TECH
AJAX B-TECH B-TECH
JSON B-TECH B-TECH
LESS O B-TECH
SASS B-TECH B-TECH
AWS B-TECH B-TECH
Docker B-TECH B-TECH
UI/UX B-TECH B-TECH
JQuery B-TECH B-TECH
If O O
you O O
are O O
a O O
good O O
fit O O
for O O
the O O
Sr O O
Full O O
Stack O O
Engineer O O
Ruby B-TECH B-TECH
on I-TECH O
Rails I-TECH O
100 O O
WORK O O
FROM O O
HOME O O
position O O
and O O
have O O
a O O
background O O
that O O
includes O O
Ruby B-TECH B-TECH
On I-TECH O
Rails I-TECH B-TECH
AngularJS B-TECH B-TECH
React B-TECH B-TECH
HTML B-TECH B-TECH
CSS B-TECH B-TECH
AJAX B-TECH B-TECH
JSON B-TECH B-TECH
LESS O B-TECH
SASS B-TECH B-TECH
AWS B-TECH B-TECH
Docker B-TECH B-TECH
UI/UX B-TECH B-TECH
JQuery B-TECH B-TECH
and O O
you O O
are O O
interested O O
in O O
form O O
upon O O
hire O O
Ruby B-TECH B-TECH
On I-TECH B-TECH
Rails I-TECH B-TECH
AngularJS B-TECH B-TECH
React B-TECH B-TECH
HTML B-TECH B-TECH
CSS B-TECH B-TECH
AJAX B-TECH B-TECH
JSON B-TECH B-TECH
LESS O B-TECH
SASS B-TECH B-TECH
AWS B-TECH B-TECH
Docker B-TECH B-TECH
UI/UX B-TECH B-TECH
JQuery B-TECH B-TECH

Now it is time to serialize the model for later usage.

### save the computation results
model.save_weights('../models/lstm_tagger.pickle')
model.save_weights('../models/lstm_tagger.h5')
model.save('../models/lstm_tagger_model.model')
with open('../models/tagger_vocab.pickle', 'wb') as f:
    pickle.dump(word2idx, f)
with open('../models/tagger_tags.pickle', 'wb') as f:
    pickle.dump(tag2idx, f)

I built web services using Flask that is consumed via some web interface

Flask Service Demo

Additional Resources


ahmednabil

Ahmed Nabil

Agnostic Software Engineer, Independent thinker with a hunger for challenge and craft mastery

Read More