Ernesto.Net | Data Science Training — Build an End to End Job Skills Ranker with Live Data

Photo by Charles Deluvio on Unsplash

In this article, I will be showing you how to build an Online Job Skill Ranker-AI Model by scraping the online job descriptions from indeed.com and then extracting job skills from the collection of job descriptions. This is part of the Data Science Master Courses at Ernesto.Net. Here is what, I am going to show you:

1. Part 1: Scrape Indeed.com to get the desired number of job descriptions and put that in a function that we can use later.

2. Part 2: We will create a function that will handle the processing of the list of jobs that we get from part 1. In short, we will create a function in this step and then return a dictionary to display that on the UI which will contain the extracted skillsets for jobs.

3. Part 3 Now, that we have two functions that manage the data extraction from indeed and then extract those skills from the job descriptions, we need a way to create a UI for this. We will use flask (Python) to create a UI that will allow the user to enter a Job Title, Location of the Job, and the Number of Jobs to extract. Once the user clicks the submit button, we will first call the function which we created in part 1 to get the data from indeed and then call the function in part 2 to process that data. Finally, the results from the function in part 2 will be displayed to the user. These are the predicted keywords with their associated relevance in the job market.

4. Part. 4 Finally, in this part, I will show you how to put your flask app in a docker image so that it can be run in the cloud.

So, I hope you have understood the overall objective of each of the parts which I discussed above. Now, let us make the app by discussing each part one by one.

Here is the Consolidated Code on my Github Job-Skill-app. So, at any stage, if you feel like you are getting overwhelmed, then please refer to the code. This is a huge app with a lot of nuances and key points. This is why I felt it necessary to put the code links in the beginning,

Before that, Let me give you a little teaser by showing you what the app will look like:

Ernesto.Net Job Ranker

Here, all the things are self-explanatory. Let me enter some inputs. Let us input Data Scientist, 20 as the Number of Jobs and New Delhi as the Location, and let us see what we get.

Here is what I get after hitting the submit button:

As you can see, each unigram and bigram has an associated “relevance score” which shows us how important that skill is for that job title. We will learn to build this app from scratch.

Some familiarity with HTML and CSS would help as the look and feel of the app is due to these tools. I will provide all code for this app. You just need to focus on the main app-building part for STEP 1 and STEP 2 in this tutorial.

Alright, let us get started with the STEP 1

Step 1: Getting the Data (Scraping Indeed.com)

Before I explain anything, let me first tell show the two functions that are part of STEP 1.

The first function will process the two text inputs that the user enters. For example, if the user enters the Job Title as “Data Scientist”, then this process_input function will concatenate these two words with “+”. First, make sure you understand the function. It is just splitting the text input by space and then using a for loop adds + in between. Now, suppose I enter the job title as “Senior Data Scientist”, then this function will output “Senior+Data+Scientist”.

Now, you might ask why do we even need to do this? To answer this, let me take you to the Indeed website. This is the home page of indeed.

Here, I want you to notice two input fields. They let us enter the Job Title and the Location. Now, let me try to enter the following:

Job Title: Data Scientist

Location: New Delhi

It will obviously look like this:

Once I hit submit, Obviously I will be shown the list of the jobs in New Delhi but I instead want to take your attention to the URL format which is shown below:

You might have guessed by now why I had to process the input this way. Because, this way, I can just concatenate any job title and location in the URL itself and then consequently make a request to that URL. I can pass the results from the process input function to “q” and “l” and hence, we get our URL. All the other parts of the URL will be the same.

I want all of you to note one more thing. If you search for any job, then there will be only 15 jobs posted here. If you navigate to the second page, then the URL will change. But do not worry because it will not change drastically. For the same result, let me show the second page’s and third page’s URL respectively:

I want you to notice the value of the start option. On the second page, it is 10 and on the third page, it is 20, and so on. To generalize, on the nth page, the start parameter will be 10 * (n-1).

I am telling you this because these are precisely the things that I have implemented in the second function. The name of that function was “get_the_list_of_text” which takes parameters such as how_many_jobs (number of jobs that we want to get out of indeed), job (which is the title), and location (It is quite self-explanatory). Here is the logic of the function:

def get_the_list_of_text(how_many_jobs, job, location):
multiple_of_fifteen = int(how_many_jobs / 15)
required_list = []

j = process_input(job)
l = process_input(location)
if multiple_of_fifteen != 0:
for i in range(multiple_of_fifteen):
if i == 0:
url = "https://indeed.com/jobs?q={0}&l={1}".format(j,l)
else:
url = "https://indeed.com/jobs?q={0}&l={1}".format(j,l) + "&start=" + str(i * 10)
browser = RoboBrowser()
browser.open(url)
listOfJobs = browser.find_all("div", {"class":"jobsearch-SerpJobCard"})
#print(len(listOfJobs))
for p in range(len(listOfJobs)):
browser.follow_link(listOfJobs[p].find("h2").find("a"))
required_text = browser.find("div", {"class":"jobsearch-JobComponent-description"}).text
browser.back()
required_list.append(required_text)
else:
url = "https://indeed.com/jobs?q={0}&l={1}".format(j,l)
browser = RoboBrowser()
browser.open(url)
listOfJobs = browser.find_all("div", {"class":"jobsearch-SerpJobCard"})
#print(len(listOfJobs))
for i in range(int(how_many_jobs)):
browser.follow_link(listOfJobs[i].find("h2").find("a"))
required_text = browser.find("div", {"class":"jobsearch-JobComponent-description"}).text
browser.back()
required_list.append(required_text)
return required_list

First, We are checking the how_many_jobs parameter. If it is greater than 15 then we need to scroll through multiple pages while if it is less than 15 then we can just stick to the first page. But how do we check this in python?

There is an operator in python called the division operator (/) which gives you the quotient after division. So, if that division is less than 1 which means int( ) will convert that to zero, then we check for a zero value. Hence, we scrape all the jobs that are on the first page.

Now, if it is greater than 15, then we loop through all the multiples of fifteen.

Now, we need to understand the web-scraping part.

Once we get the required URL, how do we scrap all the jobs on that particular page? To do that we use a python package called Robobrowser. First, we make an empty connection object with the Robobrowser and then pass the current URL to the open method of the robobrowser. Once a connection has been established, we select all the job’s HTML div section page by the select method. Here, we need to spend some time.

You need to first view the source HTML of the resultant page on indeed. Hit “CTRL + SHIFT + I” and the inspected page will open on chrome on windows. You can also get the inspect page by right-clicking on a specific section. It will look something like this:

On the right, you can see the HTML source code of indeed while on the left you can see the actual website. The HTML content is huge and getting to know which part of the HTML corresponds to which part of the page is a huge task. Luckily, for us, there is a built-in feature in almost all of the modern browsers that allows you to hover over the HTML and it highlights the corresponding section on the page on the left. This way you can get to know which part you have to focus on.

Now, we need to select the link of each of the jobs and get their respective content. Let me show you one of the job’s HTML pages.

Notice the highlighted yellow which corresponds to the link to open. We need to get that link and open it using the follow_link method of the robobrowser. But even before that, we need to have a list of all such links on a given page. In our case, there should be 15 such links on a given page. We get that with the code below:

Here, we get all the jobs in a variable called listOfJobs by searching for respective divs with a specific class. Here, you need to do a little bit of work by searching for the HTML which you want to retrieve.

Once, we get that, we loop through each job and get that respective link we talked about earlier and then open that link using Robobrowser’s follow_link method and get the description from that new link. Don’t forget to come back to the same page from where you got that earlier link by calling the back() function of RoboBrowser. Once you understand this part, we are done with this function because in the else condition we are doing the same thing except for only a single page rather than multiple pages.

Finally, we return the list of job descriptions.

Step 2 Extracting the Skills from the list of Job Descriptions.

Let us assume that we have a list of job descriptions and we are interested in extracting the top skills from those descriptions. In this part, I have created the following functions.

  1. The read_skills_file function is used to read a corpus of skills that I pulled from LinkedIn. You can find this file on the given GitHub link at the beginning of this tutorial. This reads the file and cleans it (removing the stop words, removing the newline character, etc.).
def read_skills_file():
with open(folder_for_skill + "/linkedin skill.txt", "r", encoding="utf-8") as file:
skills = file.read()
skills = skills.split("\n")
skills = [i.lower() for i in skills]
return skills

2. The clean_job_desc function cleans the list of job descriptions that we obtained from step 1. Further, it removes all the stopwords (which cannot be any skills) and returns a clean list of the jobs.

def clean_job_desc(list_of_job_desc):
combined_job_description = " ".join(list_of_job_desc)
combined_job_description = combined_job_description.replace("\n", " ")
combined_job_description = combined_job_description.lower()
combined_job_description = combined_job_description.split(" ")
combined_job_description = [i for i in combined_job_description if i not in stopwords]
return combined_job_description

3. The compute_single_word_skills function reads the skills file, cleans the list of jobs, and matches single word skills which are found in the list of jobs and are also there in our text corpora. Finally, it returns the count of all the single word skills dictionaries with their keys as the words and the values as the counts in the list of the job descriptions.

def compute_single_word_skills(req):
skills = read_skills_file()
c = clean_job_desc(req)
single_word_skills = []
for skill in skills:
s = skill.split(" ")
if len(s) == 1:
single_word_skills.append(skill)

single_word_skills_in_job = {}
for w in c:
if w in single_word_skills:
if w in single_word_skills_in_job:
single_word_skills_in_job[w] += 1
else:
single_word_skills_in_job[w] = 1
return single_word_skills_in_job

4. The final_function_to_return_top_skills function does the same thing as compute_single_word_skills but for two-word (bigram) skills. It then creates a dictionary from single words skills and adds both dictionaries’ content into one.

def final_function_to_return_top_skills(req):
combined_job_description = clean_job_desc(req)
skills = read_skills_file()
bigrams = []
for i in range(0, len(combined_job_description)-1):
bigrams.append((combined_job_description[i], combined_job_description[i+1]))
skills_dict2 = {}
for bi in bigrams:
w = " ".join(bi)
if w in skills:
if w in skills_dict2:
skills_dict2[w] += 1
else:
skills_dict2[w] = 1
single_word_skills = compute_single_word_skills(req)
skills_dict2.update(single_word_skills)
sorted_skills_dict = dict(sorted(skills_dict2.items(), key=lambda item: item[1], reverse = True))
return sorted_skills_dict

5. Finally, the final_processing function takes the dictionary returned by the above function and converts the counts into percentages. Further, it limits the number of results that the user can see. I have limited these results to 10. Obviously, you can change this but remember it might clutter the HTML. So, try to keep the number of skills to show on the webpage limited.

def final_processing(required_dict):
s = sum(required_dict.values())
for k in required_dict.keys():
required_dict[k] = round((required_dict[k] / s)*100,2)
unigram = {}
bigram = {}
for key in required_dict.keys():
if len(key.split()) == 1:
unigram[key] = required_dict[key]
else:
bigram[key] = required_dict[key]
unigram_sorted = dict(sorted(unigram.items(), key=lambda item: item[1], reverse = True))
bigram_sorted = dict(sorted(bigram.items(), key=lambda item: item[1], reverse = True))
u = {i:unigram_sorted[i] for i in list(unigram_sorted.keys())[:10] if i not in ["", " "]}
b = {i:bigram_sorted[i] for i in list(bigram_sorted.keys())[:10] if i not in ["", " "]}
return [u,b]

Sidebar: An analysis of the method I used to get and rank the skills.

Obviously, this method is not the best bet it is the best of the available alternatives when considering resource constraints. There are other possible methods that could have achieved the same outcomes with varying degrees of accuracy.

1. Based on frequency: This produced horrible results. So, I dropped this one.

2. Based on the word2vec model: This was producing good results but also giving a lot of junk in the results. Their predictions can be used to give users one or two of the most weighted sentences out of all the jobs that we have scraped.

3. LSTM based neural networks model: This would produce good results obviously but there were two constraints:

a. It needed large data sets to train which means we have to enter a large number of jobs to scrape from indeed (which would obviously take time more time).

b. Even if we had a large data set, it would still require a long time to train.

Users would not wait for 5–10 minutes just to get a prediction.

So, I had to drop this idea.

Therefore, I resorted to the old school NLP method “regular expression string matching with brute force”. This method surprisingly produced the best results in the fastest time with smaller data sets.

Step 3 Creating the UI

If we understand the basic structure of a flask application, then creating the UI is simple since we have already implemented all of the backend functionality. We will just use the above functions in the backend and get the app running. I would like to discuss a few things before that.

Let’s start with the basics before building the full application.

How does any web application work?

To learn about web application development, we need to know about HTML and CSS which are two of the most important languages of the web.

We are going to talk about HTML and CSS in a moment but right now I want to discuss how the flask app works.

To make a flask app, there are the following steps:

1. First make an app variable which is just an instance of the Flask class.

2. Then define the routes for the app. For each route, we need to declare a function. In the code of this tutorial, you will notice that we have made a function called index() before which we create a route: app.route(“/”). This is a python function decorator. A decorator tells python that the function that follows the decorator is a special function and in this case, it is a route function that will be executed once the user tries to access our application. We will discuss routes in a moment.

3. We have to decide what we want to show a user. The look and feel of the application is defined by HTML and CSS.

4. Once the route function is set, we have to run the application using app.run()

5. A very simple example of a very simple flask app that will give you some idea is given as:

In this very simple and minimal application, once the server has started the user will be greeted with a simple “Hello, World”.

Now, we can also put `HTML` code in the return statement but that would make the function very difficult to read, and this is why we need to prepare a separate HTML file. To read that HTML file, we have created a folder called templates and in that folder we included a file called index.html. We use a function from flask called render_template which we use to return to the above route in place of the text “Hello, World”.

I hope this small discussion on how a flask web app works make sense.

Now, let us understand a little bit about HTML.

HTML — The language of Web

HTML stands for `HyperText Markup Language` and it is the language of the web. With the help of HTML, we can define the structure of the web application that we want to build.

HTML decides the structure of a web page. Whatever you see on the web page is structured because of HTML. Likewise, everything on our job-skill-ranker application is also structured using HTML.

But Note here, HTML only handles the structure of the application while the look and feel of your application is handled by another technique called `CSS`. More on that later.

To understand the HTML, understand that everything in HTML is made up of tags.

What are tags? You might wonder...

Tags in HTML are made up of opening and closing tags. Opening tags looks like <tag_name> and closing tags looks like </tag_name>.

Let us take the example of our application. It has an input box, a submit button, and some text header. The first is the header of the application which is handled by header tags. There can be various different types of header tags.

1. h1 header tag

2. h2 header tag

3. h3 header tag

4. h4 header tag

5. h5 header tag

These are all used to place a header in the HTML content. Each of them differs from each other because of the size.

Taking h1 as an example, the opening tag would look like <h1> and the closing tag would look like </h1>. The content will be placed in between. We also have many different types of input tags like text input tag, number input tag, submit input tag and so on. Each of these input tags have been used in our application.

There are some types of tags that are self-closing which means for those tags, there are no closing tags. `input` tag is one of the examples self-closing tags.

One special tag <style> </style>

If you observe the code of the HTML file, most of the content resides in between <style> and </style>. This part is what defines the color of various sections and the way they are placed on the page.

Understanding HTML and CSS with example

HTML and CSS are very simple to learn and master. I will give you a very simple example to explain a few concepts.

HTML

Html has two main tags <head></head> and <body></body>.

Then, what is the use of this <head></head> tag. It turns out that this head tag holds those things that the user does not see but they have a major impact on any web page. While the body tag is what users get to see directly.

One of the important things that this head tag holds is <style></style> tag.

This tag holds all the color, size, display, shadow, the radius of HTML elements, etc.

This style tag holds something which is known as CSS, `cascading style sheets`, it enables the web application to have the look that we want.

In CSS, we have a concept of selectors and those selectors enable us to style individual selections of the HTML.

For example, Let us say, I want to style the <h1></h1> HTML tag, then I can select that tag in CSS as

h1 {

I can put all the different style property values to the h1 HTML tag.

}

In this similar manner, I can style all the HTML tags.

Now, you might be wondering, do I have to memorize all these rules?

The answer is No. The reason is that they are huge in numbers and all you need to do is type what style you are looking for in the google search bar and the corresponding style will come up as a first page on the internet.

You need to learn these HTML tags and CSS rules only if you are really interested in getting into web development.

But on the other hand, if you are just looking for creating machine learning web applications, then you don’t have to learn these rules of HTML and CSS but if you could, that would be really helpful and you will be able to create wonderful web applications.

Two states of our application — Posted or not Posted

The final application will be having two different states. One, where the user of the application will be shown the input boxes and submit button. The other state is where the user will submit the title of the job and the prediction will be displayed.

Regardless of whether the user is looking at the welcome screen or the submitted screen, the style should be the same.

Now, remember the screen which is displayed once the user opens the app?

This is the state when the input is shown. On that page, there were a lot of things. Like the header title of the application and some inputs and a submit button.

Creating the ‘route’ for our web application

Now that we have defined the necessary HTML structure, we need to create the route of the web application.

What is the route?

Route is just a fancy word for saying `URL`. Now, the `URL` which the user is going to enter in the input box is different from this URL.

A web application is just like any other website you visit online. Now, when you visit any website you enter some `URL` and that URL is what we are talking about here.

So, before the function `index()`, we have a statement which starts with `@` which has a special name in python `decorators`.

Inside this route we say that the URL is `’/’` and we will accept two types of methods `GET` and `POST`.

Now, let us understand what do we mean by methods?

Remember, when we talked about the states of the application. There were two states:

1. Where you are welcomed to enter the `URL` == `GET` request

2. Once the user enters the `URL`, he will be displayed the prediction == `POST` request

Here is the part of the code which handles which method are we invoking:

In this function, first we check what type of method that the user is making. Or in other words, we are checking whether the user has submitted the submit button or not. If he has not, we just return the render_template. But he has or in other words, if the type of request is “POST”, then we do the following:

1. We get the jobtype from form.

2. We get the noOfJobs from the form.

3. We get the location from the form.

But, how do we know these names? Here is the HTML of the form:

Notice the circled names of each of the input tag. This is where we find the required name.

After extracting these three pieces of information, we call the functions we created in the first two steps. First, we call the function to get the desired number of jobs from indeed and then use that list to call the functions which will give us the predictions.

Once we have the predictions, we render the template by passing the predictions in the render_template function. Now, how does HTML handle this?

Here, we check if we have prediction_Dictionary. If we do, then we display that. If we do not, we display the form, not the predictions.

STEP 4 Putting the app on DOCKER

To put the app on the docker, we first need to install Docker Desktop and then after installing it, confirm it is there by running the command “docker — version” on the command line.

To create an image of any application, we need to create a Dockerfile with the instructions for creating the image and also run the image in a container. We will create a docker file in the same folder where our app is.

Your directory structure should look like this:

We have a folder called instance and a static folder inside of that which holds the corpus of skills. We have another folder that holds index.html. We have one index.py file which holds all of the code of the server and backend. There is a file called Dockerfile. This file contains the following instructions to build the image of our app so that the image can be run inside a docker container.

In this file, once we build the docker image, it will run the above commands in a sequence. First, it will get a base python installation and then make a working directory and copy the contents of the current directory to this newly created working directory. After that, we run the upgrade of pip and install the required packages. Finally, we run the index.py file which is the server of our app.

To make an image, first go inside the folder where your Dockerfile exists, then run the following command:

“docker build -t <name of image> . “

After you hit enter, it will start executing the commands in the docker file. Something like this:

After it is done executing, you can run you image by

docker run -p 5000:5000 <name of the image>

Here, -p tag defines the port. If you open the docker desktop and go to the containers option, you will see your image running inside the container and from there you can choose the option to view in the browser from the right. The status of the running app can be viewed from the docker desktop and it will look something like this:

Choose open in browser option to run view the app in the browser.

I have uploaded this app’s image on docker hub (Job-Skill-app-docker) so that you can easily play with the app by pulling the image with this command:

docker pull fenago/job-skill

Once this is successful, you can run the app on your own system without having to install anything! Visit Ernest.Net for more information!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store