Edwin Fachri W
6 min readJun 7, 2020

--

Linkedin Web Scrapping Using Python

Get yourself the data that your need

Linkedin is one of the most reliable social media platforms for professionals out there. Once one graduated from school or college, getting his/her linkedin profile updated is like a must. In other word, it is one of the best source of reliable data for scapper. Out of my spare time and a bit of curiousity, I tried to search for a online tutorial regarding to the web scrapper, specifically for linkedin. Fortunately, I found one. The blog was posted here by David Craven in October 3, 2018. As I followed the guide given by him, I realised that linkedin updates makes the tutorial obselete. So I modify some of the code to scrap the data and it was successful. So, I want to share what I have gone through to get the data from linkedin.

First of all, let’s get into the page to see what can be gathered from it.

From the top there are basic information such as, name, address, number of connections, last job, where you are graduated from, and description about yourself. Lets scroll down once or twice.

As we move further down, we can see the list of jobs and educations. If we move even further down.

We can see the list of skills and interests of the page owner.

After having a look at the profile page in Linkedin, I decided to scrap these following items:
1. Basic information
2. List of Job
3. List of Education

Tools Required
1. Relational Database (I use Postgresql).
2. Jupyter Notebook (Highly recommended for development).
3. Python libraries including( Selenium, BeautifulSoup, Parsel, Psycopg2).
4. Webdriver for google chrome, you need to choose the one suitable with your google chrome version.

Setting up Databases

The database setup are pretty straight forward, you can modify to suit yourself. Here is mine:

1. Account Table

create table account (
id serial primary key,
url varchar(150) unique not null,
name varchar(150) not null,
job varchar(100) null,
location varchar(100) null,
created_date timestamp not null,
modified_date timestamp
)

2. Job Table

create table job (
id serial primary key,
account_id integer not null,
CONSTRAINT account_id_fkey FOREIGN KEY (account_id)
REFERENCES account (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
position varchar(100),
employer varchar(100),
period varchar(100),
location varchar(100),
description varchar(100),
created_date timestamp not null,
modified_date timestamp
)

3. Education Table

create table education (
id serial primary key,
account_id integer not null,
CONSTRAINT account_id_fkey FOREIGN KEY (account_id)
REFERENCES account (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
school varchar(100),
degree varchar(100),
field_of_study varchar(100),
date_attended varchar(100),
created_date timestamp not null,
modified_date timestamp
)

From the given script, it can be seen that I have 3 main tables that related to each other. As the primary table, I have account which contains basic information on the page. Job table is to keep the information of list of jobs that account has while the Education table is for the list of educations.

Let’s get into the code

First of all, Import all required libraries. The usage of the following libraries are availabe online, so please feed your curiousity yourself.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from parsel import Selector
import psycopg2
from config import config, config_prod
from bs4 import BeautifulSoup

The next step is to open the browser via selenium and get yourself logged in to the linkedin. Insert the path of downloaded webdriver on the *webdriver path*. Fill in your linkedin username and password on *linkedin username* and *linkedin password*. I created another linkedin profile so that people won’t block my main linkedin because I will be “visiting” people’s linkedin quite often. Remember to put sleep command as often as you can to avoid google bot identifier.

driver = webdriver.Chrome(*webdriver path*)
driver.get(‘https://www.linkedin.com/login')

username = driver.find_element_by_id(‘username’)
username.send_keys(*linkedin username*)
time.sleep(0.5)

password = driver.find_element_by_id(‘password’)
password.send_keys(*linkedin password*)
time.sleep(0.5)

log_in_button = driver.find_element_by_xpath(‘//*[@type=”submit”]’)
log_in_button.click()
time.sleep(2)

Get yourself a list of linkedin profile based on your search. You need to keep your search on variable “keyword” before get into the loop. Personally I have list of predetermined keyword on the database so that I can reuse it whenever I want without expanding my code. The example of the keyword can be anything something like site:linkedin.com/in/ AND “Business Information” AND “Jakarta” AND “Male”

url_list = []
for x in keyword:
driver.get(‘https://www.google.com')
time.sleep(3)
search_query = driver.find_element_by_name(‘q’)
search_query.send_keys(x)
time.sleep(0.5)
search_query.send_keys(Keys.ENTER)
time.sleep(3)
linkedin_urls = driver.find_elements_by_tag_name(‘a’)
linkedin_urls = [url.get_attribute(‘href’) for url in linkedin_urls]
for i in linkedin_urls:
if i != None:
if “linkedin” in i:
if “google” not in i:
url_list.append(i)
time.sleep(3)

After getting the linkedin profile url list the next step is to visit each of the gathered url one by one and scrap the basic information.

counter = 0
for i in url_list:
counter = counter + 1
print((“Processing “ + str(counter)+ “ out of “+str(len(url_list))+”…”), end=””)
driver.get(i)
time.sleep(15)
sel = Selector(text=driver.page_source)
name = sel.xpath(‘//div/ul/li[has-class(“t-24”)]/text()’).extract_first()
if name:
name = name.strip()
job = sel.xpath(‘//div/h2[has-class(“t-18”)]/text()’).extract_first()
if job:
job = job.strip()
address = sel.xpath(‘//div/ul/li[has-class(“t-16”)]/text()’).extract_first()
if address:
address = address.strip()
sql = ‘’’insert into account (url, name, job, location, created_date) values(%s, %s, %s, %s, current_date)
on conflict (url) do update set name = %s, job = %s, location = %s, modified_date = current_date returning id ;
‘’’
conn = None
vendor_id = None
try:
params = config_prod()
conn = psycopg2.connect(**params)
cur = conn.cursor()
cur.execute(sql, (i, name, job, address, name, job, address))
account_id = cur.fetchone()[0]
conn.commit()
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
time.sleep(0.5)

After having the basci information on the top page, we scroll down (so that the page loads job and education informations) few times and pause in between. As we have the page fully load, we get the list of job and education based on html tag id experience-section and education-section.

driver.execute_script(“window.scrollTo(0, document.body.scrollHeight/3);”)
time.sleep(1)
driver.execute_script(“window.scrollTo(0, document.body.scrollHeight/2);”)
time.sleep(1)
driver.execute_script(“window.scrollTo(0, document.body.scrollHeight/1);”)
time.sleep(1)

job = sel.xpath(‘//section[@id=”experience-section”]/ul/li’).getall()
education = sel.xpath(‘//section[@id=”education-section”]/ul/li’).getall()

time.sleep(1)

Since we have the chunk of html code, we can loop on it using BeatifulSoup and get the information on the li tag on variable job.

for i in job:
soup = BeautifulSoup(i, “html.parser”)
position = soup.li.section.div.a.h3
employer = soup.find(text=”Company Name”)
date_employed = soup.find(text=”Dates Employed”)
duration_employed = soup.find(text=”Employment Duration”)
location = soup.find(text=”Location”)
job_desc = soup.find(“div”, {“class” : “pv-entity__extra-details”})

if position:
position = position.text.strip()
if employer:
employer = employer.find_next().text.strip()
if date_employed:
date_employed = date_employed.find_next().text.strip()
if duration_employed:
duration_employed = duration_employed.find_next().text.strip()
if location:
location = location.find_next().text.strip()
if job_desc:
job_desc = job_desc.text.strip()

We save the data to our database.

sql = ‘’’INSERT INTO job (account_id, position, employer, period, duration, location, description, created_Date) VALUES (%s, %s, %s, %s, %s, %s, %s, current_date) returning id ;’’’
conn = None
vendor_id = None
try:
params = config_prod()
conn = psycopg2.connect(**params)
cur = conn.cursor()
cur.execute(sql, (account_id, position, employer, date_employed, duration_employed, location, job_desc))
#account_id = cur.fetchone()[0]
conn.commit()
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()

We do the same on education.

for i in education:
soup = BeautifulSoup(i, “html.parser”)
school = soup.find(“h3”, {“class” : “pv-entity__school-name”})
degree = soup.find(text=”Degree Name”)
field_of_study = soup.find(text=”Field Of Study”)
dates_attended = soup.find(text=”Dates attended or expected graduation”)

if school:
school = school.text.strip()
if degree:
degree = degree.find_next().text.strip()
if field_of_study:
field_of_study = field_of_study.find_next().text.strip()
if dates_attended:
dates_attended = dates_attended.find_next().text.strip()

sql = ‘’’INSERT INTO education (account_id, school, degree, field_of_study, date_attended, created_Date) VALUES (%s, %s, %s, %s, %s, current_date) returning id ;’’’
conn = None
vendor_id = None
try:
params = config_prod()
conn = psycopg2.connect(**params)
cur = conn.cursor()
cur.execute(sql, (account_id, school, degree, field_of_study, dates_attended))
conn.commit()
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
print(“[Done]\n”)

Finish.

That’s it. We have linkedin profiles on our database now.

However there are problems that I faced during this trial and error development. First, I keep getting uncomplete list of job and education before I decided to scroll down using driver.execute_script(“window.scrollTo();”). Second, I have not found the limit before I get busted by the google bot identifier. Previously I tried below 20 and it runs well, but when I hit 50 (I did not track when google busted me), I was queried with verification method. I hope there is a way to bypass that security system (in a legal way, of course).

Now you reach the end of my first post. Thank you for being here. I hope you well.

--

--