What is Web Scraping?
Web scraping is simply the act of gathering data from specifies or multiple sources on the internet. We scrape the web everyday, it doesn’t even have to be complex, gathering a top 10 list from forbes counts as web scraping. In the world of programming, when you hear the word web scraping, it no longer has to do with any manual method. Automated Web scraping is the act of gathering data automatically from a specific source or multiple sources across the internet.
Why do we scrape the web?
So why would anyone want to scrape the web? There are tons of scenarios that would lead to web scraping. First of all, it eliminates the stress of manually going to multiple websites to get information. You could simply write a program to do this for you and send the data to your email every morning. Some jobs also require regular data collection and web scraping could simplify a lot of things. Imagine building a Job Scraper that could help you scrape available jobs from multiple sources and send them to you every morning. So web scraping is a very cool and important part of many people’s day to day lives.
Also Read- Django vs Flask: Choosing the Right Framework
Is Web Scraping Legal?
This is a very good question and it all comes down to the kind of website you’re scraping. Some web owners be it an individual or organization do not mind their websites being scraped while others forbid it. Once you dive into web scraping, you’ll notice some web owners actually take active measures to ensure people cannot scrape their websites. If you’re scraping a website privately for personal reasons, you may probably not run into any issues. If however you are planning on something on a larger scale, I’ll advise you to read the terms and conditions of any website you intend to scrape. For more information, i advise you read this article on benbernardblog.
Web Scraping With Python
Python is one of the best programming languages there and most used too. It is the ideal choice for web scraping because of the powerful libraries that come with python. Tools like requests, lxml, beautifulsoup among others help to easily read and extract data from websites. There are many ways or tools to scrape the web using python but the three major and most tools ways are:
In this article we’re discussing beautifulsoup, subscribe to our newsletter and allow notifications to stay updated as we would be addressing all other techniques right here on techranter.
Requirements for Web Scraping with Beautiful Soup
- LXML- This is a python library used for parsing XML and HTML files. Its is a very important tool for web scraping and works hand in hand with beautifulsoup.
- Beautiful Soup- Beautiful soup is a pyhton library used for extracting data from XML and HTML files it works with a parser like lxml
- Python- Python is the programming language used. The libraries involved and beautiful soup itself can only be used with python
- Requests- Request is a HTTP library for python. It helps in making HTTP requests and returning appropriate response from the server.
- Vs Code- Visual Studio Code is an Integrated development environment(IDE). It has everything you’ll need to write your programs.
Scraping From a Live Website
To avoid getting a lawsuit for writing this article, i decided to use Techranter.com as the case study. If you’re new to this, and you want to learn and brush up your skills, I advise you find a different website to learn from, something more practical and useful for you. This would make pulling it off actually feel a lot better.
Step 1- Installing Python and VS Code
Make sure you install python properly and check the “add to path” box during installation.
Step 2 Install Required Libraries
Open your terminal on vs code and install the following libraries; beautifulsoup, lxml and requests using the pip command.
pip install bs4 lxml requests
By using space between the library name, you can install multiple items at once. However you can choose to install them one by one if you get an error.
Step 3 Inspect Target URL
In webs scraping, this is the most important part that requires the most attention. What exactly are you scraping? What data are you gathering? and how is the data structured?
To answer all these questions, you’ll need to visit the website in question and take a really good look. In this case, the target website is https://techranter.com and we are trying to automatically search and gather data from content with the keyword “phone”.
In simple terms, we want to see all the articles on mobile phones from techranter.com. Keep in mind that the whole purpose of web scraping is automating the stressful bit.
You should consider brushing up your knowledge on URLs, because you can’t really scrape the web if you don’t know your way around it.
Now consider this URL;
Merely looking at this URL you should be able to tell that the base URL is techranter.com, while the category ‘smartphone’ is the exact destination the URL is pointing to.
In web scraping you’re most likely to use query parameters to get data as websites use these queries to retrieve data from their database. By query, we simply mean some sort of search system you can use to find key word related data on a website.
By searching for the keyword “phone 2022” on techranter.com, it displays all relevant results. When you look on your address bar to check the current URL you’re in, you’ll find the above URL. The query parameter is “?s=phone+2022”.
Again, you should brush up on your URL knowledge if you know nothing about it before you proceed with scraping.
Using the Inspect Tool on Your Browser
You can inspect the website you intend to scrape using the inspect tool. All browsers have this and you can inspect by right clicking on the page then go to inspect and click.
Inspecting a page on your browser gives you insight on how data is displayed and structured on the website you’re inspecting. It’s like viewing the front end of a website, knowing what text is in what tag.
Now take a moment and inspect any URL of your choice and see how the data on that site is structured. Note that most modern websites are dynamic while some are static. So the detail you’ll see, the difficulty level in scraping will all vary. Know that some websites are almost impossible to scrape because of how their data is structured.
Step 4- Scraping entire HTML content from a page
We’re back to our code editor for this part, lets scrape the HTML content from a page. Create a python file on your VS code, name it scraper.py. Import requests and beautiful soup. We’re only going to use requests for this particular task, we’ll show you where beautiful soup comes in shortly.
from bs4 import BeautifulSoup import requests
Now, you state the URL you want to get and use the request library to get a response and print out the data from that response.
from bs4 import BeautifulSoup import requests url = "https://techranter.com" result = requests.get(url) print (result.text)
Congratulations, if you ran this code successfully without an error, you just scraped the web.
To break this code down, first we imported requests, then we stated the url we wanted to scrape. We then used requests to get the url and printed out the structure of the website. The result will be very similar to what you see using the inspect tool.
Step 5- Scraping Specific Content
From the result of the code above, you successfully scraped the entire layout of a web page. But that’s not what we want. Now lets scrape the specific data we actually need. In this case, we want to scrape techranter.com for all articles about phones. We need to extract the following data;
- The title
- Date posted
- Article link
Let’s go right ahead and begin the process. Using the inspect tool, i know the tags that contain each of the data i want to extract and their individual id name or class name.
Extracting the Title
Here, the beautiful soup library gets to work and lxml is used for parsing the html content.
from bs4 import BeautifulSoup import requests query = phone url = "https://techranter.com?s="+query result = requests.get(url) soup = BeautifulSoup(result.text, 'lxml') section = soup.find_all('div', class_="item-details") for article in section: title = article.find('h3', class_ = "entry-title td-module-title").text print(title)
While getting the article name above, you’ll notice two main actions. We first found the div and class name for all blog posts The articles are located in a div with the class “item-details” and we assigned it to a variable named section.
Extracting Date Posted
for article in section: title = article.find('h3', class_ = "entry-title td-module-title").text date = article.find('time', class_ = "entry-date updated td-module-date").text print(date)
Then we move forward to extract the article name and the date it was posted. Both data are stored in the master div “item-details” which we named as section. Now we loop through section to find the exact details we need. In section, there are h3 tags with the class name “
entry-title td-module-title” and these tags are where the article title is stored. We use .text to filter and extract just the text in the tag.
Getting Article Link
Getting the date posted and article link also have the same procedure. The article link for each article is stored in an a tag located in a parent h3 tag with class name
"entry-title td-module-title". This time we did not use .text, rather we used .a.get(‘href’) which extracts the link itself and displays it on our terminal.
for article in section: title = article.find('h3', class_ = "entry-title td-module-title").text date = article.find('time', class_ = "entry-date updated td-module-date").text premalink = article.find('h3', class_ = "entry-title td-module-title").a.get('href') print(premalink)
Now that all the data has been extracted you can arrange it in a list and do with it as you please.
In the above code, the query or keyword is static, you can make it dynamic by using the “input” function in your code so when it runs, you can use any keyword you choose and scrape
This is a simple layout of what web scraping is and how it works, you can work on a more serious project and put your skills to practice.
Web scraping with python is an easy and fun way to gather data. it is also one of the best ways to brush up on your python skills as you get introduced to many libraries and their importance.
Beautiful Soup, Requests and Lxml are the major libraries used in making a web scraper.
Scraping websites without first taking permission or checking a wensite’s terms and conditions is not advisable and can lead to a lawsuit.
Did you learn anything new? Having issues with your code? Feel free to ask your questions using the comment section below.