Web Scraping with Selenium: A Beginner's Guide
. ##### from "Web Scraping with Selenium: A Beginner's Guide, by Ashwin Pajankar" article ![image](https://cdn-images-1.medium.com/max/1600/1*JbSBygOMrf5YqL5F5Z0D4w.png) ## Example Usage 1. git clone https://github.com/aiwithab/blog_writer.git 2. cd to blog_writer 3. pip install -r requirements.txt 4. python main.py 5. copy the html div output and paste it into the post editor on Medium.com ## example output: ```html
What is Web Scraping?
Web scraping, or web data extraction, is the process of retrieving or “scraping” data from a website. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
When is it used?
Web scraping is used in a wide range of applications, from price comparison to job postings, news articles, and much more. Various web scraping software are available ranging from open source to commercial licenses.
What is Selenium?
Selenium is a suite of tools to automate web browsers across many platforms. Selenium is an open source tool that automates web browsers. It is licensed under the Apache License 2.0. Selenium has the support of some of the large browser vendors who have taken (or are taking) steps to make Selenium a native part of their browser. It is also the core technology in countless other browser automation tools, APIs and frameworks.
When is it used?
Selenium is extremely useful for automating repetitive tasks within a web browser. It can be used to create automated tests that check the functionality of websites, or it can be used to automate repetitive tasks, such as filling out forms. Selenium can be used to scrape data from websites.
Selenium vs Scrapy vs BeautifulSoup
There are many other web scraping libraries and packages available in python such as urllib2, BeautifulSoup, Scrapy (for more advanced users). Here we will discuss the basics of Selenium.
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Installation
The easiest way to install Selenium on Windows is through the pip utility.
pip install selenium
Firefox
You can either download the latest version of Firefox or use the GeckoDriver provided in the project.
You can also use the latest version of Chrome. In such a case, you need to download Chromedriver and provide the path.
Library
Add the following line to your python file.
from selenium import webdriver
The Driver
You need to create a driver object before you can work with it. Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before the below examples can be run. Make sure it’s in your PATH or specify its full path when you create the WebDriver object.
driver = webdriver.Firefox()
If you wish to use Chrome, simply replace Firefox() with Chrome() and ensure you have the ChromeDriver in your PATH.
Navigating
It’s like you’re using a browser, but it’s all done programmatically. To open a webpage, use get().
driver.get('https://www.google.com')
Sleeping
This is the process of simulating the waiting time a human will take when performing a task, such as waiting for an AJAX to load.
import time
time.sleep(5)
Clicking
This is the process of simulating a human clicking on an element of a webpage. Here we select the element by its ID.
element = driver.find_element_by_id('gbqfq')
element.click()
Typing
This is the process of simulating a human typing on the keyboard. Here we are sending the keys to the element by its ID.
element = driver.find_element_by_id('gbqfq')
element.send_keys('Testing Selenium')
Selecting
This is the process of simulating a human selecting an element of a webpage. Here we select the element by its XPath.
element = driver.find_elements_by_xpath('.//*[@id='gbqfq']')
element.send_keys('Testing Selenium')
Storing
Once you have a reference to an element, you can extract the text inside that element, or the value of an attribute of that element.
element = driver.find_elements_by_xpath('.//*[@id='gbqfq']')
element.send_keys('Testing Selenium')
Closing
When you are done, you need to close the driver.
driver.close()
Conclusion
In this tutorial we discussed the basics of web scraping and Selenium.
```