scrapy_playwright_example

This repo contains a scraping script that crawls a JavaScript-rendered webpage using the scrapy-playwright package in Python and the scrapy framework

Objective of the Project

I created this script to test the scrapy-playwright python package in crawling a JavaScript rendered webpage.

To scrape dynamic websites in Python, one of these three options can be used:

scrapy-playwright
scrapy-splash (requires Docker)
A proxy service that has a built-in JS rendering capability (e.g., Zyte Smart Proxy Manager or ScraperAPI).

I prefer option #1 for low-volume scraping and option #3 for high-volume scraping because these proxy services also re-route your requests and overcome the anti-bot mechanisms that E-commerce websites use. Option #2 also works pretty well, but you need to be familiar with docker and have it installed on your computer. scrapy-playwright does not need a docker-image to work and acts as a direct plugin to scrapy, which makes it pretty easy to use.

Usability and Reproducability

Step 0: To know if a website is dynamically rendered or not, click F12, then Ctrl-Shift-P, type in Disable JavaScript, then reload the page. If the text/numbers you want to scrape disappear, then you indeed have a JS-rendered website

Step 1: scrapy-playwright does not work natively on Windows. It only works on Linux and Mac. If you use Windows, you'll need to use Windows Subsystem for Linux (WSL). Otherwise, the spider will always fail

If you are using Windows, please follow the steps in this video from 4:30 to 14:00 to install WSL, VSCode, and Windows Terminal on your machine. The video is courtesy of YouTube user freakingud. It is not in English (probably Hindi), but you will be able to follow the steps without any problems from the screen recordings. I found this to be one of the most straightforward guides to install WSL despite the fact that I did not understand the language.

After installing WSL, you will need to do two additional steps:

Upgrade it from WSL1 to WSL2. To do this, follow the steps in this guide
Install the VSCode extentions shown in the image below. The ones that are specifically needed for WSL to work are WSL, Pylance, and Python, but the others are pretty useful for other use cases, and I recommend you keep them in your standard toolbox

Note 1: You will need to install these extensions again in the WSL: Ubuntu environment once you connect to the WSL remote container (steps explained below) Note 2: The name of the distro in the wsl --set-version <distro-name> 2 step is Ubuntu

Step 2: From VSCode, click on the green/purple icon in the bottom left hand corner, then click on New WSL Window using Distro, and finally Ubuntu

You should land on a page that looks like this

Step 3: Open your terminal and type in git clone https://github.com/omar-elmaria/scrapy_playwright_example.git

Step 4: After the repo is cloned, type cd scrapy_playwright_example in your terminal, then python -m venv venv_scraping to create a virtual environment

Step 5: Activate the virtual environment by typing source venv_scraping/bin/activate

Step 6: Type pip3 install -r requirements.txt to install the dependencies

Step 7: If it is your first time using scrapy-playwright, you will also need to install the headless browsers by typing playwright install in your terminal

Step 8: Before running the crawler, please enter the following lines in your settings.py file

# Playwright
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

This comes directly from the scrapy-playwright official documentation. I encourage you to go through it to get acquainted with more use cases of the plugin.

Step 9: To run the crawler, type cd scrapy_playwright_example/site_crawler in your terminal and then enter the following command --> scrapy crawl spanish_site_crawler. This will launch the spider and crawl the product name, discount tag, and price of the product. spanish_site_crawler_terminal is the name of the spider and can be changed by setting the variable name under the SiteCrawlerSpider class to something else

The end result should look like this...

Step 10 (Optional): If you want to launch the spider by running the script itself through the play button at the top right hand corner and not through the terminal, please add the following import command at the start of the script from scrapy.crawler import CrawlerProcess and insert these few lines of code at the end of the script without indentation outside the class code block

process = CrawlerProcess(settings = {
    "DOWNLOAD_HANDLERS": {
        "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    },

    "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
}) # The same lines of code you put in settings.py
process.crawl(SiteCrawlerSpider) # Name of the class
process.start()

Extra Resources

Here are two nice YouTube videos that walk you through how to install and use the package:

A tutorial by John Watson Rooney
A tutorial by Upendra

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
site_crawler		site_crawler
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapy_playwright_example

Objective of the Project

Usability and Reproducability

Extra Resources

About

Releases

Packages

Languages

omar-elmaria/scrapy_playwright_example

Folders and files

Latest commit

History

Repository files navigation

scrapy_playwright_example

Objective of the Project

Usability and Reproducability

Extra Resources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages