Awesome
Awesome Scrapy
A curated list of awesome packages, articles, and other cool resources from the Scrapy community. Scrapy is a fast high-level web crawling & scraping framework for Python.
Table of Contents
Apps
Visual Web Scraping
- Portia Visual scraping for Scrapy
Distributed Spider
- scrapy-cluster Distributed on demand scraping cluster using Redis and Kafka.
- scrapy-redis Redis-based components for Scrapy.
Scrapy Service
-
scrapyscript Run a Scrapy spider programmatically from a script or a Celery task - no project required.
-
scrapyd A service daemon to run Scrapy spiders
-
scrapyd-client Command line client for Scrapyd server
-
python-scrapyd-api A Python wrapper for working with Scrapyd's API.
-
SpiderKeeper A scalable admin ui for spider service
-
scrapyrt HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders.
Front-End Scrapy Managers
-
Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js
-
SpiderKeeper admin ui for scrapy/open source scrapinghub.
-
ScrapydWeb Scrapyd cluster management, Scrapy log analysis & visualization, Basic auth, Auto packaging, Timer Tasks, Email notice, and Mobile UI.
Monitor
-
scrapy-sentry Logs Scrapy exceptions into Sentry
-
scrapy-statsd-middleware Statsd integration middleware for scrapy
-
scrapy-jsonrpc An extension to control a running Scrapy web crawler via JSON-RPC
-
scrapy-fieldstats A Scrapy extension to log items coverage when the spider shuts down
-
spidermon Extension which provides useful tools for data validation, stats monitoring, and notification messages.
Avoid Ban
-
HttpProxyMiddleware A middleware for scrapy. Used to change HTTP proxy from time to time.
-
scrapy-proxies Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.
-
scrapy-rotating-proxies Use multiple proxies with Scrapy
-
scrapy-random-useragent Scrapy Middleware to set a random User-Agent for every Request.
-
scrapy-fake-useragent Random User-Agent middleware based on fake-useragent
-
scrapy-crawlera Crawlera routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems.
Data Processing
-
scrapy-elasticsearch A scrapy pipeline which send items to Elastic Search server
-
scrapy-mongodb MongoDB pipeline for Scrapy.
-
scrapy-mysql-pipeline MySQL pipeline to persist items in MySQL databases.
-
scrapy-s3pipeline Scrapy pipeline to store chunked items into AWS S3 bucket
-
scrapy-sqs-exporter Scrapy extension for outputting scraped items to an Amazon SQS instance
-
scrapy-kafka-export Scrapy extension which writes crawled items to Kafka
-
scrapy-rss-exporter An RSS exporter for Scrapy
Process Javascript
- scrapy-playwright Enable scraping dynamic pages using PlayWright.
- scrapy-puppeteer Make Scrapy and Puppeteer work together to handle Javascript-rendered pages.
- scrapy-splash Make Scrapy can understand Javascript
Other Useful Extensions
-
scrapy-djangoitem Scrapy extension to write scraped items using Django models
-
scrapy-deltafetch Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls
-
scrapy-crawl-once This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.
-
scrapy-magicfields Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.
-
scrapy-pagestorage A scrapy extension to store requests and responses information in storage service.
-
itemloaders Library to populate items using XPath and CSS with a convenient API.
-
itemadapter Adapter which provides a common interface to handle objects of different types in an uniform manner.
-
scrapy-poet Page Object pattern implementation which enables writing reusable and portable extraction and crawling code.
Resources
Articles
-
Web Scraping in Python using Scrapy (with multiple examples)
-
Advanced Web Scraping: Bypassing "403 Forbidden," captchas, and more
Exercises
Video
- Scrapy: Powerful Web Scraping & Crawling with Python Online courses on Udemy