Awesome
<h1 align="center"> <br> <a href="https://github.com/jschnurr/scrapyscript"><img src="https://i.ibb.co/ww3bNZ3/scrapyscript.png" alt="Scrapyscript"></a> <br> </h1> <h4 align="center">Embed Scrapy jobs directly in your code</h4> <p align="center"> <a href="https://github.com/jschnurr/scrapyscript/releases"> <img src="https://img.shields.io/github/release/jschnurr/scrapyscript.svg"> </a> <a href="https://pypi.org/project/scrapyscript/"> <img src="https://img.shields.io/pypi/v/scrapyscript.svg"> </a> <img src="https://github.com/jschnurr/scrapyscript/workflows/Tests/badge.svg"> <img src="https://img.shields.io/pypi/pyversions/scrapyscript.svg"> </p>What is Scrapyscript?
Scrapyscript is a Python library you can use to run Scrapy spiders directly from your code. Scrapy is a great framework to use for scraping projects, but sometimes you don't need the whole framework, and just want to run a small spider from a script or a Celery job. That's where Scrapyscript comes in.
With Scrapyscript, you can:
- wrap regular Scrapy Spiders in a
Job
- load the
Job(s)
in aProcessor
- call
processor.run()
to execute them
... returning all results when the last job completes.
Let's see an example.
import scrapy
from scrapyscript import Job, Processor
processor = Processor(settings=None)
class PythonSpider(scrapy.spiders.Spider):
name = "myspider"
def start_requests(self):
yield scrapy.Request(self.url)
def parse(self, response):
data = response.xpath("//title/text()").extract_first()
return {'title': data}
job = Job(PythonSpider, url="http://www.python.org")
results = processor.run(job)
print(results)
[{ "title": "Welcome to Python.org" }]
See the examples directory for more, including a complete Celery
example.
Install
pip install scrapyscript
Requirements
- Linux or MacOS
- Python 3.8+
- Scrapy 2.5+
API
Job (spider, *args, **kwargs)
A single request to call a spider, optionally passing in *args or **kwargs, which will be passed through to the spider constructor at runtime.
# url will be available as self.url inside MySpider at runtime
myjob = Job(MySpider, url='http://www.github.com')
Processor (settings=None)
Create a multiprocessing reactor for running spiders. Optionally provide a scrapy.settings.Settings
object to configure the Scrapy runtime.
settings = scrapy.settings.Settings(values={'LOG_LEVEL': 'WARNING'})
processor = Processor(settings=settings)
Processor.run(jobs)
Start the Scrapy engine, and execute one or more jobs. Blocks and returns consolidated results in a single list.
jobs
can be a single instance of Job
, or a list.
results = processor.run(myjob)
or
results = processor.run([myjob1, myjob2, ...])
A word about Spider outputs
As per the scrapy docs, a Spider
must return an iterable of Request
and/or dict
or Item
objects.
Requests will be consumed by Scrapy inside the Job
. dict
or scrapy.Item
objects will be queued
and output together when all spiders are finished.
Due to the way billiard handles communication between processes, each dict
or Item
must be
pickle-able using pickle protocol 0. It's generally best to output dict
objects from your Spider.
Contributing
Updates, additional features or bug fixes are always welcome.
Setup
- Install Poetry
git clone git@github.com:jschnurr/scrapyscript.git
poetry install
Tests
make test
ormake tox
Version History
See CHANGELOG.md
License
The MIT License (MIT). See LICENCE file for details.