Home

Awesome

PAR Scrape

PyPI PyPI - Python Version
Runs on Linux | MacOS | Windows Arch x86-63 | ARM | AppleSilicon
PyPI - License

PAR Scrape is a versatile web scraping tool with options for Selenium or Playwright, featuring AI-powered data extraction and formatting.

"Buy Me A Coffee"

Screenshots

PAR Scrape Screenshot

Features

Known Issues

Prompt Cache

Prerequisites

To install PAR Scrape, make sure you have Python 3.11.

uv is recommended

Linux and Mac

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Installation

Installation From Source

Then, follow these steps:

  1. Clone the repository:

    git clone https://github.com/paulrobello/par_scrape.git
    cd par_scrape
    
  2. Install the package dependencies using uv:

    uv sync
    

Installation From PyPI

To install PAR Scrape from PyPI, run any of the following commands:

uv tool install par_scrape
pipx install par_scrape

Playwright Installation

To use playwright as a scraper, you must install it and its browsers using the following commands:

uv tool install playwright
playwright install chromium

Usage

To use PAR Scrape, you can run it from the command line with various options. Here's a basic example: Ensure you have the AI provider api key in your environment. You can also store your api keys in the file ~/.par_scrape.env as follows:

GROQ_API_KEY= # is required for Groq. Get a free key from https://console.groq.com/
ANTHROPIC_API_KEY= # is required for Anthropic. Get a key from https://console.anthropic.com/
OPENAI_API_KEY= # is required for OpenAI. Get a key from https://platform.openai.com/account/api-keys
GITHUB_TOKEN= # is required for GitHub Models. Get a free key from https://github.com/marketplace/models
GOOGLE_API_KEY= # is required for Google Models. Get a free key from https://console.cloud.google.com
LANGCHAIN_API_KEY= # is required for Langchain Langsmith tracing. Get a free key from https://smith.langchain.com/settings
AWS_PROFILE= # is used for Bedrock authentication. The environment must already be authenticated with AWS.
# No key required to use with Ollama models.

Running from source

uv run par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" -f "Cache Price" --model gpt-4o-mini --display-output md

Running if installed from PyPI

par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" -f "Cache Price" --model gpt-4o-mini --display-output md

Options

Examples

  1. Basic usage with default options:
par_scrape --url "https://openai.com/api/pricing/" -f "Model" -f "Pricing Input" -f "Pricing Output" --pricing -w text -i gpt-4o
  1. Using Playwright and displaying JSON output:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --scraper playwright -d json --pricing -w text -i gpt-4o
  1. Specifying a custom model and output folder:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --model gpt-4 --output-folder ./custom_output --pricing -w text -i gpt-4o
  1. Running in silent mode with a custom run name:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --silent --run-name my_custom_run --pricing -w text -i gpt-4o
  1. Using the cleanup option to remove the output folder after scraping:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --cleanup --pricing
  1. Using the pause option to wait for user input before scrolling:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --pause --pricing
  1. Using Anthropic provider with prompt cache enabled and detailed pricing breakdown:
par_scrape -a Anthropic --prompt-cache -d csv -p details -f "Title" -f "Description" -f "Price" -f "Cache Price"

Whats New

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Paul Robello - probello@gmail.com