Home

Awesome

Multiscrape

GitHub Release License

pre-commit Black

hacs hacs installs

Project Maintenance BuyMeCoffee

Discord Community Forum

Important note: troubleshooting

If you don't manage to scrape the value you are looking for, please enable debug logging and log_response. This will provide you with a lot of information for continued investigation. log_response will write all responses to files. If the value you want to scrape is not in the files with the output from BeautifulSoup (*-soup.txt), Multiscrape will not be able to scrape it. Most likely it is retrieved in the background by javascript. Your best chance in this case, is to investigate the network traffic in de developer tools of your browser, and try to find a json response containing the value you are looking for.

If all of this doesn't help, use the home assistant forum. I cannot give everyone personal assistance and please don't create github issues unless you are sure there is a bug. Check the wiki for a scraping guide and other details on the functionality of this component.

Important note: be a good citizen and be aware of your responsibility

You and you alone, are accountable for your scraping activities. Be a good (web) citizen. Set reasonable scan_interval timings, seek explicit permission before scraping, and adhere to local and international laws. Respect website policies, handle data ethically, mind resource usage, and regularly monitor your actions. Uphold these principles to ensure ethical and sustainable scraping practices.

Introduction

This Home Assistant custom component can scrape multiple fields (using CSS selectors) from a single HTTP request (the existing scrape sensor can scrape a single field only). The scraped data becomes available in separate sensors.

It is based on both the existing Rest sensor and the Scrape sensor. Most properties of the Rest and Scrape sensor apply.

<a href="https://www.buymeacoffee.com/danieldotnl" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-blue.png" alt="Buy Me A Coffee" style="height: 51px !important;width: 217px !important;" ></a>

Multiscrape is sponsored by CapSolver!

<img width="960" alt="CapSolver Ads" src="https://github.com/user-attachments/assets/c4f17c1f-ed98-49a8-81d6-097d373dc53d">

CapSolver is an AI-powered service that automatically solves a range of CAPTCHAs, helping developers tackle CAPTCHA challenges encountered during web scraping. Whether you're extracting data from e-commerce sites, financial platforms, or social media, CapSolver supports CAPTCHAs like reCAPTCHA V2, reCAPTCHA V3, hCaptcha, ImageToText, DataDome, AWS, Geetest, Cloudflare Turnstileand more. With API integration and browser extensions options, and flexible pricing packages, CapSolver adapts to diverse web scraping needs and scenarios.

Installation

hacs hacs installs

Install quickly via a HACS link
Install via HACS (default store) or install manually by copying the files in a new 'custom_components/multiscrape' directory.

Example configuration (YAML)

multiscrape:
  - name: HA scraper
    resource: https://www.home-assistant.io
    scan_interval: 3600
    sensor:
      - unique_id: ha_latest_version
        name: Latest version
        select: ".release-date"
        value_template: "{{ value | trim }}"
      - unique_id: ha_release_date
        icon: >-
          {% if is_state('binary_sensor.ha_version_check', 'on') %}
            mdi:alarm-light
          {% else %}
            mdi:bat
          {% endif %}
        name: Release date
        select: ".release-date"
        attribute: "title"
        value_template: "{{ (value.split('released')[1]) }}"
    binary_sensor:
      - unique_id: ha_version_check
        name: Latest version == 2021.7.0
        select: ".release-date"
        value_template: '{{ value | trim == "2021.7.0" }}'
        attributes:
          - name: Release notes link
            select: ".release-date"
            attribute: href

Options

Based on latest (pre) release.

namedescriptionrequireddefaulttype
nameThe name for the integration.Falsestring
resourceThe url for retrieving the site or a template that will output an url. Not required when resource_template is provided.Truestring
resource_templateA template that will output an url after being rendered. Only required when resource is not provided.Truetemplate
authenticationConfigure HTTP authentication. basic or digest. Use this with username and password fields.Falsestring
usernameThe username for accessing the url.Falsestring
passwordThe password for accessing the url.Falsestring
headersThe headers for the requests.Falsetemplate - list
paramsThe query params for the requests.Falsetemplate - list
methodThe method for the request. Either POST or GET.FalseGETstring
payloadOptional payload to send with a POST request.Falsestring
verify_sslVerify the SSL certificate of the endpoint.FalseTrueboolean
log_responseLog the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config)FalseFalseboolean
timeoutDefines max time to wait data from the endpoint.False10int
scan_intervalDetermines how often the url will be requested.False60int
parserDetermines the parser to be used with beautifulsoup. Either lxml or html.parser.Falselxmlstring
list_separatorSeparator to be used in combination with select_list features.False,string
form_submitSee Form-submitFalse
sensorSee SensorFalselist
binary_sensorSee Binary sensorFalselist
buttonSee Refresh buttonFalselist

Sensor/Binary Sensor

Configure the sensors that will scrape the data.

namedescriptionrequireddefaulttype
unique_idWill be used as entity_id and enables editing the entity in the UIFalsestring
nameFriendly name for the sensorFalsestring
See Selector fieldsTrue
attributesSee Sensor attributesFalselist
unit_of_measurementDefines the units of measurement of the sensorFalsestring
device_classSets the device_class for sensors or binary sensorsFalsestring
state_classDefines the state class of the sensor, if any. (measurement, total or total_increasing) (not for binary_sensor)FalseNonestring
iconDefines the icon or a template for the icon of the sensor. The value of the selector (or value_template when given) is provided as input for the template. For binary sensors, the value is parsed in a boolean.Falsestring/template
pictureContains a path to a local image and will set it as entity pictureFalsestring
force_updateSends update events even if the value hasn’t changed. Useful if you want to have meaningful value graphs in history.FalseFalseboolean

Refresh button

Configure a refresh button to manually trigger scraping.

namedescriptionrequireddefaulttype
unique_idWill be used as entity_id and enables editing the entity in the UIFalsestring
nameFriendly name for the buttonFalsestring

Sensor attributes

Configure the attributes on the sensor that can be set with additional scraping values.

namedescriptionrequireddefaulttype
nameName of the attribute (will be slugified)Truestring
See Selector fieldsTrue

Form-submit

Configure the form-submit functionality which enables you to submit a (login) form before scraping a site. More details on how this works can be found on the wiki.

namedescriptionrequireddefaulttype
resourceThe url for the site with the formFalsestring
selectCSS selector used for selecting the form in the html. When omitted, the input fields are directly posted.Falsestring
inputA dictionary with name/values which will be merged with the input fields on the formFalsestring - dictionary
input_filterA list of input fields that should not be submitted with the formFalsestring - list
submit_onceSubmit the form only once on startup instead of each scan intervalFalseFalseboolean
resubmit_on_errorResubmit the form after a scraping error is encounteredFalseTrueboolean
variablesSee Form VariablesFalselist

Form Variables

Configure the variables that will be scraped from the form_submit response. These variables can be used in the value_template of the main configuration of the current integration: a selector in sensors/attributes or in a header. A common use case is to populate the X-Login-Token header which is the result of the login.

namedescriptionrequireddefaulttype
nameName of the variableTruestring
See Selector fieldsTrue

Example:

multiscrape:
  - resource: "https://somesiteyouwanttoscrape.com"
    form_submit:
      submit_once: True
      resource: "https://authforsomesiteyouwanttoscrape.com"
      input:
        email: "<email>"
        password: "<password>"
      variables:
        - name: token
          value_template: "{{ ... }}"
    headers:
      X-Login-Token: "{{ token }}"
    sensor: ...

Selector

Used to configure scraping options.

namedescriptionrequireddefaulttype
selectCSS selector used for retrieving the value of the attribute. Only required when select_list or value_template is not provided.Falsestring/template
select_listCSS selector for multiple values of multiple elements which will be returned as csv. Only required when select or value_template is not provided.Falsestring/template
attributeAttribute from the selected element to read as value.Falsestring
value_templateDefines a template applied to extract the value from the result of the selector (if provided) or raw page (if selector not provided)Falsestring/template
extractDetermines how the result of the CSS selector is extracted. Only applicable to HTML. text returns just text, content returns the html content of the selected tag and tag returns html including the selected tag.Falsetextstring
on_errorSee On-errorFalse

On-error

Configure what should happen in case of a scraping error (the css selector does not return a value).

namedescriptionrequireddefaulttype
logDetermines if and how something should be logged in case of a scraping error. Value can be either 'false', 'info', 'warning' or 'error'.Falseerrorstring
valueDetermines what value the sensor/attribute should get in case of a scraping error. The value can be 'last' meaning that the value does not change, 'none' which results in HA showing 'Unkown' on the sensor, or 'default' which will show the specified default value.Falsenonestring
defaultThe default value to be used when the on-error value is set to 'default'.Falsestring

Services

For each multiscrape instance, a service will be created to trigger a scrape run through an automation. (For manual triggering, the button entity can now be configured.) The services are named multiscrape.trigger_{name of integration}.

Multiscrape also offers a get_content and a scrape service. get_content retrieves the content of the website you want to scrape. It shows the same data for which you now need to enable log_response and open the page_soup.txt file.
scrape does what it says. It scrapes a website and provides the sensors and attributes.

Both services accept the same configuration as what you would provide in your configuration yaml (what is described above), with a small but important caveat: if the service input contains templates, those are automatically parsed by home assistant when the service is being called. That is fine for templates like resource and select, but templates that need to be applied on the scraped data itself (like value_template), cannot be parsed when the service is called. Therefore you need to slightly alter the syntax and add a ! in the middle. E.g. {{ becomes {!{ and %} becomes %!}. Multiscrape will then understand that this string needs to handled as a template after the service has been called.
If someone has a better solution, please let me know!

To call one of those services, go to 'Developer tools' in Home Assistant and then to 'services'. Find the multiscrape.get_content or multiscrape.scrape services and go to yaml mode. There you enter your configuration. Example:

service: multiscrape.scrape
data:
  name: HA scraper
  resource: https://www.home-assistant.io
  sensor:
    - unique_id: ha_latest_version
      name: Latest version
      select: ".release-date"
      value_template: "{!{ value | trim }!}"
    - unique_id: ha_release_date
      name: Release date
      select: ".release-date"
      attribute: "title"
      value_template: "{!{ (value.split('released')[1]) }!}"

Debug logging

Debug logging can be enabled as follows:

logger:
  default: info
  logs:
    custom_components.multiscrape: debug

Depending on your issue, also consider enabling log_response.

Contributions are welcome!

If you want to contribute to this please read the Contribution guidelines

Credits

This project was generated from @oncleben31's Home Assistant Custom Component Cookiecutter template.

Code template was mainly taken from @Ludeeus's integration_blueprint template