Home

Awesome

<!-- Please do not edit this file. Edit the `blah` field in the `package.json` instead. If in doubt, open an issue. -->

scrape-it

scrape-it

Support me on Patreon Buy me a book PayPal Ask me anything Travis Version Downloads Get help on Codementor

<a href="https://www.buymeacoffee.com/H96WwChMy" target="_blank"><img src="https://www.buymeacoffee.com/assets/img/custom_images/yellow_img.png" alt="Buy Me A Coffee"></a>

A Node.js scraper for humans.


<p align="center"> Sponsored with :heart: by: <br/><br/> <a href="https://serpapi.com">Serpapi.com</a> is a platform that allows you to scrape Google and other search engines from our fast, easy, and complete API.<br> <a href="https://serpapi.com"><img src="https://i.imgur.com/0Pk6Ysp.png" width="250" /></a> <br/><br/>

Capsolver.com is an AI-powered service that specializes in solving various types of captchas automatically. It supports captchas such as reCAPTCHA V2, reCAPTCHA V3, hCaptcha, FunCaptcha, DataDome, AWS Captcha, Geetest, and Cloudflare Captcha / Challenge 5s, Imperva / Incapsula, among others. For developers, Capsolver offers API integration options detailed in their documentation, facilitating the integration of captcha solving into applications. They also provide browser extensions for Chrome and Firefox, making it easy to use their service directly within a browser. Different pricing packages are available to accommodate varying needs, ensuring flexibility for users. <a href="https://capsolver.com/?utm_source=github&utm_medium=github_banner&utm_campaign=scrape-it"><img src="https://i.imgur.com/lCngxre.jpeg"/></a>

</p>

:cloud: Installation

# Using npm
npm install --save scrape-it

# Using yarn
yarn add scrape-it

:bulb: ProTip: You can install the cli version of this module by running npm install --global scrape-it-cli (or yarn global add scrape-it-cli).

FAQ

Here are some frequent questions and their answers.

1. How to parse scrape pages?

scrape-it has only a simple request module for making requests. That means you cannot directly parse ajax pages with it, but in general you will have those scenarios:

  1. The ajax response is in JSON format. In this case, you can make the request directly, without needing a scraping library.
  2. The ajax response gives you HTML back. Instead of calling the main website (e.g. example.com), pass to scrape-it the ajax url (e.g. example.com/api/that-endpoint) and you will you will be able to parse the response
  3. The ajax request is so complicated that you don't want to reverse-engineer it. In this case, use a headless browser (e.g. Google Chrome, Electron, PhantomJS) to load the content and then use the .scrapeHTML method from scrape it once you get the HTML loaded on the page.

2. Crawling

There is no fancy way to crawl pages with scrape-it. For simple scenarios, you can parse the list of urls from the initial page and then, using Promises, parse each page. Also, you can use a different crawler to download the website and then use the .scrapeHTML method to scrape the local files.

3. Local files

Use the .scrapeHTML to parse the HTML read from the local files using fs.readFile.

:clipboard: Example

const scrapeIt = require("scrape-it")

// Promise interface
scrapeIt("https://ionicabizau.net", {
    title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}).then(({ data, status }) => {
    console.log(`Status Code: ${status}`)
    console.log(data)
});


// Async-Await
(async () => {
    const { data } = await scrapeIt("https://ionicabizau.net", {
        // Fetch the articles
        articles: {
            listItem: ".article"
          , data: {

                // Get the article date and convert it into a Date object
                createdAt: {
                    selector: ".date"
                  , convert: x => new Date(x)
                }

                // Get the title
              , title: "a.article-title"

                // Nested list
              , tags: {
                    listItem: ".tags > span"
                }

                // Get the content
              , content: {
                    selector: ".article-content"
                  , how: "html"
                }

                // Get attribute value of root listItem by omitting the selector
              , classes: {
                    attr: "class"
                }
            }
        }

        // Fetch the blog pages
      , pages: {
            listItem: "li.page"
          , name: "pages"
          , data: {
                title: "a"
              , url: {
                    selector: "a"
                  , attr: "href"
                }
            }
        }

        // Fetch some other data from the page
      , title: ".header h1"
      , desc: ".header h2"
      , avatar: {
            selector: ".header img"
          , attr: "src"
        }
    })
    console.log(data)
    // { articles:
    //    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
    //        title: 'Pi Day, Raspberry Pi and Command Line',
    //        tags: [Object],
    //        content: '<p>Everyone knows (or should know)...a" alt=""></p>\n',
    //        classes: [Object] },
    //      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
    //        title: 'How I ported Memory Blocks to modern web',
    //        tags: [Object],
    //        content: '<p>Playing computer games is a lot of fun. ...',
    //        classes: [Object] },
    //      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
    //        title: 'How to convert JSON to Markdown using json2md',
    //        tags: [Object],
    //        content: '<p>I love and ...',
    //        classes: [Object] } ],
    //   pages:
    //    [ { title: 'Blog', url: '/' },
    //      { title: 'About', url: '/about' },
    //      { title: 'FAQ', url: '/faq' },
    //      { title: 'Training', url: '/training' },
    //      { title: 'Contact', url: '/contact' } ],
    //   title: 'Ionică Bizău',
    //   desc: 'Web Developer,  Linux geek and  Musician',
    //   avatar: '/images/logo.png' }
})()

:question: Get Help

There are few ways to get help:

  1. Please post questions on Stack Overflow. You can open issues with questions, as long you add a link to your Stack Overflow question.
  2. For bug reports and feature requests, open issues. :bug:
  3. For direct and quick help, you can use Codementor. :rocket:

:memo: Documentation

scrapeIt(url, opts, cb)

A scraping module for humans.

Params

Return

scrapeIt.scrapeHTML($, opts)

Scrapes the data in the provided element.

For the format of the selector, please refer to the Selectors section of the Cheerio library

Params

Return

:yum: How to contribute

Have an idea? Found a bug? See how to contribute.

:sparkling_heart: Support my projects

I open-source almost everything I can, and I try to reply to everyone needing help using these projects. Obviously, this takes time. You can integrate and use these projects in your applications for free! You can even change the source code and redistribute (even resell it).

However, if you get some profit from this or just want to encourage me to continue creating stuff, there are few ways you can do it:

Thanks! :heart:

:dizzy: Where is this library used?

If you are using this library in one of your projects, add it in this list. :sparkles:

:scroll: License

MIT © Ionică Bizău