Home

Awesome

PyWebCopy © 6

PyPI PyPI - Python Version PyPI - Status Python package Codacy Badge

Created By : Raja Tomar License : MIT Email: rajatomar788@gmail.com

Python websites and webpages cloning at ease. Web Scraping or Saving Complete webpages and websites with python.

Web scraping and archiving tool written in Python Archive any online website and its assets, css, js and images for offilne reading, storage or whatever reasons. It's easy with pywebcopy.

Why it's great? because it -

Email me at rajatomar788@gmail.com of any query :)

1.1 Installation

pywebcopy is available on PyPi and is easily installable using pip


$ pip install pywebcopy

You are ready to go. Read the tutorials below to get started.

1.1.1 First steps

You should always check if the latest pywebcopy is installed successfully.

>>> import pywebcopy
>>> pywebcopy.__version__
6.0.0

Your version may be different, now you can continue the tutorial.

1.2 Basic Usages

To save any single page, just type in python console

from pywebcopy import save_webpage

kwargs = {'project_name': 'some-fancy-name'}

save_webpage(
    url='http://example-site.com/index.html',
    project_folder='path/to/downloads',
    **kwargs
)

To save full website (This could overload the target server, So, be careful)


from pywebcopy import save_website

kwargs = {'project_name': 'some-fancy-name'}

save_website(
    url='http://example-site.com/index.html',
    project_folder='path/to/downloads',
    **kwargs
)

1.3 Running Tests

Running tests is simple and doesn't require any external library. Just run this command from root directory of pywebcopy package.

$ python -m pywebcopy run-tests

1.4 Command Line Interface

pywebcopy have a very easy to use command-line interface which can help you do task without having to worrying about the inner long way.

1.5 Authentication and Cookies

Most of the time authentication is needed to access a certain page. Its real easy to authenticate with pywebcopy because it usage an requests.Session object for base http activity which can be accessed through pywebcopy.SESSION attribute. And as you know there are ton of tutorials on setting up authentication with requests.Session.

Here is a basic example of simple http auth -

import pywebcopy

# Update the headers with suitable data

pywebcopy.SESSION.headers.update({
    'auth': {'username': 'password'},
    'form': {'key1': 'value1'},
})

# Rest of the code is as usual
kwargs = {
    'url': 'http://localhost:5000',
    'project_folder': 'e://saved_pages//',
    'project_name': 'my_site'
}
pywebcopy.config.setup_config(**kwargs)
pywebcopy.save_webpage(**kwargs)

2.1 WebPage class

WebPage class, the engine of this saving actions. You can use this class to access many more methods to customise the process with.

2.1.2 WebPage properties and methods

Apis which WebPage object exposes after creating through any method described above

3.1 Scrapings Support

Multiple scraping packages are wrapped up in one object which you can use to unlock the best of all those libraries at one go without having to go through the hassle of instantiating each one of those libraries

To use all the methods and properties documented below just create a object once as described

from pywebcopy import MultiParser

import requests

req = requests.get('http://google.com')

html = req.content

# You can skip the encoding declaration
# it is start enough to auto-detect :)
encoding = req.encoding

wp = MultiParser(html, encoding)

# done

All code follows above code

you can also use any beautiful_soup methods on it
```python
>>> links = wp.bs4.find_all('a')

['//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/download/other/']

```

Crawler object

This is a subclass of WebPage class and can be used to mirror any website.

>>> from pywebcopy import Crawler, config
>>> url = 'http://some-url.com/some-page.html'
>>> project_folder = '/home/desktop/'
>>> project_name = 'my_project'
>>> kwargs = {'bypass_robots': True}
# You should always start with setting up the config or use apis
>>> config.setup_config(url, project_folder, project_name, **kwargs)

# Create a instance of the webpage object
>>> wp = Crawler()

# If you want to you can use `requests` to fetch the pages
>>> wp.get(url, **{'auth': ('username', 'password')})

# Then you can access several methods like
>>> wp.crawl()

Common Settings and Errors

You can easily make a beginners mistake or could get confuse, thus here are the common errors and how to correct them if you are facing them.

  1. pywebcopy.exceptions.AccessError

    If you are getting pywebcopy.exceptions.AccessError Exception. then check if website allows scraping of its content.

    >>> import pywebcopy
    >>> pywebcopy.config['bypass_robots'] = True
    
    # rest of your code follows..
    
    
  2. Overwrite existing files when copying

    If you want to overwrite existing files in the directory then use the over_write config key.

    
    import pywebcopy
    pywebcopy.config['over_write'] = True
    
    # rest of your code follows..
    
    
  3. Changing your project name

    By default the pywebcopy creates a directory inside project_folder with the url you have provided but you can change this using the code below

    >>> import pywebcopy
    >>> pywebcopy.config['project_name'] = 'my_project'
    
    # rest of your code follows..
    
    

How to - Save Single Webpage

Particular webpage can be saved easily using the following methods.

Note: if you get pywebcopy.exceptions.AccessError when running any of these code then use the code provided on later sections.

Method 1 : via api - save_webpage()

Webpage can easily be saved using an inbuilt funtion called .save_webpage() which takes several arguments also.

>>> from pywebcopy import save_webpage
>>> save_webpage(project_url='http://google.com', project_folder='c://Saved_Webpages/',)

Method 2

This use case is slightly more powerful as it can provide every functionallity of the WebPage class.

>>> from pywebcopy import WebPage, config
>>> url = 'http://some-url.com/some-page.html'

# You should always start with setting up the config or use apis
>>> config.setup_config(url, project_folder, project_name, **kwargs)

# Create a instance of the webpage object
>>> wp = WebPage()

# If you want to use `requests` to fetch the page then
>>> wp.get(url)

# Else if you want to use plain html or urllib then use
>>> wp.set_source(object_which_have_a_read_method, encoding=encoding)
>>> wp.url = url   # you need to do this if you are using set_source()

# Then you can access several methods like
>>> wp.save_complete()
>>> wp.save_html()
>>> wp.save_assets()

# This Webpage object contains every methods of the Webpage() class and thus
# can be reused for later usages.

Method 2 using Plain HTML

I told you earlier that Webpage object is powerful and can be manipulated in any ways.

One feature is that the raw html is now also accepted.


>>> from pywebcopy import WebPage, config

>>> HTML = open('test.html').read()

>>> base_url = 'http://example.com' # used as a base for downloading imgs, css, js files.
>>> project_folder = '/saved_pages/'
>>> config.setup_config(base_url, project_folder)

>>> wp = WebPage()
>>> wp.set_source(HTML)
>>> wp.url = base_url
>>> wp.save_complete()

How to - Clone Whole Websites

Use caution when copying websites as this can overload or damage the servers of the site and rarely could be illegal, so check everything before you proceed.

Method 1 : via api - save_website()

Using the inbuilt api .save_website() which takes several arguments.

>>> from pywebcopy import save_website

>>> save_website(project_url='http://localhost:8000', project_folder='e://tests/')

Method 2 -

By creating a Crawler() object which provides several other functions as well.

>>> from pywebcopy import Crawler, config

>>> config.setup_config(project_url='http://localhost:5000/', 
project_folder='e://tests/', project_name='LocalHost')

>>> crawler = Crawler()
>>> crawler.crawl()

1.3 Configuration

pywebcopy is highly configurable. You can setup the global object using the methods exposed by the pywebcopy.config object.

Ways to change the global configurations are below -

4.1 Contribution

You can contribute in many ways

If you have any suggestions or fixes or reports feel free to mail me :)

5.1 Undocumented Features

I built many utils and classes in this project to ease the tasks I was trying to do.

But, these task are also suitable for general purpose use.

So, if you want, you can help in generating suitable documentation for these undocumented ones, then you can always create and pull request or email me.

6.1 Changelog

[version 6.0.0]

[version 5.x]

[version 4.x]

[version 2.0.0]

[changed]

[added]

[fixed]

[version 2.0(beta)]

[version 1.10]

[version 1.9]