Home

Awesome

Notice

To better serve Wise business and customer needs, the PipelineWise codebase needs to shrink. We have made the difficult decision that, going forward many components of PipelineWise will be removed or incorporated in the main repo. The last version before this decision is v0.64.1

We thank all in the open-source community, that over the past 6 years, have helped to make PipelineWise a robust product for heterogeneous replication of many many Terabytes, daily

pipelinewise-tap-s3-csv

PyPI version PyPI - Python Version License: MIT

This is a Singer tap that reads data from files located inside a given S3 bucket and produces JSON-formatted data following the Singer spec.

This is a PipelineWise compatible tap connector.

How to use it

The recommended method of running this tap is to use it from PipelineWise. When running it from PipelineWise you don't need to configure this tap with JSON files and most of things are automated. Please check the related documentation at Tap S3 CSV

If you want to run this Singer Tap independently please read further.

Install and Run

First, make sure Python 3 is installed on your system or follow these installation instructions for Mac or Ubuntu.

It's recommended to use a virtualenv:

  python3 -m venv venv
  pip install pipelinewise-tap-s3-csv

or

  python3 -m venv venv
  . venv/bin/activate
  pip install --upgrade pip
  pip install .

Configuration

Here is an example of basic config, that's using the defualt Profile based authentication:

```json
{
    "start_date": "2000-01-01T00:00:00Z",
    "bucket": "tradesignals-crawler",
    "tables": [{
        "search_prefix": "feeds",
        "search_pattern": ".csv",
        "table_name": "my_table",
        "key_properties": ["id"],
        "delimiter": ","
    }]
}
```

Profile based authentication

Profile based authentication used by default using the default profile. To use another profile set aws_profile parameter in config.json or set the AWS_PROFILE environment variable.

Non-Profile based authentication

For non-profile based authentication set aws_access_key_id , aws_secret_access_key and optionally the aws_session_token parameter in the config.json. Alternatively you can define them out of config.json by setting AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN environment variables.

A bit of a run down on each of the properties:

The table field consists of one or more objects, that describe how to find files and emit records. A more detailed (and unescaped) example below:

[
    {
        "search_prefix": "exports"
        "search_pattern": "my_table\\/.*\\.csv",
        "table_name": "my_table",
        "key_properties": ["id"],
        "date_overrides": ["created_at"],
        "delimiter": ","
    },
    ...
]

A sample configuration is available inside config.sample.json

To run tests:

  1. Install python test dependencies in a virtual env and run nose unit and integration tests
  make venv
  1. To run unit tests:
  make unit_tests
  1. To run integration tests:

Integration tests require a valid S3 bucket and credentials should be passed as environment variables, this project uses Minio server.

Fist, start a Minio server docker container:

mkdir -p ./minio/data/awesome_bucket
UID=$(id -u) GID=$(id -g) docker-compose up -d

Run integration tests:

  make integration_tests

To run pylint:

  1. Install python dependencies and run python linter
  make venv pylint