RestBox


Description

Application server: https://syrup.keboola.com/restbox/run
Project source: (https://bitbucket.org/keboola/restbox-bundle)

RestBox is a tool designed to transfer files/tables between HTTP Sources, Amazon S3 and Keboola SAPI

Configuration

JSON request

  • config: (mandatory) Name of the configuration table in sys.c-restbox to use
  • direct = [0/1]: (optional) If set to 1, the file is transferred directly to the destination without downloading it to the server. (Only correctly supported by SAPI, may cause slowdown on other interfaces)

Table sys.c-restbox.{configurationId}

Attributes

(all attributes are optional, http is used on both endpoints unless other auth.type is set)

  • [src|dest].auth.type: [Source|Destination] Authentication type (either src.auth.type or dest.auth.type)
  • [src|dest].prefix: Prefix the destination path/url
  • map.date.placeholder: String to find and replace with date in endpoints
  • map.date.format: Y-m-d default
  • map.date.timezone: Timezone to use. Defaults to server timezone
  • map.date.value: Time offset or a specific time to use for the placeholder replacement. Ant PHP strtotime() value can be used.
  • compress = [0/1]: Use GZIP to compress the data before upload (if not transferring directly)

Available auth interfaces

  • HTTP (no authentication) (default, if no .auth is set, http is used)
    (HTTP destination not yet implemented)

    • [src|dest].auth.type = http
    • [src|dest].http.query: Query parameters in a JSON Object. I.e.: { "access_token": "y0ur0hs0s3cr3t4cc3st0k3nl0ly0m4m4" } is parsed into &access_token=y0ur0hs0s3cr3t4cc3st0k3nl0ly0m4m4
    • [src|dest].http.headers: HTTP Request headers in a JSON Object. I.e.: { "Accept-Encoding": "gzip" }
  • HTTP Basic Authentication

    • [src|dest].auth.type = http.basic
    • [src|dest].auth.user: Username
    • [src|dest].auth.password: Password
  • Amazon S3

    • [src|dest].auth.type = s3
    • [src|dest].auth.id: AWS Access Key ID
    • [src|dest].auth.secret: AWS Secret Access Key
    • [src|dest].s3.bucket: S3 Bucket to use (i.e. keboola-bi if the object url is http://keboola-bi.s3.amazonaws.com/test/image.jpg)
    • dest.s3.acl = [private/public-read/public-read-write/...]: (optional) Security setting of the created object. See Amazon S3 Canned ACL for details.
    • src.s3.modifiedAfter: A date value to only download objects modified after that date (only works with */% wildcards in source mapping)
    • src.s3.autoFill = [0/1]: Automatically sets the src.s3.modifiedAfter value to the time of last successful run
  • Keboola Storage API

    • use [src|dest].prefix to set bucket (i.e.: in.c-test. - note the trailing ".")
    • [src|dest].auth.type = sapi
    • [src|dest].auth.token: (optional) Storage API token to use a different token than the one set in headers (used to run Syrup/RestBox)
    • [src|dest].auth.url: (optional) Requires Token to be set, allows use of an alternative Storage API server
    • dest.sapi.{option}: Upload to SAPI Options (replace {option} with the following). See SAPI Docs for details.

      • delimiter
      • enclosure
      • escapedBy
      • transaction
      • incremental
      • partial
      • primaryKey
      • transactional
    • src.sapi.{option}: Download from SAPI options. See SAPI Docs for details.

      • limit
      • changedSince
      • changedUntil
      • escape
      • format
      • whereColumn
      • whereOperator
    • Add a header to the file

      • dest.csv.addHeaders = [0/1]: Fetch destination table headers and prepend the file with them. Doesn't work with compress = 1;

File operations

  • Unzip file - extract a single file form a zip archive

    • file.extract = zip

Text file modifications

Text file modifications are executed in the following order (i.e. if you wanna replace a word and then skip by it, use the new/replacement word for the skip:

  • Skip lines in the file

    • text.skipLines.start = [0-%i]: Optionally keep a number of lines at the beginning of a CSV file before dropping lines (i.e. set to 1 to preserve CSV header and then drop %i lines)
    • text.skipLines.length = [%i]: Drop %i lines at the beginning of a CSV or ay text file (or offset by .start)
  • Convert character set of the file

    • text.charSet.[from|to] = [UTF-8]: change the character set of a text file (default is UTF-8, so i.e. if only the "from" charset is set, it is converted from that charset into UTF-8)

      • By default, the script skips unknown characters. To try to translate them to the closest match, append "//TRANSLIT" to the ".to" charset (i.e. "text.charSet.to" = "ISO-8859-2//TRANSLIT")
  • Fix End Of Line characters

    • text.eol = [0/1]: Replace non-unix endline characters with standard \n
  • Replace a phrase

    • text.replace = { "find": "replace", "original": "edited", ... }: Replace any phrase within the file with another.
  • Skip lines by matching a phrase

    • text.skip = [%s]: Drop all lines containing a configured phrase

Plugins

Filename

  • Adds the "source" value into a "file_origin" column in the table

    • plugin.Filename.enabled = 1 to enable

DownloadTime

  • Adds the time when RestBox processes the file into "file_processed_time" column

    • plugin.DownloadTime.enabled = 1 to enable
    • plugin.DownloadTime.format: change the date format (default is 2014-06-14T00:58:44+00:00)

FileAudit

  • Combines the 2 above plugins AND adds another column with a number of a row in the current file for each row. Adds the following columns: restbox_audit_daterestbox_audit_filenamerestbox_audit_filerow

    • plugin.FileAudit.enabled = 1 to enable
    • plugin.FileAudit.dateFormat: change the date format (default is 2014-06-14T00:58:44+00:00)

Table Data

  • source: File Source

    • Amazon S3 bucket download:

      • If the source string ends with * or %, the all objects from the bucket matching the source before */% will be downloaded

        • * Downloads all files matching the prefix to the destination object as is (If loading into SAPI, all subsequent matching files will be added incrementally. I.e. in S3, it'll just overwrite the file with each matching file!)
        • % Downloads all files matching the prefix and appends their relative path in S3 to the destination (i.e. creates a separate table for each matching object, or copies the structure to another S3 domain/bucket, or just another S3 folder)
      • The last string before */% cannot be a /
      • To download the /test/data/ folder, use /test/data*
      • To download all files starting with "prefix_2013" (ie. prefix_201301.csv, prefix_201302.csv,...) in /test/data/ folder, use /test/data/prefix_2013*
  • destination: File Destination

Example configuration

Attributes

Data

API Call

POST /restbox/run HTTP/1.1
Host: syrup.keboola.com
X-StorageApi-Token: 005-10005-59dafqwgrwemdjkgrger64gh65th4tr6h4ce65
Cache-Control: no-cache

{ "config": "text" }

Result

  1. The API will download a file from s3keboola-bi.s3.amazonaws.com/test/test.csv using credentials from the src.auth.id and src.auth.secret (see the red highlights)
  2. File is altered according to the configuration (green highlights):
    1. The downloaded file is then stripped 10 lines after 1st line, i.e. 1 line at the beginning is kept, then 10 lines #2-#11(inc.) are deleted (text.skipLines) 
    2. All lines containing 2012 are skipped
    3. Line ends are corrected to \n
    4. Text pa55w0rd is replaced by [content removed]
    5. The file is compressed using gzip
  3. The resulting file is then uploaded to sapi project determined by the token used to run REST Box. Table name is a combination of dest.prefix attribute and destination column in the table's data, in this case in.c-restbox.imported-table (see blue highlights)
Comments