Working with Tables
Kaskada stores data in tables. Tables consist of multiple rows, and each row is a value of the same type.
All methods on this page use the table
module. Be sure to import it before running any method:
from kaskada import table
Table Methods
Creating a Table
When creating a table, you must provide information about how each row should be interpreted. You must describe:
- A field containing the time associated with each row. The time should refer to when the event occurred.
- An initial entity key associated with each row. The entity should identify a thing in the world related to each event.
Optionally:
- A subsort column associated with each row. This value is used to order rows associated with the same time value. If no subsort column is provided, Kaskada will generate one.
For more information about these fields, see: Expected File Format
Here is an example of creating a table:
table.create_table(
table_name = "Purchase",
time_column_name = "purchase_time",
entity_key_column_name = "customer_id",
subsort_column_name = "subsort_id",
)
This creates a table named Purchase
. Any data loaded into this table must have a timestamp field named purchase_time
, customer_id
, and a field called subsort_id
.
Idiomatic Kaskada
We like to use CamelCase to name tables because it helps distinguish data sources from transformed values and function names.
List Tables
The list table method returns all tables defined for your user. An optional search string can filter
the response set.
Here is an example of listing tables:
table.list_tables(search = "chase")
Get Table
You can get a table using its name:
table.get_table("Purchase")
Updating a Table
Tables are currently immutable. Updating a table requires deleting it and then re-creating it with a new expression.
Deleting a table
You can delete a table using its name:
table.delete_table("Purchase")
Warning
Note that deleting a table also deletes any files uploaded to it.
A failed precondition error is returned if another view and/or materialization references the table. To continue with the deletion of the table, delete the dependent resources or supply the force
flag to delete the table forcefully. Forcefully deleting a table without deleting the dependent resources may result in the dependent resources functioning incorrectly.
table.delete_table("Purchase", force=True)
Uploading Data
Going Deeper
Under the hood, file uploads are a multi-part process. The first step is to create a staged file that requests an upload URL from the API. The Python client library will upload files to the upload URL through an HTTP PUT request. Then an additional API call is performed to load the staged file to a table where additional validation is performed.
From a Remote file or Dataframe
Data can be loaded from a dataframe into a Kaskada table. Remote files can be read into a dataframe and then uploaded to Kaskada.
import pandas
# A sample Parquet file provided by Kaskada for testing
purchases_url = "https://drive.google.com/uc?export=download&id=1SLdIw9uc0RGHY-eKzS30UBhN0NJtslkk"
# Read the file into a Pandas Dataframe
purchases = pandas.read_parquet(purchases_url)
# Upload the dataframe's contents to the Purchase table
table.upload_dataframe("Purchase", purchases)
The contents of the dataframe are transferred to Kaskada and added to the Purchase table.
From Amazon S3
Data can be loaded directly from Amazon S3 into a Kaskada table. Loading from S3 requires the following:
- AWS Access Key - The access key with READ permissions to the bucket and path (optional).
- AWS Secret Access Key - The secret key associated with the access key (optional).
- Path - The path to the object. One of the following:
- Virtual Hosted Path - E.g. https://s3.Region.amazonaws.com/bucket-name/key-name.parquet
- Region, Bucket, Key
Security and Credentials
Kaskada does not store the provided credentials in any manner. The API only has access to the credentials throughout the load data call. If no access credentials are provided, the object must have public read permissions.
from kaskada import table
TABLE_NAME = 'Purchase'
EXTERNAL_AWS_ACCESS_KEY = '<AWS_ACCESS_KEY'
EXTERNAL_AWS_SECRET_KEY = '<AWS_SECRET_KEY>'
S3_PATH = 'events/2022/purchases.parquet'
BUCKET = 'production.company'
REGION = 'us-west-2'
table.upload_from_s3(
TABLE_NAME,
access_key=EXTERNAL_AWS_ACCESS_KEY,
secret=EXTERNAL_AWS_SECRET_KEY,
bucket=BUCKET,
key=S3_PATH,
region=REGION
)
The contents of the parquet object in S3 are transferred to Kaskada and added to the Purchase table.
From a Local File
Local files can be uploaded directly to Kaskada without converting them to a dataframe. However, the files must be in a specific format. See Expected File Format for details.
fullPathToFile = "/content/drive/place/thing/purchases.parquet"
table.upload_file("Purchases", fullPathToFile)
This uploads the contents of the file to the Purchases table.
Updated 10 months ago