Kaskada stores data in tables. Tables consist of multiple rows, and each row is a value of the same type.
All methods on this page use the
table module. Be sure to import it before running any method:
from kaskada import table
When creating a table, you must provide information about how each row should be interpreted. You must describe:
- A field containing the time associated with each row. The time should refer to when the event occurred.
- An initial entity key associated with each row. The entity should identify a thing in the world related to each event.
- A subsort column associated with each row. This value is used to order rows associated with the same time value. If no subsort column is provided, Kaskada will generate one.
For more information about these fields, see: Expected File Format
Here is an example of creating a table:
table.create_table( table_name = "Purchase", time_column_name = "purchase_time", entity_key_column_name = "customer_id", subsort_column_name = "subsort_id", )
This creates a table named
Purchase. Any data loaded into this table must have a timestamp field named
customer_id, and a field called
️ Idiomatic Kaskada
We like to use CamelCase to name tables because it helps distinguish data sources from transformed values and function names.
The list table method returns all tables defined for your user. An optional search string can filter
the response set.
Here is an example of listing tables:
table.list_tables(search = "chase")
You can get a table using its name:
Tables are currently immutable. Updating a table requires deleting it and then re-creating it with a new expression.
You can delete a table using its name:
Note that deleting a table also deletes any files uploaded to it.
A failed precondition error is returned if another view and/or materialization references the table. To continue with the deletion of the table, delete the dependent resources or supply the
force flag to delete the table forcefully. Forcefully deleting a table without deleting the dependent resources may result in the dependent resources functioning incorrectly.
️ Going Deeper
Under the hood, file uploads are a multi-part process. The first step is to create a staged file that requests an upload URL from the API. The Python client library will upload files to the upload URL through an HTTP PUT request. Then an additional API call is performed to load the staged file to a table where additional validation is performed.
Data can be loaded from a dataframe into a Kaskada table. Remote files can be read into a dataframe and then uploaded to Kaskada.
import pandas # A sample Parquet file provided by Kaskada for testing purchases_url = "https://drive.google.com/uc?export=download&id=1SLdIw9uc0RGHY-eKzS30UBhN0NJtslkk" # Read the file into a Pandas Dataframe purchases = pandas.read_parquet(purchases_url) # Upload the dataframe's contents to the Purchase table table.upload_dataframe("Purchase", purchases)
The contents of the dataframe are transferred to Kaskada and added to the Purchase table.
Data can be loaded directly from Amazon S3 into a Kaskada table. Loading from S3 requires the following:
- AWS Access Key - The access key with READ permissions to the bucket and path (optional).
- AWS Secret Access Key - The secret key associated with the access key (optional).
- Path - The path to the object. One of the following:
- Virtual Hosted Path - E.g. https://s3.Region.amazonaws.com/bucket-name/key-name.parquet
- Region, Bucket, Key
️ Security and Credentials
Kaskada does not store the provided credentials in any manner. The API only has access to the credentials throughout the load data call. If no access credentials are provided, the object must have public read permissions.
from kaskada import table TABLE_NAME = 'Purchase' EXTERNAL_AWS_ACCESS_KEY = '<AWS_ACCESS_KEY' EXTERNAL_AWS_SECRET_KEY = '<AWS_SECRET_KEY>' S3_PATH = 'events/2022/purchases.parquet' BUCKET = 'production.company' REGION = 'us-west-2' table.upload_from_s3( TABLE_NAME, access_key=EXTERNAL_AWS_ACCESS_KEY, secret=EXTERNAL_AWS_SECRET_KEY, bucket=BUCKET, key=S3_PATH, region=REGION )
The contents of the parquet object in S3 are transferred to Kaskada and added to the Purchase table.
Local files can be uploaded directly to Kaskada without converting them to a dataframe. However, the files must be in a specific format. See Expected File Format for details.
fullPathToFile = "/content/drive/place/thing/purchases.parquet" table.upload_file("Purchases", fullPathToFile)
This uploads the contents of the file to the Purchases table.
Updated about 2 months ago