Loading Data
How to create a table and load some data into it.
Setup Required
The following examples assume you've already completed Client Setup.
Kaskada stores data in tables. Tables consist of multiple rows, and each row is a value of the same type.
Creating a Table
When creating a table, you must provide some information about how each row should be interpreted. You must describe:
- A field containing the time associated with each row (
time_column_name
). The time should refer to when the event occurred. - An initial entity key associated with each row (
entity_key_column_name
). The entity should identify a thing in the world that each event is associated with. Don't worry too much about picking the "right" value here - it's easy to change the entity key in Fenl. - A subsort column associated with each row (
subsort_column_name
). This value is used to order rows associated with the same time value.
For more information about these fields, see: Expected File Format
from kaskada import table
table.create_table(
table_name = "Purchase",
time_column_name = "purchase_time",
entity_key_column_name = "customer_id",
subsort_column_name = "subsort_id",
)
This creates a table named Purchase
. Any data loaded into this table must have a timestamp field named purchase_time
, a field named customer_id
, and a field named subsort_id
.
Idiomatic Kaskada
We like to use CamelCase to name tables because it helps distinguish data sources from transformed values and function names.
The response from the create_table
is a table
object with contents similar to:
table {
table_id: "76b***2e5"
table_name: "Purchase"
time_column_name: "purchase_time"
entity_key_column_name: "customer_id"
subsort_column_name: "subsort_id"
create_time {
seconds: 1634250064
nanos: 422017488
}
update_time {
seconds: 1634250064
nanos: 422017488
}
}
request_details {
request_id: "fe6bed41fa29cea6ca85fe20bea6ef4a"
}
Note that the response also includes a request_id
. A request_id
is returned from all requests, whether they succeed or error. When contacting support for an issue, if you include the request_id
, the tech can look up additional details about your request, and help get to the root cause faster.
Loading Data
Now that we've created a table, we're ready to load some data into it.
Data can be loaded into a table in multiple ways. In this example we'll load the contents of a Pandas dataframe into the table. To learn about the different ways data can be loaded into a table, see the "Uploading Data" section of the "Working with Data" page.
import pandas
# A sample Parquet file provided by Kaskada for testing
purchases_url = "https://drive.google.com/uc?export=download&id=1SLdIw9uc0RGHY-eKzS30UBhN0NJtslkk"
# Read the file into a Pandas Dataframe
purchases = pandas.read_parquet(purchases_url)
# Upload the dataframe's contents to the Purchase table
table.upload_dataframe("Purchase", purchases)
The result of running upload_dataframe
returns a data_token_id
. The data token ID is a unique reference to the data currently stored in the system.
data_token_id: "aa2***a6b9"
request_details {
request_id: "fe6bed41fa29cea6ca85fe20bea6ef4b"
}
The file is transferred to Kaskada and added to the table.
Inspecting the Table's Contents
To verify the file was loaded as expected you can use the table list endpoint to see all the tables defined for your user and the files loaded into each:
table.list_tables()
list_tables
shows all the tables accessible by the user and returns a list
of table
. The table created above is shown here:
tables {
table_id: "76b***2e5"
table_name: "Purchase"
time_column_name: "purchase_time"
entity_key_column_name: "customer_id"
subsort_column_name: "subsort_id"
create_time {
seconds: 1634067588
nanos: 312567086
}
update_time {
seconds: 1634067603
nanos: 70745776
}
version: 1
}
request_details {
request_id: "fe6bed41fa29cea6ca85fe20bea6ef4c"
}
After executing this block, all tables that have been defined are returned.
For more help with tables and loading data, see Reference - Working with Tables
Updated over 1 year ago