Expected File Format

Information about the expected file format

The Kaskada compute engine expects uploaded file data to be parquet version 2. Additionally, it expects the following:

  • The time_column (and all other date-time columns in your dataset) should be of integer type and contain a nano-second unix timestamp.
  • The combination of the time_column and the subsort_column should guarantee that each row is unique.
    • An easy way to do this is to use add an index column to your dataset, that contains the row index, and set this as the subsort_column.

Below are some code examples to help you get your data into the proper format:

Working with CSV

CSV files need to be converted to Parquet version 2 before being uploaded to Kaskada.

Below is some helper code using pandas data frames, that can perform the conversion.

!pip install pandas
import pandas as pd
df = pd.read_csv('<path_to_the_input_file.csv>')
df.to_parquet('<path_to_the_output_file.parquet>`, version="2.0")

Parsing date-time strings

Adding a subsort column

The combination of the time_column and the subsort_column must guarantee that each row in your dataset is unique. An easy way to do this is to use add an index column to your dataset, that contains the row index, and set this as the subsort_column.

Below is some helper code using pandas data frames, that can add an index column named "idx".

!pip install pandas
import pandas as pd
df = pd.read_parquet('<path_to_the_input_file.parquet>')
df["idx"] = df.index
df.to_parquet('<path_to_the_output_file.parquet>`, version="2.0")

© Copyright 2021 Kaskada, Inc. All rights reserved. Privacy Policy

Kaskada products are protected by patents in the United States, and Kaskada is currently seeking protection internationally with pending applications.