Building a Data Pipeline for Nested JSON Extraction in Snowflake

Forum|Forum|2 years ago
February 6, 2024
0 replies
223 views

Traycho Milev
Senior DataOps Engineer

In the realm of data management and analytics, handling complex data structures like nested JSON can often pose challenges. Snowflake, a cloud-based data platform, offers robust solutions for managing and querying such data efficiently. However, when dealing with nested JSON stored within a CSV file, extracting and flattening the JSON data into a tabular format can be tricky. In this article, we'll explore how to overcome this challenge using Snowflake's capabilities.

The Challenge

Consider a scenario where you have a CSV file containing a column with nested JSON data, like the example below:

TIMESTAMP,NESTED_JSON_COLUMN
2024-01-01T00:00:00.000Z,"{""NESTED_JSON_COLUMN"":{""KEY_1"":""VALUE_1""",KEY_2:0,KEY_3:0,"KEY_4:""VALUE_4""}}"

The goal is to extract each nested value as a separate column while maintaining data integrity. When attempting to view the CSV file directly in a text editor, you'll encounter broken data due to the nested JSON structure. Instead of displaying the data in a readable format, the editor may present it as a jumbled mess, making it difficult to interpret and manipulate the data effectively.

The Solution: Flattening Nested JSON

To tackle this challenge, we need to flatten the nested JSON structure into a tabular format. Snowflake provides the PARSE_JSON() function to parse JSON strings, allowing us to access nested values. However, when dealing with nested JSON within a CSV file, a straightforward approach like comma separation may yield incorrect results.

Constructing the Query

To address this issue, we can construct a query that flattens the JSON structure within the CSV file. Here's a sample query:

COPY INTO YOUR_TABLE_NAME
FROM (
    SELECT
        $1,
        PARSE_JSON($2 || ',' || $3 || ',' || $4 || ',' || $5)::VARIANT,
        CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP()) AS "ingestion_time",
        METADATA$FILENAME
    FROM @STAGE_NAME
)
FILE_FORMAT = (
    FORMAT_NAME = 'YOUR_CUSTOM_FILE_FORMAT'
)
PATTERN = '.*[.]csv';

In this query:

We concatenate the JSON columns within the SELECT statement.
Using PARSE_JSON(), we convert the concatenated JSON string into a variant data type, effectively flattening the nested structure.
We include CONVERT_TIMEZONE() to capture the ingestion time for each record.
We specify the custom file format YOUR_CUSTOM_FILE_FORMAT, which is configured to handle CSV files with nested JSON.

Custom File Format Configuration

The custom file format YOUR_CUSTOM_FILE_FORMAT should be defined as follows:

CREATE OR REPLACE FILE FORMAT YOUR_CUSTOM_FILE_FORMAT
    TYPE = 'CSV'
    SKIP_HEADER = 1;

The SKIP_HEADER parameter ensures that Snowflake skips the header row when processing the CSV file.

Automating the Ingestion:

In DataOps environment, the goal would be to automate and streamline data processes from ingestion to consumption. The extraction of nested JSON from CSV files can be seamlessly integrated into a pipeline. The optimal approach would entail leveraging both SOLE and MATE components in the DataOps pipeline. Here's how each component can be utilised effectively:

Utilise SOLE to define staging areas where raw data, including CSV files with nested JSON, is ingested from various sources.
Define Snowflake tables using SOLE
Define a custom file format using SOLE, specifying the necessary parameters such as CSV format and header skipping.
Utilise MATE for executing SQL queries within Snowflake to transform the raw nested JSON data into a tabular format. Implement parsing and flattening logic using Snowflake's PARSE_JSON function and SQL operations to extract nested values.

Conclusion

In conclusion, handling nested JSON within CSV files in Snowflake requires careful consideration and proper querying techniques. By flattening the JSON structure and configuring the appropriate file format, you can extract nested values and ingest data seamlessly into Snowflake tables. Understanding these techniques empowers data engineers and analysts to efficiently work with complex data structures, unlocking valuable insights for their organisations.

The Challenge

The Solution: Flattening Nested JSON

Constructing the Query

Custom File Format Configuration

Conclusion

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded