Google BigQuery and Snowflake are popular cloud data warehousing platforms that are widely used today. Although they can be used independently as data warehouses, there are scenarios where we may need to transfer data between BigQuery and Snowflake. In a recent effort for a customer, we had to deal with a scenario where some third party data is stored in BigQuery and we need to bring in slices of data into our Snowflake account. This article discusses the issues we ran into and how we solved for them for a project that required data transfer from BigQuery to Snowflake.
Access Permissions
The requirement was for a periodic data transfer from BigQuery to Snowflake. This may sound simple, but the BigQuery and Snowflake environments are owned by different organizations and live in different regions. Also, we do not have access rights to pull the tables directly from BigQuery – for each use case, we need to pull a specific slice of data and transfer that over to our internal Snowflake. In addition, we cannot create new buckets or other objects in the third-party owned Google Account.
To solve for the lack of access permissions, we split the process into two steps:
1) The first step was to transfer the tables into a Google Cloud Storage (“gcs”) Bucket provided by the third party that owned the data.
2) Since the BigQuery data warehouse was in a different region than our Snowflake environment, we had to find a way to reduce the egress charges. Hence, the second step was moving the tables from the third party gcs bucket to our own gcs bucket.
3) To minimize time and costs associated with moving the data, we created our new bucket in a region closer to where our Snowflake instance is located. Hence, the region-to-region copy operation happens at the file level in the GCS buckets.
Data Transfer Formats
The tables from BigQuery can be transferred to Snowflake in a variety of formats. Snowflake supports the following formats:
JSON
Avro
ORC
PARQUET
CSV
XML
The tables in BigQuery are in a semi-structured format, meaning some columns are nested and repeated. For example, a nested column will contain multiple columns within it. Since Snowflake stores the data in a regular, structured table format, we had to run a flattening process to convert the nested columns into separate columns. Essentially, we needed to convert from a NoSQL type format to a SQL format.
CSV Format
At first, we naively thought we could transfer the tables using the CSV format (before we realized the existence of the nested columns!) but as you can guess – there was no way this was going to work. CSV is a simple structured list of fields and we could not represent the nesting in any meaningful way.
Parquet Format
The next option was the Parquet format as it offered the best compression ratio compared to other file types. Parquet is an open-source data file format from Apache that stores data in columnar format for fast retrieval and offers efficient compression. So we decided to go with the Parquet format.
The Parquet format worked fine for some time, but one day we started facing an issue that affected our data pipeline. The structure of the Parquet file started changing randomly. Due to this, we had to re-adjust our code every time there was a change in the structure of the file. This process took extra time and effort which started affecting our delivery schedule. To overcome this, we decided to experiment with other formats. This issue might have been due to the fact that Parquet was still in Beta and Google was experimenting with different ways of generating it from BigQuery.
We started from scratch – doing a methodical analysis, comparing different file formats. The objective was to replace the Parquet with any other file format if the converted file size was within 20% of the Parquet file sizes.
Data Transfer Performance Results
After running our pipeline multiple times with different file formats, we got the following results form data transfer from BigQuery to Snowflake using our process:
Format | Compression | Time | Size |
PARQUET | SNAPPY | 2 min 10 sec | 4.3 GB |
JSON | GZIP | 4 min 3 sec | 6.9 GB |
AVRO | DEFLATE | 3 min 13 sec | 5.1 GB |
By looking at the above table, it was evident that PARQUET has the best compression and run time compared to JSON and AVRO formats. We decided to continue using PARQUET format till we face the issue again. Fortunately, the next time we ran our pipeline, there weren’t any changes in the format, and it remained consistent, so our pipeline executed without any issues. Even though, we didn’t end up making changes to the data format, this activity provided us with good insights into the different file formats available to store data.
Comments