Thursday, April 27, 2023

Azure Managed Instance to Parquet using CETAS and t-sql

And we're back! Not that anybody noticed.  'Scool. IP agreements and the like. 


I'll make it brief since I logged on at 11pm to get this working.

TL;DR - the below lines will allow you to query a table on your MI, creating Parquet files in Azure blob storage. And you can query it! Next up is partitioning over time, etc, etc. But this is freaking fantastic. I have a python script I wrote that does it, but it's nowhere as nice/easy as this.

Why do you care? Because it's a fantastically easy way to archive older data to blob storage, and I suspect (need to test) that if you do it right, you can then have it go to cool/archive storage via a lifecycle setup, so that if you need it much later, you can. Parquet is a columnar compressed format that can then feed other things like Snowflake, Hadoop, Redshift Spectrum, AWS Athena, etc. For me it's a platform-agnostic, portable, highly-compressed data format that I can use to better archive older data. 

CETAS: Create External Table As Select


1) Walking the dog and reading my RSS feeds on Feedly, I see this:
https://azure.microsoft.com/en-us/updates/azure-sql-general-availability-updates-for-lateapril-2023/

2) reading through, go to this:
SQL Managed Instance now supports CETAS (“Create External Table As Select”) functionality. This feature allows users to export data from local database tables into parquet and CSV files in Azure storage and creates external tables for easy querying of the exported data through data virtualization.

3) Run back home and log in. My poor dog was out of breath.

4) After dinking for 30 minutes trying to get my local copy of the az powershell module to work with the command, (still isn't working, seriously, wtaf), and being unable to use cloud shell since it requires cloud storage and permissions are locked down too much, I finally use -UseDeviceauthentication and run these commands on my instance:

# Enable ServerConfigurationOption with name "allowPolybaseExport"

Set-AzSqlServerConfigurationOption -ResourceGroupName "myresource_group_name" -InstanceName "myManagedInstanceName" `

-Name "allowPolybaseExport" -Value 1


5) Verify it works.

SELECT [name], [value] FROM sys.configurations WHERE name ='allow polybase export';

Returns 1!


6) now, I use the following t-sql, pretty much verbatim from their example and a SAS token. The only magic I've added is the 10 minutes on how to use a SAS token. 

USE my

GO

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'password would go here';

GO
CREATE DATABASE SCOPED CREDENTIAL [CETASCredential]
    WITH IDENTITY ='SHARED ACCESS SIGNATURE',
SECRET = 'sp=rwl&st=2023-04-27T04:37:23Z&se=2023-05-26T12:37:23Z&spr=https&sv=2021-12-02&sr=c&sig=stringofstuffremovedhahahahahthisisfromthesastokenA%3D' ; --Removing leading '?'
GO

 

CREATE EXTERNAL DATA SOURCE [CETASExternalDataSource]
WITH (
    LOCATION = 'abs://mycontainer@mystorageacct.blob.core.windows.net',
    CREDENTIAL = [CETASCredential] );
GO

 

CREATE EXTERNAL FILE FORMAT [CETASFileFormat]
WITH(
    FORMAT_TYPE=PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
    );
GO

 

-- Count how many rows we plan to offload
SELECT COUNT(*) FROM my.dbo.table

CREATE EXTERNAL TABLE mytable_polybase_example

WITH (
    LOCATION = 'I/get/multiple/paths/for/multiple/tables!/',
    DATA_SOURCE = [CETASExternalDataSource],
    FILE_FORMAT = [CETASFileFormat])
AS 
    SELECT 
        *
    FROM 
       my.dbo.table

 

-- you can query the newly created external table
SELECT COUNT (*) FROM mytable_polybase_example;