Azure Databricks empowers you to manage and analyze vast amounts of data. However, with great power comes great responsibility! As your data ecosystem grows, so does the need for robust data governance practices. This article explores best practices for leveraging two key functionalities in Azure Databricks: DBFS (Databricks File System) and Unity Catalog.
By understanding the strengths of each and implementing recommended practices, you can establish a secure and collaborative data management environment that fosters efficient data utilization and informed decision-making.
Best Practices for DBFS and Unity Catalog in Azure Databricks
Here are the best practices you should follow to leverage the strengths of DBFS and Unity Catalog for efficient and secure data management in Azure Databricks:
Best Practice 1: Manage Data in Azure Databricks Workspaces
Imagine a central hub for all your data assets within Azure Databricks. That's precisely what Unity Catalog provides. It functions as a unified metastore, offering a centralized platform to manage data access control, auditing, lineage, and data discovery across all workspaces in your account.
Benefits of Unity Catalog-enabled Workspaces:
Enhanced Security: Unity Catalog enforces a robust security model based on ANSI SQL standards. This allows you to grant granular access permissions using familiar SQL syntax, ensuring only authorized users can access specific data sets.
Simplified Collaboration: Collaboration becomes effortless with a single source of truth for data definitions and permissions. Data analysts, data scientists, and engineers can work seamlessly across workspaces, eliminating the need to manage permissions in individual silos.
Improved Auditing and Lineage Tracking: Unity Catalog automatically captures detailed audit logs, recording data access and usage. It also tracks data lineage, allowing you to understand the origin and transformation steps of the data assets.
Streamlined Data Discovery: Finding the data you need becomes a breeze. Unity Catalog's intuitive interface enables users to search and discover relevant data sets across workspaces, fostering efficient data utilization.
Transitioning to Unity Catalog:
Databricks offers various resources to guide the migrating process to Unity Catalog-enabled workspaces. You can create a new Unity Catalog-enabled workspace or migrate your existing workspaces to leverage its advantages.
Unity Catalog vs. DBFS:
While DBFS (Databricks File System) remains an option for storing data files, Unity Catalog offers a more secure and centralized approach to data management. For optimal security, Databricks recommends leveraging Unity Catalog's features whenever possible.
Best Practice 2: DBFS in Single User Access Mode on Azure Databricks
Clusters configured with Single User access mode grant full access to DBFS. This encompasses all files within the DBFS root directory and mounted data sources. Because of this unrestricted access, Single User mode is the preferred choice for machine learning workloads that necessitate working with data residing in DBFS and Unity Catalog (through mounts).
Aspect | Description |
Access Level | Full access to DBFS |
Accessible Data |
|
Suitability | Ideal for machine learning workloads that require working with data in both DBFS and Unity Catalog locations |
Databricks Best Practice for Production Workloads
While Single User mode offers ease of use with unrestricted access, Databricks recommends employing service principals with scheduled jobs for production workloads that necessitate access to data managed by both DBFS and Unity Catalog. Service principals are secure identities used by applications to access Azure Databricks resources. This approach enhances security by separating access for scheduled jobs from the interactive environment provided by Single User mode.
Best Practice 3: DBFS in Shared Access Mode on Azure Databricks
Shared access mode on Azure Databricks presents a unique interplay between Unity Catalog's data governance and the legacy access control mechanisms of Databricks.
Aspect | Best Practice |
Access Control |
|
Data Governance | Utilize Unity Catalog features for data lineage tracking, auditing, and data discovery. |
Collaboration | Encourage collaboration across workspaces with a centralized data catalog. |
Security |
|
DBFS Permissions | Use DBFS permissions cautiously, primarily for administrative purposes. |
Monitoring | Implement monitoring practices to track data access and usage. |
External Storage | Consider mounting external storage locations within the Unity Catalog for additional security. |
Understanding:
Unity Catalog: The central component for data governance in Azure Databricks. It manages access control through permissions assigned to users and groups.
Hive Metastore: A legacy component that stores metadata about data tables.
Legacy ACLs: Pre-existing access control lists associated with tables within the Hive Metastore.
Shared Access Mode Behavior:
Unity Catalog Takes the Lead: Shared access mode prioritizes Unity Catalog's data governance framework. This means access to data stored within the Hive Metastore is granted solely to users with explicitly assigned permissions through Unity Catalog.
# Granting read access to user "analyst" on table "my_table" within Unity Catalog
spark.sql.grant("select", "user:analyst", "my_table")
Legacy ACLs Still Play a Role (Partially): However, shared access mode doesn't completely disregard pre-existing table Access Control Lists (ACLs) established within the Hive Metastore. These legacy ACLs might still influence data access, but their power is diminished compared to Unity Catalog permissions.
DBFS Interaction and the "ANY FILE" Permission
Direct DBFS File Interaction: To directly interact with files using DBFS in shared access mode, users require "ANY FILE" permission. This allows them to read or write data directly from DBFS paths. (Keywords: Shared access mode, DBFS permissions, ANY FILE)
# Granting "ANY FILE" permission to user "admin_user" (Use cautiously!)
dbutils.fs.permission(path="/mnt/data", mask="rwx-wx-wx", user="admin_user")
Caution Advised: Granting "ANY FILE" carries significant weight. It essentially grants users the ability to bypass the security measures enforced by Unity Catalog's table ACLs and access all data managed by DBFS. Due to this broad access, Databricks strongly recommends caution when assigning this permission. (Keywords: Shared access mode, ANY FILE security risk, Databricks best practices)
Best Practices for Secure Data Management
Minimize "ANY FILE" Usage: Limit the assignment of "ANY FILE" permissions to only those users who require unrestricted access to DBFS files. (Keywords: Shared access mode, ANY FILE best practices)
Leverage Unity Catalog Permissions: For granular control, leverage Unity Catalog's permission system to grant users access to specific data sets based on their roles and responsibilities. (Keywords: Shared access mode, Unity Catalog permissions, data access control)
Consider Alternative Access Methods: Explore alternative methods for interacting with DBFS files, such as mounting external storage locations within the Unity Catalog. This approach leverages Unity Catalog's security mechanisms instead of relying solely on DBFS permissions. (Keywords: Shared access mode, Unity Catalog external storage mounts, data security)
Best Practice 4: Avoid Mixed Access Models
While DBFS (Databricks File System) and Unity Catalog offer functionalities for data storage and access in Azure Databricks, they operate with distinct data access models.
Unity Catalog External Locations:
Mechanism: Unity Catalog utilizes full Cloud Storage Uniform Resource Identifiers (URIs) to manage access control for data in external storage locations.
Benefits: These URIs enable fine-grained access control. You can ensure that only authorized users can access specific datasets within those external locations.
DBFS Mounts:
Mechanism: DBFS mounts employ a separate data access model that bypasses Unity Catalog's security mechanisms.
Drawback: Access controls established within Unity Catalog wouldn't apply to data mounted through DBFS.
# This example showcases hypothetical URI and mount path for illustrative purposes
# Replace "<your_storage_account>" and "<your_container_name>" with your actual details
# Unity Catalog External Location (URI)
uc_uri = f" abfss://<your_storage_account>.dfs.core.windows.net/<your_container_name>/data"
# DBFS Mount Path
dbfs_mount_path = "/mnt/data"
Security Concerns with Reuse:
Databricks advises against reusing cloud object storage volumes between DBFS mounts and Unity Catalog external locations. Here's why:
Inconsistent Security: Reusing volumes creates a situation where data security is governed by two different, potentially conflicting, access models. This inconsistency can lead to security vulnerabilities and unintended data access.
Bypassing Unity Catalog Controls: If data in a reused volume is accessed through a DBFS mount, Unity Catalog's access control mechanisms would be bypassed. This could potentially expose sensitive data to unauthorized users.
Best Practices for Secure Data Management:
Separate Storage Locations: Maintain separate cloud object storage volumes for DBFS mounts and Unity Catalog external locations. This ensures each data access model operates independently of its intended security controls.
Leverage Unity Catalog Features: You can use Unity Catalog's robust permission system to manage access granularly for data stored in external locations. This allows you to grant users access based on their roles and responsibilities.
Best Practice 5: Securing Unity Catalog-Managed Storage
Each Unity Catalog metastore relies on an object storage account, configured by an Azure Databricks account administrator. This designated location is the central repository for all data and metadata associated with Unity Catalog-managed tables. Here's what makes it secure:
Fresh Start with Dedicated Storage:
Databricks recommends creating a new object storage account for your Unity Catalog metastore. This minimizes the risk of lingering security vulnerabilities from previous data storage practices.
Example
# Replace "<your_cloud_provider>.com" with the appropriate cloud provider URL
storage_client = <Cloud Provider Storage Client Library>.StorageManagementClient("<your_subscription_id>",
"<your_resource_group>")
storage_client.storage_accounts.create("<unique_storage_account_name>", "<region>")
Custom Identity Policy
Function: A custom identity policy defines who or what can access the account, ensuring only authorized entities can interact with the data.
Configuration: Ensure the policy is configured to grant access solely to Unity Catalog. This restricts unauthorized access attempts.
# This example showcases assigning a Managed Identity to the storage account for illustrative purposes
# (Actual steps may involve creating IAM roles or policies)
from azure.identity import DefaultAzureCredential
from azure.mgmt.resource import ResourceManagementClient
credential = DefaultAzureCredential()
client = ResourceManagementClient(credential, subscription_id="<your_subscription_id>")
# Get the storage account object
storage_account = client.storage_accounts.get("<resource_group>", "<storage_account_name>")
# Assign Managed Identity to the storage account (illustrative example)
storage_account.identity.type = 'SystemAssigned'
client.storage_accounts.create_or_update("<resource_group>", "<storage_account_name>", storage_account)
Restricted Access: Only for Unity Catalog
The object storage account configured for your Unity Catalog metastore should be accessible solely by Unity Catalog. This eliminates the possibility of unauthorized access from external sources. (Keywords: Unity Catalog storage security, restricted access, data privacy)
Identity Access Policies
Function: Access to the object storage account should be granted exclusively through identity access policies created specifically for the Unity Catalog. These policies define granular permissions for authorized users and services within the context of the Unity Catalog.
Configuration: These policies should specify the level of access (read, write, etc.) granted to Unity Catalog and any authorized services for interacting with the data.
# Replace "<storage_account_name>" with your actual storage account name
# Replace "<your_cloud_provider>.com" with the appropriate cloud provider URL
# This example showcases setting storage blob container permissions for illustrative purposes
# (Actual steps may involve creating IAM roles or policies)
from azure.storage.blob import BlobServiceClient
connect_str = f"DefaultEndpointsProtocol=https;AccountName=<storage_account_name>;EndpointSuffix=<your_cloud_provider>.com"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
# Get container object (replace "your-container-name" with your actual container name)
container_client = blob_service_client.get_container_client("your-container-name")
# Set container access policy to grant read access to a specific user or service principal (illustrative example)
container_client.set_container_acl(public_access="None")
# Restrict public access first
container_client.set_container_acl("<user_or_service_principal>")
# Grant access to authorized entity
Best Practice 6: Add Existing Data to Unity Catalog
Unity Catalog provides a powerful mechanism for incorporating existing data stored in external accounts through the concept of external locations. This functionality will integrate data from diverse sources into your centralized data management framework.
Security First: Revoking Old Credentials
While adding existing storage accounts as external locations offers convenience, security remains a paramount concern. Databricks strongly recommends revoking all preexisting access patterns and storage credentials associated with those accounts before loading them into the Unity Catalog.
Code Example
# Replace "storage_account_name" with your actual storage account name
# Replace "<your_cloud_provider>.com" with the appropriate cloud provider URL
# This example showcases revoking access keys for illustrative purposes
# (Actual steps may involve removing roles or IAM policies)
storage_client = <Cloud Provider Storage Client Library>.StorageAccount(account_name="storage_account_name",
endpoint="<your_cloud_provider>.com")
for access_key in storage_client.list_account_sas_keys():
storage_client.delete_account_sas_key(access_key.name)
Why Revoking Credentials Matters:
Revoking old credentials is crucial for several reasons:
Minimized Risk: It eliminates potential security vulnerabilities that might arise from leftover access permissions associated with those credentials.
Enhanced Governance: By centralizing access control within Unity Catalog, you gain greater control over who can access and manage the data.
DBFS Root
Databricks advises against loading a storage account that serves as the DBFS root into Unity Catalog as an external location. This restriction stems from the inherent differences in their access control mechanisms.
Reasons to Avoid DBFS Root as an External Location:
Conflicting Security Models: DBFS and Unity Catalog employ distinct data access models. Loading the DBFS root as an external location could create security inconsistencies, potentially opening doors to unintended data access.
Compliance Concerns: This practice might not align with compliance requirements that demand clear segregation of data access controls.
Best Practice 7: Cluster Configuration
When working with Unity Catalog in Azure Databricks, it's crucial to understand how it interacts with cluster configurations, particularly for file system access.
Unlike traditional data management systems, Unity Catalog doesn't rely on cluster configurations for file system settings. This means configuration using Hadoop file system settings, is not identified by Unity Catalog when accessing data.
Unity Catalog bypasses specifically Hadoop file system settings for file system access when working with Azure Databricks.
These Hadoop file system settings typically allow you to configure custom behavior related to how data is accessed from cloud object storage. This could include settings for things like:
Authentication: Specifying credentials or tokens used to access the storage.
Caching: Defining caching policies to improve performance for frequently accessed data.
Encryption: Configuring encryption at rest or in transit for data security.
Compression: Setting compression formats for data stored in the cloud object storage.
Since Unity Catalog takes a centralized approach to data access and security, it enforces its mechanisms for these functionalities. This ensures consistent behavior across all clusters in your workspace and simplifies data management.
Key Points:
Unity Catalog manages data access and security independently of cluster configurations.
Custom file system settings configured in clusters won't be applied when using Unity Catalog.
Why Does Unity Catalog Do This?
There are several reasons behind this design choice:
Centralized Control: Unity Catalog aims to provide a centralized platform for data access and management. Relying on cluster-specific configurations could lead to inconsistencies and complicate data governance.
Security Focus: By bypassing cluster configurations, Unity Catalog enforces its robust security mechanisms for data access. This ensures a consistent and secure experience across all clusters within your workspace.
Simplified Management: Eliminating the need for cluster-specific configurations streamlines data management, reducing the risk of errors and inconsistencies.
Alternative Approaches for Data Access:
Since Unity Catalog disregards cluster configurations, here are some alternative approaches to manage data access:
1. Leverage Unity Catalog Permissions: Utilize Unity Catalog's permission system to grant granular access control to specific datasets based on user roles and responsibilities. This approach offers a centralized and secure way to manage data access across your workspace.
Example:
# Grant read access to user "analyst" on table "my_table"
spark.sql.grant("select", "user:analyst", "my_table")
2. Configure Storage Directly: You can configure cloud object storage settings directly within the console storage provider. However, this approach bypasses Unity Catalog's centralized management and security benefits.
Best Practice 8: Manage Multiple Path Access in Unity Catalog and DBFS
While Unity Catalog and DBFS can co-exist within your workspace, you cannot reference paths with a parent-child relationship or identical paths using different access methods within the same command or notebook cell. This means you cannot simultaneously access data from a Unity Catalog external location and a DBFS directory that share the same parent or identical paths.
Scenario:
Imagine you have an external table named "foo" stored in the Hive Metastore at location "dbfs:/mnt/data/a/b/c" (using Unity Catalog access) and you want to write the filtered data back to the same "dbfs:/mnt/data/a/b/c" (using DBFS write methods) location. This would cause an error due to the conflicting access methods within the same cell.
Solution: Separate Cells
Cell 1: Read Data from Unity Catalog (spark.read.table)
# This cell reads data from the Unity Catalog external table "foo"
df = spark.read.table("foo")
# Perform any necessary transformations on the DataFrame (optional)
filtered_df = df.filter("id IS NOT NULL")
Cell 2: Write Data to DBFS (df.write)
# This cell writes the filtered DataFrame to the desired DBFS location
filtered_df.write.mode("overwrite").save("dbfs:/mnt/data/a/b/c")
Explanation:
Cell 1:
Uses spark.read.table("foo") to read data from the Unity Catalog external table "foo".
Optionally performs any transformations on the DataFrame using operations like filter in this example.
Stores the resulting DataFrame as filtered_df.
Cell 2:
Uses the previously created filtered_df DataFrame.
Employs df.write.mode("overwrite").save("dbfs:/mnt/data/a/b/c") to write the DataFrame to the desired DBFS location "dbfs:/mnt/data/a/b/c" using overwrite mode.
Important Note:
This approach avoids referencing the same path with different access methods within a single cell.
Remember to replace "foo" and "dbfs:/mnt/data/a/b/c" with your table name and desired location.
Consider using descriptive variable names for your DataFrames to improve readability and maintainability,
Depending on your specific workflow, you might need to chain multiple transformations within Cell 1 before writing the data to DBFS in Cell 2.
Alternative Approaches:
While the broken cell approach offers a solution, consider these alternatives for improved code clarity and maintainability:
Refactor Data Organization: If possible, restructure your data storage to avoid overlapping paths between Unity Catalog and DBFS. This reduces the likelihood of encountering this limitation.
Utilize Consistent Access Methods: Whenever possible, strive to use a single access method (either Unity Catalog or DBFS) throughout your code for reading and writing data. This eliminates potential conflicts arising from mixed access methods.
Conclusion
This article explored best practices for DBFS and Unity Catalog in Azure Databricks. You can create a secure and collaborative data environment, by leveraging Unity Catalog's access control and DBFS's flexibility. Utilize Unity Catalog's auditing, lineage, and discovery to empower data governance and exploration features. Remember, data security is key! Regularly review your practices for optimal data management in Azure Databricks.
Comentarios