In a data-driven world managing and governing your data assets effectively is crucial. Extracting value from your data requires a robust data governance strategy, ensuring data security, accessibility, and quality.
Unity Catalog is a centralized data governance solution designed to simplify and unify managed data, machine learning models, notebooks, and other data assets across your workspaces in Azure Databricks.
This guide will teach you all about Unity Catalog in Azure Databricks, like:
What it is: We'll explain the Unity Catalog and how it helps you to manage your data better.
Why it exists: We'll talk about the problems that Unity Catalog solves and why it was created.
How it works: We'll break down the different parts of Unity Catalog, like metastores, catalogs, and tables, and show you how they work together.
Let's Begin!
What is Unity Catalog?
In Azure Databricks, Unity Catalog is a centralized data governance solution. It unifies to manage data, machine learning models, notebooks, and other data assets across your workspaces. Unity Catalog was introduced to address the challenges of managing data governance across multiple Azure Databricks workspaces.
The Issue: Fragmented Data Governance
Fragmented data governance is a situation where the practices and tools used to manage data access, security, and quality are scattered and inconsistent across different parts of an organization. This typically happens when data is siloed, meaning it's isolated and stored in separate systems and applications.
This led to:
Data Inconsistency and Errors: Fragmented data governance can lead to inconsistencies in data definitions, formats, and quality across different silos. This can make it difficult to trust the data for analysis and decision-making.
Security Risks: Inconsistent access control policies can increase the risk of data breaches and unauthorized access.
Compliance Challenges: Meeting data privacy regulations like GDPR or CCPA becomes more difficult with fragmented data governance, as it's challenging to track and manage data access across silos.
Inefficient Data Management: Difficult to find, access, and utilize data effectively.
How Unity Catalog Addresses Fragmented Data Governance?
Unity Catalog, introduced in Azure Databricks, aims to solve these challenges by providing a centralized platform for data governance. By acting as a single source of truth for data metadata, it offers features like:
Unified Access Control: Permissions for data access are defined centrally and applied across all connected workspaces, ensuring consistent data security.
Centralized User Management: User access is managed centrally, simplifying administration and reducing the risk of errors.
Improved Data Lineage: Data lineage becomes readily available, providing insights into data origin and flow across workspaces.
Scenario
Consider the below image that illustrates the difference between a Databricks workspace with and without Unity Catalog.
Databricks Workspace without Unity Catalog
In this scenario, each Databricks workspace relies on its own Hive Metastore. The Hive Metastore is a repository, that stores information about data location, schema, and ownership within the workspace. Consequently, data governance functionalities like access control, auditing, and data lineage are managed independently within each workspace.
This can lead to:
Inconsistent Data Security: Difficulty maintaining consistent security policies across workspaces.
Complex User Management: Managing user permissions becomes cumbersome with separate metastores.
Limited Data Lineage Visibility: Tracking data origin and flow across workspaces is challenging.
Databricks Workspace with Unity Catalog
When a Unity Catalog is introduced, it acts as a centralized layer managing metadata for multiple Databricks workspaces. This enables features like:
Unified Access Control: The permissions for data access are established and managed in the Unity Catalog, ensuring consistent enforcement across all connected workspaces.
Streamlined User and Metastore Management: A single point of administration simplifies user and metastore management.
Clear Data Lineage: Unity Catalog provides a centralized view of metadata, offering clear data lineage for improved data quality and troubleshooting.
Unity Catalog Object Model
Here are the core components of Unity Catalog in Azure Databricks:
Metastores
Catalogs
Schemas
1. Metastores
A metastore acts as the top-level container within Unity Catalog. It functions as a central registry, storing metadata about your data and AI assets with access control permissions.
Key Points:
Each Azure Databricks account admin should create a single metastore per region for operation. This metastore is then assigned to workspaces within the same region.
A workspace requiring Unity Catalog functionality needs a connected Unity Catalog metastore.
Optionally, configure a managed storage location for the metastore within your cloud storage account (Azure Data Lake Storage Gen2 or Cloudflare R2 bucket).
2. Catalogs (Organizing Data Assets)
A catalog represents the first layer in Unity Catalog's three-level hierarchical namespace. It serves as a way to organize your data assets.
Permissions and Visibility:
Users with the USE CATALOG data permission can view all catalogs they've been granted access to.
Users might have default permissions on automatically provisioned catalogs like main or <workspace-name>, it depends on the workspace creation and Unity Catalog enablement.
3. Schemas (Grouping Tables and Views)
A schema (or a database), forms the second layer in Unity Catalog's namespace. It provides a way to organize tables and views within your data assets.
Access Control for Schemas:
Users require the USE SCHEMA permission on a schema with the USE CATALOG permission on the schema's parent catalog, they have access to view all schemas.
To access or list individual tables or views within a schema users need SELECT permission on those specific tables or views.
By default, workspaces enabled for Unity Catalog manually include a default schema within the main catalog, accessible to all users. Workspaces enabled automatically with a <workspace-name> catalog also contain a default schema accessible to all users.
Here are the various data objects you can manage within the Unity catalog, the core data governance solution in Azure Databricks. Understanding their roles and relationships is crucial for organizing, governing, and accessing your data assets.
Tables
Views
Volumes
Models
Managed Storage
4. Tables (Data Storage and Access)
A table resides at the third layer of Unity Catalog's namespace hierarchy. It serves as the primary container for storing rows of structured data.
Permissions and Access Control:
To create a table, users require:
CREATE and USE SCHEMA permissions on the target schema.
USE CATALOG permission on the schema's parent catalog.
To query a table, users need:
SELECT permission on the table itself.
USE SCHEMA permission on the parent schema.
USE CATALOG permission on the parent catalog's parent catalog.
Types of Tables:
Managed Tables
External Tables
Managed Tables (Default): Unity Catalog manages the lifecycle and file layout for these tables. Data is stored in the Delta table format for optimal performance and functionality.
Storage Locations:
Managed workspaces: Root storage location configured during metastore creation (optional).
Automatic workspaces: Storage typically occurs at the catalog or schema levels.
Data Deletion: When a managed table is dropped, the underlying data is deleted within 30 days.
External Tables: Reference existing data residing outside of Unity Catalog's management. Useful for large datasets or requiring direct access with external tools.
Supported File Formats: DELTA, CSV, JSON, AVRO, PARQUET, ORC, TEXT.
Data Lifecycle: Unity Catalog doesn't manage the underlying data for external tables. You are responsible for its lifecycle and access control.
5. Views (Derived Data Access)
A view is a read-only object built from one or more tables and views within a metastore. It also resides in the third layer of the namespace.
Key Points:
Views can be created from tables and views across multiple schemas and catalogs.
Dynamic Views enable row- and column-level access control for granular data governance.
6. Volumes (Non-Tabular Data Management)
Volumes reside at the third layer alongside tables and views, acting as siblings within a schema. Important: This feature is currently in Public Preview.
Purpose: Store and manage directories and files in various formats (non-tabular data) within Unity Catalog. Unlike tables, files in volumes cannot be directly registered as tables.
Permissions and Access Control:
To create a volume, users require:
CREATE VOLUME and USE SCHEMA permissions on the target schema.
USE CATALOG permission on the schema's parent catalog.
Reading, adding/removing, or modifying files within a volume requires specific permission levels (read, write) along with schema and catalog access permissions.
Types of Volumes:
Managed Volumes
External Volumes
Managed Volumes: A convenient option for creating a governed location to work with non-tabular files.
Storage Locations: Similar to managed tables, storage typically occurs at the schema or catalog level, with managed storage location precedence rules applying.
Data Deletion: When a managed volume is deleted, the stored files are deleted within 30 days.
External Volumes: It provides access to existing non-tabular data in cloud storage without data migration. Useful for integrating data produced by external systems or requiring direct file access outside of Azure Databricks.
Data Lifecycle: Unity Catalog doesn't manage the lifecycle or layout of files in external volumes.
7. Models (Machine Learning Integration)
A model is a registered machine learning model within the MLflow Model Registry. It also resides in the third layer of the namespace.
Permissions for Model Creation:
Users require CREATE MODEL privilege for the target catalog or schema.
Additionally, USE CATALOG privilege on the parent catalog and USE SCHEMA on the parent schema are necessary.
8. Managed Storage (Storage Options)
Managed tables and volumes can be stored at various levels in the Unity Catalog hierarchy: metastore, catalog, or schema. Storage defined at lower levels overrides higher levels.
Storage Locations and Recommendations:
Account admins can optionally assign a storage location for managed tables and volumes during metastore creation (metastore-level storage).
Databricks recommends assigning managed storage at the catalog level for better data isolation.
Conclusion
Unity Catalog simplifies managing your data on Databricks. It brings everything together in one place, so you can find what you need easily, track how data flows, and control who can access it. This makes data governance less of a headache and helps you get the most out of your data.
コメント