Best Practices to Manage Apache Spark Libraries in Microsoft Fabric

Apache Spark is a powerful open-source framework for big data processing and analytics. When working with Spark in Microsoft Fabric, efficient library management becomes crucial. Properly handling libraries ensures smooth development workflows, minimizes duplication, and enhances overall code quality. This article will explore best practices for managing Spark libraries within the Fabric environment.

In case you missed: Task Flows in Microsoft Fabric: A Comprehensive Guide

Library management in Microsoft Fabric handles libraries within Fabric environments, particularly when working with Apache Spark. When you work with Apache Spark, libraries provide essential functionality beyond the core Spark components.

Offers additional tools, algorithms, and utilities that enhance your data processing capabilities.
When libraries are well-maintained, they are optimized for execution speed, memory usage, and resource utilization.
Libraries can be reused across different projects, notebooks, and experiments. You avoid redundant installations and ensure consistent usage.
When team members work on shared projects, the same libraries ensure reproducibility.
It prevents version conflicts that might arise due to different library versions being used with different workloads.

Microsoft Fabric provides managed environments specifically designed for data engineering tasks.

Role in Data Engineering: Microsoft Fabric offers Spark runtimes, Jupyter Notebooks, and other tools. These environments serve as isolated workspaces where data engineers can develop, experiment, and analyze data. Each Microsoft Fabric environment is isolated, allowing multiple users to work independently without interfering with each other’s libraries or configurations.
Customization: Fabric environments can be customized with specific libraries. You can install the exact packages needed for your data engineering workflows. Whether it’s Python libraries, machine learning frameworks, or specialized tools, MicrosoftFabric will set up your environment just right.
Scalability: Microsoft Fabric seamlessly scales to handle large datasets and complex computations. Whether processing terabytes of data or running intricate machine learning pipelines, Microsoft Fabric provides the necessary resources.

Types of Libraries

Below are three types of libraries you can manage in Microsoft Fabric:

Built-in Libraries
Public Libraries
Custom Libraries

1. Built-in Libraries:

Built-in libraries come preinstalled with Microsoft Fabric Spark runtimes. These foundational libraries provide essential functionality out of the box. Here are some examples:

Python Packages:
NumPy: A powerful library for numerical computing, providing support for arrays, matrices, and mathematical functions.
Pandas: Used for data manipulation, analysis, and cleaning. It offers data structures like DataFrames.
Matplotlib: A popular plotting library for creating visualizations.
Java Libraries:
Apache Commons: A collection of reusable Java components for various tasks, including utilities, data structures, and algorithms.
Other Java libraries are specific to Spark’s ecosystem.

These Microsoft Fabric built-in libraries are the building blocks for data engineering and analysis tasks. They streamline development by providing commonly used tools right from the start.

2. Public Libraries:

Public libraries are essential components that provide reusable code and functionality for your data engineering tasks. These libraries are sourced from external repositories, with two primary sources supported by Fabric:

PyPI (Python Package Index): A comprehensive repository of Python packages.
Conda: A package manager and environment management system for various programming languages.

Here are the steps to add Public Libraries to Your Microsoft Fabric Environment:

STEP 1: Select a Source:

Choose either PyPI or Conda as the source for your library. Specify the name of the library you want to install.

STEP 2: Specify the Version:

Provide the desired version of the library. Microsoft Fabric will fetch the specified version from the chosen repository.

STEP 3: Auto-Completion and Direct Search:

Fabric offers auto-completion for popular libraries during the addition process. If the library you need isn’t auto-completed, search for it directly by entering its full name. You’ll see available versions for valid library names.

STEP 4: Batch Upload Using YAML

Fabric supports batch uploads of public libraries using a YAML (.yml) file. Create a YAML file specifying multiple public libraries and their versions. Upload the file to your environment, and Microsoft Fabric will extract and add the libraries to the list.

STEP 5: Filtering and Updating Public Libraries

Use the search box on the Public Libraries page to filter the list and find specific libraries. To update the version of an existing library:

Navigate to your environment.
Open the Public Libraries section.
Choose the library and update its version.

STEP 6: Deleting and Viewing Dependencies

Hover over a library row to reveal options:

Trash Icon: Delete the library.
View Dependency: Explore the library’s dependencies.

STEP 7: Exporting to YAML

Fabric allows you to export the full list of public libraries to a YAML file. Download this file to your local directory for reference.

3. Custom Libraries:

Custom libraries cater to specific needs and can be uploaded to your Microsoft Fabric environment. Common formats for custom libraries include:

Wheel (.whl): Python package format for distribution.
JAR (Java Archive): Used for Java libraries.
Tarballs (.tar.gz): Common for R language packages.

Custom libraries allow you to extend Fabric’s capabilities by adding domain-specific tools, proprietary algorithms, or project-specific code.

Read: Policies to manage Data in Microsoft Fabric's Synapse Real-Time Analytics

Best Practices to Manage Apache Spark Libraries in Microsoft Fabric

Best Practice 1: Setting Default Libraries

Setting default libraries in Microsoft Fabric is a best practice for efficient library management in Apache Spark environments.

Consistency and Streamlined Development:

Default libraries ensure that all notebooks and Spark job definitions within the workspace start sessions with the same set of preinstalled libraries.
Developers don’t need to manually install required packages each time they create a new notebook or run a job.
This consistency streamlines development and reduces setup overhead.

Avoiding Version Conflicts:

When multiple users collaborate in the same workspace, version conflicts can arise if everyone uses different library versions.
Setting default libraries ensures that everyone works with the same versions, minimizing compatibility issues.

Workspace-Wide Dependencies:

Default libraries become part of the workspace settings.
All users benefit from these shared dependencies, making collaboration smoother.

As an administrator of the workspace, you have the authority to configure default libraries for your Fabric environment. Here are the steps to create a new environment:

STEP 1: Access the Microsoft Fabric Portal

Navigate to the Microsoft Fabric portal. On the left panel, click on the “Workspace” option.

STEP 2: Creation of a New Environment

In the Microsoft Fabric Workspace section, locate and click the “+ New” button. A drop-down menu will appear. Select “Environment” from this list.

STEP 3: Name Your Environment

A pop-up box will appear. Type your desired name in the provided field and click “Create” to proceed.

STEP 4: Configure Your Environment

After naming your environment, you’ll be presented with three main options to configure your environment:

STEP 5: Install Python Libraries :

Once the environment is set up, open a terminal or command prompt within the environment. Use the following commands to install Python packages:

For pip:

pip install package_name

For conda:

conda install package_name

Replace package_name with the actual name of the Python library you want to install.

STEP 6: Include Java Libraries

For Java libraries, you’ll need to include the appropriate JAR files. Upload the necessary JAR files to your environment. Configure your Spark jobs or notebooks to reference these JAR files as needed.

STEP 7: Attach the Environment as Workspace Default:

Click Workspace in the left panel. Click (...) and select the Workspace Settings option.

Under the Data Engineering/Science tab, select "Spark settings". Go to the Environment section, and toggle on the "Set default environment".

This ensures that notebooks and Spark job definitions within the workspace start sessions with the libraries installed in the default environment.

Best Practice 2: Persistent Library Specification

When you persist library specifications, you install the required libraries in a dedicated environment. This environment is then attached to specific code items (such as notebooks or Spark job definitions).

When you’re working within a notebook or defining a Spark job, you’ll find the Environment menu available in the Home tabs. This menu provides access to the environments you’ve set up in Fabric.

Within the Environment menu, you’ll see a list of available environments. Choose the specific Microsoft Fabric environment associated with your notebook or Spark job. Once you select an environment, its configuration becomes effective for your current session. This includes the Spark compute settings ( cluster size, memory allocation, and cores) and the libraries configured within that environment. Libraries installed in the environment are now accessible for your Spark tasks.

Benefits:

Efficiency and Time-Saving: You can avoid repetitive installations by installing libraries in a dedicated environment and attaching them to code items. Once the libraries are set up, they remain effective across all Spark sessions where the environment is attached. Developers no longer need to run installation commands every time they create a new notebook or execute a Spark job. This streamlined approach saves valuable time and effort.
Granularity and Customization: Unlike workspace-level settings, this approach allows for finer granularity. You can attach the same environment to multiple code artifacts within a workspace. Attaching them to a common Microsoft Fabric environment ensures consistency for subsets of notebooks or Spark job definitions that require the same libraries. You can tailor environments to specific project needs.
Collaboration and Management: Administrators, members, or contributors of the workspace can create, edit, and manage these environments. This flexibility empowers collaboration. When teams work on shared projects, persistent library specifications ensure to operate within the same library ecosystem. It simplifies collaboration and knowledge sharing.

Best Practice 3: Use Inline Installation Techniques

Inline installation refers to the practice of installing libraries directly within an interactive notebook (such as a Jupyter notebook) for one-time use. Unlike persistent installations, inline installations are effective only for the current notebook session and don’t persist across different sessions.

Inline Libraries allows you to manage Python and R Libraries.

Inline Installation for Python Libraries

When you run an inline installation command, the Python interpreter restarts to apply the library change. Note that any variables defined before running the command cell are lost during this process. To maintain consistency, it’s recommended to place all library-related commands (additions, deletions, updates) at the beginning of your notebook.

By default, inline commands for managing Python libraries are disabled in notebook pipelines. However, if you want to enable %pip install for pipeline runs, you can add a parameter called _inlineInstallationEnabled with a value of True in the notebook activity parameters.

Suppose you want to use the powerful visualization library Altair for a one-time data exploration in your notebook. If Altair isn’t installed in your workspace, follow these steps:

Run the following commands in a notebook code cell:

%conda install altair # Install the latest version through conda command %conda install vega_datasets # Install vega_datasets (contains a semantic model for visualization)

The output indicates the result of the installation.

Import the package and semantic model in another notebook cell:

import altair as alt from vega_datasets import data

Now you can explore Altair visualizations within your session.

Inline Installation for R Libraries

The inline package allows you to define R functions using inlined C, C++, or Fortran code. It supports the .C and .Call calling conventions. To install the inline package, you can use the following command in an R notebook cell:

install.packages("inline")

Once installed, you can create custom R functions with inlined code.

After installing the caesar library, you can use it within your SparkR session. Here’s an example of defining a function called hello that applies the caesar function to a list of strings:

library(SparkR) 
sparkR.session() 

hello <- function(x) { 
	library(caesar) 
	caesar(x) 
} 

spark.lapply(c("hello world", "good morning", "good evening"), hello)

To include external JAR files in your SparkR session, you can configure the spark.jars property. In a Scala cell, use the following syntax (replace placeholders with actual values):

%%configure -f 
{ 
	"conf": { 
		"spark.jars": "abfss://<<Lakehouse prefix>>.dfs.fabric.microsoft.com/<<path to JAR file>>/<<JAR file name>>.jar" 
	} 
}

Make sure to replace the placeholders with the correct ABFS path to your JAR file.

Benefits of Inline Installation

Inline installation allows you to quickly add libraries when needed without affecting other notebooks or Spark sessions. It’s ideal for ad hoc tasks or exploratory data analysis where you require specific libraries temporarily.
Libraries installed inline are scoped to the current session. They won’t interfere with other notebooks or workspaces. You can experiment freely without worrying about global effects.
Since inline installations are session-specific, you avoid unnecessary overhead in environments where the library isn’t needed beyond the current task. This efficient resource usage benefits both development and runtime environments.

Conclusion

Efficiently managing libraries is essential for maintaining an organized and productive development environment. Properly handling libraries ensures access to necessary tools, functions, and dependencies, minimizing code duplication and improving overall quality. Microsoft Fabric offers features like R inline installation, custom functions using the inline package, and support for external JAR files.

Explore these capabilities to enhance your development workflow! 😊🚀

2 comentarios

radhika.daveucr

20 ene

I completely agree with the points mentioned about mid cap mutual funds. They are a great way to achieve growth while managing risk, especially in volatile markets. Investing in the best mid cap mutual funds can significantly boost your returns over the long term. I’ve been considering using a SIP for investing in these funds regularly to benefit from rupee cost averaging. I’ve also been researching the top mid cap mutual funds to ensure I make the right choice for my portfolio. Would love to hear any recommendations or experiences from others!

Me gusta

Ariel wilson

05 dic 2024

Managing Apache Spark libraries in Microsoft Fabric can be streamlined with best practices like leveraging centralized storage for dependencies, using version control, and monitoring performance. Tools like Jupyter Notebooks and Azure Synapse Analytics can aid in the process. For academic researchers, these insights can be crucial when working on complex data-driven projects, especially when seeking cheap dissertation writing services to support analytical tasks.