How to use Azure OpenAI Embedding API for Document Search?

It is important to find the relevant document within a vast knowledge base. Whether you’re a researcher, a customer support agent, or an analyst, the ability to quickly retrieve relevant information can significantly impact productivity and decision-making.

Traditional keyword-based search approaches often fall short. They rely on exact matches or simple heuristics, which may miss contextually related documents. For example, searching for “climate change” might not retrieve documents mentioning “global warming,” though they share semantic similarities.

The Solution: Embedding in Azure OpenAI

In Azure OpenAI, Embeddings are fundamental concepts in natural language processing (NLP) and machine learning. They provide a way to represent words, phrases, or entire documents as dense vectors in a high-dimensional space.

Here’s how they work:

Word Embeddings:

Word embeddings map individual words to continuous vectors. Each word is represented by a fixed-length vector, capturing its semantic meaning.
These vectors are learned from large text corpora using Word2Vec, GloVe, or FastText techniques.
For example, consider the words “king” and “queen.” The vectors are close in a well-trained embedding space because they share a semantic relationship (royalty, gender, etc.).

Document Embeddings:

Document embeddings extend the idea of word embeddings to entire documents.
Instead of treating a document as a bag of words, we create a vector representation that captures the overall content.
Techniques like Doc2Vec, Paragraph Vectors, or averaging word embeddings across a document achieve this.
The resulting document vectors encode the context, topics, and text semantics.

Why Use Embeddings for Document Search?

Traditional keyword-based search has limitations:

Exact Matches: Keyword-based search relies on exact matches. It may be missed if a document doesn’t contain the query terms.
Context Ignored: Keywords don’t capture context or meaning. Synonyms, paraphrases, and related terms are overlooked.
Scalability: As the document collection grows, manual keyword-based indexing becomes impractical.

Embeddings address these challenges:

Semantic Similarity: By embedding documents, we transform them into vectors. Similar documents end up closer in this vector space, regardless of specific keywords.
Cosine Similarity: We measure the cosine similarity between query and document vectors. Closer vectors imply higher semantic relevance.
Contextual Understanding: Embeddings capture context, allowing us to find related documents even when the wording differs.

Read: How to use the Azure OpenAI Service completion endpoint to generate text?

Here are some real-world scenarios where the document search approach is useful:

Legal Research: Lawyers and legal professionals can use it to search through vast legal databases, statutes, and case law to find relevant precedents, legal opinions, and related documents.
Academic Research: Researchers, scholars, and students can explore academic papers, journals, and conference proceedings to discover relevant studies, references, and scientific literature.
Healthcare and Medical Records: Healthcare providers can search patient records, medical literature, and clinical studies to identify relevant diagnoses, treatment options, and best practices.
Enterprise Knowledge Management: Organizations can build a knowledge base by indexing internal documents, manuals, and reports. Employees can then search for relevant information efficiently.
Customer Support and FAQs: Customer service teams can use document search to quickly find answers to common queries, reducing response time and improving customer satisfaction.
News and Media Analysis: Journalists, analysts, and media professionals can search news articles, blogs, and opinion pieces to track trends, verify facts, and gather insights.
E-Commerce and Product Recommendations: E-commerce platforms can enhance product recommendations by analyzing user queries and matching them with relevant product descriptions and reviews.
Semantic Search Engines: Building search engines that understand context and semantics allows users to find relevant content more accurately.

How to use Azure OpenAI Embedding API for Document Search

Prerequisites:

Azure subscription: Create one if needed.
Azure OpenAI resource with the text-embedding-ada-002 (Version 2) model deployed.
Python 3.8 or later installed.
Required Python libraries (openai, num2words, matplotlib, plotly, scipy, scikit-learn, pandas, tiktoken).
Jupyter Notebooks for interactive exploration.

STEP 1: Install the required Python libraries

Open the terminal or command prompt. Run the following command to install the required libraries.

pip install openai num2words matplotlib plotly scipy scikit-learn pandas tiktoken

STEP 2: Download the BillSum Dataset

BillSum contains United States Congressional and California state bills. You can download the bill_sum_data.csv from GitHub sample data.

You can also download the sample data using the below command in the command prompt:

curl "https://raw.githubusercontent.com/Azure-Samples/Azure-OpenAI-Docs-Samples/main/Samples/Tutorials/Embeddings/data/bill_sum_data.csv" --output bill_sum_data.csv

Azure OpenAI Embedding API for Document Search

STEP 3: Create a Resource

Go to Azure Portal and click on " + Create a resource".

Azure OpenAI Embedding API for Document Search 2

Search "Azure OpenAI" in the search bar.

Azure OpenAI Embedding API for Document Search 3

Click "+ Create".

Now provide the basic details:

Subscription
Resource group
Region
Name
Pricing

Azure OpenAI Embedding API for Document Search 5

Click "Next".

Click "Create" to create the resource.

Azure OpenAI Embedding API for Document Search 6

STEP 4: Retrieve the API key and Endpoint

After creating the resource, click "Go to resource".

Azure OpenAI Embedding API for Document Search 7

Click "Keys and Endpoint" under Resource Management in the left panel.

Azure OpenAI Embedding API for Document Search 8

STEP 5: Set the Environment variable

Open the command prompt and set the environment variable for your API key and endpoint.

setx AZURE_OPENAI_API_KEY "YOUR_API_KEY"
setx AZURE_OPENAI_ENDPOINT "YOUR_ENDPOINT"

Paste the copied API key and Endpoint in the above code.

When you set environment variables (such as API keys or endpoints), they become accessible to any program running in your environment.

However, some applications (like Jupyter Notebooks or other IDEs) may not immediately recognize these changes without a restart.

In this case, you might need to close and reopen Jupyter Notebooks or your preferred IDE to ensure the environment variables take effect.

Jupyter Notebooks are a popular tool for interactive data analysis and coding. If you’re using Jupyter, you can directly execute code cells and see the output (including pandas DataFrames) without explicitly using print().

However, if you encounter issues (e.g., not seeing DataFrame output), use print(dataframe_name) instead of just calling dataframe_name.

STEP 6: Import the necessary libraries

import os
import re
import requests
import sys
from num2words import num2words
import os
import pandas as pd
import numpy as np
import tiktoken
from openai import AzureOpenAI

os: Provides functions for interacting with the operating system, such as creating directories, accessing file paths, and environment variables.
re: Used for regular expression operations, allowing pattern matching and text manipulation.
requests: A library for making HTTP requests, commonly used for fetching data from web APIs.
sys: Provides access to system-specific parameters and functions.
pandas (pd): A powerful data analysis and manipulation library. It is used for creating and operating on DataFrames tabular data structures other than spreadsheets.
numpy (np): Provides support for large, multi-dimensional arrays and matrices with high-level mathematical functions to operate on these arrays efficiently.
tiktoken: A library designed to encode text into tokens, which are numerical representations used by language models.
num2words: Converts numbers to their word equivalent (e.g., 123 to "one hundred and twenty-three").

STEP 7: Read the CSV File

Reads a CSV file named bill_sum_data.csv into a pandas DataFrame (df). Assumes that the CSV file is located in the same directory where you’re running the code.

df=pd.read_csv(os.path.join(os.getcwd(),'bill_sum_data.csv')) 
# This assumes that you have placed the bill_sum_data.csv in the same directory you are running Jupyter Notebooks 

df

The above line reads the CSV file into a DataFrame called df. os.getcwd() gets the current working directory, and os.path.join() constructs the full path to the CSV file.

You can now explore the contents of the df DataFrame..

Conclusion

Organizations can revolutionize their document search capabilities by effectively using the Azure OpenAI Embedding API. By transforming textual data into dense numerical representations, embeddings enable the discovery of semantically similar documents, going beyond traditional keyword-based search limitations.

Experimentation and fine-tuning are key to optimizing your document search system. Explore different embedding models, experiment with various hyperparameters, and continuously evaluate the performance of your solution. As the field of natural language processing continues to advance, so will the capabilities of Azure OpenAI Embedding API, offering even more sophisticated and effective document search solutions.

2 Comments

Guest

Apr 08

Estava no YouTube vendo vídeos de apostas e caiu uma propaganda do fortunetiger. Entrei pra ver qual era a do jogo e acabei curtindo bastante. As rodadas são rápidas, fáceis de entender e têm aquele fator surpresa que deixa tudo mais emocionante. No meio da sessão, ganhei mais do que esperava e isso me empolgou. No Brasil é difícil encontrar uma plataforma leve e sem enrolação, então essa me agradou de cara.