The Unstructured open-source library does not offer built-in support for calling embedding providers to obtain embeddings for pieces of text.
Alternatively, the Unstructured Ingest CLI and the Unstructured Ingest Python library
offer built-in support for calling embedding providers as part of an ingest pipeline. Learn how.
Also, you can use common third-party tools and libraries to get embeddings for document elements’ text within JSON files that are
produced by calling the Unstructured open-source library. For example, the following sample Python script:
- Takes an Unstructured open-source library-generated JSON file as input.
- Reads in the JSON file’s contents as a JSON object.
- Uses the sentence-transformers/all-MiniLM-L6-v2
model on Hugging Face to generate embeddings for each
text
field of each document element in the JSON file.
- Adds the generated embeddings next to each corresponding
text
field in the original JSON.
- Saves the results back to the original JSON file.
# Filename: embeddings.py
# pip install langchain sentence_transformers
import sys
import json
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
if __name__ == "__main__":
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Get the JSON file's path.
if len(sys.argv) < 2:
print("Error: Specify the path to the input JSON file.")
print("For example, 'python embeddings.py myfile.json'")
sys.exit(1)
file_path = sys.argv[1]
try:
# Get the JSON file's contents.
with open(file_path, 'r') as file:
file_elements = json.load(file)
# Process each element in the JSON file.
for element in file_elements:
# Get the element's "text" field.
text = element["text"]
# Generate the embeddings for that "text" field.
query_result = embeddings.embed_query(text)
# Add the embeddings to that element as an "embeddings" field.
element["embeddings"] = query_result
# Save the updated JSON back into the original file.
with open(file_path, 'w') as file:
json.dump(file_elements, file, indent=2)
print(f"Done! Updated JSON saved to '{file_path}'.")
except FileNotFoundError:
print(f"Error: File '{file_path}' not found.")
except IOError:
print(f"Error: Unable to access file '{file_path}'.")
Additional resources
For information about how to use Python scripts to call various embedding providers, see for example: