Embedding

The process of transforming raw text (sentence, paragraph, tweet etc.) into fixed length vector of numbers that capture its semantic meaning
Embedding models are used for this purpose
These models have fixed embedding length which determines the length of the vector
Examples
- TF-IDF (Term Frequency-Inverse Document Frequency)
  - No semantic meaning encoded
- Word2Vec
  - Distance between vectors is based on semantic similarity
  - Static Embeddings
  - https://projector.tensorflow.org/ has 3D representation
- BERT (Bidirectional encoder representations from transformers)
  - Transformer based, contextualized
  - Dynamic embeddings
- LLM based
  - They have world knowledge inherited from LLM
  - Examples
    - embeddinggemma from gemma
    - qwen3-embedding from qwen3
Ref:
- https://huggingface.co/spaces/hesamation/primer-llm-embedding
- https://developers.openai.com/api/docs/guides/embeddings

from langchain_ollama import OllamaEmbeddings
 
embeddings = OllamaEmbeddings(
    model="embeddinggemma:latest",
)
 
# used for query text
result = embeddings.embed_query("Hello world")
 
print(len(result)) # equal to embedding length of the model
# print(result)
 
texts = [
    "Hello Sonam",
    "How are you?",
    "Where are you now"
]
 
# used for list of documents
embedded_docs = embeddings.embed_documents(texts)
 
print(len(embedded_docs)) # 3

Semantic Relationship

Embeddings hold semantic relationships
In 2013, Google researchers saw relationships in Word2Vec embeddings $King - Man + Woman \approx Queen$

Document Loaders

https://docs.langchain.com/oss/python/integrations/document_loaders
https://reference.langchain.com/python/langchain-community/document_loaders
Provide a standard interface for reading data from different sources into LangChain’s Document format
Methods
- load() – Loads all documents at once.
- lazy_load() – Streams documents lazily, useful for large datasets.
Examples
- Unstructured
- GoogleDriveLoader
- YoutubeLoader

Langchain Document Format

Contains
- metadata (dict)
- page_content (str)

Text Splitting

aka Chunking
Text Splitters break large docs into smaller chunks that will be retrievable individually and fit within model context window limit
Types
- Text structure based
  - RecursiveTextSplitter
- Length based
  - TokenTextSplitter
  - CharacterTextSplitter
- Document structure based
  - Code Splitter
  - HTML Splitter
  - Markdown Splitter
  - JSON Splitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    length_function= len, # Fn that measures length of chunk
    add_start_index=True,  # track index in original document
    strip_whitespace=True # string whitespace from start/end of each document
)

Full example

from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter
 
embeddings = OllamaEmbeddings(
    model="embeddinggemma:latest",
)
 
# Load Document
history_loader = TextLoader("./data/history.txt") # Indian History
history_doc = history_loader.load()
 
science_loader = TextLoader("./data/science.txt") # Science
science_doc = science_loader.load()
 
# Text Splitter
text_splitter = CharacterTextSplitter(chunk_size=800, chunk_overlap=0)
history_documents = text_splitter.split_documents(history_doc)
science_documents = text_splitter.split_documents(science_doc)
 
all_documents = history_documents + science_documents
 
db = Chroma.from_documents(
    all_documents, embeddings,
    persist_directory='./data/chroma_db'
)
 
# similar_docs = db.similarity_search("Modern", 2)
# similar_docs = db.similarity_search("Sanskrit", 2)
similar_docs = db.similarity_search("Artificial Intelligence", 2)
 
print(similar_docs)

Experiments

Explorer

Embedding

Embedding

Semantic Relationship

Document Loaders

Langchain Document Format

Text Splitting

Full example

Table of Contents

Backlinks