Embedding

  • The process of transforming raw text (sentence, paragraph, tweet etc.) into fixed length vector of numbers that capture its semantic meaning
  • Embedding models are used for this purpose
  • These models have fixed embedding length which determines the length of the vector
  • Examples
    • TF-IDF (Term Frequency-Inverse Document Frequency)
      • No semantic meaning encoded
    • Word2Vec
    • BERT (Bidirectional encoder representations from transformers)
      • Transformer based, contextualized
      • Dynamic embeddings
    • LLM based
      • They have world knowledge inherited from LLM
      • Examples
        • embeddinggemma from gemma
        • qwen3-embedding from qwen3
  • Ref:
from langchain_ollama import OllamaEmbeddings
 
embeddings = OllamaEmbeddings(
    model="embeddinggemma:latest",
)
 
# used for query text
result = embeddings.embed_query("Hello world")
 
print(len(result)) # equal to embedding length of the model
# print(result)
 
texts = [
    "Hello Sonam",
    "How are you?",
    "Where are you now"
]
 
# used for list of documents
embedded_docs = embeddings.embed_documents(texts)
 
print(len(embedded_docs)) # 3

Semantic Relationship

  • Embeddings hold semantic relationships
  • In 2013, Google researchers saw relationships in Word2Vec embeddings

Document Loaders

Langchain Document Format

  • Contains
    • metadata (dict)
    • page_content (str)

Text Splitting

  • aka Chunking
  • Text Splitters break large docs into smaller chunks that will be retrievable individually and fit within model context window limit
  • Types
    • Text structure based
      • RecursiveTextSplitter
    • Length based
      • TokenTextSplitter
      • CharacterTextSplitter
    • Document structure based
      • Code Splitter
      • HTML Splitter
      • Markdown Splitter
      • JSON Splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    length_function= len, # Fn that measures length of chunk
    add_start_index=True,  # track index in original document
    strip_whitespace=True # string whitespace from start/end of each document
)

Full example

from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter
 
embeddings = OllamaEmbeddings(
    model="embeddinggemma:latest",
)
 
# Load Document
history_loader = TextLoader("./data/history.txt") # Indian History
history_doc = history_loader.load()
 
science_loader = TextLoader("./data/science.txt") # Science
science_doc = science_loader.load()
 
# Text Splitter
text_splitter = CharacterTextSplitter(chunk_size=800, chunk_overlap=0)
history_documents = text_splitter.split_documents(history_doc)
science_documents = text_splitter.split_documents(science_doc)
 
all_documents = history_documents + science_documents
 
db = Chroma.from_documents(
    all_documents, embeddings,
    persist_directory='./data/chroma_db'
)
 
# similar_docs = db.similarity_search("Modern", 2)
# similar_docs = db.similarity_search("Sanskrit", 2)
similar_docs = db.similarity_search("Artificial Intelligence", 2)
 
print(similar_docs)