TL;DR
- Install LlamaIndex in 2 minutes with
pip install llama-index(Python) ornpm install llamaindex(TypeScript) Full Guide. - Load documents from 200+ sources (PDFs, SQL, Notion) and parse them into structured nodes Data Loaders.
- Build production-ready RAG pipelines with hybrid search (vector + keyword) and reranking RAG Guide.
- Deploy with observability (Arize, LangSmith) and scale with LlamaCloud Sign-up.
- Gotchas: Large indexes need 16GB+ RAM; multi-modal RAG requires paid APIs Multi-Modal Costs.
1. Installation and Quickstart
Python (LlamaIndex.Py)
# Install core package (minimal dependencies)
pip install llama-index-core
# Install full suite (includes multi-modal, agents, etc.)
pip install llama-index
# Verify installation
python -c "from llama_index.core import VectorStoreIndex; print('LlamaIndex v0.10.28 ready')"
# Expected output: LlamaIndex v0.10.28 ready
Version verified from GitHub Releases.
Gotchas:
- If you see
ImportError: No module named 'llama_index', ensure you're using Python 3.9+ and a virtual environment. - For GPU acceleration, install PyTorch first:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118.
TypeScript (LlamaIndex.TS)
npm install llamaindex
# or
yarn add llamaindex
# Verify installation
node -e "const { VectorStoreIndex } = require('llamaindex'); console.log('LlamaIndex.TS v0.3.12 ready')"
# Expected output: LlamaIndex.TS v0.3.12 ready
TypeScript support documented in TS Guide.
2. Document Loading and Parsing
Load a PDF and Parse into Nodes
from llama_index.core import SimpleDirectoryReader
# Load documents from a directory (supports PDF, DOCX, CSV, etc.)
documents = SimpleDirectoryReader("data/").load_data()
print(f"Loaded {len(documents)} documents")
# Expected output: Loaded 3 documents
# Parse into nodes (chunks with metadata)
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
nodes = parser.get_nodes_from_documents(documents)
print(f"Created {len(nodes)} nodes")
# Expected output: Created 42 nodes
Documentation for data loaders available in Data Loaders.
Key Features:
- 200+ data connectors: Load from Notion, Slack, SQL, and more Data Loaders.
- LlamaParse: Paid API for parsing complex PDFs (tables, multi-column layouts). Free tier available LlamaParse.
from llama_parse import LlamaParse parser = LlamaParse(api_key="llx-...", result_type="markdown") documents = parser.load_data("data/report.pdf")
Gotchas:
- Large PDFs (>100 pages) may time out with
SimpleDirectoryReader. UseLlamaParsefor better results LlamaParse. - For SQL databases, use
SQLTableRetrieverQueryEngineto auto-generate queries from natural language.
3. Index Types
Vector Index (Default for RAG)
from llama_index.core import VectorStoreIndex
# Create a vector index (uses OpenAI embeddings by default)
index = VectorStoreIndex.from_documents(documents)
# Persist to disk
index.storage_context.persist("storage/")
Expected Output:
INFO:llama_index.core.storage.storage_context:Saved VectorStoreIndex to storage/
Keyword Index (Lexical Search)
from llama_index.core import KeywordTableIndex
keyword_index = KeywordTableIndex.from_documents(documents)
Tree Index (Hierarchical Summarization)
from llama_index.core import TreeIndex
tree_index = TreeIndex.from_documents(documents)
# Query with a child-branch traversal
query_engine = tree_index.as_query_engine(child_branch_factor=2)
response = query_engine.query("Summarize the key points")
print(response)
When to Use Which:
| Index Type | Use Case | Pros | Cons |
|---|---|---|---|
| Vector | Semantic search, RAG. | High accuracy, supports hybrid search. | Slower for large datasets. |
| Keyword | Lexical search (e.g., exact matches). | Fast, no embeddings needed. | No semantic understanding. |
| Tree | Hierarchical data (e.g., legal docs). | Preserves structure. | Complex queries. |
Gotchas:
- Vector indexes require an embedding model (default: OpenAI's
text-embedding-3-small). For local embeddings, useHuggingFaceEmbedding:from llama_index.embeddings.huggingface import HuggingFaceEmbedding embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
4. Query Engine Setup
Basic Query Engine
query_engine = index.as_query_engine()
response = query_engine.query("What are the risks of AI in 2026?")
print(response)
Expected Output:
The risks of AI in 2026 include:
1. Job displacement in creative industries.
2. Increased misinformation via deepfakes.
3. Regulatory gaps in multi-modal models.
Source: data/report.pdf (page 42)
Hybrid Search (Vector + Keyword)
from llama_index.core import QueryBundle
from llama_index.core.retrievers import BaseRetriever
class HybridRetriever(BaseRetriever):
def __init__(self, vector_index, keyword_index):
self.vector_retriever = vector_index.as_retriever()
self.keyword_retriever = keyword_index.as_retriever()
def _retrieve(self, query_bundle: QueryBundle):
vector_nodes = self.vector_retriever.retrieve(query_bundle)
keyword_nodes = self.keyword_retriever.retrieve(query_bundle)
return vector_nodes + keyword_nodes
retriever = HybridRetriever(index, keyword_index)
query_engine = index.as_query_engine(retriever=retriever)
Advanced retrieval techniques documented in RAG Guide.
Gotchas:
- Hybrid search adds ~200-500ms latency. Use
similarity_top_k=2to limit results. - For production, add a reranker (e.g.,
CohereRerank):from llama_index.postprocessor.cohere_rerank import CohereRerank reranker = CohereRerank(api_key="...", top_n=3) query_engine = index.as_query_engine(node_postprocessors=[reranker])
5. Custom Retrievers
Build a Time-Based Retriever
from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore
from typing import List
class TimeBasedRetriever(BaseRetriever):
def __init__(self, index, time_field="date"):
self.index = index
self.time_field = time_field
def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
# Filter nodes by time (e.g., "documents from 2025-2026")
nodes = self.index.docstore.get_nodes()
filtered_nodes = [
node for node in nodes
if node.metadata.get(self.time_field, "").startswith("2025")
]
return [NodeWithScore(node=node, score=1.0) for node in filtered_nodes]
retriever = TimeBasedRetriever(index)
query_engine = index.as_query_engine(retriever=retriever)
Use Cases:
- ASSESS (AI Security Posture Framework™): Retrieve logs from a specific time window to evaluate exposure.
- COMPLY: Filter documents by compliance tags (e.g., "GDPR", "HIPAA").
6. Evaluation and Metrics
Run a RAG Evaluation
from llama_index.core.evaluation import (
generate_question_context_pairs,
EmbeddingQAFinetuneDataset,
)
from llama_index.evaluation import RetrieverEvaluator
# Generate synthetic Q&A pairs
qa_dataset = generate_question_context_pairs(documents, num_questions_per_chunk=2)
# Evaluate retriever
retriever = index.as_retriever(similarity_top_k=2)
evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever
)
eval_results = await evaluator.aevaluate_dataset(qa_dataset)
print(eval_results)
Expected Output:
{'mrr': 0.85, 'hit_rate': 0.92}
Key Metrics:
| Metric | Description | Target Value |
|---|---|---|
| MRR | Mean Reciprocal Rank. | >0.8 |
| Hit Rate | % of queries with relevant results. | >0.9 |
| Faithfulness | % of responses grounded in context. | >0.95 |
Gotchas:
- Synthetic Q&A generation requires an LLM (default: OpenAI). For local evaluation, use
LlamaCPP:from llama_index.llms.llama_cpp import LlamaCPP llm = LlamaCPP(model_path="models/<a href="/services/open-source-llm-integration">llama</a>-2-7b.Q4_K_M.<a href="/services/slm-edge-ai">gguf</a>") qa_dataset = generate_question_context_pairs(documents, llm=llm)
7. Production Deployment Tips
Deploy with FastAPI
from fastapi import FastAPI
from llama_index.core import VectorStoreIndex
from pydantic import BaseModel
app = FastAPI()
index = VectorStoreIndex.from_documents(documents)
class QueryRequest(BaseModel):
query: str
@app.post("/query")
async def query_index(request: QueryRequest):
query_engine = index.as_query_engine()
response = query_engine.query(request.query)
return {"response": str(response)}
# Run with: uvicorn app:app --reload
Expected Output (API):
{
"response": "The risks of AI in 2026 include job displacement and misinformation."
}
Observability with Arize
from llama_index.core.callbacks import CallbackManager
from llama_index.callbacks.arize_ai import ArizeCallbackHandler
arize_callback = ArizeCallbackHandler(
api_key="...",
space_key="...",
)
callback_manager = CallbackManager([arize_callback])
index = VectorStoreIndex.from_documents(
documents, callback_manager=callback_manager
)
Observability features documented in Observability.
Scale with LlamaCloud
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
index = LlamaCloudIndex.from_documents(
documents,
name="my-production-index",
project_name="my-project",
api_key="llx-...",
)
LlamaCloud documentation available at Sign-up.
Gotchas:
- LlamaCloud indexes are eventually consistent (updates may take ~1 minute).
- For multi-modal RAG, use
LlamaCloudMultiModalIndexMulti-Modal.
Alternatives at a Glance
| Tool | Best For | Key Limitation |
|---|---|---|
| LlamaIndex | Enterprise RAG, multi-modal apps. | TS version less mature [TS Roadmap](https://github |
