MLOps Community

How I Reduced AI Token Costs by 91% with Semantic Tool Selection and Redis

How I Reduced AI Token Costs by 91% with Semantic Tool Selection and Redis
# AI Cost Optimization
# LLM Tooling
# Semantic Search
# Vector Databases
# Redis
# Embeddings

Building Semantic Tool Selection with Multi-Component Embeddings

January 20, 2026
Subham Kundu
Subham Kundu
How I Reduced AI Token Costs by 91% with Semantic Tool Selection and Redis
Last quarter, our enterprise AI platform hit a wall. We had built an impressive suite of 70+ automated tools covering everything from database operations to cloud deployments, but our LLM costs were spiralling out of control. Every user query was sending all 70 tool definitions to the model, consuming over 8,000 tokens per request—even when users only needed 2-3 relevant tools.
The problem wasn’t just cost. Response times were suffering, and users were getting frustrated with irrelevant tool suggestions. Our traditional approach of keyword matching and category filtering was failing to understand user intent. “Deploy my app” would return database backup tools alongside deployment tools, forcing users to sift through noise.
This is the story of how I built a semantic tool selection system that reduced our token consumption by 91% while improving accuracy and user experience. The solution uses Redis as a vector database with intelligent multi-component embeddings to understand both what tools do and how they’re used.

Why Traditional Tool Selection Fails in Enterprise Environments

Enterprise tool ecosystems present unique challenges that simple matching algorithms can’t solve:

Vocabulary Mismatch

Users describe tasks in natural language while tools have technical names. A user asking to “send notification to engineering team” might need our Slack integration, but keyword matching would miss this connection because the tool is named “send_slack_message” and contains technical parameter names like “channel_id” and “webhook_url”.

Contextual Understanding

The same query can mean different things depending on context. “Create backup” could refer to database backups, file system backups, or cloud resource backups. Traditional systems can’t distinguish between these contexts without manual categorization.

Scalability Issues

As our tool library grew from 20 to 70+ tools, the combinatorial explosion of possible tool combinations made manual categorization impossible. Each new tool required updating multiple rule sets and taxonomies.

The Token Cost Crisis

Every irrelevant tool sent to the LLM directly impacts our bottom line. At enterprise scale, with thousands of daily queries, this waste translates to millions in unnecessary API costs.

The Semantic Solution: Architecture Overview

I designed a multi-layered semantic understanding system that analyzes tools from four different perspectives:
Tool Description: What the tool does (50% weight)
Parameters: What inputs the tool accepts (25% weight)
Usage Examples: How the tool is typically used (15% weight)
Return Types: What the tool produces (10% weight)
Each component gets its own vector embedding using OpenAI’s text-embedding-3-small model, creating a rich semantic fingerprint that captures both functionality and usage patterns.
The system architecture follows a clean separation of concerns:


Deep Dive: The Architecture That Makes It Work

Multi-Component Embedding Strategy

The key innovation is decomposing each tool into semantic components rather than embedding the entire tool definition as one blob. For our Slack messaging tool:

Subscribe
Each component gets embedded separately, allowing the system to match queries like “notify engineering team” to the parameter descriptions and examples, even if the main description doesn’t contain those exact words.
tool_embedding = { "description": "Send messages to Slack channels using webhooks", "parameters": [ {"name": "channel_id", "description": "Target Slack channel identifier"}, {"name": "message", "description": "Message content to send"}, {"name": "webhook_url", "description": "Slack webhook endpoint"} ], "examples": [ "send_slack_message(channel_id='#engineering', message='Deployment complete')", "Notify team about critical system updates via Slack" ], "returns": "Message delivery confirmation with timestamp" }
Each component gets embedded separately, allowing the system to match queries like “notify engineering team” to the parameter descriptions and examples, even if the main description doesn’t contain those exact words.

Intelligent Scoring Algorithm

The relevance scoring combines semantic similarity with practical factors:
def calculate_relevance_score(query_embedding, tool_embeddings): scores = {} # Weighted semantic matching scores['tool'] = cosine_similarity(query_embedding, tool_embeddings.description) * 0.50 scores['parameters'] = max(cosine_similarity(query_embedding, param) for param in tool_embeddings.parameters) * 0.25 scores['examples'] = max(cosine_similarity(query_embedding, example) for example in tool_embeddings.examples) * 0.15 scores['returns'] = cosine_similarity(query_embedding, tool_embeddings.returns) * 0.10 # Boost popular tools popularity_boost = calculate_popularity_boost(tool_id) # Apply confidence thresholding final_score = sum(scores.values()) + popularity_boost return final_score if final_score >= MIN_CONFIDENCE else 0

Adaptive Tool Selection

Rather than returning a fixed number of tools, the system uses adaptive selection based on score distribution:
def adaptive_selection(scored_results, max_tools=5): scores = [result.score for result in scored_results] # Find natural score drop-off points score_drops = [] for i in range(1, len(scores)): drop_percentage = (scores[i-1] - scores[i]) / scores[i-1] if drop_percentage > 0.3: # 30% drop threshold score_drops.append(i) # Select tools until significant score drop or max limit if score_drops: selected_count = min(score_drops[0], max_tools) else: selected_count = max_tools return scored_results[:selected_count]
This prevents the system from returning marginally relevant tools just to fill a quota.

Why Redis Beats Postgres, Neo4j, and Specialized Vector DBs

When choosing our vector database, I evaluated four options extensively:

PostgreSQL with pgvector

Pros: Familiar SQL interface, ACID compliance Cons: Vector search is an add-on, not optimized for large-scale similarity search, indexing limitations for high-dimensional vectors

Neo4j

Pros: Excellent for relationship queries, graph-based tool dependencies Cons: Vector search requires custom implementations, higher complexity for simple similarity queries, steeper learning curve

Pinecone/Weaviate (Specialized Vector DBs)

Pros: Purpose-built for vector search, excellent performance Cons: Additional infrastructure to manage, higher costs at scale, vendor lock-in, limited non-vector query capabilities

Redis Stack (Our Choice)

Pros: Native vector search with HNSW indexing, sub-millisecond latency, built-in caching, familiar to most teams, excellent scalability, cost-effective Cons: Less feature-rich for complex graph queries
The decisive factor was Redis’s multi-model capability. We needed both vector search AND traditional key-value storage for tool metadata, caching, and session management. Redis handles both seamlessly without requiring multiple database systems.

Performance Comparison


Redis delivered the best combination of performance, cost, and operational simplicity for our use case.

Code Walkthrough: Key Implementation Patterns

1. Vector Store Operations

class VectorStore: def __init__(self, redis_client: Redis): self.redis = redis_client self.index_name = "tool_embeddings" self.vector_field = "embedding" async def create_hnsw_index(self): """Create HNSW index with optimal parameters for our use case""" await self.redis.ft().create_index( fields=[ VectorField(self.vector_field, "FLAT", { "TYPE": "FLOAT32", "DIM": 1536, "DISTANCE_METRIC": "COSINE", "INITIAL_CAP": 1000, "BLOCK_SIZE": 100 }), TextField("tool_id"), TextField("component_type"), TextField("category"), NumericField("usage_count") ], definition=IndexDefinition(prefix=["tool:embedding:"], index_type=IndexType.HASH) ) async def search_similar(self, query_embedding: List[float], k: int = 10, category: str = None) -> List[dict]: """Perform vector similarity search with optional filtering""" query = Query(f"*=>[KNN {k} @{self.vector_field} $query_vector]") # Add category filter if specified if category: query = Query(f"@category:{category}=>[KNN {k} @{self.vector_field} $query_vector]") query_params = {"query_vector": np.array(query_embedding).astype(np.float32).tobytes()} query.return_fields = ["tool_id", "component_type", "score", "category"] results = await self.redis.ft().search(query, query_params) return [dict(doc) for doc in results.docs]

2. Embedding Generation with Batching

class EmbeddingGenerator: def __init__(self, openai_client: AsyncOpenAI): self.client = openai_client self.batch_size = 100 # Optimize for OpenAI's rate limits async def generate_tool_embeddings(self, tool: Tool) -> ToolEmbedding: """Generate embeddings for all tool components""" # Collect all text components for batch processing texts_to_embed = [ tool.description, *[f"{param.name}: {param.description}" for param in tool.parameters], *[example.description for example in tool.examples], tool.return_type.description if tool.return_type else "" ] # Batch generate embeddings for efficiency embeddings = await self.batch_generate_embeddings(texts_to_embed) # Map embeddings back to components tool_embedding = ToolEmbedding( tool_id=tool.id, description_embedding=embeddings[0], parameter_embeddings=embeddings[1:1+len(tool.parameters)], example_embeddings=embeddings[1+len(tool.parameters):-1], return_embedding=embeddings[-1] if tool.return_type else None ) return tool_embedding async def batch_generate_embeddings(self, texts: List[str]) -> List[List[float]]: """Efficient batch embedding generation with retry logic""" embeddings = [] for i in range(0, len(texts), self.batch_size): batch = texts[i:i + self.batch_size] for attempt in range(3): # Retry with exponential backoff try: response = await self.client.embeddings.create( model="text-embedding-3-small", input=batch ) batch_embeddings = [data.embedding for data in response.data] embeddings.extend(batch_embeddings) break except RateLimitError: wait_time = 2 ** attempt await asyncio.sleep(wait_time) continue except Exception as e: logger.error(f"Embedding generation failed: {e}") raise return embeddings

3. Semantic Search with Caching


class SemanticSearch: def __init__(self, vector_store: VectorStore, embedding_generator: EmbeddingGenerator, cache: QueryCache): self.vector_store = vector_store self.embedding_generator = embedding_generator self.cache = cache self.ranker = RelevanceRanker() async def search(self, query: str, max_results: int = 10) -> List[SearchResult]: """Main search method with caching and multi-component matching""" # Check cache first cache_key = self._generate_cache_key(query, max_results) cached_results = await self.cache.get(cache_key) if cached_results: return cached_results # Generate query embedding query_embedding = await self.embedding_generator.generate_query_embedding(query) # Search across all component types component_results = {} for component_type in ["description", "parameters", "examples", "returns"]: results = await self.vector_store.search_similar( query_embedding, k=max_results * 2, # Get more candidates for better ranking component_type=component_type ) component_results[component_type] = results # Aggregate scores across components aggregated_scores = self._aggregate_component_scores(component_results) # Build search results with tool metadata search_results = await self._build_search_results(aggregated_scores, query) # Apply ranking and filtering ranked_results = await self.ranker.rank_results(search_results, query) # Cache results for 1 hour await self.cache.set(cache_key, ranked_results, ttl=3600) return ranked_results def _aggregate_component_scores(self, component_results: dict) -> dict: """Combine scores from different component types with weights""" weights = { "description": 0.50, "parameters": 0.25, "examples": 0.15, "returns": 0.10 } tool_scores = defaultdict(float) for component_type, results in component_results.items(): weight = weights[component_type] for result in results: tool_id = result["tool_id"] similarity_score = 1 - float(result["score"]) # Convert distance to similarity tool_scores[tool_id] += similarity_score * weight return dict(tool_scores)

The Results: 91% Token Reduction with Improved Accuracy

After implementing the semantic selection system, we ran comprehensive benchmarks comparing the old approach (sending all tools) with the new semantic approach:

Token Consumption Analysis

Without Semantic Selection:
Average input tokens per query: 7,244
Average output tokens: 279
Total tokens per query: 7,523
Cost per query: $0.0118
With Semantic Selection:
Average input tokens per query: 198
Average output tokens: 599
Total tokens per query: 797
Cost per query: $0.0060
Improvement: 91.5% reduction in token usage, 49% cost reduction

Performance Metrics


Share

Real-World Query Examples

Query: “Deploy my api-service application version 2.0 to production environment”
Before: Sent all 70 tools (7,433 tokens), LLM had to identify relevant deployment tools After: Sent only 3 most relevant tools (638 tokens):
deploy_application (score: 0.92)
create_docker_image (score: 0.87)
launch_aws_ec2_instance (score: 0.79)
Query: “Send a slack message to #engineering channel about deployment complete”
Before: Sent all 70 tools (7,440 tokens), included irrelevant database and monitoring tools After: Sent 2 relevant tools (127 tokens):
send_slack_message (score: 0.95)
send_push_notification (score: 0.71)

Accuracy Improvements

The semantic approach dramatically improved precision and recall:
Precision@3: 95% (vs 72% before) - 3 returned tools, 2.85 are relevant on average
Recall@5: 90% (vs 68% before) - Found 90% of all relevant tools in top 5 results
Mean Reciprocal Rank: 0.88 (vs 0.54 before) - First relevant tool appears much earlier

Enterprise Impact and Lessons Learned

Cost Savings at Scale


With 10,000 daily queries, the token reduction translates to:
Daily savings: ~67 million tokens ($58)
Monthly savings: ~2 billion tokens ($1,740)
Annual savings: ~24 billion tokens ($20,880)
For our enterprise usage, this represents a six-figure annual cost reduction while improving service quality.

Operational Benefits

Beyond cost savings, we observed significant operational improvements:
Faster Response Times: Users get answers 31% faster
Higher Success Rates: Fewer failed queries due to context length limits
Better User Experience: More relevant tool suggestions lead to higher adoption
Easier Onboarding: New users find relevant tools more quickly

Technical Lessons

Multi-Component Embeddings Matter: Breaking tools into components dramatically improves matching accuracy
Adaptive Selection Beats Fixed Limits: Dynamic tool count based on score distribution prevents over/under-selection
Caching is Critical: 1-hour TTL cache provides 60% hit rate for common queries
Redis Multi-Model Advantage: Combining vector search with traditional data structures simplifies architecture

Implementation Challenges

The biggest technical challenge was optimizing the embedding generation pipeline. Initial attempts at individual API calls were too slow and expensive. The solution was implementing sophisticated batching with retry logic and exponential backoff.
Another challenge was tuning the relevance scoring weights. We ran extensive A/B tests to find the optimal balance between description matching and parameter/example matching.

Future Enhancements

Our roadmap includes several exciting improvements:
Context-Aware Selection: Incorporate user history and session context
Tool Dependency Graph: Use Neo4j for complex tool orchestration scenarios
Multi-Modal Embeddings: Include tool documentation and screenshots
Federated Learning: Improve embeddings based on user feedback
Real-Time Performance Monitoring: Advanced analytics for continuous optimization
The system demonstrates the same architecture and performance characteristics as our enterprise implementation, but with mock tools for testing.

Conclusion: Semantic Understanding is the Future

Traditional tool selection methods are reaching their limits in enterprise environments. As tool libraries grow and user expectations increase, semantic understanding becomes not just an advantage but a necessity.
Our 91% token reduction proves that intelligent tool selection doesn’t just save money—it fundamentally improves the user experience. By understanding both what tools do and how they’re used, we can create AI systems that are more efficient, accurate, and user-friendly.
The combination of Redis vector search, multi-component embeddings, and adaptive selection provides a blueprint for the next generation of enterprise AI platforms. As organizations continue to build sophisticated tool ecosystems, semantic understanding will be the key to making them accessible and efficient.
The future of enterprise AI isn’t about having more tools—it’s about understanding exactly which tools you need, exactly when you need them.


Follow Subham Kundu for more insights on enterprise AI architecture and performance optimization. If you try the system, let Subham know your results in the comments!



Dive in

Related

Blog
How We Reduced our Infrastructure Costs by 8x without Sacrificing Performance
Mar 14th, 2022 Views 37
Video
How LiveKit Became An AI Company By Accident
By Russ d'Sa • Sep 22nd, 2025 Views 121
Blog
Building a Serverless Application with AWS Lambda and Qdrant for Semantic Search
By Benito Martin • Jul 2nd, 2024 Views 1.5K
Code of Conduct