Why Local LLMs Struggle with Tool Selection, and 4 Strategies to Fix It

2026-02-24 · # AI 활용

Local LLMs Collapse as Tools Multiply

The core ability of AI agents is selecting the right tool for the job. If you call a file reading tool when you need web search, it’s useless. Large models like GPT-4 and Claude made relatively stable choices even with dozens of tools available. But when you ask 8B-scale models running locally to do the same thing, the story changes completely.

The root causes were twofold. First, context window limitations. A single tool’s JSON schema consumes an average of hundreds of tokens. With 100 tools, that’s tens of thousands of tokens gone just for schemas. According to Speakeasy’s measurements, statically loading 400 tools consumed around 405,000 tokens, exceeding the context window of most models¹. Second, choice overload confusion. When small models face multiple tools with similar names, they can’t decide which one to pick. LLaMA-3.1-8B-Instruct scored 0.428 on MCP-Bench, significantly trailing GPT-4o-mini’s 0.557².

Ultimately, for local LLMs to handle tool selection, we need external mechanisms to reduce the list of tools the model sees. All the strategies covered in this article stem from that single principle: filtering choices beforehand to a range small models can handle.

Semantic Router: Millisecond Classification Without LLM Calls

The first layer we can apply is Semantic Router. This library from aurelio-labs converts user queries into embedding vectors, then compares similarity with predefined routes to determine categories³. Since it doesn’t call LLMs, latency is measured in milliseconds.

The mechanism is simple. Example utterances like “read a file” and “show directory listing” get registered to a file_operations route, while “search the web” and “fetch URL content” go to web_operations. When new queries arrive, category determination happens through embedding similarity alone. Using lightweight embedding models like all-MiniLM-L6-v2 allows completely local execution without API keys.

Semantic router was especially effective with 20 or fewer tools. Split into 4-5 categories, each category holds 3-5 tools that 8B models can handle without strain. However, it struggled with compound queries spanning multiple domains like “read data from a file and store it in the database.” Since it returns only one route, file tools get selected while DB tools get missed. Complementing this limitation is what Tool RAG addresses.

Tool RAG: Narrowing Candidates Through Vector Search

Tool RAG embeds tool descriptions, stores them in a vector database, and searches for relevant tools using semantic similarity with queries. Using lightweight vector DBs like ChromaDB or Qdrant allows indexing hundreds of tools in local environments.

The key was the top_k parameter. Searching for only the top 5 out of 100 tools and passing those to the LLM reduced context consumption by over 90%. While 8B models couldn’t digest 100 complete tool schemas at once, they handled 5 stably.

Combining Semantic Router and Tool RAG creates a 2-stage pipeline. Stage 1: semantic router determines category. Stage 2: vector search refines tools within that category. This combination was the most practical solution for environments with 30-100 tools.

But search quality directly depended on tool description quality. A tool described as “Search for things” versus “Search the web using Brave Search API. Returns titles, URLs, and text snippets” showed huge differences in search precision. This leads to the MCP tool description optimization problem.

MCP Tool Description Optimization: Write for Models to Read

As the MCP (Model Context Protocol) ecosystem expanded, description quality became a key variable determining tool selection accuracy. The Dynamic ReAct paper systematized this as Description Enrichment⁴.

Enrichment follows five principles. First, What — concretely describe what the tool does. Second, When — specify situations to use this tool with “Use this when…” patterns. Third, Not — add cases not to use this tool as “Do NOT use this for…” Fourth, Related — mention related tools or operations to broaden the search net. Fifth, Category — tag with app or domain classifications.

This principle’s effect was dramatic for smaller models. Large models could infer from poor descriptions through context, but 8B models only understood what was literally written. The same principles applied to parameters. Adding individual descriptions to each parameter, explicitly limiting possible values with enum, and including concrete examples visibly reduced argument configuration errors.

Token savings was also important. The MCP community discussed SEP-1576 proposals to eliminate redundancy by referencing common schemas with $ref⁵. The balance needed was compressing tool descriptions under 200 characters while including all essential information.

Validation Feedback: Making Models Try Again After Failure

Even when tools are selected correctly, execution fails if arguments are misconfigured. This problem was especially frequent with small models. The Agentica framework from Korean AI company wrtn labs offered a practical solution⁶.

Agentica’s core patterns were twofold. First was Selector Agent, a lightweight agent that narrows candidates by looking at only names and summaries from the complete function list. Since only one-line summaries are passed instead of full schemas, token consumption dropped 80%+, and this task was sufficient for 4B models.

Second was Validation Feedback. When tool call results failed schema validation — missing required parameters or wrong types — error messages were fed back to the LLM to induce retries. Using typia-based runtime type validation, up to 3 retries were allowed. This feedback loop dramatically improved 8B models’ argument configuration accuracy. While they couldn’t create perfect arguments in one shot, small models had the ability to see errors and make corrections.

When Tools Number in Hundreds: Progressive Search and Search-and-Load

When tool count exceeded 100, the previously introduced strategies weren’t enough. At this scale, two approaches gained attention.

Speakeasy’s Gram introduced Progressive Search. It organized all tools in hierarchical structure and bound only 3 meta-tools to the LLM: list_tools, describe_tools, and execute_tool. The LLM first explores categories with list_tools, fetches detailed schemas with describe_tools when needed tools are found, then executes with execute_tool. According to Gram’s benchmarks, Progressive Search’s initial tokens in a 400-tool environment were about 2,500, a 99%+ reduction from Static’s 405,100¹.

The Search-and-Load architecture proposed by the Dynamic ReAct paper followed similar principles⁴. It performs vector search with search_tools, then the LLM selects needed tools from results and binds them with load_tools. The key difference is that LLMs directly compose atomic search queries. When compound requests like “monitor Twitter mentions and organize them in Google Sheets” arrive, the LLM splits this into “retrieve Twitter mentions” and “create Google Sheets spreadsheet” as separate search queries. Precision was much higher than searching the user’s original query directly.

What Benchmarks Tell Us About Reality

How effective these theoretical strategies actually were is shown by benchmark numbers.

BFCL (Berkeley Function Calling Leaderboard) is the most authoritative benchmark evaluating LLM function calling accuracy, presented at ICML 2025⁷. The most notable result among small models was ToolACE-8B. This model, fine-tuned from LLaMA-3.1-8B-Instruct with tool usage data, reached GPT-4-level performance on BFCL-v1⁸. Performance at this level in 8B scale was unprecedented.

Among general-purpose models, Qwen3-8B established itself as the de facto standard. In the r/LocalLLaMA community, evaluations of “insane performance on agentic tasks” appeared repeatedly, and its 131,072-token long context window also favored tool selection⁹.

Model	Size	Features
ToolACE-8B	8B	GPT-4-class on BFCL-v1, tool-calling specialized fine-tuning
Qwen3-8B	8B	Strongest tool-calling among general models, 131K context
Qwen3-30B-A3B	30B(3B active)	MoE structure, only 3B activated for lightness
Qwen3-4B	4B	Stable tool calling even at 4B scale

MCP-Bench is a benchmark evaluating actual tool usage through MCP servers, adopted at NeurIPS 2025 Workshop². On this benchmark, the 8B model (LLaMA-3.1-8B-Instruct) score of 0.428 significantly trailed GPT-5’s 0.749. However, this number was the result of directly binding all tools without tool selection optimization. With the strategies explained earlier — reducing candidates with Tool RAG, enhancing descriptions with Enriched Description, and correcting errors with Validation Feedback — that gap could be substantially narrowed.

Practical Combinations by Tool Count

You don’t need to apply all strategies at once. Appropriate combinations vary by tool count.

Tool Count	Recommended Strategy	Reason
1-10	Load all statically	No overhead, 8B models handle under 10 stably
10-30	Semantic Router + category-based partial loading	Millisecond classification then pass only relevant tools
30-100	Tool RAG(top_k=5) + Validation Feedback	Reduce candidates with vector search, complement accuracy with retries
100-500	Gram Progressive Search or Dynamic ReAct Search-and-Load	Meta-tool based dynamic exploration, initial tokens under 2,500
500+	Dedicated routing infrastructure	MoM-based platforms like vLLM Semantic Router Iris¹⁰

The most realistic scenario was 30-100 tool environments. Connecting 2-3 MCP servers quickly reaches this range. In this bracket, the most stable approach was a 4-layer pipeline: primary classification with semantic router, secondary filtering with Tool RAG, description quality assurance with Enriched Description, and safety net with Validation Feedback.

Ultimately, local LLMs’ tool selection problem wasn’t a limitation of the models themselves, but an issue of the quantity and quality of information given to models. With proper filtering and enhancement mechanisms, even 8B models could operate as practical agents in hundreds-of-tools environments. Rather than waiting for models to get smarter, organizing the world the model sees was a much faster path.

Speakeasy, “Comparing Progressive Discovery and Semantic Search for Powering Dynamic MCP”, 2025. https://www.speakeasy.com/blog/100x-token-reduction-dynamic-toolsets ↩ ↩²
Wang et al., “MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers”, NeurIPS 2025 Workshop. https://github.com/Accenture/mcp-bench ↩ ↩²
aurelio-labs, “Semantic Router”, GitHub. https://github.com/aurelio-labs/semantic-router ↩
Gaurav et al., “Dynamic ReAct: Scalable Tool Selection for Large-Scale MCP Environments”, arXiv:2509.20386, 2025. https://arxiv.org/abs/2509.20386 ↩ ↩²
MCP SEP-1576, “Token Bloat Mitigation”, GitHub. https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1576 ↩
wrtn labs, “Agentica: TypeScript AI Function Calling Framework”, GitHub. https://github.com/wrtnlabs/agentica ↩
Patil et al., “The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models”, ICML 2025. https://gorilla.cs.berkeley.edu/leaderboard.html ↩
Team-ACE, “ToolACE-8B”, Hugging Face. https://huggingface.co/Team-ACE/ToolACE-8B ↩
Qwen Team, “Qwen3 Technical Report”, arXiv:2505.09388, 2025. https://arxiv.org/abs/2505.09388 ↩
vLLM Project, “Semantic Router v0.1 Iris”, 2026. https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html ↩