Simple Tool Calling Agent
Building an AI Agent with Real Function Calling
This project demonstrates how to build an AI agent that makes actual structured function calls rather than just describing tool usage in text. Many tutorials show agents that fake tool calling or simply generate text like "I would call the weather tool..." - this implementation executes real function calls with validated parameters using LangChain, LangGraph, and a locally-hosted LLM.
The complete implementation is available at GitHub Repository
The Problem: Most "Tool Calling" Isn't Real
Many AI agent tutorials create the illusion of tool calling without actual function execution. Common issues include:
- Using models that lack function calling capabilities
- Generating text descriptions instead of structured tool calls
- Skipping critical tool binding steps
- Missing proper input validation with Pydantic schemas
- Inadequate docstrings that fail to guide LLM behavior
- Using outdated LangGraph APIs that AI coding assistants generate
Technical Implementation
Hybrid Architecture: Single Model, Dual Roles
The system uses a single LLM instance (Hermes-3-Llama-3.1-8B) running locally via vLLM, serving two distinct purposes:
- Agent Mode (
llm_with_tools): LLM with tools bound, generates structured tool calls - Evaluator Mode (
llm): Raw LLM without tools, performs semantic assessment
This hybrid approach runs on a 24GB GPU by avoiding the memory overhead of loading two separate models. The same underlying model serves both functions - one configuration with tool schemas, one without.
Why Model Selection is Critical
Not all LLMs support structured function calling. General instruction models will generate text descriptions like "I would call the weather tool for Boston" instead of making actual function calls. This implementation uses Hermes-3-Llama-3.1-8B, which is specifically trained for function calling.
Function-calling capable models: Hermes-3-Llama-3.1-8B, GPT-4, Claude 3+, Mistral-Large
Lacks function calling: Mistral-7B-Instruct, Base Llama models, most general chat models
Pydantic Schemas for Input Validation
Each tool uses Pydantic models to define expected input schemas, ensuring the LLM generates valid, structured tool calls:
class WeatherInput(BaseModel):
location: str = Field(
description="The city name and optionally state/country (e.g., 'San Francisco, CA')",
min_length=2,
max_length=100
)
@tool(args_schema=WeatherInput)
def get_weather(location: str) -> str:
"""
Retrieves current weather information for a specified location.
Args:
location: The city name and optionally state/country
Returns:
A string containing the current temperature and weather conditions.
"""
return f"Current weather in {location}: Temperature is 72°F, conditions are sunny"
The Critical Tool Binding Step
Simply defining tools is insufficient - they must be explicitly bound to the LLM. This step is commonly skipped by AI coding assistants:
# Base LLM without tools (for evaluation)
llm = ChatOpenAI(
base_url="http://localhost:8082/v1",
model="NousResearch/Hermes-3-Llama-3.1-8B"
)
# Bind tools to create agent LLM
tools = [get_weather, calculator]
llm_with_tools = llm.bind_tools(tools)
Without binding, the LLM has no knowledge of available tools and cannot generate structured function calls.
LangGraph State Management
The agent uses LangGraph to manage conversation flow with conditional routing:
graph_builder = StateGraph(MessagesState)
graph_builder.add_node("agent", call_model)
graph_builder.add_node("tools", ToolNode(tools))
graph_builder.add_conditional_edges(
"agent",
lambda x: "tools" if x["messages"][-1].tool_calls else END
)
graph_builder.add_edge("tools", "agent")
graph_builder.add_edge(START, "agent")
graph = graph_builder.compile()
Important: This uses the current LangGraph API with START and END keywords. Many AI coding assistants generate outdated code using set_entry_point() and "__end__".
Semantic Evaluation Without Keyword Matching
Rather than brittle keyword matching, the system uses LLM-based evaluation to assess:
- Tool Selection: Did the agent choose appropriate tools for the query?
- Response Quality: Is the final answer clear, complete, and well-formatted?
- Overall Success: Does the response address the user's question?
This approach handles semantic variations ("multiply" vs "multiplied" vs "times") and provides reasoned assessment with explanations.
Implemented Tools
Weather Tool
Returns mock weather data for any location, demonstrating single-parameter tool with string validation and proper Pydantic schema with length constraints.
Calculator Tool
Performs basic arithmetic operations (add, subtract, multiply, divide) with multi-parameter validation and error handling for invalid operations and division by zero.
Both tools follow production patterns:
- Complete docstrings with parameter descriptions
- Pydantic validation schemas
- Natural language return values
- Proper error messages
Key Implementation Details
The Role of Docstrings
Docstrings are not optional - they are the primary mechanism by which the LLM learns what each tool does. The docstring content is sent to the LLM as part of the tool schema. Many coding agents skip or minimize docstrings, resulting in unreliable tool selection.
Effective tool docstrings include:
- Clear purpose statement
- Detailed parameter descriptions with examples
- Return value documentation
- Format and constraint specifications
vLLM Configuration
The implementation requires specific vLLM flags to enable function calling:
vllm serve NousResearch/Hermes-3-Llama-3.1-8B \
--port 8082 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 8192
--enable-auto-tool-choice: Enables automatic tool calling capability--tool-call-parser hermes: Uses Hermes-specific parser for structured calls--max-model-len 8192: Limits context window to fit in 24GB GPU memory
Test Results
The implementation includes three test cases demonstrating different agent behaviors:
- Weather query - Agent correctly calls
get_weathertool - Math query - Agent correctly calls
calculatortool with proper operation parameter - General knowledge - Agent appropriately declines when no tool is available
Each test displays the complete message flow (HumanMessage → AIMessage with tool_calls → ToolMessage → final AIMessage) and LLM-based evaluation results.
Educational Value and Extensions
This implementation provides patterns for building agents that interact with:
- REST APIs and web services
- Databases with SQL queries
- Analytics and data processing tools
- System operations and file management
- Business systems (CRM, ERP, ticketing)
Extension points:
- Add more tools following the Pydantic + docstring pattern
- Implement authentication for external APIs
- Add retry logic and rate limiting
- Extend evaluation criteria for domain-specific requirements
- Scale to multi-agent systems with specialized tool sets
Common Pitfalls Addressed
AI Coding Assistants Generate Outdated Code
Many LLM-based coding assistants (Claude, ChatGPT, etc.) generate LangGraph code using the legacy API. This implementation uses the current API and highlights the differences.
Missing Tool Binding
AI assistants frequently skip the critical bind_tools() step, resulting in agents that cannot actually call functions.
Inadequate Docstrings
Many tutorials use minimal docstrings that don't provide enough information for the LLM to reliably select and use tools.
Wrong Model Selection
Using models that lack function calling capabilities results in text descriptions instead of structured calls.
Project Attribution
This is an educational project demonstrating production-ready patterns for AI agent development with real function calling capabilities.
- GitHub Repository
- License: MIT
Technologies Used
- Python 3.11+
- LangChain
- LangGraph
- vLLM
- Pydantic
- Hermes-3-Llama-3.1-8B