You know that feeling when you're staring at a mountain of repetitive code, or trying to automate a process that just feels… manual? That was me a while back. I'd been keeping an eye on AI language models, especially Claude, and the buzz around things like Copilot for a bit. Everyone was talking about how AI was going to change coding, and frankly, I was sceptical but also intrigued. My team at a fintech startup was constantly making tons of small services, and setting up new projects, or doing initial code reviews, was becoming a huge time suck.
The Early Days: Just Prompting and Praying
Around late 2023, I started just messing around with Claude 2.1, mostly by throwing questions at it. My initial idea was to get it to auto-make some basic code for our new Python microservices. We used FastAPI quite a lot, and the setup was always pretty much the same: main.py, models.py, schemas.py, services.py, tests/ etc. I'd copy-paste an old project and modify it. "Claude," I'd type, "generate a FastAPI service for user management, with CRUD operations for a User model with id, name, email fields." It'd give me some code, often quite good, but a bit all over the place. Sometimes it'd forget async/await, sometimes it'd miss Pydantic schemas, or the test files would be empty. It was faster than doing it by hand, sure, but I spent nearly as much time fixing its output as I would have just writing it from scratch. Honestly, it felt more like a chore than a real fix. This approach felt like I was babysitting a junior dev who needed me to hold their hand all the time.
Hitting a Wall: Context and Consistency
The biggest headache? Context. If I wanted to generate two related files, say models.py and schemas.py, I'd either have to put the entire content of the first file in the prompt for the second (which quickly made the prompts too long and cost a bomb), or Claude would forget how they linked up. It was super fragile. I knew I needed something tougher, something that could manage whole jobs, not just answer one-off questions. We were deploying about 100 new services a month, and the friction was real. I knew there had to be a better way than just prompt engineering.
One evening, after I spent 3 hours figuring out why a schema it made wasn't checking data right (turned out Claude had used Optional[str] instead of str for a field that had to have a value, leading to silent data issues), I decided this quick-fix approach just wasn't good enough. This kind of mistake, while small, could really mess up our data if it slipped through code review.
The Breakthrough: Agents, Skills, and Hooks
That's when I stopped thinking about simple prompts and started building a real Claude setup with 'agents,' 'skills,' and 'hooks.' The goal was to give Claude a set of tools (skills) and a way to decide when to use them (auto-activation), plus ways for me to step in or watch what it's doing ('hooks'). I wanted to move beyond just asking Claude to write code, to asking Claude to think through and do coding jobs.
I started messing around with Python 3.10 to build a first version. My main idea was to chop big jobs into smaller, easier bits, each one handled by a special 'agent' or by turning on a specific 'skill.'
Here’s a simplified look at how I structured it:
Let's say I wanted to generate a new FastAPI project. Instead of one giant prompt, I'd have skills like:
* create_directory(path: str)
* write_file(path: str, content: str)
* install_dependencies(project_path: str, dependencies: list[str])
* run_linter(file_path: str)
My primary agent, let's call it the ProjectEngineerAgent, would get the high-level request. It would then use Claude 3 Opus (I upgraded to Opus as soon as it was available – the reasoning capability was a total game-changer) to think about the steps, activate the right skills, and manage the whole thing. Opus's ability to understand complex instructions and follow multi-step reasoning chains was super important here. It felt like moving from a junior dev to a mid-level who could actually plan their work.
Here’s a snippet of how a skill might look:
# skills.py
import os
def create_directory(path: str) -> str:
"""Creates a directory at the given path."""
try:
os.makedirs(path, exist_ok=True)
return f"Directory '{path}' created successfully."
except Exception as e:
return f"Error creating directory '{path}': {e}"
def write_file(path: str, content: str) -> str:
"""Writes content to a file at the given path."""
try:
with open(path, 'w') as f:
f.write(content)
return f"File '{path}' written successfully."
except Exception as e:
return f"Error writing file '{path}': {e}"
# ... more skills like install_dependencies, run_linter, etc.
And the ProjectEngineerAgent would use Claude's tool-use capabilities to select and call these functions. This meant Claude wasn't just telling me what code to write; it was doing the steps to build the project.
Implementing Hooks for Safety and Control
This was where the hooks came in, and they were super important. When building an automated system that makes and potentially runs code, security is everything. Honestly, one of my first big screw-ups was not adding enough safety checks. I quickly realised an AI making files or running commands could be a massive problem. This directly relates to why I’m so keen on things like the patterns I discuss in My New Favourite Way to Keep Code Safe.
I set up a pre_execution_hook that, for any skill that could cause trouble (like write_file or install_dependencies), would stop and ask me for permission. For write_file, it'd show me the path and the full content Claude intended to write. Only after I clicked 'approve' would the skill run. This stopped any crazy agent actions. At 2 AM one night, our agent tried to write a massive, malformed requirements.txt file which would've broken our CI/CD container if I hadn't had this hook in place to catch it. The stack trace showed a subprocess.CalledProcessError during a pip install -r due to bad package names Claude invented. Total facepalm moment.
Another hook was post_code_generation_hook which would automatically format the generated code using Black and run a basic linter (Flake8) before returning it. This really made the code way better and more consistent.
# hooks.py
def pre_execution_approval_hook(skill_name: str, args: dict) -> bool:
"""Asks for human approval before executing a sensitive skill."""
print(f"\n--- Human Approval Required ---")
print(f"Skill: {skill_name}")
print(f"Arguments: {args}")
response = input("Approve execution? (y/n): ")
return response.lower() == 'y'
def post_code_format_hook(code_content: str) -> str:
"""Formats Python code using Black."""
try:
import black
# Assume 'black' is installed in the environment
formatted_code = black.format_str(code_content, mode=black.FileMode())
return formatted_code
except ImportError:
print("Warning: Black not installed. Skipping formatting.")
return code_content
except Exception as e:
print(f"Error during code formatting: {e}")
return code_content
Measurable Results and Lessons Learned
This whole Claude agent setup, running as an internal API with FastAPI, really changed how we did things. For generating a new microservice boilerplate, what used to take me 30-45 minutes of copying stuff and fixing it by hand now takes about 2 minutes, even with me approving the file writes. We cut the average setup time for new projects from 45 minutes to around 5 minutes, including the human review step. This isn't just about speed; it's about consistency and making things less brain-heavy.
I also used a similar agent for auto-making basic test files for new features. Our test coverage for new parts jumped from about 45% (when devs were swamped) to a solid 75-80% right from the start, saving me like an hour or two of writing tests by hand for each new bit.
When our internal API hit 10k requests/day for various automation tasks, the system worked great. The ProjectEngineerAgent could spin up a new service in seconds. Database queries for fetching skill definitions went from 2.5s to 180ms after I added proper indexing to our skill registry database. Classic mistake, not adding indexes early enough. That cost us a few hours of debugging performance issues.
Oops! What I Messed Up and What I'd Do Next Time
* Dependency Bloat: Managing all the different libraries for each skill became a real headache. I first just threw everything into one big requirements.txt file, but that got super messy. I eventually switched to a cleaner way, using poetry to manage projects and making sure each 'skill module' only listed the few things it really needed. This made setting it up a bit trickier but kept the main code much tidier.
Prompt Tuning is Still Key: Even with agents, the first prompt you give the main agent has* to be super clear and to the point. If Claude doesn't get the big picture, the whole thing falls apart.
Observability: I didn't log enough stuff at the start. When an agent failed, it was a nightmare to figure out which skill broke and why*. Adding really good logging – with details like the agent ID, task ID, and what arguments the skill got – was a total game-changer for debugging. Honestly, it saved me like 20 hours a week just trying to understand failures.
If I were doing this again, I'd spend more time early on on:
It's been a wild ride, but seeing Claude change from just a fancy autocomplete for code to a real automation buddy has been awesome. It’s not about replacing developers, but helping us, freeing us up for the more interesting, tough problems. This isn't just a toy; it's a solid tool for any full-stack developer who wants to seriously use AI in their daily work. Next, I'm exploring how models like Emu3.5 and how it's changing what we build can help with more visual-centric code generation, which could be another massive leap.