AI-Driven IT Support Desk Op

Client: Multicloud Enterprise (AWS & Azure) | Role: Cloud Architect & Agile Product Manager

The Challenge

The client’s internal IT support desk processes approximately 400 inbound cases per week, handling critical alerts from Amazon CloudWatch, Azure Monitor, as well as tickets created by worldwide employees. The volume spans typical devops concerns: memory exhaustion, load balancer latency, backup failures, certificate expirations, and budget anomalies.

The Bottleneck: Human operators faced a high "Time-to-First-Action" (~10 minutes/ticket). Resolving a single alert required manually cross-referencing disparate data silos:

Project-specific Runbooks & Case History
Document Management Systems (DMS) & SharePoint Wikis
AWS/Azure Official Documentation

The Solution

I led the end-to-end delivery of a GenAI Support Assistant integrated directly into the ServiceNow support desk. The system automatically recommends a remediation plan based on this combination of internal and public knowledge.

Tech Stack: Amazon Bedrock (Claude Sonnet 4), Amazon SageMaker Studio, ServiceNow, RAG (Retrieval-Augmented Generation).

Core Execution: From “Unstructured" to "Actionable"

A critical success factor was transforming static, "non-queryable" internal knowledge into a dynamic vector database. We utilized standard best practices for this data transformation pipeline:

Ingestion & Connectors: We configured Amazon Bedrock Knowledge Bases with native connectors to sync directly with the client's SharePoint and internal Wikis. This replaced manual uploads with automated, periodic syncs (e.g., nightly crawls) to capture new runbook updates.
Chunking Strategy: Instead of ingesting whole documents, we implemented semantic chunking. For example, a 50-page "Disaster Recovery" PDF was split into distinct, topic-based segments (e.g., "Database Restore Steps" vs. "Frontend Failover"). This ensures the RAG retrieval fetches only the specific paragraph needed to solve the issue, not the entire manual.
Metadata Enrichment: We automated the tagging of chunks with project IDs and cloud provider tags (e.g., tag:AWS, tag:Production). This allowed the AI to filter context strictly to the relevant environment, preventing it from suggesting an Azure fix for an AWS problem.

Project Timeline & Methodology

We employed a "Crawl, Walk, Run" phased delivery model to manage risk and ensure alignment.

Phase 1: Foundation & MVP (Crawl)

Requirements Discovery: Defined Q&A pairs to benchmark performance.
Data Preparation: Arranged internal data for retrieval, setting up Bedrock Knowledge Base connectors.
Model Evaluation: Conducted comparative testing of foundational models to select the optimal balance of reasoning and cost (Selected: Claude Sonnet 4).
RAG Integration Testing: Tested the foundational model with RAG to verify it could accurately cite internal runbooks.
ServiceNow Integration: Built the API middleware to fetch AI recommendations and display them within the ServiceNow agent workspace.
MVP Release: Deployed the AI assistant to a pilot group of support agents.
- Outcome: Reduced Time-to-First-Action by 4 minutes per ticket.

Phase 2: Trained Custom Model (Walk)

Fine-Tuning: Used Amazon SageMaker to fine-tune the model. By baking the specific internal knowledge directly into the model weights (SFT), we reduced dependency on retrieval latency for common issues and improved tone consistency.
v2 Deployment: Integrated the fine-tuned model with ServiceNow
- Outcome: Increased accuracy for niche, client-specific edge cases.
- Outcome: reduced AI solution cost per support ticket.

Phase 3: Fine-Tuning (Run)

Feedback Loop Implementation: Added "Approve/Reject" buttons for agents. Rejected recommendations prompted a required "Correct Answer" field to gather ground-truth data.Dataset Curation: Aggregated the "Approved" responses and "Corrected" human feedback from Phase 2 into a training dataset.
RLHF Fine-Tuning (v3): Used Amazon SageMaker to further fine-tune the model based on feedback loop responses (Reinforcement Learning from Human Feedback).

Business Impact

Operational Efficiency: 26+ man-hours saved per week (4 mins saved x 400 cases).
Knowledge Democratization: Junior agents can better resolve complex or project-specific alerts at the speed of senior engineers by leveraging the AI's "institutional memory."
Data-Driven Quality: The feedback loop permanently captures business knowledge, preventing information loss if senior staff turn over.

AI-Driven IT Support Desk Op

The Challenge

The Solution

Core Execution: From “Unstructured" to "Actionable"

Project Timeline & Methodology

Business Impact

AWS & Azure Certified Expert - Daniel Pohl

Location

Contact