21. Prompting & Task Decomposition
Chapter 21 — Prompting & Task Decomposition
Overview
System prompts, instruction design, and task graphs to improve reliability, cost, and speed.
Effective prompting is the foundation of reliable LLM systems. While the underlying model provides capabilities, prompt engineering determines how consistently and accurately those capabilities are applied. This chapter covers advanced prompting techniques, task decomposition strategies, and orchestration patterns that transform unreliable prototypes into production-ready systems.
Why Prompting Matters
Impact on Production Systems:
- Reliability: 60-80% improvement in task success through structured prompting
- Cost: 40-60% reduction via task decomposition and model routing
- Speed: 2-5x faster execution through parallel task execution
- Maintainability: Centralized prompt management reduces technical debt
The Prompting Hierarchy
Understanding the layers of prompt engineering helps structure your approach:
graph TB subgraph "Prompting Hierarchy" A[Task Definition] --> B[System Instructions] B --> C[Context & Examples] C --> D[Input Formatting] D --> E[Output Constraints] E --> F[Validation & Retry] B --> B1[Role Definition] B --> B2[Behavioral Guidelines] B --> B3[Constraints & Boundaries] C --> C1[Few-Shot Examples] C --> C2[Retrieved Context] C --> C3[Conversation History] D --> D1[Structured Input] D --> D2[Delimiters] D --> D3[Markdown Formatting] E --> E1[Output Format] E --> E2[JSON Schema] E --> E3[Length Limits] F --> F1[Schema Validation] F --> F2[Content Verification] F --> F3[Confidence Scoring] end
Core Prompting Techniques
1. Role + Constraints Pattern
Define the AI's role and operating boundaries explicitly.
Key Components:
| Component | Purpose | Example |
|---|---|---|
| Role Definition | Establish persona and expertise | "You are a customer support specialist for e-commerce" |
| Responsibilities | Define primary tasks | Answer questions, escalate issues, maintain tone |
| Constraints | Set boundaries | Never share PII, require approval for >$500 refunds |
| Fallback Behavior | Handle uncertainty | "Let me connect you with a specialist" |
Dynamic Prompt Construction Flow:
graph TB A[User Request] --> B{User Tier?} B -->|Premium| C[Personalized Tone<br/>Proactive Assistance] B -->|Standard| D[Efficient Support<br/>Self-Service Focus] C --> E{System Load?} D --> E E -->|High Load| F[Quick Resolution Mode<br/>Self-service links<br/>Efficient triage] E -->|Normal Load| G[Thorough Mode<br/>Detailed explanations<br/>Relationship building] F --> H[Add Context] G --> H H --> I{Session History?} I -->|Yes| J[Include Summary<br/>Avoid Repetition] I -->|No| K[Fresh Context] J --> L[Final Prompt] K --> L
2. Few-Shot Learning & Examples
Show the model what good looks like through carefully selected examples.
Few-Shot Strategy Comparison:
| Strategy | Example Count | When to Use | Accuracy Improvement | Cost Impact |
|---|---|---|---|---|
| Zero-Shot | 0 | Simple, well-defined tasks | Baseline | Lowest |
| Static Few-Shot | 3-5 fixed | Consistent task patterns | +15-25% | +10% tokens |
| Dynamic Few-Shot | 3-5 selected | Variable input patterns | +25-40% | +15% tokens |
| Multi-Domain | 8-12 diverse | Cross-domain tasks | +30-50% | +25% tokens |
Dynamic Example Selection Flow:
graph LR A[User Query] --> B[Embed Query] B --> C[Example Pool<br/>Pre-embedded] C --> D[Calculate Similarity<br/>Cosine Distance] D --> E[Rank Examples] E --> F[Select Top-3<br/>Most Similar] F --> G[Build Prompt] G --> H[Task Description +<br/>Selected Examples +<br/>User Query] H --> I[LLM Processing] I --> J[Structured Output]
Example Selection Criteria:
- Similarity: Semantic closeness to user query (cosine similarity > 0.7)
- Diversity: Cover different aspects of the task (avoid redundant examples)
- Quality: Pre-validated gold standard examples only
- Recency: Prefer recent examples for evolving domains
3. Chain-of-Thought (CoT) Reasoning
Guide the model through step-by-step reasoning to improve accuracy on complex tasks.
CoT Technique Comparison:
| Technique | Complexity | Accuracy Gain | Latency Impact | Best For |
|---|---|---|---|---|
| Zero-Shot CoT | Low | +25-40% | +30% | General reasoning |
| Few-Shot CoT | Medium | +40-60% | +50% | Domain-specific logic |
| Self-Consistency | High | +50-75% | +200% | Critical decisions |
| Tree-of-Thoughts | Very High | +60-80% | +400% | Multi-path problems |
Chain-of-Thought Reasoning Flow:
graph TB A[Complex Problem] --> B[CoT Prompt] B --> C[Step 1: Identify Facts] C --> D[Step 2: Determine Requirements] D --> E[Step 3: Evaluate Options] E --> F[Step 4: Apply Logic/Rules] F --> G[Step 5: Reach Conclusion] G --> H{Verify Logic?} H -->|Yes| I[Self-Check Steps] H -->|No| J[Final Answer] I --> K{Valid?} K -->|Yes| J K -->|No| C
When to Use CoT:
- Multi-step Problems: Requires breaking down into subtasks
- Logic Puzzles: Needs explicit reasoning chain
- Policy Application: Must evaluate rules and edge cases
- Numerical Reasoning: Benefits from shown calculations
4. Schema-Constrained Outputs
Enforce structured outputs with JSON schemas and validation for reliable data extraction.
Schema-Based Output Control:
graph TB A[Define Schema<br/>Pydantic/JSON Schema] --> B[Generate Prompt<br/>with Schema] B --> C[LLM Generation] C --> D[Parse JSON] D --> E{Valid JSON?} E -->|No| F[Retry with Error] E -->|Yes| G{Schema Valid?} F --> C G -->|No| H[Retry with<br/>Validation Errors] G -->|Yes| I[Validated Output] H --> C I --> J[Use in Application]
Schema Strategy Comparison:
| Approach | Reliability | Latency | Cost | Best For |
|---|---|---|---|---|
| Text Instructions | 60-70% | Low | Low | Simple formats |
| JSON Mode | 85-95% | Low | Low | Structured data |
| Pydantic Validation | 95-99% | Medium | Medium | Critical extraction |
| Function Calling | 98-99% | Low | Medium | Tool integration |
| Guided Generation | 99%+ | High | High | Maximum reliability |
Essential Structured Extraction (Minimal Code):
# Single essential example - structured extraction with retry
async def extract_with_schema(message: str, schema: dict) -> dict:
"""Extract structured data with automatic retry on failures"""
prompt = f"Extract as JSON matching schema: {schema}\n\nInput: {message}"
for attempt in range(3):
response = await llm.complete(prompt, temperature=0.0)
try:
return json.loads(response) # Validate & return
except:
prompt += f"\n\nError - provide valid JSON only"
raise ValueError("Failed after retries")
5. Tool-Use and Function Calling
Integrate LLMs with external tools for deterministic operations and real-time data access.
Tool-Use Architecture:
graph TB A[User Query] --> B[LLM Reasoning] B --> C{Need Tool?} C -->|Yes| D[Select Tool] C -->|No| E[Direct Answer] D --> F[Extract Parameters] F --> G[Validate Args] G --> H{Valid?} H -->|No| I[Request Clarification] H -->|Yes| J[Execute Tool] I --> A J --> K[Tool Result] K --> L[Add to Context] L --> B B --> M{Complete?} M -->|No| C M -->|Yes| N[Final Response]
Tool Integration Comparison:
| Approach | Reliability | Setup Complexity | Best For |
|---|---|---|---|
| Native Function Calling | 95-98% | Low | Modern LLMs (GPT-4, Claude) |
| Prompt-Based Tool Use | 75-85% | Medium | Legacy models, custom tools |
| ReAct Pattern | 85-95% | Medium | Multi-step reasoning |
| LangChain Agents | 80-90% | Low | Rapid prototyping |
Tool Access Control:
| Control Layer | Purpose | Implementation |
|---|---|---|
| Whitelist | Allowed tools only | Registry of approved functions |
| Parameter Validation | Sanitize inputs | Schema validation before execution |
| Rate Limiting | Prevent abuse | Per-tool request limits |
| Audit Logging | Track usage | Log all tool invocations |
| Privilege Checks | User permissions | Role-based access control |
Task Decomposition Patterns
Break complex tasks into manageable, verifiable steps for improved reliability and traceability.
Decomposition Strategy Comparison
| Pattern | Complexity | Accuracy | Latency | Cost | Best For |
|---|---|---|---|---|---|
| Planner-Solver | Medium | +40-60% | +100% | +80% | Multi-step tasks |
| ReAct (Reason+Act) | Medium | +30-50% | +80% | +70% | Tool-heavy workflows |
| Reflection | High | +50-70% | +150% | +120% | Quality-critical tasks |
| Multi-Agent Debate | High | +60-80% | +200% | +200% | Strategic decisions |
| Task Graph/DAG | Very High | +70-90% | +120% | +100% | Complex workflows |
1. Planner-Solver Pattern
Separate planning from execution for better error handling and verification.
graph TB A[Complex Task] --> B[Planner Agent] B --> C[Generate Step-by-Step Plan] C --> D[Plan Validation] D -->|Valid| E[Solver Agent Loop] D -->|Invalid| B E --> F[Execute Step 1] F --> G{Step Success?} G -->|Yes| H[Execute Step 2] G -->|No| I[Retry or Replan] I --> E H --> J{More Steps?} J -->|Yes| E J -->|No| K[Final Output] K --> L[Result Verification] L -->|Pass| M[Return Result] L -->|Fail| I
Key Advantages:
- Separation of Concerns: Planning vs execution
- Transparency: Clear step-by-step breakdown
- Debuggability: Identify which step failed
- Adaptability: Replan when steps fail
2. Reflection and Self-Critique
Add verification loops to improve output quality through iterative refinement.
Reflection Pattern Flow:
graph TB A[Task Input] --> B[Generate Initial Output] B --> C[Self-Critique] C --> D{Quality Score} D -->|< Threshold| E[Generate Improvements] D -->|≥ Threshold| F[Accept Output] E --> G[Refine Based on Critique] G --> C F --> H{Max Iterations?} H -->|No| I[Continue if Needed] H -->|Yes| J[Return Best Output]
Reflection Use Cases:
| Use Case | Iterations | Quality Gain | When to Use |
|---|---|---|---|
| Draft Review | 2-3 | +25-40% | Content generation, reports |
| Code Review | 3-5 | +40-60% | Complex logic, algorithms |
| Fact Checking | 2-4 | +50-70% | Research, analysis |
| Creative Refinement | 3-6 | +30-50% | Marketing, storytelling |
3. Multi-Agent Debate
Use multiple perspectives to reach better, more robust answers.
Debate Consensus Process:
graph TB A[Question] --> B[Agent 1<br/>Initial Position] A --> C[Agent 2<br/>Initial Position] A --> D[Agent 3<br/>Initial Position] B --> E[Round 1: Review<br/>Other Positions] C --> E D --> E E --> F[Refine Position<br/>Based on Arguments] F --> G{More Rounds?} G -->|Yes| E G -->|No| H[Moderator<br/>Synthesize Consensus] H --> I{Agreement?} I -->|High| J[Confident Answer] I -->|Medium| K[Qualified Answer] I -->|Low| L[Present Trade-offs]
Debate Configuration:
| Agents | Rounds | Improvement | Cost | Best For |
|---|---|---|---|---|
| 2 | 2 | +20-30% | +100% | Quick validation |
| 3 | 3 | +40-60% | +200% | Strategic decisions |
| 5 | 3 | +60-80% | +400% | Critical analysis |
Task Orchestration with Graphs
Model complex workflows as directed acyclic graphs (DAGs) for production-grade orchestration.
Example: Customer Support Workflow DAG:
graph TB A[User Request] --> B{Intent Classification} B -->|Refund| C[Verify Order] B -->|Status| D[Lookup Order] B -->|Complaint| E[Sentiment Analysis] C --> F{Order Eligible?} F -->|Yes| G[Calculate Refund] F -->|No| H[Explain Policy] G --> I{Amount > $500?} I -->|Yes| J[Human Approval] I -->|No| K[Process Refund] D --> L[Format Status] E --> M{Severity?} M -->|High| N[Priority Escalation] M -->|Low| O[Standard Response] J --> P{Approved?} P -->|Yes| K P -->|No| H K --> Q[Send Confirmation] L --> Q N --> Q O --> Q H --> Q Q --> R[Log Interaction] R --> S[Return to User]
Task Graph Node Types:
| Node Type | Purpose | Execution | Example |
|---|---|---|---|
| LLM | AI reasoning/generation | Async LLM call | Intent classification, summarization |
| Tool | External API/database | Function execution | Order lookup, payment processing |
| Decision | Routing logic | Sync evaluation | Route by intent, eligibility check |
| Human | Manual review | Pause for input | Approval for large refunds |
| Parallel | Concurrent execution | Multi-threaded | Fetch multiple data sources |
Graph Execution Strategies:
| Strategy | Complexity | Fault Tolerance | Best For |
|---|---|---|---|
| Sequential | Low | Basic retry | Simple pipelines |
| Parallel | Medium | Independent failure | Data aggregation |
| Conditional | Medium | Branch-specific | Business logic |
| Recursive | High | Nested retry | Hierarchical tasks |
Determinism and Control
Control randomness for consistent, reliable outputs based on task requirements.
Temperature Control by Task Type:
| Task Type | Temperature | Rationale | Example |
|---|---|---|---|
| Extraction | 0.0 | Deterministic output required | Data extraction, parsing |
| Classification | 0.0 | Consistent categorization | Intent detection, routing |
| Structured Output | 0.0 | Exact JSON/schema adherence | API responses, databases |
| Summarization | 0.3 | Light variation acceptable | Document summarization |
| Rewriting | 0.5 | Moderate creativity | Content adaptation |
| Brainstorming | 0.8 | High diversity desired | Idea generation |
| Creative Writing | 0.9 | Maximum variation | Stories, marketing copy |
Consensus Strategies for Quality:
| Strategy | Samples | Cost | Quality Gain | When to Use |
|---|---|---|---|---|
| Single (temp=0) | 1 | 1x | Baseline | Deterministic tasks |
| Best-of-N | 3-5 | 3-5x | +15-25% | Classification, scoring |
| Majority Vote | 5-7 | 5-7x | +25-40% | Binary decisions |
| LLM Synthesis | 3-5 | 6-10x | +30-50% | Creative tasks |
| Mixture of Models | 3-5 | Varies | +40-60% | Critical decisions |
Deliverables
Prompt Library Management
Prompt Lifecycle:
graph LR A[Create Prompt v1.0] --> B[Test on Dev Data] B --> C{Performance OK?} C -->|No| D[Iterate & Create v1.1] C -->|Yes| E[Deploy to Production] D --> B E --> F[Monitor Metrics] F --> G{Degradation?} G -->|Yes| H[Create v2.0<br/>with Improvements] G -->|No| F H --> B
Prompt Versioning Strategy:
| Element | Purpose | Example |
|---|---|---|
| Name | Unique identifier | customer_intent_classifier |
| Version | Track iterations | v2.3 |
| Metadata | Context & ownership | Author, task type, model |
| Template | Parameterized prompt | "Classify: {message}" |
| Test Cases | Validation suite | 50+ examples per version |
| Performance | Success metrics | Accuracy, latency, cost |
Task Graph Specification
Workflow Documentation Template:
| Section | Contents | Purpose |
|---|---|---|
| Metadata | Name, version, description | Identification |
| Nodes | ID, type, action, parameters | Execution logic |
| Edges | Connections, conditions | Flow control |
| Guardrails | Input/output validation, safety | Risk mitigation |
| Monitoring | Metrics, thresholds, alerts | Observability |
| SLAs | Latency, accuracy targets | Quality gates |
Example Workflow Spec (YAML):
name: customer_support_refund
version: 2.1.0
nodes:
- id: classify_intent
type: llm
model: gpt-4o-mini
temperature: 0.0
timeout: 5s
- id: route_decision
type: decision
logic: "intent == 'refund' ? verify_order : general_response"
- id: verify_order
type: tool
function: check_eligibility
- id: human_approval
type: human
required_if: "amount > $500"
sla: 30min
guardrails:
- pii_detection
- output_validation
monitoring:
latency_p95: 2s
error_rate: <1%
Why It Matters
Good prompting and decomposition increase reliability and reduce cost.
Impact Metrics:
| Technique | Reliability Improvement | Cost Reduction | Latency Impact |
|---|---|---|---|
| Role + Constraints | +15-25% | 0% | 0% |
| Few-Shot Examples | +20-35% | 0-5% increase | +10-20% |
| Chain-of-Thought | +30-50% | 5-15% increase | +20-40% |
| Schema Constraints | +40-60% | 0% | +5-10% |
| Tool Integration | +50-70% | 20-40% decrease | -10-30% |
| Task Decomposition | +60-80% | 10-30% decrease | +15-35% |
| Reflection | +25-40% | 30-60% increase | +50-100% |
Common Pitfalls:
- Over-prompting: Too verbose prompts that confuse rather than clarify
- Under-specification: Vague instructions leading to inconsistent outputs
- Missing error handling: No graceful degradation when steps fail
- Rigid workflows: Can't adapt to edge cases
- No verification: Trusting outputs without validation
Case Study: Legal Document Summarization with Verification
Challenge: A legal tech company with 500+ attorneys needed to summarize court documents with 99%+ factual accuracy. Initial GPT-4 implementation achieved only 87% accuracy with 23% hallucination rate, causing trust issues and manual review overhead.
Initial State (GPT-4 Zero-Shot):
- Accuracy: 87%
- Hallucination Rate: 23%
- Citations: 0%
- Cost: $25/document
- Latency: 1.2s
- Attorney Trust: Low (62% required full re-review)
Solution Architecture:
graph TB A[Court Document] --> B[Stage 1:<br/>Extract Key Facts] B --> C[Stage 2:<br/>Generate Summary<br/>from Facts] C --> D[Stage 3:<br/>Multi-Method<br/>Verification] D --> E{Accuracy<br/>> 95%?} E -->|No| F[Stage 4:<br/>Refine with Issues] E -->|Yes| G[Stage 5:<br/>Add Citations] F --> D G --> H[Final Output] subgraph "Verification Methods" D1[Claim-by-Claim<br/>vs Facts] D2[LLM-as-Judge<br/>Verification] D3[Entailment<br/>Checking] end D --> D1 D --> D2 D --> D3
Implemented Techniques:
| Stage | Technique | Purpose | Impact |
|---|---|---|---|
| 1. Fact Extraction | Chain-of-Thought prompting | Identify verifiable claims | +15% accuracy |
| 2. Summary Generation | Few-shot with legal examples | Generate from facts only | +20% accuracy |
| 3. Verification | Multi-method checking | Detect hallucinations | +25% accuracy |
| 4. Refinement | Iterative improvement | Fix identified issues | +10% accuracy |
| 5. Citation | Source attribution | Build attorney trust | +100% adoption |
Results After Implementation:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Accuracy | 87% | 99.1% | +12.1% |
| Hallucination Rate | 23% | 0.9% | -96% |
| Citations | 0% | 100% | Perfect |
| Cost per Document | $25 | $36 | +44% |
| Latency | 1.2s | 3.8s | +217% |
| Attorney Trust | 62% | 94% | +52% |
| Re-review Rate | 100% | 8% | -92% |
Business Impact:
- Time Savings: 15 hours/week per attorney (85% less review time)
- Productivity: 3x more documents processed per day
- ROI: 340% in first 6 months
- Cost Savings: $180K/year in attorney time (despite +44% LLM cost)
- Quality: Zero legal errors in production use
Key Learnings:
- Multi-Stage Beats Single-Shot: 99.1% vs 87% accuracy through decomposition
- Verification is Critical: Multiple methods catch different error types
- Citations Build Trust: 100% citation rate enabled 94% adoption
- Cost vs Quality: +44% cost acceptable for high-stakes use cases
- Latency Tolerance: 3.8s acceptable for batch workflows, not real-time
Implementation Checklist
Prompt Engineering
- Define clear roles and constraints for each prompt
- Create few-shot example library organized by task type
- Implement dynamic example selection based on similarity
- Add output format specifications (JSON schemas)
- Include error handling instructions in prompts
- Test prompts across temperature settings (0.0, 0.3, 0.7)
- Document prompt versions and performance metrics
Task Decomposition
- Identify complex tasks that benefit from decomposition
- Design task graphs with clear inputs/outputs per node
- Implement planner-solver pattern for multi-step tasks
- Add verification nodes after critical steps
- Include human-in-the-loop checkpoints for high-stakes decisions
- Define retry policies and fallback strategies
- Set up parallel execution for independent sub-tasks
Tool Integration
- Define tool/function schemas with clear descriptions
- Implement tool registry with error handling
- Add tool result validation
- Create tool-use examples in prompts
- Monitor tool call accuracy and latency
- Implement tool-first routing for deterministic operations
Verification & Quality
- Implement output schema validation
- Add LLM-as-judge verification for quality
- Create fact-checking pipelines for factual tasks
- Set up reflection loops for iterative improvement
- Define confidence scoring mechanisms
- Establish quality thresholds for human escalation
Monitoring & Optimization
- Track prompt performance metrics (accuracy, latency, cost)
- A/B test prompt variations
- Monitor task graph execution paths
- Identify bottlenecks and optimization opportunities
- Set up alerts for quality degradation
- Create feedback loops from production to prompt library
Documentation
- Document prompt library with versions and metadata
- Create task graph specifications in YAML/JSON
- Define standard operating procedures for each workflow
- Maintain runbooks for common failure modes
- Document best practices and anti-patterns
- Create onboarding guide for prompt engineering