Part 4: Generative AI & LLM Consulting

Chapter 21: Prompting & Task Decomposition

Hire Us
4Part 4: Generative AI & LLM Consulting

21. Prompting & Task Decomposition

Chapter 21 — Prompting & Task Decomposition

Overview

System prompts, instruction design, and task graphs to improve reliability, cost, and speed.

Effective prompting is the foundation of reliable LLM systems. While the underlying model provides capabilities, prompt engineering determines how consistently and accurately those capabilities are applied. This chapter covers advanced prompting techniques, task decomposition strategies, and orchestration patterns that transform unreliable prototypes into production-ready systems.

Why Prompting Matters

Impact on Production Systems:

  • Reliability: 60-80% improvement in task success through structured prompting
  • Cost: 40-60% reduction via task decomposition and model routing
  • Speed: 2-5x faster execution through parallel task execution
  • Maintainability: Centralized prompt management reduces technical debt

The Prompting Hierarchy

Understanding the layers of prompt engineering helps structure your approach:

graph TB subgraph "Prompting Hierarchy" A[Task Definition] --> B[System Instructions] B --> C[Context & Examples] C --> D[Input Formatting] D --> E[Output Constraints] E --> F[Validation & Retry] B --> B1[Role Definition] B --> B2[Behavioral Guidelines] B --> B3[Constraints & Boundaries] C --> C1[Few-Shot Examples] C --> C2[Retrieved Context] C --> C3[Conversation History] D --> D1[Structured Input] D --> D2[Delimiters] D --> D3[Markdown Formatting] E --> E1[Output Format] E --> E2[JSON Schema] E --> E3[Length Limits] F --> F1[Schema Validation] F --> F2[Content Verification] F --> F3[Confidence Scoring] end

Core Prompting Techniques

1. Role + Constraints Pattern

Define the AI's role and operating boundaries explicitly.

Key Components:

ComponentPurposeExample
Role DefinitionEstablish persona and expertise"You are a customer support specialist for e-commerce"
ResponsibilitiesDefine primary tasksAnswer questions, escalate issues, maintain tone
ConstraintsSet boundariesNever share PII, require approval for >$500 refunds
Fallback BehaviorHandle uncertainty"Let me connect you with a specialist"

Dynamic Prompt Construction Flow:

graph TB A[User Request] --> B{User Tier?} B -->|Premium| C[Personalized Tone<br/>Proactive Assistance] B -->|Standard| D[Efficient Support<br/>Self-Service Focus] C --> E{System Load?} D --> E E -->|High Load| F[Quick Resolution Mode<br/>Self-service links<br/>Efficient triage] E -->|Normal Load| G[Thorough Mode<br/>Detailed explanations<br/>Relationship building] F --> H[Add Context] G --> H H --> I{Session History?} I -->|Yes| J[Include Summary<br/>Avoid Repetition] I -->|No| K[Fresh Context] J --> L[Final Prompt] K --> L

2. Few-Shot Learning & Examples

Show the model what good looks like through carefully selected examples.

Few-Shot Strategy Comparison:

StrategyExample CountWhen to UseAccuracy ImprovementCost Impact
Zero-Shot0Simple, well-defined tasksBaselineLowest
Static Few-Shot3-5 fixedConsistent task patterns+15-25%+10% tokens
Dynamic Few-Shot3-5 selectedVariable input patterns+25-40%+15% tokens
Multi-Domain8-12 diverseCross-domain tasks+30-50%+25% tokens

Dynamic Example Selection Flow:

graph LR A[User Query] --> B[Embed Query] B --> C[Example Pool<br/>Pre-embedded] C --> D[Calculate Similarity<br/>Cosine Distance] D --> E[Rank Examples] E --> F[Select Top-3<br/>Most Similar] F --> G[Build Prompt] G --> H[Task Description +<br/>Selected Examples +<br/>User Query] H --> I[LLM Processing] I --> J[Structured Output]

Example Selection Criteria:

  • Similarity: Semantic closeness to user query (cosine similarity > 0.7)
  • Diversity: Cover different aspects of the task (avoid redundant examples)
  • Quality: Pre-validated gold standard examples only
  • Recency: Prefer recent examples for evolving domains

3. Chain-of-Thought (CoT) Reasoning

Guide the model through step-by-step reasoning to improve accuracy on complex tasks.

CoT Technique Comparison:

TechniqueComplexityAccuracy GainLatency ImpactBest For
Zero-Shot CoTLow+25-40%+30%General reasoning
Few-Shot CoTMedium+40-60%+50%Domain-specific logic
Self-ConsistencyHigh+50-75%+200%Critical decisions
Tree-of-ThoughtsVery High+60-80%+400%Multi-path problems

Chain-of-Thought Reasoning Flow:

graph TB A[Complex Problem] --> B[CoT Prompt] B --> C[Step 1: Identify Facts] C --> D[Step 2: Determine Requirements] D --> E[Step 3: Evaluate Options] E --> F[Step 4: Apply Logic/Rules] F --> G[Step 5: Reach Conclusion] G --> H{Verify Logic?} H -->|Yes| I[Self-Check Steps] H -->|No| J[Final Answer] I --> K{Valid?} K -->|Yes| J K -->|No| C

When to Use CoT:

  • Multi-step Problems: Requires breaking down into subtasks
  • Logic Puzzles: Needs explicit reasoning chain
  • Policy Application: Must evaluate rules and edge cases
  • Numerical Reasoning: Benefits from shown calculations

4. Schema-Constrained Outputs

Enforce structured outputs with JSON schemas and validation for reliable data extraction.

Schema-Based Output Control:

graph TB A[Define Schema<br/>Pydantic/JSON Schema] --> B[Generate Prompt<br/>with Schema] B --> C[LLM Generation] C --> D[Parse JSON] D --> E{Valid JSON?} E -->|No| F[Retry with Error] E -->|Yes| G{Schema Valid?} F --> C G -->|No| H[Retry with<br/>Validation Errors] G -->|Yes| I[Validated Output] H --> C I --> J[Use in Application]

Schema Strategy Comparison:

ApproachReliabilityLatencyCostBest For
Text Instructions60-70%LowLowSimple formats
JSON Mode85-95%LowLowStructured data
Pydantic Validation95-99%MediumMediumCritical extraction
Function Calling98-99%LowMediumTool integration
Guided Generation99%+HighHighMaximum reliability

Essential Structured Extraction (Minimal Code):

# Single essential example - structured extraction with retry
async def extract_with_schema(message: str, schema: dict) -> dict:
    """Extract structured data with automatic retry on failures"""
    prompt = f"Extract as JSON matching schema: {schema}\n\nInput: {message}"

    for attempt in range(3):
        response = await llm.complete(prompt, temperature=0.0)
        try:
            return json.loads(response)  # Validate & return
        except:
            prompt += f"\n\nError - provide valid JSON only"

    raise ValueError("Failed after retries")

5. Tool-Use and Function Calling

Integrate LLMs with external tools for deterministic operations and real-time data access.

Tool-Use Architecture:

graph TB A[User Query] --> B[LLM Reasoning] B --> C{Need Tool?} C -->|Yes| D[Select Tool] C -->|No| E[Direct Answer] D --> F[Extract Parameters] F --> G[Validate Args] G --> H{Valid?} H -->|No| I[Request Clarification] H -->|Yes| J[Execute Tool] I --> A J --> K[Tool Result] K --> L[Add to Context] L --> B B --> M{Complete?} M -->|No| C M -->|Yes| N[Final Response]

Tool Integration Comparison:

ApproachReliabilitySetup ComplexityBest For
Native Function Calling95-98%LowModern LLMs (GPT-4, Claude)
Prompt-Based Tool Use75-85%MediumLegacy models, custom tools
ReAct Pattern85-95%MediumMulti-step reasoning
LangChain Agents80-90%LowRapid prototyping

Tool Access Control:

Control LayerPurposeImplementation
WhitelistAllowed tools onlyRegistry of approved functions
Parameter ValidationSanitize inputsSchema validation before execution
Rate LimitingPrevent abusePer-tool request limits
Audit LoggingTrack usageLog all tool invocations
Privilege ChecksUser permissionsRole-based access control

Task Decomposition Patterns

Break complex tasks into manageable, verifiable steps for improved reliability and traceability.

Decomposition Strategy Comparison

PatternComplexityAccuracyLatencyCostBest For
Planner-SolverMedium+40-60%+100%+80%Multi-step tasks
ReAct (Reason+Act)Medium+30-50%+80%+70%Tool-heavy workflows
ReflectionHigh+50-70%+150%+120%Quality-critical tasks
Multi-Agent DebateHigh+60-80%+200%+200%Strategic decisions
Task Graph/DAGVery High+70-90%+120%+100%Complex workflows

1. Planner-Solver Pattern

Separate planning from execution for better error handling and verification.

graph TB A[Complex Task] --> B[Planner Agent] B --> C[Generate Step-by-Step Plan] C --> D[Plan Validation] D -->|Valid| E[Solver Agent Loop] D -->|Invalid| B E --> F[Execute Step 1] F --> G{Step Success?} G -->|Yes| H[Execute Step 2] G -->|No| I[Retry or Replan] I --> E H --> J{More Steps?} J -->|Yes| E J -->|No| K[Final Output] K --> L[Result Verification] L -->|Pass| M[Return Result] L -->|Fail| I

Key Advantages:

  • Separation of Concerns: Planning vs execution
  • Transparency: Clear step-by-step breakdown
  • Debuggability: Identify which step failed
  • Adaptability: Replan when steps fail

2. Reflection and Self-Critique

Add verification loops to improve output quality through iterative refinement.

Reflection Pattern Flow:

graph TB A[Task Input] --> B[Generate Initial Output] B --> C[Self-Critique] C --> D{Quality Score} D -->|< Threshold| E[Generate Improvements] D -->|≥ Threshold| F[Accept Output] E --> G[Refine Based on Critique] G --> C F --> H{Max Iterations?} H -->|No| I[Continue if Needed] H -->|Yes| J[Return Best Output]

Reflection Use Cases:

Use CaseIterationsQuality GainWhen to Use
Draft Review2-3+25-40%Content generation, reports
Code Review3-5+40-60%Complex logic, algorithms
Fact Checking2-4+50-70%Research, analysis
Creative Refinement3-6+30-50%Marketing, storytelling

3. Multi-Agent Debate

Use multiple perspectives to reach better, more robust answers.

Debate Consensus Process:

graph TB A[Question] --> B[Agent 1<br/>Initial Position] A --> C[Agent 2<br/>Initial Position] A --> D[Agent 3<br/>Initial Position] B --> E[Round 1: Review<br/>Other Positions] C --> E D --> E E --> F[Refine Position<br/>Based on Arguments] F --> G{More Rounds?} G -->|Yes| E G -->|No| H[Moderator<br/>Synthesize Consensus] H --> I{Agreement?} I -->|High| J[Confident Answer] I -->|Medium| K[Qualified Answer] I -->|Low| L[Present Trade-offs]

Debate Configuration:

AgentsRoundsImprovementCostBest For
22+20-30%+100%Quick validation
33+40-60%+200%Strategic decisions
53+60-80%+400%Critical analysis

Task Orchestration with Graphs

Model complex workflows as directed acyclic graphs (DAGs) for production-grade orchestration.

Example: Customer Support Workflow DAG:

graph TB A[User Request] --> B{Intent Classification} B -->|Refund| C[Verify Order] B -->|Status| D[Lookup Order] B -->|Complaint| E[Sentiment Analysis] C --> F{Order Eligible?} F -->|Yes| G[Calculate Refund] F -->|No| H[Explain Policy] G --> I{Amount > $500?} I -->|Yes| J[Human Approval] I -->|No| K[Process Refund] D --> L[Format Status] E --> M{Severity?} M -->|High| N[Priority Escalation] M -->|Low| O[Standard Response] J --> P{Approved?} P -->|Yes| K P -->|No| H K --> Q[Send Confirmation] L --> Q N --> Q O --> Q H --> Q Q --> R[Log Interaction] R --> S[Return to User]

Task Graph Node Types:

Node TypePurposeExecutionExample
LLMAI reasoning/generationAsync LLM callIntent classification, summarization
ToolExternal API/databaseFunction executionOrder lookup, payment processing
DecisionRouting logicSync evaluationRoute by intent, eligibility check
HumanManual reviewPause for inputApproval for large refunds
ParallelConcurrent executionMulti-threadedFetch multiple data sources

Graph Execution Strategies:

StrategyComplexityFault ToleranceBest For
SequentialLowBasic retrySimple pipelines
ParallelMediumIndependent failureData aggregation
ConditionalMediumBranch-specificBusiness logic
RecursiveHighNested retryHierarchical tasks

Determinism and Control

Control randomness for consistent, reliable outputs based on task requirements.

Temperature Control by Task Type:

Task TypeTemperatureRationaleExample
Extraction0.0Deterministic output requiredData extraction, parsing
Classification0.0Consistent categorizationIntent detection, routing
Structured Output0.0Exact JSON/schema adherenceAPI responses, databases
Summarization0.3Light variation acceptableDocument summarization
Rewriting0.5Moderate creativityContent adaptation
Brainstorming0.8High diversity desiredIdea generation
Creative Writing0.9Maximum variationStories, marketing copy

Consensus Strategies for Quality:

StrategySamplesCostQuality GainWhen to Use
Single (temp=0)11xBaselineDeterministic tasks
Best-of-N3-53-5x+15-25%Classification, scoring
Majority Vote5-75-7x+25-40%Binary decisions
LLM Synthesis3-56-10x+30-50%Creative tasks
Mixture of Models3-5Varies+40-60%Critical decisions

Deliverables

Prompt Library Management

Prompt Lifecycle:

graph LR A[Create Prompt v1.0] --> B[Test on Dev Data] B --> C{Performance OK?} C -->|No| D[Iterate & Create v1.1] C -->|Yes| E[Deploy to Production] D --> B E --> F[Monitor Metrics] F --> G{Degradation?} G -->|Yes| H[Create v2.0<br/>with Improvements] G -->|No| F H --> B

Prompt Versioning Strategy:

ElementPurposeExample
NameUnique identifiercustomer_intent_classifier
VersionTrack iterationsv2.3
MetadataContext & ownershipAuthor, task type, model
TemplateParameterized prompt"Classify: {message}"
Test CasesValidation suite50+ examples per version
PerformanceSuccess metricsAccuracy, latency, cost

Task Graph Specification

Workflow Documentation Template:

SectionContentsPurpose
MetadataName, version, descriptionIdentification
NodesID, type, action, parametersExecution logic
EdgesConnections, conditionsFlow control
GuardrailsInput/output validation, safetyRisk mitigation
MonitoringMetrics, thresholds, alertsObservability
SLAsLatency, accuracy targetsQuality gates

Example Workflow Spec (YAML):

name: customer_support_refund
version: 2.1.0

nodes:
  - id: classify_intent
    type: llm
    model: gpt-4o-mini
    temperature: 0.0
    timeout: 5s

  - id: route_decision
    type: decision
    logic: "intent == 'refund' ? verify_order : general_response"

  - id: verify_order
    type: tool
    function: check_eligibility

  - id: human_approval
    type: human
    required_if: "amount > $500"
    sla: 30min

guardrails:
  - pii_detection
  - output_validation

monitoring:
  latency_p95: 2s
  error_rate: <1%

Why It Matters

Good prompting and decomposition increase reliability and reduce cost.

Impact Metrics:

TechniqueReliability ImprovementCost ReductionLatency Impact
Role + Constraints+15-25%0%0%
Few-Shot Examples+20-35%0-5% increase+10-20%
Chain-of-Thought+30-50%5-15% increase+20-40%
Schema Constraints+40-60%0%+5-10%
Tool Integration+50-70%20-40% decrease-10-30%
Task Decomposition+60-80%10-30% decrease+15-35%
Reflection+25-40%30-60% increase+50-100%

Common Pitfalls:

  1. Over-prompting: Too verbose prompts that confuse rather than clarify
  2. Under-specification: Vague instructions leading to inconsistent outputs
  3. Missing error handling: No graceful degradation when steps fail
  4. Rigid workflows: Can't adapt to edge cases
  5. No verification: Trusting outputs without validation

Challenge: A legal tech company with 500+ attorneys needed to summarize court documents with 99%+ factual accuracy. Initial GPT-4 implementation achieved only 87% accuracy with 23% hallucination rate, causing trust issues and manual review overhead.

Initial State (GPT-4 Zero-Shot):

  • Accuracy: 87%
  • Hallucination Rate: 23%
  • Citations: 0%
  • Cost: $25/document
  • Latency: 1.2s
  • Attorney Trust: Low (62% required full re-review)

Solution Architecture:

graph TB A[Court Document] --> B[Stage 1:<br/>Extract Key Facts] B --> C[Stage 2:<br/>Generate Summary<br/>from Facts] C --> D[Stage 3:<br/>Multi-Method<br/>Verification] D --> E{Accuracy<br/>> 95%?} E -->|No| F[Stage 4:<br/>Refine with Issues] E -->|Yes| G[Stage 5:<br/>Add Citations] F --> D G --> H[Final Output] subgraph "Verification Methods" D1[Claim-by-Claim<br/>vs Facts] D2[LLM-as-Judge<br/>Verification] D3[Entailment<br/>Checking] end D --> D1 D --> D2 D --> D3

Implemented Techniques:

StageTechniquePurposeImpact
1. Fact ExtractionChain-of-Thought promptingIdentify verifiable claims+15% accuracy
2. Summary GenerationFew-shot with legal examplesGenerate from facts only+20% accuracy
3. VerificationMulti-method checkingDetect hallucinations+25% accuracy
4. RefinementIterative improvementFix identified issues+10% accuracy
5. CitationSource attributionBuild attorney trust+100% adoption

Results After Implementation:

MetricBeforeAfterImprovement
Accuracy87%99.1%+12.1%
Hallucination Rate23%0.9%-96%
Citations0%100%Perfect
Cost per Document$25$36+44%
Latency1.2s3.8s+217%
Attorney Trust62%94%+52%
Re-review Rate100%8%-92%

Business Impact:

  • Time Savings: 15 hours/week per attorney (85% less review time)
  • Productivity: 3x more documents processed per day
  • ROI: 340% in first 6 months
  • Cost Savings: $180K/year in attorney time (despite +44% LLM cost)
  • Quality: Zero legal errors in production use

Key Learnings:

  1. Multi-Stage Beats Single-Shot: 99.1% vs 87% accuracy through decomposition
  2. Verification is Critical: Multiple methods catch different error types
  3. Citations Build Trust: 100% citation rate enabled 94% adoption
  4. Cost vs Quality: +44% cost acceptable for high-stakes use cases
  5. Latency Tolerance: 3.8s acceptable for batch workflows, not real-time

Implementation Checklist

Prompt Engineering

  • Define clear roles and constraints for each prompt
  • Create few-shot example library organized by task type
  • Implement dynamic example selection based on similarity
  • Add output format specifications (JSON schemas)
  • Include error handling instructions in prompts
  • Test prompts across temperature settings (0.0, 0.3, 0.7)
  • Document prompt versions and performance metrics

Task Decomposition

  • Identify complex tasks that benefit from decomposition
  • Design task graphs with clear inputs/outputs per node
  • Implement planner-solver pattern for multi-step tasks
  • Add verification nodes after critical steps
  • Include human-in-the-loop checkpoints for high-stakes decisions
  • Define retry policies and fallback strategies
  • Set up parallel execution for independent sub-tasks

Tool Integration

  • Define tool/function schemas with clear descriptions
  • Implement tool registry with error handling
  • Add tool result validation
  • Create tool-use examples in prompts
  • Monitor tool call accuracy and latency
  • Implement tool-first routing for deterministic operations

Verification & Quality

  • Implement output schema validation
  • Add LLM-as-judge verification for quality
  • Create fact-checking pipelines for factual tasks
  • Set up reflection loops for iterative improvement
  • Define confidence scoring mechanisms
  • Establish quality thresholds for human escalation

Monitoring & Optimization

  • Track prompt performance metrics (accuracy, latency, cost)
  • A/B test prompt variations
  • Monitor task graph execution paths
  • Identify bottlenecks and optimization opportunities
  • Set up alerts for quality degradation
  • Create feedback loops from production to prompt library

Documentation

  • Document prompt library with versions and metadata
  • Create task graph specifications in YAML/JSON
  • Define standard operating procedures for each workflow
  • Maintain runbooks for common failure modes
  • Document best practices and anti-patterns
  • Create onboarding guide for prompt engineering