Chapter 43 — Competitive Intelligence & Research Bots

Overview

Competitive intelligence (CI) and research bots represent a critical capability for modern enterprises. These autonomous systems monitor competitors, track market trends, extract insights from public data, and surface actionable intelligence—all while respecting legal boundaries and maintaining data provenance.

This chapter covers the full lifecycle of building compliant, effective CI and research systems:

Intelligent Crawling: Politeness policies, deduplication, and change detection
Structured Extraction: Entity recognition, price tracking, and claim identification
Evidence Preservation: Snapshots, diffs, and chain-of-custody for audit trails
Grounded Summarization: Citations, confidence scoring, and contradiction detection
Legal Compliance: robots.txt, ToS respect, and PII filtering

Why It Matters

Competitive intelligence thrives on freshness and accuracy. Legal and ethical constraints require careful scoping, evidence capture, and respect for site policies.

Business Value & Risks

Value Category	Impact	Metric
Research Time	90% reduction	20 hours/week → 2 hours/week
Competitor Coverage	5x increase	5 competitors → 25 competitors
Update Frequency	Real-time vs. weekly	2 hours vs. 7 days to detect changes
Insights Generated	24x increase	5/month → 120/month actionable insights
Sources Monitored	10x increase	500 products → 10,000 products

Critical Risks:

Risk Category	Impact	Mitigation	Priority
ToS Violations	Legal action, IP blocks	Strict robots.txt compliance, 10 req/min limits	Critical
Inaccurate Intelligence	Bad decisions	Citation linking, confidence scoring	High
Stale Data	Outdated insights	Hourly change detection, timestamps	High
PII Exposure	GDPR fines	PII filtering before storage	Critical
Misleading Summaries	Strategic errors	Contradiction checking, fact verification	High

Architecture

graph TB
    subgraph "Source Discovery"
        A[Source Registry] --> B[Sitemap Parser]
        A --> C[RSS Feed Monitor]
        A --> D[API Connector]
        A --> E[Manual URL Queue]
    end

    subgraph "Crawl Orchestrator"
        B --> F[Crawl Scheduler]
        C --> F
        D --> F
        E --> F
        F --> G[Politeness Engine]
        G --> H[robots.txt Checker]
        G --> I[Rate Limiter]
        G --> J[Allowlist Filter]
    end

    subgraph "Content Processing"
        J --> K[Content Fetcher]
        K --> L[Deduplicator]
        L --> M[Canonicalizer]
        M --> N[Change Detector]
        N --> O{Changed?}
        O -->|Yes| P[Structured Extractors]
        O -->|No| Q[Skip Processing]
    end

    subgraph "Extraction Layer"
        P --> R[Entity Extractor]
        P --> S[Price Extractor]
        P --> T[Feature Extractor]
        P --> U[Claim Extractor]
        R --> V[Structured Data Store]
        S --> V
        T --> V
        U --> V
    end

    subgraph "Intelligence Generation"
        V --> W[Change Analyzer]
        V --> X[Summarizer]
        W --> Y[Alert Generator]
        X --> Z[Citation Linker]
        Z --> AA[Confidence Scorer]
        AA --> AB[Contradiction Checker]
    end

    subgraph "Evidence & Audit"
        K --> AC[Snapshot Store]
        M --> AD[URL Canonicalization Log]
        N --> AE[Diff Store]
        V --> AF[Provenance Tracker]
        AC --> AG[Evidence Archive]
        AD --> AG
        AE --> AG
        AF --> AG
    end

    subgraph "Analyst Interface"
        Y --> AH[Alert Dashboard]
        AB --> AI[Intelligence Report]
        AG --> AJ[Audit Viewer]
    end

    style G fill:#ffe1e1
    style V fill:#e1f5ff
    style AG fill:#e1ffe1
    style AB fill:#fff3e1

Core Components

1. Source Discovery & Management

Source Types & Crawl Frequencies:

Source Type	Frequency	Priority	Use Case
Product Pages	Hourly	10	Price monitoring
Press Releases	Daily	8	New announcements
Blog Posts	Daily	6	Thought leadership
Sitemaps	Weekly	5	Comprehensive coverage
RSS Feeds	Hourly	9	Real-time updates
APIs	Hourly	10	Official data

Implementation Pattern:

class SourceRegistry:
    def register_source(self, url: str, competitor: str, source_type: str, priority: int = 5):
        """Register intelligence source with crawl frequency"""
        source = {
            'url': url,
            'competitor': competitor,
            'type': source_type,
            'priority': priority,
            'frequency': self._get_frequency(source_type),
            'last_crawled': None
        }
        self.sources[self._generate_id(source)] = source

    def get_sources_to_crawl(self) -> List[dict]:
        """Get sources due for crawl, sorted by priority"""
        now = datetime.utcnow()
        due_sources = [
            s for s in self.sources.values()
            if not s['last_crawled'] or (now - s['last_crawled']) >= s['frequency']
        ]
        return sorted(due_sources, key=lambda s: s['priority'], reverse=True)

class SitemapParser:
    async def discover_and_parse(self, domain: str) -> List[str]:
        """Discover sitemaps from robots.txt and parse URLs"""
        # Check robots.txt for sitemap directive
        sitemaps = await self._parse_robots_txt(f"https://{domain}/robots.txt")

        # Try common locations
        sitemaps += await self._try_common_locations(domain)

        # Parse all sitemaps and extract URLs
        urls = []
        for sitemap_url in sitemaps:
            urls.extend(await self._parse_sitemap_xml(sitemap_url))

        return urls

2. Politeness Engine

Politeness Checks (Sequential):

graph TD
    A[URL Request] --> B{In Allowlist?}
    B -->|No| C[Reject: Not Allowed]
    B -->|Yes| D{robots.txt OK?}
    D -->|No| E[Reject: Disallowed]
    D -->|Yes| F{Rate Limit OK?}
    F -->|No| G[Wait Until Window Opens]
    F -->|Yes| H[Allow Crawl]

    G --> F

    style C fill:#ffe1e1
    style E fill:#ffe1e1
    style H fill:#e1f5ff

Rate Limits & Rules:

Check	Rule	Action
Allowlist	Domain must be in allowlist	Reject if not present
robots.txt	Must respect User-agent rules	Check cache (24h TTL)
Rate Limit	Max 10 requests/minute per domain	Wait if exceeded
User-Agent	Identify as "ResearchBot/1.0"	Include contact URL
Cache robots.txt	24 hour TTL	Refresh daily

Implementation Pattern:

class PolitenessEngine:
    async def can_fetch(self, url: str) -> dict:
        """3-layer politeness check"""
        domain = urlparse(url).netloc

        # 1. Allowlist check
        if domain not in self.allowlist:
            return {'can_fetch': False, 'reason': 'not_in_allowlist'}

        # 2. robots.txt check (24h cache)
        if not await self._check_robots(url):
            return {'can_fetch': False, 'reason': 'disallowed_by_robots'}

        # 3. Rate limit (10 req/min per domain)
        if not await self._check_rate_limit(domain):
            return {'can_fetch': False, 'reason': 'rate_limit_exceeded'}

        return {'can_fetch': True}

    async def _check_robots(self, url: str) -> bool:
        """Check robots.txt with 24h cache"""
        domain = urlparse(url).netloc
        if domain in self.robots_cache:
            cache_entry = self.robots_cache[domain]
            if (datetime.utcnow() - cache_entry['fetched_at']) < timedelta(hours=24):
                return cache_entry['parser'].can_fetch(self.user_agent, url)

        # Fetch and cache robots.txt
        rp = RobotFileParser()
        rp.set_url(f"https://{domain}/robots.txt")
        rp.read()
        self.robots_cache[domain] = {'parser': rp, 'fetched_at': datetime.utcnow()}
        return rp.can_fetch(self.user_agent, url)

    async def _check_rate_limit(self, domain: str) -> bool:
        """10 requests per minute sliding window"""
        now = datetime.utcnow()
        limiter = self.rate_limiters.get(domain, {'rpm': 10, 'requests': []})

        # Remove requests outside 1-minute window
        limiter['requests'] = [t for t in limiter['requests'] if (now - t).seconds < 60]

        # Wait if limit exceeded
        if len(limiter['requests']) >= limiter['rpm']:
            await asyncio.sleep(6)  # Wait 6 seconds
            return await self._check_rate_limit(domain)

        limiter['requests'].append(now)
        return True

3. Content Processing Pipeline

URL Canonicalization Rules:

Transformation	Example	Reason
Remove tracking params	`?utm_source=twitter` → ``	Deduplication
Sort query params	`?b=2&a=1` → `?a=1&b=2`	Consistent hashing
Remove trailing slash	`/page/` → `/page`	Normalize paths
Lowercase domain	`Example.COM` → `example.com`	Case-insensitive

Change Detection:

graph LR
    A[Fetch Content] --> B[Calculate Hash]
    B --> C{Previous Snapshot?}

    C -->|No| D[First Seen]
    C -->|Yes| E{Hash Match?}

    E -->|Yes| F[No Changes - Skip]
    E -->|No| G[Generate Diff]

    G --> H[Store New Snapshot]
    D --> H

    style F fill:#e1ffe1
    style G fill:#ffe1e1
    style H fill:#e1f5ff

Implementation Pattern:

class ContentDeduplicator:
    def canonicalize_url(self, url: str) -> str:
        """Remove tracking params, sort query, normalize"""
        tracking_params = {'utm_source', 'utm_medium', 'fbclid', 'gclid', 'ref'}
        parsed = urlparse(url)
        params = {k: v for k, v in parse_qs(parsed.query).items() if k not in tracking_params}
        canonical = f"{parsed.scheme}://{parsed.netloc.lower()}{parsed.path.rstrip('/')}"
        if params:
            canonical += f"?{urlencode(sorted(params.items()))}"
        return canonical

    def is_duplicate(self, url: str, content: str) -> bool:
        """Check if content hash exists"""
        canonical = self.canonicalize_url(url)
        content_hash = hashlib.sha256(' '.join(content.split()).encode()).hexdigest()
        return content_hash in self.content_hashes

class ChangeDetector:
    async def detect_changes(self, url: str, content: str) -> dict:
        """Compare with previous snapshot"""
        previous = await self.snapshot_store.get_latest(url)
        if not previous:
            return {'is_new': True, 'has_changes': False}

        current_hash = hashlib.sha256(content.encode()).hexdigest()
        if current_hash == previous['hash']:
            return {'is_new': False, 'has_changes': False}

        # Generate diff
        diff = difflib.unified_diff(previous['content'].splitlines(), content.splitlines())
        return {'is_new': False, 'has_changes': True, 'diff': list(diff)[:100]}

4. Structured Extraction

Extraction Targets:

Extractor	Target Data	Selectors	Confidence
Price	Amount + currency	`.price`, `[itemprop="price"]`	High
Features	Product specs	`.features li`, `.specs td`	Medium
Claims	Performance metrics	Sentences with %, $, "faster"	Variable
Entities	Companies, products	Capitalized words, proper nouns	Medium

Implementation Pattern:

class EntityExtractor:
    def extract(self, url: str, html: str, competitor: str) -> dict:
        """Extract using registered CSS selectors"""
        soup = BeautifulSoup(html, 'html.parser')
        rules = self.extraction_rules.get(competitor, {})
        data = {}

        for field, selector_config in rules.items():
            element = soup.select_one(selector_config['selector'])
            data[field] = element.get_text(strip=True) if element else None

        return {'url': url, 'competitor': competitor, 'data': data}

class PriceExtractor:
    def extract_prices(self, html: str) -> List[dict]:
        """Extract prices from common selectors"""
        selectors = ['.price', '.product-price', '[itemprop="price"]']
        prices = []

        soup = BeautifulSoup(html, 'html.parser')
        for selector in selectors:
            for elem in soup.select(selector):
                text = elem.get_text(strip=True)
                # Parse: $1,234.56 → {amount: 1234.56, currency: 'USD'}
                match = re.search(r'[$€£¥]?([\d,]+\.?\d*)', text)
                if match:
                    amount = float(match.group(1).replace(',', ''))
                    currency = 'USD' if '$' in text else 'EUR' if '€' in text else 'USD'
                    prices.append({'amount': amount, 'currency': currency, 'selector': selector})

        return prices

class ClaimExtractor:
    def extract_claims(self, text: str) -> List[dict]:
        """Extract sentences with metrics, comparisons, superlatives"""
        claim_patterns = [r'\d+%', r'\$[\d,]+', r'\d+x', r'(faster|better|leading|first)']
        claims = []

        for sentence in text.split('.'):
            if any(re.search(p, sentence, re.I) for p in claim_patterns):
                claims.append({
                    'text': sentence.strip(),
                    'type': self._classify(sentence),
                    'confidence': 0.7
                })

        return claims

    def _classify(self, claim: str) -> str:
        """Classify claim type"""
        if '$' in claim:
            return 'pricing'
        elif '%' in claim:
            return 'performance'
        elif re.search(r'faster|slower', claim, re.I):
            return 'speed'
        else:
            return 'general'

5. Intelligence Generation

Confidence Scoring:

Factor	Weight	Calculation
Citation Density	20%	Citations per 50 words
Source Freshness	30%	1 - (age_hours / 168)
Source Diversity	20%	Unique domains / 5
Data Completeness	30%	Extracted fields / Total fields

Implementation Pattern:

class IntelligenceSummarizer:
    async def generate_summary(self, competitor: str, data: List[dict], period: str) -> dict:
        """Generate LLM summary with citations and confidence scoring"""
        # Prepare context with source numbers
        context = '\n'.join([
            f"[{i+1}] {d['url']}: {d['data']}"
            for i, d in enumerate(data)
        ])

        # LLM prompt with citation requirement
        prompt = f"""Summarize competitive intelligence for {competitor} ({period}).
        Requirements: Cite sources as [n], include specific metrics, avoid speculation.

        Data:
        {context}

        Summary (Overview → Key Findings → Changes):"""

        summary = await self.llm.generate(prompt)

        # Extract citations: [1], [2], etc.
        citations = [
            {'number': int(m), 'url': data[int(m)-1]['url']}
            for m in re.findall(r'\[(\d+)\]', summary)
            if int(m) <= len(data)
        ]

        # Calculate confidence score (4 factors)
        confidence = self._score_confidence(summary, data, citations)

        # Check contradictions
        contradictions = self._check_contradictions(summary, data)

        return {
            'summary': summary,
            'citations': citations,
            'confidence_score': confidence,
            'contradictions': contradictions
        }

    def _score_confidence(self, summary: str, data: List[dict], citations: List[dict]) -> float:
        """Weighted confidence: citation density (20%) + freshness (30%) + diversity (20%) + completeness (30%)"""
        # Citation density: citations per 50 words
        citation_density = min(len(citations) / (len(summary.split()) / 50), 1.0) * 0.2

        # Source freshness: 1 - (avg_age_hours / 168)
        avg_age = sum((datetime.utcnow() - datetime.fromisoformat(d['extracted_at'].replace('Z', ''))).total_seconds() / 3600
                     for d in data if 'extracted_at' in d) / len(data) if data else 0
        freshness = max(0, 1 - (avg_age / 168)) * 0.3

        # Source diversity: unique domains / 5
        unique_domains = len(set(urlparse(d['url']).netloc for d in data))
        diversity = min(unique_domains / 5, 1.0) * 0.2

        # Data completeness: non-null fields / total fields
        completeness = sum(1 for d in data for v in d['data'].values() if v) / sum(len(d['data']) for d in data) * 0.3

        return citation_density + freshness + diversity + completeness

    def _check_contradictions(self, summary: str, data: List[dict]) -> List[dict]:
        """Verify numbers in summary appear in source data"""
        contradictions = []
        numbers = re.findall(r'\d+(?:\.\d+)?', summary)
        data_str = str(data)

        for num in numbers:
            if num not in data_str:
                contradictions.append({'issue': f"Number {num} not in sources", 'severity': 'warning'})

        return contradictions

Evaluation Metrics

Metric Category	Metric	Target	Measurement Method
Freshness	Time to Index	< 1 hour	Time from page update to indexed
Freshness	Update Latency (p95)	< 4 hours	Time from change detection to alert
Coverage	Source Coverage	> 95%	Monitored sources / Target sources
Coverage	Entity Completeness	> 90%	Extracted fields / Total fields
Accuracy	Extraction Precision	> 95%	Correct extractions / Total extractions
Accuracy	Price Accuracy	> 99%	Correct prices / Total prices
Accuracy	Claim Verification Rate	> 90%	Verified claims / Total claims
Quality	Summary Confidence	> 0.8	Average confidence score
Quality	Contradiction Rate	< 2%	Summaries with contradictions / Total
Utility	Analyst Time Saved	> 70%	(Old time - New time) / Old time
Utility	Actionable Insights Rate	> 60%	Insights acted upon / Total insights
Compliance	robots.txt Violations	0	Violations detected / Total crawls
Compliance	ToS Violations	0	Violations detected / Total operations

Real-World Case Study: Retail Competitive Intelligence

Scenario: A major retailer needs to monitor pricing and product offerings across 25 competitors, tracking 10,000+ products.

System Flow:

graph TD
    A[Hourly Cron Job] --> B[Get Sources to Crawl]
    B --> C[Politeness Check]
    C -->|Pass| D[Fetch Content]
    C -->|Fail| E[Skip]

    D --> F[Deduplicate]
    F -->|Duplicate| E
    F -->|New| G[Change Detection]

    G -->|No Changes| E
    G -->|Changes| H[Extract Prices + Entities]

    H --> I[Store Snapshot]
    I --> J[Generate Alerts]
    J --> K[Create Summary]

    style C fill:#ffe1e1
    style H fill:#e1f5ff
    style K fill:#e1ffe1

Implementation Pattern:

class RetailCIBot:
    async def monitor_competitor_pricing(self, competitor: str) -> dict:
        """End-to-end monitoring with 7-step pipeline"""
        sources = self.source_registry.get_competitor_sources(competitor)
        changes = []
        pricing_data = []

        for source in sources:
            # 1. Politeness check
            if not (await self.politeness.can_fetch(source.url))['can_fetch']:
                continue

            # 2. Fetch content
            content = await self._fetch(source.url)

            # 3. Deduplication
            if self.deduplicator.is_duplicate(source.url, content):
                continue

            # 4. Change detection
            change = await self.change_detector.detect_changes(source.url, content)
            if not change['has_changes']:
                continue

            changes.append(change)

            # 5. Extract prices + entities
            prices = self.price_extractor.extract_prices(content)
            entities = self.entity_extractor.extract(source.url, content, competitor)
            pricing_data.extend(prices)

            # 6. Store snapshot
            await self._store_snapshot(source.url, content, prices, entities)

        # 7. Generate summary and alerts
        if changes:
            summary = await self.summarizer.generate_summary(competitor, pricing_data, "24h")
            alerts = self._generate_alerts(changes, pricing_data)
            return {'competitor': competitor, 'changes': len(changes), 'summary': summary, 'alerts': alerts}

        return {'competitor': competitor, 'changes': 0, 'message': 'No changes'}

    def _generate_alerts(self, changes: List[dict], prices: List[dict]) -> List[dict]:
        """Alert on price drops >10% and new products"""
        alerts = []
        # Price drops
        for p in prices:
            if 'previous_price' in p and ((p['previous_price'] - p['amount']) / p['previous_price']) > 0.1:
                alerts.append({'type': 'price_drop', 'severity': 'high', 'product': p['context']['product_name']})
        return alerts

Results After 6 Months:

Metric	Before CI Bot	After CI Bot	Improvement
Competitors Monitored	5	25	400% increase
Products Tracked	500	10,000	1900% increase
Update Frequency	Weekly manual	Hourly automated	Real-time vs. weekly
Analyst Time on Data Collection	20 hours/week	2 hours/week	90% reduction
Price Change Detection Time	7 days average	2 hours average	98% faster
Pricing Errors Caught	2 per month	45 per month	2150% increase
Actionable Insights Generated	5 per month	120 per month	2300% increase
ToS/Legal Violations	0	0	Maintained compliance

Key Success Factors:

✅ Strict politeness controls prevented any legal issues
✅ Change detection reduced noise, focusing only on updates
✅ Citation linking enabled rapid verification of insights
✅ Contradiction checking caught 15 potential errors before delivery
✅ Evidence snapshots resolved 8 disputes about competitor claims

Implementation Checklist

Phase 1: Source Discovery (Week 1-2)

Define target competitors (minimum 10) and source types
Build source registry with priority and frequency configuration
Implement sitemap parser and RSS feed monitor
Create manual URL queue for ad-hoc monitoring

Phase 2: Politeness & Compliance (Week 3-4)

Implement robots.txt checker with 24h caching
Build rate limiter (10 req/min per domain default)
Configure domain allowlist for all competitors
Review ToS for each competitor and document policies

Phase 3: Content Processing (Week 5-6)

Implement URL canonicalization (tracking param removal, sorting)
Build content hash deduplication
Create snapshot storage and diff generation
Configure change detection thresholds (>10 lines = significant)

Phase 4: Extraction (Week 7-9)

Define extraction schemas per competitor (CSS selectors)
Implement price extractor with currency detection
Build claim extractor (patterns: %, $, comparatives, superlatives)
Create extraction validation and error handling

Phase 5: Intelligence Generation (Week 10-12)

Integrate LLM for summary generation with citation requirements
Implement confidence scoring (4 factors: density, freshness, diversity, completeness)
Build contradiction checker (verify numbers in sources)
Create alert rules (price drops >10%, new products, feature changes)

Phase 6: Evidence & Deployment (Week 13-16)

Implement snapshot storage with chain-of-custody logging
Build compliance audit logging and reports
Test all extraction types and change detection
Gradual rollout: 5 competitors → 15 → full (25)

Phase 7: Ongoing Operations

Weekly: Review extraction accuracy, update rules as sites change
Monthly: Evaluate source coverage, summary quality, compliance
Quarterly: Add competitors/sources, improve prompts, expand alerts

Common Pitfalls and Solutions

Pitfall	Impact	Solution
Ignoring robots.txt	Legal action, IP bans	Always check robots.txt; honor all directives
Aggressive crawling	IP bans, reputation damage	Conservative rate limits (10 req/min or less)
Poor URL canonicalization	Duplicate processing, wasted resources	Normalize URLs; filter tracking parameters
No change detection	Processing unchanged content	Implement content hashing and diff generation
Extraction brittleness	Breaks when sites update	Multiple fallback selectors; schema versioning
Missing citations	Cannot verify claims	Link every fact to source URL and timestamp
Stale data	Outdated intelligence	Timestamp everything; show data age in reports
No contradiction checking	Misleading summaries	Validate summary claims against source data
PII in snapshots	Privacy violations	Filter PII before storage; redact in evidence
No source diversity	Biased intelligence	Monitor multiple sources per competitor

Summary

Building effective competitive intelligence systems requires balancing three priorities:

Legal Compliance: Strict adherence to robots.txt, ToS, and rate limits
Data Quality: Accurate extraction, freshness tracking, and citation linking
Analyst Utility: Actionable insights, not just raw data dumps

The retail CI bot case study demonstrates that with proper implementation, CI systems can:

Monitor 5x more competitors with 90% less manual effort
Detect competitive moves 98% faster than manual monitoring
Generate 24x more actionable insights while maintaining zero legal violations

The key success factors are:

Conservative politeness settings (err on the side of caution)
Comprehensive evidence preservation (every fact must be verifiable)
Change detection to reduce noise and focus on what matters
Grounded summarization with citations and contradiction checking
Continuous monitoring of extraction accuracy and legal compliance

By following the implementation checklist and avoiding common pitfalls, you can deploy CI systems that provide strategic advantage while respecting legal and ethical boundaries.

Chapter 43: Competitive Intelligence & Research Bots

43. Competitive Intelligence & Research Bots

Chapter 43 — Competitive Intelligence & Research Bots

Overview

Why It Matters

Business Value & Risks

Architecture

Core Components

1. Source Discovery & Management

2. Politeness Engine

3. Content Processing Pipeline

4. Structured Extraction

5. Intelligence Generation

Evaluation Metrics

Real-World Case Study: Retail Competitive Intelligence

Implementation Checklist

Phase 1: Source Discovery (Week 1-2)

Phase 2: Politeness & Compliance (Week 3-4)

Phase 3: Content Processing (Week 5-6)

Phase 4: Extraction (Week 7-9)

Phase 5: Intelligence Generation (Week 10-12)

Phase 6: Evidence & Deployment (Week 13-16)

Phase 7: Ongoing Operations

Common Pitfalls and Solutions

Summary