Chapter 13 — Data Discovery & Cataloging

Overview

Establish a complete, trusted inventory of data sources with ownership, lineage, and usage constraints. Data discovery and cataloging form the foundation of any successful AI program by providing visibility into what data exists, where it resides, who owns it, and how it can be used. Without this foundation, AI teams operate blindly, leading to duplicated efforts, compliance violations, and poor model performance.

Why It Matters

AI programs depend on trustworthy data. Discovery and cataloging create clarity on what exists, who owns it, and how it can be used. Consider these critical scenarios:

Avoiding Redundancy: Teams unknowingly duplicate data pipelines because they can't find existing sources
Ensuring Compliance: Without proper cataloging, teams may use data that violates privacy regulations
Enabling Collaboration: Data scientists waste weeks searching for the right datasets instead of building models
Managing Dependencies: Changes to upstream systems break AI pipelines without warning
Tracking Lineage: When issues arise, teams can't trace data back to its source

Discovery Process Flow

graph TB
    A[Start Discovery] --> B{Discovery Method}

    B -->|Automated| C[Schema Scanners]
    B -->|Manual| D[Stakeholder Interviews]
    B -->|Hybrid| E[Combined Approach]

    C --> F[Extract Metadata]
    D --> F
    E --> F

    F --> G[Profile Data Sources]
    G --> H[Document Ownership]
    H --> I[Map Lineage]
    I --> J[Classify Sensitivity]

    J --> K{Complete?}
    K -->|No| L[Identify Gaps]
    L --> B
    K -->|Yes| M[Populate Catalog]

    M --> N[Ongoing Maintenance]

    style A fill:#e1f5ff,stroke:#333,stroke-width:2px
    style M fill:#d4edda,stroke:#333,stroke-width:2px
    style K fill:#fff3cd,stroke:#333,stroke-width:2px

Core Discovery Activities

1. Source Inventory Matrix

Source Type	Examples	Discovery Method	Typical Metadata	Priority
Databases	PostgreSQL, MySQL, Oracle	JDBC scanners, schema dumps	Tables, columns, types, constraints	P0
Data Warehouses	Snowflake, BigQuery, Redshift	Cloud APIs, information schema	Schemas, views, materialized views	P0
Data Lakes	S3, ADLS, GCS	File system crawlers	Directories, file formats, partitions	P1
APIs	REST, GraphQL, gRPC	Swagger/OpenAPI, manual docs	Endpoints, schemas, rate limits	P1
File Systems	NFS, SMB, local storage	File scanners	File types, sizes, modification dates	P2
SaaS Applications	Salesforce, HubSpot, Zendesk	Connector catalogs	Objects, fields, sync frequency	P2
Streaming	Kafka, Kinesis, Pulsar	Topic inspection	Topics, schemas, retention	P1
Legacy Systems	Mainframes, COBOL systems	Manual documentation	COBOL copybooks, flat files	P2

2. Data Profiling Dimensions

graph LR
    subgraph "Structural Profiling"
        S1[Schema Analysis]
        S2[Data Types]
        S3[Constraints]
        S4[Relationships]
    end

    subgraph "Content Profiling"
        C1[Volume Metrics]
        C2[Null Rates]
        C3[Value Distributions]
        C4[Patterns/Formats]
    end

    subgraph "Operational Profiling"
        O1[Update Frequency]
        O2[Access Patterns]
        O3[Performance Metrics]
        O4[Dependencies]
    end

    subgraph "Business Profiling"
        B1[Ownership]
        B2[Domain]
        B3[Use Cases]
        B4[Business Rules]
    end

    S1 & S2 & S3 & S4 --> DP[Data Profile]
    C1 & C2 & C3 & C4 --> DP
    O1 & O2 & O3 & O4 --> DP
    B1 & B2 & B3 & B4 --> DP

    DP --> Catalog[Data Catalog]

    style Catalog fill:#f96,stroke:#333,stroke-width:3px

3. Ownership & Stewardship Model

graph TD
    subgraph "Governance Roles"
        DO[Data Owner<br/>Accountable]
        DS[Data Steward<br/>Day-to-day Management]
        DC[Data Custodian<br/>Technical Implementation]
        DU[Data Consumer<br/>Usage & Feedback]
    end

    DO -->|Approves| A1[Access Requests]
    DO -->|Sets| A2[Retention Policies]
    DO -->|Defines| A3[Quality Standards]

    DS -->|Monitors| B1[Quality Metrics]
    DS -->|Resolves| B2[Data Issues]
    DS -->|Updates| B3[Metadata]

    DC -->|Manages| C1[Database Systems]
    DC -->|Implements| C2[Security Controls]
    DC -->|Performs| C3[Backups]

    DU -->|Submits| D1[Access Requests]
    DU -->|Reports| D2[Quality Issues]
    DU -->|Provides| D3[Feedback]

    style DO fill:#ffcccc,stroke:#333,stroke-width:2px
    style DS fill:#cce5ff,stroke:#333,stroke-width:2px
    style DC fill:#ccffcc,stroke:#333,stroke-width:2px

Data Lineage Mapping

End-to-End Lineage Architecture

graph LR
    subgraph "Source Systems"
        S1[(CRM Database)]
        S2[(E-commerce API)]
        S3[Mobile App Events]
        S4[(ERP System)]
    end

    subgraph "Raw Layer"
        R1[Raw Customer]
        R2[Raw Orders]
        R3[Raw Events]
        R4[Raw Finance]
    end

    subgraph "Curated Layer"
        C1[Customer Master]
        C2[Order History]
        C3[User Behavior]
        C4[Financial Data]
    end

    subgraph "Analytics Layer"
        A1[Customer 360]
        A2[Revenue Analytics]
        A3[Behavioral Insights]
    end

    subgraph "ML Layer"
        M1[Customer Features]
        M2[Churn Model]
        M3[Recommendation Model]
    end

    S1 -->|CDC| R1
    S2 -->|API Pull| R2
    S3 -->|Stream| R3
    S4 -->|Batch| R4

    R1 -->|Dedupe| C1
    R2 -->|Join| C2
    R3 -->|Aggregate| C3
    R4 -->|Transform| C4

    C1 & C2 --> A1
    C2 & C4 --> A2
    C3 --> A3

    C1 & C2 & C3 --> M1
    M1 --> M2
    M1 & C2 --> M3

    style M2 fill:#f96,stroke:#333,stroke-width:3px
    style M3 fill:#f96,stroke:#333,stroke-width:3px

Lineage Capture Methods Comparison

Method	How It Works	Accuracy	Coverage	Cost	Best For
SQL Parsing	Parse queries to extract dependencies	95%	SQL only	Low	Data warehouses, DBT
Runtime Instrumentation	Capture actual data flows during execution	99%	All operations	High	Complex pipelines
Manual Documentation	Engineers document dependencies	60-80%	Selective	Medium	Legacy systems
Metadata APIs	Query system metadata	90%	Platform-specific	Low	Cloud platforms
Hybrid Approach	Combine automated + manual	95%	Comprehensive	Medium	Enterprise catalogs

Data Classification Framework

Sensitivity Classification

graph TB
    A[Data Asset] --> B{Contains PII/PHI?}
    B -->|Yes| C{Regulated Data?}
    B -->|No| D{Business Critical?}

    C -->|Yes - GDPR/HIPAA| E[Highly Sensitive<br/>Encrypt, Strict Access]
    C -->|No - Internal PII| F[Sensitive<br/>Access Controls]

    D -->|Yes| G{Public if Leaked?}
    D -->|No| H[Internal<br/>Standard Controls]

    G -->|No| I[Confidential<br/>Business Impact]
    G -->|Yes| J[Public<br/>Minimal Controls]

    style E fill:#ff6b6b,stroke:#333,stroke-width:3px
    style F fill:#ffa94d,stroke:#333,stroke-width:2px
    style I fill:#ffd43b,stroke:#333,stroke-width:2px
    style H fill:#74c0fc,stroke:#333,stroke-width:2px
    style J fill:#b2f2bb,stroke:#333,stroke-width:2px

Criticality vs. Sensitivity Matrix

Classification	Criticality	Sensitivity	Required Controls	Examples
Critical + Highly Sensitive	P0 - Business stops	PII/PHI/Financial	Encryption, MFA, Audit logs, DR, Legal review	Customer SSN, Payment data, Health records
Critical + Moderate Sensitivity	P0 - Business stops	Proprietary	Access controls, Backups, Monitoring	Product inventory, Pricing algorithms
Important + Highly Sensitive	P1 - Degraded service	PII/Confidential	Encryption, Access controls, Audit logs	Employee data, Contract terms
Important + Low Sensitivity	P1 - Degraded service	Internal	Standard access controls	Sales reports, Marketing analytics
Non-Critical + Any Sensitivity	P2 - Minimal impact	Varies	Basic security practices	Test data, Public datasets

Catalog Implementation Architecture

graph TB
    subgraph "Data Sources"
        DS1[Databases]
        DS2[Data Lakes]
        DS3[APIs]
        DS4[Streams]
    end

    subgraph "Collection Layer"
        CL1[Scanners & Crawlers]
        CL2[Lineage Extractors]
        CL3[Profilers]
        CL4[Quality Checkers]
    end

    subgraph "Catalog Core"
        CC1[Metadata Repository]
        CC2[Search Index]
        CC3[Lineage Graph]
        CC4[Access Control]
    end

    subgraph "Presentation Layer"
        PL1[Search & Discovery UI]
        PL2[Lineage Visualization]
        PL3[Data Quality Dashboards]
        PL4[REST API]
    end

    subgraph "Integration Layer"
        IL1[IDE Plugins]
        IL2[BI Tool Connectors]
        IL3[Governance Workflows]
        IL4[Access Request System]
    end

    DS1 & DS2 & DS3 & DS4 --> CL1 & CL2 & CL3 & CL4
    CL1 & CL2 & CL3 & CL4 --> CC1 & CC2 & CC3 & CC4
    CC1 & CC2 & CC3 & CC4 --> PL1 & PL2 & PL3 & PL4
    PL1 & PL2 & PL3 & PL4 --> IL1 & IL2 & IL3 & IL4

    style CC1 fill:#f96,stroke:#333,stroke-width:3px
    style CC3 fill:#bbf,stroke:#333,stroke-width:2px

Catalog Tool Selection Matrix

Tool	Deployment	Best For	Strengths	Limitations	Cost Range	Decision Factor
AWS Glue Catalog	Cloud (AWS)	AWS-native workloads	Deep AWS integration, Serverless, Auto-discovery	AWS lock-in, Limited business metadata	$1/100K objects/mo	Choose if all-in on AWS
Azure Purview	Cloud (Azure)	Microsoft ecosystem	Cross-cloud scanning, Compliance features	Complex setup, Azure-centric	$500-5K/month	Choose for Azure + compliance
Alation	SaaS/On-prem	Enterprise-wide	AI-powered search, Collaboration, Rich UI	High cost, Admin overhead	$50K-250K/year	Choose for enterprise with budget
DataHub	Self-hosted	Open-source, Customizable	Free, Extensible, Active community, Modern	Self-managed, Setup complexity	Infrastructure only	Choose for control + no budget
Collibra	SaaS/On-prem	Governance-first orgs	Comprehensive governance, Workflows, Enterprise	Expensive, Heavy, Long setup	$100K-500K/year	Choose for regulated industries
Apache Atlas	Self-hosted	Hadoop ecosystem	Hadoop-native, Free, Metadata API	Limited UI, Hadoop-focused	Infrastructure only	Choose for Hadoop/Spark shops
Amundsen	Self-hosted	Data discovery	Simple, Fast search, Lightweight	Limited governance features	Infrastructure only	Choose for discovery-first

Implementation Phasing

7-Week Implementation Plan

gantt
    title Data Cataloging Implementation Timeline
    dateFormat YYYY-MM-DD
    section Phase 1: Discovery
    Automated Scanning          :a1, 2025-10-03, 7d
    Stakeholder Interviews      :a2, 2025-10-03, 7d
    Schema Extraction           :a3, 2025-10-07, 7d
    section Phase 2: Ownership
    Ownership Workshops         :b1, 2025-10-14, 7d
    Steward Assignment          :b2, 2025-10-14, 7d
    section Phase 3: Lineage
    Lineage Mapping             :c1, 2025-10-21, 14d
    Dependency Documentation    :c2, 2025-10-21, 14d
    section Phase 4: Classification
    Automated PII Detection     :d1, 2025-11-04, 7d
    Manual Classification       :d2, 2025-11-04, 7d
    section Phase 5: Gap Analysis
    Gap Identification          :e1, 2025-11-11, 7d
    Remediation Planning        :e2, 2025-11-11, 7d

Phase-by-Phase Checklist

Week 1-2: Initial Discovery

□ Identify all data sources (databases, APIs, files, streams)
□ Run automated scanning tools (Glue Crawler, Purview, etc.)
□ Interview stakeholders for undocumented sources
□ Document source systems, schemas, and volumes
□ Capture update cadence and access patterns
□ Estimate data volumes and growth rates

Week 3: Ownership Assignment

□ Conduct ownership workshops by domain
□ Assign data owners for each domain/dataset
□ Designate steward teams for day-to-day management
□ Document escalation paths for issues
□ Establish ownership review cadence (quarterly)
□ Create accountability metrics (response time, quality)

Week 4-5: Lineage Mapping

□ Map upstream sources for critical datasets
□ Document transformation logic and dependencies
□ Identify downstream consumers and use cases
□ Record data contracts and SLAs
□ Visualize end-to-end data flows
□ Test lineage accuracy with sample queries

Week 6: Classification

□ Run automated PII/PHI detection tools
□ Manually review and validate classifications
□ Apply sensitivity labels to all datasets
□ Document compliance requirements (GDPR, HIPAA, etc.)
□ Implement appropriate security controls
□ Create classification audit trail

Week 7: Gap Analysis & Remediation

□ Compare discovered assets vs. expected inventory
□ Identify missing metadata, ownership, lineage
□ Prioritize gaps by business impact
□ Create remediation plan with timelines
□ Assign owners for gap closure
□ Schedule follow-up reviews

Metadata Model Example

Essential Catalog Metadata (Keep code to 1 minimal example):

# Minimal example: Dataset catalog entry
dataset:
  name: customer_transactions
  location: s3://data-lake/curated/transactions/

  # Technical
  format: parquet
  partitions: [date]
  row_count: 45000000
  size_gb: 120

  # Business
  owner: jane.smith@company.com
  domain: Sales
  description: "Customer transaction records from all channels"
  tags: [pii, revenue, customer]

  # Governance
  sensitivity: PII
  compliance: [GDPR, PCI-DSS]
  retention_days: 2555

  # Quality
  freshness_sla: "< 4 hours"
  quality_score: 0.94

Real-World Case Study: Financial Services Cataloging

Challenge

Large bank with 3,500+ databases across 200+ applications. No central catalog. AI teams spent 40% of time searching for data.

Implementation

Timeline: 7 weeks Team: 3 data engineers, 1 governance lead Tools: Azure Purview + custom scripts

Discovery Findings

Discovery Result	Count	Impact	Resolution
Tables discovered	12,000	Complete inventory	Cataloged all
"Customer" tables with conflicting schemas	8	Data inconsistency	Established canonical source
Duplicate ETL pipelines	23	Wasted $120K/year	Decommissioned 18
PII tables without encryption	156	Compliance violation	Implemented encryption
Critical datasets with no owner	89	No accountability	Assigned via workshops

Results After 6 Months

Metric	Before	After	Improvement
Data discovery time	3-5 days	15 minutes	99% reduction
Data quality incidents	40/month	15/month	62% reduction
AI project delivery time	16 weeks	11 weeks	31% faster
Compliance audit findings	23	2	91% reduction
Infrastructure costs (duplicate pipelines)	$450K/year	$330K/year	27% savings

Key Success Factors

Executive sponsorship and budget allocation
Automated scanning covered 80% of discovery
Domain-led ownership workshops
Iterative approach (critical assets first)
Clear metrics and regular progress reviews

Deliverables

1. Data Catalog with Completeness Metrics

Total Assets Cataloged: 847
├── Production Databases: 234 (100% metadata complete)
├── Data Lakes/Warehouses: 89 (100% metadata complete)
├── APIs: 156 (87% metadata complete)
├── File Systems: 248 (65% metadata complete)
└── External Sources: 120 (78% metadata complete)

Metadata Completeness:
├── Technical Metadata: 100% (schema, location, format)
├── Business Metadata: 87% (owner, description, tags)
├── Operational Metadata: 92% (freshness, quality scores)
└── Governance Metadata: 78% (sensitivity, compliance)

Coverage: 95% of critical datasets, 75% of all datasets

2. Ownership Matrix

Domain	Owner	Steward Team	Datasets	Coverage	Avg Response Time
Sales	Jane Smith	Sales Analytics	67	100%	4 hours
Marketing	John Doe	Marketing Ops	43	95%	6 hours
Product	Alice Johnson	Product Analytics	89	98%	3 hours
Finance	Bob Wilson	Finance IT	34	100%	2 hours
Engineering	Carol Martinez	Platform Team	156	85%	8 hours

3. Gap Remediation Roadmap

Priority	Gap	Affected Assets	Remediation	Owner	Timeline	Status
P0	PII without encryption	23 tables	Implement column-level encryption	Security Team	2 weeks	In Progress
P0	No assigned owner	89 datasets	Ownership workshops	Data Governance	3 weeks	60% Complete
P1	Schema drift detected	45 tables	Implement schema validation	Platform Team	4 weeks	Planned
P1	Missing lineage	134 datasets	Deploy lineage tracking	Data Engineering	6 weeks	Planned
P2	Stale documentation	267 datasets	Automate metadata refresh	Catalog Admin	8 weeks	Planned
P2	Duplicate datasets	56 pairs	Consolidation project	Domain Owners	12 weeks	Scoping

Best Practices

Start with High-Value Assets: Focus on critical datasets used by production models first
Automate Discovery: Use tools to bootstrap catalog (80% automated, 20% manual)
Engage Domain Experts: Business context requires human input, not just technical scanning
Make It Self-Service: Enable users to discover and request access without tickets
Keep Catalog Current: Automate metadata refresh to prevent staleness
Measure Adoption: Track search queries, access requests, time-to-data metrics
Integrate into Workflows: Embed catalog in data development tools (IDEs, notebooks)
Lightweight Governance: Balance control with agility; don't over-bureaucratize

Common Pitfalls to Avoid

Boiling the Ocean: Trying to catalog everything at once (start with critical 20%)
Metadata Burden: Making metadata entry too complex (keep it simple, automate)
Catalog Shelfware: Building catalog that nobody uses (prioritize user experience)
Stale Information: Letting catalog become outdated (automate refresh, set SLAs)
Lack of Enforcement: Not requiring catalog usage for data access
Tool Over-Engineering: Choosing complex enterprise tools for simple needs
Missing Business Context: Focusing only on technical metadata
Ignoring Legacy Systems: Only cataloging modern cloud infrastructure
No Ownership Model: Building catalog without clear data ownership
Poor Search Experience: Complex interfaces that frustrate users

Decision Framework: When to Invest in Data Cataloging

graph TD
    A[Should we build a data catalog?] --> B{How many data sources?}
    B -->|< 10| C[Spreadsheet/Wiki OK]
    B -->|10-50| D{How many users?}
    B -->|> 50| E[Definitely need catalog]

    D -->|< 5| C
    D -->|> 5| F{Regulated industry?}

    F -->|Yes| E
    F -->|No| G{Data quality issues?}

    G -->|Yes| E
    G -->|No| H[Start simple, grow later]

    style E fill:#d4edda,stroke:#333,stroke-width:2px
    style C fill:#fff3cd,stroke:#333,stroke-width:2px
    style H fill:#cce5ff,stroke:#333,stroke-width:2px

Investment Decision Criteria:

Definitely invest if: >50 data sources, >10 data users, regulated industry, frequent data quality issues, AI/ML at scale
Start simple if: <10 sources, <5 users, simple use cases, small team
Plan to scale if: Growing team, expanding data landscape, planning AI investments

Chapter 13: Data Discovery & Cataloging

13. Data Discovery & Cataloging