13. Data Discovery & Cataloging
Chapter 13 — Data Discovery & Cataloging
Overview
Establish a complete, trusted inventory of data sources with ownership, lineage, and usage constraints. Data discovery and cataloging form the foundation of any successful AI program by providing visibility into what data exists, where it resides, who owns it, and how it can be used. Without this foundation, AI teams operate blindly, leading to duplicated efforts, compliance violations, and poor model performance.
Why It Matters
AI programs depend on trustworthy data. Discovery and cataloging create clarity on what exists, who owns it, and how it can be used. Consider these critical scenarios:
- Avoiding Redundancy: Teams unknowingly duplicate data pipelines because they can't find existing sources
- Ensuring Compliance: Without proper cataloging, teams may use data that violates privacy regulations
- Enabling Collaboration: Data scientists waste weeks searching for the right datasets instead of building models
- Managing Dependencies: Changes to upstream systems break AI pipelines without warning
- Tracking Lineage: When issues arise, teams can't trace data back to its source
Discovery Process Flow
graph TB A[Start Discovery] --> B{Discovery Method} B -->|Automated| C[Schema Scanners] B -->|Manual| D[Stakeholder Interviews] B -->|Hybrid| E[Combined Approach] C --> F[Extract Metadata] D --> F E --> F F --> G[Profile Data Sources] G --> H[Document Ownership] H --> I[Map Lineage] I --> J[Classify Sensitivity] J --> K{Complete?} K -->|No| L[Identify Gaps] L --> B K -->|Yes| M[Populate Catalog] M --> N[Ongoing Maintenance] style A fill:#e1f5ff,stroke:#333,stroke-width:2px style M fill:#d4edda,stroke:#333,stroke-width:2px style K fill:#fff3cd,stroke:#333,stroke-width:2px
Core Discovery Activities
1. Source Inventory Matrix
| Source Type | Examples | Discovery Method | Typical Metadata | Priority |
|---|---|---|---|---|
| Databases | PostgreSQL, MySQL, Oracle | JDBC scanners, schema dumps | Tables, columns, types, constraints | P0 |
| Data Warehouses | Snowflake, BigQuery, Redshift | Cloud APIs, information schema | Schemas, views, materialized views | P0 |
| Data Lakes | S3, ADLS, GCS | File system crawlers | Directories, file formats, partitions | P1 |
| APIs | REST, GraphQL, gRPC | Swagger/OpenAPI, manual docs | Endpoints, schemas, rate limits | P1 |
| File Systems | NFS, SMB, local storage | File scanners | File types, sizes, modification dates | P2 |
| SaaS Applications | Salesforce, HubSpot, Zendesk | Connector catalogs | Objects, fields, sync frequency | P2 |
| Streaming | Kafka, Kinesis, Pulsar | Topic inspection | Topics, schemas, retention | P1 |
| Legacy Systems | Mainframes, COBOL systems | Manual documentation | COBOL copybooks, flat files | P2 |
2. Data Profiling Dimensions
graph LR subgraph "Structural Profiling" S1[Schema Analysis] S2[Data Types] S3[Constraints] S4[Relationships] end subgraph "Content Profiling" C1[Volume Metrics] C2[Null Rates] C3[Value Distributions] C4[Patterns/Formats] end subgraph "Operational Profiling" O1[Update Frequency] O2[Access Patterns] O3[Performance Metrics] O4[Dependencies] end subgraph "Business Profiling" B1[Ownership] B2[Domain] B3[Use Cases] B4[Business Rules] end S1 & S2 & S3 & S4 --> DP[Data Profile] C1 & C2 & C3 & C4 --> DP O1 & O2 & O3 & O4 --> DP B1 & B2 & B3 & B4 --> DP DP --> Catalog[Data Catalog] style Catalog fill:#f96,stroke:#333,stroke-width:3px
3. Ownership & Stewardship Model
graph TD subgraph "Governance Roles" DO[Data Owner<br/>Accountable] DS[Data Steward<br/>Day-to-day Management] DC[Data Custodian<br/>Technical Implementation] DU[Data Consumer<br/>Usage & Feedback] end DO -->|Approves| A1[Access Requests] DO -->|Sets| A2[Retention Policies] DO -->|Defines| A3[Quality Standards] DS -->|Monitors| B1[Quality Metrics] DS -->|Resolves| B2[Data Issues] DS -->|Updates| B3[Metadata] DC -->|Manages| C1[Database Systems] DC -->|Implements| C2[Security Controls] DC -->|Performs| C3[Backups] DU -->|Submits| D1[Access Requests] DU -->|Reports| D2[Quality Issues] DU -->|Provides| D3[Feedback] style DO fill:#ffcccc,stroke:#333,stroke-width:2px style DS fill:#cce5ff,stroke:#333,stroke-width:2px style DC fill:#ccffcc,stroke:#333,stroke-width:2px
Data Lineage Mapping
End-to-End Lineage Architecture
graph LR subgraph "Source Systems" S1[(CRM Database)] S2[(E-commerce API)] S3[Mobile App Events] S4[(ERP System)] end subgraph "Raw Layer" R1[Raw Customer] R2[Raw Orders] R3[Raw Events] R4[Raw Finance] end subgraph "Curated Layer" C1[Customer Master] C2[Order History] C3[User Behavior] C4[Financial Data] end subgraph "Analytics Layer" A1[Customer 360] A2[Revenue Analytics] A3[Behavioral Insights] end subgraph "ML Layer" M1[Customer Features] M2[Churn Model] M3[Recommendation Model] end S1 -->|CDC| R1 S2 -->|API Pull| R2 S3 -->|Stream| R3 S4 -->|Batch| R4 R1 -->|Dedupe| C1 R2 -->|Join| C2 R3 -->|Aggregate| C3 R4 -->|Transform| C4 C1 & C2 --> A1 C2 & C4 --> A2 C3 --> A3 C1 & C2 & C3 --> M1 M1 --> M2 M1 & C2 --> M3 style M2 fill:#f96,stroke:#333,stroke-width:3px style M3 fill:#f96,stroke:#333,stroke-width:3px
Lineage Capture Methods Comparison
| Method | How It Works | Accuracy | Coverage | Cost | Best For |
|---|---|---|---|---|---|
| SQL Parsing | Parse queries to extract dependencies | 95% | SQL only | Low | Data warehouses, DBT |
| Runtime Instrumentation | Capture actual data flows during execution | 99% | All operations | High | Complex pipelines |
| Manual Documentation | Engineers document dependencies | 60-80% | Selective | Medium | Legacy systems |
| Metadata APIs | Query system metadata | 90% | Platform-specific | Low | Cloud platforms |
| Hybrid Approach | Combine automated + manual | 95% | Comprehensive | Medium | Enterprise catalogs |
Data Classification Framework
Sensitivity Classification
graph TB A[Data Asset] --> B{Contains PII/PHI?} B -->|Yes| C{Regulated Data?} B -->|No| D{Business Critical?} C -->|Yes - GDPR/HIPAA| E[Highly Sensitive<br/>Encrypt, Strict Access] C -->|No - Internal PII| F[Sensitive<br/>Access Controls] D -->|Yes| G{Public if Leaked?} D -->|No| H[Internal<br/>Standard Controls] G -->|No| I[Confidential<br/>Business Impact] G -->|Yes| J[Public<br/>Minimal Controls] style E fill:#ff6b6b,stroke:#333,stroke-width:3px style F fill:#ffa94d,stroke:#333,stroke-width:2px style I fill:#ffd43b,stroke:#333,stroke-width:2px style H fill:#74c0fc,stroke:#333,stroke-width:2px style J fill:#b2f2bb,stroke:#333,stroke-width:2px
Criticality vs. Sensitivity Matrix
| Classification | Criticality | Sensitivity | Required Controls | Examples |
|---|---|---|---|---|
| Critical + Highly Sensitive | P0 - Business stops | PII/PHI/Financial | Encryption, MFA, Audit logs, DR, Legal review | Customer SSN, Payment data, Health records |
| Critical + Moderate Sensitivity | P0 - Business stops | Proprietary | Access controls, Backups, Monitoring | Product inventory, Pricing algorithms |
| Important + Highly Sensitive | P1 - Degraded service | PII/Confidential | Encryption, Access controls, Audit logs | Employee data, Contract terms |
| Important + Low Sensitivity | P1 - Degraded service | Internal | Standard access controls | Sales reports, Marketing analytics |
| Non-Critical + Any Sensitivity | P2 - Minimal impact | Varies | Basic security practices | Test data, Public datasets |
Catalog Implementation Architecture
graph TB subgraph "Data Sources" DS1[Databases] DS2[Data Lakes] DS3[APIs] DS4[Streams] end subgraph "Collection Layer" CL1[Scanners & Crawlers] CL2[Lineage Extractors] CL3[Profilers] CL4[Quality Checkers] end subgraph "Catalog Core" CC1[Metadata Repository] CC2[Search Index] CC3[Lineage Graph] CC4[Access Control] end subgraph "Presentation Layer" PL1[Search & Discovery UI] PL2[Lineage Visualization] PL3[Data Quality Dashboards] PL4[REST API] end subgraph "Integration Layer" IL1[IDE Plugins] IL2[BI Tool Connectors] IL3[Governance Workflows] IL4[Access Request System] end DS1 & DS2 & DS3 & DS4 --> CL1 & CL2 & CL3 & CL4 CL1 & CL2 & CL3 & CL4 --> CC1 & CC2 & CC3 & CC4 CC1 & CC2 & CC3 & CC4 --> PL1 & PL2 & PL3 & PL4 PL1 & PL2 & PL3 & PL4 --> IL1 & IL2 & IL3 & IL4 style CC1 fill:#f96,stroke:#333,stroke-width:3px style CC3 fill:#bbf,stroke:#333,stroke-width:2px
Catalog Tool Selection Matrix
| Tool | Deployment | Best For | Strengths | Limitations | Cost Range | Decision Factor |
|---|---|---|---|---|---|---|
| AWS Glue Catalog | Cloud (AWS) | AWS-native workloads | Deep AWS integration, Serverless, Auto-discovery | AWS lock-in, Limited business metadata | $1/100K objects/mo | Choose if all-in on AWS |
| Azure Purview | Cloud (Azure) | Microsoft ecosystem | Cross-cloud scanning, Compliance features | Complex setup, Azure-centric | $500-5K/month | Choose for Azure + compliance |
| Alation | SaaS/On-prem | Enterprise-wide | AI-powered search, Collaboration, Rich UI | High cost, Admin overhead | $50K-250K/year | Choose for enterprise with budget |
| DataHub | Self-hosted | Open-source, Customizable | Free, Extensible, Active community, Modern | Self-managed, Setup complexity | Infrastructure only | Choose for control + no budget |
| Collibra | SaaS/On-prem | Governance-first orgs | Comprehensive governance, Workflows, Enterprise | Expensive, Heavy, Long setup | $100K-500K/year | Choose for regulated industries |
| Apache Atlas | Self-hosted | Hadoop ecosystem | Hadoop-native, Free, Metadata API | Limited UI, Hadoop-focused | Infrastructure only | Choose for Hadoop/Spark shops |
| Amundsen | Self-hosted | Data discovery | Simple, Fast search, Lightweight | Limited governance features | Infrastructure only | Choose for discovery-first |
Implementation Phasing
7-Week Implementation Plan
gantt title Data Cataloging Implementation Timeline dateFormat YYYY-MM-DD section Phase 1: Discovery Automated Scanning :a1, 2025-10-03, 7d Stakeholder Interviews :a2, 2025-10-03, 7d Schema Extraction :a3, 2025-10-07, 7d section Phase 2: Ownership Ownership Workshops :b1, 2025-10-14, 7d Steward Assignment :b2, 2025-10-14, 7d section Phase 3: Lineage Lineage Mapping :c1, 2025-10-21, 14d Dependency Documentation :c2, 2025-10-21, 14d section Phase 4: Classification Automated PII Detection :d1, 2025-11-04, 7d Manual Classification :d2, 2025-11-04, 7d section Phase 5: Gap Analysis Gap Identification :e1, 2025-11-11, 7d Remediation Planning :e2, 2025-11-11, 7d
Phase-by-Phase Checklist
Week 1-2: Initial Discovery
□ Identify all data sources (databases, APIs, files, streams)
□ Run automated scanning tools (Glue Crawler, Purview, etc.)
□ Interview stakeholders for undocumented sources
□ Document source systems, schemas, and volumes
□ Capture update cadence and access patterns
□ Estimate data volumes and growth rates
Week 3: Ownership Assignment
□ Conduct ownership workshops by domain
□ Assign data owners for each domain/dataset
□ Designate steward teams for day-to-day management
□ Document escalation paths for issues
□ Establish ownership review cadence (quarterly)
□ Create accountability metrics (response time, quality)
Week 4-5: Lineage Mapping
□ Map upstream sources for critical datasets
□ Document transformation logic and dependencies
□ Identify downstream consumers and use cases
□ Record data contracts and SLAs
□ Visualize end-to-end data flows
□ Test lineage accuracy with sample queries
Week 6: Classification
□ Run automated PII/PHI detection tools
□ Manually review and validate classifications
□ Apply sensitivity labels to all datasets
□ Document compliance requirements (GDPR, HIPAA, etc.)
□ Implement appropriate security controls
□ Create classification audit trail
Week 7: Gap Analysis & Remediation
□ Compare discovered assets vs. expected inventory
□ Identify missing metadata, ownership, lineage
□ Prioritize gaps by business impact
□ Create remediation plan with timelines
□ Assign owners for gap closure
□ Schedule follow-up reviews
Metadata Model Example
Essential Catalog Metadata (Keep code to 1 minimal example):
# Minimal example: Dataset catalog entry
dataset:
name: customer_transactions
location: s3://data-lake/curated/transactions/
# Technical
format: parquet
partitions: [date]
row_count: 45000000
size_gb: 120
# Business
owner: jane.smith@company.com
domain: Sales
description: "Customer transaction records from all channels"
tags: [pii, revenue, customer]
# Governance
sensitivity: PII
compliance: [GDPR, PCI-DSS]
retention_days: 2555
# Quality
freshness_sla: "< 4 hours"
quality_score: 0.94
Real-World Case Study: Financial Services Cataloging
Challenge
Large bank with 3,500+ databases across 200+ applications. No central catalog. AI teams spent 40% of time searching for data.
Implementation
Timeline: 7 weeks Team: 3 data engineers, 1 governance lead Tools: Azure Purview + custom scripts
Discovery Findings
| Discovery Result | Count | Impact | Resolution |
|---|---|---|---|
| Tables discovered | 12,000 | Complete inventory | Cataloged all |
| "Customer" tables with conflicting schemas | 8 | Data inconsistency | Established canonical source |
| Duplicate ETL pipelines | 23 | Wasted $120K/year | Decommissioned 18 |
| PII tables without encryption | 156 | Compliance violation | Implemented encryption |
| Critical datasets with no owner | 89 | No accountability | Assigned via workshops |
Results After 6 Months
| Metric | Before | After | Improvement |
|---|---|---|---|
| Data discovery time | 3-5 days | 15 minutes | 99% reduction |
| Data quality incidents | 40/month | 15/month | 62% reduction |
| AI project delivery time | 16 weeks | 11 weeks | 31% faster |
| Compliance audit findings | 23 | 2 | 91% reduction |
| Infrastructure costs (duplicate pipelines) | $450K/year | $330K/year | 27% savings |
Key Success Factors
- Executive sponsorship and budget allocation
- Automated scanning covered 80% of discovery
- Domain-led ownership workshops
- Iterative approach (critical assets first)
- Clear metrics and regular progress reviews
Deliverables
1. Data Catalog with Completeness Metrics
Total Assets Cataloged: 847
├── Production Databases: 234 (100% metadata complete)
├── Data Lakes/Warehouses: 89 (100% metadata complete)
├── APIs: 156 (87% metadata complete)
├── File Systems: 248 (65% metadata complete)
└── External Sources: 120 (78% metadata complete)
Metadata Completeness:
├── Technical Metadata: 100% (schema, location, format)
├── Business Metadata: 87% (owner, description, tags)
├── Operational Metadata: 92% (freshness, quality scores)
└── Governance Metadata: 78% (sensitivity, compliance)
Coverage: 95% of critical datasets, 75% of all datasets
2. Ownership Matrix
| Domain | Owner | Steward Team | Datasets | Coverage | Avg Response Time |
|---|---|---|---|---|---|
| Sales | Jane Smith | Sales Analytics | 67 | 100% | 4 hours |
| Marketing | John Doe | Marketing Ops | 43 | 95% | 6 hours |
| Product | Alice Johnson | Product Analytics | 89 | 98% | 3 hours |
| Finance | Bob Wilson | Finance IT | 34 | 100% | 2 hours |
| Engineering | Carol Martinez | Platform Team | 156 | 85% | 8 hours |
3. Gap Remediation Roadmap
| Priority | Gap | Affected Assets | Remediation | Owner | Timeline | Status |
|---|---|---|---|---|---|---|
| P0 | PII without encryption | 23 tables | Implement column-level encryption | Security Team | 2 weeks | In Progress |
| P0 | No assigned owner | 89 datasets | Ownership workshops | Data Governance | 3 weeks | 60% Complete |
| P1 | Schema drift detected | 45 tables | Implement schema validation | Platform Team | 4 weeks | Planned |
| P1 | Missing lineage | 134 datasets | Deploy lineage tracking | Data Engineering | 6 weeks | Planned |
| P2 | Stale documentation | 267 datasets | Automate metadata refresh | Catalog Admin | 8 weeks | Planned |
| P2 | Duplicate datasets | 56 pairs | Consolidation project | Domain Owners | 12 weeks | Scoping |
Best Practices
- Start with High-Value Assets: Focus on critical datasets used by production models first
- Automate Discovery: Use tools to bootstrap catalog (80% automated, 20% manual)
- Engage Domain Experts: Business context requires human input, not just technical scanning
- Make It Self-Service: Enable users to discover and request access without tickets
- Keep Catalog Current: Automate metadata refresh to prevent staleness
- Measure Adoption: Track search queries, access requests, time-to-data metrics
- Integrate into Workflows: Embed catalog in data development tools (IDEs, notebooks)
- Lightweight Governance: Balance control with agility; don't over-bureaucratize
Common Pitfalls to Avoid
- Boiling the Ocean: Trying to catalog everything at once (start with critical 20%)
- Metadata Burden: Making metadata entry too complex (keep it simple, automate)
- Catalog Shelfware: Building catalog that nobody uses (prioritize user experience)
- Stale Information: Letting catalog become outdated (automate refresh, set SLAs)
- Lack of Enforcement: Not requiring catalog usage for data access
- Tool Over-Engineering: Choosing complex enterprise tools for simple needs
- Missing Business Context: Focusing only on technical metadata
- Ignoring Legacy Systems: Only cataloging modern cloud infrastructure
- No Ownership Model: Building catalog without clear data ownership
- Poor Search Experience: Complex interfaces that frustrate users
Decision Framework: When to Invest in Data Cataloging
graph TD A[Should we build a data catalog?] --> B{How many data sources?} B -->|< 10| C[Spreadsheet/Wiki OK] B -->|10-50| D{How many users?} B -->|> 50| E[Definitely need catalog] D -->|< 5| C D -->|> 5| F{Regulated industry?} F -->|Yes| E F -->|No| G{Data quality issues?} G -->|Yes| E G -->|No| H[Start simple, grow later] style E fill:#d4edda,stroke:#333,stroke-width:2px style C fill:#fff3cd,stroke:#333,stroke-width:2px style H fill:#cce5ff,stroke:#333,stroke-width:2px
Investment Decision Criteria:
- Definitely invest if: >50 data sources, >10 data users, regulated industry, frequent data quality issues, AI/ML at scale
- Start simple if: <10 sources, <5 users, simple use cases, small team
- Plan to scale if: Growing team, expanding data landscape, planning AI investments