Part 3: Data Foundations

Chapter 13: Data Discovery & Cataloging

Hire Us
3Part 3: Data Foundations

13. Data Discovery & Cataloging

Chapter 13 — Data Discovery & Cataloging

Overview

Establish a complete, trusted inventory of data sources with ownership, lineage, and usage constraints. Data discovery and cataloging form the foundation of any successful AI program by providing visibility into what data exists, where it resides, who owns it, and how it can be used. Without this foundation, AI teams operate blindly, leading to duplicated efforts, compliance violations, and poor model performance.

Why It Matters

AI programs depend on trustworthy data. Discovery and cataloging create clarity on what exists, who owns it, and how it can be used. Consider these critical scenarios:

  • Avoiding Redundancy: Teams unknowingly duplicate data pipelines because they can't find existing sources
  • Ensuring Compliance: Without proper cataloging, teams may use data that violates privacy regulations
  • Enabling Collaboration: Data scientists waste weeks searching for the right datasets instead of building models
  • Managing Dependencies: Changes to upstream systems break AI pipelines without warning
  • Tracking Lineage: When issues arise, teams can't trace data back to its source

Discovery Process Flow

graph TB A[Start Discovery] --> B{Discovery Method} B -->|Automated| C[Schema Scanners] B -->|Manual| D[Stakeholder Interviews] B -->|Hybrid| E[Combined Approach] C --> F[Extract Metadata] D --> F E --> F F --> G[Profile Data Sources] G --> H[Document Ownership] H --> I[Map Lineage] I --> J[Classify Sensitivity] J --> K{Complete?} K -->|No| L[Identify Gaps] L --> B K -->|Yes| M[Populate Catalog] M --> N[Ongoing Maintenance] style A fill:#e1f5ff,stroke:#333,stroke-width:2px style M fill:#d4edda,stroke:#333,stroke-width:2px style K fill:#fff3cd,stroke:#333,stroke-width:2px

Core Discovery Activities

1. Source Inventory Matrix

Source TypeExamplesDiscovery MethodTypical MetadataPriority
DatabasesPostgreSQL, MySQL, OracleJDBC scanners, schema dumpsTables, columns, types, constraintsP0
Data WarehousesSnowflake, BigQuery, RedshiftCloud APIs, information schemaSchemas, views, materialized viewsP0
Data LakesS3, ADLS, GCSFile system crawlersDirectories, file formats, partitionsP1
APIsREST, GraphQL, gRPCSwagger/OpenAPI, manual docsEndpoints, schemas, rate limitsP1
File SystemsNFS, SMB, local storageFile scannersFile types, sizes, modification datesP2
SaaS ApplicationsSalesforce, HubSpot, ZendeskConnector catalogsObjects, fields, sync frequencyP2
StreamingKafka, Kinesis, PulsarTopic inspectionTopics, schemas, retentionP1
Legacy SystemsMainframes, COBOL systemsManual documentationCOBOL copybooks, flat filesP2

2. Data Profiling Dimensions

graph LR subgraph "Structural Profiling" S1[Schema Analysis] S2[Data Types] S3[Constraints] S4[Relationships] end subgraph "Content Profiling" C1[Volume Metrics] C2[Null Rates] C3[Value Distributions] C4[Patterns/Formats] end subgraph "Operational Profiling" O1[Update Frequency] O2[Access Patterns] O3[Performance Metrics] O4[Dependencies] end subgraph "Business Profiling" B1[Ownership] B2[Domain] B3[Use Cases] B4[Business Rules] end S1 & S2 & S3 & S4 --> DP[Data Profile] C1 & C2 & C3 & C4 --> DP O1 & O2 & O3 & O4 --> DP B1 & B2 & B3 & B4 --> DP DP --> Catalog[Data Catalog] style Catalog fill:#f96,stroke:#333,stroke-width:3px

3. Ownership & Stewardship Model

graph TD subgraph "Governance Roles" DO[Data Owner<br/>Accountable] DS[Data Steward<br/>Day-to-day Management] DC[Data Custodian<br/>Technical Implementation] DU[Data Consumer<br/>Usage & Feedback] end DO -->|Approves| A1[Access Requests] DO -->|Sets| A2[Retention Policies] DO -->|Defines| A3[Quality Standards] DS -->|Monitors| B1[Quality Metrics] DS -->|Resolves| B2[Data Issues] DS -->|Updates| B3[Metadata] DC -->|Manages| C1[Database Systems] DC -->|Implements| C2[Security Controls] DC -->|Performs| C3[Backups] DU -->|Submits| D1[Access Requests] DU -->|Reports| D2[Quality Issues] DU -->|Provides| D3[Feedback] style DO fill:#ffcccc,stroke:#333,stroke-width:2px style DS fill:#cce5ff,stroke:#333,stroke-width:2px style DC fill:#ccffcc,stroke:#333,stroke-width:2px

Data Lineage Mapping

End-to-End Lineage Architecture

graph LR subgraph "Source Systems" S1[(CRM Database)] S2[(E-commerce API)] S3[Mobile App Events] S4[(ERP System)] end subgraph "Raw Layer" R1[Raw Customer] R2[Raw Orders] R3[Raw Events] R4[Raw Finance] end subgraph "Curated Layer" C1[Customer Master] C2[Order History] C3[User Behavior] C4[Financial Data] end subgraph "Analytics Layer" A1[Customer 360] A2[Revenue Analytics] A3[Behavioral Insights] end subgraph "ML Layer" M1[Customer Features] M2[Churn Model] M3[Recommendation Model] end S1 -->|CDC| R1 S2 -->|API Pull| R2 S3 -->|Stream| R3 S4 -->|Batch| R4 R1 -->|Dedupe| C1 R2 -->|Join| C2 R3 -->|Aggregate| C3 R4 -->|Transform| C4 C1 & C2 --> A1 C2 & C4 --> A2 C3 --> A3 C1 & C2 & C3 --> M1 M1 --> M2 M1 & C2 --> M3 style M2 fill:#f96,stroke:#333,stroke-width:3px style M3 fill:#f96,stroke:#333,stroke-width:3px

Lineage Capture Methods Comparison

MethodHow It WorksAccuracyCoverageCostBest For
SQL ParsingParse queries to extract dependencies95%SQL onlyLowData warehouses, DBT
Runtime InstrumentationCapture actual data flows during execution99%All operationsHighComplex pipelines
Manual DocumentationEngineers document dependencies60-80%SelectiveMediumLegacy systems
Metadata APIsQuery system metadata90%Platform-specificLowCloud platforms
Hybrid ApproachCombine automated + manual95%ComprehensiveMediumEnterprise catalogs

Data Classification Framework

Sensitivity Classification

graph TB A[Data Asset] --> B{Contains PII/PHI?} B -->|Yes| C{Regulated Data?} B -->|No| D{Business Critical?} C -->|Yes - GDPR/HIPAA| E[Highly Sensitive<br/>Encrypt, Strict Access] C -->|No - Internal PII| F[Sensitive<br/>Access Controls] D -->|Yes| G{Public if Leaked?} D -->|No| H[Internal<br/>Standard Controls] G -->|No| I[Confidential<br/>Business Impact] G -->|Yes| J[Public<br/>Minimal Controls] style E fill:#ff6b6b,stroke:#333,stroke-width:3px style F fill:#ffa94d,stroke:#333,stroke-width:2px style I fill:#ffd43b,stroke:#333,stroke-width:2px style H fill:#74c0fc,stroke:#333,stroke-width:2px style J fill:#b2f2bb,stroke:#333,stroke-width:2px

Criticality vs. Sensitivity Matrix

ClassificationCriticalitySensitivityRequired ControlsExamples
Critical + Highly SensitiveP0 - Business stopsPII/PHI/FinancialEncryption, MFA, Audit logs, DR, Legal reviewCustomer SSN, Payment data, Health records
Critical + Moderate SensitivityP0 - Business stopsProprietaryAccess controls, Backups, MonitoringProduct inventory, Pricing algorithms
Important + Highly SensitiveP1 - Degraded servicePII/ConfidentialEncryption, Access controls, Audit logsEmployee data, Contract terms
Important + Low SensitivityP1 - Degraded serviceInternalStandard access controlsSales reports, Marketing analytics
Non-Critical + Any SensitivityP2 - Minimal impactVariesBasic security practicesTest data, Public datasets

Catalog Implementation Architecture

graph TB subgraph "Data Sources" DS1[Databases] DS2[Data Lakes] DS3[APIs] DS4[Streams] end subgraph "Collection Layer" CL1[Scanners & Crawlers] CL2[Lineage Extractors] CL3[Profilers] CL4[Quality Checkers] end subgraph "Catalog Core" CC1[Metadata Repository] CC2[Search Index] CC3[Lineage Graph] CC4[Access Control] end subgraph "Presentation Layer" PL1[Search & Discovery UI] PL2[Lineage Visualization] PL3[Data Quality Dashboards] PL4[REST API] end subgraph "Integration Layer" IL1[IDE Plugins] IL2[BI Tool Connectors] IL3[Governance Workflows] IL4[Access Request System] end DS1 & DS2 & DS3 & DS4 --> CL1 & CL2 & CL3 & CL4 CL1 & CL2 & CL3 & CL4 --> CC1 & CC2 & CC3 & CC4 CC1 & CC2 & CC3 & CC4 --> PL1 & PL2 & PL3 & PL4 PL1 & PL2 & PL3 & PL4 --> IL1 & IL2 & IL3 & IL4 style CC1 fill:#f96,stroke:#333,stroke-width:3px style CC3 fill:#bbf,stroke:#333,stroke-width:2px

Catalog Tool Selection Matrix

ToolDeploymentBest ForStrengthsLimitationsCost RangeDecision Factor
AWS Glue CatalogCloud (AWS)AWS-native workloadsDeep AWS integration, Serverless, Auto-discoveryAWS lock-in, Limited business metadata$1/100K objects/moChoose if all-in on AWS
Azure PurviewCloud (Azure)Microsoft ecosystemCross-cloud scanning, Compliance featuresComplex setup, Azure-centric$500-5K/monthChoose for Azure + compliance
AlationSaaS/On-premEnterprise-wideAI-powered search, Collaboration, Rich UIHigh cost, Admin overhead$50K-250K/yearChoose for enterprise with budget
DataHubSelf-hostedOpen-source, CustomizableFree, Extensible, Active community, ModernSelf-managed, Setup complexityInfrastructure onlyChoose for control + no budget
CollibraSaaS/On-premGovernance-first orgsComprehensive governance, Workflows, EnterpriseExpensive, Heavy, Long setup$100K-500K/yearChoose for regulated industries
Apache AtlasSelf-hostedHadoop ecosystemHadoop-native, Free, Metadata APILimited UI, Hadoop-focusedInfrastructure onlyChoose for Hadoop/Spark shops
AmundsenSelf-hostedData discoverySimple, Fast search, LightweightLimited governance featuresInfrastructure onlyChoose for discovery-first

Implementation Phasing

7-Week Implementation Plan

gantt title Data Cataloging Implementation Timeline dateFormat YYYY-MM-DD section Phase 1: Discovery Automated Scanning :a1, 2025-10-03, 7d Stakeholder Interviews :a2, 2025-10-03, 7d Schema Extraction :a3, 2025-10-07, 7d section Phase 2: Ownership Ownership Workshops :b1, 2025-10-14, 7d Steward Assignment :b2, 2025-10-14, 7d section Phase 3: Lineage Lineage Mapping :c1, 2025-10-21, 14d Dependency Documentation :c2, 2025-10-21, 14d section Phase 4: Classification Automated PII Detection :d1, 2025-11-04, 7d Manual Classification :d2, 2025-11-04, 7d section Phase 5: Gap Analysis Gap Identification :e1, 2025-11-11, 7d Remediation Planning :e2, 2025-11-11, 7d

Phase-by-Phase Checklist

Week 1-2: Initial Discovery

□ Identify all data sources (databases, APIs, files, streams)
□ Run automated scanning tools (Glue Crawler, Purview, etc.)
□ Interview stakeholders for undocumented sources
□ Document source systems, schemas, and volumes
□ Capture update cadence and access patterns
□ Estimate data volumes and growth rates

Week 3: Ownership Assignment

□ Conduct ownership workshops by domain
□ Assign data owners for each domain/dataset
□ Designate steward teams for day-to-day management
□ Document escalation paths for issues
□ Establish ownership review cadence (quarterly)
□ Create accountability metrics (response time, quality)

Week 4-5: Lineage Mapping

□ Map upstream sources for critical datasets
□ Document transformation logic and dependencies
□ Identify downstream consumers and use cases
□ Record data contracts and SLAs
□ Visualize end-to-end data flows
□ Test lineage accuracy with sample queries

Week 6: Classification

□ Run automated PII/PHI detection tools
□ Manually review and validate classifications
□ Apply sensitivity labels to all datasets
□ Document compliance requirements (GDPR, HIPAA, etc.)
□ Implement appropriate security controls
□ Create classification audit trail

Week 7: Gap Analysis & Remediation

□ Compare discovered assets vs. expected inventory
□ Identify missing metadata, ownership, lineage
□ Prioritize gaps by business impact
□ Create remediation plan with timelines
□ Assign owners for gap closure
□ Schedule follow-up reviews

Metadata Model Example

Essential Catalog Metadata (Keep code to 1 minimal example):

# Minimal example: Dataset catalog entry
dataset:
  name: customer_transactions
  location: s3://data-lake/curated/transactions/

  # Technical
  format: parquet
  partitions: [date]
  row_count: 45000000
  size_gb: 120

  # Business
  owner: jane.smith@company.com
  domain: Sales
  description: "Customer transaction records from all channels"
  tags: [pii, revenue, customer]

  # Governance
  sensitivity: PII
  compliance: [GDPR, PCI-DSS]
  retention_days: 2555

  # Quality
  freshness_sla: "< 4 hours"
  quality_score: 0.94

Real-World Case Study: Financial Services Cataloging

Challenge

Large bank with 3,500+ databases across 200+ applications. No central catalog. AI teams spent 40% of time searching for data.

Implementation

Timeline: 7 weeks Team: 3 data engineers, 1 governance lead Tools: Azure Purview + custom scripts

Discovery Findings

Discovery ResultCountImpactResolution
Tables discovered12,000Complete inventoryCataloged all
"Customer" tables with conflicting schemas8Data inconsistencyEstablished canonical source
Duplicate ETL pipelines23Wasted $120K/yearDecommissioned 18
PII tables without encryption156Compliance violationImplemented encryption
Critical datasets with no owner89No accountabilityAssigned via workshops

Results After 6 Months

MetricBeforeAfterImprovement
Data discovery time3-5 days15 minutes99% reduction
Data quality incidents40/month15/month62% reduction
AI project delivery time16 weeks11 weeks31% faster
Compliance audit findings23291% reduction
Infrastructure costs (duplicate pipelines)$450K/year$330K/year27% savings

Key Success Factors

  1. Executive sponsorship and budget allocation
  2. Automated scanning covered 80% of discovery
  3. Domain-led ownership workshops
  4. Iterative approach (critical assets first)
  5. Clear metrics and regular progress reviews

Deliverables

1. Data Catalog with Completeness Metrics

Total Assets Cataloged: 847
├── Production Databases: 234 (100% metadata complete)
├── Data Lakes/Warehouses: 89 (100% metadata complete)
├── APIs: 156 (87% metadata complete)
├── File Systems: 248 (65% metadata complete)
└── External Sources: 120 (78% metadata complete)

Metadata Completeness:
├── Technical Metadata: 100% (schema, location, format)
├── Business Metadata: 87% (owner, description, tags)
├── Operational Metadata: 92% (freshness, quality scores)
└── Governance Metadata: 78% (sensitivity, compliance)

Coverage: 95% of critical datasets, 75% of all datasets

2. Ownership Matrix

DomainOwnerSteward TeamDatasetsCoverageAvg Response Time
SalesJane SmithSales Analytics67100%4 hours
MarketingJohn DoeMarketing Ops4395%6 hours
ProductAlice JohnsonProduct Analytics8998%3 hours
FinanceBob WilsonFinance IT34100%2 hours
EngineeringCarol MartinezPlatform Team15685%8 hours

3. Gap Remediation Roadmap

PriorityGapAffected AssetsRemediationOwnerTimelineStatus
P0PII without encryption23 tablesImplement column-level encryptionSecurity Team2 weeksIn Progress
P0No assigned owner89 datasetsOwnership workshopsData Governance3 weeks60% Complete
P1Schema drift detected45 tablesImplement schema validationPlatform Team4 weeksPlanned
P1Missing lineage134 datasetsDeploy lineage trackingData Engineering6 weeksPlanned
P2Stale documentation267 datasetsAutomate metadata refreshCatalog Admin8 weeksPlanned
P2Duplicate datasets56 pairsConsolidation projectDomain Owners12 weeksScoping

Best Practices

  1. Start with High-Value Assets: Focus on critical datasets used by production models first
  2. Automate Discovery: Use tools to bootstrap catalog (80% automated, 20% manual)
  3. Engage Domain Experts: Business context requires human input, not just technical scanning
  4. Make It Self-Service: Enable users to discover and request access without tickets
  5. Keep Catalog Current: Automate metadata refresh to prevent staleness
  6. Measure Adoption: Track search queries, access requests, time-to-data metrics
  7. Integrate into Workflows: Embed catalog in data development tools (IDEs, notebooks)
  8. Lightweight Governance: Balance control with agility; don't over-bureaucratize

Common Pitfalls to Avoid

  1. Boiling the Ocean: Trying to catalog everything at once (start with critical 20%)
  2. Metadata Burden: Making metadata entry too complex (keep it simple, automate)
  3. Catalog Shelfware: Building catalog that nobody uses (prioritize user experience)
  4. Stale Information: Letting catalog become outdated (automate refresh, set SLAs)
  5. Lack of Enforcement: Not requiring catalog usage for data access
  6. Tool Over-Engineering: Choosing complex enterprise tools for simple needs
  7. Missing Business Context: Focusing only on technical metadata
  8. Ignoring Legacy Systems: Only cataloging modern cloud infrastructure
  9. No Ownership Model: Building catalog without clear data ownership
  10. Poor Search Experience: Complex interfaces that frustrate users

Decision Framework: When to Invest in Data Cataloging

graph TD A[Should we build a data catalog?] --> B{How many data sources?} B -->|< 10| C[Spreadsheet/Wiki OK] B -->|10-50| D{How many users?} B -->|> 50| E[Definitely need catalog] D -->|< 5| C D -->|> 5| F{Regulated industry?} F -->|Yes| E F -->|No| G{Data quality issues?} G -->|Yes| E G -->|No| H[Start simple, grow later] style E fill:#d4edda,stroke:#333,stroke-width:2px style C fill:#fff3cd,stroke:#333,stroke-width:2px style H fill:#cce5ff,stroke:#333,stroke-width:2px

Investment Decision Criteria:

  • Definitely invest if: >50 data sources, >10 data users, regulated industry, frequent data quality issues, AI/ML at scale
  • Start simple if: <10 sources, <5 users, simple use cases, small team
  • Plan to scale if: Growing team, expanding data landscape, planning AI investments