Skip to main content

Catalog

The Catalog is your centralized repository for managing and organizing data assets across your organization. It provides comprehensive metadata management, data discovery capabilities, and automated schema tracking.

Overview

The Catalog enables you to discover, browse, and manage your data assets across multiple vault connections. It automatically extracts metadata, tracks schema changes, and provides detailed information about your tables and datasets.

Catalog Interface

The catalog interface consists of three main sections:

Discover Catalog

  • Vault Selection: Choose from your configured vaults to discover available data sources
  • Dataset/Database Selection: Browse available databases or datasets within the selected vault
  • Table Discovery: View and select multiple tables for cataloging
  • Bulk Operations: Catalog multiple tables simultaneously

Browse Catalog

  • Search and Filter: Find cataloged tables using various filters
  • Metadata Viewing: Access comprehensive table and column information
  • Schema Details: View column names, data types, and nullability
  • Version History: Track schema changes and updates over time

Manage Catalog

  • Table Management: Add or remove tables from the catalog
  • Bulk Deletion: Remove multiple catalog entries at once
  • Maintenance Operations: Perform catalog cleanup and optimization

Using the Catalog

Discovering and Cataloging Data

  1. Navigate to Discover Catalog

    • Select the vault type (e.g., BigQuery, PostgreSQL, Snowflake)
    • Choose the specific vault connection from the dropdown
    • Select the target dataset or database
  2. Select Tables for Cataloging

    • View available tables in the selected dataset
    • Use multi-select to choose tables (e.g., "customers", "orders", "orders_partitioned")
    • Click "Catalog" to add selected tables to the catalog
  3. Automatic Metadata Extraction

    • DeepDQ automatically extracts table schemas
    • Column information including names, types, and constraints
    • Creates initial catalog entries with metadata

Browsing Cataloged Data

  1. Access Browse Catalog

    • View all cataloged tables across your organization
    • Filter by vault type, dataset, or table name
    • Search for specific tables or data assets
  2. View Table Details

    • Click on any table to view detailed information
    • Basic Information: Dataset, GCP Project, Source vault
    • Schema Details: Complete column listing with data types
    • Version Information: Creation and update timestamps
    • Metadata: Comprehensive table documentation

Table Information Display

Each cataloged table shows:

  • Table Name: Full table identifier
  • Source Information:
    • Dataset/Database name
    • GCP Project ID (for BigQuery)
    • Source vault connection
  • Schema Version: Current version with timestamp
  • Last Updated: Most recent schema update
  • Column Details:
    • Column names and data types (INT64, STRING, NUMERIC, DATE)
    • Nullability (NULLABLE/NOT NULL modes)
    • Complete schema structure

Managing Catalog Entries

  1. Navigate to Manage Catalog

    • Select vault type and specific vault
    • View tables available for deletion or management
  2. Remove Tables from Catalog

    • Select tables to remove from cataloging
    • Use bulk operations for multiple table deletion
    • Confirm removal with "Delete Selected Tables"
  3. Catalog Maintenance

    • Regular cleanup of outdated entries
    • Bulk management operations
    • Schema version management

Key Features

Automated Schema Discovery

  • Multi-Vault Support: Discover tables across different database types
  • Real-time Schema Extraction: Automatic metadata collection
  • Schema Versioning: Track changes over time
  • Bulk Operations: Catalog multiple tables efficiently

Comprehensive Metadata Management

  • Column-Level Details: Data types, constraints, and nullability
  • Source Tracking: Vault and dataset origin information
  • Version Control: Schema change history and timestamps
  • Cross-Platform Support: Works with BigQuery, PostgreSQL, Snowflake, and other supported vaults

Integration with DeepDQ Features

  • Vault Integration: Seamless connection to configured vaults
  • Sentinel Compatibility: Cataloged tables available for quality monitoring
  • Data Lineage: Schema information feeds into lineage tracking
  • Alert Integration: Schema changes can trigger notifications

Supported Vault Types

The catalog supports discovery and management across all DeepDQ vault types:

  • BigQuery: Datasets and tables with full schema extraction
  • PostgreSQL: Databases, schemas, and table structures
  • Snowflake: Warehouses, databases, and schema information
  • SQL Server: Databases and table metadata
  • MySQL: Database and table schema discovery
  • Databricks: Catalogs, schemas, and Delta Lake tables
  • Redshift: Schemas and table structures
  • Oracle: Schema and table metadata extraction

Best Practices

Efficient Cataloging

  • Start Small: Begin with critical datasets and expand gradually
  • Use Bulk Operations: Catalog related tables together for efficiency
  • Regular Updates: Refresh catalog entries to capture schema changes
  • Organize by Domain: Group related tables by business domain or project

Metadata Management

  • Consistent Naming: Follow naming conventions for easy discovery
  • Regular Maintenance: Remove obsolete or unused catalog entries
  • Version Tracking: Monitor schema changes and their impact
  • Documentation: Add business context and descriptions where possible

Integration Workflow

  1. Configure Vaults: Set up secure connections to your data sources
  2. Discover Tables: Use the catalog to find and inventory data assets
  3. Catalog Selection: Add important tables to the centralized catalog
  4. Enable Monitoring: Set up Sentinels for quality monitoring
  5. Maintain Regularly: Keep catalog current with schema changes