Skip to main content

Databricks Vault Configuration

Databricks vaults provide secure access to Databricks SQL warehouses and clusters, enabling data quality monitoring across your unified analytics platform.

Overview

Databricks is a unified analytics platform that combines data engineering, data science, and business analytics. DeepDQ's Databricks vault integration connects to Databricks SQL warehouses and compute clusters, enabling Sentinels to monitor data quality across your lakehouse architecture, Data Catalog to discover Delta Lake tables, Data Lineage to track data relationships, and DAB Chatbot to provide conversational analytics.

Configuration Parameters

Required Fields

  • Name: A unique identifier for your Databricks vault
  • Type: Databricks (automatically selected)
  • Server Host Name: The Databricks workspace hostname
  • HTTP Path: The HTTP path to your SQL warehouse or cluster
  • Access Token: Your Databricks personal access token (securely encrypted)

Example Configuration

Name: Production Databricks Vault
Type: Databricks
Server Host Name: your-workspace.cloud.databricks.com
HTTP Path: /sql/1.0/warehouses/your-warehouse-id
Access Token: [encrypted]

Server Host Name Format

Databricks workspace URLs follow these patterns:

  • AWS: your-workspace.cloud.databricks.com
  • Azure: adb-workspace-id.azuredatabricks.net
  • GCP: your-workspace.gcp.databricks.com

HTTP Path Configuration

SQL warehouses provide optimized performance for analytics workloads:

  • Format: /sql/1.0/warehouses/{warehouse-id}
  • Serverless compute with auto-scaling
  • Optimized for concurrent users
  • Built-in caching and optimization

Interactive Clusters

For development and interactive analysis:

  • Format: /sql/protocolv1/o/{org-id}/{cluster-id}
  • Customizable compute configurations
  • Support for multiple runtime versions
  • Ideal for data exploration

Authentication

Personal Access Tokens

  • Generate tokens from Databricks workspace settings
  • Set appropriate token lifetime and permissions
  • Rotate tokens regularly for security
  • Use workspace or account-level tokens as needed
  • Currently supported by DeepDQ

Supported Databricks Features

Delta Lake Integration

  • Native support for Delta Lake tables
  • Time travel capabilities for historical analysis
  • Schema evolution tracking
  • ACID transaction monitoring

Unity Catalog

  • Centralized metadata and governance
  • Cross-workspace data discovery
  • Fine-grained access control
  • Data lineage tracking

Databricks SQL

  • High-performance SQL analytics engine
  • Photon acceleration support
  • Serverless compute optimization
  • Advanced caching mechanisms

Common Use Cases

Lakehouse Data Quality Monitoring

  • Execute Sentinels across bronze, silver, and gold layers
  • Monitor Delta Lake table quality and consistency
  • Validate ETL/ELT pipeline outputs
  • Track data freshness and completeness in lakehouses

Delta Lake Schema Discovery

  • Automatic discovery of Unity Catalog objects
  • Extract metadata from Delta Lake tables
  • Track schema evolution and versioning
  • Document lakehouse data assets

Lakehouse Data Lineage

  • Map data flow across Delta Lake layers
  • Track transformations in data pipelines
  • Monitor cross-workspace data relationships
  • Visualize lakehouse architecture and dependencies

ML and Analytics Data Exploration

  • DAB Chatbot integration for feature store queries
  • Natural language exploration of ML datasets
  • Conversational analytics for business insights
  • Interactive data discovery and analysis

Best Practices

Security

  • Use service principals for production workloads
  • Implement IP access lists for workspace security
  • Enable audit logging for compliance
  • Regular access token rotation
  • Network Security: Contact salesandsupport@deepanalyze.ai to get the DeepDQ Static IP for IP access lists

Performance

  • Use SQL warehouses for production monitoring
  • Enable result caching for repeated queries
  • Optimize table clustering and partitioning
  • Monitor compute resource utilization

Cost Optimization

  • Configure auto-stop for SQL warehouses
  • Right-size compute resources for workload
  • Use spot instances where appropriate
  • Monitor DBU consumption and optimize usage

Troubleshooting

Connection Issues

Authentication Failures

  • Verify access token validity and permissions
  • Check token expiration dates
  • Validate workspace access and user permissions
  • Review IP access list restrictions

HTTP Path Errors

  • Confirm warehouse or cluster ID accuracy
  • Verify warehouse/cluster is running and accessible
  • Check HTTP path format for your deployment type
  • Ensure compute resource availability

Network Connectivity Issues

  • Validate workspace URL and hostname
  • Check firewall and proxy configurations
  • Verify DNS resolution for workspace
  • Test connectivity from DeepDQ environment

Performance Issues

Slow Query Execution

  • Review SQL warehouse size and scaling settings
  • Check for warehouse queuing and concurrency limits
  • Optimize query patterns and complexity
  • Monitor Photon acceleration usage

High Compute Costs

  • Review auto-stop and scaling configurations
  • Optimize query frequency and scheduling
  • Monitor DBU consumption patterns
  • Consider serverless SQL for variable workloads

Advanced Configuration

Cluster Policies

  • Implement cluster policies for governance
  • Standardize compute configurations
  • Control resource usage and costs
  • Ensure compliance with organizational standards

High Availability

  • Configure multi-region deployments
  • Implement disaster recovery procedures
  • Set up cross-workspace replication
  • Monitor service availability and uptime

Databricks-Specific Features

Photon Engine

  • Vectorized query execution for improved performance
  • Automatic optimization for analytical workloads
  • Enhanced data processing speed
  • Transparent acceleration for existing queries

Auto Loader

  • Incremental data ingestion monitoring
  • Schema inference and evolution
  • Exactly-once processing guarantees
  • Cloud storage integration

MLflow Integration

  • Model registry data quality tracking
  • Experiment tracking and reproducibility
  • Model serving data validation
  • Feature store monitoring