mirror of https://github.com/databricks/cli.git
8182 lines
281 KiB
Markdown
8182 lines
281 KiB
Markdown
<!-- DO NOT EDIT. This file is autogenerated with https://github.com/databricks/cli -->
|
||
---
|
||
description: Learn about resources supported by Databricks Asset Bundles and how to configure them.
|
||
---
|
||
|
||
# <DABS> resources
|
||
|
||
<DABS> allows you to specify information about the <Databricks> resources used by the bundle in the `resources` mapping in the bundle configuration. See [resources mapping](/dev-tools/bundles/settings.md#resources) and [resources key reference](/dev-tools/bundles/reference.md#resources).
|
||
|
||
This article outlines supported resource types for bundles and provides details and an example for each supported type. For additional examples, see [_](/dev-tools/bundles/resource-examples.md).
|
||
|
||
## <a id="resource-types"></a> Supported resources
|
||
|
||
The following table lists supported resource types for bundles. Some resources can be created by defining them in a bundle and deploying the bundle, and some resources only support referencing an existing resource to include in the bundle.
|
||
|
||
Resources are defined using the corresponding [Databricks REST API](/api/workspace/introduction) object's create operation request payload, where the object's supported fields, expressed as YAML, are the resource's supported properties. Links to documentation for each resource's corresponding payloads are listed in the table.
|
||
|
||
.. tip:: The `databricks bundle validate` command returns warnings if unknown resource properties are found in bundle configuration files.
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Resource
|
||
- Create support
|
||
- Corresponding REST API object
|
||
|
||
* - [cluster](#cluster)
|
||
- ✓
|
||
- [Cluster object](/api/workspace/clusters/create)
|
||
|
||
* - [dashboard](#dashboard)
|
||
-
|
||
- [Dashboard object](/api/workspace/lakeview/create)
|
||
|
||
* - [experiment](#experiment)
|
||
- ✓
|
||
- [Experiment object](/api/workspace/experiments/createexperiment)
|
||
|
||
* - [job](#job)
|
||
- ✓
|
||
- [Job object](/api/workspace/jobs/create)
|
||
|
||
* - [model (legacy)](#model-legacy)
|
||
- ✓
|
||
- [Model (legacy) object](/api/workspace/modelregistry/createmodel)
|
||
|
||
* - [model_serving_endpoint](#model-serving-endpoint)
|
||
- ✓
|
||
- [Model serving endpoint object](/api/workspace/servingendpoints/create)
|
||
|
||
* - [pipeline](#pipeline)
|
||
- ✓
|
||
- [Pipeline object](/api/workspace/pipelines/create)
|
||
|
||
* - [quality_monitor](#quality-monitor)
|
||
- ✓
|
||
- [Quality monitor object](/api/workspace/qualitymonitors/create)
|
||
|
||
* - [registered_model](#registered-model) (<UC>)
|
||
- ✓
|
||
- [Registered model object](/api/workspace/registeredmodels/create)
|
||
|
||
* - [schema](#schema) (<UC>)
|
||
- ✓
|
||
- [Schema object](/api/workspace/schemas/create)
|
||
|
||
* - [volume](#volume) (<UC>)
|
||
- ✓
|
||
- [Volume object](/api/workspace/volumes/create)
|
||
|
||
|
||
## apps
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
```yaml
|
||
apps:
|
||
<app-name>:
|
||
<app-field-name>: <app-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `active_deployment`
|
||
- Map
|
||
- See [_](#apps.<name>.active_deployment).
|
||
|
||
* - `app_status`
|
||
- Map
|
||
- See [_](#apps.<name>.app_status).
|
||
|
||
* - `compute_status`
|
||
- Map
|
||
- See [_](#apps.<name>.compute_status).
|
||
|
||
* - `config`
|
||
- Map
|
||
-
|
||
|
||
* - `create_time`
|
||
- String
|
||
-
|
||
|
||
* - `creator`
|
||
- String
|
||
-
|
||
|
||
* - `default_source_code_path`
|
||
- String
|
||
-
|
||
|
||
* - `description`
|
||
- String
|
||
-
|
||
|
||
* - `name`
|
||
- String
|
||
-
|
||
|
||
* - `pending_deployment`
|
||
- Map
|
||
- See [_](#apps.<name>.pending_deployment).
|
||
|
||
* - `permissions`
|
||
- Sequence
|
||
- See [_](#apps.<name>.permissions).
|
||
|
||
* - `resources`
|
||
- Sequence
|
||
- See [_](#apps.<name>.resources).
|
||
|
||
* - `service_principal_client_id`
|
||
- String
|
||
-
|
||
|
||
* - `service_principal_id`
|
||
- Integer
|
||
-
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
-
|
||
|
||
* - `source_code_path`
|
||
- String
|
||
-
|
||
|
||
* - `update_time`
|
||
- String
|
||
-
|
||
|
||
* - `updater`
|
||
- String
|
||
-
|
||
|
||
* - `url`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.active_deployment
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `create_time`
|
||
- String
|
||
-
|
||
|
||
* - `creator`
|
||
- String
|
||
-
|
||
|
||
* - `deployment_artifacts`
|
||
- Map
|
||
- See [_](#apps.<name>.active_deployment.deployment_artifacts).
|
||
|
||
* - `deployment_id`
|
||
- String
|
||
-
|
||
|
||
* - `mode`
|
||
- String
|
||
-
|
||
|
||
* - `source_code_path`
|
||
- String
|
||
-
|
||
|
||
* - `status`
|
||
- Map
|
||
- See [_](#apps.<name>.active_deployment.status).
|
||
|
||
* - `update_time`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.active_deployment.deployment_artifacts
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `source_code_path`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.active_deployment.status
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `message`
|
||
- String
|
||
-
|
||
|
||
* - `state`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.app_status
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `message`
|
||
- String
|
||
-
|
||
|
||
* - `state`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.compute_status
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `message`
|
||
- String
|
||
-
|
||
|
||
* - `state`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.pending_deployment
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `create_time`
|
||
- String
|
||
-
|
||
|
||
* - `creator`
|
||
- String
|
||
-
|
||
|
||
* - `deployment_artifacts`
|
||
- Map
|
||
- See [_](#apps.<name>.pending_deployment.deployment_artifacts).
|
||
|
||
* - `deployment_id`
|
||
- String
|
||
-
|
||
|
||
* - `mode`
|
||
- String
|
||
-
|
||
|
||
* - `source_code_path`
|
||
- String
|
||
-
|
||
|
||
* - `status`
|
||
- Map
|
||
- See [_](#apps.<name>.pending_deployment.status).
|
||
|
||
* - `update_time`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.pending_deployment.deployment_artifacts
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `source_code_path`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.pending_deployment.status
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `message`
|
||
- String
|
||
-
|
||
|
||
* - `state`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.permissions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `group_name`
|
||
- String
|
||
- The name of the group that has the permission set in level.
|
||
|
||
* - `level`
|
||
- String
|
||
- The allowed permission for user, group, service principal defined for this permission.
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
- The name of the service principal that has the permission set in level.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The name of the user that has the permission set in level.
|
||
|
||
|
||
### apps.<name>.resources
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `description`
|
||
- String
|
||
-
|
||
|
||
* - `job`
|
||
- Map
|
||
- See [_](#apps.<name>.resources.job).
|
||
|
||
* - `name`
|
||
- String
|
||
-
|
||
|
||
* - `secret`
|
||
- Map
|
||
- See [_](#apps.<name>.resources.secret).
|
||
|
||
* - `serving_endpoint`
|
||
- Map
|
||
- See [_](#apps.<name>.resources.serving_endpoint).
|
||
|
||
* - `sql_warehouse`
|
||
- Map
|
||
- See [_](#apps.<name>.resources.sql_warehouse).
|
||
|
||
|
||
### apps.<name>.resources.job
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
* - `permission`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.resources.secret
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `key`
|
||
- String
|
||
-
|
||
|
||
* - `permission`
|
||
- String
|
||
-
|
||
|
||
* - `scope`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.resources.serving_endpoint
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `name`
|
||
- String
|
||
-
|
||
|
||
* - `permission`
|
||
- String
|
||
-
|
||
|
||
|
||
### apps.<name>.resources.sql_warehouse
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
* - `permission`
|
||
- String
|
||
-
|
||
|
||
|
||
## clusters
|
||
|
||
**`Type: Map`**
|
||
|
||
The cluster resource defines an [all-purpose cluster](/api/workspace/clusters/create).
|
||
|
||
```yaml
|
||
clusters:
|
||
<cluster-name>:
|
||
<cluster-field-name>: <cluster-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `apply_policy_default_values`
|
||
- Boolean
|
||
- When set to true, fixed and default values from the policy will be used for fields that are omitted. When set to false, only fixed values from the policy will be applied.
|
||
|
||
* - `autoscale`
|
||
- Map
|
||
- Parameters needed in order to automatically scale clusters up and down based on load. Note: autoscaling works best with DB runtime versions 3.0 or later. See [_](#clusters.<name>.autoscale).
|
||
|
||
* - `autotermination_minutes`
|
||
- Integer
|
||
- Automatically terminates the cluster after it is inactive for this time in minutes. If not set, this cluster will not be automatically terminated. If specified, the threshold must be between 10 and 10000 minutes. Users can also set this value to 0 to explicitly disable automatic termination.
|
||
|
||
* - `aws_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Amazon Web Services. If not specified at cluster creation, a set of default values will be used. See [_](#clusters.<name>.aws_attributes).
|
||
|
||
* - `azure_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Microsoft Azure. If not specified at cluster creation, a set of default values will be used. See [_](#clusters.<name>.azure_attributes).
|
||
|
||
* - `cluster_log_conf`
|
||
- Map
|
||
- The configuration for delivering spark logs to a long-term storage destination. Two kinds of destinations (dbfs and s3) are supported. Only one destination can be specified for one cluster. If the conf is given, the logs will be delivered to the destination every `5 mins`. The destination of driver logs is `$destination/$clusterId/driver`, while the destination of executor logs is `$destination/$clusterId/executor`. See [_](#clusters.<name>.cluster_log_conf).
|
||
|
||
* - `cluster_name`
|
||
- String
|
||
- Cluster name requested by the user. This doesn't have to be unique. If not specified at creation, the cluster name will be an empty string.
|
||
|
||
* - `custom_tags`
|
||
- Map
|
||
- Additional tags for cluster resources. Databricks will tag all cluster resources (e.g., AWS instances and EBS volumes) with these tags in addition to `default_tags`. Notes: - Currently, Databricks allows at most 45 custom tags - Clusters can only reuse cloud resources if the resources' tags are a subset of the cluster tags
|
||
|
||
* - `data_security_mode`
|
||
- String
|
||
- Data security mode decides what data governance model to use when accessing data from a cluster. The following modes can only be used with `kind`. * `DATA_SECURITY_MODE_AUTO`: Databricks will choose the most appropriate access mode depending on your compute configuration. * `DATA_SECURITY_MODE_STANDARD`: Alias for `USER_ISOLATION`. * `DATA_SECURITY_MODE_DEDICATED`: Alias for `SINGLE_USER`. The following modes can be used regardless of `kind`. * `NONE`: No security isolation for multiple users sharing the cluster. Data governance features are not available in this mode. * `SINGLE_USER`: A secure cluster that can only be exclusively used by a single user specified in `single_user_name`. Most programming languages, cluster features and data governance features are available in this mode. * `USER_ISOLATION`: A secure cluster that can be shared by multiple users. Cluster users are fully isolated so that they cannot see each other's data and credentials. Most data governance features are supported in this mode. But programming languages and cluster features might be limited. The following modes are deprecated starting with Databricks Runtime 15.0 and will be removed for future Databricks Runtime versions: * `LEGACY_TABLE_ACL`: This mode is for users migrating from legacy Table ACL clusters. * `LEGACY_PASSTHROUGH`: This mode is for users migrating from legacy Passthrough on high concurrency clusters. * `LEGACY_SINGLE_USER`: This mode is for users migrating from legacy Passthrough on standard clusters. * `LEGACY_SINGLE_USER_STANDARD`: This mode provides a way that doesn’t have UC nor passthrough enabled.
|
||
|
||
* - `docker_image`
|
||
- Map
|
||
- See [_](#clusters.<name>.docker_image).
|
||
|
||
* - `driver_instance_pool_id`
|
||
- String
|
||
- The optional ID of the instance pool for the driver of the cluster belongs. The pool cluster uses the instance pool with id (instance_pool_id) if the driver pool is not assigned.
|
||
|
||
* - `driver_node_type_id`
|
||
- String
|
||
- The node type of the Spark driver. Note that this field is optional; if unset, the driver node type will be set as the same value as `node_type_id` defined above.
|
||
|
||
* - `enable_elastic_disk`
|
||
- Boolean
|
||
- Autoscaling Local Storage: when enabled, this cluster will dynamically acquire additional disk space when its Spark workers are running low on disk space. This feature requires specific AWS permissions to function correctly - refer to the User Guide for more details.
|
||
|
||
* - `enable_local_disk_encryption`
|
||
- Boolean
|
||
- Whether to enable LUKS on cluster VMs' local disks
|
||
|
||
* - `gcp_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Google Cloud Platform. If not specified at cluster creation, a set of default values will be used. See [_](#clusters.<name>.gcp_attributes).
|
||
|
||
* - `init_scripts`
|
||
- Sequence
|
||
- The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `<destination>/<cluster-ID>/init_scripts`. See [_](#clusters.<name>.init_scripts).
|
||
|
||
* - `instance_pool_id`
|
||
- String
|
||
- The optional ID of the instance pool to which the cluster belongs.
|
||
|
||
* - `is_single_node`
|
||
- Boolean
|
||
- This field can only be used with `kind`. When set to true, Databricks will automatically set single node related `custom_tags`, `spark_conf`, and `num_workers`
|
||
|
||
* - `kind`
|
||
- String
|
||
-
|
||
|
||
* - `node_type_id`
|
||
- String
|
||
- This field encodes, through a single value, the resources available to each of the Spark nodes in this cluster. For example, the Spark nodes can be provisioned and optimized for memory or compute intensive workloads. A list of available node types can be retrieved by using the :method:clusters/listNodeTypes API call.
|
||
|
||
* - `num_workers`
|
||
- Integer
|
||
- Number of worker nodes that this cluster should have. A cluster has one Spark Driver and `num_workers` Executors for a total of `num_workers` + 1 Spark nodes. Note: When reading the properties of a cluster, this field reflects the desired number of workers rather than the actual current number of workers. For instance, if a cluster is resized from 5 to 10 workers, this field will immediately be updated to reflect the target size of 10 workers, whereas the workers listed in `spark_info` will gradually increase from 5 to 10 as the new nodes are provisioned.
|
||
|
||
* - `permissions`
|
||
- Sequence
|
||
- See [_](#clusters.<name>.permissions).
|
||
|
||
* - `policy_id`
|
||
- String
|
||
- The ID of the cluster policy used to create the cluster if applicable.
|
||
|
||
* - `runtime_engine`
|
||
- String
|
||
- Determines the cluster's runtime engine, either standard or Photon. This field is not compatible with legacy `spark_version` values that contain `-photon-`. Remove `-photon-` from the `spark_version` and set `runtime_engine` to `PHOTON`. If left unspecified, the runtime engine defaults to standard unless the spark_version contains -photon-, in which case Photon will be used.
|
||
|
||
* - `single_user_name`
|
||
- String
|
||
- Single user name if data_security_mode is `SINGLE_USER`
|
||
|
||
* - `spark_conf`
|
||
- Map
|
||
- An object containing a set of optional, user-specified Spark configuration key-value pairs. Users can also pass in a string of extra JVM options to the driver and the executors via `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` respectively.
|
||
|
||
* - `spark_env_vars`
|
||
- Map
|
||
- An object containing a set of optional, user-specified environment variable key-value pairs. Please note that key-value pair of the form (X,Y) will be exported as is (i.e., `export X='Y'`) while launching the driver and workers. In order to specify an additional set of `SPARK_DAEMON_JAVA_OPTS`, we recommend appending them to `$SPARK_DAEMON_JAVA_OPTS` as shown in the example below. This ensures that all default databricks managed environmental variables are included as well. Example Spark environment variables: `{"SPARK_WORKER_MEMORY": "28000m", "SPARK_LOCAL_DIRS": "/local_disk0"}` or `{"SPARK_DAEMON_JAVA_OPTS": "$SPARK_DAEMON_JAVA_OPTS -Dspark.shuffle.service.enabled=true"}`
|
||
|
||
* - `spark_version`
|
||
- String
|
||
- The Spark version of the cluster, e.g. `3.3.x-scala2.11`. A list of available Spark versions can be retrieved by using the :method:clusters/sparkVersions API call.
|
||
|
||
* - `ssh_public_keys`
|
||
- Sequence
|
||
- SSH public key contents that will be added to each Spark node in this cluster. The corresponding private keys can be used to login with the user name `ubuntu` on port `2200`. Up to 10 keys can be specified.
|
||
|
||
* - `use_ml_runtime`
|
||
- Boolean
|
||
- This field can only be used with `kind`. `effective_spark_version` is determined by `spark_version` (DBR release), this field `use_ml_runtime`, and whether `node_type_id` is gpu node or not.
|
||
|
||
* - `workload_type`
|
||
- Map
|
||
- See [_](#clusters.<name>.workload_type).
|
||
|
||
|
||
**Example**
|
||
|
||
The following example creates a cluster named `my_cluster` and sets that as the cluster to use to run the notebook in `my_job`:
|
||
|
||
```yaml
|
||
bundle:
|
||
name: clusters
|
||
|
||
resources:
|
||
clusters:
|
||
my_cluster:
|
||
num_workers: 2
|
||
node_type_id: "i3.xlarge"
|
||
autoscale:
|
||
min_workers: 2
|
||
max_workers: 7
|
||
spark_version: "13.3.x-scala2.12"
|
||
spark_conf:
|
||
"spark.executor.memory": "2g"
|
||
|
||
jobs:
|
||
my_job:
|
||
tasks:
|
||
- task_key: test_task
|
||
notebook_task:
|
||
notebook_path: "./src/my_notebook.py"
|
||
```
|
||
|
||
### clusters.<name>.autoscale
|
||
|
||
**`Type: Map`**
|
||
|
||
Parameters needed in order to automatically scale clusters up and down based on load.
|
||
Note: autoscaling works best with DB runtime versions 3.0 or later.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `max_workers`
|
||
- Integer
|
||
- The maximum number of workers to which the cluster can scale up when overloaded. Note that `max_workers` must be strictly greater than `min_workers`.
|
||
|
||
* - `min_workers`
|
||
- Integer
|
||
- The minimum number of workers to which the cluster can scale down when underutilized. It is also the initial number of workers the cluster will have after creation.
|
||
|
||
|
||
### clusters.<name>.aws_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Amazon Web Services.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- Availability type used for all subsequent nodes past the `first_on_demand` ones. Note: If `first_on_demand` is zero, this availability type will be used for the entire cluster.
|
||
|
||
* - `ebs_volume_count`
|
||
- Integer
|
||
- The number of volumes launched for each instance. Users can choose up to 10 volumes. This feature is only enabled for supported node types. Legacy node types cannot specify custom EBS volumes. For node types with no instance store, at least one EBS volume needs to be specified; otherwise, cluster creation will fail. These EBS volumes will be mounted at `/ebs0`, `/ebs1`, and etc. Instance store volumes will be mounted at `/local_disk0`, `/local_disk1`, and etc. If EBS volumes are attached, Databricks will configure Spark to use only the EBS volumes for scratch storage because heterogenously sized scratch devices can lead to inefficient disk utilization. If no EBS volumes are attached, Databricks will configure Spark to use instance store volumes. Please note that if EBS volumes are specified, then the Spark configuration `spark.local.dir` will be overridden.
|
||
|
||
* - `ebs_volume_iops`
|
||
- Integer
|
||
- If using gp3 volumes, what IOPS to use for the disk. If this is not set, the maximum performance of a gp2 volume with the same volume size will be used.
|
||
|
||
* - `ebs_volume_size`
|
||
- Integer
|
||
- The size of each EBS volume (in GiB) launched for each instance. For general purpose SSD, this value must be within the range 100 - 4096. For throughput optimized HDD, this value must be within the range 500 - 4096.
|
||
|
||
* - `ebs_volume_throughput`
|
||
- Integer
|
||
- If using gp3 volumes, what throughput to use for the disk. If this is not set, the maximum performance of a gp2 volume with the same volume size will be used.
|
||
|
||
* - `ebs_volume_type`
|
||
- String
|
||
- The type of EBS volumes that will be launched with this cluster.
|
||
|
||
* - `first_on_demand`
|
||
- Integer
|
||
- The first `first_on_demand` nodes of the cluster will be placed on on-demand instances. If this value is greater than 0, the cluster driver node in particular will be placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, `first_on_demand` nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. Note that this value does not affect cluster size and cannot currently be mutated over the lifetime of a cluster.
|
||
|
||
* - `instance_profile_arn`
|
||
- String
|
||
- Nodes for this cluster will only be placed on AWS instances with this instance profile. If ommitted, nodes will be placed on instances without an IAM instance profile. The instance profile must have previously been added to the Databricks environment by an account administrator. This feature may only be available to certain customer plans. If this field is ommitted, we will pull in the default from the conf if it exists.
|
||
|
||
* - `spot_bid_price_percent`
|
||
- Integer
|
||
- The bid price for AWS spot instances, as a percentage of the corresponding instance type's on-demand price. For example, if this field is set to 50, and the cluster needs a new `r3.xlarge` spot instance, then the bid price is half of the price of on-demand `r3.xlarge` instances. Similarly, if this field is set to 200, the bid price is twice the price of on-demand `r3.xlarge` instances. If not specified, the default value is 100. When spot instances are requested for this cluster, only spot instances whose bid price percentage matches this field will be considered. Note that, for safety, we enforce this field to be no more than 10000. The default value and documentation here should be kept consistent with CommonConf.defaultSpotBidPricePercent and CommonConf.maxSpotBidPricePercent.
|
||
|
||
* - `zone_id`
|
||
- String
|
||
- Identifier for the availability zone/datacenter in which the cluster resides. This string will be of a form like "us-west-2a". The provided availability zone must be in the same region as the Databricks deployment. For example, "us-west-2a" is not a valid zone id if the Databricks deployment resides in the "us-east-1" region. This is an optional field at cluster creation, and if not specified, a default zone will be used. If the zone specified is "auto", will try to place cluster in a zone with high availability, and will retry placement in a different AZ if there is not enough capacity. The list of available zones as well as the default value can be found by using the `List Zones` method.
|
||
|
||
|
||
### clusters.<name>.azure_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Microsoft Azure.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- Availability type used for all subsequent nodes past the `first_on_demand` ones. Note: If `first_on_demand` is zero (which only happens on pool clusters), this availability type will be used for the entire cluster.
|
||
|
||
* - `first_on_demand`
|
||
- Integer
|
||
- The first `first_on_demand` nodes of the cluster will be placed on on-demand instances. This value should be greater than 0, to make sure the cluster driver node is placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, `first_on_demand` nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. Note that this value does not affect cluster size and cannot currently be mutated over the lifetime of a cluster.
|
||
|
||
* - `log_analytics_info`
|
||
- Map
|
||
- Defines values necessary to configure and run Azure Log Analytics agent. See [_](#clusters.<name>.azure_attributes.log_analytics_info).
|
||
|
||
* - `spot_bid_max_price`
|
||
- Any
|
||
- The max bid price to be used for Azure spot instances. The Max price for the bid cannot be higher than the on-demand price of the instance. If not specified, the default value is -1, which specifies that the instance cannot be evicted on the basis of price, and only on the basis of availability. Further, the value should > 0 or -1.
|
||
|
||
|
||
### clusters.<name>.azure_attributes.log_analytics_info
|
||
|
||
**`Type: Map`**
|
||
|
||
Defines values necessary to configure and run Azure Log Analytics agent
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `log_analytics_primary_key`
|
||
- String
|
||
- <needs content added>
|
||
|
||
* - `log_analytics_workspace_id`
|
||
- String
|
||
- <needs content added>
|
||
|
||
|
||
### clusters.<name>.cluster_log_conf
|
||
|
||
**`Type: Map`**
|
||
|
||
The configuration for delivering spark logs to a long-term storage destination.
|
||
Two kinds of destinations (dbfs and s3) are supported. Only one destination can be specified
|
||
for one cluster. If the conf is given, the logs will be delivered to the destination every
|
||
`5 mins`. The destination of driver logs is `$destination/$clusterId/driver`, while
|
||
the destination of executor logs is `$destination/$clusterId/executor`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `dbfs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`. See [_](#clusters.<name>.cluster_log_conf.dbfs).
|
||
|
||
* - `s3`
|
||
- Map
|
||
- destination and either the region or endpoint need to be provided. e.g. `{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }` Cluster iam role is used to access s3, please make sure the cluster iam role in `instance_profile_arn` has permission to write data to the s3 destination. See [_](#clusters.<name>.cluster_log_conf.s3).
|
||
|
||
|
||
### clusters.<name>.cluster_log_conf.dbfs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- dbfs destination, e.g. `dbfs:/my/path`
|
||
|
||
|
||
### clusters.<name>.cluster_log_conf.s3
|
||
|
||
**`Type: Map`**
|
||
|
||
destination and either the region or endpoint need to be provided. e.g.
|
||
`{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }`
|
||
Cluster iam role is used to access s3, please make sure the cluster iam role in
|
||
`instance_profile_arn` has permission to write data to the s3 destination.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `canned_acl`
|
||
- String
|
||
- (Optional) Set canned access control list for the logs, e.g. `bucket-owner-full-control`. If `canned_cal` is set, please make sure the cluster iam role has `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned acl can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. Please also note that by default only the object owner gets full controls. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs.
|
||
|
||
* - `destination`
|
||
- String
|
||
- S3 destination, e.g. `s3://my-bucket/some-prefix` Note that logs will be delivered using cluster iam role, please make sure you set cluster iam role and the role has write access to the destination. Please also note that you cannot use AWS keys to deliver logs.
|
||
|
||
* - `enable_encryption`
|
||
- Boolean
|
||
- (Optional) Flag to enable server side encryption, `false` by default.
|
||
|
||
* - `encryption_type`
|
||
- String
|
||
- (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It will be used only when encryption is enabled and the default type is `sse-s3`.
|
||
|
||
* - `endpoint`
|
||
- String
|
||
- S3 endpoint, e.g. `https://s3-us-west-2.amazonaws.com`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
* - `kms_key`
|
||
- String
|
||
- (Optional) Kms key which will be used if encryption is enabled and encryption type is set to `sse-kms`.
|
||
|
||
* - `region`
|
||
- String
|
||
- S3 region, e.g. `us-west-2`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
|
||
### clusters.<name>.docker_image
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `basic_auth`
|
||
- Map
|
||
- See [_](#clusters.<name>.docker_image.basic_auth).
|
||
|
||
* - `url`
|
||
- String
|
||
- URL of the docker image.
|
||
|
||
|
||
### clusters.<name>.docker_image.basic_auth
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `password`
|
||
- String
|
||
- Password of the user
|
||
|
||
* - `username`
|
||
- String
|
||
- Name of the user
|
||
|
||
|
||
### clusters.<name>.gcp_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Google Cloud Platform.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- This field determines whether the instance pool will contain preemptible VMs, on-demand VMs, or preemptible VMs with a fallback to on-demand VMs if the former is unavailable.
|
||
|
||
* - `boot_disk_size`
|
||
- Integer
|
||
- boot disk size in GB
|
||
|
||
* - `google_service_account`
|
||
- String
|
||
- If provided, the cluster will impersonate the google service account when accessing gcloud services (like GCS). The google service account must have previously been added to the Databricks environment by an account administrator.
|
||
|
||
* - `local_ssd_count`
|
||
- Integer
|
||
- If provided, each node (workers and driver) in the cluster will have this number of local SSDs attached. Each local SSD is 375GB in size. Refer to [GCP documentation](https://cloud.google.com/compute/docs/disks/local-ssd#choose_number_local_ssds) for the supported number of local SSDs for each instance type.
|
||
|
||
* - `use_preemptible_executors`
|
||
- Boolean
|
||
- This field determines whether the spark executors will be scheduled to run on preemptible VMs (when set to true) versus standard compute engine VMs (when set to false; default). Note: Soon to be deprecated, use the availability field instead.
|
||
|
||
* - `zone_id`
|
||
- String
|
||
- Identifier for the availability zone in which the cluster resides. This can be one of the following: - "HA" => High availability, spread nodes across availability zones for a Databricks deployment region [default] - "AUTO" => Databricks picks an availability zone to schedule the cluster on. - A GCP availability zone => Pick One of the available zones for (machine type + region) from https://cloud.google.com/compute/docs/regions-zones.
|
||
|
||
|
||
### clusters.<name>.init_scripts
|
||
|
||
**`Type: Sequence`**
|
||
|
||
The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `<destination>/<cluster-ID>/init_scripts`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `abfss`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "abfss" : { "destination" : "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>" } }. See [_](#clusters.<name>.init_scripts.abfss).
|
||
|
||
* - `dbfs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`. See [_](#clusters.<name>.init_scripts.dbfs).
|
||
|
||
* - `file`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "file" : { "destination" : "file:/my/local/file.sh" } }`. See [_](#clusters.<name>.init_scripts.file).
|
||
|
||
* - `gcs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "gcs": { "destination": "gs://my-bucket/file.sh" } }`. See [_](#clusters.<name>.init_scripts.gcs).
|
||
|
||
* - `s3`
|
||
- Map
|
||
- destination and either the region or endpoint need to be provided. e.g. `{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }` Cluster iam role is used to access s3, please make sure the cluster iam role in `instance_profile_arn` has permission to write data to the s3 destination. See [_](#clusters.<name>.init_scripts.s3).
|
||
|
||
* - `volumes`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "volumes" : { "destination" : "/Volumes/my-init.sh" } }`. See [_](#clusters.<name>.init_scripts.volumes).
|
||
|
||
* - `workspace`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "workspace" : { "destination" : "/Users/user1@databricks.com/my-init.sh" } }`. See [_](#clusters.<name>.init_scripts.workspace).
|
||
|
||
|
||
### clusters.<name>.init_scripts.abfss
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "abfss" : { "destination" : "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>" } }
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- abfss destination, e.g. `abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>`.
|
||
|
||
|
||
### clusters.<name>.init_scripts.dbfs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- dbfs destination, e.g. `dbfs:/my/path`
|
||
|
||
|
||
### clusters.<name>.init_scripts.file
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "file" : { "destination" : "file:/my/local/file.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- local file destination, e.g. `file:/my/local/file.sh`
|
||
|
||
|
||
### clusters.<name>.init_scripts.gcs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "gcs": { "destination": "gs://my-bucket/file.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- GCS destination/URI, e.g. `gs://my-bucket/some-prefix`
|
||
|
||
|
||
### clusters.<name>.init_scripts.s3
|
||
|
||
**`Type: Map`**
|
||
|
||
destination and either the region or endpoint need to be provided. e.g.
|
||
`{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }`
|
||
Cluster iam role is used to access s3, please make sure the cluster iam role in
|
||
`instance_profile_arn` has permission to write data to the s3 destination.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `canned_acl`
|
||
- String
|
||
- (Optional) Set canned access control list for the logs, e.g. `bucket-owner-full-control`. If `canned_cal` is set, please make sure the cluster iam role has `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned acl can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. Please also note that by default only the object owner gets full controls. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs.
|
||
|
||
* - `destination`
|
||
- String
|
||
- S3 destination, e.g. `s3://my-bucket/some-prefix` Note that logs will be delivered using cluster iam role, please make sure you set cluster iam role and the role has write access to the destination. Please also note that you cannot use AWS keys to deliver logs.
|
||
|
||
* - `enable_encryption`
|
||
- Boolean
|
||
- (Optional) Flag to enable server side encryption, `false` by default.
|
||
|
||
* - `encryption_type`
|
||
- String
|
||
- (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It will be used only when encryption is enabled and the default type is `sse-s3`.
|
||
|
||
* - `endpoint`
|
||
- String
|
||
- S3 endpoint, e.g. `https://s3-us-west-2.amazonaws.com`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
* - `kms_key`
|
||
- String
|
||
- (Optional) Kms key which will be used if encryption is enabled and encryption type is set to `sse-kms`.
|
||
|
||
* - `region`
|
||
- String
|
||
- S3 region, e.g. `us-west-2`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
|
||
### clusters.<name>.init_scripts.volumes
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "volumes" : { "destination" : "/Volumes/my-init.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- Unity Catalog Volumes file destination, e.g. `/Volumes/my-init.sh`
|
||
|
||
|
||
### clusters.<name>.init_scripts.workspace
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "workspace" : { "destination" : "/Users/user1@databricks.com/my-init.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- workspace files destination, e.g. `/Users/user1@databricks.com/my-init.sh`
|
||
|
||
|
||
### clusters.<name>.permissions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `group_name`
|
||
- String
|
||
- The name of the group that has the permission set in level.
|
||
|
||
* - `level`
|
||
- String
|
||
- The allowed permission for user, group, service principal defined for this permission.
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
- The name of the service principal that has the permission set in level.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The name of the user that has the permission set in level.
|
||
|
||
|
||
### clusters.<name>.workload_type
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `clients`
|
||
- Map
|
||
- defined what type of clients can use the cluster. E.g. Notebooks, Jobs. See [_](#clusters.<name>.workload_type.clients).
|
||
|
||
|
||
### clusters.<name>.workload_type.clients
|
||
|
||
**`Type: Map`**
|
||
|
||
defined what type of clients can use the cluster. E.g. Notebooks, Jobs
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `jobs`
|
||
- Boolean
|
||
- With jobs set, the cluster can be used for jobs
|
||
|
||
* - `notebooks`
|
||
- Boolean
|
||
- With notebooks set, this cluster can be used for notebooks
|
||
|
||
|
||
## dashboards
|
||
|
||
**`Type: Map`**
|
||
|
||
The dashboard resource allows you to manage [AI/BI dashboards](/api/workspace/lakeview/create) in a bundle. For information about AI/BI dashboards, see [_](/dashboards/index.md).
|
||
|
||
```yaml
|
||
dashboards:
|
||
<dashboard-name>:
|
||
<dashboard-field-name>: <dashboard-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `create_time`
|
||
- String
|
||
- The timestamp of when the dashboard was created.
|
||
|
||
* - `dashboard_id`
|
||
- String
|
||
- UUID identifying the dashboard.
|
||
|
||
* - `display_name`
|
||
- String
|
||
- The display name of the dashboard.
|
||
|
||
* - `embed_credentials`
|
||
- Boolean
|
||
-
|
||
|
||
* - `etag`
|
||
- String
|
||
- The etag for the dashboard. Can be optionally provided on updates to ensure that the dashboard has not been modified since the last read. This field is excluded in List Dashboards responses.
|
||
|
||
* - `file_path`
|
||
- String
|
||
-
|
||
|
||
* - `lifecycle_state`
|
||
- String
|
||
- The state of the dashboard resource. Used for tracking trashed status.
|
||
|
||
* - `parent_path`
|
||
- String
|
||
- The workspace path of the folder containing the dashboard. Includes leading slash and no trailing slash. This field is excluded in List Dashboards responses.
|
||
|
||
* - `path`
|
||
- String
|
||
- The workspace path of the dashboard asset, including the file name. Exported dashboards always have the file extension `.lvdash.json`. This field is excluded in List Dashboards responses.
|
||
|
||
* - `permissions`
|
||
- Sequence
|
||
- See [_](#dashboards.<name>.permissions).
|
||
|
||
* - `serialized_dashboard`
|
||
- Any
|
||
- The contents of the dashboard in serialized string form. This field is excluded in List Dashboards responses. Use the [get dashboard API](https://docs.databricks.com/api/workspace/lakeview/get) to retrieve an example response, which includes the `serialized_dashboard` field. This field provides the structure of the JSON string that represents the dashboard's layout and components.
|
||
|
||
* - `update_time`
|
||
- String
|
||
- The timestamp of when the dashboard was last updated by the user. This field is excluded in List Dashboards responses.
|
||
|
||
* - `warehouse_id`
|
||
- String
|
||
- The warehouse ID used to run the dashboard.
|
||
|
||
|
||
**Example**
|
||
|
||
The following example includes and deploys the sample __NYC Taxi Trip Analysis__ dashboard to the Databricks workspace.
|
||
|
||
``` yaml
|
||
resources:
|
||
dashboards:
|
||
nyc_taxi_trip_analysis:
|
||
display_name: "NYC Taxi Trip Analysis"
|
||
file_path: ../src/nyc_taxi_trip_analysis.lvdash.json
|
||
warehouse_id: ${var.warehouse_id}
|
||
```
|
||
If you use the UI to modify the dashboard, modifications made through the UI are not applied to the dashboard JSON file in the local bundle unless you explicitly update it using `bundle generate`. You can use the `--watch` option to continuously poll and retrieve changes to the dashboard. See [_](/dev-tools/cli/bundle-commands.md#generate).
|
||
|
||
In addition, if you attempt to deploy a bundle that contains a dashboard JSON file that is different than the one in the remote workspace, an error will occur. To force the deploy and overwrite the dashboard in the remote workspace with the local one, use the `--force` option. See [_](/dev-tools/cli/bundle-commands.md#deploy).
|
||
|
||
### dashboards.<name>.permissions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `group_name`
|
||
- String
|
||
- The name of the group that has the permission set in level.
|
||
|
||
* - `level`
|
||
- String
|
||
- The allowed permission for user, group, service principal defined for this permission.
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
- The name of the service principal that has the permission set in level.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The name of the user that has the permission set in level.
|
||
|
||
|
||
## experiments
|
||
|
||
**`Type: Map`**
|
||
|
||
The experiment resource allows you to define [MLflow experiments](/api/workspace/experiments/createexperiment) in a bundle. For information about MLflow experiments, see [_](/mlflow/experiments.md).
|
||
|
||
```yaml
|
||
experiments:
|
||
<experiment-name>:
|
||
<experiment-field-name>: <experiment-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `artifact_location`
|
||
- String
|
||
- Location where artifacts for the experiment are stored.
|
||
|
||
* - `creation_time`
|
||
- Integer
|
||
- Creation time
|
||
|
||
* - `experiment_id`
|
||
- String
|
||
- Unique identifier for the experiment.
|
||
|
||
* - `last_update_time`
|
||
- Integer
|
||
- Last update time
|
||
|
||
* - `lifecycle_stage`
|
||
- String
|
||
- Current life cycle stage of the experiment: "active" or "deleted". Deleted experiments are not returned by APIs.
|
||
|
||
* - `name`
|
||
- String
|
||
- Human readable name that identifies the experiment.
|
||
|
||
* - `permissions`
|
||
- Sequence
|
||
- See [_](#experiments.<name>.permissions).
|
||
|
||
* - `tags`
|
||
- Sequence
|
||
- Tags: Additional metadata key-value pairs. See [_](#experiments.<name>.tags).
|
||
|
||
|
||
**Example**
|
||
|
||
The following example defines an experiment that all users can view:
|
||
|
||
```yaml
|
||
resources:
|
||
experiments:
|
||
experiment:
|
||
name: my_ml_experiment
|
||
permissions:
|
||
- level: CAN_READ
|
||
group_name: users
|
||
description: MLflow experiment used to track runs
|
||
```
|
||
|
||
### experiments.<name>.permissions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `group_name`
|
||
- String
|
||
- The name of the group that has the permission set in level.
|
||
|
||
* - `level`
|
||
- String
|
||
- The allowed permission for user, group, service principal defined for this permission.
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
- The name of the service principal that has the permission set in level.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The name of the user that has the permission set in level.
|
||
|
||
|
||
### experiments.<name>.tags
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Tags: Additional metadata key-value pairs.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `key`
|
||
- String
|
||
- The tag key.
|
||
|
||
* - `value`
|
||
- String
|
||
- The tag value.
|
||
|
||
|
||
## jobs
|
||
|
||
**`Type: Map`**
|
||
|
||
The job resource allows you to define [jobs and their corresponding tasks](/api/workspace/jobs/create) in your bundle. For information about jobs, see [_](/jobs/index.md). For a tutorial that uses a <DABS> template to create a job, see [_](/dev-tools/bundles/jobs-tutorial.md).
|
||
|
||
```yaml
|
||
jobs:
|
||
<job-name>:
|
||
<job-field-name>: <job-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `budget_policy_id`
|
||
- String
|
||
- The id of the user specified budget policy to use for this job. If not specified, a default budget policy may be applied when creating or modifying the job. See `effective_budget_policy_id` for the budget policy used by this workload.
|
||
|
||
* - `continuous`
|
||
- Map
|
||
- An optional continuous property for this job. The continuous property will ensure that there is always one run executing. Only one of `schedule` and `continuous` can be used. See [_](#jobs.<name>.continuous).
|
||
|
||
* - `deployment`
|
||
- Map
|
||
- Deployment information for jobs managed by external sources. See [_](#jobs.<name>.deployment).
|
||
|
||
* - `description`
|
||
- String
|
||
- An optional description for the job. The maximum length is 27700 characters in UTF-8 encoding.
|
||
|
||
* - `edit_mode`
|
||
- String
|
||
- Edit mode of the job. * `UI_LOCKED`: The job is in a locked UI state and cannot be modified. * `EDITABLE`: The job is in an editable state and can be modified.
|
||
|
||
* - `email_notifications`
|
||
- Map
|
||
- An optional set of email addresses that is notified when runs of this job begin or complete as well as when this job is deleted. See [_](#jobs.<name>.email_notifications).
|
||
|
||
* - `environments`
|
||
- Sequence
|
||
- A list of task execution environment specifications that can be referenced by serverless tasks of this job. An environment is required to be present for serverless tasks. For serverless notebook tasks, the environment is accessible in the notebook environment panel. For other serverless tasks, the task environment is required to be specified using environment_key in the task settings. See [_](#jobs.<name>.environments).
|
||
|
||
* - `format`
|
||
- String
|
||
- Used to tell what is the format of the job. This field is ignored in Create/Update/Reset calls. When using the Jobs API 2.1 this value is always set to `"MULTI_TASK"`.
|
||
|
||
* - `git_source`
|
||
- Map
|
||
- An optional specification for a remote Git repository containing the source code used by tasks. Version-controlled source code is supported by notebook, dbt, Python script, and SQL File tasks. If `git_source` is set, these tasks retrieve the file from the remote repository by default. However, this behavior can be overridden by setting `source` to `WORKSPACE` on the task. Note: dbt and SQL File tasks support only version-controlled sources. If dbt or SQL File tasks are used, `git_source` must be defined on the job. See [_](#jobs.<name>.git_source).
|
||
|
||
* - `health`
|
||
- Map
|
||
- An optional set of health rules that can be defined for this job. See [_](#jobs.<name>.health).
|
||
|
||
* - `job_clusters`
|
||
- Sequence
|
||
- A list of job cluster specifications that can be shared and reused by tasks of this job. Libraries cannot be declared in a shared job cluster. You must declare dependent libraries in task settings. See [_](#jobs.<name>.job_clusters).
|
||
|
||
* - `max_concurrent_runs`
|
||
- Integer
|
||
- An optional maximum allowed number of concurrent runs of the job. Set this value if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters. This setting affects only new runs. For example, suppose the job’s concurrency is 4 and there are 4 concurrent active runs. Then setting the concurrency to 3 won’t kill any of the active runs. However, from then on, new runs are skipped unless there are fewer than 3 active runs. This value cannot exceed 1000. Setting this value to `0` causes all new runs to be skipped.
|
||
|
||
* - `name`
|
||
- String
|
||
- An optional name for the job. The maximum length is 4096 bytes in UTF-8 encoding.
|
||
|
||
* - `notification_settings`
|
||
- Map
|
||
- Optional notification settings that are used when sending notifications to each of the `email_notifications` and `webhook_notifications` for this job. See [_](#jobs.<name>.notification_settings).
|
||
|
||
* - `parameters`
|
||
- Sequence
|
||
- Job-level parameter definitions. See [_](#jobs.<name>.parameters).
|
||
|
||
* - `permissions`
|
||
- Sequence
|
||
- See [_](#jobs.<name>.permissions).
|
||
|
||
* - `queue`
|
||
- Map
|
||
- The queue settings of the job. See [_](#jobs.<name>.queue).
|
||
|
||
* - `run_as`
|
||
- Map
|
||
- Write-only setting. Specifies the user or service principal that the job runs as. If not specified, the job runs as the user who created the job. Either `user_name` or `service_principal_name` should be specified. If not, an error is thrown. See [_](#jobs.<name>.run_as).
|
||
|
||
* - `schedule`
|
||
- Map
|
||
- An optional periodic schedule for this job. The default behavior is that the job only runs when triggered by clicking “Run Now” in the Jobs UI or sending an API request to `runNow`. See [_](#jobs.<name>.schedule).
|
||
|
||
* - `tags`
|
||
- Map
|
||
- A map of tags associated with the job. These are forwarded to the cluster as cluster tags for jobs clusters, and are subject to the same limitations as cluster tags. A maximum of 25 tags can be added to the job.
|
||
|
||
* - `tasks`
|
||
- Sequence
|
||
- A list of task specifications to be executed by this job. See [_](#jobs.<name>.tasks).
|
||
|
||
* - `timeout_seconds`
|
||
- Integer
|
||
- An optional timeout applied to each run of this job. A value of `0` means no timeout.
|
||
|
||
* - `trigger`
|
||
- Map
|
||
- A configuration to trigger a run when certain conditions are met. The default behavior is that the job runs only when triggered by clicking “Run Now” in the Jobs UI or sending an API request to `runNow`. See [_](#jobs.<name>.trigger).
|
||
|
||
* - `webhook_notifications`
|
||
- Map
|
||
- A collection of system notification IDs to notify when runs of this job begin or complete. See [_](#jobs.<name>.webhook_notifications).
|
||
|
||
|
||
**Example**
|
||
|
||
The following example defines a job with the resource key `hello-job` with one notebook task:
|
||
|
||
```yaml
|
||
resources:
|
||
jobs:
|
||
hello-job:
|
||
name: hello-job
|
||
tasks:
|
||
- task_key: hello-task
|
||
notebook_task:
|
||
notebook_path: ./hello.py
|
||
```
|
||
|
||
For information about defining job tasks and overriding job settings, see [_](/dev-tools/bundles/job-task-types.md), [_](/dev-tools/bundles/job-task-override.md), and [_](/dev-tools/bundles/cluster-override.md).
|
||
|
||
### jobs.<name>.continuous
|
||
|
||
**`Type: Map`**
|
||
|
||
An optional continuous property for this job. The continuous property will ensure that there is always one run executing. Only one of `schedule` and `continuous` can be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `pause_status`
|
||
- String
|
||
- Indicate whether the continuous execution of the job is paused or not. Defaults to UNPAUSED.
|
||
|
||
|
||
### jobs.<name>.deployment
|
||
|
||
**`Type: Map`**
|
||
|
||
Deployment information for jobs managed by external sources.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `kind`
|
||
- String
|
||
- The kind of deployment that manages the job. * `BUNDLE`: The job is managed by Databricks Asset Bundle.
|
||
|
||
* - `metadata_file_path`
|
||
- String
|
||
- Path of the file that contains deployment metadata.
|
||
|
||
|
||
### jobs.<name>.email_notifications
|
||
|
||
**`Type: Map`**
|
||
|
||
An optional set of email addresses that is notified when runs of this job begin or complete as well as when this job is deleted.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `no_alert_for_skipped_runs`
|
||
- Boolean
|
||
- If true, do not send email to recipients specified in `on_failure` if the run is skipped. This field is `deprecated`. Please use the `notification_settings.no_alert_for_skipped_runs` field.
|
||
|
||
* - `on_duration_warning_threshold_exceeded`
|
||
- Sequence
|
||
- A list of email addresses to be notified when the duration of a run exceeds the threshold specified for the `RUN_DURATION_SECONDS` metric in the `health` field. If no rule for the `RUN_DURATION_SECONDS` metric is specified in the `health` field for the job, notifications are not sent.
|
||
|
||
* - `on_failure`
|
||
- Sequence
|
||
- A list of email addresses to be notified when a run unsuccessfully completes. A run is considered to have completed unsuccessfully if it ends with an `INTERNAL_ERROR` `life_cycle_state` or a `FAILED`, or `TIMED_OUT` result_state. If this is not specified on job creation, reset, or update the list is empty, and notifications are not sent.
|
||
|
||
* - `on_start`
|
||
- Sequence
|
||
- A list of email addresses to be notified when a run begins. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent.
|
||
|
||
* - `on_streaming_backlog_exceeded`
|
||
- Sequence
|
||
- A list of email addresses to notify when any streaming backlog thresholds are exceeded for any stream. Streaming backlog thresholds can be set in the `health` field using the following metrics: `STREAMING_BACKLOG_BYTES`, `STREAMING_BACKLOG_RECORDS`, `STREAMING_BACKLOG_SECONDS`, or `STREAMING_BACKLOG_FILES`. Alerting is based on the 10-minute average of these metrics. If the issue persists, notifications are resent every 30 minutes.
|
||
|
||
* - `on_success`
|
||
- Sequence
|
||
- A list of email addresses to be notified when a run successfully completes. A run is considered to have completed successfully if it ends with a `TERMINATED` `life_cycle_state` and a `SUCCESS` result_state. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent.
|
||
|
||
|
||
### jobs.<name>.environments
|
||
|
||
**`Type: Sequence`**
|
||
|
||
A list of task execution environment specifications that can be referenced by serverless tasks of this job.
|
||
An environment is required to be present for serverless tasks.
|
||
For serverless notebook tasks, the environment is accessible in the notebook environment panel.
|
||
For other serverless tasks, the task environment is required to be specified using environment_key in the task settings.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `environment_key`
|
||
- String
|
||
- The key of an environment. It has to be unique within a job.
|
||
|
||
* - `spec`
|
||
- Map
|
||
- The environment entity used to preserve serverless environment side panel and jobs' environment for non-notebook task. In this minimal environment spec, only pip dependencies are supported. See [_](#jobs.<name>.environments.spec).
|
||
|
||
|
||
### jobs.<name>.environments.spec
|
||
|
||
**`Type: Map`**
|
||
|
||
The environment entity used to preserve serverless environment side panel and jobs' environment for non-notebook task.
|
||
In this minimal environment spec, only pip dependencies are supported.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `client`
|
||
- String
|
||
- Client version used by the environment The client is the user-facing environment of the runtime. Each client comes with a specific set of pre-installed libraries. The version is a string, consisting of the major client version.
|
||
|
||
* - `dependencies`
|
||
- Sequence
|
||
- List of pip dependencies, as supported by the version of pip in this environment. Each dependency is a pip requirement file line https://pip.pypa.io/en/stable/reference/requirements-file-format/ Allowed dependency could be <requirement specifier>, <archive url/path>, <local project path>(WSFS or Volumes in Databricks), <vcs project url> E.g. dependencies: ["foo==0.0.1", "-r /Workspace/test/requirements.txt"]
|
||
|
||
|
||
### jobs.<name>.git_source
|
||
|
||
**`Type: Map`**
|
||
|
||
An optional specification for a remote Git repository containing the source code used by tasks. Version-controlled source code is supported by notebook, dbt, Python script, and SQL File tasks.
|
||
|
||
If `git_source` is set, these tasks retrieve the file from the remote repository by default. However, this behavior can be overridden by setting `source` to `WORKSPACE` on the task.
|
||
|
||
Note: dbt and SQL File tasks support only version-controlled sources. If dbt or SQL File tasks are used, `git_source` must be defined on the job.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `git_branch`
|
||
- String
|
||
- Name of the branch to be checked out and used by this job. This field cannot be specified in conjunction with git_tag or git_commit.
|
||
|
||
* - `git_commit`
|
||
- String
|
||
- Commit to be checked out and used by this job. This field cannot be specified in conjunction with git_branch or git_tag.
|
||
|
||
* - `git_provider`
|
||
- String
|
||
- Unique identifier of the service used to host the Git repository. The value is case insensitive.
|
||
|
||
* - `git_snapshot`
|
||
- Map
|
||
- Read-only state of the remote repository at the time the job was run. This field is only included on job runs. See [_](#jobs.<name>.git_source.git_snapshot).
|
||
|
||
* - `git_tag`
|
||
- String
|
||
- Name of the tag to be checked out and used by this job. This field cannot be specified in conjunction with git_branch or git_commit.
|
||
|
||
* - `git_url`
|
||
- String
|
||
- URL of the repository to be cloned by this job.
|
||
|
||
* - `job_source`
|
||
- Map
|
||
- The source of the job specification in the remote repository when the job is source controlled. See [_](#jobs.<name>.git_source.job_source).
|
||
|
||
|
||
### jobs.<name>.git_source.git_snapshot
|
||
|
||
**`Type: Map`**
|
||
|
||
Read-only state of the remote repository at the time the job was run. This field is only included on job runs.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `used_commit`
|
||
- String
|
||
- Commit that was used to execute the run. If git_branch was specified, this points to the HEAD of the branch at the time of the run; if git_tag was specified, this points to the commit the tag points to.
|
||
|
||
|
||
### jobs.<name>.git_source.job_source
|
||
|
||
**`Type: Map`**
|
||
|
||
The source of the job specification in the remote repository when the job is source controlled.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `dirty_state`
|
||
- String
|
||
- Dirty state indicates the job is not fully synced with the job specification in the remote repository. Possible values are: * `NOT_SYNCED`: The job is not yet synced with the remote job specification. Import the remote job specification from UI to make the job fully synced. * `DISCONNECTED`: The job is temporary disconnected from the remote job specification and is allowed for live edit. Import the remote job specification again from UI to make the job fully synced.
|
||
|
||
* - `import_from_git_branch`
|
||
- String
|
||
- Name of the branch which the job is imported from.
|
||
|
||
* - `job_config_path`
|
||
- String
|
||
- Path of the job YAML file that contains the job specification.
|
||
|
||
|
||
### jobs.<name>.health
|
||
|
||
**`Type: Map`**
|
||
|
||
An optional set of health rules that can be defined for this job.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `rules`
|
||
- Sequence
|
||
- See [_](#jobs.<name>.health.rules).
|
||
|
||
|
||
### jobs.<name>.health.rules
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `metric`
|
||
- String
|
||
- Specifies the health metric that is being evaluated for a particular health rule. * `RUN_DURATION_SECONDS`: Expected total time for a run in seconds. * `STREAMING_BACKLOG_BYTES`: An estimate of the maximum bytes of data waiting to be consumed across all streams. This metric is in Public Preview. * `STREAMING_BACKLOG_RECORDS`: An estimate of the maximum offset lag across all streams. This metric is in Public Preview. * `STREAMING_BACKLOG_SECONDS`: An estimate of the maximum consumer delay across all streams. This metric is in Public Preview. * `STREAMING_BACKLOG_FILES`: An estimate of the maximum number of outstanding files across all streams. This metric is in Public Preview.
|
||
|
||
* - `op`
|
||
- String
|
||
- Specifies the operator used to compare the health metric value with the specified threshold.
|
||
|
||
* - `value`
|
||
- Integer
|
||
- Specifies the threshold value that the health metric should obey to satisfy the health rule.
|
||
|
||
|
||
### jobs.<name>.job_clusters
|
||
|
||
**`Type: Sequence`**
|
||
|
||
A list of job cluster specifications that can be shared and reused by tasks of this job. Libraries cannot be declared in a shared job cluster. You must declare dependent libraries in task settings.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `job_cluster_key`
|
||
- String
|
||
- A unique name for the job cluster. This field is required and must be unique within the job. `JobTaskSettings` may refer to this field to determine which cluster to launch for the task execution.
|
||
|
||
* - `new_cluster`
|
||
- Map
|
||
- If new_cluster, a description of a cluster that is created for each task. See [_](#jobs.<name>.job_clusters.new_cluster).
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster
|
||
|
||
**`Type: Map`**
|
||
|
||
If new_cluster, a description of a cluster that is created for each task.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `apply_policy_default_values`
|
||
- Boolean
|
||
- When set to true, fixed and default values from the policy will be used for fields that are omitted. When set to false, only fixed values from the policy will be applied.
|
||
|
||
* - `autoscale`
|
||
- Map
|
||
- Parameters needed in order to automatically scale clusters up and down based on load. Note: autoscaling works best with DB runtime versions 3.0 or later. See [_](#jobs.<name>.job_clusters.new_cluster.autoscale).
|
||
|
||
* - `autotermination_minutes`
|
||
- Integer
|
||
- Automatically terminates the cluster after it is inactive for this time in minutes. If not set, this cluster will not be automatically terminated. If specified, the threshold must be between 10 and 10000 minutes. Users can also set this value to 0 to explicitly disable automatic termination.
|
||
|
||
* - `aws_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Amazon Web Services. If not specified at cluster creation, a set of default values will be used. See [_](#jobs.<name>.job_clusters.new_cluster.aws_attributes).
|
||
|
||
* - `azure_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Microsoft Azure. If not specified at cluster creation, a set of default values will be used. See [_](#jobs.<name>.job_clusters.new_cluster.azure_attributes).
|
||
|
||
* - `cluster_log_conf`
|
||
- Map
|
||
- The configuration for delivering spark logs to a long-term storage destination. Two kinds of destinations (dbfs and s3) are supported. Only one destination can be specified for one cluster. If the conf is given, the logs will be delivered to the destination every `5 mins`. The destination of driver logs is `$destination/$clusterId/driver`, while the destination of executor logs is `$destination/$clusterId/executor`. See [_](#jobs.<name>.job_clusters.new_cluster.cluster_log_conf).
|
||
|
||
* - `cluster_name`
|
||
- String
|
||
- Cluster name requested by the user. This doesn't have to be unique. If not specified at creation, the cluster name will be an empty string.
|
||
|
||
* - `custom_tags`
|
||
- Map
|
||
- Additional tags for cluster resources. Databricks will tag all cluster resources (e.g., AWS instances and EBS volumes) with these tags in addition to `default_tags`. Notes: - Currently, Databricks allows at most 45 custom tags - Clusters can only reuse cloud resources if the resources' tags are a subset of the cluster tags
|
||
|
||
* - `data_security_mode`
|
||
- String
|
||
- Data security mode decides what data governance model to use when accessing data from a cluster. The following modes can only be used with `kind`. * `DATA_SECURITY_MODE_AUTO`: Databricks will choose the most appropriate access mode depending on your compute configuration. * `DATA_SECURITY_MODE_STANDARD`: Alias for `USER_ISOLATION`. * `DATA_SECURITY_MODE_DEDICATED`: Alias for `SINGLE_USER`. The following modes can be used regardless of `kind`. * `NONE`: No security isolation for multiple users sharing the cluster. Data governance features are not available in this mode. * `SINGLE_USER`: A secure cluster that can only be exclusively used by a single user specified in `single_user_name`. Most programming languages, cluster features and data governance features are available in this mode. * `USER_ISOLATION`: A secure cluster that can be shared by multiple users. Cluster users are fully isolated so that they cannot see each other's data and credentials. Most data governance features are supported in this mode. But programming languages and cluster features might be limited. The following modes are deprecated starting with Databricks Runtime 15.0 and will be removed for future Databricks Runtime versions: * `LEGACY_TABLE_ACL`: This mode is for users migrating from legacy Table ACL clusters. * `LEGACY_PASSTHROUGH`: This mode is for users migrating from legacy Passthrough on high concurrency clusters. * `LEGACY_SINGLE_USER`: This mode is for users migrating from legacy Passthrough on standard clusters. * `LEGACY_SINGLE_USER_STANDARD`: This mode provides a way that doesn’t have UC nor passthrough enabled.
|
||
|
||
* - `docker_image`
|
||
- Map
|
||
- See [_](#jobs.<name>.job_clusters.new_cluster.docker_image).
|
||
|
||
* - `driver_instance_pool_id`
|
||
- String
|
||
- The optional ID of the instance pool for the driver of the cluster belongs. The pool cluster uses the instance pool with id (instance_pool_id) if the driver pool is not assigned.
|
||
|
||
* - `driver_node_type_id`
|
||
- String
|
||
- The node type of the Spark driver. Note that this field is optional; if unset, the driver node type will be set as the same value as `node_type_id` defined above.
|
||
|
||
* - `enable_elastic_disk`
|
||
- Boolean
|
||
- Autoscaling Local Storage: when enabled, this cluster will dynamically acquire additional disk space when its Spark workers are running low on disk space. This feature requires specific AWS permissions to function correctly - refer to the User Guide for more details.
|
||
|
||
* - `enable_local_disk_encryption`
|
||
- Boolean
|
||
- Whether to enable LUKS on cluster VMs' local disks
|
||
|
||
* - `gcp_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Google Cloud Platform. If not specified at cluster creation, a set of default values will be used. See [_](#jobs.<name>.job_clusters.new_cluster.gcp_attributes).
|
||
|
||
* - `init_scripts`
|
||
- Sequence
|
||
- The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `<destination>/<cluster-ID>/init_scripts`. See [_](#jobs.<name>.job_clusters.new_cluster.init_scripts).
|
||
|
||
* - `instance_pool_id`
|
||
- String
|
||
- The optional ID of the instance pool to which the cluster belongs.
|
||
|
||
* - `is_single_node`
|
||
- Boolean
|
||
- This field can only be used with `kind`. When set to true, Databricks will automatically set single node related `custom_tags`, `spark_conf`, and `num_workers`
|
||
|
||
* - `kind`
|
||
- String
|
||
-
|
||
|
||
* - `node_type_id`
|
||
- String
|
||
- This field encodes, through a single value, the resources available to each of the Spark nodes in this cluster. For example, the Spark nodes can be provisioned and optimized for memory or compute intensive workloads. A list of available node types can be retrieved by using the :method:clusters/listNodeTypes API call.
|
||
|
||
* - `num_workers`
|
||
- Integer
|
||
- Number of worker nodes that this cluster should have. A cluster has one Spark Driver and `num_workers` Executors for a total of `num_workers` + 1 Spark nodes. Note: When reading the properties of a cluster, this field reflects the desired number of workers rather than the actual current number of workers. For instance, if a cluster is resized from 5 to 10 workers, this field will immediately be updated to reflect the target size of 10 workers, whereas the workers listed in `spark_info` will gradually increase from 5 to 10 as the new nodes are provisioned.
|
||
|
||
* - `policy_id`
|
||
- String
|
||
- The ID of the cluster policy used to create the cluster if applicable.
|
||
|
||
* - `runtime_engine`
|
||
- String
|
||
- Determines the cluster's runtime engine, either standard or Photon. This field is not compatible with legacy `spark_version` values that contain `-photon-`. Remove `-photon-` from the `spark_version` and set `runtime_engine` to `PHOTON`. If left unspecified, the runtime engine defaults to standard unless the spark_version contains -photon-, in which case Photon will be used.
|
||
|
||
* - `single_user_name`
|
||
- String
|
||
- Single user name if data_security_mode is `SINGLE_USER`
|
||
|
||
* - `spark_conf`
|
||
- Map
|
||
- An object containing a set of optional, user-specified Spark configuration key-value pairs. Users can also pass in a string of extra JVM options to the driver and the executors via `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` respectively.
|
||
|
||
* - `spark_env_vars`
|
||
- Map
|
||
- An object containing a set of optional, user-specified environment variable key-value pairs. Please note that key-value pair of the form (X,Y) will be exported as is (i.e., `export X='Y'`) while launching the driver and workers. In order to specify an additional set of `SPARK_DAEMON_JAVA_OPTS`, we recommend appending them to `$SPARK_DAEMON_JAVA_OPTS` as shown in the example below. This ensures that all default databricks managed environmental variables are included as well. Example Spark environment variables: `{"SPARK_WORKER_MEMORY": "28000m", "SPARK_LOCAL_DIRS": "/local_disk0"}` or `{"SPARK_DAEMON_JAVA_OPTS": "$SPARK_DAEMON_JAVA_OPTS -Dspark.shuffle.service.enabled=true"}`
|
||
|
||
* - `spark_version`
|
||
- String
|
||
- The Spark version of the cluster, e.g. `3.3.x-scala2.11`. A list of available Spark versions can be retrieved by using the :method:clusters/sparkVersions API call.
|
||
|
||
* - `ssh_public_keys`
|
||
- Sequence
|
||
- SSH public key contents that will be added to each Spark node in this cluster. The corresponding private keys can be used to login with the user name `ubuntu` on port `2200`. Up to 10 keys can be specified.
|
||
|
||
* - `use_ml_runtime`
|
||
- Boolean
|
||
- This field can only be used with `kind`. `effective_spark_version` is determined by `spark_version` (DBR release), this field `use_ml_runtime`, and whether `node_type_id` is gpu node or not.
|
||
|
||
* - `workload_type`
|
||
- Map
|
||
- See [_](#jobs.<name>.job_clusters.new_cluster.workload_type).
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.autoscale
|
||
|
||
**`Type: Map`**
|
||
|
||
Parameters needed in order to automatically scale clusters up and down based on load.
|
||
Note: autoscaling works best with DB runtime versions 3.0 or later.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `max_workers`
|
||
- Integer
|
||
- The maximum number of workers to which the cluster can scale up when overloaded. Note that `max_workers` must be strictly greater than `min_workers`.
|
||
|
||
* - `min_workers`
|
||
- Integer
|
||
- The minimum number of workers to which the cluster can scale down when underutilized. It is also the initial number of workers the cluster will have after creation.
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.aws_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Amazon Web Services.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- Availability type used for all subsequent nodes past the `first_on_demand` ones. Note: If `first_on_demand` is zero, this availability type will be used for the entire cluster.
|
||
|
||
* - `ebs_volume_count`
|
||
- Integer
|
||
- The number of volumes launched for each instance. Users can choose up to 10 volumes. This feature is only enabled for supported node types. Legacy node types cannot specify custom EBS volumes. For node types with no instance store, at least one EBS volume needs to be specified; otherwise, cluster creation will fail. These EBS volumes will be mounted at `/ebs0`, `/ebs1`, and etc. Instance store volumes will be mounted at `/local_disk0`, `/local_disk1`, and etc. If EBS volumes are attached, Databricks will configure Spark to use only the EBS volumes for scratch storage because heterogenously sized scratch devices can lead to inefficient disk utilization. If no EBS volumes are attached, Databricks will configure Spark to use instance store volumes. Please note that if EBS volumes are specified, then the Spark configuration `spark.local.dir` will be overridden.
|
||
|
||
* - `ebs_volume_iops`
|
||
- Integer
|
||
- If using gp3 volumes, what IOPS to use for the disk. If this is not set, the maximum performance of a gp2 volume with the same volume size will be used.
|
||
|
||
* - `ebs_volume_size`
|
||
- Integer
|
||
- The size of each EBS volume (in GiB) launched for each instance. For general purpose SSD, this value must be within the range 100 - 4096. For throughput optimized HDD, this value must be within the range 500 - 4096.
|
||
|
||
* - `ebs_volume_throughput`
|
||
- Integer
|
||
- If using gp3 volumes, what throughput to use for the disk. If this is not set, the maximum performance of a gp2 volume with the same volume size will be used.
|
||
|
||
* - `ebs_volume_type`
|
||
- String
|
||
- The type of EBS volumes that will be launched with this cluster.
|
||
|
||
* - `first_on_demand`
|
||
- Integer
|
||
- The first `first_on_demand` nodes of the cluster will be placed on on-demand instances. If this value is greater than 0, the cluster driver node in particular will be placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, `first_on_demand` nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. Note that this value does not affect cluster size and cannot currently be mutated over the lifetime of a cluster.
|
||
|
||
* - `instance_profile_arn`
|
||
- String
|
||
- Nodes for this cluster will only be placed on AWS instances with this instance profile. If ommitted, nodes will be placed on instances without an IAM instance profile. The instance profile must have previously been added to the Databricks environment by an account administrator. This feature may only be available to certain customer plans. If this field is ommitted, we will pull in the default from the conf if it exists.
|
||
|
||
* - `spot_bid_price_percent`
|
||
- Integer
|
||
- The bid price for AWS spot instances, as a percentage of the corresponding instance type's on-demand price. For example, if this field is set to 50, and the cluster needs a new `r3.xlarge` spot instance, then the bid price is half of the price of on-demand `r3.xlarge` instances. Similarly, if this field is set to 200, the bid price is twice the price of on-demand `r3.xlarge` instances. If not specified, the default value is 100. When spot instances are requested for this cluster, only spot instances whose bid price percentage matches this field will be considered. Note that, for safety, we enforce this field to be no more than 10000. The default value and documentation here should be kept consistent with CommonConf.defaultSpotBidPricePercent and CommonConf.maxSpotBidPricePercent.
|
||
|
||
* - `zone_id`
|
||
- String
|
||
- Identifier for the availability zone/datacenter in which the cluster resides. This string will be of a form like "us-west-2a". The provided availability zone must be in the same region as the Databricks deployment. For example, "us-west-2a" is not a valid zone id if the Databricks deployment resides in the "us-east-1" region. This is an optional field at cluster creation, and if not specified, a default zone will be used. If the zone specified is "auto", will try to place cluster in a zone with high availability, and will retry placement in a different AZ if there is not enough capacity. The list of available zones as well as the default value can be found by using the `List Zones` method.
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.azure_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Microsoft Azure.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- Availability type used for all subsequent nodes past the `first_on_demand` ones. Note: If `first_on_demand` is zero (which only happens on pool clusters), this availability type will be used for the entire cluster.
|
||
|
||
* - `first_on_demand`
|
||
- Integer
|
||
- The first `first_on_demand` nodes of the cluster will be placed on on-demand instances. This value should be greater than 0, to make sure the cluster driver node is placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, `first_on_demand` nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. Note that this value does not affect cluster size and cannot currently be mutated over the lifetime of a cluster.
|
||
|
||
* - `log_analytics_info`
|
||
- Map
|
||
- Defines values necessary to configure and run Azure Log Analytics agent. See [_](#jobs.<name>.job_clusters.new_cluster.azure_attributes.log_analytics_info).
|
||
|
||
* - `spot_bid_max_price`
|
||
- Any
|
||
- The max bid price to be used for Azure spot instances. The Max price for the bid cannot be higher than the on-demand price of the instance. If not specified, the default value is -1, which specifies that the instance cannot be evicted on the basis of price, and only on the basis of availability. Further, the value should > 0 or -1.
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.azure_attributes.log_analytics_info
|
||
|
||
**`Type: Map`**
|
||
|
||
Defines values necessary to configure and run Azure Log Analytics agent
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `log_analytics_primary_key`
|
||
- String
|
||
- <needs content added>
|
||
|
||
* - `log_analytics_workspace_id`
|
||
- String
|
||
- <needs content added>
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.cluster_log_conf
|
||
|
||
**`Type: Map`**
|
||
|
||
The configuration for delivering spark logs to a long-term storage destination.
|
||
Two kinds of destinations (dbfs and s3) are supported. Only one destination can be specified
|
||
for one cluster. If the conf is given, the logs will be delivered to the destination every
|
||
`5 mins`. The destination of driver logs is `$destination/$clusterId/driver`, while
|
||
the destination of executor logs is `$destination/$clusterId/executor`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `dbfs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`. See [_](#jobs.<name>.job_clusters.new_cluster.cluster_log_conf.dbfs).
|
||
|
||
* - `s3`
|
||
- Map
|
||
- destination and either the region or endpoint need to be provided. e.g. `{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }` Cluster iam role is used to access s3, please make sure the cluster iam role in `instance_profile_arn` has permission to write data to the s3 destination. See [_](#jobs.<name>.job_clusters.new_cluster.cluster_log_conf.s3).
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.cluster_log_conf.dbfs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- dbfs destination, e.g. `dbfs:/my/path`
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.cluster_log_conf.s3
|
||
|
||
**`Type: Map`**
|
||
|
||
destination and either the region or endpoint need to be provided. e.g.
|
||
`{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }`
|
||
Cluster iam role is used to access s3, please make sure the cluster iam role in
|
||
`instance_profile_arn` has permission to write data to the s3 destination.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `canned_acl`
|
||
- String
|
||
- (Optional) Set canned access control list for the logs, e.g. `bucket-owner-full-control`. If `canned_cal` is set, please make sure the cluster iam role has `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned acl can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. Please also note that by default only the object owner gets full controls. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs.
|
||
|
||
* - `destination`
|
||
- String
|
||
- S3 destination, e.g. `s3://my-bucket/some-prefix` Note that logs will be delivered using cluster iam role, please make sure you set cluster iam role and the role has write access to the destination. Please also note that you cannot use AWS keys to deliver logs.
|
||
|
||
* - `enable_encryption`
|
||
- Boolean
|
||
- (Optional) Flag to enable server side encryption, `false` by default.
|
||
|
||
* - `encryption_type`
|
||
- String
|
||
- (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It will be used only when encryption is enabled and the default type is `sse-s3`.
|
||
|
||
* - `endpoint`
|
||
- String
|
||
- S3 endpoint, e.g. `https://s3-us-west-2.amazonaws.com`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
* - `kms_key`
|
||
- String
|
||
- (Optional) Kms key which will be used if encryption is enabled and encryption type is set to `sse-kms`.
|
||
|
||
* - `region`
|
||
- String
|
||
- S3 region, e.g. `us-west-2`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.docker_image
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `basic_auth`
|
||
- Map
|
||
- See [_](#jobs.<name>.job_clusters.new_cluster.docker_image.basic_auth).
|
||
|
||
* - `url`
|
||
- String
|
||
- URL of the docker image.
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.docker_image.basic_auth
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `password`
|
||
- String
|
||
- Password of the user
|
||
|
||
* - `username`
|
||
- String
|
||
- Name of the user
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.gcp_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Google Cloud Platform.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- This field determines whether the instance pool will contain preemptible VMs, on-demand VMs, or preemptible VMs with a fallback to on-demand VMs if the former is unavailable.
|
||
|
||
* - `boot_disk_size`
|
||
- Integer
|
||
- boot disk size in GB
|
||
|
||
* - `google_service_account`
|
||
- String
|
||
- If provided, the cluster will impersonate the google service account when accessing gcloud services (like GCS). The google service account must have previously been added to the Databricks environment by an account administrator.
|
||
|
||
* - `local_ssd_count`
|
||
- Integer
|
||
- If provided, each node (workers and driver) in the cluster will have this number of local SSDs attached. Each local SSD is 375GB in size. Refer to [GCP documentation](https://cloud.google.com/compute/docs/disks/local-ssd#choose_number_local_ssds) for the supported number of local SSDs for each instance type.
|
||
|
||
* - `use_preemptible_executors`
|
||
- Boolean
|
||
- This field determines whether the spark executors will be scheduled to run on preemptible VMs (when set to true) versus standard compute engine VMs (when set to false; default). Note: Soon to be deprecated, use the availability field instead.
|
||
|
||
* - `zone_id`
|
||
- String
|
||
- Identifier for the availability zone in which the cluster resides. This can be one of the following: - "HA" => High availability, spread nodes across availability zones for a Databricks deployment region [default] - "AUTO" => Databricks picks an availability zone to schedule the cluster on. - A GCP availability zone => Pick One of the available zones for (machine type + region) from https://cloud.google.com/compute/docs/regions-zones.
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.init_scripts
|
||
|
||
**`Type: Sequence`**
|
||
|
||
The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `<destination>/<cluster-ID>/init_scripts`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `abfss`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "abfss" : { "destination" : "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>" } }. See [_](#jobs.<name>.job_clusters.new_cluster.init_scripts.abfss).
|
||
|
||
* - `dbfs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`. See [_](#jobs.<name>.job_clusters.new_cluster.init_scripts.dbfs).
|
||
|
||
* - `file`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "file" : { "destination" : "file:/my/local/file.sh" } }`. See [_](#jobs.<name>.job_clusters.new_cluster.init_scripts.file).
|
||
|
||
* - `gcs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "gcs": { "destination": "gs://my-bucket/file.sh" } }`. See [_](#jobs.<name>.job_clusters.new_cluster.init_scripts.gcs).
|
||
|
||
* - `s3`
|
||
- Map
|
||
- destination and either the region or endpoint need to be provided. e.g. `{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }` Cluster iam role is used to access s3, please make sure the cluster iam role in `instance_profile_arn` has permission to write data to the s3 destination. See [_](#jobs.<name>.job_clusters.new_cluster.init_scripts.s3).
|
||
|
||
* - `volumes`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "volumes" : { "destination" : "/Volumes/my-init.sh" } }`. See [_](#jobs.<name>.job_clusters.new_cluster.init_scripts.volumes).
|
||
|
||
* - `workspace`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "workspace" : { "destination" : "/Users/user1@databricks.com/my-init.sh" } }`. See [_](#jobs.<name>.job_clusters.new_cluster.init_scripts.workspace).
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.init_scripts.abfss
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "abfss" : { "destination" : "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>" } }
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- abfss destination, e.g. `abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>`.
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.init_scripts.dbfs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- dbfs destination, e.g. `dbfs:/my/path`
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.init_scripts.file
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "file" : { "destination" : "file:/my/local/file.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- local file destination, e.g. `file:/my/local/file.sh`
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.init_scripts.gcs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "gcs": { "destination": "gs://my-bucket/file.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- GCS destination/URI, e.g. `gs://my-bucket/some-prefix`
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.init_scripts.s3
|
||
|
||
**`Type: Map`**
|
||
|
||
destination and either the region or endpoint need to be provided. e.g.
|
||
`{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }`
|
||
Cluster iam role is used to access s3, please make sure the cluster iam role in
|
||
`instance_profile_arn` has permission to write data to the s3 destination.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `canned_acl`
|
||
- String
|
||
- (Optional) Set canned access control list for the logs, e.g. `bucket-owner-full-control`. If `canned_cal` is set, please make sure the cluster iam role has `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned acl can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. Please also note that by default only the object owner gets full controls. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs.
|
||
|
||
* - `destination`
|
||
- String
|
||
- S3 destination, e.g. `s3://my-bucket/some-prefix` Note that logs will be delivered using cluster iam role, please make sure you set cluster iam role and the role has write access to the destination. Please also note that you cannot use AWS keys to deliver logs.
|
||
|
||
* - `enable_encryption`
|
||
- Boolean
|
||
- (Optional) Flag to enable server side encryption, `false` by default.
|
||
|
||
* - `encryption_type`
|
||
- String
|
||
- (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It will be used only when encryption is enabled and the default type is `sse-s3`.
|
||
|
||
* - `endpoint`
|
||
- String
|
||
- S3 endpoint, e.g. `https://s3-us-west-2.amazonaws.com`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
* - `kms_key`
|
||
- String
|
||
- (Optional) Kms key which will be used if encryption is enabled and encryption type is set to `sse-kms`.
|
||
|
||
* - `region`
|
||
- String
|
||
- S3 region, e.g. `us-west-2`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.init_scripts.volumes
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "volumes" : { "destination" : "/Volumes/my-init.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- Unity Catalog Volumes file destination, e.g. `/Volumes/my-init.sh`
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.init_scripts.workspace
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "workspace" : { "destination" : "/Users/user1@databricks.com/my-init.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- workspace files destination, e.g. `/Users/user1@databricks.com/my-init.sh`
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.workload_type
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `clients`
|
||
- Map
|
||
- defined what type of clients can use the cluster. E.g. Notebooks, Jobs. See [_](#jobs.<name>.job_clusters.new_cluster.workload_type.clients).
|
||
|
||
|
||
### jobs.<name>.job_clusters.new_cluster.workload_type.clients
|
||
|
||
**`Type: Map`**
|
||
|
||
defined what type of clients can use the cluster. E.g. Notebooks, Jobs
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `jobs`
|
||
- Boolean
|
||
- With jobs set, the cluster can be used for jobs
|
||
|
||
* - `notebooks`
|
||
- Boolean
|
||
- With notebooks set, this cluster can be used for notebooks
|
||
|
||
|
||
### jobs.<name>.notification_settings
|
||
|
||
**`Type: Map`**
|
||
|
||
Optional notification settings that are used when sending notifications to each of the `email_notifications` and `webhook_notifications` for this job.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `no_alert_for_canceled_runs`
|
||
- Boolean
|
||
- If true, do not send notifications to recipients specified in `on_failure` if the run is canceled.
|
||
|
||
* - `no_alert_for_skipped_runs`
|
||
- Boolean
|
||
- If true, do not send notifications to recipients specified in `on_failure` if the run is skipped.
|
||
|
||
|
||
### jobs.<name>.parameters
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Job-level parameter definitions
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `default`
|
||
- String
|
||
- Default value of the parameter.
|
||
|
||
* - `name`
|
||
- String
|
||
- The name of the defined parameter. May only contain alphanumeric characters, `_`, `-`, and `.`
|
||
|
||
|
||
### jobs.<name>.permissions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `group_name`
|
||
- String
|
||
- The name of the group that has the permission set in level.
|
||
|
||
* - `level`
|
||
- String
|
||
- The allowed permission for user, group, service principal defined for this permission.
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
- The name of the service principal that has the permission set in level.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The name of the user that has the permission set in level.
|
||
|
||
|
||
### jobs.<name>.queue
|
||
|
||
**`Type: Map`**
|
||
|
||
The queue settings of the job.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `enabled`
|
||
- Boolean
|
||
- If true, enable queueing for the job. This is a required field.
|
||
|
||
|
||
### jobs.<name>.run_as
|
||
|
||
**`Type: Map`**
|
||
|
||
Write-only setting. Specifies the user or service principal that the job runs as. If not specified, the job runs as the user who created the job.
|
||
|
||
Either `user_name` or `service_principal_name` should be specified. If not, an error is thrown.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
- The application ID of an active service principal. Setting this field requires the `servicePrincipal/user` role.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The email of an active workspace user. Non-admin users can only set this field to their own email.
|
||
|
||
|
||
### jobs.<name>.schedule
|
||
|
||
**`Type: Map`**
|
||
|
||
An optional periodic schedule for this job. The default behavior is that the job only runs when triggered by clicking “Run Now” in the Jobs UI or sending an API request to `runNow`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `pause_status`
|
||
- String
|
||
- Indicate whether this schedule is paused or not.
|
||
|
||
* - `quartz_cron_expression`
|
||
- String
|
||
- A Cron expression using Quartz syntax that describes the schedule for a job. See [Cron Trigger](http://www.quartz-scheduler.org/documentation/quartz-2.3.0/tutorials/crontrigger.html) for details. This field is required.
|
||
|
||
* - `timezone_id`
|
||
- String
|
||
- A Java timezone ID. The schedule for a job is resolved with respect to this timezone. See [Java TimeZone](https://docs.oracle.com/javase/7/docs/api/java/util/TimeZone.html) for details. This field is required.
|
||
|
||
|
||
### jobs.<name>.tasks
|
||
|
||
**`Type: Sequence`**
|
||
|
||
A list of task specifications to be executed by this job.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `clean_rooms_notebook_task`
|
||
- Map
|
||
- The task runs a [clean rooms](https://docs.databricks.com/en/clean-rooms/index.html) notebook when the `clean_rooms_notebook_task` field is present. See [_](#jobs.<name>.tasks.clean_rooms_notebook_task).
|
||
|
||
* - `condition_task`
|
||
- Map
|
||
- The task evaluates a condition that can be used to control the execution of other tasks when the `condition_task` field is present. The condition task does not require a cluster to execute and does not support retries or notifications. See [_](#jobs.<name>.tasks.condition_task).
|
||
|
||
* - `dbt_task`
|
||
- Map
|
||
- The task runs one or more dbt commands when the `dbt_task` field is present. The dbt task requires both Databricks SQL and the ability to use a serverless or a pro SQL warehouse. See [_](#jobs.<name>.tasks.dbt_task).
|
||
|
||
* - `depends_on`
|
||
- Sequence
|
||
- An optional array of objects specifying the dependency graph of the task. All tasks specified in this field must complete before executing this task. The task will run only if the `run_if` condition is true. The key is `task_key`, and the value is the name assigned to the dependent task. See [_](#jobs.<name>.tasks.depends_on).
|
||
|
||
* - `description`
|
||
- String
|
||
- An optional description for this task.
|
||
|
||
* - `disable_auto_optimization`
|
||
- Boolean
|
||
- An option to disable auto optimization in serverless
|
||
|
||
* - `email_notifications`
|
||
- Map
|
||
- An optional set of email addresses that is notified when runs of this task begin or complete as well as when this task is deleted. The default behavior is to not send any emails. See [_](#jobs.<name>.tasks.email_notifications).
|
||
|
||
* - `environment_key`
|
||
- String
|
||
- The key that references an environment spec in a job. This field is required for Python script, Python wheel and dbt tasks when using serverless compute.
|
||
|
||
* - `existing_cluster_id`
|
||
- String
|
||
- If existing_cluster_id, the ID of an existing cluster that is used for all runs. When running jobs or tasks on an existing cluster, you may need to manually restart the cluster if it stops responding. We suggest running jobs and tasks on new clusters for greater reliability
|
||
|
||
* - `for_each_task`
|
||
- Map
|
||
- The task executes a nested task for every input provided when the `for_each_task` field is present. See [_](#jobs.<name>.tasks.for_each_task).
|
||
|
||
* - `health`
|
||
- Map
|
||
- An optional set of health rules that can be defined for this job. See [_](#jobs.<name>.tasks.health).
|
||
|
||
* - `job_cluster_key`
|
||
- String
|
||
- If job_cluster_key, this task is executed reusing the cluster specified in `job.settings.job_clusters`.
|
||
|
||
* - `libraries`
|
||
- Sequence
|
||
- An optional list of libraries to be installed on the cluster. The default value is an empty list. See [_](#jobs.<name>.tasks.libraries).
|
||
|
||
* - `max_retries`
|
||
- Integer
|
||
- An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with the `FAILED` result_state or `INTERNAL_ERROR` `life_cycle_state`. The value `-1` means to retry indefinitely and the value `0` means to never retry.
|
||
|
||
* - `min_retry_interval_millis`
|
||
- Integer
|
||
- An optional minimal interval in milliseconds between the start of the failed run and the subsequent retry run. The default behavior is that unsuccessful runs are immediately retried.
|
||
|
||
* - `new_cluster`
|
||
- Map
|
||
- If new_cluster, a description of a new cluster that is created for each run. See [_](#jobs.<name>.tasks.new_cluster).
|
||
|
||
* - `notebook_task`
|
||
- Map
|
||
- The task runs a notebook when the `notebook_task` field is present. See [_](#jobs.<name>.tasks.notebook_task).
|
||
|
||
* - `notification_settings`
|
||
- Map
|
||
- Optional notification settings that are used when sending notifications to each of the `email_notifications` and `webhook_notifications` for this task. See [_](#jobs.<name>.tasks.notification_settings).
|
||
|
||
* - `pipeline_task`
|
||
- Map
|
||
- The task triggers a pipeline update when the `pipeline_task` field is present. Only pipelines configured to use triggered more are supported. See [_](#jobs.<name>.tasks.pipeline_task).
|
||
|
||
* - `python_wheel_task`
|
||
- Map
|
||
- The task runs a Python wheel when the `python_wheel_task` field is present. See [_](#jobs.<name>.tasks.python_wheel_task).
|
||
|
||
* - `retry_on_timeout`
|
||
- Boolean
|
||
- An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout.
|
||
|
||
* - `run_if`
|
||
- String
|
||
- An optional value specifying the condition determining whether the task is run once its dependencies have been completed. * `ALL_SUCCESS`: All dependencies have executed and succeeded * `AT_LEAST_ONE_SUCCESS`: At least one dependency has succeeded * `NONE_FAILED`: None of the dependencies have failed and at least one was executed * `ALL_DONE`: All dependencies have been completed * `AT_LEAST_ONE_FAILED`: At least one dependency failed * `ALL_FAILED`: ALl dependencies have failed
|
||
|
||
* - `run_job_task`
|
||
- Map
|
||
- The task triggers another job when the `run_job_task` field is present. See [_](#jobs.<name>.tasks.run_job_task).
|
||
|
||
* - `spark_jar_task`
|
||
- Map
|
||
- The task runs a JAR when the `spark_jar_task` field is present. See [_](#jobs.<name>.tasks.spark_jar_task).
|
||
|
||
* - `spark_python_task`
|
||
- Map
|
||
- The task runs a Python file when the `spark_python_task` field is present. See [_](#jobs.<name>.tasks.spark_python_task).
|
||
|
||
* - `spark_submit_task`
|
||
- Map
|
||
- (Legacy) The task runs the spark-submit script when the `spark_submit_task` field is present. This task can run only on new clusters and is not compatible with serverless compute. In the `new_cluster` specification, `libraries` and `spark_conf` are not supported. Instead, use `--jars` and `--py-files` to add Java and Python libraries and `--conf` to set the Spark configurations. `master`, `deploy-mode`, and `executor-cores` are automatically configured by Databricks; you _cannot_ specify them in parameters. By default, the Spark submit job uses all available memory (excluding reserved memory for Databricks services). You can set `--driver-memory`, and `--executor-memory` to a smaller value to leave some room for off-heap usage. The `--jars`, `--py-files`, `--files` arguments support DBFS and S3 paths. See [_](#jobs.<name>.tasks.spark_submit_task).
|
||
|
||
* - `sql_task`
|
||
- Map
|
||
- The task runs a SQL query or file, or it refreshes a SQL alert or a legacy SQL dashboard when the `sql_task` field is present. See [_](#jobs.<name>.tasks.sql_task).
|
||
|
||
* - `task_key`
|
||
- String
|
||
- A unique name for the task. This field is used to refer to this task from other tasks. This field is required and must be unique within its parent job. On Update or Reset, this field is used to reference the tasks to be updated or reset.
|
||
|
||
* - `timeout_seconds`
|
||
- Integer
|
||
- An optional timeout applied to each run of this job task. A value of `0` means no timeout.
|
||
|
||
* - `webhook_notifications`
|
||
- Map
|
||
- A collection of system notification IDs to notify when runs of this task begin or complete. The default behavior is to not send any system notifications. See [_](#jobs.<name>.tasks.webhook_notifications).
|
||
|
||
|
||
### jobs.<name>.tasks.clean_rooms_notebook_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task runs a [clean rooms](https://docs.databricks.com/en/clean-rooms/index.html) notebook
|
||
when the `clean_rooms_notebook_task` field is present.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `clean_room_name`
|
||
- String
|
||
- The clean room that the notebook belongs to.
|
||
|
||
* - `etag`
|
||
- String
|
||
- Checksum to validate the freshness of the notebook resource (i.e. the notebook being run is the latest version). It can be fetched by calling the :method:cleanroomassets/get API.
|
||
|
||
* - `notebook_base_parameters`
|
||
- Map
|
||
- Base parameters to be used for the clean room notebook job.
|
||
|
||
* - `notebook_name`
|
||
- String
|
||
- Name of the notebook being run.
|
||
|
||
|
||
### jobs.<name>.tasks.condition_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task evaluates a condition that can be used to control the execution of other tasks when the `condition_task` field is present.
|
||
The condition task does not require a cluster to execute and does not support retries or notifications.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `left`
|
||
- String
|
||
- The left operand of the condition task. Can be either a string value or a job state or parameter reference.
|
||
|
||
* - `op`
|
||
- String
|
||
- * `EQUAL_TO`, `NOT_EQUAL` operators perform string comparison of their operands. This means that `“12.0” == “12”` will evaluate to `false`. * `GREATER_THAN`, `GREATER_THAN_OR_EQUAL`, `LESS_THAN`, `LESS_THAN_OR_EQUAL` operators perform numeric comparison of their operands. `“12.0” >= “12”` will evaluate to `true`, `“10.0” >= “12”` will evaluate to `false`. The boolean comparison to task values can be implemented with operators `EQUAL_TO`, `NOT_EQUAL`. If a task value was set to a boolean value, it will be serialized to `“true”` or `“false”` for the comparison.
|
||
|
||
* - `right`
|
||
- String
|
||
- The right operand of the condition task. Can be either a string value or a job state or parameter reference.
|
||
|
||
|
||
### jobs.<name>.tasks.dbt_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task runs one or more dbt commands when the `dbt_task` field is present. The dbt task requires both Databricks SQL and the ability to use a serverless or a pro SQL warehouse.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `catalog`
|
||
- String
|
||
- Optional name of the catalog to use. The value is the top level in the 3-level namespace of Unity Catalog (catalog / schema / relation). The catalog value can only be specified if a warehouse_id is specified. Requires dbt-databricks >= 1.1.1.
|
||
|
||
* - `commands`
|
||
- Sequence
|
||
- A list of dbt commands to execute. All commands must start with `dbt`. This parameter must not be empty. A maximum of up to 10 commands can be provided.
|
||
|
||
* - `profiles_directory`
|
||
- String
|
||
- Optional (relative) path to the profiles directory. Can only be specified if no warehouse_id is specified. If no warehouse_id is specified and this folder is unset, the root directory is used.
|
||
|
||
* - `project_directory`
|
||
- String
|
||
- Path to the project directory. Optional for Git sourced tasks, in which case if no value is provided, the root of the Git repository is used.
|
||
|
||
* - `schema`
|
||
- String
|
||
- Optional schema to write to. This parameter is only used when a warehouse_id is also provided. If not provided, the `default` schema is used.
|
||
|
||
* - `source`
|
||
- String
|
||
- Optional location type of the project directory. When set to `WORKSPACE`, the project will be retrieved from the local Databricks workspace. When set to `GIT`, the project will be retrieved from a Git repository defined in `git_source`. If the value is empty, the task will use `GIT` if `git_source` is defined and `WORKSPACE` otherwise. * `WORKSPACE`: Project is located in Databricks workspace. * `GIT`: Project is located in cloud Git provider.
|
||
|
||
* - `warehouse_id`
|
||
- String
|
||
- ID of the SQL warehouse to connect to. If provided, we automatically generate and provide the profile and connection details to dbt. It can be overridden on a per-command basis by using the `--profiles-dir` command line argument.
|
||
|
||
|
||
### jobs.<name>.tasks.depends_on
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional array of objects specifying the dependency graph of the task. All tasks specified in this field must complete before executing this task. The task will run only if the `run_if` condition is true.
|
||
The key is `task_key`, and the value is the name assigned to the dependent task.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `outcome`
|
||
- String
|
||
- Can only be specified on condition task dependencies. The outcome of the dependent task that must be met for this task to run.
|
||
|
||
* - `task_key`
|
||
- String
|
||
- The name of the task this task depends on.
|
||
|
||
|
||
### jobs.<name>.tasks.email_notifications
|
||
|
||
**`Type: Map`**
|
||
|
||
An optional set of email addresses that is notified when runs of this task begin or complete as well as when this task is deleted. The default behavior is to not send any emails.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `no_alert_for_skipped_runs`
|
||
- Boolean
|
||
- If true, do not send email to recipients specified in `on_failure` if the run is skipped. This field is `deprecated`. Please use the `notification_settings.no_alert_for_skipped_runs` field.
|
||
|
||
* - `on_duration_warning_threshold_exceeded`
|
||
- Sequence
|
||
- A list of email addresses to be notified when the duration of a run exceeds the threshold specified for the `RUN_DURATION_SECONDS` metric in the `health` field. If no rule for the `RUN_DURATION_SECONDS` metric is specified in the `health` field for the job, notifications are not sent.
|
||
|
||
* - `on_failure`
|
||
- Sequence
|
||
- A list of email addresses to be notified when a run unsuccessfully completes. A run is considered to have completed unsuccessfully if it ends with an `INTERNAL_ERROR` `life_cycle_state` or a `FAILED`, or `TIMED_OUT` result_state. If this is not specified on job creation, reset, or update the list is empty, and notifications are not sent.
|
||
|
||
* - `on_start`
|
||
- Sequence
|
||
- A list of email addresses to be notified when a run begins. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent.
|
||
|
||
* - `on_streaming_backlog_exceeded`
|
||
- Sequence
|
||
- A list of email addresses to notify when any streaming backlog thresholds are exceeded for any stream. Streaming backlog thresholds can be set in the `health` field using the following metrics: `STREAMING_BACKLOG_BYTES`, `STREAMING_BACKLOG_RECORDS`, `STREAMING_BACKLOG_SECONDS`, or `STREAMING_BACKLOG_FILES`. Alerting is based on the 10-minute average of these metrics. If the issue persists, notifications are resent every 30 minutes.
|
||
|
||
* - `on_success`
|
||
- Sequence
|
||
- A list of email addresses to be notified when a run successfully completes. A run is considered to have completed successfully if it ends with a `TERMINATED` `life_cycle_state` and a `SUCCESS` result_state. If not specified on job creation, reset, or update, the list is empty, and notifications are not sent.
|
||
|
||
|
||
### jobs.<name>.tasks.for_each_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task executes a nested task for every input provided when the `for_each_task` field is present.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `concurrency`
|
||
- Integer
|
||
- An optional maximum allowed number of concurrent runs of the task. Set this value if you want to be able to execute multiple runs of the task concurrently.
|
||
|
||
* - `inputs`
|
||
- String
|
||
- Array for task to iterate on. This can be a JSON string or a reference to an array parameter.
|
||
|
||
* - `task`
|
||
- Map
|
||
- Configuration for the task that will be run for each element in the array
|
||
|
||
|
||
### jobs.<name>.tasks.health
|
||
|
||
**`Type: Map`**
|
||
|
||
An optional set of health rules that can be defined for this job.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `rules`
|
||
- Sequence
|
||
- See [_](#jobs.<name>.tasks.health.rules).
|
||
|
||
|
||
### jobs.<name>.tasks.health.rules
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `metric`
|
||
- String
|
||
- Specifies the health metric that is being evaluated for a particular health rule. * `RUN_DURATION_SECONDS`: Expected total time for a run in seconds. * `STREAMING_BACKLOG_BYTES`: An estimate of the maximum bytes of data waiting to be consumed across all streams. This metric is in Public Preview. * `STREAMING_BACKLOG_RECORDS`: An estimate of the maximum offset lag across all streams. This metric is in Public Preview. * `STREAMING_BACKLOG_SECONDS`: An estimate of the maximum consumer delay across all streams. This metric is in Public Preview. * `STREAMING_BACKLOG_FILES`: An estimate of the maximum number of outstanding files across all streams. This metric is in Public Preview.
|
||
|
||
* - `op`
|
||
- String
|
||
- Specifies the operator used to compare the health metric value with the specified threshold.
|
||
|
||
* - `value`
|
||
- Integer
|
||
- Specifies the threshold value that the health metric should obey to satisfy the health rule.
|
||
|
||
|
||
### jobs.<name>.tasks.libraries
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of libraries to be installed on the cluster.
|
||
The default value is an empty list.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `cran`
|
||
- Map
|
||
- Specification of a CRAN library to be installed as part of the library. See [_](#jobs.<name>.tasks.libraries.cran).
|
||
|
||
* - `egg`
|
||
- String
|
||
- Deprecated. URI of the egg library to install. Installing Python egg files is deprecated and is not supported in Databricks Runtime 14.0 and above.
|
||
|
||
* - `jar`
|
||
- String
|
||
- URI of the JAR library to install. Supported URIs include Workspace paths, Unity Catalog Volumes paths, and S3 URIs. For example: `{ "jar": "/Workspace/path/to/library.jar" }`, `{ "jar" : "/Volumes/path/to/library.jar" }` or `{ "jar": "s3://my-bucket/library.jar" }`. If S3 is used, please make sure the cluster has read access on the library. You may need to launch the cluster with an IAM role to access the S3 URI.
|
||
|
||
* - `maven`
|
||
- Map
|
||
- Specification of a maven library to be installed. For example: `{ "coordinates": "org.jsoup:jsoup:1.7.2" }`. See [_](#jobs.<name>.tasks.libraries.maven).
|
||
|
||
* - `pypi`
|
||
- Map
|
||
- Specification of a PyPi library to be installed. For example: `{ "package": "simplejson" }`. See [_](#jobs.<name>.tasks.libraries.pypi).
|
||
|
||
* - `requirements`
|
||
- String
|
||
- URI of the requirements.txt file to install. Only Workspace paths and Unity Catalog Volumes paths are supported. For example: `{ "requirements": "/Workspace/path/to/requirements.txt" }` or `{ "requirements" : "/Volumes/path/to/requirements.txt" }`
|
||
|
||
* - `whl`
|
||
- String
|
||
- URI of the wheel library to install. Supported URIs include Workspace paths, Unity Catalog Volumes paths, and S3 URIs. For example: `{ "whl": "/Workspace/path/to/library.whl" }`, `{ "whl" : "/Volumes/path/to/library.whl" }` or `{ "whl": "s3://my-bucket/library.whl" }`. If S3 is used, please make sure the cluster has read access on the library. You may need to launch the cluster with an IAM role to access the S3 URI.
|
||
|
||
|
||
### jobs.<name>.tasks.libraries.cran
|
||
|
||
**`Type: Map`**
|
||
|
||
Specification of a CRAN library to be installed as part of the library
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `package`
|
||
- String
|
||
- The name of the CRAN package to install.
|
||
|
||
* - `repo`
|
||
- String
|
||
- The repository where the package can be found. If not specified, the default CRAN repo is used.
|
||
|
||
|
||
### jobs.<name>.tasks.libraries.maven
|
||
|
||
**`Type: Map`**
|
||
|
||
Specification of a maven library to be installed. For example:
|
||
`{ "coordinates": "org.jsoup:jsoup:1.7.2" }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `coordinates`
|
||
- String
|
||
- Gradle-style maven coordinates. For example: "org.jsoup:jsoup:1.7.2".
|
||
|
||
* - `exclusions`
|
||
- Sequence
|
||
- List of dependences to exclude. For example: `["slf4j:slf4j", "*:hadoop-client"]`. Maven dependency exclusions: https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html.
|
||
|
||
* - `repo`
|
||
- String
|
||
- Maven repo to install the Maven package from. If omitted, both Maven Central Repository and Spark Packages are searched.
|
||
|
||
|
||
### jobs.<name>.tasks.libraries.pypi
|
||
|
||
**`Type: Map`**
|
||
|
||
Specification of a PyPi library to be installed. For example:
|
||
`{ "package": "simplejson" }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `package`
|
||
- String
|
||
- The name of the pypi package to install. An optional exact version specification is also supported. Examples: "simplejson" and "simplejson==3.8.0".
|
||
|
||
* - `repo`
|
||
- String
|
||
- The repository where the package can be found. If not specified, the default pip index is used.
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster
|
||
|
||
**`Type: Map`**
|
||
|
||
If new_cluster, a description of a new cluster that is created for each run.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `apply_policy_default_values`
|
||
- Boolean
|
||
- When set to true, fixed and default values from the policy will be used for fields that are omitted. When set to false, only fixed values from the policy will be applied.
|
||
|
||
* - `autoscale`
|
||
- Map
|
||
- Parameters needed in order to automatically scale clusters up and down based on load. Note: autoscaling works best with DB runtime versions 3.0 or later. See [_](#jobs.<name>.tasks.new_cluster.autoscale).
|
||
|
||
* - `autotermination_minutes`
|
||
- Integer
|
||
- Automatically terminates the cluster after it is inactive for this time in minutes. If not set, this cluster will not be automatically terminated. If specified, the threshold must be between 10 and 10000 minutes. Users can also set this value to 0 to explicitly disable automatic termination.
|
||
|
||
* - `aws_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Amazon Web Services. If not specified at cluster creation, a set of default values will be used. See [_](#jobs.<name>.tasks.new_cluster.aws_attributes).
|
||
|
||
* - `azure_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Microsoft Azure. If not specified at cluster creation, a set of default values will be used. See [_](#jobs.<name>.tasks.new_cluster.azure_attributes).
|
||
|
||
* - `cluster_log_conf`
|
||
- Map
|
||
- The configuration for delivering spark logs to a long-term storage destination. Two kinds of destinations (dbfs and s3) are supported. Only one destination can be specified for one cluster. If the conf is given, the logs will be delivered to the destination every `5 mins`. The destination of driver logs is `$destination/$clusterId/driver`, while the destination of executor logs is `$destination/$clusterId/executor`. See [_](#jobs.<name>.tasks.new_cluster.cluster_log_conf).
|
||
|
||
* - `cluster_name`
|
||
- String
|
||
- Cluster name requested by the user. This doesn't have to be unique. If not specified at creation, the cluster name will be an empty string.
|
||
|
||
* - `custom_tags`
|
||
- Map
|
||
- Additional tags for cluster resources. Databricks will tag all cluster resources (e.g., AWS instances and EBS volumes) with these tags in addition to `default_tags`. Notes: - Currently, Databricks allows at most 45 custom tags - Clusters can only reuse cloud resources if the resources' tags are a subset of the cluster tags
|
||
|
||
* - `data_security_mode`
|
||
- String
|
||
- Data security mode decides what data governance model to use when accessing data from a cluster. The following modes can only be used with `kind`. * `DATA_SECURITY_MODE_AUTO`: Databricks will choose the most appropriate access mode depending on your compute configuration. * `DATA_SECURITY_MODE_STANDARD`: Alias for `USER_ISOLATION`. * `DATA_SECURITY_MODE_DEDICATED`: Alias for `SINGLE_USER`. The following modes can be used regardless of `kind`. * `NONE`: No security isolation for multiple users sharing the cluster. Data governance features are not available in this mode. * `SINGLE_USER`: A secure cluster that can only be exclusively used by a single user specified in `single_user_name`. Most programming languages, cluster features and data governance features are available in this mode. * `USER_ISOLATION`: A secure cluster that can be shared by multiple users. Cluster users are fully isolated so that they cannot see each other's data and credentials. Most data governance features are supported in this mode. But programming languages and cluster features might be limited. The following modes are deprecated starting with Databricks Runtime 15.0 and will be removed for future Databricks Runtime versions: * `LEGACY_TABLE_ACL`: This mode is for users migrating from legacy Table ACL clusters. * `LEGACY_PASSTHROUGH`: This mode is for users migrating from legacy Passthrough on high concurrency clusters. * `LEGACY_SINGLE_USER`: This mode is for users migrating from legacy Passthrough on standard clusters. * `LEGACY_SINGLE_USER_STANDARD`: This mode provides a way that doesn’t have UC nor passthrough enabled.
|
||
|
||
* - `docker_image`
|
||
- Map
|
||
- See [_](#jobs.<name>.tasks.new_cluster.docker_image).
|
||
|
||
* - `driver_instance_pool_id`
|
||
- String
|
||
- The optional ID of the instance pool for the driver of the cluster belongs. The pool cluster uses the instance pool with id (instance_pool_id) if the driver pool is not assigned.
|
||
|
||
* - `driver_node_type_id`
|
||
- String
|
||
- The node type of the Spark driver. Note that this field is optional; if unset, the driver node type will be set as the same value as `node_type_id` defined above.
|
||
|
||
* - `enable_elastic_disk`
|
||
- Boolean
|
||
- Autoscaling Local Storage: when enabled, this cluster will dynamically acquire additional disk space when its Spark workers are running low on disk space. This feature requires specific AWS permissions to function correctly - refer to the User Guide for more details.
|
||
|
||
* - `enable_local_disk_encryption`
|
||
- Boolean
|
||
- Whether to enable LUKS on cluster VMs' local disks
|
||
|
||
* - `gcp_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Google Cloud Platform. If not specified at cluster creation, a set of default values will be used. See [_](#jobs.<name>.tasks.new_cluster.gcp_attributes).
|
||
|
||
* - `init_scripts`
|
||
- Sequence
|
||
- The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `<destination>/<cluster-ID>/init_scripts`. See [_](#jobs.<name>.tasks.new_cluster.init_scripts).
|
||
|
||
* - `instance_pool_id`
|
||
- String
|
||
- The optional ID of the instance pool to which the cluster belongs.
|
||
|
||
* - `is_single_node`
|
||
- Boolean
|
||
- This field can only be used with `kind`. When set to true, Databricks will automatically set single node related `custom_tags`, `spark_conf`, and `num_workers`
|
||
|
||
* - `kind`
|
||
- String
|
||
-
|
||
|
||
* - `node_type_id`
|
||
- String
|
||
- This field encodes, through a single value, the resources available to each of the Spark nodes in this cluster. For example, the Spark nodes can be provisioned and optimized for memory or compute intensive workloads. A list of available node types can be retrieved by using the :method:clusters/listNodeTypes API call.
|
||
|
||
* - `num_workers`
|
||
- Integer
|
||
- Number of worker nodes that this cluster should have. A cluster has one Spark Driver and `num_workers` Executors for a total of `num_workers` + 1 Spark nodes. Note: When reading the properties of a cluster, this field reflects the desired number of workers rather than the actual current number of workers. For instance, if a cluster is resized from 5 to 10 workers, this field will immediately be updated to reflect the target size of 10 workers, whereas the workers listed in `spark_info` will gradually increase from 5 to 10 as the new nodes are provisioned.
|
||
|
||
* - `policy_id`
|
||
- String
|
||
- The ID of the cluster policy used to create the cluster if applicable.
|
||
|
||
* - `runtime_engine`
|
||
- String
|
||
- Determines the cluster's runtime engine, either standard or Photon. This field is not compatible with legacy `spark_version` values that contain `-photon-`. Remove `-photon-` from the `spark_version` and set `runtime_engine` to `PHOTON`. If left unspecified, the runtime engine defaults to standard unless the spark_version contains -photon-, in which case Photon will be used.
|
||
|
||
* - `single_user_name`
|
||
- String
|
||
- Single user name if data_security_mode is `SINGLE_USER`
|
||
|
||
* - `spark_conf`
|
||
- Map
|
||
- An object containing a set of optional, user-specified Spark configuration key-value pairs. Users can also pass in a string of extra JVM options to the driver and the executors via `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` respectively.
|
||
|
||
* - `spark_env_vars`
|
||
- Map
|
||
- An object containing a set of optional, user-specified environment variable key-value pairs. Please note that key-value pair of the form (X,Y) will be exported as is (i.e., `export X='Y'`) while launching the driver and workers. In order to specify an additional set of `SPARK_DAEMON_JAVA_OPTS`, we recommend appending them to `$SPARK_DAEMON_JAVA_OPTS` as shown in the example below. This ensures that all default databricks managed environmental variables are included as well. Example Spark environment variables: `{"SPARK_WORKER_MEMORY": "28000m", "SPARK_LOCAL_DIRS": "/local_disk0"}` or `{"SPARK_DAEMON_JAVA_OPTS": "$SPARK_DAEMON_JAVA_OPTS -Dspark.shuffle.service.enabled=true"}`
|
||
|
||
* - `spark_version`
|
||
- String
|
||
- The Spark version of the cluster, e.g. `3.3.x-scala2.11`. A list of available Spark versions can be retrieved by using the :method:clusters/sparkVersions API call.
|
||
|
||
* - `ssh_public_keys`
|
||
- Sequence
|
||
- SSH public key contents that will be added to each Spark node in this cluster. The corresponding private keys can be used to login with the user name `ubuntu` on port `2200`. Up to 10 keys can be specified.
|
||
|
||
* - `use_ml_runtime`
|
||
- Boolean
|
||
- This field can only be used with `kind`. `effective_spark_version` is determined by `spark_version` (DBR release), this field `use_ml_runtime`, and whether `node_type_id` is gpu node or not.
|
||
|
||
* - `workload_type`
|
||
- Map
|
||
- See [_](#jobs.<name>.tasks.new_cluster.workload_type).
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.autoscale
|
||
|
||
**`Type: Map`**
|
||
|
||
Parameters needed in order to automatically scale clusters up and down based on load.
|
||
Note: autoscaling works best with DB runtime versions 3.0 or later.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `max_workers`
|
||
- Integer
|
||
- The maximum number of workers to which the cluster can scale up when overloaded. Note that `max_workers` must be strictly greater than `min_workers`.
|
||
|
||
* - `min_workers`
|
||
- Integer
|
||
- The minimum number of workers to which the cluster can scale down when underutilized. It is also the initial number of workers the cluster will have after creation.
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.aws_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Amazon Web Services.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- Availability type used for all subsequent nodes past the `first_on_demand` ones. Note: If `first_on_demand` is zero, this availability type will be used for the entire cluster.
|
||
|
||
* - `ebs_volume_count`
|
||
- Integer
|
||
- The number of volumes launched for each instance. Users can choose up to 10 volumes. This feature is only enabled for supported node types. Legacy node types cannot specify custom EBS volumes. For node types with no instance store, at least one EBS volume needs to be specified; otherwise, cluster creation will fail. These EBS volumes will be mounted at `/ebs0`, `/ebs1`, and etc. Instance store volumes will be mounted at `/local_disk0`, `/local_disk1`, and etc. If EBS volumes are attached, Databricks will configure Spark to use only the EBS volumes for scratch storage because heterogenously sized scratch devices can lead to inefficient disk utilization. If no EBS volumes are attached, Databricks will configure Spark to use instance store volumes. Please note that if EBS volumes are specified, then the Spark configuration `spark.local.dir` will be overridden.
|
||
|
||
* - `ebs_volume_iops`
|
||
- Integer
|
||
- If using gp3 volumes, what IOPS to use for the disk. If this is not set, the maximum performance of a gp2 volume with the same volume size will be used.
|
||
|
||
* - `ebs_volume_size`
|
||
- Integer
|
||
- The size of each EBS volume (in GiB) launched for each instance. For general purpose SSD, this value must be within the range 100 - 4096. For throughput optimized HDD, this value must be within the range 500 - 4096.
|
||
|
||
* - `ebs_volume_throughput`
|
||
- Integer
|
||
- If using gp3 volumes, what throughput to use for the disk. If this is not set, the maximum performance of a gp2 volume with the same volume size will be used.
|
||
|
||
* - `ebs_volume_type`
|
||
- String
|
||
- The type of EBS volumes that will be launched with this cluster.
|
||
|
||
* - `first_on_demand`
|
||
- Integer
|
||
- The first `first_on_demand` nodes of the cluster will be placed on on-demand instances. If this value is greater than 0, the cluster driver node in particular will be placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, `first_on_demand` nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. Note that this value does not affect cluster size and cannot currently be mutated over the lifetime of a cluster.
|
||
|
||
* - `instance_profile_arn`
|
||
- String
|
||
- Nodes for this cluster will only be placed on AWS instances with this instance profile. If ommitted, nodes will be placed on instances without an IAM instance profile. The instance profile must have previously been added to the Databricks environment by an account administrator. This feature may only be available to certain customer plans. If this field is ommitted, we will pull in the default from the conf if it exists.
|
||
|
||
* - `spot_bid_price_percent`
|
||
- Integer
|
||
- The bid price for AWS spot instances, as a percentage of the corresponding instance type's on-demand price. For example, if this field is set to 50, and the cluster needs a new `r3.xlarge` spot instance, then the bid price is half of the price of on-demand `r3.xlarge` instances. Similarly, if this field is set to 200, the bid price is twice the price of on-demand `r3.xlarge` instances. If not specified, the default value is 100. When spot instances are requested for this cluster, only spot instances whose bid price percentage matches this field will be considered. Note that, for safety, we enforce this field to be no more than 10000. The default value and documentation here should be kept consistent with CommonConf.defaultSpotBidPricePercent and CommonConf.maxSpotBidPricePercent.
|
||
|
||
* - `zone_id`
|
||
- String
|
||
- Identifier for the availability zone/datacenter in which the cluster resides. This string will be of a form like "us-west-2a". The provided availability zone must be in the same region as the Databricks deployment. For example, "us-west-2a" is not a valid zone id if the Databricks deployment resides in the "us-east-1" region. This is an optional field at cluster creation, and if not specified, a default zone will be used. If the zone specified is "auto", will try to place cluster in a zone with high availability, and will retry placement in a different AZ if there is not enough capacity. The list of available zones as well as the default value can be found by using the `List Zones` method.
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.azure_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Microsoft Azure.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- Availability type used for all subsequent nodes past the `first_on_demand` ones. Note: If `first_on_demand` is zero (which only happens on pool clusters), this availability type will be used for the entire cluster.
|
||
|
||
* - `first_on_demand`
|
||
- Integer
|
||
- The first `first_on_demand` nodes of the cluster will be placed on on-demand instances. This value should be greater than 0, to make sure the cluster driver node is placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, `first_on_demand` nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. Note that this value does not affect cluster size and cannot currently be mutated over the lifetime of a cluster.
|
||
|
||
* - `log_analytics_info`
|
||
- Map
|
||
- Defines values necessary to configure and run Azure Log Analytics agent. See [_](#jobs.<name>.tasks.new_cluster.azure_attributes.log_analytics_info).
|
||
|
||
* - `spot_bid_max_price`
|
||
- Any
|
||
- The max bid price to be used for Azure spot instances. The Max price for the bid cannot be higher than the on-demand price of the instance. If not specified, the default value is -1, which specifies that the instance cannot be evicted on the basis of price, and only on the basis of availability. Further, the value should > 0 or -1.
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.azure_attributes.log_analytics_info
|
||
|
||
**`Type: Map`**
|
||
|
||
Defines values necessary to configure and run Azure Log Analytics agent
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `log_analytics_primary_key`
|
||
- String
|
||
- <needs content added>
|
||
|
||
* - `log_analytics_workspace_id`
|
||
- String
|
||
- <needs content added>
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.cluster_log_conf
|
||
|
||
**`Type: Map`**
|
||
|
||
The configuration for delivering spark logs to a long-term storage destination.
|
||
Two kinds of destinations (dbfs and s3) are supported. Only one destination can be specified
|
||
for one cluster. If the conf is given, the logs will be delivered to the destination every
|
||
`5 mins`. The destination of driver logs is `$destination/$clusterId/driver`, while
|
||
the destination of executor logs is `$destination/$clusterId/executor`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `dbfs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`. See [_](#jobs.<name>.tasks.new_cluster.cluster_log_conf.dbfs).
|
||
|
||
* - `s3`
|
||
- Map
|
||
- destination and either the region or endpoint need to be provided. e.g. `{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }` Cluster iam role is used to access s3, please make sure the cluster iam role in `instance_profile_arn` has permission to write data to the s3 destination. See [_](#jobs.<name>.tasks.new_cluster.cluster_log_conf.s3).
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.cluster_log_conf.dbfs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- dbfs destination, e.g. `dbfs:/my/path`
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.cluster_log_conf.s3
|
||
|
||
**`Type: Map`**
|
||
|
||
destination and either the region or endpoint need to be provided. e.g.
|
||
`{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }`
|
||
Cluster iam role is used to access s3, please make sure the cluster iam role in
|
||
`instance_profile_arn` has permission to write data to the s3 destination.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `canned_acl`
|
||
- String
|
||
- (Optional) Set canned access control list for the logs, e.g. `bucket-owner-full-control`. If `canned_cal` is set, please make sure the cluster iam role has `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned acl can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. Please also note that by default only the object owner gets full controls. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs.
|
||
|
||
* - `destination`
|
||
- String
|
||
- S3 destination, e.g. `s3://my-bucket/some-prefix` Note that logs will be delivered using cluster iam role, please make sure you set cluster iam role and the role has write access to the destination. Please also note that you cannot use AWS keys to deliver logs.
|
||
|
||
* - `enable_encryption`
|
||
- Boolean
|
||
- (Optional) Flag to enable server side encryption, `false` by default.
|
||
|
||
* - `encryption_type`
|
||
- String
|
||
- (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It will be used only when encryption is enabled and the default type is `sse-s3`.
|
||
|
||
* - `endpoint`
|
||
- String
|
||
- S3 endpoint, e.g. `https://s3-us-west-2.amazonaws.com`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
* - `kms_key`
|
||
- String
|
||
- (Optional) Kms key which will be used if encryption is enabled and encryption type is set to `sse-kms`.
|
||
|
||
* - `region`
|
||
- String
|
||
- S3 region, e.g. `us-west-2`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.docker_image
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `basic_auth`
|
||
- Map
|
||
- See [_](#jobs.<name>.tasks.new_cluster.docker_image.basic_auth).
|
||
|
||
* - `url`
|
||
- String
|
||
- URL of the docker image.
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.docker_image.basic_auth
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `password`
|
||
- String
|
||
- Password of the user
|
||
|
||
* - `username`
|
||
- String
|
||
- Name of the user
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.gcp_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Google Cloud Platform.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- This field determines whether the instance pool will contain preemptible VMs, on-demand VMs, or preemptible VMs with a fallback to on-demand VMs if the former is unavailable.
|
||
|
||
* - `boot_disk_size`
|
||
- Integer
|
||
- boot disk size in GB
|
||
|
||
* - `google_service_account`
|
||
- String
|
||
- If provided, the cluster will impersonate the google service account when accessing gcloud services (like GCS). The google service account must have previously been added to the Databricks environment by an account administrator.
|
||
|
||
* - `local_ssd_count`
|
||
- Integer
|
||
- If provided, each node (workers and driver) in the cluster will have this number of local SSDs attached. Each local SSD is 375GB in size. Refer to [GCP documentation](https://cloud.google.com/compute/docs/disks/local-ssd#choose_number_local_ssds) for the supported number of local SSDs for each instance type.
|
||
|
||
* - `use_preemptible_executors`
|
||
- Boolean
|
||
- This field determines whether the spark executors will be scheduled to run on preemptible VMs (when set to true) versus standard compute engine VMs (when set to false; default). Note: Soon to be deprecated, use the availability field instead.
|
||
|
||
* - `zone_id`
|
||
- String
|
||
- Identifier for the availability zone in which the cluster resides. This can be one of the following: - "HA" => High availability, spread nodes across availability zones for a Databricks deployment region [default] - "AUTO" => Databricks picks an availability zone to schedule the cluster on. - A GCP availability zone => Pick One of the available zones for (machine type + region) from https://cloud.google.com/compute/docs/regions-zones.
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.init_scripts
|
||
|
||
**`Type: Sequence`**
|
||
|
||
The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `<destination>/<cluster-ID>/init_scripts`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `abfss`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "abfss" : { "destination" : "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>" } }. See [_](#jobs.<name>.tasks.new_cluster.init_scripts.abfss).
|
||
|
||
* - `dbfs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`. See [_](#jobs.<name>.tasks.new_cluster.init_scripts.dbfs).
|
||
|
||
* - `file`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "file" : { "destination" : "file:/my/local/file.sh" } }`. See [_](#jobs.<name>.tasks.new_cluster.init_scripts.file).
|
||
|
||
* - `gcs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "gcs": { "destination": "gs://my-bucket/file.sh" } }`. See [_](#jobs.<name>.tasks.new_cluster.init_scripts.gcs).
|
||
|
||
* - `s3`
|
||
- Map
|
||
- destination and either the region or endpoint need to be provided. e.g. `{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }` Cluster iam role is used to access s3, please make sure the cluster iam role in `instance_profile_arn` has permission to write data to the s3 destination. See [_](#jobs.<name>.tasks.new_cluster.init_scripts.s3).
|
||
|
||
* - `volumes`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "volumes" : { "destination" : "/Volumes/my-init.sh" } }`. See [_](#jobs.<name>.tasks.new_cluster.init_scripts.volumes).
|
||
|
||
* - `workspace`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "workspace" : { "destination" : "/Users/user1@databricks.com/my-init.sh" } }`. See [_](#jobs.<name>.tasks.new_cluster.init_scripts.workspace).
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.init_scripts.abfss
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "abfss" : { "destination" : "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>" } }
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- abfss destination, e.g. `abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>`.
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.init_scripts.dbfs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- dbfs destination, e.g. `dbfs:/my/path`
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.init_scripts.file
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "file" : { "destination" : "file:/my/local/file.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- local file destination, e.g. `file:/my/local/file.sh`
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.init_scripts.gcs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "gcs": { "destination": "gs://my-bucket/file.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- GCS destination/URI, e.g. `gs://my-bucket/some-prefix`
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.init_scripts.s3
|
||
|
||
**`Type: Map`**
|
||
|
||
destination and either the region or endpoint need to be provided. e.g.
|
||
`{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }`
|
||
Cluster iam role is used to access s3, please make sure the cluster iam role in
|
||
`instance_profile_arn` has permission to write data to the s3 destination.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `canned_acl`
|
||
- String
|
||
- (Optional) Set canned access control list for the logs, e.g. `bucket-owner-full-control`. If `canned_cal` is set, please make sure the cluster iam role has `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned acl can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. Please also note that by default only the object owner gets full controls. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs.
|
||
|
||
* - `destination`
|
||
- String
|
||
- S3 destination, e.g. `s3://my-bucket/some-prefix` Note that logs will be delivered using cluster iam role, please make sure you set cluster iam role and the role has write access to the destination. Please also note that you cannot use AWS keys to deliver logs.
|
||
|
||
* - `enable_encryption`
|
||
- Boolean
|
||
- (Optional) Flag to enable server side encryption, `false` by default.
|
||
|
||
* - `encryption_type`
|
||
- String
|
||
- (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It will be used only when encryption is enabled and the default type is `sse-s3`.
|
||
|
||
* - `endpoint`
|
||
- String
|
||
- S3 endpoint, e.g. `https://s3-us-west-2.amazonaws.com`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
* - `kms_key`
|
||
- String
|
||
- (Optional) Kms key which will be used if encryption is enabled and encryption type is set to `sse-kms`.
|
||
|
||
* - `region`
|
||
- String
|
||
- S3 region, e.g. `us-west-2`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.init_scripts.volumes
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "volumes" : { "destination" : "/Volumes/my-init.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- Unity Catalog Volumes file destination, e.g. `/Volumes/my-init.sh`
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.init_scripts.workspace
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "workspace" : { "destination" : "/Users/user1@databricks.com/my-init.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- workspace files destination, e.g. `/Users/user1@databricks.com/my-init.sh`
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.workload_type
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `clients`
|
||
- Map
|
||
- defined what type of clients can use the cluster. E.g. Notebooks, Jobs. See [_](#jobs.<name>.tasks.new_cluster.workload_type.clients).
|
||
|
||
|
||
### jobs.<name>.tasks.new_cluster.workload_type.clients
|
||
|
||
**`Type: Map`**
|
||
|
||
defined what type of clients can use the cluster. E.g. Notebooks, Jobs
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `jobs`
|
||
- Boolean
|
||
- With jobs set, the cluster can be used for jobs
|
||
|
||
* - `notebooks`
|
||
- Boolean
|
||
- With notebooks set, this cluster can be used for notebooks
|
||
|
||
|
||
### jobs.<name>.tasks.notebook_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task runs a notebook when the `notebook_task` field is present.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `base_parameters`
|
||
- Map
|
||
- Base parameters to be used for each run of this job. If the run is initiated by a call to :method:jobs/run Now with parameters specified, the two parameters maps are merged. If the same key is specified in `base_parameters` and in `run-now`, the value from `run-now` is used. Use [Task parameter variables](https://docs.databricks.com/jobs.html#parameter-variables) to set parameters containing information about job runs. If the notebook takes a parameter that is not specified in the job’s `base_parameters` or the `run-now` override parameters, the default value from the notebook is used. Retrieve these parameters in a notebook using [dbutils.widgets.get](https://docs.databricks.com/dev-tools/databricks-utils.html#dbutils-widgets). The JSON representation of this field cannot exceed 1MB.
|
||
|
||
* - `notebook_path`
|
||
- String
|
||
- The path of the notebook to be run in the Databricks workspace or remote repository. For notebooks stored in the Databricks workspace, the path must be absolute and begin with a slash. For notebooks stored in a remote repository, the path must be relative. This field is required.
|
||
|
||
* - `source`
|
||
- String
|
||
- Optional location type of the notebook. When set to `WORKSPACE`, the notebook will be retrieved from the local Databricks workspace. When set to `GIT`, the notebook will be retrieved from a Git repository defined in `git_source`. If the value is empty, the task will use `GIT` if `git_source` is defined and `WORKSPACE` otherwise. * `WORKSPACE`: Notebook is located in Databricks workspace. * `GIT`: Notebook is located in cloud Git provider.
|
||
|
||
* - `warehouse_id`
|
||
- String
|
||
- Optional `warehouse_id` to run the notebook on a SQL warehouse. Classic SQL warehouses are NOT supported, please use serverless or pro SQL warehouses. Note that SQL warehouses only support SQL cells; if the notebook contains non-SQL cells, the run will fail.
|
||
|
||
|
||
### jobs.<name>.tasks.notification_settings
|
||
|
||
**`Type: Map`**
|
||
|
||
Optional notification settings that are used when sending notifications to each of the `email_notifications` and `webhook_notifications` for this task.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `alert_on_last_attempt`
|
||
- Boolean
|
||
- If true, do not send notifications to recipients specified in `on_start` for the retried runs and do not send notifications to recipients specified in `on_failure` until the last retry of the run.
|
||
|
||
* - `no_alert_for_canceled_runs`
|
||
- Boolean
|
||
- If true, do not send notifications to recipients specified in `on_failure` if the run is canceled.
|
||
|
||
* - `no_alert_for_skipped_runs`
|
||
- Boolean
|
||
- If true, do not send notifications to recipients specified in `on_failure` if the run is skipped.
|
||
|
||
|
||
### jobs.<name>.tasks.pipeline_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task triggers a pipeline update when the `pipeline_task` field is present. Only pipelines configured to use triggered more are supported.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `full_refresh`
|
||
- Boolean
|
||
- If true, triggers a full refresh on the delta live table.
|
||
|
||
* - `pipeline_id`
|
||
- String
|
||
- The full name of the pipeline task to execute.
|
||
|
||
|
||
### jobs.<name>.tasks.python_wheel_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task runs a Python wheel when the `python_wheel_task` field is present.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `entry_point`
|
||
- String
|
||
- Named entry point to use, if it does not exist in the metadata of the package it executes the function from the package directly using `$packageName.$entryPoint()`
|
||
|
||
* - `named_parameters`
|
||
- Map
|
||
- Command-line parameters passed to Python wheel task in the form of `["--name=task", "--data=dbfs:/path/to/data.json"]`. Leave it empty if `parameters` is not null.
|
||
|
||
* - `package_name`
|
||
- String
|
||
- Name of the package to execute
|
||
|
||
* - `parameters`
|
||
- Sequence
|
||
- Command-line parameters passed to Python wheel task. Leave it empty if `named_parameters` is not null.
|
||
|
||
|
||
### jobs.<name>.tasks.run_job_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task triggers another job when the `run_job_task` field is present.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `dbt_commands`
|
||
- Sequence
|
||
- An array of commands to execute for jobs with the dbt task, for example `"dbt_commands": ["dbt deps", "dbt seed", "dbt deps", "dbt seed", "dbt run"]`
|
||
|
||
* - `jar_params`
|
||
- Sequence
|
||
- A list of parameters for jobs with Spark JAR tasks, for example `"jar_params": ["john doe", "35"]`. The parameters are used to invoke the main function of the main class specified in the Spark JAR task. If not specified upon `run-now`, it defaults to an empty list. jar_params cannot be specified in conjunction with notebook_params. The JSON representation of this field (for example `{"jar_params":["john doe","35"]}`) cannot exceed 10,000 bytes. Use [Task parameter variables](https://docs.databricks.com/jobs.html#parameter-variables) to set parameters containing information about job runs.
|
||
|
||
* - `job_id`
|
||
- Integer
|
||
- ID of the job to trigger.
|
||
|
||
* - `job_parameters`
|
||
- Map
|
||
- Job-level parameters used to trigger the job.
|
||
|
||
* - `notebook_params`
|
||
- Map
|
||
- A map from keys to values for jobs with notebook task, for example `"notebook_params": {"name": "john doe", "age": "35"}`. The map is passed to the notebook and is accessible through the [dbutils.widgets.get](https://docs.databricks.com/dev-tools/databricks-utils.html) function. If not specified upon `run-now`, the triggered run uses the job’s base parameters. notebook_params cannot be specified in conjunction with jar_params. Use [Task parameter variables](https://docs.databricks.com/jobs.html#parameter-variables) to set parameters containing information about job runs. The JSON representation of this field (for example `{"notebook_params":{"name":"john doe","age":"35"}}`) cannot exceed 10,000 bytes.
|
||
|
||
* - `pipeline_params`
|
||
- Map
|
||
- Controls whether the pipeline should perform a full refresh. See [_](#jobs.<name>.tasks.run_job_task.pipeline_params).
|
||
|
||
* - `python_named_params`
|
||
- Map
|
||
-
|
||
|
||
* - `python_params`
|
||
- Sequence
|
||
- A list of parameters for jobs with Python tasks, for example `"python_params": ["john doe", "35"]`. The parameters are passed to Python file as command-line parameters. If specified upon `run-now`, it would overwrite the parameters specified in job setting. The JSON representation of this field (for example `{"python_params":["john doe","35"]}`) cannot exceed 10,000 bytes. Use [Task parameter variables](https://docs.databricks.com/jobs.html#parameter-variables) to set parameters containing information about job runs. Important These parameters accept only Latin characters (ASCII character set). Using non-ASCII characters returns an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis.
|
||
|
||
* - `spark_submit_params`
|
||
- Sequence
|
||
- A list of parameters for jobs with spark submit task, for example `"spark_submit_params": ["--class", "org.apache.spark.examples.SparkPi"]`. The parameters are passed to spark-submit script as command-line parameters. If specified upon `run-now`, it would overwrite the parameters specified in job setting. The JSON representation of this field (for example `{"python_params":["john doe","35"]}`) cannot exceed 10,000 bytes. Use [Task parameter variables](https://docs.databricks.com/jobs.html#parameter-variables) to set parameters containing information about job runs Important These parameters accept only Latin characters (ASCII character set). Using non-ASCII characters returns an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis.
|
||
|
||
* - `sql_params`
|
||
- Map
|
||
- A map from keys to values for jobs with SQL task, for example `"sql_params": {"name": "john doe", "age": "35"}`. The SQL alert task does not support custom parameters.
|
||
|
||
|
||
### jobs.<name>.tasks.run_job_task.pipeline_params
|
||
|
||
**`Type: Map`**
|
||
|
||
Controls whether the pipeline should perform a full refresh
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `full_refresh`
|
||
- Boolean
|
||
- If true, triggers a full refresh on the delta live table.
|
||
|
||
|
||
### jobs.<name>.tasks.spark_jar_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task runs a JAR when the `spark_jar_task` field is present.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `jar_uri`
|
||
- String
|
||
- Deprecated since 04/2016. Provide a `jar` through the `libraries` field instead. For an example, see :method:jobs/create.
|
||
|
||
* - `main_class_name`
|
||
- String
|
||
- The full name of the class containing the main method to be executed. This class must be contained in a JAR provided as a library. The code must use `SparkContext.getOrCreate` to obtain a Spark context; otherwise, runs of the job fail.
|
||
|
||
* - `parameters`
|
||
- Sequence
|
||
- Parameters passed to the main method. Use [Task parameter variables](https://docs.databricks.com/jobs.html#parameter-variables) to set parameters containing information about job runs.
|
||
|
||
|
||
### jobs.<name>.tasks.spark_python_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task runs a Python file when the `spark_python_task` field is present.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `parameters`
|
||
- Sequence
|
||
- Command line parameters passed to the Python file. Use [Task parameter variables](https://docs.databricks.com/jobs.html#parameter-variables) to set parameters containing information about job runs.
|
||
|
||
* - `python_file`
|
||
- String
|
||
- The Python file to be executed. Cloud file URIs (such as dbfs:/, s3:/, adls:/, gcs:/) and workspace paths are supported. For python files stored in the Databricks workspace, the path must be absolute and begin with `/`. For files stored in a remote repository, the path must be relative. This field is required.
|
||
|
||
* - `source`
|
||
- String
|
||
- Optional location type of the Python file. When set to `WORKSPACE` or not specified, the file will be retrieved from the local Databricks workspace or cloud location (if the `python_file` has a URI format). When set to `GIT`, the Python file will be retrieved from a Git repository defined in `git_source`. * `WORKSPACE`: The Python file is located in a Databricks workspace or at a cloud filesystem URI. * `GIT`: The Python file is located in a remote Git repository.
|
||
|
||
|
||
### jobs.<name>.tasks.spark_submit_task
|
||
|
||
**`Type: Map`**
|
||
|
||
(Legacy) The task runs the spark-submit script when the `spark_submit_task` field is present. This task can run only on new clusters and is not compatible with serverless compute.
|
||
|
||
In the `new_cluster` specification, `libraries` and `spark_conf` are not supported. Instead, use `--jars` and `--py-files` to add Java and Python libraries and `--conf` to set the Spark configurations.
|
||
|
||
`master`, `deploy-mode`, and `executor-cores` are automatically configured by Databricks; you _cannot_ specify them in parameters.
|
||
|
||
By default, the Spark submit job uses all available memory (excluding reserved memory for Databricks services). You can set `--driver-memory`, and `--executor-memory` to a smaller value to leave some room for off-heap usage.
|
||
|
||
The `--jars`, `--py-files`, `--files` arguments support DBFS and S3 paths.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `parameters`
|
||
- Sequence
|
||
- Command-line parameters passed to spark submit. Use [Task parameter variables](https://docs.databricks.com/jobs.html#parameter-variables) to set parameters containing information about job runs.
|
||
|
||
|
||
### jobs.<name>.tasks.sql_task
|
||
|
||
**`Type: Map`**
|
||
|
||
The task runs a SQL query or file, or it refreshes a SQL alert or a legacy SQL dashboard when the `sql_task` field is present.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `alert`
|
||
- Map
|
||
- If alert, indicates that this job must refresh a SQL alert. See [_](#jobs.<name>.tasks.sql_task.alert).
|
||
|
||
* - `dashboard`
|
||
- Map
|
||
- If dashboard, indicates that this job must refresh a SQL dashboard. See [_](#jobs.<name>.tasks.sql_task.dashboard).
|
||
|
||
* - `file`
|
||
- Map
|
||
- If file, indicates that this job runs a SQL file in a remote Git repository. See [_](#jobs.<name>.tasks.sql_task.file).
|
||
|
||
* - `parameters`
|
||
- Map
|
||
- Parameters to be used for each run of this job. The SQL alert task does not support custom parameters.
|
||
|
||
* - `query`
|
||
- Map
|
||
- If query, indicates that this job must execute a SQL query. See [_](#jobs.<name>.tasks.sql_task.query).
|
||
|
||
* - `warehouse_id`
|
||
- String
|
||
- The canonical identifier of the SQL warehouse. Recommended to use with serverless or pro SQL warehouses. Classic SQL warehouses are only supported for SQL alert, dashboard and query tasks and are limited to scheduled single-task jobs.
|
||
|
||
|
||
### jobs.<name>.tasks.sql_task.alert
|
||
|
||
**`Type: Map`**
|
||
|
||
If alert, indicates that this job must refresh a SQL alert.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `alert_id`
|
||
- String
|
||
- The canonical identifier of the SQL alert.
|
||
|
||
* - `pause_subscriptions`
|
||
- Boolean
|
||
- If true, the alert notifications are not sent to subscribers.
|
||
|
||
* - `subscriptions`
|
||
- Sequence
|
||
- If specified, alert notifications are sent to subscribers. See [_](#jobs.<name>.tasks.sql_task.alert.subscriptions).
|
||
|
||
|
||
### jobs.<name>.tasks.sql_task.alert.subscriptions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
If specified, alert notifications are sent to subscribers.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination_id`
|
||
- String
|
||
- The canonical identifier of the destination to receive email notification. This parameter is mutually exclusive with user_name. You cannot set both destination_id and user_name for subscription notifications.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The user name to receive the subscription email. This parameter is mutually exclusive with destination_id. You cannot set both destination_id and user_name for subscription notifications.
|
||
|
||
|
||
### jobs.<name>.tasks.sql_task.dashboard
|
||
|
||
**`Type: Map`**
|
||
|
||
If dashboard, indicates that this job must refresh a SQL dashboard.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `custom_subject`
|
||
- String
|
||
- Subject of the email sent to subscribers of this task.
|
||
|
||
* - `dashboard_id`
|
||
- String
|
||
- The canonical identifier of the SQL dashboard.
|
||
|
||
* - `pause_subscriptions`
|
||
- Boolean
|
||
- If true, the dashboard snapshot is not taken, and emails are not sent to subscribers.
|
||
|
||
* - `subscriptions`
|
||
- Sequence
|
||
- If specified, dashboard snapshots are sent to subscriptions. See [_](#jobs.<name>.tasks.sql_task.dashboard.subscriptions).
|
||
|
||
|
||
### jobs.<name>.tasks.sql_task.dashboard.subscriptions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
If specified, dashboard snapshots are sent to subscriptions.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination_id`
|
||
- String
|
||
- The canonical identifier of the destination to receive email notification. This parameter is mutually exclusive with user_name. You cannot set both destination_id and user_name for subscription notifications.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The user name to receive the subscription email. This parameter is mutually exclusive with destination_id. You cannot set both destination_id and user_name for subscription notifications.
|
||
|
||
|
||
### jobs.<name>.tasks.sql_task.file
|
||
|
||
**`Type: Map`**
|
||
|
||
If file, indicates that this job runs a SQL file in a remote Git repository.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `path`
|
||
- String
|
||
- Path of the SQL file. Must be relative if the source is a remote Git repository and absolute for workspace paths.
|
||
|
||
* - `source`
|
||
- String
|
||
- Optional location type of the SQL file. When set to `WORKSPACE`, the SQL file will be retrieved from the local Databricks workspace. When set to `GIT`, the SQL file will be retrieved from a Git repository defined in `git_source`. If the value is empty, the task will use `GIT` if `git_source` is defined and `WORKSPACE` otherwise. * `WORKSPACE`: SQL file is located in Databricks workspace. * `GIT`: SQL file is located in cloud Git provider.
|
||
|
||
|
||
### jobs.<name>.tasks.sql_task.query
|
||
|
||
**`Type: Map`**
|
||
|
||
If query, indicates that this job must execute a SQL query.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `query_id`
|
||
- String
|
||
- The canonical identifier of the SQL query.
|
||
|
||
|
||
### jobs.<name>.tasks.webhook_notifications
|
||
|
||
**`Type: Map`**
|
||
|
||
A collection of system notification IDs to notify when runs of this task begin or complete. The default behavior is to not send any system notifications.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `on_duration_warning_threshold_exceeded`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when the duration of a run exceeds the threshold specified for the `RUN_DURATION_SECONDS` metric in the `health` field. A maximum of 3 destinations can be specified for the `on_duration_warning_threshold_exceeded` property. See [_](#jobs.<name>.tasks.webhook_notifications.on_duration_warning_threshold_exceeded).
|
||
|
||
* - `on_failure`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when the run fails. A maximum of 3 destinations can be specified for the `on_failure` property. See [_](#jobs.<name>.tasks.webhook_notifications.on_failure).
|
||
|
||
* - `on_start`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when the run starts. A maximum of 3 destinations can be specified for the `on_start` property. See [_](#jobs.<name>.tasks.webhook_notifications.on_start).
|
||
|
||
* - `on_streaming_backlog_exceeded`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when any streaming backlog thresholds are exceeded for any stream. Streaming backlog thresholds can be set in the `health` field using the following metrics: `STREAMING_BACKLOG_BYTES`, `STREAMING_BACKLOG_RECORDS`, `STREAMING_BACKLOG_SECONDS`, or `STREAMING_BACKLOG_FILES`. Alerting is based on the 10-minute average of these metrics. If the issue persists, notifications are resent every 30 minutes. A maximum of 3 destinations can be specified for the `on_streaming_backlog_exceeded` property. See [_](#jobs.<name>.tasks.webhook_notifications.on_streaming_backlog_exceeded).
|
||
|
||
* - `on_success`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when the run completes successfully. A maximum of 3 destinations can be specified for the `on_success` property. See [_](#jobs.<name>.tasks.webhook_notifications.on_success).
|
||
|
||
|
||
### jobs.<name>.tasks.webhook_notifications.on_duration_warning_threshold_exceeded
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when the duration of a run exceeds the threshold specified for the `RUN_DURATION_SECONDS` metric in the `health` field. A maximum of 3 destinations can be specified for the `on_duration_warning_threshold_exceeded` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
### jobs.<name>.tasks.webhook_notifications.on_failure
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when the run fails. A maximum of 3 destinations can be specified for the `on_failure` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
### jobs.<name>.tasks.webhook_notifications.on_start
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when the run starts. A maximum of 3 destinations can be specified for the `on_start` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
### jobs.<name>.tasks.webhook_notifications.on_streaming_backlog_exceeded
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when any streaming backlog thresholds are exceeded for any stream.
|
||
Streaming backlog thresholds can be set in the `health` field using the following metrics: `STREAMING_BACKLOG_BYTES`, `STREAMING_BACKLOG_RECORDS`, `STREAMING_BACKLOG_SECONDS`, or `STREAMING_BACKLOG_FILES`.
|
||
Alerting is based on the 10-minute average of these metrics. If the issue persists, notifications are resent every 30 minutes.
|
||
A maximum of 3 destinations can be specified for the `on_streaming_backlog_exceeded` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
### jobs.<name>.tasks.webhook_notifications.on_success
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when the run completes successfully. A maximum of 3 destinations can be specified for the `on_success` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
### jobs.<name>.trigger
|
||
|
||
**`Type: Map`**
|
||
|
||
A configuration to trigger a run when certain conditions are met. The default behavior is that the job runs only when triggered by clicking “Run Now” in the Jobs UI or sending an API request to `runNow`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `file_arrival`
|
||
- Map
|
||
- File arrival trigger settings. See [_](#jobs.<name>.trigger.file_arrival).
|
||
|
||
* - `pause_status`
|
||
- String
|
||
- Whether this trigger is paused or not.
|
||
|
||
* - `periodic`
|
||
- Map
|
||
- Periodic trigger settings. See [_](#jobs.<name>.trigger.periodic).
|
||
|
||
* - `table`
|
||
- Map
|
||
- Old table trigger settings name. Deprecated in favor of `table_update`. See [_](#jobs.<name>.trigger.table).
|
||
|
||
* - `table_update`
|
||
- Map
|
||
- See [_](#jobs.<name>.trigger.table_update).
|
||
|
||
|
||
### jobs.<name>.trigger.file_arrival
|
||
|
||
**`Type: Map`**
|
||
|
||
File arrival trigger settings.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `min_time_between_triggers_seconds`
|
||
- Integer
|
||
- If set, the trigger starts a run only after the specified amount of time passed since the last time the trigger fired. The minimum allowed value is 60 seconds
|
||
|
||
* - `url`
|
||
- String
|
||
- URL to be monitored for file arrivals. The path must point to the root or a subpath of the external location.
|
||
|
||
* - `wait_after_last_change_seconds`
|
||
- Integer
|
||
- If set, the trigger starts a run only after no file activity has occurred for the specified amount of time. This makes it possible to wait for a batch of incoming files to arrive before triggering a run. The minimum allowed value is 60 seconds.
|
||
|
||
|
||
### jobs.<name>.trigger.periodic
|
||
|
||
**`Type: Map`**
|
||
|
||
Periodic trigger settings.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `interval`
|
||
- Integer
|
||
- The interval at which the trigger should run.
|
||
|
||
* - `unit`
|
||
- String
|
||
- The unit of time for the interval.
|
||
|
||
|
||
### jobs.<name>.trigger.table
|
||
|
||
**`Type: Map`**
|
||
|
||
Old table trigger settings name. Deprecated in favor of `table_update`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `condition`
|
||
- String
|
||
- The table(s) condition based on which to trigger a job run.
|
||
|
||
* - `min_time_between_triggers_seconds`
|
||
- Integer
|
||
- If set, the trigger starts a run only after the specified amount of time has passed since the last time the trigger fired. The minimum allowed value is 60 seconds.
|
||
|
||
* - `table_names`
|
||
- Sequence
|
||
- A list of Delta tables to monitor for changes. The table name must be in the format `catalog_name.schema_name.table_name`.
|
||
|
||
* - `wait_after_last_change_seconds`
|
||
- Integer
|
||
- If set, the trigger starts a run only after no table updates have occurred for the specified time and can be used to wait for a series of table updates before triggering a run. The minimum allowed value is 60 seconds.
|
||
|
||
|
||
### jobs.<name>.trigger.table_update
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `condition`
|
||
- String
|
||
- The table(s) condition based on which to trigger a job run.
|
||
|
||
* - `min_time_between_triggers_seconds`
|
||
- Integer
|
||
- If set, the trigger starts a run only after the specified amount of time has passed since the last time the trigger fired. The minimum allowed value is 60 seconds.
|
||
|
||
* - `table_names`
|
||
- Sequence
|
||
- A list of Delta tables to monitor for changes. The table name must be in the format `catalog_name.schema_name.table_name`.
|
||
|
||
* - `wait_after_last_change_seconds`
|
||
- Integer
|
||
- If set, the trigger starts a run only after no table updates have occurred for the specified time and can be used to wait for a series of table updates before triggering a run. The minimum allowed value is 60 seconds.
|
||
|
||
|
||
### jobs.<name>.webhook_notifications
|
||
|
||
**`Type: Map`**
|
||
|
||
A collection of system notification IDs to notify when runs of this job begin or complete.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `on_duration_warning_threshold_exceeded`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when the duration of a run exceeds the threshold specified for the `RUN_DURATION_SECONDS` metric in the `health` field. A maximum of 3 destinations can be specified for the `on_duration_warning_threshold_exceeded` property. See [_](#jobs.<name>.webhook_notifications.on_duration_warning_threshold_exceeded).
|
||
|
||
* - `on_failure`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when the run fails. A maximum of 3 destinations can be specified for the `on_failure` property. See [_](#jobs.<name>.webhook_notifications.on_failure).
|
||
|
||
* - `on_start`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when the run starts. A maximum of 3 destinations can be specified for the `on_start` property. See [_](#jobs.<name>.webhook_notifications.on_start).
|
||
|
||
* - `on_streaming_backlog_exceeded`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when any streaming backlog thresholds are exceeded for any stream. Streaming backlog thresholds can be set in the `health` field using the following metrics: `STREAMING_BACKLOG_BYTES`, `STREAMING_BACKLOG_RECORDS`, `STREAMING_BACKLOG_SECONDS`, or `STREAMING_BACKLOG_FILES`. Alerting is based on the 10-minute average of these metrics. If the issue persists, notifications are resent every 30 minutes. A maximum of 3 destinations can be specified for the `on_streaming_backlog_exceeded` property. See [_](#jobs.<name>.webhook_notifications.on_streaming_backlog_exceeded).
|
||
|
||
* - `on_success`
|
||
- Sequence
|
||
- An optional list of system notification IDs to call when the run completes successfully. A maximum of 3 destinations can be specified for the `on_success` property. See [_](#jobs.<name>.webhook_notifications.on_success).
|
||
|
||
|
||
### jobs.<name>.webhook_notifications.on_duration_warning_threshold_exceeded
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when the duration of a run exceeds the threshold specified for the `RUN_DURATION_SECONDS` metric in the `health` field. A maximum of 3 destinations can be specified for the `on_duration_warning_threshold_exceeded` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
### jobs.<name>.webhook_notifications.on_failure
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when the run fails. A maximum of 3 destinations can be specified for the `on_failure` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
### jobs.<name>.webhook_notifications.on_start
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when the run starts. A maximum of 3 destinations can be specified for the `on_start` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
### jobs.<name>.webhook_notifications.on_streaming_backlog_exceeded
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when any streaming backlog thresholds are exceeded for any stream.
|
||
Streaming backlog thresholds can be set in the `health` field using the following metrics: `STREAMING_BACKLOG_BYTES`, `STREAMING_BACKLOG_RECORDS`, `STREAMING_BACKLOG_SECONDS`, or `STREAMING_BACKLOG_FILES`.
|
||
Alerting is based on the 10-minute average of these metrics. If the issue persists, notifications are resent every 30 minutes.
|
||
A maximum of 3 destinations can be specified for the `on_streaming_backlog_exceeded` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
### jobs.<name>.webhook_notifications.on_success
|
||
|
||
**`Type: Sequence`**
|
||
|
||
An optional list of system notification IDs to call when the run completes successfully. A maximum of 3 destinations can be specified for the `on_success` property.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `id`
|
||
- String
|
||
-
|
||
|
||
|
||
## model_serving_endpoints
|
||
|
||
**`Type: Map`**
|
||
|
||
The model_serving_endpoint resource allows you to define [model serving endpoints](/api/workspace/servingendpoints/create). See [_](/machine-learning/model-serving/manage-serving-endpoints.md).
|
||
|
||
```yaml
|
||
model_serving_endpoints:
|
||
<model_serving_endpoint-name>:
|
||
<model_serving_endpoint-field-name>: <model_serving_endpoint-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `ai_gateway`
|
||
- Map
|
||
- The AI Gateway configuration for the serving endpoint. NOTE: only external model endpoints are supported as of now. See [_](#model_serving_endpoints.<name>.ai_gateway).
|
||
|
||
* - `config`
|
||
- Map
|
||
- The core config of the serving endpoint. See [_](#model_serving_endpoints.<name>.config).
|
||
|
||
* - `name`
|
||
- String
|
||
- The name of the serving endpoint. This field is required and must be unique across a Databricks workspace. An endpoint name can consist of alphanumeric characters, dashes, and underscores.
|
||
|
||
* - `permissions`
|
||
- Sequence
|
||
- See [_](#model_serving_endpoints.<name>.permissions).
|
||
|
||
* - `rate_limits`
|
||
- Sequence
|
||
- Rate limits to be applied to the serving endpoint. NOTE: this field is deprecated, please use AI Gateway to manage rate limits. See [_](#model_serving_endpoints.<name>.rate_limits).
|
||
|
||
* - `route_optimized`
|
||
- Boolean
|
||
- Enable route optimization for the serving endpoint.
|
||
|
||
* - `tags`
|
||
- Sequence
|
||
- Tags to be attached to the serving endpoint and automatically propagated to billing logs. See [_](#model_serving_endpoints.<name>.tags).
|
||
|
||
|
||
**Example**
|
||
|
||
The following example defines a <UC> model serving endpoint:
|
||
|
||
```yaml
|
||
resources:
|
||
model_serving_endpoints:
|
||
uc_model_serving_endpoint:
|
||
name: "uc-model-endpoint"
|
||
config:
|
||
served_entities:
|
||
- entity_name: "myCatalog.mySchema.my-ads-model"
|
||
entity_version: "10"
|
||
workload_size: "Small"
|
||
scale_to_zero_enabled: "true"
|
||
traffic_config:
|
||
routes:
|
||
- served_model_name: "my-ads-model-10"
|
||
traffic_percentage: "100"
|
||
tags:
|
||
- key: "team"
|
||
value: "data science"
|
||
```
|
||
|
||
### model_serving_endpoints.<name>.ai_gateway
|
||
|
||
**`Type: Map`**
|
||
|
||
The AI Gateway configuration for the serving endpoint. NOTE: only external model endpoints are supported as of now.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `guardrails`
|
||
- Map
|
||
- Configuration for AI Guardrails to prevent unwanted data and unsafe data in requests and responses. See [_](#model_serving_endpoints.<name>.ai_gateway.guardrails).
|
||
|
||
* - `inference_table_config`
|
||
- Map
|
||
- Configuration for payload logging using inference tables. Use these tables to monitor and audit data being sent to and received from model APIs and to improve model quality. See [_](#model_serving_endpoints.<name>.ai_gateway.inference_table_config).
|
||
|
||
* - `rate_limits`
|
||
- Sequence
|
||
- Configuration for rate limits which can be set to limit endpoint traffic. See [_](#model_serving_endpoints.<name>.ai_gateway.rate_limits).
|
||
|
||
* - `usage_tracking_config`
|
||
- Map
|
||
- Configuration to enable usage tracking using system tables. These tables allow you to monitor operational usage on endpoints and their associated costs. See [_](#model_serving_endpoints.<name>.ai_gateway.usage_tracking_config).
|
||
|
||
|
||
### model_serving_endpoints.<name>.ai_gateway.guardrails
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for AI Guardrails to prevent unwanted data and unsafe data in requests and responses.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `input`
|
||
- Map
|
||
- Configuration for input guardrail filters. See [_](#model_serving_endpoints.<name>.ai_gateway.guardrails.input).
|
||
|
||
* - `output`
|
||
- Map
|
||
- Configuration for output guardrail filters. See [_](#model_serving_endpoints.<name>.ai_gateway.guardrails.output).
|
||
|
||
|
||
### model_serving_endpoints.<name>.ai_gateway.guardrails.input
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for input guardrail filters.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `invalid_keywords`
|
||
- Sequence
|
||
- List of invalid keywords. AI guardrail uses keyword or string matching to decide if the keyword exists in the request or response content.
|
||
|
||
* - `pii`
|
||
- Map
|
||
- Configuration for guardrail PII filter. See [_](#model_serving_endpoints.<name>.ai_gateway.guardrails.input.pii).
|
||
|
||
* - `safety`
|
||
- Boolean
|
||
- Indicates whether the safety filter is enabled.
|
||
|
||
* - `valid_topics`
|
||
- Sequence
|
||
- The list of allowed topics. Given a chat request, this guardrail flags the request if its topic is not in the allowed topics.
|
||
|
||
|
||
### model_serving_endpoints.<name>.ai_gateway.guardrails.input.pii
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for guardrail PII filter.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `behavior`
|
||
- String
|
||
- Behavior for PII filter. Currently only 'BLOCK' is supported. If 'BLOCK' is set for the input guardrail and the request contains PII, the request is not sent to the model server and 400 status code is returned; if 'BLOCK' is set for the output guardrail and the model response contains PII, the PII info in the response is redacted and 400 status code is returned.
|
||
|
||
|
||
### model_serving_endpoints.<name>.ai_gateway.guardrails.output
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for output guardrail filters.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `invalid_keywords`
|
||
- Sequence
|
||
- List of invalid keywords. AI guardrail uses keyword or string matching to decide if the keyword exists in the request or response content.
|
||
|
||
* - `pii`
|
||
- Map
|
||
- Configuration for guardrail PII filter. See [_](#model_serving_endpoints.<name>.ai_gateway.guardrails.output.pii).
|
||
|
||
* - `safety`
|
||
- Boolean
|
||
- Indicates whether the safety filter is enabled.
|
||
|
||
* - `valid_topics`
|
||
- Sequence
|
||
- The list of allowed topics. Given a chat request, this guardrail flags the request if its topic is not in the allowed topics.
|
||
|
||
|
||
### model_serving_endpoints.<name>.ai_gateway.guardrails.output.pii
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for guardrail PII filter.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `behavior`
|
||
- String
|
||
- Behavior for PII filter. Currently only 'BLOCK' is supported. If 'BLOCK' is set for the input guardrail and the request contains PII, the request is not sent to the model server and 400 status code is returned; if 'BLOCK' is set for the output guardrail and the model response contains PII, the PII info in the response is redacted and 400 status code is returned.
|
||
|
||
|
||
### model_serving_endpoints.<name>.ai_gateway.inference_table_config
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for payload logging using inference tables. Use these tables to monitor and audit data being sent to and received from model APIs and to improve model quality.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `catalog_name`
|
||
- String
|
||
- The name of the catalog in Unity Catalog. Required when enabling inference tables. NOTE: On update, you have to disable inference table first in order to change the catalog name.
|
||
|
||
* - `enabled`
|
||
- Boolean
|
||
- Indicates whether the inference table is enabled.
|
||
|
||
* - `schema_name`
|
||
- String
|
||
- The name of the schema in Unity Catalog. Required when enabling inference tables. NOTE: On update, you have to disable inference table first in order to change the schema name.
|
||
|
||
* - `table_name_prefix`
|
||
- String
|
||
- The prefix of the table in Unity Catalog. NOTE: On update, you have to disable inference table first in order to change the prefix name.
|
||
|
||
|
||
### model_serving_endpoints.<name>.ai_gateway.rate_limits
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Configuration for rate limits which can be set to limit endpoint traffic.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `calls`
|
||
- Integer
|
||
- Used to specify how many calls are allowed for a key within the renewal_period.
|
||
|
||
* - `key`
|
||
- String
|
||
- Key field for a rate limit. Currently, only 'user' and 'endpoint' are supported, with 'endpoint' being the default if not specified.
|
||
|
||
* - `renewal_period`
|
||
- String
|
||
- Renewal period field for a rate limit. Currently, only 'minute' is supported.
|
||
|
||
|
||
### model_serving_endpoints.<name>.ai_gateway.usage_tracking_config
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration to enable usage tracking using system tables. These tables allow you to monitor operational usage on endpoints and their associated costs.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `enabled`
|
||
- Boolean
|
||
- Whether to enable usage tracking.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config
|
||
|
||
**`Type: Map`**
|
||
|
||
The core config of the serving endpoint.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `auto_capture_config`
|
||
- Map
|
||
- Configuration for Inference Tables which automatically logs requests and responses to Unity Catalog. See [_](#model_serving_endpoints.<name>.config.auto_capture_config).
|
||
|
||
* - `served_entities`
|
||
- Sequence
|
||
- A list of served entities for the endpoint to serve. A serving endpoint can have up to 15 served entities. See [_](#model_serving_endpoints.<name>.config.served_entities).
|
||
|
||
* - `served_models`
|
||
- Sequence
|
||
- (Deprecated, use served_entities instead) A list of served models for the endpoint to serve. A serving endpoint can have up to 15 served models. See [_](#model_serving_endpoints.<name>.config.served_models).
|
||
|
||
* - `traffic_config`
|
||
- Map
|
||
- The traffic config defining how invocations to the serving endpoint should be routed. See [_](#model_serving_endpoints.<name>.config.traffic_config).
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.auto_capture_config
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for Inference Tables which automatically logs requests and responses to Unity Catalog.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `catalog_name`
|
||
- String
|
||
- The name of the catalog in Unity Catalog. NOTE: On update, you cannot change the catalog name if the inference table is already enabled.
|
||
|
||
* - `enabled`
|
||
- Boolean
|
||
- Indicates whether the inference table is enabled.
|
||
|
||
* - `schema_name`
|
||
- String
|
||
- The name of the schema in Unity Catalog. NOTE: On update, you cannot change the schema name if the inference table is already enabled.
|
||
|
||
* - `table_name_prefix`
|
||
- String
|
||
- The prefix of the table in Unity Catalog. NOTE: On update, you cannot change the prefix name if the inference table is already enabled.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities
|
||
|
||
**`Type: Sequence`**
|
||
|
||
A list of served entities for the endpoint to serve. A serving endpoint can have up to 15 served entities.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `entity_name`
|
||
- String
|
||
- The name of the entity to be served. The entity may be a model in the Databricks Model Registry, a model in the Unity Catalog (UC), or a function of type FEATURE_SPEC in the UC. If it is a UC object, the full name of the object should be given in the form of __catalog_name__.__schema_name__.__model_name__.
|
||
|
||
* - `entity_version`
|
||
- String
|
||
- The version of the model in Databricks Model Registry to be served or empty if the entity is a FEATURE_SPEC.
|
||
|
||
* - `environment_vars`
|
||
- Map
|
||
- An object containing a set of optional, user-specified environment variable key-value pairs used for serving this entity. Note: this is an experimental feature and subject to change. Example entity environment variables that refer to Databricks secrets: `{"OPENAI_API_KEY": "{{secrets/my_scope/my_key}}", "DATABRICKS_TOKEN": "{{secrets/my_scope2/my_key2}}"}`
|
||
|
||
* - `external_model`
|
||
- Map
|
||
- The external model to be served. NOTE: Only one of external_model and (entity_name, entity_version, workload_size, workload_type, and scale_to_zero_enabled) can be specified with the latter set being used for custom model serving for a Databricks registered model. For an existing endpoint with external_model, it cannot be updated to an endpoint without external_model. If the endpoint is created without external_model, users cannot update it to add external_model later. The task type of all external models within an endpoint must be the same. . See [_](#model_serving_endpoints.<name>.config.served_entities.external_model).
|
||
|
||
* - `instance_profile_arn`
|
||
- String
|
||
- ARN of the instance profile that the served entity uses to access AWS resources.
|
||
|
||
* - `max_provisioned_throughput`
|
||
- Integer
|
||
- The maximum tokens per second that the endpoint can scale up to.
|
||
|
||
* - `min_provisioned_throughput`
|
||
- Integer
|
||
- The minimum tokens per second that the endpoint can scale down to.
|
||
|
||
* - `name`
|
||
- String
|
||
- The name of a served entity. It must be unique across an endpoint. A served entity name can consist of alphanumeric characters, dashes, and underscores. If not specified for an external model, this field defaults to external_model.name, with '.' and ':' replaced with '-', and if not specified for other entities, it defaults to <entity-name>-<entity-version>.
|
||
|
||
* - `scale_to_zero_enabled`
|
||
- Boolean
|
||
- Whether the compute resources for the served entity should scale down to zero.
|
||
|
||
* - `workload_size`
|
||
- String
|
||
- The workload size of the served entity. The workload size corresponds to a range of provisioned concurrency that the compute autoscales between. A single unit of provisioned concurrency can process one request at a time. Valid workload sizes are "Small" (4 - 4 provisioned concurrency), "Medium" (8 - 16 provisioned concurrency), and "Large" (16 - 64 provisioned concurrency). If scale-to-zero is enabled, the lower bound of the provisioned concurrency for each workload size is 0.
|
||
|
||
* - `workload_type`
|
||
- String
|
||
- The workload type of the served entity. The workload type selects which type of compute to use in the endpoint. The default value for this parameter is "CPU". For deep learning workloads, GPU acceleration is available by selecting workload types like GPU_SMALL and others. See the available [GPU types](https://docs.databricks.com/machine-learning/model-serving/create-manage-serving-endpoints.html#gpu-workload-types).
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities.external_model
|
||
|
||
**`Type: Map`**
|
||
|
||
The external model to be served. NOTE: Only one of external_model and (entity_name, entity_version, workload_size, workload_type, and scale_to_zero_enabled)
|
||
can be specified with the latter set being used for custom model serving for a Databricks registered model. For an existing endpoint with external_model,
|
||
it cannot be updated to an endpoint without external_model. If the endpoint is created without external_model, users cannot update it to add external_model later.
|
||
The task type of all external models within an endpoint must be the same.
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `ai21labs_config`
|
||
- Map
|
||
- AI21Labs Config. Only required if the provider is 'ai21labs'. See [_](#model_serving_endpoints.<name>.config.served_entities.external_model.ai21labs_config).
|
||
|
||
* - `amazon_bedrock_config`
|
||
- Map
|
||
- Amazon Bedrock Config. Only required if the provider is 'amazon-bedrock'. See [_](#model_serving_endpoints.<name>.config.served_entities.external_model.amazon_bedrock_config).
|
||
|
||
* - `anthropic_config`
|
||
- Map
|
||
- Anthropic Config. Only required if the provider is 'anthropic'. See [_](#model_serving_endpoints.<name>.config.served_entities.external_model.anthropic_config).
|
||
|
||
* - `cohere_config`
|
||
- Map
|
||
- Cohere Config. Only required if the provider is 'cohere'. See [_](#model_serving_endpoints.<name>.config.served_entities.external_model.cohere_config).
|
||
|
||
* - `databricks_model_serving_config`
|
||
- Map
|
||
- Databricks Model Serving Config. Only required if the provider is 'databricks-model-serving'. See [_](#model_serving_endpoints.<name>.config.served_entities.external_model.databricks_model_serving_config).
|
||
|
||
* - `google_cloud_vertex_ai_config`
|
||
- Map
|
||
- Google Cloud Vertex AI Config. Only required if the provider is 'google-cloud-vertex-ai'. See [_](#model_serving_endpoints.<name>.config.served_entities.external_model.google_cloud_vertex_ai_config).
|
||
|
||
* - `name`
|
||
- String
|
||
- The name of the external model.
|
||
|
||
* - `openai_config`
|
||
- Map
|
||
- OpenAI Config. Only required if the provider is 'openai'. See [_](#model_serving_endpoints.<name>.config.served_entities.external_model.openai_config).
|
||
|
||
* - `palm_config`
|
||
- Map
|
||
- PaLM Config. Only required if the provider is 'palm'. See [_](#model_serving_endpoints.<name>.config.served_entities.external_model.palm_config).
|
||
|
||
* - `provider`
|
||
- String
|
||
- The name of the provider for the external model. Currently, the supported providers are 'ai21labs', 'anthropic', 'amazon-bedrock', 'cohere', 'databricks-model-serving', 'google-cloud-vertex-ai', 'openai', and 'palm'.",
|
||
|
||
* - `task`
|
||
- String
|
||
- The task type of the external model.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities.external_model.ai21labs_config
|
||
|
||
**`Type: Map`**
|
||
|
||
AI21Labs Config. Only required if the provider is 'ai21labs'.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `ai21labs_api_key`
|
||
- String
|
||
- The Databricks secret key reference for an AI21 Labs API key. If you prefer to paste your API key directly, see `ai21labs_api_key_plaintext`. You must provide an API key using one of the following fields: `ai21labs_api_key` or `ai21labs_api_key_plaintext`.
|
||
|
||
* - `ai21labs_api_key_plaintext`
|
||
- String
|
||
- An AI21 Labs API key provided as a plaintext string. If you prefer to reference your key using Databricks Secrets, see `ai21labs_api_key`. You must provide an API key using one of the following fields: `ai21labs_api_key` or `ai21labs_api_key_plaintext`.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities.external_model.amazon_bedrock_config
|
||
|
||
**`Type: Map`**
|
||
|
||
Amazon Bedrock Config. Only required if the provider is 'amazon-bedrock'.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `aws_access_key_id`
|
||
- String
|
||
- The Databricks secret key reference for an AWS access key ID with permissions to interact with Bedrock services. If you prefer to paste your API key directly, see `aws_access_key_id`. You must provide an API key using one of the following fields: `aws_access_key_id` or `aws_access_key_id_plaintext`.
|
||
|
||
* - `aws_access_key_id_plaintext`
|
||
- String
|
||
- An AWS access key ID with permissions to interact with Bedrock services provided as a plaintext string. If you prefer to reference your key using Databricks Secrets, see `aws_access_key_id`. You must provide an API key using one of the following fields: `aws_access_key_id` or `aws_access_key_id_plaintext`.
|
||
|
||
* - `aws_region`
|
||
- String
|
||
- The AWS region to use. Bedrock has to be enabled there.
|
||
|
||
* - `aws_secret_access_key`
|
||
- String
|
||
- The Databricks secret key reference for an AWS secret access key paired with the access key ID, with permissions to interact with Bedrock services. If you prefer to paste your API key directly, see `aws_secret_access_key_plaintext`. You must provide an API key using one of the following fields: `aws_secret_access_key` or `aws_secret_access_key_plaintext`.
|
||
|
||
* - `aws_secret_access_key_plaintext`
|
||
- String
|
||
- An AWS secret access key paired with the access key ID, with permissions to interact with Bedrock services provided as a plaintext string. If you prefer to reference your key using Databricks Secrets, see `aws_secret_access_key`. You must provide an API key using one of the following fields: `aws_secret_access_key` or `aws_secret_access_key_plaintext`.
|
||
|
||
* - `bedrock_provider`
|
||
- String
|
||
- The underlying provider in Amazon Bedrock. Supported values (case insensitive) include: Anthropic, Cohere, AI21Labs, Amazon.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities.external_model.anthropic_config
|
||
|
||
**`Type: Map`**
|
||
|
||
Anthropic Config. Only required if the provider is 'anthropic'.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `anthropic_api_key`
|
||
- String
|
||
- The Databricks secret key reference for an Anthropic API key. If you prefer to paste your API key directly, see `anthropic_api_key_plaintext`. You must provide an API key using one of the following fields: `anthropic_api_key` or `anthropic_api_key_plaintext`.
|
||
|
||
* - `anthropic_api_key_plaintext`
|
||
- String
|
||
- The Anthropic API key provided as a plaintext string. If you prefer to reference your key using Databricks Secrets, see `anthropic_api_key`. You must provide an API key using one of the following fields: `anthropic_api_key` or `anthropic_api_key_plaintext`.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities.external_model.cohere_config
|
||
|
||
**`Type: Map`**
|
||
|
||
Cohere Config. Only required if the provider is 'cohere'.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `cohere_api_base`
|
||
- String
|
||
- This is an optional field to provide a customized base URL for the Cohere API. If left unspecified, the standard Cohere base URL is used.
|
||
|
||
* - `cohere_api_key`
|
||
- String
|
||
- The Databricks secret key reference for a Cohere API key. If you prefer to paste your API key directly, see `cohere_api_key_plaintext`. You must provide an API key using one of the following fields: `cohere_api_key` or `cohere_api_key_plaintext`.
|
||
|
||
* - `cohere_api_key_plaintext`
|
||
- String
|
||
- The Cohere API key provided as a plaintext string. If you prefer to reference your key using Databricks Secrets, see `cohere_api_key`. You must provide an API key using one of the following fields: `cohere_api_key` or `cohere_api_key_plaintext`.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities.external_model.databricks_model_serving_config
|
||
|
||
**`Type: Map`**
|
||
|
||
Databricks Model Serving Config. Only required if the provider is 'databricks-model-serving'.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `databricks_api_token`
|
||
- String
|
||
- The Databricks secret key reference for a Databricks API token that corresponds to a user or service principal with Can Query access to the model serving endpoint pointed to by this external model. If you prefer to paste your API key directly, see `databricks_api_token_plaintext`. You must provide an API key using one of the following fields: `databricks_api_token` or `databricks_api_token_plaintext`.
|
||
|
||
* - `databricks_api_token_plaintext`
|
||
- String
|
||
- The Databricks API token that corresponds to a user or service principal with Can Query access to the model serving endpoint pointed to by this external model provided as a plaintext string. If you prefer to reference your key using Databricks Secrets, see `databricks_api_token`. You must provide an API key using one of the following fields: `databricks_api_token` or `databricks_api_token_plaintext`.
|
||
|
||
* - `databricks_workspace_url`
|
||
- String
|
||
- The URL of the Databricks workspace containing the model serving endpoint pointed to by this external model.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities.external_model.google_cloud_vertex_ai_config
|
||
|
||
**`Type: Map`**
|
||
|
||
Google Cloud Vertex AI Config. Only required if the provider is 'google-cloud-vertex-ai'.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `private_key`
|
||
- String
|
||
- The Databricks secret key reference for a private key for the service account which has access to the Google Cloud Vertex AI Service. See [Best practices for managing service account keys](https://cloud.google.com/iam/docs/best-practices-for-managing-service-account-keys). If you prefer to paste your API key directly, see `private_key_plaintext`. You must provide an API key using one of the following fields: `private_key` or `private_key_plaintext`
|
||
|
||
* - `private_key_plaintext`
|
||
- String
|
||
- The private key for the service account which has access to the Google Cloud Vertex AI Service provided as a plaintext secret. See [Best practices for managing service account keys](https://cloud.google.com/iam/docs/best-practices-for-managing-service-account-keys). If you prefer to reference your key using Databricks Secrets, see `private_key`. You must provide an API key using one of the following fields: `private_key` or `private_key_plaintext`.
|
||
|
||
* - `project_id`
|
||
- String
|
||
- This is the Google Cloud project id that the service account is associated with.
|
||
|
||
* - `region`
|
||
- String
|
||
- This is the region for the Google Cloud Vertex AI Service. See [supported regions](https://cloud.google.com/vertex-ai/docs/general/locations) for more details. Some models are only available in specific regions.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities.external_model.openai_config
|
||
|
||
**`Type: Map`**
|
||
|
||
OpenAI Config. Only required if the provider is 'openai'.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `microsoft_entra_client_id`
|
||
- String
|
||
- This field is only required for Azure AD OpenAI and is the Microsoft Entra Client ID.
|
||
|
||
* - `microsoft_entra_client_secret`
|
||
- String
|
||
- The Databricks secret key reference for a client secret used for Microsoft Entra ID authentication. If you prefer to paste your client secret directly, see `microsoft_entra_client_secret_plaintext`. You must provide an API key using one of the following fields: `microsoft_entra_client_secret` or `microsoft_entra_client_secret_plaintext`.
|
||
|
||
* - `microsoft_entra_client_secret_plaintext`
|
||
- String
|
||
- The client secret used for Microsoft Entra ID authentication provided as a plaintext string. If you prefer to reference your key using Databricks Secrets, see `microsoft_entra_client_secret`. You must provide an API key using one of the following fields: `microsoft_entra_client_secret` or `microsoft_entra_client_secret_plaintext`.
|
||
|
||
* - `microsoft_entra_tenant_id`
|
||
- String
|
||
- This field is only required for Azure AD OpenAI and is the Microsoft Entra Tenant ID.
|
||
|
||
* - `openai_api_base`
|
||
- String
|
||
- This is a field to provide a customized base URl for the OpenAI API. For Azure OpenAI, this field is required, and is the base URL for the Azure OpenAI API service provided by Azure. For other OpenAI API types, this field is optional, and if left unspecified, the standard OpenAI base URL is used.
|
||
|
||
* - `openai_api_key`
|
||
- String
|
||
- The Databricks secret key reference for an OpenAI API key using the OpenAI or Azure service. If you prefer to paste your API key directly, see `openai_api_key_plaintext`. You must provide an API key using one of the following fields: `openai_api_key` or `openai_api_key_plaintext`.
|
||
|
||
* - `openai_api_key_plaintext`
|
||
- String
|
||
- The OpenAI API key using the OpenAI or Azure service provided as a plaintext string. If you prefer to reference your key using Databricks Secrets, see `openai_api_key`. You must provide an API key using one of the following fields: `openai_api_key` or `openai_api_key_plaintext`.
|
||
|
||
* - `openai_api_type`
|
||
- String
|
||
- This is an optional field to specify the type of OpenAI API to use. For Azure OpenAI, this field is required, and adjust this parameter to represent the preferred security access validation protocol. For access token validation, use azure. For authentication using Azure Active Directory (Azure AD) use, azuread.
|
||
|
||
* - `openai_api_version`
|
||
- String
|
||
- This is an optional field to specify the OpenAI API version. For Azure OpenAI, this field is required, and is the version of the Azure OpenAI service to utilize, specified by a date.
|
||
|
||
* - `openai_deployment_name`
|
||
- String
|
||
- This field is only required for Azure OpenAI and is the name of the deployment resource for the Azure OpenAI service.
|
||
|
||
* - `openai_organization`
|
||
- String
|
||
- This is an optional field to specify the organization in OpenAI or Azure OpenAI.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_entities.external_model.palm_config
|
||
|
||
**`Type: Map`**
|
||
|
||
PaLM Config. Only required if the provider is 'palm'.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `palm_api_key`
|
||
- String
|
||
- The Databricks secret key reference for a PaLM API key. If you prefer to paste your API key directly, see `palm_api_key_plaintext`. You must provide an API key using one of the following fields: `palm_api_key` or `palm_api_key_plaintext`.
|
||
|
||
* - `palm_api_key_plaintext`
|
||
- String
|
||
- The PaLM API key provided as a plaintext string. If you prefer to reference your key using Databricks Secrets, see `palm_api_key`. You must provide an API key using one of the following fields: `palm_api_key` or `palm_api_key_plaintext`.
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.served_models
|
||
|
||
**`Type: Sequence`**
|
||
|
||
(Deprecated, use served_entities instead) A list of served models for the endpoint to serve. A serving endpoint can have up to 15 served models.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `environment_vars`
|
||
- Map
|
||
- An object containing a set of optional, user-specified environment variable key-value pairs used for serving this model. Note: this is an experimental feature and subject to change. Example model environment variables that refer to Databricks secrets: `{"OPENAI_API_KEY": "{{secrets/my_scope/my_key}}", "DATABRICKS_TOKEN": "{{secrets/my_scope2/my_key2}}"}`
|
||
|
||
* - `instance_profile_arn`
|
||
- String
|
||
- ARN of the instance profile that the served model will use to access AWS resources.
|
||
|
||
* - `max_provisioned_throughput`
|
||
- Integer
|
||
- The maximum tokens per second that the endpoint can scale up to.
|
||
|
||
* - `min_provisioned_throughput`
|
||
- Integer
|
||
- The minimum tokens per second that the endpoint can scale down to.
|
||
|
||
* - `model_name`
|
||
- String
|
||
- The name of the model in Databricks Model Registry to be served or if the model resides in Unity Catalog, the full name of model, in the form of __catalog_name__.__schema_name__.__model_name__.
|
||
|
||
* - `model_version`
|
||
- String
|
||
- The version of the model in Databricks Model Registry or Unity Catalog to be served.
|
||
|
||
* - `name`
|
||
- String
|
||
- The name of a served model. It must be unique across an endpoint. If not specified, this field will default to <model-name>-<model-version>. A served model name can consist of alphanumeric characters, dashes, and underscores.
|
||
|
||
* - `scale_to_zero_enabled`
|
||
- Boolean
|
||
- Whether the compute resources for the served model should scale down to zero.
|
||
|
||
* - `workload_size`
|
||
- String
|
||
- The workload size of the served model. The workload size corresponds to a range of provisioned concurrency that the compute will autoscale between. A single unit of provisioned concurrency can process one request at a time. Valid workload sizes are "Small" (4 - 4 provisioned concurrency), "Medium" (8 - 16 provisioned concurrency), and "Large" (16 - 64 provisioned concurrency). If scale-to-zero is enabled, the lower bound of the provisioned concurrency for each workload size will be 0.
|
||
|
||
* - `workload_type`
|
||
- String
|
||
- The workload type of the served model. The workload type selects which type of compute to use in the endpoint. The default value for this parameter is "CPU". For deep learning workloads, GPU acceleration is available by selecting workload types like GPU_SMALL and others. See the available [GPU types](https://docs.databricks.com/machine-learning/model-serving/create-manage-serving-endpoints.html#gpu-workload-types).
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.traffic_config
|
||
|
||
**`Type: Map`**
|
||
|
||
The traffic config defining how invocations to the serving endpoint should be routed.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `routes`
|
||
- Sequence
|
||
- The list of routes that define traffic to each served entity. See [_](#model_serving_endpoints.<name>.config.traffic_config.routes).
|
||
|
||
|
||
### model_serving_endpoints.<name>.config.traffic_config.routes
|
||
|
||
**`Type: Sequence`**
|
||
|
||
The list of routes that define traffic to each served entity.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `served_model_name`
|
||
- String
|
||
- The name of the served model this route configures traffic for.
|
||
|
||
* - `traffic_percentage`
|
||
- Integer
|
||
- The percentage of endpoint traffic to send to this route. It must be an integer between 0 and 100 inclusive.
|
||
|
||
|
||
### model_serving_endpoints.<name>.permissions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `group_name`
|
||
- String
|
||
- The name of the group that has the permission set in level.
|
||
|
||
* - `level`
|
||
- String
|
||
- The allowed permission for user, group, service principal defined for this permission.
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
- The name of the service principal that has the permission set in level.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The name of the user that has the permission set in level.
|
||
|
||
|
||
### model_serving_endpoints.<name>.rate_limits
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Rate limits to be applied to the serving endpoint. NOTE: this field is deprecated, please use AI Gateway to manage rate limits.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `calls`
|
||
- Integer
|
||
- Used to specify how many calls are allowed for a key within the renewal_period.
|
||
|
||
* - `key`
|
||
- String
|
||
- Key field for a serving endpoint rate limit. Currently, only 'user' and 'endpoint' are supported, with 'endpoint' being the default if not specified.
|
||
|
||
* - `renewal_period`
|
||
- String
|
||
- Renewal period field for a serving endpoint rate limit. Currently, only 'minute' is supported.
|
||
|
||
|
||
### model_serving_endpoints.<name>.tags
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Tags to be attached to the serving endpoint and automatically propagated to billing logs.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `key`
|
||
- String
|
||
- Key field for a serving endpoint tag.
|
||
|
||
* - `value`
|
||
- String
|
||
- Optional value field for a serving endpoint tag.
|
||
|
||
|
||
## models
|
||
|
||
**`Type: Map`**
|
||
|
||
The model resource allows you to define [legacy models](/api/workspace/modelregistry/createmodel) in bundles. Databricks recommends you use <UC> [registered models](#registered-model) instead.
|
||
|
||
```yaml
|
||
models:
|
||
<model-name>:
|
||
<model-field-name>: <model-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `creation_timestamp`
|
||
- Integer
|
||
- Timestamp recorded when this `registered_model` was created.
|
||
|
||
* - `description`
|
||
- String
|
||
- Description of this `registered_model`.
|
||
|
||
* - `last_updated_timestamp`
|
||
- Integer
|
||
- Timestamp recorded when metadata for this `registered_model` was last updated.
|
||
|
||
* - `latest_versions`
|
||
- Sequence
|
||
- Collection of latest model versions for each stage. Only contains models with current `READY` status. See [_](#models.<name>.latest_versions).
|
||
|
||
* - `name`
|
||
- String
|
||
- Unique name for the model.
|
||
|
||
* - `permissions`
|
||
- Sequence
|
||
- See [_](#models.<name>.permissions).
|
||
|
||
* - `tags`
|
||
- Sequence
|
||
- Tags: Additional metadata key-value pairs for this `registered_model`. See [_](#models.<name>.tags).
|
||
|
||
* - `user_id`
|
||
- String
|
||
- User that created this `registered_model`
|
||
|
||
|
||
### models.<name>.latest_versions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Collection of latest model versions for each stage.
|
||
Only contains models with current `READY` status.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `creation_timestamp`
|
||
- Integer
|
||
- Timestamp recorded when this `model_version` was created.
|
||
|
||
* - `current_stage`
|
||
- String
|
||
- Current stage for this `model_version`.
|
||
|
||
* - `description`
|
||
- String
|
||
- Description of this `model_version`.
|
||
|
||
* - `last_updated_timestamp`
|
||
- Integer
|
||
- Timestamp recorded when metadata for this `model_version` was last updated.
|
||
|
||
* - `name`
|
||
- String
|
||
- Unique name of the model
|
||
|
||
* - `run_id`
|
||
- String
|
||
- MLflow run ID used when creating `model_version`, if `source` was generated by an experiment run stored in MLflow tracking server.
|
||
|
||
* - `run_link`
|
||
- String
|
||
- Run Link: Direct link to the run that generated this version
|
||
|
||
* - `source`
|
||
- String
|
||
- URI indicating the location of the source model artifacts, used when creating `model_version`
|
||
|
||
* - `status`
|
||
- String
|
||
- Current status of `model_version`
|
||
|
||
* - `status_message`
|
||
- String
|
||
- Details on current `status`, if it is pending or failed.
|
||
|
||
* - `tags`
|
||
- Sequence
|
||
- Tags: Additional metadata key-value pairs for this `model_version`. See [_](#models.<name>.latest_versions.tags).
|
||
|
||
* - `user_id`
|
||
- String
|
||
- User that created this `model_version`.
|
||
|
||
* - `version`
|
||
- String
|
||
- Model's version number.
|
||
|
||
|
||
### models.<name>.latest_versions.tags
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Tags: Additional metadata key-value pairs for this `model_version`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `key`
|
||
- String
|
||
- The tag key.
|
||
|
||
* - `value`
|
||
- String
|
||
- The tag value.
|
||
|
||
|
||
### models.<name>.permissions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `group_name`
|
||
- String
|
||
- The name of the group that has the permission set in level.
|
||
|
||
* - `level`
|
||
- String
|
||
- The allowed permission for user, group, service principal defined for this permission.
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
- The name of the service principal that has the permission set in level.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The name of the user that has the permission set in level.
|
||
|
||
|
||
### models.<name>.tags
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Tags: Additional metadata key-value pairs for this `registered_model`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `key`
|
||
- String
|
||
- The tag key.
|
||
|
||
* - `value`
|
||
- String
|
||
- The tag value.
|
||
|
||
|
||
## pipelines
|
||
|
||
**`Type: Map`**
|
||
|
||
The pipeline resource allows you to create <DLT> [pipelines](/api/workspace/pipelines/create). For information about pipelines, see [_](/delta-live-tables/index.md). For a tutorial that uses the <DABS> template to create a pipeline, see [_](/dev-tools/bundles/pipelines-tutorial.md).
|
||
|
||
```yaml
|
||
pipelines:
|
||
<pipeline-name>:
|
||
<pipeline-field-name>: <pipeline-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `budget_policy_id`
|
||
- String
|
||
- Budget policy of this pipeline.
|
||
|
||
* - `catalog`
|
||
- String
|
||
- A catalog in Unity Catalog to publish data from this pipeline to. If `target` is specified, tables in this pipeline are published to a `target` schema inside `catalog` (for example, `catalog`.`target`.`table`). If `target` is not specified, no data is published to Unity Catalog.
|
||
|
||
* - `channel`
|
||
- String
|
||
- DLT Release Channel that specifies which version to use.
|
||
|
||
* - `clusters`
|
||
- Sequence
|
||
- Cluster settings for this pipeline deployment. See [_](#pipelines.<name>.clusters).
|
||
|
||
* - `configuration`
|
||
- Map
|
||
- String-String configuration for this pipeline execution.
|
||
|
||
* - `continuous`
|
||
- Boolean
|
||
- Whether the pipeline is continuous or triggered. This replaces `trigger`.
|
||
|
||
* - `deployment`
|
||
- Map
|
||
- Deployment type of this pipeline. See [_](#pipelines.<name>.deployment).
|
||
|
||
* - `development`
|
||
- Boolean
|
||
- Whether the pipeline is in Development mode. Defaults to false.
|
||
|
||
* - `edition`
|
||
- String
|
||
- Pipeline product edition.
|
||
|
||
* - `filters`
|
||
- Map
|
||
- Filters on which Pipeline packages to include in the deployed graph. See [_](#pipelines.<name>.filters).
|
||
|
||
* - `gateway_definition`
|
||
- Map
|
||
- The definition of a gateway pipeline to support change data capture. See [_](#pipelines.<name>.gateway_definition).
|
||
|
||
* - `id`
|
||
- String
|
||
- Unique identifier for this pipeline.
|
||
|
||
* - `ingestion_definition`
|
||
- Map
|
||
- The configuration for a managed ingestion pipeline. These settings cannot be used with the 'libraries', 'target' or 'catalog' settings. See [_](#pipelines.<name>.ingestion_definition).
|
||
|
||
* - `libraries`
|
||
- Sequence
|
||
- Libraries or code needed by this deployment. See [_](#pipelines.<name>.libraries).
|
||
|
||
* - `name`
|
||
- String
|
||
- Friendly identifier for this pipeline.
|
||
|
||
* - `notifications`
|
||
- Sequence
|
||
- List of notification settings for this pipeline. See [_](#pipelines.<name>.notifications).
|
||
|
||
* - `permissions`
|
||
- Sequence
|
||
- See [_](#pipelines.<name>.permissions).
|
||
|
||
* - `photon`
|
||
- Boolean
|
||
- Whether Photon is enabled for this pipeline.
|
||
|
||
* - `restart_window`
|
||
- Map
|
||
- Restart window of this pipeline. See [_](#pipelines.<name>.restart_window).
|
||
|
||
* - `schema`
|
||
- String
|
||
- The default schema (database) where tables are read from or published to. The presence of this field implies that the pipeline is in direct publishing mode.
|
||
|
||
* - `serverless`
|
||
- Boolean
|
||
- Whether serverless compute is enabled for this pipeline.
|
||
|
||
* - `storage`
|
||
- String
|
||
- DBFS root directory for storing checkpoints and tables.
|
||
|
||
* - `target`
|
||
- String
|
||
- Target schema (database) to add tables in this pipeline to. If not specified, no data is published to the Hive metastore or Unity Catalog. To publish to Unity Catalog, also specify `catalog`.
|
||
|
||
* - `trigger`
|
||
- Map
|
||
- Which pipeline trigger to use. Deprecated: Use `continuous` instead. See [_](#pipelines.<name>.trigger).
|
||
|
||
|
||
**Example**
|
||
|
||
The following example defines a pipeline with the resource key `hello-pipeline`:
|
||
|
||
```yaml
|
||
resources:
|
||
pipelines:
|
||
hello-pipeline:
|
||
name: hello-pipeline
|
||
clusters:
|
||
- label: default
|
||
num_workers: 1
|
||
development: true
|
||
continuous: false
|
||
channel: CURRENT
|
||
edition: CORE
|
||
photon: false
|
||
libraries:
|
||
- notebook:
|
||
path: ./pipeline.py
|
||
```
|
||
|
||
### pipelines.<name>.clusters
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Cluster settings for this pipeline deployment.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `apply_policy_default_values`
|
||
- Boolean
|
||
- Note: This field won't be persisted. Only API users will check this field.
|
||
|
||
* - `autoscale`
|
||
- Map
|
||
- Parameters needed in order to automatically scale clusters up and down based on load. Note: autoscaling works best with DB runtime versions 3.0 or later. See [_](#pipelines.<name>.clusters.autoscale).
|
||
|
||
* - `aws_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Amazon Web Services. If not specified at cluster creation, a set of default values will be used. See [_](#pipelines.<name>.clusters.aws_attributes).
|
||
|
||
* - `azure_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Microsoft Azure. If not specified at cluster creation, a set of default values will be used. See [_](#pipelines.<name>.clusters.azure_attributes).
|
||
|
||
* - `cluster_log_conf`
|
||
- Map
|
||
- The configuration for delivering spark logs to a long-term storage destination. Only dbfs destinations are supported. Only one destination can be specified for one cluster. If the conf is given, the logs will be delivered to the destination every `5 mins`. The destination of driver logs is `$destination/$clusterId/driver`, while the destination of executor logs is `$destination/$clusterId/executor`. . See [_](#pipelines.<name>.clusters.cluster_log_conf).
|
||
|
||
* - `custom_tags`
|
||
- Map
|
||
- Additional tags for cluster resources. Databricks will tag all cluster resources (e.g., AWS instances and EBS volumes) with these tags in addition to `default_tags`. Notes: - Currently, Databricks allows at most 45 custom tags - Clusters can only reuse cloud resources if the resources' tags are a subset of the cluster tags
|
||
|
||
* - `driver_instance_pool_id`
|
||
- String
|
||
- The optional ID of the instance pool for the driver of the cluster belongs. The pool cluster uses the instance pool with id (instance_pool_id) if the driver pool is not assigned.
|
||
|
||
* - `driver_node_type_id`
|
||
- String
|
||
- The node type of the Spark driver. Note that this field is optional; if unset, the driver node type will be set as the same value as `node_type_id` defined above.
|
||
|
||
* - `enable_local_disk_encryption`
|
||
- Boolean
|
||
- Whether to enable local disk encryption for the cluster.
|
||
|
||
* - `gcp_attributes`
|
||
- Map
|
||
- Attributes related to clusters running on Google Cloud Platform. If not specified at cluster creation, a set of default values will be used. See [_](#pipelines.<name>.clusters.gcp_attributes).
|
||
|
||
* - `init_scripts`
|
||
- Sequence
|
||
- The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `<destination>/<cluster-ID>/init_scripts`. See [_](#pipelines.<name>.clusters.init_scripts).
|
||
|
||
* - `instance_pool_id`
|
||
- String
|
||
- The optional ID of the instance pool to which the cluster belongs.
|
||
|
||
* - `label`
|
||
- String
|
||
- A label for the cluster specification, either `default` to configure the default cluster, or `maintenance` to configure the maintenance cluster. This field is optional. The default value is `default`.
|
||
|
||
* - `node_type_id`
|
||
- String
|
||
- This field encodes, through a single value, the resources available to each of the Spark nodes in this cluster. For example, the Spark nodes can be provisioned and optimized for memory or compute intensive workloads. A list of available node types can be retrieved by using the :method:clusters/listNodeTypes API call.
|
||
|
||
* - `num_workers`
|
||
- Integer
|
||
- Number of worker nodes that this cluster should have. A cluster has one Spark Driver and `num_workers` Executors for a total of `num_workers` + 1 Spark nodes. Note: When reading the properties of a cluster, this field reflects the desired number of workers rather than the actual current number of workers. For instance, if a cluster is resized from 5 to 10 workers, this field will immediately be updated to reflect the target size of 10 workers, whereas the workers listed in `spark_info` will gradually increase from 5 to 10 as the new nodes are provisioned.
|
||
|
||
* - `policy_id`
|
||
- String
|
||
- The ID of the cluster policy used to create the cluster if applicable.
|
||
|
||
* - `spark_conf`
|
||
- Map
|
||
- An object containing a set of optional, user-specified Spark configuration key-value pairs. See :method:clusters/create for more details.
|
||
|
||
* - `spark_env_vars`
|
||
- Map
|
||
- An object containing a set of optional, user-specified environment variable key-value pairs. Please note that key-value pair of the form (X,Y) will be exported as is (i.e., `export X='Y'`) while launching the driver and workers. In order to specify an additional set of `SPARK_DAEMON_JAVA_OPTS`, we recommend appending them to `$SPARK_DAEMON_JAVA_OPTS` as shown in the example below. This ensures that all default databricks managed environmental variables are included as well. Example Spark environment variables: `{"SPARK_WORKER_MEMORY": "28000m", "SPARK_LOCAL_DIRS": "/local_disk0"}` or `{"SPARK_DAEMON_JAVA_OPTS": "$SPARK_DAEMON_JAVA_OPTS -Dspark.shuffle.service.enabled=true"}`
|
||
|
||
* - `ssh_public_keys`
|
||
- Sequence
|
||
- SSH public key contents that will be added to each Spark node in this cluster. The corresponding private keys can be used to login with the user name `ubuntu` on port `2200`. Up to 10 keys can be specified.
|
||
|
||
|
||
### pipelines.<name>.clusters.autoscale
|
||
|
||
**`Type: Map`**
|
||
|
||
Parameters needed in order to automatically scale clusters up and down based on load.
|
||
Note: autoscaling works best with DB runtime versions 3.0 or later.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `max_workers`
|
||
- Integer
|
||
- The maximum number of workers to which the cluster can scale up when overloaded. `max_workers` must be strictly greater than `min_workers`.
|
||
|
||
* - `min_workers`
|
||
- Integer
|
||
- The minimum number of workers the cluster can scale down to when underutilized. It is also the initial number of workers the cluster will have after creation.
|
||
|
||
* - `mode`
|
||
- String
|
||
- Databricks Enhanced Autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact to the data processing latency of your pipelines. Enhanced Autoscaling is available for `updates` clusters only. The legacy autoscaling feature is used for `maintenance` clusters.
|
||
|
||
|
||
### pipelines.<name>.clusters.aws_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Amazon Web Services.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- Availability type used for all subsequent nodes past the `first_on_demand` ones. Note: If `first_on_demand` is zero, this availability type will be used for the entire cluster.
|
||
|
||
* - `ebs_volume_count`
|
||
- Integer
|
||
- The number of volumes launched for each instance. Users can choose up to 10 volumes. This feature is only enabled for supported node types. Legacy node types cannot specify custom EBS volumes. For node types with no instance store, at least one EBS volume needs to be specified; otherwise, cluster creation will fail. These EBS volumes will be mounted at `/ebs0`, `/ebs1`, and etc. Instance store volumes will be mounted at `/local_disk0`, `/local_disk1`, and etc. If EBS volumes are attached, Databricks will configure Spark to use only the EBS volumes for scratch storage because heterogenously sized scratch devices can lead to inefficient disk utilization. If no EBS volumes are attached, Databricks will configure Spark to use instance store volumes. Please note that if EBS volumes are specified, then the Spark configuration `spark.local.dir` will be overridden.
|
||
|
||
* - `ebs_volume_iops`
|
||
- Integer
|
||
- If using gp3 volumes, what IOPS to use for the disk. If this is not set, the maximum performance of a gp2 volume with the same volume size will be used.
|
||
|
||
* - `ebs_volume_size`
|
||
- Integer
|
||
- The size of each EBS volume (in GiB) launched for each instance. For general purpose SSD, this value must be within the range 100 - 4096. For throughput optimized HDD, this value must be within the range 500 - 4096.
|
||
|
||
* - `ebs_volume_throughput`
|
||
- Integer
|
||
- If using gp3 volumes, what throughput to use for the disk. If this is not set, the maximum performance of a gp2 volume with the same volume size will be used.
|
||
|
||
* - `ebs_volume_type`
|
||
- String
|
||
- The type of EBS volumes that will be launched with this cluster.
|
||
|
||
* - `first_on_demand`
|
||
- Integer
|
||
- The first `first_on_demand` nodes of the cluster will be placed on on-demand instances. If this value is greater than 0, the cluster driver node in particular will be placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, `first_on_demand` nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. Note that this value does not affect cluster size and cannot currently be mutated over the lifetime of a cluster.
|
||
|
||
* - `instance_profile_arn`
|
||
- String
|
||
- Nodes for this cluster will only be placed on AWS instances with this instance profile. If ommitted, nodes will be placed on instances without an IAM instance profile. The instance profile must have previously been added to the Databricks environment by an account administrator. This feature may only be available to certain customer plans. If this field is ommitted, we will pull in the default from the conf if it exists.
|
||
|
||
* - `spot_bid_price_percent`
|
||
- Integer
|
||
- The bid price for AWS spot instances, as a percentage of the corresponding instance type's on-demand price. For example, if this field is set to 50, and the cluster needs a new `r3.xlarge` spot instance, then the bid price is half of the price of on-demand `r3.xlarge` instances. Similarly, if this field is set to 200, the bid price is twice the price of on-demand `r3.xlarge` instances. If not specified, the default value is 100. When spot instances are requested for this cluster, only spot instances whose bid price percentage matches this field will be considered. Note that, for safety, we enforce this field to be no more than 10000. The default value and documentation here should be kept consistent with CommonConf.defaultSpotBidPricePercent and CommonConf.maxSpotBidPricePercent.
|
||
|
||
* - `zone_id`
|
||
- String
|
||
- Identifier for the availability zone/datacenter in which the cluster resides. This string will be of a form like "us-west-2a". The provided availability zone must be in the same region as the Databricks deployment. For example, "us-west-2a" is not a valid zone id if the Databricks deployment resides in the "us-east-1" region. This is an optional field at cluster creation, and if not specified, a default zone will be used. If the zone specified is "auto", will try to place cluster in a zone with high availability, and will retry placement in a different AZ if there is not enough capacity. The list of available zones as well as the default value can be found by using the `List Zones` method.
|
||
|
||
|
||
### pipelines.<name>.clusters.azure_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Microsoft Azure.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- Availability type used for all subsequent nodes past the `first_on_demand` ones. Note: If `first_on_demand` is zero (which only happens on pool clusters), this availability type will be used for the entire cluster.
|
||
|
||
* - `first_on_demand`
|
||
- Integer
|
||
- The first `first_on_demand` nodes of the cluster will be placed on on-demand instances. This value should be greater than 0, to make sure the cluster driver node is placed on an on-demand instance. If this value is greater than or equal to the current cluster size, all nodes will be placed on on-demand instances. If this value is less than the current cluster size, `first_on_demand` nodes will be placed on on-demand instances and the remainder will be placed on `availability` instances. Note that this value does not affect cluster size and cannot currently be mutated over the lifetime of a cluster.
|
||
|
||
* - `log_analytics_info`
|
||
- Map
|
||
- Defines values necessary to configure and run Azure Log Analytics agent. See [_](#pipelines.<name>.clusters.azure_attributes.log_analytics_info).
|
||
|
||
* - `spot_bid_max_price`
|
||
- Any
|
||
- The max bid price to be used for Azure spot instances. The Max price for the bid cannot be higher than the on-demand price of the instance. If not specified, the default value is -1, which specifies that the instance cannot be evicted on the basis of price, and only on the basis of availability. Further, the value should > 0 or -1.
|
||
|
||
|
||
### pipelines.<name>.clusters.azure_attributes.log_analytics_info
|
||
|
||
**`Type: Map`**
|
||
|
||
Defines values necessary to configure and run Azure Log Analytics agent
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `log_analytics_primary_key`
|
||
- String
|
||
- <needs content added>
|
||
|
||
* - `log_analytics_workspace_id`
|
||
- String
|
||
- <needs content added>
|
||
|
||
|
||
### pipelines.<name>.clusters.cluster_log_conf
|
||
|
||
**`Type: Map`**
|
||
|
||
The configuration for delivering spark logs to a long-term storage destination.
|
||
Only dbfs destinations are supported. Only one destination can be specified
|
||
for one cluster. If the conf is given, the logs will be delivered to the destination every
|
||
`5 mins`. The destination of driver logs is `$destination/$clusterId/driver`, while
|
||
the destination of executor logs is `$destination/$clusterId/executor`.
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `dbfs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`. See [_](#pipelines.<name>.clusters.cluster_log_conf.dbfs).
|
||
|
||
* - `s3`
|
||
- Map
|
||
- destination and either the region or endpoint need to be provided. e.g. `{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }` Cluster iam role is used to access s3, please make sure the cluster iam role in `instance_profile_arn` has permission to write data to the s3 destination. See [_](#pipelines.<name>.clusters.cluster_log_conf.s3).
|
||
|
||
|
||
### pipelines.<name>.clusters.cluster_log_conf.dbfs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- dbfs destination, e.g. `dbfs:/my/path`
|
||
|
||
|
||
### pipelines.<name>.clusters.cluster_log_conf.s3
|
||
|
||
**`Type: Map`**
|
||
|
||
destination and either the region or endpoint need to be provided. e.g.
|
||
`{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }`
|
||
Cluster iam role is used to access s3, please make sure the cluster iam role in
|
||
`instance_profile_arn` has permission to write data to the s3 destination.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `canned_acl`
|
||
- String
|
||
- (Optional) Set canned access control list for the logs, e.g. `bucket-owner-full-control`. If `canned_cal` is set, please make sure the cluster iam role has `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned acl can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. Please also note that by default only the object owner gets full controls. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs.
|
||
|
||
* - `destination`
|
||
- String
|
||
- S3 destination, e.g. `s3://my-bucket/some-prefix` Note that logs will be delivered using cluster iam role, please make sure you set cluster iam role and the role has write access to the destination. Please also note that you cannot use AWS keys to deliver logs.
|
||
|
||
* - `enable_encryption`
|
||
- Boolean
|
||
- (Optional) Flag to enable server side encryption, `false` by default.
|
||
|
||
* - `encryption_type`
|
||
- String
|
||
- (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It will be used only when encryption is enabled and the default type is `sse-s3`.
|
||
|
||
* - `endpoint`
|
||
- String
|
||
- S3 endpoint, e.g. `https://s3-us-west-2.amazonaws.com`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
* - `kms_key`
|
||
- String
|
||
- (Optional) Kms key which will be used if encryption is enabled and encryption type is set to `sse-kms`.
|
||
|
||
* - `region`
|
||
- String
|
||
- S3 region, e.g. `us-west-2`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
|
||
### pipelines.<name>.clusters.gcp_attributes
|
||
|
||
**`Type: Map`**
|
||
|
||
Attributes related to clusters running on Google Cloud Platform.
|
||
If not specified at cluster creation, a set of default values will be used.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `availability`
|
||
- String
|
||
- This field determines whether the instance pool will contain preemptible VMs, on-demand VMs, or preemptible VMs with a fallback to on-demand VMs if the former is unavailable.
|
||
|
||
* - `boot_disk_size`
|
||
- Integer
|
||
- boot disk size in GB
|
||
|
||
* - `google_service_account`
|
||
- String
|
||
- If provided, the cluster will impersonate the google service account when accessing gcloud services (like GCS). The google service account must have previously been added to the Databricks environment by an account administrator.
|
||
|
||
* - `local_ssd_count`
|
||
- Integer
|
||
- If provided, each node (workers and driver) in the cluster will have this number of local SSDs attached. Each local SSD is 375GB in size. Refer to [GCP documentation](https://cloud.google.com/compute/docs/disks/local-ssd#choose_number_local_ssds) for the supported number of local SSDs for each instance type.
|
||
|
||
* - `use_preemptible_executors`
|
||
- Boolean
|
||
- This field determines whether the spark executors will be scheduled to run on preemptible VMs (when set to true) versus standard compute engine VMs (when set to false; default). Note: Soon to be deprecated, use the availability field instead.
|
||
|
||
* - `zone_id`
|
||
- String
|
||
- Identifier for the availability zone in which the cluster resides. This can be one of the following: - "HA" => High availability, spread nodes across availability zones for a Databricks deployment region [default] - "AUTO" => Databricks picks an availability zone to schedule the cluster on. - A GCP availability zone => Pick One of the available zones for (machine type + region) from https://cloud.google.com/compute/docs/regions-zones.
|
||
|
||
|
||
### pipelines.<name>.clusters.init_scripts
|
||
|
||
**`Type: Sequence`**
|
||
|
||
The configuration for storing init scripts. Any number of destinations can be specified. The scripts are executed sequentially in the order provided. If `cluster_log_conf` is specified, init script logs are sent to `<destination>/<cluster-ID>/init_scripts`.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `abfss`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "abfss" : { "destination" : "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>" } }. See [_](#pipelines.<name>.clusters.init_scripts.abfss).
|
||
|
||
* - `dbfs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`. See [_](#pipelines.<name>.clusters.init_scripts.dbfs).
|
||
|
||
* - `file`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "file" : { "destination" : "file:/my/local/file.sh" } }`. See [_](#pipelines.<name>.clusters.init_scripts.file).
|
||
|
||
* - `gcs`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "gcs": { "destination": "gs://my-bucket/file.sh" } }`. See [_](#pipelines.<name>.clusters.init_scripts.gcs).
|
||
|
||
* - `s3`
|
||
- Map
|
||
- destination and either the region or endpoint need to be provided. e.g. `{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }` Cluster iam role is used to access s3, please make sure the cluster iam role in `instance_profile_arn` has permission to write data to the s3 destination. See [_](#pipelines.<name>.clusters.init_scripts.s3).
|
||
|
||
* - `volumes`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "volumes" : { "destination" : "/Volumes/my-init.sh" } }`. See [_](#pipelines.<name>.clusters.init_scripts.volumes).
|
||
|
||
* - `workspace`
|
||
- Map
|
||
- destination needs to be provided. e.g. `{ "workspace" : { "destination" : "/Users/user1@databricks.com/my-init.sh" } }`. See [_](#pipelines.<name>.clusters.init_scripts.workspace).
|
||
|
||
|
||
### pipelines.<name>.clusters.init_scripts.abfss
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "abfss" : { "destination" : "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>" } }
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- abfss destination, e.g. `abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>`.
|
||
|
||
|
||
### pipelines.<name>.clusters.init_scripts.dbfs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "dbfs" : { "destination" : "dbfs:/home/cluster_log" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- dbfs destination, e.g. `dbfs:/my/path`
|
||
|
||
|
||
### pipelines.<name>.clusters.init_scripts.file
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "file" : { "destination" : "file:/my/local/file.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- local file destination, e.g. `file:/my/local/file.sh`
|
||
|
||
|
||
### pipelines.<name>.clusters.init_scripts.gcs
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "gcs": { "destination": "gs://my-bucket/file.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- GCS destination/URI, e.g. `gs://my-bucket/some-prefix`
|
||
|
||
|
||
### pipelines.<name>.clusters.init_scripts.s3
|
||
|
||
**`Type: Map`**
|
||
|
||
destination and either the region or endpoint need to be provided. e.g.
|
||
`{ "s3": { "destination" : "s3://cluster_log_bucket/prefix", "region" : "us-west-2" } }`
|
||
Cluster iam role is used to access s3, please make sure the cluster iam role in
|
||
`instance_profile_arn` has permission to write data to the s3 destination.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `canned_acl`
|
||
- String
|
||
- (Optional) Set canned access control list for the logs, e.g. `bucket-owner-full-control`. If `canned_cal` is set, please make sure the cluster iam role has `s3:PutObjectAcl` permission on the destination bucket and prefix. The full list of possible canned acl can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. Please also note that by default only the object owner gets full controls. If you are using cross account role for writing data, you may want to set `bucket-owner-full-control` to make bucket owner able to read the logs.
|
||
|
||
* - `destination`
|
||
- String
|
||
- S3 destination, e.g. `s3://my-bucket/some-prefix` Note that logs will be delivered using cluster iam role, please make sure you set cluster iam role and the role has write access to the destination. Please also note that you cannot use AWS keys to deliver logs.
|
||
|
||
* - `enable_encryption`
|
||
- Boolean
|
||
- (Optional) Flag to enable server side encryption, `false` by default.
|
||
|
||
* - `encryption_type`
|
||
- String
|
||
- (Optional) The encryption type, it could be `sse-s3` or `sse-kms`. It will be used only when encryption is enabled and the default type is `sse-s3`.
|
||
|
||
* - `endpoint`
|
||
- String
|
||
- S3 endpoint, e.g. `https://s3-us-west-2.amazonaws.com`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
* - `kms_key`
|
||
- String
|
||
- (Optional) Kms key which will be used if encryption is enabled and encryption type is set to `sse-kms`.
|
||
|
||
* - `region`
|
||
- String
|
||
- S3 region, e.g. `us-west-2`. Either region or endpoint needs to be set. If both are set, endpoint will be used.
|
||
|
||
|
||
### pipelines.<name>.clusters.init_scripts.volumes
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "volumes" : { "destination" : "/Volumes/my-init.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- Unity Catalog Volumes file destination, e.g. `/Volumes/my-init.sh`
|
||
|
||
|
||
### pipelines.<name>.clusters.init_scripts.workspace
|
||
|
||
**`Type: Map`**
|
||
|
||
destination needs to be provided. e.g.
|
||
`{ "workspace" : { "destination" : "/Users/user1@databricks.com/my-init.sh" } }`
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination`
|
||
- String
|
||
- workspace files destination, e.g. `/Users/user1@databricks.com/my-init.sh`
|
||
|
||
|
||
### pipelines.<name>.deployment
|
||
|
||
**`Type: Map`**
|
||
|
||
Deployment type of this pipeline.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `kind`
|
||
- String
|
||
- The deployment method that manages the pipeline.
|
||
|
||
* - `metadata_file_path`
|
||
- String
|
||
- The path to the file containing metadata about the deployment.
|
||
|
||
|
||
### pipelines.<name>.filters
|
||
|
||
**`Type: Map`**
|
||
|
||
Filters on which Pipeline packages to include in the deployed graph.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `exclude`
|
||
- Sequence
|
||
- Paths to exclude.
|
||
|
||
* - `include`
|
||
- Sequence
|
||
- Paths to include.
|
||
|
||
|
||
### pipelines.<name>.gateway_definition
|
||
|
||
**`Type: Map`**
|
||
|
||
The definition of a gateway pipeline to support change data capture.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `connection_id`
|
||
- String
|
||
- [Deprecated, use connection_name instead] Immutable. The Unity Catalog connection that this gateway pipeline uses to communicate with the source.
|
||
|
||
* - `connection_name`
|
||
- String
|
||
- Immutable. The Unity Catalog connection that this gateway pipeline uses to communicate with the source.
|
||
|
||
* - `gateway_storage_catalog`
|
||
- String
|
||
- Required, Immutable. The name of the catalog for the gateway pipeline's storage location.
|
||
|
||
* - `gateway_storage_name`
|
||
- String
|
||
- Optional. The Unity Catalog-compatible name for the gateway storage location. This is the destination to use for the data that is extracted by the gateway. Delta Live Tables system will automatically create the storage location under the catalog and schema.
|
||
|
||
* - `gateway_storage_schema`
|
||
- String
|
||
- Required, Immutable. The name of the schema for the gateway pipelines's storage location.
|
||
|
||
|
||
### pipelines.<name>.ingestion_definition
|
||
|
||
**`Type: Map`**
|
||
|
||
The configuration for a managed ingestion pipeline. These settings cannot be used with the 'libraries', 'target' or 'catalog' settings.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `connection_name`
|
||
- String
|
||
- Immutable. The Unity Catalog connection that this ingestion pipeline uses to communicate with the source. This is used with connectors for applications like Salesforce, Workday, and so on.
|
||
|
||
* - `ingestion_gateway_id`
|
||
- String
|
||
- Immutable. Identifier for the gateway that is used by this ingestion pipeline to communicate with the source database. This is used with connectors to databases like SQL Server.
|
||
|
||
* - `objects`
|
||
- Sequence
|
||
- Required. Settings specifying tables to replicate and the destination for the replicated tables. See [_](#pipelines.<name>.ingestion_definition.objects).
|
||
|
||
* - `table_configuration`
|
||
- Map
|
||
- Configuration settings to control the ingestion of tables. These settings are applied to all tables in the pipeline. See [_](#pipelines.<name>.ingestion_definition.table_configuration).
|
||
|
||
|
||
### pipelines.<name>.ingestion_definition.objects
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Required. Settings specifying tables to replicate and the destination for the replicated tables.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `report`
|
||
- Map
|
||
- Select a specific source report. See [_](#pipelines.<name>.ingestion_definition.objects.report).
|
||
|
||
* - `schema`
|
||
- Map
|
||
- Select all tables from a specific source schema. See [_](#pipelines.<name>.ingestion_definition.objects.schema).
|
||
|
||
* - `table`
|
||
- Map
|
||
- Select a specific source table. See [_](#pipelines.<name>.ingestion_definition.objects.table).
|
||
|
||
|
||
### pipelines.<name>.ingestion_definition.objects.report
|
||
|
||
**`Type: Map`**
|
||
|
||
Select a specific source report.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination_catalog`
|
||
- String
|
||
- Required. Destination catalog to store table.
|
||
|
||
* - `destination_schema`
|
||
- String
|
||
- Required. Destination schema to store table.
|
||
|
||
* - `destination_table`
|
||
- String
|
||
- Required. Destination table name. The pipeline fails if a table with that name already exists.
|
||
|
||
* - `source_url`
|
||
- String
|
||
- Required. Report URL in the source system.
|
||
|
||
* - `table_configuration`
|
||
- Map
|
||
- Configuration settings to control the ingestion of tables. These settings override the table_configuration defined in the IngestionPipelineDefinition object. See [_](#pipelines.<name>.ingestion_definition.objects.report.table_configuration).
|
||
|
||
|
||
### pipelines.<name>.ingestion_definition.objects.report.table_configuration
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration settings to control the ingestion of tables. These settings override the table_configuration defined in the IngestionPipelineDefinition object.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `primary_keys`
|
||
- Sequence
|
||
- The primary key of the table used to apply changes.
|
||
|
||
* - `salesforce_include_formula_fields`
|
||
- Boolean
|
||
- If true, formula fields defined in the table are included in the ingestion. This setting is only valid for the Salesforce connector
|
||
|
||
* - `scd_type`
|
||
- String
|
||
- The SCD type to use to ingest the table.
|
||
|
||
* - `sequence_by`
|
||
- Sequence
|
||
- The column names specifying the logical order of events in the source data. Delta Live Tables uses this sequencing to handle change events that arrive out of order.
|
||
|
||
|
||
### pipelines.<name>.ingestion_definition.objects.schema
|
||
|
||
**`Type: Map`**
|
||
|
||
Select all tables from a specific source schema.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination_catalog`
|
||
- String
|
||
- Required. Destination catalog to store tables.
|
||
|
||
* - `destination_schema`
|
||
- String
|
||
- Required. Destination schema to store tables in. Tables with the same name as the source tables are created in this destination schema. The pipeline fails If a table with the same name already exists.
|
||
|
||
* - `source_catalog`
|
||
- String
|
||
- The source catalog name. Might be optional depending on the type of source.
|
||
|
||
* - `source_schema`
|
||
- String
|
||
- Required. Schema name in the source database.
|
||
|
||
* - `table_configuration`
|
||
- Map
|
||
- Configuration settings to control the ingestion of tables. These settings are applied to all tables in this schema and override the table_configuration defined in the IngestionPipelineDefinition object. See [_](#pipelines.<name>.ingestion_definition.objects.schema.table_configuration).
|
||
|
||
|
||
### pipelines.<name>.ingestion_definition.objects.schema.table_configuration
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration settings to control the ingestion of tables. These settings are applied to all tables in this schema and override the table_configuration defined in the IngestionPipelineDefinition object.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `primary_keys`
|
||
- Sequence
|
||
- The primary key of the table used to apply changes.
|
||
|
||
* - `salesforce_include_formula_fields`
|
||
- Boolean
|
||
- If true, formula fields defined in the table are included in the ingestion. This setting is only valid for the Salesforce connector
|
||
|
||
* - `scd_type`
|
||
- String
|
||
- The SCD type to use to ingest the table.
|
||
|
||
* - `sequence_by`
|
||
- Sequence
|
||
- The column names specifying the logical order of events in the source data. Delta Live Tables uses this sequencing to handle change events that arrive out of order.
|
||
|
||
|
||
### pipelines.<name>.ingestion_definition.objects.table
|
||
|
||
**`Type: Map`**
|
||
|
||
Select a specific source table.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `destination_catalog`
|
||
- String
|
||
- Required. Destination catalog to store table.
|
||
|
||
* - `destination_schema`
|
||
- String
|
||
- Required. Destination schema to store table.
|
||
|
||
* - `destination_table`
|
||
- String
|
||
- Optional. Destination table name. The pipeline fails if a table with that name already exists. If not set, the source table name is used.
|
||
|
||
* - `source_catalog`
|
||
- String
|
||
- Source catalog name. Might be optional depending on the type of source.
|
||
|
||
* - `source_schema`
|
||
- String
|
||
- Schema name in the source database. Might be optional depending on the type of source.
|
||
|
||
* - `source_table`
|
||
- String
|
||
- Required. Table name in the source database.
|
||
|
||
* - `table_configuration`
|
||
- Map
|
||
- Configuration settings to control the ingestion of tables. These settings override the table_configuration defined in the IngestionPipelineDefinition object and the SchemaSpec. See [_](#pipelines.<name>.ingestion_definition.objects.table.table_configuration).
|
||
|
||
|
||
### pipelines.<name>.ingestion_definition.objects.table.table_configuration
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration settings to control the ingestion of tables. These settings override the table_configuration defined in the IngestionPipelineDefinition object and the SchemaSpec.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `primary_keys`
|
||
- Sequence
|
||
- The primary key of the table used to apply changes.
|
||
|
||
* - `salesforce_include_formula_fields`
|
||
- Boolean
|
||
- If true, formula fields defined in the table are included in the ingestion. This setting is only valid for the Salesforce connector
|
||
|
||
* - `scd_type`
|
||
- String
|
||
- The SCD type to use to ingest the table.
|
||
|
||
* - `sequence_by`
|
||
- Sequence
|
||
- The column names specifying the logical order of events in the source data. Delta Live Tables uses this sequencing to handle change events that arrive out of order.
|
||
|
||
|
||
### pipelines.<name>.ingestion_definition.table_configuration
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration settings to control the ingestion of tables. These settings are applied to all tables in the pipeline.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `primary_keys`
|
||
- Sequence
|
||
- The primary key of the table used to apply changes.
|
||
|
||
* - `salesforce_include_formula_fields`
|
||
- Boolean
|
||
- If true, formula fields defined in the table are included in the ingestion. This setting is only valid for the Salesforce connector
|
||
|
||
* - `scd_type`
|
||
- String
|
||
- The SCD type to use to ingest the table.
|
||
|
||
* - `sequence_by`
|
||
- Sequence
|
||
- The column names specifying the logical order of events in the source data. Delta Live Tables uses this sequencing to handle change events that arrive out of order.
|
||
|
||
|
||
### pipelines.<name>.libraries
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Libraries or code needed by this deployment.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `file`
|
||
- Map
|
||
- The path to a file that defines a pipeline and is stored in the Databricks Repos. . See [_](#pipelines.<name>.libraries.file).
|
||
|
||
* - `jar`
|
||
- String
|
||
- URI of the jar to be installed. Currently only DBFS is supported.
|
||
|
||
* - `maven`
|
||
- Map
|
||
- Specification of a maven library to be installed. . See [_](#pipelines.<name>.libraries.maven).
|
||
|
||
* - `notebook`
|
||
- Map
|
||
- The path to a notebook that defines a pipeline and is stored in the Databricks workspace. . See [_](#pipelines.<name>.libraries.notebook).
|
||
|
||
* - `whl`
|
||
- String
|
||
- URI of the whl to be installed.
|
||
|
||
|
||
### pipelines.<name>.libraries.file
|
||
|
||
**`Type: Map`**
|
||
|
||
The path to a file that defines a pipeline and is stored in the Databricks Repos.
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `path`
|
||
- String
|
||
- The absolute path of the file.
|
||
|
||
|
||
### pipelines.<name>.libraries.maven
|
||
|
||
**`Type: Map`**
|
||
|
||
Specification of a maven library to be installed.
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `coordinates`
|
||
- String
|
||
- Gradle-style maven coordinates. For example: "org.jsoup:jsoup:1.7.2".
|
||
|
||
* - `exclusions`
|
||
- Sequence
|
||
- List of dependences to exclude. For example: `["slf4j:slf4j", "*:hadoop-client"]`. Maven dependency exclusions: https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html.
|
||
|
||
* - `repo`
|
||
- String
|
||
- Maven repo to install the Maven package from. If omitted, both Maven Central Repository and Spark Packages are searched.
|
||
|
||
|
||
### pipelines.<name>.libraries.notebook
|
||
|
||
**`Type: Map`**
|
||
|
||
The path to a notebook that defines a pipeline and is stored in the Databricks workspace.
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `path`
|
||
- String
|
||
- The absolute path of the notebook.
|
||
|
||
|
||
### pipelines.<name>.notifications
|
||
|
||
**`Type: Sequence`**
|
||
|
||
List of notification settings for this pipeline.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `alerts`
|
||
- Sequence
|
||
- A list of alerts that trigger the sending of notifications to the configured destinations. The supported alerts are: * `on-update-success`: A pipeline update completes successfully. * `on-update-failure`: Each time a pipeline update fails. * `on-update-fatal-failure`: A pipeline update fails with a non-retryable (fatal) error. * `on-flow-failure`: A single data flow fails.
|
||
|
||
* - `email_recipients`
|
||
- Sequence
|
||
- A list of email addresses notified when a configured alert is triggered.
|
||
|
||
|
||
### pipelines.<name>.permissions
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `group_name`
|
||
- String
|
||
- The name of the group that has the permission set in level.
|
||
|
||
* - `level`
|
||
- String
|
||
- The allowed permission for user, group, service principal defined for this permission.
|
||
|
||
* - `service_principal_name`
|
||
- String
|
||
- The name of the service principal that has the permission set in level.
|
||
|
||
* - `user_name`
|
||
- String
|
||
- The name of the user that has the permission set in level.
|
||
|
||
|
||
### pipelines.<name>.restart_window
|
||
|
||
**`Type: Map`**
|
||
|
||
Restart window of this pipeline.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `days_of_week`
|
||
- Sequence
|
||
- Days of week in which the restart is allowed to happen (within a five-hour window starting at start_hour). If not specified all days of the week will be used.
|
||
|
||
* - `start_hour`
|
||
- Integer
|
||
- An integer between 0 and 23 denoting the start hour for the restart window in the 24-hour day. Continuous pipeline restart is triggered only within a five-hour window starting at this hour.
|
||
|
||
* - `time_zone_id`
|
||
- String
|
||
- Time zone id of restart window. See https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-conf-mgmt-set-timezone.html for details. If not specified, UTC will be used.
|
||
|
||
|
||
### pipelines.<name>.restart_window.days_of_week
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Days of week in which the restart is allowed to happen (within a five-hour window starting at start_hour).
|
||
If not specified all days of the week will be used.
|
||
|
||
|
||
### pipelines.<name>.trigger
|
||
|
||
**`Type: Map`**
|
||
|
||
Which pipeline trigger to use. Deprecated: Use `continuous` instead.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `cron`
|
||
- Map
|
||
- See [_](#pipelines.<name>.trigger.cron).
|
||
|
||
* - `manual`
|
||
- Map
|
||
-
|
||
|
||
|
||
### pipelines.<name>.trigger.cron
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `quartz_cron_schedule`
|
||
- String
|
||
-
|
||
|
||
* - `timezone_id`
|
||
- String
|
||
-
|
||
|
||
|
||
### pipelines.<name>.trigger.manual
|
||
|
||
**`Type: Map`**
|
||
|
||
|
||
|
||
|
||
## quality_monitors
|
||
|
||
**`Type: Map`**
|
||
|
||
The quality_monitor resource allows you to define a <UC> [table monitor](/api/workspace/qualitymonitors/create). For information about monitors, see [_](/machine-learning/model-serving/monitor-diagnose-endpoints.md).
|
||
|
||
```yaml
|
||
quality_monitors:
|
||
<quality_monitor-name>:
|
||
<quality_monitor-field-name>: <quality_monitor-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `assets_dir`
|
||
- String
|
||
- The directory to store monitoring assets (e.g. dashboard, metric tables).
|
||
|
||
* - `baseline_table_name`
|
||
- String
|
||
- Name of the baseline table from which drift metrics are computed from. Columns in the monitored table should also be present in the baseline table.
|
||
|
||
* - `custom_metrics`
|
||
- Sequence
|
||
- Custom metrics to compute on the monitored table. These can be aggregate metrics, derived metrics (from already computed aggregate metrics), or drift metrics (comparing metrics across time windows). . See [_](#quality_monitors.<name>.custom_metrics).
|
||
|
||
* - `data_classification_config`
|
||
- Map
|
||
- The data classification config for the monitor. See [_](#quality_monitors.<name>.data_classification_config).
|
||
|
||
* - `inference_log`
|
||
- Map
|
||
- Configuration for monitoring inference logs. See [_](#quality_monitors.<name>.inference_log).
|
||
|
||
* - `notifications`
|
||
- Map
|
||
- The notification settings for the monitor. See [_](#quality_monitors.<name>.notifications).
|
||
|
||
* - `output_schema_name`
|
||
- String
|
||
- Schema where output metric tables are created.
|
||
|
||
* - `schedule`
|
||
- Map
|
||
- The schedule for automatically updating and refreshing metric tables. See [_](#quality_monitors.<name>.schedule).
|
||
|
||
* - `skip_builtin_dashboard`
|
||
- Boolean
|
||
- Whether to skip creating a default dashboard summarizing data quality metrics.
|
||
|
||
* - `slicing_exprs`
|
||
- Sequence
|
||
- List of column expressions to slice data with for targeted analysis. The data is grouped by each expression independently, resulting in a separate slice for each predicate and its complements. For high-cardinality columns, only the top 100 unique values by frequency will generate slices.
|
||
|
||
* - `snapshot`
|
||
- Map
|
||
- Configuration for monitoring snapshot tables.
|
||
|
||
* - `table_name`
|
||
- String
|
||
-
|
||
|
||
* - `time_series`
|
||
- Map
|
||
- Configuration for monitoring time series tables. See [_](#quality_monitors.<name>.time_series).
|
||
|
||
* - `warehouse_id`
|
||
- String
|
||
- Optional argument to specify the warehouse for dashboard creation. If not specified, the first running warehouse will be used.
|
||
|
||
|
||
**Example**
|
||
|
||
The following example defines a quality monitor:
|
||
|
||
```yaml
|
||
resources:
|
||
quality_monitors:
|
||
my_quality_monitor:
|
||
table_name: dev.mlops_schema.predictions
|
||
output_schema_name: ${bundle.target}.mlops_schema
|
||
assets_dir: /Users/${workspace.current_user.userName}/databricks_lakehouse_monitoring
|
||
inference_log:
|
||
granularities: [1 day]
|
||
model_id_col: model_id
|
||
prediction_col: prediction
|
||
label_col: price
|
||
problem_type: PROBLEM_TYPE_REGRESSION
|
||
timestamp_col: timestamp
|
||
schedule:
|
||
quartz_cron_expression: 0 0 8 * * ? # Run Every day at 8am
|
||
timezone_id: UTC
|
||
```
|
||
|
||
### quality_monitors.<name>.custom_metrics
|
||
|
||
**`Type: Sequence`**
|
||
|
||
Custom metrics to compute on the monitored table. These can be aggregate metrics, derived
|
||
metrics (from already computed aggregate metrics), or drift metrics (comparing metrics across time
|
||
windows).
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `definition`
|
||
- String
|
||
- Jinja template for a SQL expression that specifies how to compute the metric. See [create metric definition](https://docs.databricks.com/en/lakehouse-monitoring/custom-metrics.html#create-definition).
|
||
|
||
* - `input_columns`
|
||
- Sequence
|
||
- A list of column names in the input table the metric should be computed for. Can use ``":table"`` to indicate that the metric needs information from multiple columns.
|
||
|
||
* - `name`
|
||
- String
|
||
- Name of the metric in the output tables.
|
||
|
||
* - `output_data_type`
|
||
- String
|
||
- The output type of the custom metric.
|
||
|
||
* - `type`
|
||
- String
|
||
- Can only be one of ``"CUSTOM_METRIC_TYPE_AGGREGATE"``, ``"CUSTOM_METRIC_TYPE_DERIVED"``, or ``"CUSTOM_METRIC_TYPE_DRIFT"``. The ``"CUSTOM_METRIC_TYPE_AGGREGATE"`` and ``"CUSTOM_METRIC_TYPE_DERIVED"`` metrics are computed on a single table, whereas the ``"CUSTOM_METRIC_TYPE_DRIFT"`` compare metrics across baseline and input table, or across the two consecutive time windows. - CUSTOM_METRIC_TYPE_AGGREGATE: only depend on the existing columns in your table - CUSTOM_METRIC_TYPE_DERIVED: depend on previously computed aggregate metrics - CUSTOM_METRIC_TYPE_DRIFT: depend on previously computed aggregate or derived metrics
|
||
|
||
|
||
### quality_monitors.<name>.data_classification_config
|
||
|
||
**`Type: Map`**
|
||
|
||
The data classification config for the monitor.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `enabled`
|
||
- Boolean
|
||
- Whether data classification is enabled.
|
||
|
||
|
||
### quality_monitors.<name>.inference_log
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for monitoring inference logs.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `granularities`
|
||
- Sequence
|
||
- Granularities for aggregating data into time windows based on their timestamp. Currently the following static granularities are supported: {``"5 minutes"``, ``"30 minutes"``, ``"1 hour"``, ``"1 day"``, ``"<n> week(s)"``, ``"1 month"``, ``"1 year"``}.
|
||
|
||
* - `label_col`
|
||
- String
|
||
- Optional column that contains the ground truth for the prediction.
|
||
|
||
* - `model_id_col`
|
||
- String
|
||
- Column that contains the id of the model generating the predictions. Metrics will be computed per model id by default, and also across all model ids.
|
||
|
||
* - `prediction_col`
|
||
- String
|
||
- Column that contains the output/prediction from the model.
|
||
|
||
* - `prediction_proba_col`
|
||
- String
|
||
- Optional column that contains the prediction probabilities for each class in a classification problem type. The values in this column should be a map, mapping each class label to the prediction probability for a given sample. The map should be of PySpark MapType().
|
||
|
||
* - `problem_type`
|
||
- String
|
||
- Problem type the model aims to solve. Determines the type of model-quality metrics that will be computed.
|
||
|
||
* - `timestamp_col`
|
||
- String
|
||
- Column that contains the timestamps of requests. The column must be one of the following: - A ``TimestampType`` column - A column whose values can be converted to timestamps through the pyspark ``to_timestamp`` [function](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.to_timestamp.html).
|
||
|
||
|
||
### quality_monitors.<name>.notifications
|
||
|
||
**`Type: Map`**
|
||
|
||
The notification settings for the monitor.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `on_failure`
|
||
- Map
|
||
- Who to send notifications to on monitor failure. See [_](#quality_monitors.<name>.notifications.on_failure).
|
||
|
||
* - `on_new_classification_tag_detected`
|
||
- Map
|
||
- Who to send notifications to when new data classification tags are detected. See [_](#quality_monitors.<name>.notifications.on_new_classification_tag_detected).
|
||
|
||
|
||
### quality_monitors.<name>.notifications.on_failure
|
||
|
||
**`Type: Map`**
|
||
|
||
Who to send notifications to on monitor failure.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `email_addresses`
|
||
- Sequence
|
||
- The list of email addresses to send the notification to. A maximum of 5 email addresses is supported.
|
||
|
||
|
||
### quality_monitors.<name>.notifications.on_new_classification_tag_detected
|
||
|
||
**`Type: Map`**
|
||
|
||
Who to send notifications to when new data classification tags are detected.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `email_addresses`
|
||
- Sequence
|
||
- The list of email addresses to send the notification to. A maximum of 5 email addresses is supported.
|
||
|
||
|
||
### quality_monitors.<name>.schedule
|
||
|
||
**`Type: Map`**
|
||
|
||
The schedule for automatically updating and refreshing metric tables.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `pause_status`
|
||
- String
|
||
- Read only field that indicates whether a schedule is paused or not.
|
||
|
||
* - `quartz_cron_expression`
|
||
- String
|
||
- The expression that determines when to run the monitor. See [examples](https://www.quartz-scheduler.org/documentation/quartz-2.3.0/tutorials/crontrigger.html).
|
||
|
||
* - `timezone_id`
|
||
- String
|
||
- The timezone id (e.g., ``"PST"``) in which to evaluate the quartz expression.
|
||
|
||
|
||
### quality_monitors.<name>.snapshot
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for monitoring snapshot tables.
|
||
|
||
|
||
### quality_monitors.<name>.time_series
|
||
|
||
**`Type: Map`**
|
||
|
||
Configuration for monitoring time series tables.
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `granularities`
|
||
- Sequence
|
||
- Granularities for aggregating data into time windows based on their timestamp. Currently the following static granularities are supported: {``"5 minutes"``, ``"30 minutes"``, ``"1 hour"``, ``"1 day"``, ``"<n> week(s)"``, ``"1 month"``, ``"1 year"``}.
|
||
|
||
* - `timestamp_col`
|
||
- String
|
||
- Column that contains the timestamps of requests. The column must be one of the following: - A ``TimestampType`` column - A column whose values can be converted to timestamps through the pyspark ``to_timestamp`` [function](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.to_timestamp.html).
|
||
|
||
|
||
## registered_models
|
||
|
||
**`Type: Map`**
|
||
|
||
The registered model resource allows you to define models in <UC>. For information about <UC> [registered models](/api/workspace/registeredmodels/create), see [_](/machine-learning/manage-model-lifecycle/index.md).
|
||
|
||
```yaml
|
||
registered_models:
|
||
<registered_model-name>:
|
||
<registered_model-field-name>: <registered_model-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `catalog_name`
|
||
- String
|
||
- The name of the catalog where the schema and the registered model reside
|
||
|
||
* - `comment`
|
||
- String
|
||
- The comment attached to the registered model
|
||
|
||
* - `grants`
|
||
- Sequence
|
||
- See [_](#registered_models.<name>.grants).
|
||
|
||
* - `name`
|
||
- String
|
||
- The name of the registered model
|
||
|
||
* - `schema_name`
|
||
- String
|
||
- The name of the schema where the registered model resides
|
||
|
||
* - `storage_location`
|
||
- String
|
||
- The storage location on the cloud under which model version data files are stored
|
||
|
||
|
||
**Example**
|
||
|
||
The following example defines a registered model in <UC>:
|
||
|
||
```yaml
|
||
resources:
|
||
registered_models:
|
||
model:
|
||
name: my_model
|
||
catalog_name: ${bundle.target}
|
||
schema_name: mlops_schema
|
||
comment: Registered model in Unity Catalog for ${bundle.target} deployment target
|
||
grants:
|
||
- privileges:
|
||
- EXECUTE
|
||
principal: account users
|
||
```
|
||
|
||
### registered_models.<name>.grants
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `principal`
|
||
- String
|
||
- The name of the principal that will be granted privileges
|
||
|
||
* - `privileges`
|
||
- Sequence
|
||
- The privileges to grant to the specified entity
|
||
|
||
|
||
## schemas
|
||
|
||
**`Type: Map`**
|
||
|
||
The schema resource type allows you to define <UC> [schemas](/api/workspace/schemas/create) for tables and other assets in your workflows and pipelines created as part of a bundle. A schema, different from other resource types, has the following limitations:
|
||
|
||
- The owner of a schema resource is always the deployment user, and cannot be changed. If `run_as` is specified in the bundle, it will be ignored by operations on the schema.
|
||
- Only fields supported by the corresponding [Schemas object create API](/api/workspace/schemas/create) are available for the schema resource. For example, `enable_predictive_optimization` is not supported as it is only available on the [update API](/api/workspace/schemas/update).
|
||
|
||
```yaml
|
||
schemas:
|
||
<schema-name>:
|
||
<schema-field-name>: <schema-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `catalog_name`
|
||
- String
|
||
- Name of parent catalog.
|
||
|
||
* - `comment`
|
||
- String
|
||
- User-provided free-form text description.
|
||
|
||
* - `grants`
|
||
- Sequence
|
||
- See [_](#schemas.<name>.grants).
|
||
|
||
* - `name`
|
||
- String
|
||
- Name of schema, relative to parent catalog.
|
||
|
||
* - `properties`
|
||
- Map
|
||
-
|
||
|
||
* - `storage_root`
|
||
- String
|
||
- Storage root URL for managed tables within schema.
|
||
|
||
|
||
**Example**
|
||
|
||
The following example defines a pipeline with the resource key `my_pipeline` that creates a <UC> schema with the key `my_schema` as the target:
|
||
|
||
```yaml
|
||
resources:
|
||
pipelines:
|
||
my_pipeline:
|
||
name: test-pipeline-{{.unique_id}}
|
||
libraries:
|
||
- notebook:
|
||
path: ./nb.sql
|
||
development: true
|
||
catalog: main
|
||
target: ${resources.schemas.my_schema.id}
|
||
|
||
schemas:
|
||
my_schema:
|
||
name: test-schema-{{.unique_id}}
|
||
catalog_name: main
|
||
comment: This schema was created by DABs.
|
||
```
|
||
|
||
A top-level grants mapping is not supported by <DABS>, so if you want to set grants for a schema, define the grants for the schema within the `schemas` mapping. For more information about grants, see [_](/data-governance/unity-catalog/manage-privileges/index.md#grant).
|
||
|
||
The following example defines a <UC> schema with grants:
|
||
|
||
```yaml
|
||
resources:
|
||
schemas:
|
||
my_schema:
|
||
name: test-schema
|
||
grants:
|
||
- principal: users
|
||
privileges:
|
||
- CAN_MANAGE
|
||
- principal: my_team
|
||
privileges:
|
||
- CAN_READ
|
||
catalog_name: main
|
||
```
|
||
|
||
### schemas.<name>.grants
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `principal`
|
||
- String
|
||
- The name of the principal that will be granted privileges
|
||
|
||
* - `privileges`
|
||
- Sequence
|
||
- The privileges to grant to the specified entity
|
||
|
||
|
||
## volumes
|
||
|
||
**`Type: Map`**
|
||
|
||
The volume resource type allows you to define and create <UC> [volumes](/api/workspace/volumes/create) as part of a bundle. When deploying a bundle with a volume defined, note that:
|
||
|
||
- A volume cannot be referenced in the `artifact_path` for the bundle until it exists in the workspace. Hence, if you want to use <DABS> to create the volume, you must first define the volume in the bundle, deploy it to create the volume, then reference it in the `artifact_path` in subsequent deployments.
|
||
|
||
- Volumes in the bundle are not prepended with the `dev_${workspace.current_user.short_name}` prefix when the deployment target has `mode: development` configured. However, you can manually configure this prefix. See [_](/dev-tools/bundles/deployment-modes.md#custom-presets).
|
||
|
||
```yaml
|
||
volumes:
|
||
<volume-name>:
|
||
<volume-field-name>: <volume-field-value>
|
||
```
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `catalog_name`
|
||
- String
|
||
- The name of the catalog where the schema and the volume are
|
||
|
||
* - `comment`
|
||
- String
|
||
- The comment attached to the volume
|
||
|
||
* - `grants`
|
||
- Sequence
|
||
- See [_](#volumes.<name>.grants).
|
||
|
||
* - `name`
|
||
- String
|
||
- The name of the volume
|
||
|
||
* - `schema_name`
|
||
- String
|
||
- The name of the schema where the volume is
|
||
|
||
* - `storage_location`
|
||
- String
|
||
- The storage location on the cloud
|
||
|
||
* - `volume_type`
|
||
- String
|
||
-
|
||
|
||
|
||
**Example**
|
||
|
||
The following example creates a <UC> volume with the key `my_volume`:
|
||
|
||
```yaml
|
||
resources:
|
||
volumes:
|
||
my_volume:
|
||
catalog_name: main
|
||
name: my_volume
|
||
schema_name: my_schema
|
||
```
|
||
|
||
For an example bundle that runs a job that writes to a file in <UC> volume, see the [bundle-examples GitHub repository](https://github.com/databricks/bundle-examples/tree/main/knowledge_base/write_from_job_to_volume).
|
||
|
||
### volumes.<name>.grants
|
||
|
||
**`Type: Sequence`**
|
||
|
||
|
||
|
||
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Key
|
||
- Type
|
||
- Description
|
||
|
||
* - `principal`
|
||
- String
|
||
- The name of the principal that will be granted privileges
|
||
|
||
* - `privileges`
|
||
- Sequence
|
||
- The privileges to grant to the specified entity
|
||
|