Commit Graph

18 Commits

Author SHA1 Message Date
Denis Bilenko c3b02d9321 Properly read git metadata when inside Workspace
Since there is no .git directory in Workspace file system, we need to make
an API call to fetch git checkout status (root of the repo, current branch, etc).
(api/2.0/workspace/get-status?return_git_info=true).

Refactor Repository to accept repository root rather than calculate it.
This helps, because Repository is currently created in multiple places and
finding the repository root is expensive.
2024-12-02 11:29:06 +01:00
shreyas-goenka b323703c1b
Add validation for single node clusters (#1909)
## Changes
This PR adds a warning validating that the configuration for a single
node cluster is valid for interactive, job, job-task, and pipeline
clusters.

Note: We skip the validation if a cluster policy is configured because
the policy is likely to configure `spark_conf` / `custom_tags` itself.

Note: Terrform originally only had validation for interactive, job, and
job-task clusters. This PR adding the validation for pipeline clusters
as well is new.

This PR follows the same logic as we used to have in Terraform. The
validation was removed from Terraform because we had no way to demote
the error to a warning:
https://github.com/databricks/terraform-provider-databricks/pull/4222

### Background
Single-node clusters require `spark_conf` and `custom_tags` to be
correctly set in the cluster definition for them to function optimally.
The cluster will be created even if incorrectly configured, but its
performance will not be great.

For example, if both `spark_conf` and `custom_tags` are not set and
`num_workers` is 0, then only the driver process will be launched on the
cluster compute instance thus leading to sub-optimal utilization of
available compute resources and no parallelization across worker
processes when processing a spark query.

### Issue

This PR addresses some issues reported in
https://github.com/databricks/cli/issues/1546

## Tests
Unit tests and manually.

Example output of the warning:
```
➜  bundle-playground git:(master) ✗ cli bundle validate
Warning: Single node cluster is not correctly configured
  at resources.pipelines.bar.clusters[0]
  in databricks.yml:29:11

num_workers should be 0 only for single-node clusters. To create a
valid single node cluster please ensure that the following properties
are correctly set in the cluster specification:

  spark_conf:
    spark.databricks.cluster.profile: singleNode
    spark.master: local[*]

  custom_tags:
    ResourceClass: SingleNode
  

Name: foobar
Target: default
Workspace:
  User: shreyas.goenka@databricks.com
  Path: /Workspace/Users/shreyas.goenka@databricks.com/.bundle/foobar/default

Found 1 warning
```
2024-11-22 15:48:09 +00:00
Pieter Noordhuis abfd1713e0
Skip sync warning if no sync paths are defined (#1926)
## Changes

Users can configure the bundle to not synchronize any files with:
```yaml
sync:
  paths: []
```

If it is explicitly configured as an empty list, the validate command
must not warn about not having any files to synchronize. The warning
exists to alert users who are unintentionally not synchronizing any
files (they might have a `.gitignore` pattern that matches everything).

Closes #1663.

## Tests

* New unit test.
2024-11-21 15:03:13 +00:00
Andrew Nester ac71d2e5ce
Fixed adding /Workspace prefix for resource paths (#1866)
## Changes
`/Workspace` prefix needs to be added to `resource_path` as well.

Fixes the issue mentioned here:
https://github.com/databricks/cli/pull/1822#issuecomment-2447697498

Fixes #1867 

## Tests
Added regression test
2024-10-30 17:34:11 +00:00
Andrew Nester f018daf413
Use SetPermissions instead of UpdatePermissions when setting folder permissions based on top-level ones (#1822)
## Changes
Changed to use SetPermissions() to configure the permissions which
remove other permissions on deployment folders.

## Tests
Added unit test
2024-10-29 12:06:38 +00:00
Andrew Nester eaea308254
Added validator for folder permissions (#1824)
## Changes
This validator checks permissions defined in top-level bundle config and
permissions set in workspace for the folders bundle is deployed to. It
raises the warning if the permissions defined in the workspace are not
defined in bundle.

This validator is executed only during `bundle validate` command.

## Tests

```
Warning: untracked permissions apply to target workspace path

The following permissions apply to the workspace folder at "/Workspace/Users/andrew.nester@databricks.com/.bundle/clusters/default" but are not configured in the bundle:
- level: CAN_MANAGE, user_name: andrew.nester@databricks.com
```

---------

Co-authored-by: Pieter Noordhuis <pieter.noordhuis@databricks.com>
2024-10-24 12:36:17 +00:00
Gleb Kanterov 3d9decdda9
Add JobTaskClusterSpec validate mutator (#1784)
## Changes
Add JobTaskClusterSpec validate mutator. It catches the case when tasks
don't which cluster to use.

For example, we can get this error with minor modifications to
`default-python` template:

```yaml
      tasks:
        - task_key: python_file_task
          spark_python_task:
            python_file: ../src/my_project_10/main.py
```

```
 % databricks bundle validate
Error: Missing required cluster or environment settings
  at resources.jobs.my_project_10_job.tasks[0]
  in resources/my_project_10_job.yml:17:11

Task "print_github_stars" requires a cluster or an environment to run.
Specify one of the following fields: job_cluster_key, environment_key, existing_cluster_id, new_cluster.
```

We implicitly rely on "one of" validation, which does not exist. Many
bundle fields can't co-exist, for instance, specifying:
`JobTask.{existing_cluster_id,job_cluster_key}`, `Library.{whl,pypi}`,
`JobTask.{notebook_task,python_wheel_task}`, etc.

## Tests
Unit tests

---------

Co-authored-by: Pieter Noordhuis <pcnoordhuis@gmail.com>
2024-09-25 11:30:14 +00:00
Pieter Noordhuis ceefa80d72
Pass copy of `dyn.Path` to callback function (#1747)
## Changes

Some call sites hold on to the `dyn.Path` provided to them by the
callback. It must therefore never be mutated after the callback returns,
or these mutations leak out into unknown scope.

This change means it is no longer possible for this failure mode to
happen.

## Tests

Unit test.
2024-09-05 11:05:16 +00:00
shreyas-goenka 242d4b51ed
Report all empty resources present in error diagnostic (#1685)
## Changes
This PR addressed post-merge feedback from
https://github.com/databricks/cli/pull/1673.

## Tests
Unit tests, and manually.
```
Error: experiment undefined-experiment is not defined
  at resources.experiments.undefined-experiment
  in databricks.yml:11:26

Error: job undefined-job is not defined
  at resources.jobs.undefined-job
  in databricks.yml:6:19

Error: pipeline undefined-pipeline is not defined
  at resources.pipelines.undefined-pipeline
  in databricks.yml:14:24

Name: undefined-job
Target: default

Found 3 errors
```
2024-08-20 00:22:00 +00:00
Pieter Noordhuis 7de7583b37
Make fileset take optional list of paths to list (#1684)
## Changes

Before this change, the fileset library would take a single root path
and list all files in it. To support an allowlist of paths to list (much
like a Git `pathspec` without patterns; see [pathspec](pathspec)), this
change introduces an optional argument to `fileset.New` where the caller
can specify paths to list. If not specified, this argument defaults to
list `.` (i.e. list all files in the root).

The motivation for this change is that we wish to expose this pattern in
bundles. Users should be able to specify which paths to synchronize
instead of always only synchronizing the bundle root directory.

[pathspec]:
https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpathspecapathspec

## Tests

New and existing unit tests.
2024-08-19 15:15:14 +00:00
shreyas-goenka 7ae80de351
Stop tracking file path locations in bundle resources (#1673)
## Changes
Since locations are already tracked in the dynamic value tree, we no
longer need to track it at the resource/artifact level. This PR:
1. Removes use of `paths.Paths`. Uses dyn.Location instead.
2. Refactors the validation of resources not being empty valued to be
generic across all resource types.
  
## Tests
Existing unit tests.
2024-08-13 12:50:15 +00:00
Andrew Nester 1fb8e324d5
Added test for negation pattern in sync include exclude section (#1637)
## Changes
Added test for negation pattern in sync include exclude section
2024-07-31 13:42:23 +00:00
shreyas-goenka a52b188e99
Use dynamic walking to validate unique resource keys (#1614)
## Changes
This PR:
1. Uses dynamic walking (via the `dyn.MapByPattern` func) to validate no
two resources have the same resource key. The allows us to remove this
validation at merge time.
2. Modifies `dyn.Mapping` to always return a sorted slice of pairs. This
makes traversal functions like `dyn.Walk` or `dyn.MapByPattern`
deterministic.

## Tests
Unit tests. Also manually.
2024-07-29 13:04:02 +00:00
shreyas-goenka 37b9df96e6
Support multiple paths for diagnostics (#1616)
## Changes
Some diagnostics can have multiple paths associated with them. For
instance, ensuring that unique resource keys are used across all
resources. This PR extends `diag.Diagnostic` to accept multiple paths.

This PR is symmetrical to
https://github.com/databricks/cli/pull/1610/files

## Tests
Unit tests
2024-07-25 15:16:27 +00:00
shreyas-goenka 4bf88b4209
Support multiple locations for diagnostics (#1610)
## Changes
This PR changes `diag.Diagnostics` to allow including multiple locations
associated with the diagnostic message. The diagnostics that now return
multiple locations with this PR are:
1. Warning for unknown keys in config.
2. Use of experimental.run_as
3. Accidental sync.exludes that exclude all files.

## Tests
Existing unit tests pass. New unit test case to assert on error message
when multiple locations are included.

Example output:
```
➜  bundle-playground-2 ~/cli2/cli/cli bundle validate              
Warning: You are using the legacy mode of run_as. The support for this mode is experimental and might be removed in a future release of the CLI. In order to run the DLT pipelines in your DAB as the run_as user this mode changes the owners of the pipelines to the run_as identity, which requires the user deploying the bundle to be a workspace admin, and also a Metastore admin if the pipeline target is in UC.
  at experimental.use_legacy_run_as
  in resources.yml:10:22
     databricks.yml:13:22

Name: fix run_if
Target: default
Workspace:
  User: shreyas.goenka@databricks.com
  Path: /Users/shreyas.goenka@databricks.com/.bundle/fix run_if/default

Found 1 warning
```
2024-07-23 17:20:11 +00:00
Pieter Noordhuis b3c044c461
Use `vfs.Path` for filesystem interaction (#1554)
## Changes

Note: this doesn't cover _all_ filesystem interaction.

To intercept calls where read or stat files to determine their type, we
need a layer between our code and the `os` package calls that interact
with the local file system. Interception is necessary to accommodate
differences between a regular local file system and the FUSE-mounted
Workspace File System when running the CLI on DBR.

This change makes use of #1452 in the bundle struct.

It uses #1525 to access the bundle variable in path rewriting.

## Tests

* Unit tests pass.
* Integration tests pass.
2024-07-03 10:13:22 +00:00
Pieter Noordhuis 424499ec1d
Abstract over filesystem interaction with libs/vfs (#1452)
## Changes

Introduce `libs/vfs` for an implementation of `fs.FS` and friends that
_includes_ the absolute path it is anchored to.

This is needed for:
1. Intercepting file operations to inject custom logic (e.g., logging,
access control).
2. Traversing directories to find specific leaf directories (e.g.,
`.git`).
3. Converting virtual paths to OS-native paths.

Options 2 and 3 are not possible with the standard `fs.FS` interface.
They are needed such that we can provide an instance to the sync package
and still detect the containing `.git` directory and convert paths to
native paths.

This change focuses on making the following packages use `vfs.Path`:
* libs/fileset
* libs/git
* libs/sync

All entries returned by `fileset.All` are now slash-separated. This has
2 consequences:
* The sync snapshot now always uses slash-separated paths
* We don't need to call `filepath.FromSlash` as much as we did

## Tests

* All unit tests pass
* All integration tests pass
* Manually confirmed that a deployment made on Windows by a previous
version of the CLI can be deployed by a new version of the CLI while
retaining the validity of the local sync snapshot as well as the remote
deployment state.
2024-05-30 07:41:50 +00:00
Andrew Nester 27f51c760f
Added validate mutator to surface additional bundle warnings (#1352)
## Changes
All these validators will return warnings as part of `bundle validate`
run

Added 2 mutators: 
1. To check that if tasks use job_cluster_key it is actually defined
2. To check if there are any files to sync as part of deployment

Also added `bundle.Parallel` to run them in parallel

To make sure mutators under bundle.Parallel do not mutate config,
introduced new `ReadOnlyMutator`, `ReadOnlyBundle` and `ReadOnlyConfig`.

Example 

```
databricks bundle validate -p deco-staging
Warning: unknown field: new_cluster
  at resources.jobs.my_job
  in bundle.yml:24:7

Warning: job_cluster_key high_cpu_workload_job_cluster is not defined
  at resources.jobs.my_job.tasks[0].job_cluster_key
  in bundle.yml:35:28

Warning: There are no files to sync, please check your your .gitignore and sync.exclude configuration
  at sync.exclude
  in bundle.yml:18:5

Name: test
Target: default
Workspace:
  Host: https://acme.databricks.com
  User: andrew.nester@databricks.com
  Path: /Users/andrew.nester@databricks.com/.bundle/test/default

Found 3 warnings
```

## Tests
Added unit tests
2024-04-18 15:13:16 +00:00