## Changes
The host URL for databricks workspaces includes the workspaceId by
default as a positional arg. Eg:
https://e2-dogfood.staging.cloud.databricks.com/?o=1234
Thus a user can't simply copy paste the URL today to the auth login
command. They'll see a runtime error:
```
➜ cli git:(main) ✗ databricks auth login --host https://e2-dogfood.staging.cloud.databricks.com/\?o\=xxx --profile new-dg
Error: oidc: fetch .well-known: failed to unmarshal response body: invalid character '<' looking for beginning of value. This is likely a bug in the Databricks SDK for Go or the underlying REST API. Please report this issue with the following debugging information to the SDK issue tracker at https://github.com/databricks/databricks-sdk-go/issues. Request log:
GET /login.html
...
```
## Tests
Unit tests and manually. Now auth login works even when the workspace_id
is included in the URL.
## Changes
This package can be used to marshal a `dyn.Value` as JSON and retain the
ordering of keys in a mapping. Unlike the default behavior of
`json.Marshal,` the output does not encode HTML characters.
Otherwise, this is no different from using `JSON.Marshal` with
`v.AsAny().`
## Tests
Unit tests.
## Changes
Test failures indicate that both stdout and stderr are consumed, yet the
content of stdout doesn't end up in the intended output. This can happen
if the goroutines responsible for writing to the combined output buffer
attempt to write to the same underlying buffer concurrently.
Example failure:
```
=== RUN TestBackgroundCombinedOutput
background_test.go:65:
Error Trace: D:/a/cli/cli/libs/process/background_test.go:65
Error: elements differ
extra elements in list A:
([]interface {}) (len=1) {
(string) (len=1) "2"
}
listA:
([]string) (len=2) {
(string) (len=1) "1",
(string) (len=1) "2"
}
listB:
([]string) (len=1) {
(string) (len=1) "1"
}
Test: TestBackgroundCombinedOutput
```
With the test body:
ca45e53f42/libs/process/background_test.go (L48-L66)
With the implementation of `WithCombinedOutput`:
ca45e53f42/libs/process/opts.go (L72-L78)
Notice that `c.Stdout` does get the "2", or the test failure would have
included the relevant assertion error. This leads me to believe that
there is a race on writing to `buf` from the two goroutines writing to
`c.Stdout` and `c.Stderr`.
## Tests
The test passes. If this PR has the intended effect remains to be
seen...
Change the default-python template to not set the `catalog` field for
the pipeline for workspaces that set `hive_metastore` as the default
catalog. The Pipelines service currently returns an error when that
value is used for the `catalog` field.
This is the most simple fix for this issue, which was reported by a
customer. As a followup, we should look at whether we want to prompt for
a catalog instead, possibly just for this specific scenario.
## Changes
This change allows the `sync` command to work from [git
worktrees](https://git-scm.com/docs/git-worktree).
## Tests
* Added unit tests for traversal of worktree related files.
* Manually confirmed that synchronization of files from a main checkout,
as well as a worktree, observed the same ignore rules (both locally
defined as well as from `$GIT_DIR/info/exclude`).
---------
Co-authored-by: Pieter Noordhuis <pieter.noordhuis@databricks.com>
## Changes
This file is at `info/exclude`, and not `info/excludes`.
Also see https://git-scm.com/docs/gitignore.
## Tests
Manually confirmed that these ignore patterns are now picked up. I
created a repository with a pattern in this file and ran `sync` to
confirm it ignores files matching the pattern.
## Changes
I took the examples from https://yaml.org/spec/1.2.2.
The required modifications to the loader are:
* Correctly parse floating point infinities and NaN
* Correctly parse octal numbers per the YAML 1.2 spec
* Treat "null" keys in a map as valid
## Tests
Existing and new unit tests pass.
## Changes
The issue reported in #1828 illustrates how using a YAML timestamp-like
value (a date in this case) causes an issue during conversion to and
from the typed configuration tree.
We use the `AsAny()` function on the `dyn.Value` when normalizing for
the `any` type. We only use the `any` type for variable values, because
they can assume every type. The `AsAny()` function returns a `time.Time`
for the time value during conversion **to** the typed configuration
tree. Upon conversion **from** the typed configuration tree back into
the dynamic configuration tree, we cannot distinguish a `time.Time`
struct from any other struct.
To address this, we use the underlying string value of the time value
when we normalize for the `any` type.
Fixes#1828.
## Tests
Existing unit tests pass
## Changes
Fixed unmarshalling json input into `interface{}` type
Commands like `api post` support free form request input, so it should
be unmarshaled correctly
## Tests
Added regression test + E2E test pass
## Changes
Added JSON input validation for CLI commands. Now when invalid JSON
passed as a payload to CLI commands, CLI performs input normalisation
and detects if there are any mismatches such as incorrect types, unknown
fields and etc.
This diagnostic information is printed in standard error output and does
not block command execution, so the change is backward compatible.
Fixes#1769#1764#1625#1560
## Tests
Added unit tests
```
andrew.nester@HFW9Y94129 ~ % databricks jobs create --json '{"seeti}'
Error: error decoding JSON at (inline):1:2: unexpected EOF
andrew.nester@HFW9Y94129 ~ % databricks jobs create --json '{"seeti": true}'
Warning: unknown field: seeti
in (inline):1:9
Error: Job settings must be specified.
```
---------
Co-authored-by: Pieter Noordhuis <pieter.noordhuis@databricks.com>
## Changes
This extends the `{{default_catalog}}` helper in templates to ignore any
`PERMISSION_DENIED` error. We're still reviewing when exactly this error
occurs, but if it does, it should not break templates. We should fall
back to assuming there's no default catalog (and no UC) instead.
## Testing
I have not been able to reproduce this issue, but there is a customer
report about "access denied to clusters that don't have unity catalog
enabled" being returned on a non-UC workspace. The error code in this PR
corresponds to that message.
## Next steps
We'll work together with the UC team to review if this error even makes
sense for this API. If that discussion leads to a behavior change in the
API we can update the CLI code again.
## Changes
The two functions `GetShortUserName` and `IsServicePrincipal` are
unrelated to auth or the purpose of the auth package. This change moves
them into their own package and updates `IsServicePrincipal` to take an
`*iam.User` argument instead of a string username.
## Tests
Tests pass.
## Changes
This adds diagnostics for collaborative (production) deployment
scenarios, including:
- Bob deploys a bundle that is normally deployed by Alice, but this
fails because Bob can't write to `/Users/Alice/.bundle`.
- Charlie deploys a bundle that is normally deployed by Alice, but this
fails because he can't create a new pipeline where Alice would be the
owner.
- Alice deploys a bundle where she didn't list herself as one of the
CAN_MANAGE users in permissions. That can work, but is probably a
mistake.
## Tests
Unit tests, manual testing.
## Changes
We want to encourage a pattern of specifying only a single resource in a
YAML file when the `.(resource-type).yml` extension is used (for
example, `.job.yml`). This convention could allow us to bijectively map
a resource YAML file to its corresponding resource in the Databricks
workspace.
This PR:
1. Emits a recommendation diagnostic when we detect this convention is
being violated. We can promote this to a warning when we want to
encourage this pattern more strongly.
2. Visualises the recommendation diagnostics in the `bundle validate`
command.
**NOTE:** While this PR also shows the recommendation for `.yaml` files,
we do not encourage users to use this extension. We only support it here
since it's part of the YAML standard and some existing users might
already be using `.yaml`.
## Tests
Unit tests and manually. Here's what an example output looks like:
```
Recommendation: define a single job in a file with the .job.yml extension.
at resources.jobs.bar
resources.jobs.foo
in foo.job.yml:13:7
foo.job.yml:5:7
The following resources are defined or configured in this file:
- bar (job)
- foo (job)
```
---------
Co-authored-by: Lennart Kats (databricks) <lennart.kats@databricks.com>
## Changes
Due to platform changes, all libraries, notebooks and etc. paths used in
Databricks must be started with either /Workspace or /Volumes prefix.
This PR makes sure that all bundle paths are correctly prefixed.
Note: this change is a breaking change if user previously configured and
used `/Workspace/Workspace` folder in their workspace file system or
having `/Workspace/${workspace.root_path}...` pattern configured
anywhere in their bundle config
Fixes: #1751
AI:
- [x] Scan DABs config and error out on
`/Workspace/${workspace.root_path}...` pattern usage
## Tests
Added unit tests
---------
Co-authored-by: Pieter Noordhuis <pieter.noordhuis@databricks.com>
## Changes
We want to encourage a pattern of only specifying a single resource in a
YAML file when an `.<resource-type>.yml` (like `.job.yml`) is used. This
convention could allow us to bijectively map a resource YAML file to
it's corresponding resource in the Databricks workspace.
This PR simply makes the built-in templates compliant to this format.
## Tests
Existing tests.
## Changes
- Extract sync output logic from `cmd/sync` into `lib/sync`
- Add hidden `verbose` flag to the `bundle deploy` command, it's false
by default and hidden from the `--help` output
- Pass output handler to the `deploy/files/upload` mutator if the
verbose option is true
The was an idea to use in-place output overriding each past file sync
event in the output, bit that wont work for the extension, since it
doesn't display deploy logs in the terminal.
Example output:
```
~/tmp/defpy: ~/cli/cli bundle deploy --sync-progress
Building defpy...
Uploading defpy-0.0.1+20240917.112755-py3-none-any.whl...
Uploading bundle files to /Users/ilia.babanov@databricks.com/.bundle/defpy/dev/files...
Action: PUT: requirements-dev.txt, resources/defpy_pipeline.yml, pytest.ini, src/defpy/main.py, src/defpy/__init__.py, src/dlt_pipeline.ipynb, tests/main_test.py, src/notebook.ipynb, setup.py, resources/defpy_job.yml, .vscode/extensions.json, .vscode/settings.json, fixtures/.gitkeep, .vscode/__builtins__.pyi, README.md, .gitignore, databricks.yml
Uploaded tests
Uploaded resources
Uploaded fixtures
Uploaded .vscode
Uploaded src/defpy
Uploaded requirements-dev.txt
Uploaded .gitignore
Uploaded fixtures/.gitkeep
Uploaded src/defpy/__init__.py
Uploaded databricks.yml
Uploaded README.md
Uploaded setup.py
Uploaded .vscode/__builtins__.pyi
Uploaded .vscode/extensions.json
Uploaded src/dlt_pipeline.ipynb
Uploaded .vscode/settings.json
Uploaded resources/defpy_job.yml
Uploaded pytest.ini
Uploaded src/defpy/main.py
Uploaded tests/main_test.py
Uploaded resources/defpy_pipeline.yml
Uploaded src/notebook.ipynb
Initial Sync Complete
Deploying resources...
Updating deployment state...
Deployment complete!
```
Output example in the extension:
<img width="1843" alt="Screenshot 2024-09-19 at 11 07 48"
src="https://github.com/user-attachments/assets/0fafd095-cdc6-44b8-b482-27a38ada0330">
## Tests
Manually for the `sync` and `bundle deploy` commands + vscode extension
sync and deploy flows
## Summary
Enables Unity Catalog for pipelines in the default template. Pipelines
will default to non-Unity Catalog pipelines if a catalog is not
specified.
*Small caveat*: there are cases where admins lock down the default
catalog of a workspace and don't allow the creation of a new schema
there. If that happens, the pipeline would fail at runtime with a clear
error indicating what happened. ("PERMISSION_DENIED: User does not have
CREATE SCHEMA on Catalog 'main'."). I've seen this with an internal
Databricks workspace, where creating new non-UC schemas wasn't locked
down, but creation in the `main` was.
## Testing
- Validated on a non-UC + UC workspace. The catalog selection logic here
is the same as applied for the SQL templates.
## Summary
Use the friendly name of service principals when shortening their name.
This change is helpful for the prefix in development mode. Instead of
adding a prefix like `[dev 1706906c-c0a2-4c25-9f57-3a7aa3cb8123]`, we'll
prefix like `[dev my_principal]`.
## Summary
Simplifies template by using the periodic trigger syntax instead of the
cron schedule syntax. Periodic triggers are simpler to configure,
simpler to read, and make sure that workloads are spread out through the
day. We only recommend cron syntax for advanced cases or when more
control is needed.
## Testing
* Templates validation via unit tests
* Manual validation that the new triggers work as expected in dev/prod
## Changes
This PR makes sweeping changes to the way we generate and test the
bundle JSON schema. The main benefits are:
1. More modular JSON schema. Every definition in the schema now is one
level deep and points to references instead of inlining the entire
schema for a field. This unblocks PyDABs from taking a dependency on the
JSON schema.
2. Generate the JSON schema during CLI code generation. Directly stream
it instead of computing it at runtime whenever a user calls `databricks
bundle schema`. This is nice because we no longer need to embed a
partial OpenAPI spec in the CLI. Down the line, we can add a `Schema()`
method to every struct in the Databricks Go SDK and remove the
dependency on the OpenAPI spec altogether. It'll become more important
once we decouple Go SDK structs and methods from the underlying APIs.
3. Add enum values for Go SDK fields in the JSON schema. Better
autocompletion and validation for these fields. As a follow-up, we can
add enum values for non-Go SDK enums as well (created internal ticket to
track).
4. Use "packageName.structName" as a key to read JSON schemas from the
OpenAPI spec for Go SDK structs. Before, we would use an unrolled
presentation of the JSON schema (stored in `bundle_descriptions.json`),
which was complex to parse and include in the final JSON schema output.
This also means loading values from the OpenAPI spec for `target` schema
works automatically and no longer needs custom code.
5. Support recursive types (eg: `for_each_task`). With us now using
$refs everywhere it's trivial to support.
6. Using complex variables would be invalid according to the schema
generated before this PR. Now that bug is fixed. In the future adding
more custom rules will be easier as well due to the single level nature
of the JSON schema.
Since this is a complete change of approach in how we generate the JSON
schema, there are a few (very minor) regressions worth calling out.
1. We'll lose a few custom descriptions for non Go SDK structs that were
a part of `bundle_descriptions.json`. Support for those can be added in
the future as a followup.
2. Since now the final JSON schema is a static artefact, we lose some
lead time for the signal that JSON schema integration tests are failing.
It's okay though since we have a lot of coverage via the existing unit
tests.
## Tests
Unit tests. End to end tests are being added in this PR:
https://github.com/databricks/cli/pull/1726
Previous unit tests were all deleted because they were bloated. Effort
was made to make the new unit tests provide (almost) equivalent
coverage.
## Changes
Some call sites hold on to the `dyn.Path` provided to them by the
callback. It must therefore never be mutated after the callback returns,
or these mutations leak out into unknown scope.
This change means it is no longer possible for this failure mode to
happen.
## Tests
Unit test.
## Changes
This updates the templates to include a `permissions` section. Having a
permissions section is a best practice, is helpful to understand the
notion of permissions, and helps diagnose permission errors
(https://github.com/databricks/cli/pull/1386).
This is a cherry-pick from https://github.com/databricks/cli/pull/1387.
This change was verified to work both in dev and prod. Existing unit
tests validate the validity of the templates in these modes.
## Changes
If not explicitly quoted, the YAML loader interprets a value like
`2024-08-29` as a timestamp. Such a value is usually intended to be a
string instead. Our normalization logic was not able to turn a time
value back into the original string.
This change boxes the time value to include its original string
representation. Normalization of one of these values into a string can
now use the original input value.
## Tests
Unit tests in `libs/dyn/convert`.
## Changes
`TestAccFilerWorkspaceFilesExtensionsErrorsOnDupName` recently started
failing in our nightlies because the upstream `import` API was changed
to [prohibit conflicting file
paths](https://docs.databricks.com/en/release-notes/product/2024/august.html#files-can-no-longer-have-identical-names-in-workspace-folders).
Because existing conflicting file paths can still be grandfathered in,
we need to retain coverage for the test. To do this, this PR:
1. Removes the failing
`TestAccFilerWorkspaceFilesExtensionsErrorsOnDupName`
2. Add an equivalent unit test with the `list` and `get-status` API
calls mocked.
## Changes
Make `pydabs/venv_path` optional. When not specified, CLI detects the
Python interpreter using `python.DetectExecutable`, the same way as for
`artifacts`. `python.DetectExecutable` works correctly if a virtual
environment is activated or `python3` is available on PATH through other
means.
Extract the venv detection code from PyDABs into `libs/python/detect`.
This code will be used when we implement the `python/venv_path` section
in `databricks.yml`.
## Tests
Unit tests and manually
---------
Co-authored-by: Pieter Noordhuis <pcnoordhuis@gmail.com>
## Changes
Before this change, the fileset library would take a single root path
and list all files in it. To support an allowlist of paths to list (much
like a Git `pathspec` without patterns; see [pathspec](pathspec)), this
change introduces an optional argument to `fileset.New` where the caller
can specify paths to list. If not specified, this argument defaults to
list `.` (i.e. list all files in the root).
The motivation for this change is that we wish to expose this pattern in
bundles. Users should be able to specify which paths to synchronize
instead of always only synchronizing the bundle root directory.
[pathspec]:
https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpathspecapathspec
## Tests
New and existing unit tests.
## Changes
In https://github.com/databricks/cli/pull/1202 the semantics of
`cmdio.RenderJson` was changes to always render the JSON object. Before
we would only render it if `--output json` was specified.
This PR fixes the logs to print human-readable log lines instead of a
JSON object.
This PR also removes the now unused `cmdio.Render` method.
## Tests
Manually:
```
➜ bundle-playground git:(master) ✗ cli workspace import-dir ./tmp /Users/shreyas.goenka@databricks.com/test-import-1 -p aws-prod-ucws
Importing files from ./tmp
a -> /Users/shreyas.goenka@databricks.com/test-import-1/a
Import complete. The files are available at /Users/shreyas.goenka@databricks.com/test-import-1
```
```
➜ bundle-playground git:(master) ✗ cli workspace export-dir /Users/shreyas.goenka@databricks.com/test-export-1 ./tmp-2 -p aws-prod-ucws
Exporting files from /Users/shreyas.goenka@databricks.com/test-export-1
/Users/shreyas.goenka@databricks.com/test-export-1/b -> tmp-2/b
Exported complete. The files are available at ./tmp-2
```
## Changes
The `auth login` command today prefers a host URL specified in a profile
before selecting the one explicitly provided by a user as a command line
argument.
This PR fixes this bug and refactors the code to make it more linear and
easy to read. Note that the same issue exists in the `auth token`
command and is fixed here as well.
## Tests
Unit tests, and manual testing.
## Changes
Since locations are already tracked in the dynamic value tree, we no
longer need to track it at the resource/artifact level. This PR:
1. Removes use of `paths.Paths`. Uses dyn.Location instead.
2. Refactors the validation of resources not being empty valued to be
generic across all resource types.
## Tests
Existing unit tests.
## Changes
While investigating #1629, I found that Go doesn't allow characters
outside the set documented at
https://pkg.go.dev/golang.org/x/mod/module#CheckFilePath.
To fix this, I changed the relevant test case to create the fixtures it
needs instead of loading it from the `testdata` directory (in
`renderer_test.go`).
Some test cases in `config_test.go` depended on templated paths without
needing to do so. In the process of fixing this, I refactored these
tests slightly to reduce dependencies between them.
This change also adds a test case to ensure that all files in the
repository are allowed to be part of a module (per the earlier
`CheckFilePath` function).
Fixes#1629.
## Tests
I manually confirmed I could import the repository as a Go module.
## Changes
This PR adds autocomplete for cat, cp, ls, mkdir and rm.
The new completer can do completion for any `Filer`. The command
completion for the `sync` command can be moved to use this general
completer as a follow-up.
## Tests
- Tested manually against a workspace
- Unit tests
## Changes
This PR:
1. Uses dynamic walking (via the `dyn.MapByPattern` func) to validate no
two resources have the same resource key. The allows us to remove this
validation at merge time.
2. Modifies `dyn.Mapping` to always return a sorted slice of pairs. This
makes traversal functions like `dyn.Walk` or `dyn.MapByPattern`
deterministic.
## Tests
Unit tests. Also manually.
## Changes
Some diagnostics can have multiple paths associated with them. For
instance, ensuring that unique resource keys are used across all
resources. This PR extends `diag.Diagnostic` to accept multiple paths.
This PR is symmetrical to
https://github.com/databricks/cli/pull/1610/files
## Tests
Unit tests
## Changes
Right now we ask users for two confirmations when destroying a bundle.
One to destroy the resources and one to delete the files. This PR
consolidates the two prompts into one.
## Tests
Manually
Destroying a bundle with no resources:
```
➜ bundle-playground git:(master) ✗ cli bundle destroy
All files and directories at the following location will be deleted: /Users/shreyas.goenka@databricks.com/.bundle/bundle-playground/default
Would you like to proceed? [y/n]: y
No resources to destroy
Updating deployment state...
Deleting files...
Destroy complete!
```
Destroying a bundle with no remote state:
```
➜ bundle-playground git:(master) ✗ cli bundle destroy
No active deployment found to destroy!
```
When a user cancells a deployment:
```
➜ bundle-playground git:(master) ✗ cli bundle destroy
The following resources will be deleted:
delete job job_1
delete job job_2
delete pipeline foo
All files and directories at the following location will be deleted: /Users/shreyas.goenka@databricks.com/.bundle/bundle-playground/default
Would you like to proceed? [y/n]: n
Destroy cancelled!
```
When a user destroys resources:
```
➜ bundle-playground git:(master) ✗ cli bundle destroy
The following resources will be deleted:
delete job job_1
delete job job_2
delete pipeline foo
All files and directories at the following location will be deleted: /Users/shreyas.goenka@databricks.com/.bundle/bundle-playground/default
Would you like to proceed? [y/n]: y
Updating deployment state...
Deleting files...
Destroy complete!
```
## Changes
This PR changes `diag.Diagnostics` to allow including multiple locations
associated with the diagnostic message. The diagnostics that now return
multiple locations with this PR are:
1. Warning for unknown keys in config.
2. Use of experimental.run_as
3. Accidental sync.exludes that exclude all files.
## Tests
Existing unit tests pass. New unit test case to assert on error message
when multiple locations are included.
Example output:
```
➜ bundle-playground-2 ~/cli2/cli/cli bundle validate
Warning: You are using the legacy mode of run_as. The support for this mode is experimental and might be removed in a future release of the CLI. In order to run the DLT pipelines in your DAB as the run_as user this mode changes the owners of the pipelines to the run_as identity, which requires the user deploying the bundle to be a workspace admin, and also a Metastore admin if the pipeline target is in UC.
at experimental.use_legacy_run_as
in resources.yml:10:22
databricks.yml:13:22
Name: fix run_if
Target: default
Workspace:
User: shreyas.goenka@databricks.com
Path: /Users/shreyas.goenka@databricks.com/.bundle/fix run_if/default
Found 1 warning
```
## Changes
Add support for google/uuid.New() to DAB templates.
This is needed to generate UUIDs in downstream templates like MLOps
Stacks.
## Tests
Unit tests.
## Changes
By default, construct a read/write instance. If constructed in read-only
mode, the underlying filer is wrapped in a readahead cache.
## Tests
* Filer integration tests pass.
* Manual test that caching is enabled when running on WSFS.
## Changes
The reason this readahead cache exists is that we frequently need to
recursively find all files in the bundle root directory, especially for
sync include and exclude processing. By caching the response for every
file/directory and frontloading the latency cost of these calls, we
significantly improve performance and eliminate redundant operations.
## Tests
* [ ] Working on unit tests
## Changes
This PR fixes a performance bug that led downloaded files (e.g. with
`databricks fs cp dbfs:/Volumes/.../somefile .`) to be buffered in
memory before being written.
Results from profiling the download of a ~100MB file:
Before:
```
Type: alloc_space
Showing nodes accounting for 374.02MB, 98.50% of 379.74MB total
```
After:
```
Type: alloc_space
Showing nodes accounting for 3748.67kB, 100% of 3748.67kB total
```
Note that this fix is temporary. A longer term solution should be to use
the API provided by the Go SDK rather than making an HTTP request
directly from the CLI.
fix#1575
## Tests
Verified that the CLI properly download the file when doing the
profiling.