in Blog

May 18, 2026

How to Set Up Databricks the Right Way Without Starting from Scratch

Home » How to Set Up Databricks the Right Way Without Starting from Scratch

Author:

Kaja Grzybowska

Reading time:

14 minutes

As a unified data platform, Databricks gives engineering teams the freedom to design pipelines, environments, and workflows almost any way they want, which is exactly what makes it powerful, and exactly what makes the initial setup non-trivial.

In fact, Databricks’ flexibility doesn’t come pre-configured.

Someone has to make the decisions, and making them well requires knowing the platform deeply.

Vadym Mariiechko, Data Engineer at Addepto, a Databricks partner, has spent years doing exactly that.

Vadym Mariiechko

Data Engineer at Addepto

Vadym specializes in production-grade Databricks architecture, with years of experience designing secure environments, isolating developer workflows, and implementing robust CI/CD pipelines. He is the creator of the open-source Databricks Bundle Template.

Working across projects, he accumulated hands-on knowledge of Databricks best practices that doesn’t come from documentation: how to structure multi-environment deployments, how to isolate developer workspaces so teams don’t overwrite each other’s work, how to wire up CI/CD so code moves from a laptop to production reliably.

He packaged that knowledge into the Databricks Bundle Template, an open-source DABs template that walks a team through a short configuration process and generates a production-ready, correctly structured Databricks project from scratch. It won’t cover every exotic edge case, but for the vast majority of data engineering setups, it’s a starting point most teams have to build themselves.

databricks-bundle-template

An open-source DABs template that generates a production-ready, correctly structured Databricks project — multi-environment setup, CI/CD wiring, and developer isolation included.

View on GitHub →

We sat down with Vadym to talk about what it actually takes to set up Databricks for a real engineering team, why the platform’s flexibility is both its greatest strength and its steepest learning curve, and how he turned years of hands-on experience into an open-source tool that gives other teams a production-ready starting point from day one.

Key Takeaways

Databricks is deliberately flexible, the structural decisions around environments, team isolation, and deployment workflows are always left to the team, and getting them right requires platform experience most teams are still building.

Declarative Automation Bundles (DABs) are Databricks’ native way to manage infrastructure as code and give each developer their own isolated environment, but setting them up correctly is non-trivial without a solid starting point.

The Databricks Bundle Template encodes structural decisions, multi-environment setup, CI/CD wiring, developer isolation, based on real project experience, covering the foundation without touching business logic.

The template supports three cloud providers (AWS, Azure, GCP), three CI/CD platforms (GitHub Actions, Azure DevOps, GitLab), and both classic and serverless compute, with a test matrix that automatically verifies combinations.

An asset library sits alongside the core template: a growing catalog of standalone, reusable solutions for recurring Databricks problems, installable into any existing project.

Addepto: You work with Databricks every day, what was it that finally made you think, I need to build something here?

Vadym Mariiechko: It wasn’t a single moment, honestly. It built up over time. I was working on a project where we were building an Intermodal Data Platform from scratch, a small team, a few developers on the client side and usually one or two of us from Addepto. And we kept hitting the same friction points around how we organised our work on Databricks.

The most obvious symptom was that developers would write code locally in their own IDE and then manually copy and paste files into Databricks. No automation, no structure. And when two people needed to work on something similar – the same pipeline, say – they’d immediately run into each other. You’d deploy a pipeline tied to a table, and the next developer couldn’t deploy theirs against the same resource. One person’s work would overwrite the other’s.

So I started putting structure around it. Not guardrails in a restrictive sense, more like: here’s how we work, here’s who owns what, here’s how code moves from your laptop to production. And as I built that structure, I realised it wasn’t specific to that project at all.

Every team working on Databricks needs to solve the same foundational problems. I’d solved them. It made sense to make that reusable.

For readers who haven’t worked with Databricks, can you explain what a Databricks actually is, and why it needs a template on top of it?

Sure. Databricks is a data platform – you use it to build pipelines, run jobs, process large amounts of data. It’s extremely powerful, but it’s also deliberately open. It doesn’t tell you how to work. It gives you the tools and steps back.

Databricks provides something called Declarative Automation Bundles – DABs – which lets you define your infrastructure as code. Instead of creating jobs manually through the UI one by one, you write a configuration file describing what you want, and it gets deployed consistently. And crucially, each developer gets their own isolated environment, and nobody touches anyone else’s work.

Definition

DABs — Declarative Automation Bundles

Databricks’ native way to manage infrastructure as code. Instead of creating jobs manually through the UI, teams write configuration files describing what they want — environments, jobs, compute settings — and DABs deploys it consistently. Each developer gets their own isolated environment, so no one overwrites anyone else’s work.

That all sounds clean in theory. But setting it up from scratch means making dozens of interconnected decisions: how many environments, what compute type, which CI/CD platform, how to structure your data layers.

Those decisions have consequences that compound over time. If you get them wrong early on, you’re dealing with the fallout for months. A template means those decisions have already been made, correctly, based on real project experience, so you can skip the setup phase entirely and start building things that actually matter to your project.

Databricks does ship their own default bundle template though. What’s the gap, why wasn’t that enough?

Databricks goes very wide, they have to. The platform serves every possible use case, every industry, every cloud, so it makes sense that their template is flexible rather than opinionated.

It handles the technical scaffolding but deliberately leaves the structural decisions to you: how your team works, how environments are separated, how deployment flows. That’s not a flaw, it’s a reflection of what Databricks is.

The gap is that making those structural decisions well requires experience with the platform. It’s a bit like the difference between vanilla JavaScript and something like React or Nuxt. JavaScript gives you everything you need, React is just an opinionated layer built on top that encodes how most teams should probably be working.

My template is that opinionated layer for Databricks. It’s not about adding things Databricks forgot. It’s about encoding the right answers to questions that every team has to answer anyway, based on what actually works across real projects. And it stays completely out of your business logic, what you build inside Databricks is still entirely up to you.

But the foundation of how you work is set up correctly from the start.

What does it actually feel like to use it for the first time? Walk me through those first few minutes.

You run a command in the Databricks CLI, which fetches the template and starts asking you configuration questions. It’s a short conversation, really – how many environments do you want, two or three? What compute type, classic or serverless? Which cloud provider, which CI/CD platform? You answer those, and what comes out the other end is a complete project skeleton: folder structure, environment configurations, sample ETL jobs you can run immediately to verify everything works, and CI/CD pipelines already connected.

The decisions that would normally take days of research and trial and error, the kind you only get right after working with Databricks across multiple real projects, are already made.

You’re not starting from zero. You’re starting from a solid, well-structured foundation that reflects how Databricks actually recommends working.

Instead of “sewing” the suit yourself, you answer a few questions to a tailor and get something properly fitted. Does that land for you?

I like the image, but I’d push it a bit further.

Bootstrap is actually the analogy I could use. With Bootstrap, you don’t write CSS from scratch, you need, for example, a green button and you get a perfectly built green button. You don’t think about the implementation at all. You just use it.

My template works the same way for Databricks infrastructure. The underlying setup, the part that requires deep platform knowledge, is already done. You get a working, correctly structured project, and you build from there.

Where the tailor analogy breaks down is that a tailor makes something unique to you. Bootstrap doesn’t, and neither does my template.

It’s opinionated by design. You trade flexibility on the foundational decisions, which are honestly similar across most projects, for a setup that works correctly from day one.

The vast majority of data engineering teams will never hit the edge cases that fall outside it. And for everyone else, it’s still a much better starting point than from an empty config.

Let’s talk about the complexity underneath that simplicity. You support three cloud providers, three CI/CD platforms, classic and serverless compute, the decision tree must be enormous. How did you stop it from becoming a maintenance nightmare?

It is a big tree, and the combinations multiply fast. The way I handled it was with two levels of testing.

The first is a test matrix for the template itself. It automatically checks a large set of configurations – serverless compute with GitHub Actions, classic compute with Azure DevOps, and so on – to verify that each combination produces the right output. If you select serverless, serverless shows up everywhere it should. No mismatched references, no missing dependencies somewhere deep in the project.

The second level is testing inside Databricks itself, that’s about the code you write after the template is set up. I include a test placeholder in the example project that shows how to structure unit tests for your jobs, and those tests get picked up and run automatically by the CI/CD pipeline as code moves through environments.

The matrix is what makes the breadth sustainable. Databricks supports so many configurations because teams’ needs genuinely vary. But that same breadth means you can’t manually verify every combination every time you change something. Automating the verification of the template itself is what makes it possible to support all those options without the whole thing slowly falling apart.

cloud providers supported: AWS, Azure, GCP

CI/CD platforms: GitHub Actions, Azure DevOps, GitLab

compute types: classic and serverless, both fully supported

Security is another area where the complexity could spiral. You built environment-aware group configurations in from the start, why was that a priority?

It’s worth being precise here, because there are actually two distinct layers of security in a Databricks setup, and my template only covers one of them.

The higher layer – provisioning workspaces, managing users, controlling access at the platform level – that sits above what the template does. What my template handles is the internal layer: how developers are isolated from each other within a project, how staging data stays separate from production, how group permissions inside a bundle are configured so nobody accidentally steps into someone else’s environment.

That internal structure is what causes day-to-day friction on Databricks teams – it’s quieter than a platform-level security failure, but it slows teams down constantly. The template enforces it correctly from the start, in line with how Databricks recommends structuring multi-developer projects.

You also built something called an asset library alongside the template – which is a different concept. What is it, and why keep it separate rather than just expanding the core template?

The core template generates a project skeleton, the foundation. Once that’s in place, though, you still need to build things inside it, and some of those things come up again and again across different projects. The asset library is a catalog of standalone mini-templates: specific features or solutions you can install into any existing bundle project.

Think of it as extending the Bootstrap analogy. The template is Bootstrap itself. The asset library is the component ecosystem that grows around it: specific, ready-to-use solutions for problems that keep recurring in real Databricks work.

The reason I kept it separate is that these assets vary enormously in scope. Some are general-purpose utilities. Some solve a very specific problem. Folding them into the main template would make it unwieldy and harder to navigate.

As a library, you pull in exactly what you need, when you need it. And it means contributors can add assets without ever touching the core template, which is important if you want the library to actually grow.

The first asset I published is sdp-checkpoint-recovery – it automatically resets checkpoint state on a pipeline when a source table gets dropped. Specific problem, but one that comes up on real projects. Now instead of debugging it from scratch, you install the asset and move on.

Let’s make this specific. Paint me a picture: a team before this template, and the same team after.

Before: developers are writing code locally, copying files manually into Databricks, creating jobs one by one through the UI. When two people work on similar things, nobody’s quite sure what belongs to whom, what’s safe to change, what will break if they touch it. There’s no reliable path from “I finished building this” to “it’s running in production.” A lot of things live in people’s heads.

After: each developer has their own isolated environment on Databricks – their own jobs, their own data, their own space to work without risk of collision. They build a feature, finish it, submit it for code review. Once it’s approved, the CI/CD pipeline, already set up by the template – automatically deploys to a staging environment that closely mirrors production. Tests run. If everything passes, the code goes to production.

The difference isn’t just speed. It’s that the whole workflow is defined and repeatable. Databricks has the native capabilities to support all of this – the platform is more than powerful enough. The template just makes sure teams are actually using those capabilities correctly, from day one, rather than spending weeks working out how to wire it all together.

What was the hardest design decision you had to make building this?

How to handle teams that already have an existing project. The template is ideal for starting from zero, that’s its natural context. But many teams aren’t starting from zero. They have existing workflows, existing code, existing conventions.

I decided not to try to make it a migration tool. I wouldn’t recommend tearing down an existing project just to adopt this. The better path is to generate the skeleton separately and selectively bring over the parts that are useful, the CI/CD pipelines, the environment configurations, and specific structural blocks. The template is modular enough to support that kind of adoption.

Am I still convinced it was the right call? Yes. Trying to make the template smart enough to merge cleanly with an arbitrary existing project would add enormous complexity and would probably work poorly in most real cases. Better to give people well-structured building blocks they can incorporate on their own terms, at their own pace.

Last question. This is open source and you’re actively inviting people to contribute. What does a good first contribution look like, and what does the project most need right now?

The asset library is where the most interesting contributions can happen right now. The framework is already in place… scalable, documented, ready for new assets. If you’ve solved a recurring Databricks problem in your own work, you can package that solution and contribute it. That’s really all it takes.

Each asset has a standard structure and a clear description format so other users understand exactly what it does and when to reach for it. The CONTRIBUTING.md walks through the process in detail.

What the project most needs is people contributing the solutions they’ve already built — the things that come up on real Databricks projects and have to be solved from scratch every time. The whole point of the asset library is to be that shared catalog: a growing collection of ready-to-use solutions built by people who’ve actually needed them in production. The infrastructure is there. Now it just needs to be filled.

Built something useful in Databricks?

If you’ve solved a recurring Databricks problem in production, the asset library is the right place for it. Package your solution, follow the CONTRIBUTING.md, and add it to a catalog that saves other engineers the same debugging time you already spent.

Contribute to the project →

FAQ

How does Databricks Bundle Template differ from Databricks' own default template?

Databricks’ default template handles technical scaffolding but deliberately leaves structural decisions to the team — how environments are separated, how deployment flows, how developers stay isolated from each other. The Bundle Template by Vadym Mariiechko is an opinionated layer on top: it encodes those structural decisions based on real project experience, so teams don’t have to figure them out from scratch.

Do I need to be a Databricks expert to use the template?

No, that’s the point. The template is designed precisely for teams that are still building their Databricks expertise. You answer a short set of configuration questions through the CLI, and the output is a correctly structured, production-ready project. The decisions that would normally require deep platform knowledge are already made for you.

Which cloud providers and CI/CD platforms does the template support?

The template supports three cloud providers (AWS, Azure, GCP) and three CI/CD platforms (GitHub Actions, Azure DevOps, GitLab). Both classic and serverless compute are supported, and a test matrix automatically verifies that each combination produces the correct output.

What is the asset library, and how is it different from the core template?

The core template generates a project foundation — folder structure, environment configurations, CI/CD pipelines. The asset library is a separate catalog of standalone mini-templates, each solving one specific recurring Databricks problem. Assets can be installed into any existing bundle project without touching the core setup.

Can I use the template if my team already has an existing Databricks project?

The template is designed for projects starting from scratch. For existing projects, the recommended approach is to generate the skeleton separately and selectively bring over the parts that are useful — CI/CD pipelines, environment configurations, specific structural blocks — rather than attempting a full migration.

How can I contribute to the project?

The asset library is where contributions are most needed right now. If you’ve solved a recurring Databricks problem in production, you can package that solution as an asset and submit it. The CONTRIBUTING.md in the repository walks through the process. The framework is already in place — the library just needs engineers who’ve already solved real problems to share those solutions.

Category:

Data Engineering

Share this article: