Meet ContextCheck: Our Open-Source Framework for LLM & RAG Testing! Check it out on Github!

in Blog

April 24, 2025

Modern Data Engineering Toolset: A Practical Perspective

Author:




Vadym Mariiechko

Data Engineer


Reading time:




4 minutes


Selecting the right tools can significantly influence the success of any data engineering project. While the field offers a vast range of platforms and frameworks, the examples shared here reflect one of our real-world projects at Addepto, where Databricks features prominently. Keep in mind that every organization has unique requirements, and the technologies mentioned here are tailored for our current project scope.

Core Data Platform Technologies

In many modern data engineering scenarios, you’ll see a blend of powerful processing engines, orchestration tools, and storage solutions. For one of our current projects, we leverage Databricks as our central data intelligence platform because it offers an integrated environment for data processing, analytics, and machine learning.

Key platform components we’re exploring in this project include:

  • dltHub for quick and efficient development of batch pipelines, particularly in the early stages
  • Databricks Workflows for orchestrating batch processing pipelines
  • Databricks DLT for orchestrating and managing streaming data pipelines
  • Apache Spark for large-scale data processing
  • Kafka for real-time messaging and event streaming
  • DBT or SQLMesh for advanced data transformations

Visualization and Dashboard Development

Communicating insights effectively is a critical part of a Data Engineer’s role. We often use Databricks Apps in combination with frameworks like Streamlit and Folium to quickly prototype and demonstrate interactive dashboards. This setup allows us to:

  • Develop and deploy prototypes rapidly
  • Facilitate interactive data exploration for stakeholders
  • Integrate seamlessly with the underlying data platform

In short, it’s a convenient way to showcase early insights without spinning up a separate infrastructure for visualization.

Development Environment

A well-configured development environment is crucial for productivity. The essential tools include:

  • Cursor IDE, an AI-assisted code editor
  • ChatGPT reasoning models for deeper analysis, architectural insights, and code reasoning
  • Obsidian for organizing development notes and ideas
  • Draw.io for quick, clear architectural diagrams
  • Lightshot for quick screenshot annotations

Project Management and Collaboration

Effective collaboration is essential in modern data engineering projects. The standard toolkit includes:

  • Slack for internal team communication
  • Microsoft Teams for client interactions
  • Azure DevOps for task tracking and project management
  • Excel for structured data sharing and analysis presentation
  • Word for writing and sharing documentation, analysis, and technical reports with the team

Real-Time Processing Capabilities

Many modern data projects eventually move beyond batch processing into real-time or near-real-time data pipelines. In our current work, we use Spark Structured Streaming on Databricks coupled with Autoloader and Databricks DLT to handle multiple streaming data sources:

  • Websocket-based ingestion: A custom service monitors a live data feed via WebSocket APIs, storing JSON responses in blob storage.
  • Autoloader: This picks up the incoming data automatically and lands it in raw tables on Databricks for further transformation.
  • Streaming pipelines: We employ Databricks DLT to orchestrate continuous transformation, cleaning, and enrichment of the streaming data in our master table.

While Kafka remains a popular choice for event streaming, our project currently relies on custom ingestion services. In future phases, we may integrate Kafka into our DLT pipelines for enhanced real-time processing capabilities.

Learning and Staying Updated

Data engineering evolves rapidly. Besides hands-on experimentation, these resources offer valuable insights:

  • The comprehensive “data-engineer-handbook” on GitHub
  • Industry experts such as Zach Wilson, Benjamin Rogojan (Seattle Data Guy), and Michael Kahan from Kahan Data Solutions
  • The Modern Data Stack website
  • Start Data Engineering platform
  • The Databricks technical blog for insights on platform updates and best practices

Best Practices for Tool Selection

When choosing tools for a data engineering project, consider:

  • Project Requirements
    • Immediate business needs
    • Long-term scalability requirements
    • Real-time vs. batch processing needs
  • Team Expertise
    • Existing skill sets
    • Learning curve for new tools
    • Available training resources
  • Integration Capabilities
    • Compatibility with existing systems
    • API availability
    • Data format support

Future Considerations

The toolset should be flexible enough to accommodate:

  • Shifting from batch to real-time processing
  • Scaling data operations
  • Incorporating new data sources
  • Adapting to changing business requirements

 

Conclusion

The modern data engineering toolset is diverse and constantly evolving. Success lies not just in knowing these tools, but in understanding when and how to apply them effectively. Start with the core essentials, and gradually expand your toolkit based on project requirements and team capabilities.

For those beginning their data engineering journey, remember that mastery of these tools comes through consistent practice and hands-on project experience. Focus on building a strong foundation with core tools before expanding to more specialized solutions.



Category:


People & Culture