Modern Data Engineering Toolset: A Practical Perspective

Author:

Vadym Mariiechko

Data Engineer

Reading time:

4 minutes

Selecting the right tools can significantly influence the success of any data engineering project. While the field offers a vast range of platforms and frameworks, the examples shared here reflect one of our real-world projects at Addepto, where Databricks features prominently. Keep in mind that every organization has unique requirements, and the technologies mentioned here are tailored for our current project scope.

Core Data Platform Technologies

In many modern data engineering scenarios, you’ll see a blend of powerful processing engines, orchestration tools, and storage solutions. For one of our current projects, we leverage Databricks as our central data intelligence platform because it offers an integrated environment for data processing, analytics, and machine learning.

Key platform components we’re exploring in this project include:

dltHub for quick and efficient development of batch pipelines, particularly in the early stages
Databricks Workflows for orchestrating batch processing pipelines
Databricks DLT for orchestrating and managing streaming data pipelines
Apache Spark for large-scale data processing
Kafka for real-time messaging and event streaming
DBT or SQLMesh for advanced data transformations

Visualization and Dashboard Development

Communicating insights effectively is a critical part of a Data Engineer’s role. We often use Databricks Apps in combination with frameworks like Streamlit and Folium to quickly prototype and demonstrate interactive dashboards. This setup allows us to:

Develop and deploy prototypes rapidly
Facilitate interactive data exploration for stakeholders
Integrate seamlessly with the underlying data platform

In short, it’s a convenient way to showcase early insights without spinning up a separate infrastructure for visualization.

Development Environment

A well-configured development environment is crucial for productivity. The essential tools include:

Cursor IDE, an AI-assisted code editor
ChatGPT reasoning models for deeper analysis, architectural insights, and code reasoning
Obsidian for organizing development notes and ideas
Draw.io for quick, clear architectural diagrams
Lightshot for quick screenshot annotations

Project Management and Collaboration

Effective collaboration is essential in modern data engineering projects. The standard toolkit includes:

Slack for internal team communication
Microsoft Teams for client interactions
Azure DevOps for task tracking and project management
Excel for structured data sharing and analysis presentation
Word for writing and sharing documentation, analysis, and technical reports with the team

Real-Time Processing Capabilities

Many modern data projects eventually move beyond batch processing into real-time or near-real-time data pipelines. In our current work, we use Spark Structured Streaming on Databricks coupled with Autoloader and Databricks DLT to handle multiple streaming data sources:

Websocket-based ingestion: A custom service monitors a live data feed via WebSocket APIs, storing JSON responses in blob storage.
Autoloader: This picks up the incoming data automatically and lands it in raw tables on Databricks for further transformation.
Streaming pipelines: We employ Databricks DLT to orchestrate continuous transformation, cleaning, and enrichment of the streaming data in our master table.

While Kafka remains a popular choice for event streaming, our project currently relies on custom ingestion services. In future phases, we may integrate Kafka into our DLT pipelines for enhanced real-time processing capabilities.

Learning and Staying Updated

Data engineering evolves rapidly. Besides hands-on experimentation, these resources offer valuable insights:

The comprehensive “data-engineer-handbook” on GitHub
Industry experts such as Zach Wilson, Benjamin Rogojan (Seattle Data Guy), and Michael Kahan from Kahan Data Solutions
The Modern Data Stack website
Start Data Engineering platform
The Databricks technical blog for insights on platform updates and best practices

Best Practices for Tool Selection

When choosing tools for a data engineering project, consider:

Project Requirements
- Immediate business needs
- Long-term scalability requirements
- Real-time vs. batch processing needs
Team Expertise
- Existing skill sets
- Learning curve for new tools
- Available training resources
Integration Capabilities
- Compatibility with existing systems
- API availability
- Data format support

Future Considerations

The toolset should be flexible enough to accommodate:

Shifting from batch to real-time processing
Scaling data operations
Incorporating new data sources
Adapting to changing business requirements

Conclusion

The modern data engineering toolset is diverse and constantly evolving. Success lies not just in knowing these tools, but in understanding when and how to apply them effectively. Start with the core essentials, and gradually expand your toolkit based on project requirements and team capabilities.

For those beginning their data engineering journey, remember that mastery of these tools comes through consistent practice and hands-on project experience. Focus on building a strong foundation with core tools before expanding to more specialized solutions.

Category:

People & Culture

Share this article: