Question

best practice for architecture data pipeline with open source tools on premise

  • 24 January 2024
  • 6 replies
  • 146 views

Hello everyone
I`m junior in data engineering and want implement data pipeline for company that I working in that I have 3 questions:
1- What is the best practice for the architecture of implementing data lake and data warehouse?

2- What open source tools are available for this?

3- My data sources are postgresql and mongodb, what open source tools can I use to implement ETL and data ware house?


6 replies

Userlevel 4
Badge

Hi Mohammad, 

Hope this message finds you well! I will try to give you my perspective on these three questions below. When it comes to setting up a solid data lake and data warehouse, it's like navigating a dynamic landscape where concepts are constantly evolving. The key to success lies in finding the sweet spot between scalability, flexibility, and keeping things manageable.

Trying to sum up this complex world in a brief answer is a bit like trying to fit a puzzle piece into the wrong spot. It just doesn't do justice to the intricacies involved. So, it's essential to dive into recent literature, where the real gems of knowledge are waiting to be discovered.

Speaking of gems, two books that really hit the mark in capturing the essence of modern data architecture are "Maturing the Snowflake Data Cloud" by Andrew Carruthers and "The Cloud Data Lake" by Rukmani Gopalan. These books don't just scratch the surface; they dive deep into templated approaches, governance strategies, and crafting a robust cloud data architecture. It's like having a mentor guide you through the real-world challenges. 

We at DataOps.Live also provide a great read that we highly recommend. You can download a free copy of our “DataOps for Dummies” from here: https://www.dataops.live/dataops-for-dummies-download

But hey, these are just the tip of the iceberg. The world of data architecture has a treasure trove of insights from various thought leaders. Each book is like a unique lens, offering a different perspective on how to tackle the hurdles of building effective data solutions.

So, my advice? Go on a bit of a reading spree. Explore different sources, soak in the diverse viewpoints, and build yourself a toolkit of approaches. This way, you not only get a richer understanding of the subject but also become adept at adapting these insights to the specific twists and turns your data engineering projects might throw at you. Happy reading!

Userlevel 4
Badge

When it comes to open-source tools for implementing ETL and data warehousing, there are several widely recognised solutions that have gained significant industry acclaim. Notable among these are:

  • Apache Airflow
  • Apache Kafka
  • DBT
  • Singer.io
  • Talend Open Studio

At DataOps.Live, we offer pre-configured orchestration solutions for some of these open-source tools, as well as for some of their paid counterparts. For detailed information and guidance on orchestrating these tools seamlessly, you can explore our documentation page at: DataOps.Live Documentation. It's important to note that while our platform may not explicitly list every tool, we offer the flexibility to create custom solutions tailored to specific requirements. Our approach is inherently tool-agnostic, allowing users to build and implement solutions that best suit their unique needs.

Userlevel 4
Badge

On your final question, we must admit Snowflake emerges as a highly recommended Data Warehousing solution for your data management needs. Leveraging its cloud-based architecture, Snowflake provides a scalable and efficient platform for storing, processing, and analysing large volumes of data.

When it comes to ETL’ing data from your PostgreSQL and MongoDB sources into the data warehouse, you have a variety of options. Open-source solutions, as previously mentioned, offer cost-effective alternatives. However, you may also consider exploring paid competitors that provide advanced features and support for more complex ETL processes.

To streamline and orchestrate the entire data pipeline, a robust choice is the DataOps.Live platform. This platform not only enhances the flexibility of your data operations but also brings efficiency and automation to the management of your data workflows. With DataOps.Live, you can create, monitor, and optimise your ETL processes, ensuring a smooth and reliable data flow throughout your entire ecosystem.

By incorporating Snowflake as your data warehouse and utilising the capabilities of DataOps.Live, you empower your organisation with a comprehensive and flexible data infrastructure. This combination allows you to efficiently manage, transform, and derive valuable insights from your data, contributing to more informed decision-making and improved business outcomes.

Userlevel 4
Badge

@mohammadnaghibi does @Traycho Milev answer help? 

yeah,0 it was helpful
but there are two points one: I can`t use cloud based services and must to use open source tools
two: I need more explain about open source tools and how use them next each other and I have challenge for implement ETL from mongodb to postgres
 

Userlevel 4
Badge

Hi @mohammadnaghibi If you have limitations on using cloud-based services and you strongly prefer open-source tools, there are several options available for implementing ETL from MongoDB to PostgreSQL using open-source tools. Below, I'll provide a general overview and suggest some tools you can use:

  1. Extract (E):

    • Use mongoexport: This is a command-line tool provided by MongoDB (link here) to export data in various formats. You can export data from MongoDB collections into JSON or CSV files.
    • Example command would be:

      mongoexport --db yourDB --collection yourCollection --out data.json

  2. Transform (T):

    • Use a programming language: Open-source programming languages like Python can be used for data transformation. You can read the exported data, perform necessary transformations, and prepare it for loading into PostgreSQL.
    • Example Python script using pandas:
      import pandas as pd

      # Read data from MongoDB export we did in Step 1
      data = pd.read_json('data.json')

      # Perform transformations
      # ...

      # Save transformed data
      data.to_csv('transformed_data.csv', index=False)

       

  3. Load (L):

    • Use psql (PostgreSQL command-line tool - link here): This tool allows you to interact with PostgreSQL from the command line. You can create tables and load data into PostgreSQL using SQL commands.
    • psql -U your_username -d your_database -c "CREATE TABLE your_table (column1 datatype, column2 datatype, ...);"

       

  4. Orchestration:

    • For orchestrating the ETL process, you can use tools like Apache Airflow or Luigi  link here. These tools allow you to schedule and monitor your ETL workflows. While not open source, DataOps.live is a great choice for getting the job done effectively. Throughout each stage of your development process, you can leverage excellent support.
  5. Challenges and Considerations:

    • Schema Mapping:
      • Ensure that you map the MongoDB schema to the PostgreSQL schema correctly.
    • Data Consistency:
      • Consider how to handle consistency between the MongoDB and PostgreSQL databases during the ETL process.
    • Incremental Updates:
      • If this is a recurring ETL process, think about how to handle incremental updates to the data.
  6. Overall Goal: ETL from MongoDB to PostgreSQL:​​​​​​

    • The initial actions you need to undertake include:

      • Implement a script or workflow that regularly extracts data from a MongoDB collection, transforms it to match the PostgreSQL schema, and loads it into the PostgreSQL database.

      • You can use the provided examples as a starting point and expand upon them to meet your specific requirements.

Remember to install necessary libraries or tools, such as pandas for Python, and configure your MongoDB and PostgreSQL connections accordingly. Adjust the examples based on your specific use case and data structures.

If you have additional questions, feel free to reach out to us.

Reply