by raihansaputra on 7/23/2024, 9:41:16 AM with 0 comments
I have a need to ingest files, run Python Notebooks/files (pandas etc) on it, and output to database, files and dashboard(s). Ideally the runs/executions can be graphed and versioned in case the code/data changes for re-runs.
I keep exploring ETL/ELT/DAG/MLOps tools and all seem to be very complicated or enterprise-priced gated.
I don't mind buliding out another web app to ingest the data to S3 and/or display from database, but ideally it's integrated.
Some I've read into:
- Windmill: The most suited except that the Apps portion also needs users to be subscribed to the workspace (even for self-hosted). Can be worked around by another web app to upload to an S3 bucket and ingest from there. The enterprise offerings are eyewateringly expensive for non-first world businesses.
- Pachyderm: Interesting offering as they focus on data-versioning and data pipeline, but unsure about the dashboard part.
Airflow, dagster, temporal, prefect, etc are a bit too code-heavy for my use case. I want the graphs and the larger logic flows to be understandable by non-data engineers.
I don't have any large volume to deal with, so ideally something that can be run on one machine as a default and scale as needed. Clarity, dependability, and simplicity of deployment are the priorities.
I have a need to ingest files, run Python Notebooks/files (pandas etc) on it, and output to database, files and dashboard(s). Ideally the runs/executions can be graphed and versioned in case the code/data changes for re-runs.
I keep exploring ETL/ELT/DAG/MLOps tools and all seem to be very complicated or enterprise-priced gated.
I don't mind buliding out another web app to ingest the data to S3 and/or display from database, but ideally it's integrated.
Some I've read into:
- Windmill: The most suited except that the Apps portion also needs users to be subscribed to the workspace (even for self-hosted). Can be worked around by another web app to upload to an S3 bucket and ingest from there. The enterprise offerings are eyewateringly expensive for non-first world businesses.
- Pachyderm: Interesting offering as they focus on data-versioning and data pipeline, but unsure about the dashboard part.
Airflow, dagster, temporal, prefect, etc are a bit too code-heavy for my use case. I want the graphs and the larger logic flows to be understandable by non-data engineers.
I don't have any large volume to deal with, so ideally something that can be run on one machine as a default and scale as needed. Clarity, dependability, and simplicity of deployment are the priorities.
Any suggestions?