Hacker News Clone

Show HN: I built an open-source data pipeline tool in Go

by karakanb on 12/17/2024, 4:40:31 PM with 48 comments

by NortySpock on 12/17/2024, 9:07:32 PM
Interesting, I've been looking for a system / tool that acknowledges that a dbt transformation pipeline tends to be joined-at-the-hip with the data ingestion mode....
As I read through the documentation, Do you have a mode in ingstr that lets you specify the maximum lateness of a file? (For late-arriving rows or files or backfills) I didn't see it in my brief read through.
https://bruin-data.github.io/bruin/assets/ingestr.html
Reminds me a bit of Benthos / Bento / RedPanda Connect (in a good way)
Interested to kick the tires on this (compared to, say, Python dlt)
by peterm4 on 12/18/2024, 9:31:43 PM
I'd absolutely love to love this.
Using dbt at $JOB, and building a custom dbt adapter for our legacy data repos, I've slowly developed a difficult relationship dbt's internals and externals. Struggling with the way it (python) handles concurrency, threading, timeouts with long running (4hr+ jobs), and the like. Not to mention inconsistencies with the way it handles Jinja in config files vs SQL files. Also it's lack of ingestion handling and VSCode/editor support, which it seems like Bruin considers very well! Since starting poking around on the inside of dbt I've felt like Go or Rust would be a far more suitable platform for a pipeline building tool, and this looks to be going in a great direction, so congrats on the launch and best of luck with your cloud offering.
That being said, I tried starting the example bruin pipeline with duckdb on a current data project, and I'm having no luck getting the connection to appear with `bruin connections list` so nothing will run. So looks like I'm going to have to stick with dbt for now. Might be worth adding some more documentation around the .bruin.yml file; dbt has great documentation listing the purpose and layout of each file in the folder which is very helpful when trying to set things up.
by jmccarthy on 12/17/2024, 7:38:09 PM
Burak - one wish I've had recently is for a "py data ecosystem compiler", specifically one which allows me to express structures and transformations in dbt and Ibis, but not rely on Python at runtime. [Go|Rust]+[DuckDB|chDB|DataFusion] for the runtime. Bruin seems very close to the mark! Following.
by halfcat on 12/18/2024, 3:39:01 AM
I always thought Hamilton [1] does a good job of giving enough visual hooks that draw you in.
I also noticed this pattern where library authors sometimes do a bit extra in terms of discussing and even promoting their competitors, and it makes me trust them more. A “heres why ours is better and everyone else sucks …” section always comes across as the infomercial character who is having quite a hard time peeling an apple to the point you wonder if this the first time they’ve used hands.
One thing wish for is a tool that’s essentially just Celery that doesn’t require a message broker (and can just use a database), and which is supported on Windows. There’s always a handful of edge cases where we’re pulling data from an old 32-bit system on Windows. And basically every system has some not-quite-ergonomic workaround that’s as much work as if you’d just built it yourself.
It seems like it’s just sending a JSON message over a queue or HTTP API and the worker receives it and runs the task. Maybe it’s way harder than I’m envisioning (but I don’t think so because I’ve already written most of it).
I guess that’s one thing I’m not clear on with Bruin, can I run workers if different physical locations and have them carry out the tasks in the right order? Or is this more of a centralized thing (meaning even if its K8s or Dask or Ray, those are all run in a cluster which happens to be distributed, but they’re all machines sitting in the same subnet, which isn’t the definition of a “distributed task” I’m going for.
[1] https://github.com/DAGWorks-Inc/hamilton
by thruflo on 12/17/2024, 6:33:55 PM
It’s pretty remarkable what Bruin brings together into a single tool / workflow.
If you’re doing data analytics in Python it’s well worth a look.
by mushufasa on 12/17/2024, 11:19:57 PM
Hi Burak, thanks for posting! We're looking for a tool in this space and i'll take a look.
Does Bruin support specifying and visualizing DAGs? I didn't see that in the documentation via a quick look, but I thought to ask because you may use different terminology that can be a substitute.
by alpb on 12/17/2024, 11:35:08 PM
Congrats Burak, I can tell a lot of work has gone into this. If I may recommend, a comparison of this project with similar other/state-of-the-art projects would be really good to have in your documentation set for others to understand how your approach differs from them.
by havef on 12/18/2024, 8:45:16 AM
Hi, Burak, it looks interesting. I was wondering, do you know about connect? Maybe you can take advantage of some of its ready-made components. In addition, it is also developed using Go
- https://docs.redpanda.com/redpanda-connect/home/
- https://github.com/redpanda-data/connect
by JeffMcCune on 12/17/2024, 7:49:35 PM
Congrats on the launch! Since this is Go have you considered using CUE or looked at their flow package? Curious how you see it relating or helping with data pipelines.
by ellisv on 12/17/2024, 7:18:57 PM
Direct link to the documentation:
https://bruin-data.github.io/bruin/
by gigatexal on 12/18/2024, 8:21:00 PM
Ingestion with DLT likely would have given you more connections to things. Still very cool. I saw you talking about this on LinkedIn.
by producthunter90 on 12/17/2024, 5:07:02 PM
How does it handle scheduling or orchestrating pipeline runs? Do you integrate with tools like Airflow, or is there a built-in solution for that?
by evalsock on 12/18/2024, 6:04:18 AM
Do you have integration for ML orchestration to reuse bruin inside our existing pipeline?
by wodenokoto on 12/18/2024, 6:06:23 AM
That ingestr CLI you also developed and just casually reference seems very, very cool!
by Multrex on 12/17/2024, 11:12:44 PM
Why there is not MySQL integration? Will you plan to add it? MySQL is very popular.
by sakshy14 on 12/18/2024, 6:08:37 AM
I just used your getting started guide and it's freaking amazing
by kyt on 12/17/2024, 6:41:53 PM
Why use this over Meltano?
by uniquenamehere on 12/18/2024, 1:48:22 AM
This looks cool! How would this compare to Benthos?
by kakoni on 12/17/2024, 9:59:00 PM
Is dlt part of bruin-stack?
by drchaim on 12/18/2024, 12:20:47 AM
"Interesting, congrats! I've felt the same challenges but ended up using custom Python with dbt and DuckDB. I'll take a look!"
by tony_francis on 12/17/2024, 8:11:10 PM
How does this compare to ray data?