• by NortySpock on 12/17/2024, 9:07:32 PM

    Interesting, I've been looking for a system / tool that acknowledges that a dbt transformation pipeline tends to be joined-at-the-hip with the data ingestion mode....

    As I read through the documentation, Do you have a mode in ingstr that lets you specify the maximum lateness of a file? (For late-arriving rows or files or backfills) I didn't see it in my brief read through.

    https://bruin-data.github.io/bruin/assets/ingestr.html

    Reminds me a bit of Benthos / Bento / RedPanda Connect (in a good way)

    Interested to kick the tires on this (compared to, say, Python dlt)

  • by peterm4 on 12/18/2024, 9:31:43 PM

    I'd absolutely love to love this.

    Using dbt at $JOB, and building a custom dbt adapter for our legacy data repos, I've slowly developed a difficult relationship dbt's internals and externals. Struggling with the way it (python) handles concurrency, threading, timeouts with long running (4hr+ jobs), and the like. Not to mention inconsistencies with the way it handles Jinja in config files vs SQL files. Also it's lack of ingestion handling and VSCode/editor support, which it seems like Bruin considers very well! Since starting poking around on the inside of dbt I've felt like Go or Rust would be a far more suitable platform for a pipeline building tool, and this looks to be going in a great direction, so congrats on the launch and best of luck with your cloud offering.

    That being said, I tried starting the example bruin pipeline with duckdb on a current data project, and I'm having no luck getting the connection to appear with `bruin connections list` so nothing will run. So looks like I'm going to have to stick with dbt for now. Might be worth adding some more documentation around the .bruin.yml file; dbt has great documentation listing the purpose and layout of each file in the folder which is very helpful when trying to set things up.

  • by jmccarthy on 12/17/2024, 7:38:09 PM

    Burak - one wish I've had recently is for a "py data ecosystem compiler", specifically one which allows me to express structures and transformations in dbt and Ibis, but not rely on Python at runtime. [Go|Rust]+[DuckDB|chDB|DataFusion] for the runtime. Bruin seems very close to the mark! Following.

  • by halfcat on 12/18/2024, 3:39:01 AM

    I always thought Hamilton [1] does a good job of giving enough visual hooks that draw you in.

    I also noticed this pattern where library authors sometimes do a bit extra in terms of discussing and even promoting their competitors, and it makes me trust them more. A “heres why ours is better and everyone else sucks …” section always comes across as the infomercial character who is having quite a hard time peeling an apple to the point you wonder if this the first time they’ve used hands.

    One thing wish for is a tool that’s essentially just Celery that doesn’t require a message broker (and can just use a database), and which is supported on Windows. There’s always a handful of edge cases where we’re pulling data from an old 32-bit system on Windows. And basically every system has some not-quite-ergonomic workaround that’s as much work as if you’d just built it yourself.

    It seems like it’s just sending a JSON message over a queue or HTTP API and the worker receives it and runs the task. Maybe it’s way harder than I’m envisioning (but I don’t think so because I’ve already written most of it).

    I guess that’s one thing I’m not clear on with Bruin, can I run workers if different physical locations and have them carry out the tasks in the right order? Or is this more of a centralized thing (meaning even if its K8s or Dask or Ray, those are all run in a cluster which happens to be distributed, but they’re all machines sitting in the same subnet, which isn’t the definition of a “distributed task” I’m going for.

    [1] https://github.com/DAGWorks-Inc/hamilton

  • by thruflo on 12/17/2024, 6:33:55 PM

    It’s pretty remarkable what Bruin brings together into a single tool / workflow.

    If you’re doing data analytics in Python it’s well worth a look.

  • by mushufasa on 12/17/2024, 11:19:57 PM

    Hi Burak, thanks for posting! We're looking for a tool in this space and i'll take a look.

    Does Bruin support specifying and visualizing DAGs? I didn't see that in the documentation via a quick look, but I thought to ask because you may use different terminology that can be a substitute.

  • by alpb on 12/17/2024, 11:35:08 PM

    Congrats Burak, I can tell a lot of work has gone into this. If I may recommend, a comparison of this project with similar other/state-of-the-art projects would be really good to have in your documentation set for others to understand how your approach differs from them.

  • by havef on 12/18/2024, 8:45:16 AM

    Hi, Burak, it looks interesting. I was wondering, do you know about connect? Maybe you can take advantage of some of its ready-made components. In addition, it is also developed using Go

    - https://docs.redpanda.com/redpanda-connect/home/

    - https://github.com/redpanda-data/connect

  • by JeffMcCune on 12/17/2024, 7:49:35 PM

    Congrats on the launch! Since this is Go have you considered using CUE or looked at their flow package? Curious how you see it relating or helping with data pipelines.

  • by ellisv on 12/17/2024, 7:18:57 PM

    Direct link to the documentation:

    https://bruin-data.github.io/bruin/

  • by gigatexal on 12/18/2024, 8:21:00 PM

    Ingestion with DLT likely would have given you more connections to things. Still very cool. I saw you talking about this on LinkedIn.

  • by producthunter90 on 12/17/2024, 5:07:02 PM

    How does it handle scheduling or orchestrating pipeline runs? Do you integrate with tools like Airflow, or is there a built-in solution for that?

  • by evalsock on 12/18/2024, 6:04:18 AM

    Do you have integration for ML orchestration to reuse bruin inside our existing pipeline?

  • by wodenokoto on 12/18/2024, 6:06:23 AM

    That ingestr CLI you also developed and just casually reference seems very, very cool!

  • by Multrex on 12/17/2024, 11:12:44 PM

    Why there is not MySQL integration? Will you plan to add it? MySQL is very popular.

  • by sakshy14 on 12/18/2024, 6:08:37 AM

    I just used your getting started guide and it's freaking amazing

  • by kyt on 12/17/2024, 6:41:53 PM

    Why use this over Meltano?

  • by uniquenamehere on 12/18/2024, 1:48:22 AM

    This looks cool! How would this compare to Benthos?

  • by kakoni on 12/17/2024, 9:59:00 PM

    Is dlt part of bruin-stack?

  • by drchaim on 12/18/2024, 12:20:47 AM

    "Interesting, congrats! I've felt the same challenges but ended up using custom Python with dbt and DuckDB. I'll take a look!"

  • by tony_francis on 12/17/2024, 8:11:10 PM

    How does this compare to ray data?