by hodgesrm on 2/3/2022, 2:46:40 AM
Try ClickHouse. You can set it up on a laptop, in a VM, or get a trial on one of the cloud services. You didn't say which RDS database you are using but ClickHouse can read data directly from MySQL and PostgreSQL. Pull it into tables on ClickHouse and try out your analytic queries to see if it seems reasonable.
100s of millions of rows is relatively small for ClickHouse, so you don't need to worry about advanced optimizations like compression codecs or materialized views. Standard datatypes and the default LZ4 compression should be just fine. Y
Here's quickstart video that will teach you most of what you need to get started: https://www.youtube.com/watch?v=phTu24qCIw0
I work on ClickHouse at Altinity and created the education video. I'm kind of biased. ClickHouse is not the only OSS data warehouse. There is also Druid and Pinot. I think, however, for this particular case where you are pulling from RDS that ClickHouse will turn out to be a good solution.
Good luck whatever approach you choose!
by edmundsauto on 2/2/2022, 11:46:05 AM
Clickhouse for storage, airflow for orchestration. Dbt for reusability in your models and transforms. There are a bunch of decent extractors that have prebuilt connectors to pull data from third parties.
Most of all, though, you should think through how users will discover data, and how you will check quality. Once your transforms start getting complicated, you’ll want this (and maybe a metric store).
Since your data isn’t huge, build things so you can rerun in a different architecture, should you outgrow your setup. Keep a certain level of raw data where possible in AWS, make sure your dbt build works from scratch.
While your data size is small, focus on delivering value from as simple an architecture as you can get away with. If you can rebuild your warehouse because you’re using standards, the underlying tech doesn’t matter as much as the processes and culture you establish.
by brudgers on 2/2/2022, 4:29:44 PM
Have you considered hiring a consultant experienced in solving this class of problem? This is ordinary business process stuff, not your company's core competence.
And it is not something your team has experience doing (presumably since the question).
Because it's all AWS, this is commodity work. Go to Linkedin and find someone to talk to. Good luck.
by seektable on 2/4/2022, 7:49:16 PM
> - Reporting (Metabase, Superset, Quicksight?)
SeekTable may be useful for tabular reports, also it is good for embedding.
I work at a small (~100 person) bootstrapped company. We are hosted almost exclusively on AWS, and we use RDS for most of our transactional DB needs. Up until now, our analytics has been running on a combination of Metabase (hooked up to our RDS databases) and internal admin reporting. It works, but its all a bit clunky and we're outgrowing its capabilities. At this point in time, this is for internal (i.e. management) analytics only.
We are at the stage now we we need to start doing it "properly". I have a reasonably good understanding of tech in general, including on the infra, database and sql side. I have also done a fair bit of research into data warehousing, ELT/ETL pipelines, etc.
To start off with, I'd like to get a proof of concept up and running, just by extracting data from our RDS databases (we don't have a HUGE amount of data - few hundred million records spread over 2 DBs). There are other data sources that we'd like to connect to, but that can come later.
We would prefer AWS or self hosted, but we are open to any possibilities (Google, Fivetran, Snowflake, etc). I guess there are 3 pieces to the puzzle:
- Data warehouse (I almost think that we should just be pushing this all to another RDS database (and denormalise, aggregate, etc where appropriate).
- ETL/ELT
- Reporting (Metabase, Superset, Quicksight?)
We're looking for the simplest option that we can use to get up and running, but also have a clear path to growing with us over time. Do you have any suggestions on the best way to get started?