• by dekhn on 12/4/2019, 7:17:03 PM

    I recommend the book "Managing Gigabytes", which while dated is still relevant. The title doesn't indicate this, but it's heavily focused on data structures for indexing text documents.

    But Elasticsearch running on a cloud VM with an attached EBS volume would be a fast way to get work done.

  • by 1e10 on 12/4/2019, 7:11:19 PM

    1tb is nothing these days. If you insist on cloud the hetzner could be best bang for buck. Otherwise a similar desktop system can be acquired for less than 1000 usd.

    I’d start with solr or elasticsearch and a simple indexing script (home rolled python script).

    Then you can use solr admin or something like Jupyter for iterative querying.

    I’m not an expert on index tuning, but you might even be able to dump it all into postgres with json types.

    Best of luck!

  • by johnnycarcin on 12/9/2019, 10:57:28 PM

    coming back, i stumbled over this while looking at options: https://docs.alephdata.org/. It is a bit more heavyweight than plain elasticsearch, but it has some nice additions that might make it worth it depending on your situation.