• by dmpetrov on 8/17/2024, 10:16:40 PM

    Hey! I'm one of the creators of DataChain.

    DataChain works on your local machine and manages files in storage (like images and PDFs in S3 or GCP). Users can slice and dice their files using metadata. Example:

    - Download only files labeled "Cats" instead of the whole dataset. Use json/parque to get labels.

    - Use LLMs to generate metadata. E.g., "Are there more than 3 people in the image?".

    - Add custom metadata to create a rich "DataFrame" of your files

    The API of the data-frame is based on Python (Pydentic) but queries to Pythion objects are transpiled to database (SQLite). Or you can just convert all metadata into Pandas if you prefer.

    WDYT? I’d love to hear your thoughts!