by Obertr on 9/22/2023, 2:26:43 PM with 0 comments
Hey! I am an AI engineer and I currently try to setup an endpoint on GPU to make inference on GTE embeddings model.
Currently our price per 1k tokens is exactly like openai ada 2
I did ONNX runtime inference on runpod.io so we pay per seconds.
I know it is theoretically possible to cut the cost much more, but I am struggling with the amount of experiments I can do.
I wonder if there is anyone who could help me figure out low level GPU nvdidia optimisation stuff?
Please leave a DM here if you feel like you have expertise and can help!
https://x.com/karmedge
Hey! I am an AI engineer and I currently try to setup an endpoint on GPU to make inference on GTE embeddings model. Currently our price per 1k tokens is exactly like openai ada 2
I did ONNX runtime inference on runpod.io so we pay per seconds. I know it is theoretically possible to cut the cost much more, but I am struggling with the amount of experiments I can do.
I wonder if there is anyone who could help me figure out low level GPU nvdidia optimisation stuff?
Please leave a DM here if you feel like you have expertise and can help! https://x.com/karmedge