I mean it in the sense that I can upload a low quality phone photo of a page from a Chinese cookbook and it will OCR it, translate it into English and give me a summary of the ingredients.
I’ve been looking into vision models but they seem daunting to set up, and the specs say stuff like 384x384 image resolution, so it doesn’t seem like it would be able to do what I look for. Am I even searching in the right direction right now?
No idea your skill level, but try installing open webui and downloading any of ollama vision models
There’s a bit of a learning curve to running docker but ChatGPT can easily get you to a point it’s running.
I’m not sure if I’m doing something wrong here, but openwebui has been weird for me. I’ve tried running nanonets-ocr, but it only read the last lines visible on a photo. And other models would start reprocessing the whole chat and ignoring the last image I post, answering with the context of the previous reply instead… Using the websearch is easy with it though, so I think I’ll keep an eye on it and maybe will try again later