You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi JoyCaption developers,
I’m using JoyCaption locally on an Apple Silicon MacBook Pro with an M4 Max GPU. My use case is accessibility: I am building a local video-to-audio-description workflow for blind users, where JoyCaption captions sampled video frames and the app turns those captions into spoken narration.
The current Hugging Face / LLaVA-style JoyCaption model can run on Mac in some setups, but Apple Silicon support is still difficult compared with CUDA. PyTorch MPS / Metal compatibility, memory use, dtype handling, and speed can be challenging, especially with newer or larger JoyCaption versions.
I wanted to ask whether you would consider one of the following for future JoyCaption releases:
Better tested Apple Silicon support through PyTorch MPS / Metal
An MLX-compatible version for Apple Silicon
A Core ML export or conversion guide
A smaller or quantized Mac-friendly model
Clear Mac setup notes for Transformers users
This would be very helpful for accessibility-focused local apps. Many blind users and Mac users would benefit from a strong local image/video captioning model that does not require NVIDIA CUDA hardware.
My current system:
Apple Silicon MacBook Pro, M4 Max
macOS
PyTorch MPS / Metal
Hugging Face Transformers
Local JoyCaption model folder
Thank you for building JoyCaption and making it available to the community.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi JoyCaption developers,
I’m using JoyCaption locally on an Apple Silicon MacBook Pro with an M4 Max GPU. My use case is accessibility: I am building a local video-to-audio-description workflow for blind users, where JoyCaption captions sampled video frames and the app turns those captions into spoken narration.
The current Hugging Face / LLaVA-style JoyCaption model can run on Mac in some setups, but Apple Silicon support is still difficult compared with CUDA. PyTorch MPS / Metal compatibility, memory use, dtype handling, and speed can be challenging, especially with newer or larger JoyCaption versions.
I wanted to ask whether you would consider one of the following for future JoyCaption releases:
Better tested Apple Silicon support through PyTorch MPS / Metal
An MLX-compatible version for Apple Silicon
A Core ML export or conversion guide
A smaller or quantized Mac-friendly model
Clear Mac setup notes for Transformers users
This would be very helpful for accessibility-focused local apps. Many blind users and Mac users would benefit from a strong local image/video captioning model that does not require NVIDIA CUDA hardware.
My current system:
Apple Silicon MacBook Pro, M4 Max
macOS
PyTorch MPS / Metal
Hugging Face Transformers
Local JoyCaption model folder
Thank you for building JoyCaption and making it available to the community.
Beta Was this translation helpful? Give feedback.
All reactions