Running latency demo with cuda 13 generates invalid tokens, change to 12.8 fix the issue

When build ml-llama with cuda 13, and then run below command 

`python megakernels/scripts/generate.py mode=mk prompt="tell me a funny joke about cookies" ntok=100`

All generated tokens except the first one are "!". 

After asking Gemini AI, I chose a cloud VM with cuda 12.8 and the demo works.