You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to reproduce the results reported on "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning". But, I'm facing difficulty reproducing the InstructBLIP (Vicuna-7B) results on Flickr30K test set for the image captioning task.
I'm using the model from Hugginface and executing the code snippet below, and I'm getting a Cider score of 60.9 while the reported one is 82.4.
I'm using the prompt reported on the paper, "A short image description: ", and decoding hyperparams from the example in huggingface. I wonder if I'm using the correct hyperparams and prompt?
PS: using the same hyperparams and the prompt "A short image caption." increases the cider score to 83.1
Hi,
I'm trying to reproduce the results reported on "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning". But, I'm facing difficulty reproducing the InstructBLIP (Vicuna-7B) results on Flickr30K test set for the image captioning task.
I'm using the model from Hugginface and executing the code snippet below, and I'm getting a Cider score of 60.9 while the reported one is 82.4.
I'm using the prompt reported on the paper, "A short image description: ", and decoding hyperparams from the example in huggingface. I wonder if I'm using the correct hyperparams and prompt?
PS: using the same hyperparams and the prompt "A short image caption." increases the cider score to 83.1
The text was updated successfully, but these errors were encountered: