With
$1 on windows, pressing enter doesn't seem to remove the overlay. Also, both NVDA and the extension read out the caption. If it were me, what I would probably do is modify the alt-text of the original image that the user right clicked on, to make the alt text be equal to the returned caption. Then just play a small sound to signify that the work has been done. That way, the user can review the caption with his regular screen reader commands, without losing his place in the web page, or having to listen to a third party voice reading it.
Also, if the image contains text, your extension doesn't seem to indicate that at all. It would be useful, even if it couldn't do OCR, if it returned something like "Also contains text". Many screen readers (NVDA and JAWS at least) have a command to perform OCR on an image. However, that doesn't describe the image at all, just recognize text. So, for example, the first image on
$1, when I OCR it with NVDA, returns:
> Testimonial
>
> This is a test of the testimonial section.
>
> -- John Doe, US
When I ask your extension for a caption, it returns:
> a close up of a person holding a wii remote
I have no idea where it's getting that; I'm totally blind myself. Never-the-less, the point of the image is the text. So a hint that the user should try OCR would be useful.