Is it open source or are there plans to make it so? I'd be very interested how you do frame-accurate video annotations in the browser, as I found the currentTime attribute of HTML video elements to be too unreliable for this. (Context: I maintain a video annotation tool that is mostly used for marine imaging.)
For Anno we are using HTML video elements / YouTube embeds which have the same frame inaccuracy from currentTime. For our editor we have a WebGL-based solution that seems to get better frame granularity but still has some floating point rounding errors converting frames <-> seconds
There's also requestVideoFrameCallback() and seekToNextFrame() in the Web APIs but they are still experimental/not supported by all browser. The WebCodecs API is also experimental/not supported to all browsers but would allow you to decode and grab the individual frames from a video and draw them to a canvas
In Chrome and Edge you can use WebCodecs to decode and display video frames one-by-one. (this is what I used for the video editor at https://vidmix.app )
In other browsers you could build FFmpeg with webassembly and use that for frame-by-frame decoding but it's not going to be nearly as performant.
Decoding the video manually seems so much overkill when all that's needed would essentially be a currentFrame property. But it seems I have to explore this avenue as well. I only aim to support Chrome and Firefox but as much as I hate to admit it, Firefox is somewhat lacking on the video side anyway (as there is a bug that lets it display different video frames for the same currentTime than other browsers or FFmpeg [1]). Therefore I already discourage using Firefox for video annotation.
Yes, the shortcomings of the html video tag are quite unfortunate. Just some very small improvements would make it a lot more useful. One other issue I found at the time I was looking at it (not sure if it's still the case) was that stepping one frame forward wasn't any faster than seeking from a random location. Seemed like the video was decoded from the last keyframe every time, making it inefficient to iterate through the frames.