Poor execution, but great idea. Give users random frames. Obtain multiple drawings for each frame. average and reject frames that are too different from the average. voila!
Give users a choice - do they want to draw a frame (will take a couple of minutes, many people don't have time/patience) - or do they want to vote on a few frames?
For each frame, get at least 4 different drawings.
For voters, show them 2-4 frames and have them pick the best one.
If there are more than 4 versions of a frame, do a bracket-based competition to determine the best frame.
This could be an ongoing process. Generate a version of the movie using the best available frames, but continue allowing people to redraw frames. The movie would continue getting better and better over time.
Then, when you watch the movie, maybe you could have a slider to control which "generation" of the movie you're watching. Slide all the way to the left and it will be really primitive, slide all the way to the right and it will be the best version.
I can't even begin to guess how you would numerically average or compare hairline drawings.
If they were fully shaded frames it would work, but hairlines I have no idea.
Also I observed that some frames left a dark background white, and others filled in a dark background as black. So you'd have to set up a lot of drawing "rules" to ensure it was even conceptually meaningful to generate an average in the first place.
Flag images which have less than a certain % covered. Probably garbage.
Pass the reference frame and each candidate drawing through a CNN. Measure the cosine similarity between the candidate embedding and the reference embedding. Flag drawings that fall below a threshold.
Pass candidates through CLIP. Flag images with obvious garbage descriptions.
The problem isn't just approximate positioning of lines, but radically different artistic styles and amounts of shading and detail. Is someone's hair made of one line, five lines, or 100 lines? Does a face have 3 simple lines or 50 little lines that give a sense of skin texture?
Distance fields would work great for font outlines, but not for translating movie frames.
Oh, I don't know. How well does it have to work? If anything I'm suddenly very curious what kind of results you'd get from converting line art to distance fields, averaging/interpolating, and converting back. Maybe I'll write some code...