So we are 90% sure it is a tiger but only 68% sure it is a land animal? I don't think that makes sense.
It could be that this is a weakness of seeding AI data with human inputs. I can believe that 90% of people who saw the video would agree that it is a tiger, while fewer would agree it is a terrestrial animal, because they don't know what terrestrial means.
It's probably more likely that they want each output to be independent of the other. Certain features may be predominantly associated with a tiger, but not necessarily indicative of a terrestrial animal. If the 9.89% chance that they could have been wrong would have been the case, then that should not influence whether or not it was a terrestrial animal. In my opinion, the consumer of the output values should be able to rely on these fields independently, and make these associations themselves. Although I totally agree a second pass could be useful as a separate data set.
You're right (and nitpicking nitpicks seems appropriate to me :P).
But, I'm pretty sure the assumptions of logistic regression are even stronger than just that. The inputs are assumed to be independent given the output class, and the log odds of the output vary as a linear function of each input. The first one is essentially the naive Bayes assumption, and the second one is completely unreasonable for almost any problem ever (roughly equivalent to assuming every dataset has a multivariate normal distribution). If they are both correct, though, you get a perfectly good Bayesian posterior probability of each output class.
I think the lesson is that gradient descent will build a decent function approximation out of pretty much anything powerful enough, which is why neural networks still work even when probability theory has been thrown completely out the window.
It's really tough to take the outputs seriously if they are not even on the same scale. That is, tiger, cat, and bengal tiger all imply terrestrial animal. That means that they should all be scaled around that. That is to say that terrestrial animal would need to be at least max(tiger, cat, bengal tiger).
Maybe. Or maybe this is a human-centric view. Imagine that the classifier worked on sound. A low growl could be a cat, a tiger or a submarine engine. Then the probabilities might be flipped - if it's a land animal, it might be 40/60 that it's a tiger or a cat.
A visual classifier that identify "4 moving things" might indicate some kind of land animal, or slow motion video of a Dragonfly in flight.
Sample/"evidence"-based reasoning will always have these kind of odd inconsistencies - I'm not sure if mapping such output to a logic model is an improvement. It might be - to take output from a classifier like this, and plug it into an expert system like a Prolog/datalog database or something. Or it might just end up being just as limited as those systems already are.
But when one says "tiger implies terrestrial mammal (or animal)", one is really talking about ontologies -- perhaps training the classifier to come up with things like "90% sure four legs, 60% sure fur" and plugging that into a logic based system would yield good hybrid systems?
I do think one would then loose the "magic" effectiveness of these pure(ish) learning systems though? Perhaps someone more familiar with the domains might shed some light?
It seems like an intersting next step after producing these output labels might be to use something like ConceptNet [0] to evaluate the relationship between the labels and somehow incorporate this as feedback.
Indeed. I wonder if the networks could learn a more and more general description of the concepts in the hierarchy when going up that hierarchy.
E.g. there's a bunch of tiger species [0] each with a specific underlying model, but some traits are common to all tiger species. And some traits are common to, say, carnivores. Could you share parts of the model/network via that hierarchy, e.g. that a tiger inherits some parts of the model for carnivores etc.
For example, can the 'stripe model' or 'fang model' be shared among tigers and carnivores etc.
I tried something similar for my thesis, but this was before the advent of DNNs etc.
This is spot on. What makes the above probabilities look incorrect is that people are assuming that the algorithm understands the relationship between tiger and animal the same way that humans do. Clearly they are evaluating each independently.
I wonder if Snapchat is/will become a large user of this service? Depending on the average response time of this API, Snapchat could get much better ad targeting analyzing their Stories content.
I imagine that they have something similar in house that they run since it is pretty vital to their core business, but you never know.
I think the most commercially successful application of computer vision has been quality-control devices (citation needed). Agriculture is very interested in CV for a return-optimization technique known as precision farming. Manufacturers pay for inspection of production throughout the pipeline. To predict where a mass-market CV could be successful, I think we should look for industries with similar problems but cannot currently afford a bespoke custom modeling solution.
CV will be the technology that kills what jobs we still have left: in agriculture (picking fruit, weeding), in logistics (picking items from shelves), in custodial work (cleaning bots), security, driving, reading MRIs and x-rays - practically all the jobs that depended on vision and could only be done by people in the last 60 years are going to be automated. When CV is fully deployed, the world will be totally different.
I'm pretty excited about precision agriculture, but for plant-life on earth it will mean that now really there will grow nothing outside the system. Bots are going to monitor all plant life. It might be plant-utopia for some species, with timely water and nutrient dispensing, but for other species it might mean being automatically killed by agrobots.
I feel like the places where CV always proves most useful are places where either we need more eyes than would be financially viable, or we need eyes in places its hard/unsafe/expensive to have them.
The tipping point on training+pay for a human vs an ML/CV system for a lot of tasks is coming down.
I saw a Microsoft spotlight last fall on a company that's using drones + CV to inspect power infrastructure in Scandinavia. As much as the tech to do that has cost, its probably cheaper than sending out helicopters with specialists all the time to check power line towers for wear/damage.
Or we need eyes that can drive a physical response faster & more accurately than a human can react. This is the whole robotics/drone/hardware market.
Right now, most of these efforts work at the same speed humans work (probably because we still need to monitor them to make sure they're doing the right thing). But imagine self-driving cars with no traffic lights, because the cars can react & communicate fast enough to avoid a collision without any macro-scale signaling. Manufacturing processes where molten metal is shaped directly into whatever shape desired, where computers do real-time calculations of the effect of gravity & cooling rate on the position of the material. Airliners that never land and can take on passengers anywhere, because passengers ascend in a personal drone that mates with the mothership under computer control.
Unfortunately CVaaS isn't really suitable for this, because the network delays transmitting the data to a datacenter will outweigh any speed benefits of computerizing it. But I could see a big market for services that let you train a model in the cloud at high-speed, and then download the computed model for execution on a local GPU or TPU.
For human level vision and speech (and even search) long term we might not need computational clouds and data centers. We don't see them in nature. The models/indexes will likely just come prepackaged on device.
If you could share with me what mobile phone / web browser you used that produced the styling issue, I'll be sure to pass it on to the relavent people within google so that it gets resolved.
Also if you send me an email at bookman@google.com I'll be sure to update you as to when the styling errors are resolved.
I'm curious about how much use these general-purpose computer vision APIs are actually getting. How many companies out there really want to sift through a lot of photos to find ones that contain "sailboat"? I'm inclined to think a lot more companies would want to find "one of these five different specific kinds of sailboats performing this action", which is definitely not among the tens of thousands of predefined labels that Google, and Amazon, offer with their general purpose models.
High-quality custom model training as a service seems much more compelling.
It's good for advertising. For example, Facebook analyzes all of your photos, and realizes there are a high number of images with sailboats.
That means you might have an interest in sailing, travel offers, or outdoor equipment, and Facebook can test that theory, and see if related advertisements have higher conversion rates.
It's also beneficial for user engagement, and by analyzing your photos, Facebook could recommend related groups in your area, or upcoming sailing events.
On the other end, imagine you make cosplay outfits for a living. You want to promote your business. Wouldn't it be efficient if you could only show your advertisement to Instagram or Snapchat users that post a high percentage of cosplay photos? That would result in a much higher click-through rate, and they could charge a higher premium for those targeted advertisements.
Or, what if you run a wallpaper site, where users upload wallpapers? You could automatically categorize and tag those images for users to search. Or, if someone is viewing a wallpaper of a sunset, you might want to show related sunset wallpapers that could interest them. That's pretty powerful. You could upload one million photos, and with a little work have them all nicely arranged in categories.
General purpose computer vision APIs are good if you're looking for breadth of concepts across many categories. For example, if you're Shutterstock and you're trying to make images searchable with very widely used, generally accepted concepts like "flower" and "car" then a general model would be good enough for you.
Custom computer vision models are good if you're looking for depth in certain categories. For example, if you're a gardening app and you want to take a pic of a flower and be able to recognize different species of flowers, then custom training is required.
There are some options with computer vision API companies where they will let you do custom model training. IBM will do custom training as a service for $$$$ but if you don't want to pay like crazy, Clarifai has a free (to a certain point) offering that lets you train a custom image recognition model on your own https://developer.clarifai.com/guide/train#train
It depends on the data set and the images in question.
Of course taking as an example for a sport’s clothing company where their digital assets are mostly related to their products a general purpose API might not get the subtle differences between two similar shoes, or clothing lines from different seasons. But it might be enough to help catalog that one image or video has a sports person in, and the other is a fashion shoot or product shot.
I have been on the beta program for this and generally the results in our testing have been very good. I particularly like how granular the data can get.
What do you guys plan to do with it? Another poster mentioned how it seems hard to imagine a business that has a model around "finding the sailboats in this batch of pics."
The business prop is easy and already hinted at by another user here.
People post millions of hours of themselves in natural settings on Snapchat. If you can recognize their settings (objects, environments) and cluster/categorize users then you can target advertising even more intrusively than Google et al already do.
We build systems that let our customer manage their video, and usually in large quantities. Some of this video spans a very long period of time and the only description that they have of it is in the filename.
Getting proper structured metadata from content has traditionally been expensive as it has required humans, sometimes trained as librarians, so providing the ability to extract some meaning from video becomes valuable.
Even for systems that have trained librarians, it can still help to have the ability to have a system that highlights the general content so they can further refine it and bring it inline with a taxonomy.
Google Vision API also helps because the metadata can be placed against timecode, further helping the ability to search for particular subjects within a video that might otherwise not be found easily from quickly browsing the videos.
There are other use cases depending on the customers need (such as a customer making sure that certain things are not in video about to go to air), but none of it has to do with marketing.
It was really entertaining listening to Fei-Fei Lee talk about AI and ML at Google Cloud. If you get the chance check it out on YouTube. I especially liked how she referred to video as once being the "dark matter" of vision AI.
The demo picture they chose is interesting. It's obviously a tiger, and is identified as such with only 90% probability. I appreciate the difficulty of the problem and how big of a success it is to achieve even that level of confidence, but that low level of confidence really shows how far we are from being able to simply trust computer vision. Still useful from an information retrieval perspective, I expect.
That's kinda true, but (regularization aside) for standard loss functions it's minimized at the point it's well calibrated, right? Given the scores in the image (97% animal, 90% tiger, etc) they seem to be binary classifiers e.g. "is this a tiger?" So of all scores in the neighborhood of 90%, 90% should be "yes it is," making it a measure of confidence compatible with probability.
Please someone correct me if I'm wrong, but I'm pretty sure that's how it works, just like how logistic regression gives you a probability.
What you need to do is to take the top prediction and see how accurate it is compared to a test set. The scores on the picture represent confidence not accuracy.
I think there is a need for a comprehensive system for image and video data analytics. Much like how we today have relational databases (postgres, MYSQL) and full text search engines (lucene/Solr). The approach Google or Amazon have been taking which involves providing a "tagging" API is frankly unimaginative.
I am working on Deep Video Analytics an Open Source Visual Search and Analytics platform for images and videos. The goal of Deep Video analytics is to become a quickly customizable platform for developing visual & video analytics applications, while benefiting from seamless integration with state or the art models released by the vision research community. Its currently in very active development but still well tested and usable without having to write any code.
I would be interested to know more about this, particularly the database and what you plan to do with it in the future (I am thinking the license on the GitHub project is obviously restrictive for a purpose at the moment).
Sorry about the license, I am trying to reach a beta version within a month along with a system-description paper that outlines the long term vision behind building such a system. At that point I plan on relaxing the license. There are certain constraints such as making sure that all underlying models are correctly licensed. Also FAISS which I use is licensed by Facebook under an explicit non-commercial license.
Correct me if I'm wrong, but this is just a frame-by-frame labeling. You can download whatever pre-trained CNN, pass individual frames through it and get the same result.
I don't know if they are doing it right now in this API, but Google and plenty of others have demonstrated recognition of actions in videos. I expect they'll add it later if it's not there right now.
At the very least in this release they are identifying scene changes, which is a dynamic property.
True. But then you have to deploy and maintain that CNN yourself.
The value prop is similar to, say, Twilio. Though, arguably, it's easier to run your own pre-trained CNN than it is to replicate the telephony, VoIP, and video conferencing stuff Twilio provides.
Also, presumably Google is hoping that they can continue to train and improve their CNN so that it's always just a little better than the best free-to-download ones.
As a Cloud Prediction API user, it makes me a bit uneasy to see it left out of the image of their product suite. Is it effectively in maintenance mode now? I feel like TensorFlow is overkill for what I need and my use case doesn't fit into image/speech/video detection.
It can help more than superficially, it can become a starting point for humans to go and refine classification into a controlled taxonomy.
The Google implementation can also detect when different people are speaking which is useful - though it takes someone to tag who is who.
As I mentioned in another post, it can also highlight where things are happening in a video - for instance a 2hr video where a scene suddenly appears. Like in a nature video rushes where an animal appears for only a few seconds.
Other types of video analysis can also detect problems in the video/audio, such as dropped frames, noise or colour gamut issues.
We are talking about our underlying tech at Next conference - https://goo.gl/3ihXth
We are clearly using frame level annotations, but we also have additional models to aggregate visual and additional information to provide aggregate level entities at the shot level or video level. PM at Google
it would be completely naive to implement it that way, considering there is an entirely new attribute video applies over images which of course is "time".
I don't know shit about ML- talking out of my ass here- but I'd be surprised if the algorithms didn't account for changes over time or canonical entity recognition (is this the same boat that was in the last image)?
The linked press release shows an animal is detected -- tiger etc. It does not say tiger running or hunting, which is where the time component would have been used.
> nouns such as “dog,” “flower” or “human” or verbs such as “run,” “swim" or “fly”
that out of the way... i suspect you wouldn't need video to detect those things...
and the screenshot you're referring to is an specific application of the API... not a kitchen sink:
> It can even provide contextual understanding of when those entities appear; for example, searching for “Tiger” would find all precise shots containing tigers across a video collection in Google Cloud Storage.
Look at their example:
So we are 90% sure it is a tiger but only 68% sure it is a land animal? I don't think that makes sense.It could be that this is a weakness of seeding AI data with human inputs. I can believe that 90% of people who saw the video would agree that it is a tiger, while fewer would agree it is a terrestrial animal, because they don't know what terrestrial means.