Hacker News new | past | comments | ask | show | jobs | submit login
Cloud Video Intelligence API (cloud.google.com)
259 points by hurrycane on March 8, 2017 | hide | past | favorite | 87 comments



I think their model should take a second pass on the words and probabilities, independent of the video.

Look at their example:

  Animal: 97.76%
  Tiger: 90.11%
  Terrestrial animal: 68.17%
So we are 90% sure it is a tiger but only 68% sure it is a land animal? I don't think that makes sense.

It could be that this is a weakness of seeding AI data with human inputs. I can believe that 90% of people who saw the video would agree that it is a tiger, while fewer would agree it is a terrestrial animal, because they don't know what terrestrial means.


It's probably more likely that they want each output to be independent of the other. Certain features may be predominantly associated with a tiger, but not necessarily indicative of a terrestrial animal. If the 9.89% chance that they could have been wrong would have been the case, then that should not influence whether or not it was a terrestrial animal. In my opinion, the consumer of the output values should be able to rely on these fields independently, and make these associations themselves. Although I totally agree a second pass could be useful as a separate data set.


Still, in any consistent way of assigning probabilities to events, if A implies B, then P(A) <= P(B).

Neural network outputs are not probabilities. I think that's the main lesson here.


Given that the last layer of a NN is a logistic regression, they are in fact well-calibrated probabilities under the assumption disjoint classes.

The issue at hand is training them on overlapping classes :-)

I will shut up now, sorry for nitpicking.


You're right (and nitpicking nitpicks seems appropriate to me :P).

But, I'm pretty sure the assumptions of logistic regression are even stronger than just that. The inputs are assumed to be independent given the output class, and the log odds of the output vary as a linear function of each input. The first one is essentially the naive Bayes assumption, and the second one is completely unreasonable for almost any problem ever (roughly equivalent to assuming every dataset has a multivariate normal distribution). If they are both correct, though, you get a perfectly good Bayesian posterior probability of each output class.

I think the lesson is that gradient descent will build a decent function approximation out of pretty much anything powerful enough, which is why neural networks still work even when probability theory has been thrown completely out the window.


It's really tough to take the outputs seriously if they are not even on the same scale. That is, tiger, cat, and bengal tiger all imply terrestrial animal. That means that they should all be scaled around that. That is to say that terrestrial animal would need to be at least max(tiger, cat, bengal tiger).


Maybe. Or maybe this is a human-centric view. Imagine that the classifier worked on sound. A low growl could be a cat, a tiger or a submarine engine. Then the probabilities might be flipped - if it's a land animal, it might be 40/60 that it's a tiger or a cat.

A visual classifier that identify "4 moving things" might indicate some kind of land animal, or slow motion video of a Dragonfly in flight.

Sample/"evidence"-based reasoning will always have these kind of odd inconsistencies - I'm not sure if mapping such output to a logic model is an improvement. It might be - to take output from a classifier like this, and plug it into an expert system like a Prolog/datalog database or something. Or it might just end up being just as limited as those systems already are.

But when one says "tiger implies terrestrial mammal (or animal)", one is really talking about ontologies -- perhaps training the classifier to come up with things like "90% sure four legs, 60% sure fur" and plugging that into a logic based system would yield good hybrid systems?

I do think one would then loose the "magic" effectiveness of these pure(ish) learning systems though? Perhaps someone more familiar with the domains might shed some light?


Perhaps it's confused by the many images and videos of tigers swimming in water?


Thanks a lot Life of Pi


It seems like an intersting next step after producing these output labels might be to use something like ConceptNet [0] to evaluate the relationship between the labels and somehow incorporate this as feedback.

[0] http://conceptnet.io


Indeed. I wonder if the networks could learn a more and more general description of the concepts in the hierarchy when going up that hierarchy. E.g. there's a bunch of tiger species [0] each with a specific underlying model, but some traits are common to all tiger species. And some traits are common to, say, carnivores. Could you share parts of the model/network via that hierarchy, e.g. that a tiger inherits some parts of the model for carnivores etc.

For example, can the 'stripe model' or 'fang model' be shared among tigers and carnivores etc.

I tried something similar for my thesis, but this was before the advent of DNNs etc.

[0] http://conceptnet.io/c/en/tiger?rel=/r/IsA&limit=1000


This is spot on. What makes the above probabilities look incorrect is that people are assuming that the algorithm understands the relationship between tiger and animal the same way that humans do. Clearly they are evaluating each independently.


Or even something like a tree of life diagram entered in. (I forget what those are called)


> So we are 90% sure it is a tiger but only 68% sure it is a land animal? I don't think that makes sense.

Could be a carpet with the image of a tiger, or a toy, or a person in a costume.


Tigers are 38.57% arboreal.


I wonder if Snapchat is/will become a large user of this service? Depending on the average response time of this API, Snapchat could get much better ad targeting analyzing their Stories content.

I imagine that they have something similar in house that they run since it is pretty vital to their core business, but you never know.


Snapchat is big enough to do it in house without paying Google to donate their data for their NN.

For volume consumers, Google should be paying Snapchat to learn from their data.


They have just committed to spending $1bn on GCP though, so not unrealistic that they'd leverage some of the other suite of tools.


It's actually $2bn [1], $400 million per year over 5 years.

1: http://www.recode.net/2017/2/2/14492026/snap-ipo-2-billion-c...


This API is not cheap


How long until someone runs an attack like https://arxiv.org/abs/1609.02943 and provides the model for free?


This field moves quickly.

Embedding Watermarks into Deep Neural Networks

https://arxiv.org/abs/1701.04082


Snapchat is required to spend buckets of money with Google...


They're more than likely on this already.

Any photo you save in the app is already categorized by content.


I think the most commercially successful application of computer vision has been quality-control devices (citation needed). Agriculture is very interested in CV for a return-optimization technique known as precision farming. Manufacturers pay for inspection of production throughout the pipeline. To predict where a mass-market CV could be successful, I think we should look for industries with similar problems but cannot currently afford a bespoke custom modeling solution.


CV will be the technology that kills what jobs we still have left: in agriculture (picking fruit, weeding), in logistics (picking items from shelves), in custodial work (cleaning bots), security, driving, reading MRIs and x-rays - practically all the jobs that depended on vision and could only be done by people in the last 60 years are going to be automated. When CV is fully deployed, the world will be totally different.

I'm pretty excited about precision agriculture, but for plant-life on earth it will mean that now really there will grow nothing outside the system. Bots are going to monitor all plant life. It might be plant-utopia for some species, with timely water and nutrient dispensing, but for other species it might mean being automatically killed by agrobots.


> When CV is fully deployed, the world will be totally different.

It's really scary to think about the long-term consequences of this. You know what evolution does to features not required for survival anymore.


I feel like the places where CV always proves most useful are places where either we need more eyes than would be financially viable, or we need eyes in places its hard/unsafe/expensive to have them.

The tipping point on training+pay for a human vs an ML/CV system for a lot of tasks is coming down.

I saw a Microsoft spotlight last fall on a company that's using drones + CV to inspect power infrastructure in Scandinavia. As much as the tech to do that has cost, its probably cheaper than sending out helicopters with specialists all the time to check power line towers for wear/damage.


Or we need eyes that can drive a physical response faster & more accurately than a human can react. This is the whole robotics/drone/hardware market.

Right now, most of these efforts work at the same speed humans work (probably because we still need to monitor them to make sure they're doing the right thing). But imagine self-driving cars with no traffic lights, because the cars can react & communicate fast enough to avoid a collision without any macro-scale signaling. Manufacturing processes where molten metal is shaped directly into whatever shape desired, where computers do real-time calculations of the effect of gravity & cooling rate on the position of the material. Airliners that never land and can take on passengers anywhere, because passengers ascend in a personal drone that mates with the mothership under computer control.

Unfortunately CVaaS isn't really suitable for this, because the network delays transmitting the data to a datacenter will outweigh any speed benefits of computerizing it. But I could see a big market for services that let you train a model in the cloud at high-speed, and then download the computed model for execution on a local GPU or TPU.


For human level vision and speech (and even search) long term we might not need computational clouds and data centers. We don't see them in nature. The models/indexes will likely just come prepackaged on device.


This is only a matter of time. The imagenet model can already run on my iPhone on the gpu with BNNS. Not real-time but going there very soon.

Once you have real time local inferencre on your mobile phone. Boom!


It amazes me how smart these guys at google are, and yet, they can't design a mobile site if their lives depended on it:

http://imgur.com/bXGuNfL


If you could share with me what mobile phone / web browser you used that produced the styling issue, I'll be sure to pass it on to the relavent people within google so that it gets resolved.

Also if you send me an email at bookman@google.com I'll be sure to update you as to when the styling errors are resolved.

(Disclaimer, I work for google cloud)


iPhone SE, iOS 10.2.1

Also, see my previous comment on a similar issue with Google Cloud Calculator:

https://news.ycombinator.com/item?id=13729466


A code change was just pushed out to fix this css styling bug, and is live in production.

Sorry for the inconvenience, and please don't hesitate to forward any other bugs you find to me at bookman@google.com


thanks. Submitted your issue. Along with your google cloud calculator issue. Feel free to email me if you'd like to be cc'd on the status.


^ lol, too true


I'm curious about how much use these general-purpose computer vision APIs are actually getting. How many companies out there really want to sift through a lot of photos to find ones that contain "sailboat"? I'm inclined to think a lot more companies would want to find "one of these five different specific kinds of sailboats performing this action", which is definitely not among the tens of thousands of predefined labels that Google, and Amazon, offer with their general purpose models.

High-quality custom model training as a service seems much more compelling.


It's good for advertising. For example, Facebook analyzes all of your photos, and realizes there are a high number of images with sailboats.

That means you might have an interest in sailing, travel offers, or outdoor equipment, and Facebook can test that theory, and see if related advertisements have higher conversion rates.

It's also beneficial for user engagement, and by analyzing your photos, Facebook could recommend related groups in your area, or upcoming sailing events.

On the other end, imagine you make cosplay outfits for a living. You want to promote your business. Wouldn't it be efficient if you could only show your advertisement to Instagram or Snapchat users that post a high percentage of cosplay photos? That would result in a much higher click-through rate, and they could charge a higher premium for those targeted advertisements.

Or, what if you run a wallpaper site, where users upload wallpapers? You could automatically categorize and tag those images for users to search. Or, if someone is viewing a wallpaper of a sunset, you might want to show related sunset wallpapers that could interest them. That's pretty powerful. You could upload one million photos, and with a little work have them all nicely arranged in categories.


General purpose computer vision APIs are good if you're looking for breadth of concepts across many categories. For example, if you're Shutterstock and you're trying to make images searchable with very widely used, generally accepted concepts like "flower" and "car" then a general model would be good enough for you.

Custom computer vision models are good if you're looking for depth in certain categories. For example, if you're a gardening app and you want to take a pic of a flower and be able to recognize different species of flowers, then custom training is required.

There are some options with computer vision API companies where they will let you do custom model training. IBM will do custom training as a service for $$$$ but if you don't want to pay like crazy, Clarifai has a free (to a certain point) offering that lets you train a custom image recognition model on your own https://developer.clarifai.com/guide/train#train


I believe it's public that Vision API has already processed more than a billion photos since being made public, which is likely a good sign of usage.


one immediate need is NSFW flagging, esp. things that might indicate abuse.


The Cloud Vision API already offers this: https://cloud.google.com/vision/docs/detecting-safe-search

Disclaimer: I work on Google Cloud, but not Vision.


This is a feature that Microsoft's computer vision API offers (in contrast to AWS Rekognition and other services): https://www.microsoft.com/cognitive-services/en-us/computer-...


Content moderator is what you're looking for: https://www.microsoft.com/cognitive-services/en-us/content-m...


It depends on the data set and the images in question.

Of course taking as an example for a sport’s clothing company where their digital assets are mostly related to their products a general purpose API might not get the subtle differences between two similar shoes, or clothing lines from different seasons. But it might be enough to help catalog that one image or video has a sports person in, and the other is a fashion shoot or product shot.


Can't you do model training with the google vision API?


I have been on the beta program for this and generally the results in our testing have been very good. I particularly like how granular the data can get.


What do you guys plan to do with it? Another poster mentioned how it seems hard to imagine a business that has a model around "finding the sailboats in this batch of pics."


The business prop is easy and already hinted at by another user here.

People post millions of hours of themselves in natural settings on Snapchat. If you can recognize their settings (objects, environments) and cluster/categorize users then you can target advertising even more intrusively than Google et al already do.


We build systems that let our customer manage their video, and usually in large quantities. Some of this video spans a very long period of time and the only description that they have of it is in the filename.

Getting proper structured metadata from content has traditionally been expensive as it has required humans, sometimes trained as librarians, so providing the ability to extract some meaning from video becomes valuable.

Even for systems that have trained librarians, it can still help to have the ability to have a system that highlights the general content so they can further refine it and bring it inline with a taxonomy.

Google Vision API also helps because the metadata can be placed against timecode, further helping the ability to search for particular subjects within a video that might otherwise not be found easily from quickly browsing the videos.

There are other use cases depending on the customers need (such as a customer making sure that certain things are not in video about to go to air), but none of it has to do with marketing.


It was really entertaining listening to Fei-Fei Lee talk about AI and ML at Google Cloud. If you get the chance check it out on YouTube. I especially liked how she referred to video as once being the "dark matter" of vision AI.



Would you mind share a youtube link?


The demo picture they chose is interesting. It's obviously a tiger, and is identified as such with only 90% probability. I appreciate the difficulty of the problem and how big of a success it is to achieve even that level of confidence, but that low level of confidence really shows how far we are from being able to simply trust computer vision. Still useful from an information retrieval perspective, I expect.


You realize that softmax scores aren't probabilities, right?

It's just a relative measure of confidence, scaled such that they all sum to 1.0.


That's kinda true, but (regularization aside) for standard loss functions it's minimized at the point it's well calibrated, right? Given the scores in the image (97% animal, 90% tiger, etc) they seem to be binary classifiers e.g. "is this a tiger?" So of all scores in the neighborhood of 90%, 90% should be "yes it is," making it a measure of confidence compatible with probability.

Please someone correct me if I'm wrong, but I'm pretty sure that's how it works, just like how logistic regression gives you a probability.


You can't add any of the numbers in the picture to equal 1.0 (or 100)


What you need to do is to take the top prediction and see how accurate it is compared to a test set. The scores on the picture represent confidence not accuracy.


I think there is a need for a comprehensive system for image and video data analytics. Much like how we today have relational databases (postgres, MYSQL) and full text search engines (lucene/Solr). The approach Google or Amazon have been taking which involves providing a "tagging" API is frankly unimaginative.

I am working on Deep Video Analytics an Open Source Visual Search and Analytics platform for images and videos. The goal of Deep Video analytics is to become a quickly customizable platform for developing visual & video analytics applications, while benefiting from seamless integration with state or the art models released by the vision research community. Its currently in very active development but still well tested and usable without having to write any code.

https://github.com/AKSHAYUBHAT/DeepVideoAnalytics

https://deepvideoanalytics.com


I would be interested to know more about this, particularly the database and what you plan to do with it in the future (I am thinking the license on the GitHub project is obviously restrictive for a purpose at the moment).


Sorry about the license, I am trying to reach a beta version within a month along with a system-description paper that outlines the long term vision behind building such a system. At that point I plan on relaxing the license. There are certain constraints such as making sure that all underlying models are correctly licensed. Also FAISS which I use is licensed by Facebook under an explicit non-commercial license.


Correct me if I'm wrong, but this is just a frame-by-frame labeling. You can download whatever pre-trained CNN, pass individual frames through it and get the same result.


I don't know if they are doing it right now in this API, but Google and plenty of others have demonstrated recognition of actions in videos. I expect they'll add it later if it's not there right now.

At the very least in this release they are identifying scene changes, which is a dynamic property.


True. But then you have to deploy and maintain that CNN yourself.

The value prop is similar to, say, Twilio. Though, arguably, it's easier to run your own pre-trained CNN than it is to replicate the telephony, VoIP, and video conferencing stuff Twilio provides.

Also, presumably Google is hoping that they can continue to train and improve their CNN so that it's always just a little better than the best free-to-download ones.


Yes, but I mean if you analyze video as individual frames, you can't declare "we do video analysis", because video is completely different domain.

There're papers like [1] where CNN output is used as input for RNN, which performs deeper context analysis. Results aren't exciting, though.

[1] http://cs231n.stanford.edu/reports2016/221_Report.pdf


It also mentions verbs like swimming so it may do those across several frames? Not sure how much more accurate that is than frame-by-frame


I suspect it's not a detected action, but just a picture labeled "swimming", like one of these (random image from internet): http://www.ambientegallerie.com/blog/wp-content/uploads/2016...


There's alternative out there from a company called Valossa. More comprehensive than what Google is now offering. Https://val.ai


I've seen the Valossa offering and it is indeed impressive. Insane amounts of visual data on videos.


As a Cloud Prediction API user, it makes me a bit uneasy to see it left out of the image of their product suite. Is it effectively in maintenance mode now? I feel like TensorFlow is overkill for what I need and my use case doesn't fit into image/speech/video detection.


Sounds similar to a company I worked with that took security camera footage from restaurants and identified employee theft and process inefficiencies.


I wonder if you could use this to upload recordings from your DVR and have it determine the likely timecode of commercial breaks...


Not the first https://clarifai.com has a similar service .


Is there any storage-related cost (i.e. retreival or egress cost) when you call this on a file stored on Google Cloud Storage?


It's awesome, but I can't really see any application beside content filtering and supericial content classification.


It can help more than superficially, it can become a starting point for humans to go and refine classification into a controlled taxonomy.

The Google implementation can also detect when different people are speaking which is useful - though it takes someone to tag who is who.

As I mentioned in another post, it can also highlight where things are happening in a video - for instance a 2hr video where a scene suddenly appears. Like in a nature video rushes where an animal appears for only a few seconds.

Other types of video analysis can also detect problems in the video/audio, such as dropped frames, noise or colour gamut issues.


When you use these Google APIs, can Google keep/ use your data in any way?


Here's a link to the Google Cloud Terms of Service: https://cloud.google.com/terms/ (you may be interested in section 5.2: Use of Customer Data).

Disclaimer: I work for Google but am definitely not a lawyer and can't authoritatively speak for Google here.


what is the "video" bit here? This is just running image recognition on a bunch of frames.


We are talking about our underlying tech at Next conference - https://goo.gl/3ihXth We are clearly using frame level annotations, but we also have additional models to aggregate visual and additional information to provide aggregate level entities at the shot level or video level. PM at Google


how do you know the implementation details?

it would be completely naive to implement it that way, considering there is an entirely new attribute video applies over images which of course is "time".

I don't know shit about ML- talking out of my ass here- but I'd be surprised if the algorithms didn't account for changes over time or canonical entity recognition (is this the same boat that was in the last image)?


The linked press release shows an animal is detected -- tiger etc. It does not say tiger running or hunting, which is where the time component would have been used.


the press release says:

> nouns such as “dog,” “flower” or “human” or verbs such as “run,” “swim" or “fly”

that out of the way... i suspect you wouldn't need video to detect those things...

and the screenshot you're referring to is an specific application of the API... not a kitchen sink:

> It can even provide contextual understanding of when those entities appear; for example, searching for “Tiger” would find all precise shots containing tigers across a video collection in Google Cloud Storage.


I have seen it detect that a car is drifting..


Cronenberg inception porn is coming


[flagged]


I've never seen spam like this on HN before...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: