They make a big deal about IBM using images published under a Creative Commons license without asking for permission, when the whole point of CC is to give permission to anyone, without having to ask. The people who are now surprised that their images are being used for purposes they disagree with should probably have used a different license, but I guess the messaging around CC (which tends to emphasize reuse by artists) makes it difficult.
Using images with a restriction to non-commercial purposes is a bit more of a gray area, depending on how you separate commercial from non-commercial activity. Since they share the data set with researchers at other organizations (presumably including competitors), I'd consider it non-commercial enough, because they don't gain a competitive advantage, but the details might have to be fought out in court.
As someone who has done license clearance while employed by IBM I can assure you that their legal team is all about following the rules on this stuff.
That said, from the article you have this: "NBC News obtained IBM’s dataset from a source after the company declined to share it, saying it could be used only by academic or corporate research groups." which tells me that IBM has restricted distribution to non-commercial activities, and this "To build its Diversity in Faces dataset, IBM says it drew upon a collection of 100 million images published with Creative Commons licenses that Flickr’s owner, Yahoo, released as a batch for researchers to download in 2014. So they started with a data set that someone else had re-licensed for this purpose already. (That would be Yahoo!)
Back in the day, Yahoo!'s terms of service for Flickr were that you gave Yahoo! its own license to your work and could specify the license that others got if they downloaded your work. So I can imagine that it is entirely possible/legal for Yahoo! to exercise their rights under that ToS to relicense the photos how they saw fit (and remember Verizon/Yahoo! was trying to make the asset as valuable as possible, and this would contribute positively to that effort).
I expect that somewhere someone has sold the old classmates.com archives of images and/or digitized a few thousand yearbooks for images. It is not too hard to find sources of hundreds of images where the include a head shot, are all equally lit, and have a small number of backgrounds to remove to leave just the facial features.
> As someone who has done license clearance while employed by IBM I can assure you that their legal team is all about following the rules on this stuff.
As a few HNers are no doubt aware, IBM allegedly obtained explicit permission to use Douglas Crockford's software for Evil rather than Good, because it couldn't guarantee that its customers aren't doing Evil.
But the CC license requires attribution (and, in some variants, prohibits commerical use). It's not clear to me that IBM followed these requirements--presumably a trained neural network counts as a derived work, so far as the CC license is concerned.
On the one hand, US Copyright is very explicitly attached to a piece of work, not the facts or ideas contained in it. Per 17 USC 102:
> In no case does copyright protection for an original work
> of authorship extend to any idea, procedure, process,
> system, method of operation, concept, principle, or
> discovery, regardless of the form in which it is described,
> explained, illustrated, or embodied in such work.
You are therefore free to create works of your own that analyze facts contained in others' copyrighted works, comment on their ideas, and so on. This is always true if you don't include any of their copyrighted work, and often true, via fair use, if you only include the small pieces needed for your commentary. Accordingly, it seems pretty clear that you could analyze a huge collection of copyrighted portraits and do whatever you want with the results (distribution of hair colors, eye positions relative to nose, how these vary within/across individuals, and so on).
The counterargument would appear to be that derivative works include the "abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted." Neural networks do seem to store their training data, at least in some form, and there's a fuzzy line between extracting some facts from data (which is fine) and omitting data to create an abridged version (which isn't).
I wouldn't want to bet much either way, but I do think it would be a little odd for copyright to limit how much you can "learn" from something, either manually or via machine.
>presumably a trained neural network counts as a derived work
I just spent about 15 minutes trying to confirm that and... I have no idea. I suppose it's not surprising that a software engineer would not be able to suss that out in 15 minutes. Every definition I find tends to focus on on art, visual and auditory creations.
Disregarding a legal interpretation (you know, the thing that actually matters), I can see it either way. Certainly the model is based off of data derived from the characteristics of these images. On the other hand, if I saw e.g. a shade of blue in one of these images that I liked, would I need to provide attribution if I measured it and used it in my own work? I have no idea, I suppose I'm just thinking out loud here. I do understand the taking something to a logical extreme (the color example) is not the end all be all of legal arguments.
Isn't the purpose of a ML model to describe actual facts, derived mechanically and not creatively, and thus might not be subject to US copyright in the first place?
It may not constitute a new copyright (like a database copyright on a new collection of stuff copyrighted by others), but that doesn't invalidate existing copyrights and the license requirements under which the copyrighted material was used.
But I'm not a lawyer, so no idea how the interactions are in that situation. The closest analogy I could think of are sampling in music: how much must the original works be atomized before they don't count anymore?
But you could argue that the weights of the model wouldn't be what they are without the copyrighted work. Since their model does use the work, apart of it was "derived" from that piece.
This isn't about the various senses the English word "derived" can have, it's about the specific meaning of the legal term "derivative work." That would start with including major elements of another work and those elements being subject to copyright (e.g. in the US, book titles are not). There's no point in talking about how one could argue that something wouldn't be what it is without something else, that's not the legal basis at all. In no legal sense are the Indiana Jones movies derivative works of the old 1930's serials just because they were sources of inspiration.
This is nuanced enough it would need to be tested in court. I suspect we won't really be sure of the answer for at least another 20 years. Computers have existed for 80 years and we're still debating if APIs are copyrightable.
Almost certainly a trained neural network would be classified as a "collection" under copyright. Interestingly, I suspect that this collection is not a derived work of the individual photos because the photos are not actually included in the trained neural net. Instead the network collects facts about the photos that allows it to recognise the faces.
Not a lawyer and I bet you could get a case to go to court arguing otherwise, but this is my guess on what the result would be.
I guess you can use the trained neural network to generate (hallucinate) images, and those images will depend on the original training images, and may contain features of those images.
I suppose that even if you consider the neural network as a black box, you can generate images that bear some resemblance to the training data in some indirect way. For example by walking along a gradient of the output with respect to changes in the input.
Even with the color example, there are lots of restrictions on the use of certain colors for certain uses. For example, a paint store may not be able to mix a certain shade of paint that is (somehow) legally “owned” by a specific brand.
I thought that had to do with either trademark (e.g. logo for a competing company) or the method by which a color is manufactured (assuming it is not simple/obvious.) I don't believe you can literally lock down a color with no exceptions.
First, you can't copyright a color, period. You can trademark a color, but that trademark only applies to a very specific use of that color. For instance, you can paint your house or non-delivery-service business in "UPS Brown" without fear, but you couldn't use it in conjunction with a delivery service.
The purpose of trademark is to eliminate customer confusion, where they may think they're doing business with one entity when in fact they are doing business with another. Non-confusing uses of trademarks are legal.
> a paint store may not be able to mix a certain shade of paint that is (somehow) legally “owned” by a specific brand.
A paint store can legally mix paints with those colors, unless you've told them that you're going to be using the color for a trademark-infringing purpose.
Stores may voluntarily decline to mix trademark colors to avoid even the possibility of a lawsuit, but it isn't legally required.
TFA mentions contacting the Flickr users who uploaded the images, so for all we know, they were attributed appropriately. (Not all CC licenses are CC-BY anyway.)
The problem of treating a neural network as derived work is exactly why IBM said they wouldn't use the dataset in their products. Instead, they'll likely train various different networks, note which ones performed best, write a paper about it and throw the trained networks away. So long as they do that, they're not infringing on anyone's copyright.
The CC licenses where created before using this kind of content for machine learning was a widely known possibility, probably should be rewritten to create a new type of license where you can disallow this kind of usage.
The CC non-profit was set up in 2001 according to Wikipedia, yet people were going things like face detection in the 80s which would have required using images of faces
2001 was before mobile phones with cameras. It was years before big social networks. Years before the selfie became common. There simply weren’t that many images of faces on the web, comparatively speaking. As Creative Commons arose, the idea of harvesting an enormous amount of ordinary people’s faces was not something people were thinking about.
yeah I'm pretty sure of that timing hence my usage of the phrase 'widely known'.
Sure, there were face detection projects before 2001, with much lower quality results, and maybe not even familiar to the creators of CC.
Controlling a fetus' DNA is just getting ramped up, but of course there has been a good deal of academic knowledge of the possibilities before hand. Do you think there may be some laws or contractual formats worked out in the last few years that might apply in the area, yet have not adequately taken the implications into consideration?
The guidance given by the lawyers at my last company was "everything we do is commercial activity, even research", thus prohibiting the use of anything tagged "not for commercial use". I expect that it's the same way at IBM.
Interestingly I experienced something similar in academia. I combined several sources of open data to gather all of an author's publications and link to their respective Open Access versions, and have had objections from people who generally identify as Open Data advocates and had consciously freely licensed that data.
As these are photos from Flicker, they can include random people not aware about how their photos are used and where they are published. So it doesn't matter what license does photo have. I think it should be illegal to publish other people's faces on the Internet or TV, or recognise them without getting their consent. Also, search engines like Google should not be allowed to index such photos.
so all video of the president would be un-publsihable unless cropped to show just the president? those 100s of people around the president didn't give their permission. or how about all the people in that dash cam video of an accident.
Software engineering is engineering. Engineering can be accurately described as a mix of art and science. That's why in the olden days, "engineering" wasn't used outside of railroads -- the term used instead was "useful arts".
I'm not usually one to rant, but I'm getting a bit sick of media outlets behaving like they don't have a duty to understand the subjects they are writing about. While in the past it may have only been "techies" who understood these things, at least the general understanding of how privacy rules have developed exists in the populace. All of us read these articles and give the author the "laymans" excuse like a southerner excuses their grandparents' racist vocabulary. I'd like to make the situation very clear:
This article is clickbait, it's attempting to inflame readers by misinforming them and/or feeding common misconceptions. As journalists we should expect more of them. I do. We shouldn't allow ourselves the luxury of accepting "that's just the way journalism is now". It wasn't always this way, and it doesn't need to be.
I beg of readers: know when you are being played to raise agendas based on false premises. The author just wants to stir up the public so they have more to write about later. If by some chance the author believes anything they wrote, then perhaps NBC should consider moving them to the obits.
A friend and I built a web crawler that published images to a pool of workers that generated feature embeddings for all faces it found. After indexing all the feature vectors, we had effectively built a search engine for faces powered by images found on the public web.
The results were terrifying and they really affected me in a very negative way. Needless to say, we moved on to other projects. The world is not ready for the harm such technologies can cause.
Our index contained 100+ million faces and the compute costs were obscene.
Pretty sure Google, Facebook, and Apple already have this. They limit it to little convenience features like "Auto-tag your friends" or "Here are some friends you may know" because they don't want to scare people, but they have both the data and the models to do large-scale person search over the entire world. If they wanted to they could launch a search engine where you snap a picture with your cell phone and they tell you who it is, what their home & work addresses are, who their friends are, where they've been over the last week, which businesses they like to frequent, and probably roughly how much money they spend.
Right, I don't want it matching general public but I'm curious why a subset of this feature isn't commonplace: celebrities. As a user, I want this all the time to identify an actor, a politician, musician, etc.
It would take a random new photo and give you 3 or 4 likely matches.
Tineye only matches already known photos. Google/Bing image search will abstract a given photo and show you more of the same type: eg more white men wearing red shirts, rather than identifying the person and showing only photos of that person.
Even scarier. Imagine someone with evil intent feeding captured feature vectors for any individual into a 3D printer scripted to render high-def latex masks? Privacy could then be the least of our concerns.
The article takes a very long time to raise that the photographs in the collection had Creative Commons licenses.
Perhaps the specific use they're being put to isn't covered by the particular CC licence of each one, but until that is someone's claim I don't see this is quite the issue it's portrayed as.
If it's a full on frontal photo, in a number of countries, it is not for the photographer to decide what the license of that image is and any such attribution as CC might very well be invalid.
Yes, and when uploading you are required to confirm you have the rights to do so with that photo. If we aren't going to be able to trust these type of self declarations we pretty much can't have websites where users upload photos anymore.
These include just strangers on the street - anyone who doesn't consent to public photos. There are thousands of lovingly censored faces in her photos.
If everyone had done this I guess the situation would be different...
Ironically, using a neural network trained from this wide-sample dataset harvested from the public Internet, one could easily automate that process. And the result would be significantly less reliable without a wide sample of lighting conditions, genders, ethnicities, etc.
This is common in Japan. If TV reporters are taking an interview in a street, for example, they hide the faces of random people passing by because they did not consent to being on TV. They really care about privacy unlike people in other countries.
At a recent event, all of my friends and relatives took lots of photos, then cautioned each other not to use Facebook to post or share any of the photos because of privacy and facial recognition concerns.
All of them did, however, use WhatsApp to share the photos with each other. <facepalm>
According to the current version of the iPhone app, yes, my group chats are E2E encrypted.
I believe the mechanism is to use a generic / per message key to encrypt each individual message, and then use each recipient's public key to encrypt the encryption key before sending that (along with a link to the cipher text) to each end user.
This is also why sending a video the first time takes forever (encrypt, upload, encrypt key, send key, send link), while forwarding it to another person (encrypt key, send key, send link) does not.
Yes. It's actually one of the strongest points of whatsapp compared to other mainstream messanging services (Though Telegram is often touted as a more trustworthly alternative, it does not E2E encrypt by default and in my experience that means almost no-one does it). They've stuck with it even though it results in somewhat goofy UX (for example, no server-side logs, you can't log into other devices without your phone being online, etc).
Yes they are supposed to be, when a user joins a chat they cannot see prior messages and I believe this is a byproduct of their key being added to the ring.
They actually did a smart thing. Whatsapp is a private messenger, and random person from outside cannot see the photo while social networks are the opposite: they try to make everything you post available to maximum number of people. People might think their photo will be visible only to their friends and might be not aware that it now can be scraped by anyone.
How much do you believe Whatsapp and Facebook is connected in the backend? Any info on this specifically, especially since theres rumors stirring up that Whatsapp, Facebook, and Instagram are going to come together very soon. I mean your friends and family did choose an encrypted service over an uncrypted one. Better something than nothing no?
I firmly believe they're fully connected, and all information that is possible and reasonable to share is being shared. They paid $22 billion for WhatsApp. What did they pay for, exactly?
Does WhatsApp save the pictures for their own use?
Does their TOS/Privacy Policy outline what they (if anything) do with photos? I do not have whatsapp or facebook, so I don't know, but similar to OP, many family members do.
Your Messages. We do not retain your messages in the
ordinary course of providing our Services to you. Once your
messages (including your chats, photos, videos, voice
messages, files, and share location information) are
delivered, they are deleted from our servers. Your messages
are stored on your own device. If a message cannot be
delivered immediately (for example, if you are offline), we
may keep it on our servers for up to 30 days as we try to
deliver it. If a message is still undelivered after 30 days,
we delete it. To improve performance and deliver media
messages more efficiently, such as when many people are
sharing a popular photo or video, we may retain that content
on our servers for a longer period of time. We also offer
end-to-end encryption for our Services, which is on by
default, when you and the people with whom you message use a
version of our app released after April 2, 2016. End-to-end
encryption means that your messages are encrypted to protect
against us and third parties from reading them.
Your Messages. We do not retain your messages in the
ordinary course of providing our Services to you. Once your
messages (including your chats, photos, videos, voice
messages, files, and share location information) are
delivered, they are deleted from our servers. Your messages
are stored on your own device. If a message cannot be
delivered immediately (for example, if you are offline), we
may keep it on our servers for up to 30 days as we try to
deliver it. If a message is still undelivered after 30 days,
we delete it. To improve performance and deliver media
messages more efficiently, such as when many people are
sharing a popular photo or video, we may retain that content
on our servers for a longer period of time. We also offer
end-to-end encryption for our Services, which is on by
default, when you and the people with whom you message use a
version of our app released after April 2, 2016. End-to-end
encryption means that your messages are encrypted to protect
against us and third parties from reading them.
They claim pretty directly otherwise, right on their faq page.
> WhatsApp end-to-end encryption ensures only you and the person you're communicating with can read what's sent, and nobody in between, not even WhatsApp.
Is your claim that Facebook is plainly lying about this? That would be a a pretty high-risk thing to do, even for them: their usual MO is to cover their abuses in a couple layers of legalese and deniability.
I dont think they do, or it would be of minimal value that WhatsApp is encrypted, considering the realistic threat model surrounding What's app use involves corporate and government snooping...
This brings up a very interesting and important question around differential privacy.
If I scrape millions of photos from Facebook (including yours) then train a differentially private model that can extract features from a new face, is that a privacy violation?
A differentially private model is one in which you cannot identify the inclusion of any single datapoint which means you cannot tell the difference between a model trained on the dataset and the same model trained the same dataset with the addition of your one datapoint.
You might argue it’s a privacy violation because the scraping process might involve people looking at your images but if that was fully automated and nobody ever looked at your images - the model can be trained then the data inmediately deleted...
The irony of all this being, when I use Facebook marketplace with a specially-created public profile that has no discernible photos I get accused of being a spammer. Maybe Facebook could implement some kind of trustworthy algorithm that says "we believe this account is not a scammer". And I don't have to deal with not being able to fully use online systems that assume a photograph==trust.
Even if you upload a photo, FB will block the account if the photo doesn’t show your entire face. I set up a Facebook account where, after a holiday in Morocco, my profile photo was me in a turban where my mouth was obscured by part of the turban. I got a message from FB saying that this was not sufficient and I would have to upload another photo where no part of my face was covered.
This sounds like bs to me. There is a significant number of users with profile photos that don't contain a face and this is the first time I'm hearing this...
Those users probably have performed various other actions that reassured FB they had not signed up to spam (joining groups, adding many friends, etc.). It was the combination of being a new signup with an emptyish profile and having no clearly identifiable face that led to my account being blocked, with a message that specifically said I would not be able to regain access to my account until I uploaded a photo where my face was fully visible.
Facebook don't want you to create separate accounts for different purposes (i.e. one just for marketplace). So they aren't going to optimise the product around your use case.
Other users: if I post something in the marketplace for sale, invariably I get a comment from someone in the community "watch out, might be a spammer because they don't have any photos"
We are truly in the "revolution" part of the Information Age now. Our social and legal structures are as ill-equipped to handle problems like this as the Georgian-era England was able to handle things like millions of its people becoming factory workers.
I don't know what the future is going to look like, but, man, we're going to be going through some shit to get there.
The moment someone else puts up a picture that you happen to be in (esp. family, friends) and then tags it with your name -- your image and name be scooped and cataloged.
Even that's not wholly necessary required. If they can obtain data from your own Facebook app (or an app using the Facebook SDK), it can place you in that area around the time the photo was taken and given it's friends you have connections with on Facebook, it's easy enough to surmise it's you without the necessary confirmation.
Seems very Orwellian, to be sure, but not out of the grasps of the end-goals of data harvesting/profiling.
> If they can obtain data from your own Facebook app (or an app using the Facebook SDK)
Not for me, as I don't have a Facebook app installed, and I firewall off all outgoing traffic just to make sure no apps are phoning home without my explicit permission.
I know, but I've largely trained my friends and family to not do that sort of thing.
Regardless, I think that it's a fallacy to say that because it's impossible to defend yourself perfectly, then it's not worth defending yourself as well as you can.
There's also nothing you can do about appearing in friends' photos, or in the background of random strangers' photos in public places, or that article for the school newspaper that celebrated your basketball team victory in 11th grade, or (in many cases) headshots on your employer's team page. And that's only the legal and innocent ones. Somebody hacks into your phone or your company's org chart and suddenly a whole bunch of your photos are out there on the dark web.
Imagine being a kid born after Facebook became a thing, and finding out later in life that your parents robbed you of the chance to decide for yourself whether or not you wanted corporations to have that data.
In Russia developers of face search engines like findface or searchface usually scrape images from largest Russian social network, VK. They usually do it following that motto about asking for forgiveness, and there are no effective legal countermeasures anyway.
I think there should be no problem to scrape other social networks like Facebook, Instagram or Twitter from the countries where there are no legal restrictions and the photos are considered to be a "public data". You can outsource face recognition tasks to such countries.
There are lots of other ML models built using training data derived from the public. Voice models, gait recognition models, WiFi mobility models, Bluetooth location models, activity recognition models, emotion models, etc. It's essentially impossible for people to tell if their data was used to train these models. And it's very difficult to know what companies and individuals might do with access to both live data streams and these models. Worth thinking about though.
This is stupid. People put their stuff on the Internet and someone used it. It's good, not bad. We all participated in training data at this point. Google crawls us all.
Someone once mentioned to me that scraping online dating profiles was an excellent source of thousands of faces nicely and cleanly labelled by gender, age, and ethnicity, if those are the class labels you are looking for.
I'd say there are lots of unethical use cases but also a few ethical use cases of such a trained model.
> The dataset does not link the photos of people’s faces to their names, which means any system trained to use the photos would not be able to identify named individuals.
But isn't that like the point of a facial recognition algorithm? Recognizing individuals by their faces? Presumably from a reference image that has a name?
Also it seems pretty trivial to reverse lookup the images if they were from a public source and some of those will have names, unless they are significantly downsampled.
Not necessarily. One thing that a wide-net dataset can be extremely useful for is improving the neural net models to better recognize people outside the traditional academic dataset a lot of older-generation neural nets were trained on, which is to say: the sorts of people who have the time and need-awareness to put their faces in some university's dataset for training neural nets (which is to say: white guys ;) ).
You can use faces without names attached to improve the engine's modeling for recognizing human faces in general (and more importantly: improve the system's ability to distinguish human and animal faces).
(Your first comment is pretty interesting by itself, incidentally: both NBC News and your comment make an assumption that is not universially true about the technology. Face recognition is a much wider space than "recognize an individual by their face." Clustering of similar faces, emotion analysis, camera targeting, human presence / absence can all be done without name labels).
> As the algorithms get more advanced — meaning they are better able to identify women and people of color, a task they have historically struggled with — legal experts and civil rights advocates are sounding the alarm on researchers’ use of photos of ordinary people.
This is written as if algorithms were sentient beings overcoming the next level of obstacles, rather than just being written by mostly white men who train them on photos of people who mostly look like themselves.
Using images with a restriction to non-commercial purposes is a bit more of a gray area, depending on how you separate commercial from non-commercial activity. Since they share the data set with researchers at other organizations (presumably including competitors), I'd consider it non-commercial enough, because they don't gain a competitive advantage, but the details might have to be fought out in court.