Some missing context is that the data is shared via the DeepSeek app's use of ByteDance analytics/configuration frameworks. So not a backroom deal where DeekSeek handed over the chat history for its user base, but rather ongoing analytics data being sent from the DeepSeek mobile app.
Besides the usual analytics data (device metadata, user behavior, app performance, errors, etc), it's possible raw chat data is being shared as well, but it's not a smoking gun.
We analyzed the iOS app[1] and observed similar traffic as well as a number of basic security issues (hardcoded encryption keys, use of 3DES and some traffic over HTTP).
Thanks for writing this article! I quite enjoyed it.
question: does the DeepSeek's app use of hardcoded encryption keys rise beyond just their attempt to obfuscate and protect their app's private API endpoints? I believe this an attempt to make abusing their mobile app's private web APIs more difficult since even with cert-pinning disabled and HTTPS MITM'd you still can't observe the real traffic and replicate their requests.
If all its doing is obfuscation though, then I don't understand why pointing out that the keys are hardcoded is meaningful. It certainly doesn't engender trust. But if the app's binary is ultimately decoding some encrypted data, it needs the key, meaning it's ultimately available to the reverse engineer. Whether it's hardcoded or not doesn't matter.
It's a bad look, but if the app used the latest tech and assigned each client its own symmetric encryption key for a session, wouldn't you still be able to access the same data? What would be meaningfully different from a security perspective if they had done this obfuscation better?
Apple disallowed HTTP by default, you can flip a bit in the config to allowlist some/all endpoints to HTTP. Not clear what the App Store actually does when reviewing this info when you submit.
Despite their goal of enforcing in 2017, it is still not a hard requirement. Back then, about 80% of the apps we tested disabled ATS either partially or fully [1]. It’s rare to see Apple walk something back [2], but here is a blog at the time that talked about it [3].
Would you say that US-based apps that use e.g. Google Analytics, and therefore share information with Google, "surface the interconnectedness between all of these firms" and are a good reason to e.g. ban apps from US-based developers?
Not the op, but yes, I would; this is why I approve of GDPR and the cookie popup rules and am actively angry at every company who think it's legit to share browsing habbits with more "trusted partner" companies than there were students in my secondary school.
My comment starts with the reality that some people (e.g. U.S. Congress) find cause for concern WRT Chinese apps.
This is the reason, say, revelations about interconnectedness matter when it comes to Chinese apps versus U.S. apps.
You may disagree about whether there should be cause for concern, but that's another matter.
But, if you're asking me if I personally think there's cause for concern around allowing a foreign adversary access to your citizenry via social media platforms, then the answer is yes.
And, of course, China itself also believes it's a problem, which is why U.S. social media is banned there.
If you consider all Chinese apps suspect because they're Chinese then it doesn't make any difference whether they're connecting to another app or not. For anyone who doesn't already think Chinese apps are automatically their adversary, I don't see how a Chinese app using a Chinese metrics framework in exactly the same way that is completely routine for US apps using US metrics frameworks (and indeed any number of other countries) is supposed to move the needle on how suspicious this app is.
>If you consider all Chinese apps suspect, then it doesn't make any difference whether they're connecting to another app or not.
>For anyone who doesn't already think Chinese apps are automatically their adversary...
I think we're drifting from the original context. My point was that some people, including many in U.S. Congress, do take issue with at least some Chinese apps (let's just say TikTok here). This concern is at least partially WRT its data collection/handling and espionage. So any other apps that connect to it (or its parent company's products) and provide data would obviously also be viewed as problematic.
Incidentally, this isn't necessarily related to whether that other app is China-based.
You brought in the question of whether U.S. companies would face similar scrutiny for connecting to other U.S. companies, and I was merely explaining the difference.
No one cares about the details. (Heck, I'd be willing to wager good money that the politicians and most of their staffers don't even understand the details). In the end, it's just one more reason that Chinese models will not be legal in the US in the near future.
Protectionism can be dumb, if competition from china is decimation the US LLM market, making the cheaper better competitors illegal sounds like sound advice to someone like trump, probably?
Following typical tropes about China, "we" decided to ban space cooperation with them because they were just going to steal American space tech or whatever. That's why, to this day, you never see Chinese on the ISS. Of course China then became the 2nd largest player in space, behind only SpaceX, launched and manned their own space station, sent a rover to Mars, carried out unprecedented sample return missions from the dark side of the Moon, and just generally ran circles around the US sans SpaceX.
If it wasn't for this dumb law, it's likely NASA would have been able to use Russia, China, and SpaceX as redundancies for getting Americans to the ISS as one country/company fell out of favor with this administration or that. As was we ended up turning to Boeing for a redundancy. For those that don't follow space news, the 2 astronauts Boeing [barely] sent to the ISS are still stranded up there after their vessel was deemed too dangerous to return in.
I oft wondered what it would have been like to live in Rome circa 460.
Yeah they act holier than thou when someone else takes data but then turn around and do it themselves, I think that's called hypocrisy. Besides, once data goes to your ISP its gone, aren't we better off just limiting data that we want to keep private?
Plot twist: all these people sharing on twitter yet another creative way of mentioning Xi and Tiananmen in a conversation without triggering the protection (count to 11 in roman numbers, leetspeak etc) were in fact collecting the training data for the nextgen LLM-based protection. Well played!
Yes, they probably all do that. Anthropic primised to pay the winner that broke all their protections. That way they get tens of thousands of free workers trying to get the money. Much cheaper than $300k engineers.
A US Tiananmen-comparable example would be ChatGPT censoring George Floyd's death or killing of Native Americans, etc. ChatGPT doesn't censor these topics
Huh? TPTB in the US do not try to censor those topics; if anything they encourage discussion of them (or at least did until this year). US "AI" systems censor much the same topics as US social networks, just as Chinese "AI" systems censor much the same topics as Chinese social networks.
Major Chinese tech companies often collaborate with government entities, potentially compromising user privacy. Given China's regulatory environment, where authorities can access data held by domestic firms, users worldwide should exercise caution when engaging with platforms from such backgrounds.
Here is my story. I needed to buy central console for my car (purchased it a while ago in used cars lot). Went to Amazon and made my selection. Next thing is I see is the warning: this particular console will not fit you car which is MAKE: XXXY, MODEL: YYYY, YEAR: ZZZZ. How's that for data sharing.
"Unintentionally exposed" and "deliberately gave" are two meaningfully different actions, both of which are examples of why much better regulation and legislation of individuals rights over their data are needed.
Shouldn't this be the other way around? TikTok has the most user data for any LLM to train with. I bet they will make a killing with it, unless of course the CCP decrees that they share it for free.
Secondly, most data in China is shared among most companies anyway, because, firstly, the government (not necessarily CCP) orders most companies to share data with "technological leaders" and "strategically important" companies, and secondly because computer security is mostly an alien concept to Chinese.
Copyright (broadly speaking, most restrictions on unrestricted dissemination if data) is what is killing the US economy.
you built a internal project, co-hosted with a database, with a password 'abc123'
a month later, your manager decided to share it with other teams, the decision was made in a meeting which you're not invited
when the manager came to you, you asked:
- how about give me a week to make it a saas, with authn/authz
- no, we don't have the time, just tell them the endpoint and the password
another month later, something changed, your company built a partership with another company, your manager decided to share the project with teams in the other company
you asked:
- how about we do something like virtual network peering so that we can share a connected network with our parter
- it's complex, we can not change the network status of our partner, and we don't have a responsible role for this work, just give them the endpoint and the password
password 'abc123' is just a analogy, in this case, there's no password at all
What useful textual user data do you see coming from TikTok? All the text seems very low quality, to the point where I naively assume that including it in training data would decrease performance.
As the sibling commenter mentioned, the video data itself is useful as we see a rise in multimodal models, but also..
(1) all videos are captioned, automatically then often again by the content creator manually. This data alone is extremely valuable for training purposes.
(2) the videos contain great information about slang terms, and youth vernacular. Which is unique data that is harder to find elsewhere.
(3) young people seem to use TikTok as a search engine, so presumably some of the videos' content must be explicitly valuable enough as an information source, similar to YouTube.
> These references suggest deep integration with ByteDance's analytics and performance monitoring infrastructure
I mean when I visit a random website or open a random app, I kind of expect that it will use something like Google Analytics or Firebase Crashlytics so that my "user data" is shared with Google.
If the article wants me to feel outraged about this practice, I don't. I understand that analytics and performance monitoring are often outsourced to a third party, often without a choice of turning off the analytics and performance monitoring features in the first place.
I use the DeepSeek app happily without giving it any data I consider private. I have a separate local DeepSeek distilled model for that.
Trump needs to enforce PAFACA and ban TikTok, but also ban DeepSeek, which has the same exact issues since it is also effectively operated by a foreign adversary and poses various security threats.
Your source doesn't even mention "propaganda". Moreover, while censorship is concerning is concerning, I don't see how that practically affects users. If I want to know how to center a div, who cares if it's cagey about what happened in 1989?
"Western" AI is also arguably full of "propaganda/censorship". Remember when chatgpt just came out, and conservatives were lambasting it for being "woke"?
"Neither Feroot nor the other researchers observed data transferred to China Mobile when testing logins in North America, but they could not rule out that data for some users was being transferred to the Chinese telecom."
> "Western" AI is also arguably full of "propaganda/censorship". Remember when chatgpt just came out, and conservatives were lambasting it for being "woke"?
Clearly a false equivalence. You think government propaganda compelled by a dictatorship with access to a military and nukes is the same thing?
Which government are you talking about? Like the US banned tiktok because Israel did not like it [1]?
Let us be honest here.. China censors directly. US censors indirectly through the private companies [2] and through covert use of force [3]. If you had a pro-Russian stance in 2022 or pro-Palestine stance, you will see your content censored in very subtle ways in US.
I guess we know how many bots are commenting on this article based on how many of them are talking about the US, when the article is about South Korea?
For a single sentence on an entire article about South Korea and a pundit comment by a company whose job it is to look at this sort of thing? If you pick the US as your thing to comment on in an article like this, maybe you're just a bot, or even hired hand, for making comments about the US without bothering to understand that you're obviously commenting out of place.
You and I can't buy it because they don't want their competitors getting it. But they'll happily use it to target ads at you, and the US government has access to it and can use it to decide who they want to send their CIA kidnap-torture squads after.
Which is in this case a pretty important distinction. Letting another company leverage user data within the bounding zone which you've defined is not the same thing as is being alleged here, which is actually sharing data.
It's quite literally the difference between exposing a public API and actually handing over the contents of the database.
> Letting another company leverage user data within the bounding zone which you've defined is not the same thing as is being alleged here, which is actually sharing data.
you mean how every single US tech company shares data with Google and Meta? How you browse a website, and, in an instant, ads show in Meta products? "user behaviour and device metadata [are] likely sent to ByteDance servers", lol, all your user behavior and device metadata are sent to Google and Meta servers. South Korea too afraid to say the same thing about USA. And surprised Pikachu face about all the downvotes in this thread on users pointing out the same thing about US tech companies, lol, propaganda and ethnonationalism is a powerful force
If you're shocked or even the slightest bit surprised, then I can't imagine how blissful your life is to be so unaware about how much corporations are sharing data with each other.
Like, I wholeheartedly expect that if I mention Beyblade toys on Facebook, then the next time I visit Amazon, they'll be suggesting Beyblades even if I've never even searched Amazon for toys, let alone Beyblade.
Bytedance's entire business model is based on user-targeting and showing things what they might enjoy watching, so they can push more ads to them. I wouldn't be surprised if they bought the data to train their own LLMs.
I recently had an experience that genuinely surprised me: I was watching a Peruvian video on YouTube, and I clicked on the creator's Instagram profile link in the description. Literally a few minutes later I received a promotional email with services and investment opportunities from an official Peruvian government email. Somehow opening an Instagram profile of a Peruvian creator got me tagged as a potential investor? But the most shocking part was how quickly this all happened.
There's basically no credible evidence of this happening. All there is are vague anecdotes which are easily explained with confirmation bias and/or the birthday paradox.
If the argument is that there's no credible evidence, retorting with a vague question doesn't really help your case. If anything it reinforces the original claim that there's no credible evidence.
Being "biased" isn't remotely close to outright lying. Despite all the exasperation about Fox News being "fake news" or whatever, they very rarely outright lie.
Weird hill to die on, man. Like, sure credible evidence is one of the most important things in the world... but what, are you honestly saying that you're going to be surprised if WhatsApp turned out to be leaking data?
We don't need the pitchforks just yet, sure, but shit, you have to remain realistic about these things.
>but what, are you honestly saying that you're going to be surprised if WhatsApp turned out to be leaking data?
Your words, not mine. I never made such claims, and you're trying to move the goalposts from "Meta does this" to "I'll be surprised if Meta does this".
I'm not moving goalposts. I didn't accuse WhatsApp of leaking data, stop twisting other people's words.
I think you mean "I'll NOT be surprised if Meta does this", which is the reasonable position of any rational person to take.
I'm allowed to extrapolate expectations of future behaviour, based on past behaviour. Doing otherwise is naive, dangerously so if you're responsible for someone else's security or privacy.
The truth is even worse; reddit has enough of a profile built on you that they can predict your penchant for beyblades without even needing your whatsapp chats.
WhatsApp is a closed-source client that you cannot trust to faithfully and correctly implement the protocol, or be free of backdoors that allow Meta to snoop on your conversations.
At least according to Meta's marketing, WhatsApp is E2E encrypted. And they make ads just for this -- you can literally see billboards in NYC that advertises the encrypted messaging part of the product. It would completely destroy WhatsApp and Meta's brand if there is a backdoor somewhere. Well, Meta is never a great company to begin with, but nobody would ever lie about it and destroy their brand this way.
And I truly believe Meta has an incentive to do so. They had to reveal a conversation on Facebook Messenger on the topic of abortion after the police asked for it, which resulted in someone put in jail. Regardless of Meta's (or rather Zuck's) ever changing political position, they don't want to have liability over anything like this. They want to walk away and just say to the cops, look it's all encrypted, there's nothing we can share with you.
Better keep conspiracy theories to yourself. It's ok to question things, but better back that up with evidence.
In case this is not clear enough: you'd better come up with some real arguments with concrete evidence, or move on. Nobody has time for meaningless speculation.
>...corporations are sharing data with each other.
>I wholeheartedly expect that if I mention Beyblade toys on Facebook...
Isn't the lede here that this isn't just some random data sharing agreement between companies, but that these are both Chinese companies, and the recipient of the data has been banned in the U.S. precisely because of data concerns?
Things can be shocking (as in: causing indignation or disgust), yet totally unsurprising. In fact, I'd argue that most newsworthy events tend to be both terrible and entirely expected, given incentives and the way the world is set up to work.
I think the problem can be solved easily by forcing the company behind DeepSeek to simply redirect all the data they've gathered on their user, directly unto a CIA database. Surely this will be considered a good compromise.
That won’t protect you from its propaganda/censorship. Some versions of DeepSeek’s models have bias built in - as in it’s not just implemented by their service/app. But offline does protect you from privacy/security issues.
Yes if you search, lots of people have shared evidence of this. But it depends on which model you’re using, as some seemingly don’t have the bias built into their training.
From the 5th paragraph of the article, Americans are complaining:
> Since then, multiple countries have warned that user data may not be properly protected, and in February a US cybersecurity company alleged potential data sharing between DeepSeek and ByteDance.
Bruhh, your iphone and android will literally “share” what you are saying even in private with anyone they can find for advertising… so this should not be surprising
... in the same way a lot of website in this world 'shared user data' with Google.
Through Google Analytics.
Yeah, believe it or not. ByteDance has a cloud offering. And it includes a frontend APM product. And DeepSeek used that. How surprising! A Chinese company used a Chinese cloud.
Oh, and chat.deepseek.com resolve to a Huawei Cloud IP address in China. It resolves to Cloudflare outside of mainland China, but who knows, maybe they just decided to wrap with another CDN and their servers are still on Huawei Cloud. So they sent data to Huawei, too. I repeat, H-U-A-W-E-I. That cursed telecom equipment company in the States.
Here's the SecurityScoreCard article that brought attention to this: https://securityscorecard.com/blog/a-deep-peek-at-deepseek/#...
Besides the usual analytics data (device metadata, user behavior, app performance, errors, etc), it's possible raw chat data is being shared as well, but it's not a smoking gun.
reply