Hacker News new | past | comments | ask | show | jobs | submit login

>San Francisco-based Twitter also disclosed that it had discovered an error in how it had measured its user base since 2014 and revised its estimates downward

How is this possible? They can't just do like fetch_all_users, filtered by "not banned?"




Monthly actives is generally the primary metric and it's not so easy to calculate.


Quote from http://money.cnn.com/2017/10/26/technology/business/twitter-...

"These third-party applications used Digits, a software development kit of our now-divested Fabric platform, that allowed third-party applications to send authentication messages via SMS through our systems, which did not relate to activity on the Twitter platform," the company explained in its earnings report.

Really seems like something they should have caught earlier.


Do you mean it’s not easy to define? It shouldn’t be difficult to calculate any particular metric going forward, but it’s inportant to define what it means to be “active.”


Calculating these metrics at scale is not trivial.


In real time, yes.

But the user database should already have backups, importing those backups into an analysis server should be easy, and running queries like that on an analysis server should be easy.

Counting messages, or users with X messages, etc. is also largely a function of whether your backup/restore system works. But this time you do it in chunks.


I helped build Twitter's data platform, 2010-2016.

There isn't an "analysis server" and analyzing user activity is not done on a "user database backup" at Twitter's scale, though indeed that's a common way that would be done for smaller businesses.

By the way, if by user db you literally mean the db with user accounts, that's not the right data source -- you want the user _activity_ db to count active users, and for high-scale applications, those are different things. Presumably user activity updates are orders of magnitude more frequent than user object updates. You don't want to thrash your user db by constantly updating some "last seen at" field. Put that stuff somewhere else.

That said, it's true that counting is simple, it's just a Hadoop / Spark / distributed computing platform of choice job. Filter, distinct, count. It's not even hard in real-time if you have enough ram or are ok with approximate counts with bounded error, thanks to Storm, Heron, Flink, etc.

Defining what exactly constitutes an active user and catching edge cases such as this Digits thing is where things get tricky; the number of weird scenarios that cause under/overcount for what seem like reasonable and straightforward definitions would surprise you.

@baddox nailed it.


Thanks. Note that I wasn't trying to guess at what twitter does, just to provide a workflow that should be viable almost anywhere, in the absence of easier options. It's good to hear that the underlying idea of "calculating the metric isn't the hard part" is true.


Oh, fair that that would be a more important metric, but when they said "user base" I incorrectly assumed they meant "all registered users."


I bet an engineer noticed and told a manager that the numbers were technically lower, the upper management finds out and decides to release that info when they already have massive mindshare and it won't hurt them as much.


And users activates, users logged on last X months, users not deleted, no duplicates caused by some obscure event synchronization issue etc etc. Bugs are easy.


I believe they attributed this to MAU/DAU of Digits[0] who are not otherwise MAU/DAU of Twitter. Ostensibly, users who use Digits to sign into 3rd-party apps ride in the same DB as bona fide Twitter users, and they just didn't discount them.

[0]: https://techcrunch.com/2014/10/22/twitter-launches-digits-a-...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: