More

charlesdaniels · 2024-06-25T18:41:47 1719340907

I have been working on a project in a similar vein: rq[0]. Mine started out as an attempt to make a jq-like frontend for the Rego[1] language. However, I do find myself using it to simply convert from one format to another, and for pretty printing, quite often.

The interactive mode that qq has is really slick. I didn't torture test it, but it worked pretty smoothly with a quick test.

I see that the XML support is implemented using the same library as rq. This definitely has some problems, because the data model of XML does not map very cleanly onto JSON. In particular, I recall that I had problems using mxj to marshal arrays, and had to use this[2] ugly hack instead. qq seems to have done this a little more cleanly[3]. I may just have to lift this particular trick.

I definitely found CSV output to be challenging. Converting arbitrary JSON style data into something table shaped is pretty tricky. This[4] is my attempt. I have found it works well enough, though it isn't perfect.

I can also see that qq hasn't yet run into the byte order marker problem with CSV inputs yet. Might want to check out spkg/bom[5].

One final breadcrumb I'll drop - I drew a lot of inspiration for rq's input parsers from conftest[6]. That may be a good resource to find more formats and see specific usage examples of them.

Thanks for sharing! It's really interesting to see some of the convergent evolution between rq and qq.

0 - https://git.sr.ht/~charles/rq

1 - https://www.openpolicyagent.org/docs/latest/policy-language/

2 - https://git.sr.ht/~charles/rq/tree/c67df633c0438763956ff8646...

3 - https://github.com/JFryy/qq/blob/2f750f04def47bec9be100b7c89...

4 - https://git.sr.ht/~charles/rq/tree/c67df633c0438763956ff8646...

5 - github.com/spkg/bom

6 - https://github.com/open-policy-agent/conftest

arandomhuman · 2024-06-25T19:10:51 1719342651

Hi charles,

rq was shared with me yesterday and just wanted to say it's very impressive, I had heard of OPA/Gatekeeper and have looked into Rego before for policy assertions w/ terraform but I was not aware the language was so expressive until I saw rq. Also the amount of codecs rq supports and quality of them is really great.

It is really neat seeing a lot tools solve a similar problem in such unique ways (especially the case with rq) and has been a lot of fun reading your experiences here. Thanks for sharing your experiences and expertise with the de-serializing/serializing content - It is really cathartic to hear you mention the challenges you solved with xml and csv. I really like how you solved for CSV output/input and the conditions on the input data you chose for evaluating it makes a lot of sense and is really comprehensive, it bothered me too since the content would either need to be a matrix or a slice of maps but seeing as jq has string formatting that can convert things to csv and @tsv - I was at a bit of a standstill of how to approach.

Thanks so much for the bread crumbs I look forward to reading this in more detail over the week/weekend :)

charlesdaniels · 2024-06-25T19:37:44 1719344264

Wow, didn’t realize it had enough legs for people to be hearing about it except via me! Awesome to hear that.

Rego is “for” those authz cases like the ones you mentioned in the sense that it’s definitely designed with those in mind, and I do think it does a good job for those needs. OPA itself is definitely geared for use as a microservice or container sidecar, talking over the wire. That’s kinda hard to use in a shell script though.

Once I learned it I found myself using opa eval for searching and transforming data, eventually so much so that I made a shell script called “rq” that was basically opa eval -I -f pretty… the rest is history.

wwader · 2024-06-27T08:43:02 1719477782

Hey! fq author here, just want to say that i've looked at rq also :) nice to see ppl explore developing tools like this

arandomhuman · 2024-06-28T07:15:35 1719558935

Wow this is cool seeing you post here, fq might be most innovative modern cli tool I’ve seen. The historic archival and querying formats fq is providing is a big inspiration.

wwader · 2024-06-28T10:17:13 1719569833

Thanks for the kind words! I have to pass along most of the credit to the designers of jq and jq CLI interface. Lots of care and thought have gone into those, also gojq was big enabler. I feel more like plumber that half accidentally connected a bitstream decoder with a hacked up version of jq :D

arandomhuman · 2024-06-30T23:51:30 1719791490

If fq is a primarily a plumbing project, qq is a janitorial one at this stage (Maybe qq can be more akin to plumbing in the future). All things considered really appreciate your work with the community!

wwader · 2024-07-01T11:43:18 1719834198

It do involve some quite complicated plumbing :) i hope i've made at least some more ppl interested in jq as a langauge and related tools with fq, jqjq, jq-lsp and by helping maintain the original jq project

arandomhuman · 2024-06-26T15:09:36 1719414576

I mean to be fair the support for semi structured formats is unprecedented with rq so I could see why!

Thanks for sharing about rq here, I really should give OPA a try sometime, it seems really powerful setting policies in kubernetes or terraform for instance. When I first heard of Rego, I was very interested but it didn’t quite click, I can see that would not have been the case had rq been available at the time.

iimblack · 2024-06-25T18:45:28 1719341128

This is the beauty of open source. I love that you’re being so collaborative here instead of seeing another tool as competition.

charlesdaniels · 2024-06-25T18:48:39 1719341319

What's better than 1 nifty tool for querying semistructured data?

2 nifty tools for querying semistructured data!

charlesdaniels · on Dec 27, 2022

I wrote a wrapper around these some years ago which I’ve been using since. It’s been working pretty well for me.

https://git.sr.ht/~charles/dotfiles/tree/171c95a20394552e02a...

charlesdaniels · on July 17, 2022

There isn’t consensus on who the sea people were.

You might be interested in the book “1177 B.C.: The Year Civilization Collapsed” by Eric H. Cline, which covers what we know about the pre-collapse civilizations as well as the collapse itself. It’s very fascinating, though it seems a lot of the details are simply lost to time.

chucksta · on July 18, 2022

Have any other book suggestions? This ancient civilizations has been of great interest to me lately, doesn't have to be specific to the mediterranean

charlesdaniels · on Jan 20, 2022

I feel like requiring users to enter into a legal agreement with a private third party in order to perform legally required obligations for a government entity has some troubling edge cases.

What happens if a user has a dispute with ID.me, who decides to terminate their account for violating their ToS? Something tells me they will still be on the hook to pay their taxes (presumably via paper, for however long that's still supported).

landemva · on Jan 20, 2022

What if person can't utilize phone/camera due to ADA claim? What is the reasonable accommodation?

charlesdaniels · on Dec 1, 2021

I wonder if we'll see a comeback of hand-curated directories of content? I feel like the "awesome list" trend is maybe the start of something there.

I would be willing to pay an annual fee to have access to well-curated search results with all the clickbait, blogspam, etc. filtered out.

Until then, I recommend uBlacklist[0], which allows you to hide sites by domain in the search results page for common search engines.

0 - https://github.com/iorate/uBlacklist

notJim · on Dec 1, 2021

> hide sites by domain

This gives me the idea to build a search engine that only contains content from domains that have been vouched for. Basically, you'd have an upvote/downvote system for the domains, perhaps with some walls to make sure only trusted users can upvote/downvote. It seems like in practice, many people do this anyway. This could be the best of both worlds between directories and search engines.

fho · on Dec 2, 2021

I don't think this would change a lot, you would probably raise big sites (Pinterest, Facebook) a lot higher in the rankings as the 99% non-programmers would vouch for them.

You could counter that somewhat by having a "people who liked X also like Y" mechanism, but that quickly brings you back to search bubbles.

In that sense Google probably should/could do a better job by profiling you and if you never click through to a page lower it in the rankings. Same with preferences, if I am mainly using a specific programming language and search for "how to do X" they could only give me results on that language.

In the end that will probably make my search results worse, as I am not only using one language ... and sometimes I actually click on Pinterest :-(

ancientworldnow · on Dec 1, 2021

You don't need an upvote/downvote. If someone searches for X and clicks on results you just record when they stop trying sites or new search terms as you can assume the query has been answered. Reward that site. Most of them are already doing this in some form.

JohnBooty · on Dec 2, 2021

This is what Google already does, does it not? Why else would they be recording outbound clicks?

Unfortunately, this doesn't entirely solve the problem. Counting clicks doesn't work because you don't know if the result was truly worthwhile or if the user was just duped by clickbait.

As you say, clicking when they stop trying sites is better, but I don't know how good that signal to noise ratio is. A low-quality site might answer my question, sort of, perhaps in the minimal way that gets me to stop searching. But perhaps it wouldn't answer it nearly as well as a high-quality site that would really illuminate and explain things and lead to more useful information for me to explore. Both those scenarios would look identical to the click-tracking code of the search engine.

earthboundkid · on Dec 2, 2021

If I click on link 1 then click on link 2 several minutes later, 1 probably sucked. The difficulty is if I click on 1 and then 2 quickly, it just means I’m opening a bunch of tabs proactively.

DaiPlusPlus · on Dec 2, 2021

Often you don’t know if a site is legit or not without first visiting it.

And new clone sites launch all the time, so I’m always clicking on search results to new clone sites that I’ve never seen before so can’t avoid them in results.

OJFord · on Dec 2, 2021

Yeah, when I get caught by these SEO spam sites it's because they haven't had a similar ranking to the SO thread that ripped off, so it wasn't immediately apparent.

skinkestek · on Dec 2, 2021

> This gives me the idea to build a search engine that only contains content from domains that have been vouched for.

Just giving us personal blocklists would help a lot.

Then if search engines realize most people block certain websites they could also let it affect ranking.

freediver · on Dec 1, 2021

You can access one without paying a dime.

http://teclis.com

Problem is people usually want one general search engine, not a collection of niche ones.

wruza · on Dec 2, 2021

It’s not a directory. Hand-crafted descriptions instead of random citations and/or marketing from the site itself is what makes it a directory. This one is a search engine. Maybe it’s a good one for its purpose, but who knows, without an ability to navigate you can’t tell.

Problem is people usually want one general search engine, not a collection of niche ones.

In my opinion, the reason they want a general search engine is that they think in their box (search -> general search). What they really want is a way to discover things and quick summaries about them: “$section: $what $what_it_does $see_also”. Search engines abuse this necessity and suggest deceitfully summarized ads instead.

nextaccountic · on Dec 2, 2021

The trouble is, how do you prevent Sybil attacks? The spammers might vote for their own sites

https://en.wikipedia.org/wiki/Sybil_attack

aabbcc1241 · on Dec 4, 2021

would be better if my votes only affect my search result

charlesdaniels · on Oct 13, 2021

I would point out that sixels[0] exist. There is a nice library, libsixel[1] for working with it, which includes bindings into many languages. If the author of sixel-tmux[2][3] is to be believed[4], the relative lack of adoption is a result of unwillingness on the part of maintainers of some popular open source terminal libraries to implement sixel support.

I can't comment on that directly, but I will say, it's pretty damn cool to see GnuPlot generating output right into one's terminal. lsix[5] is also pretty handy as well.

But yeah, I agree, I'm not a fan of all the work that has gone into "terminal graphics" that are based on unicode. It's a dead-end, as was clear to DEC even back in '87 (and that's setting aside that the VT220[6] had it's own drawing capabilities, though they were more limited). Maybe sixel isn't the best possible way of handling this, but it does have the benefit of 34 years of backwards-compatibility, and with the right software, you can already use it _now_.

0 - https://en.wikipedia.org/wiki/Sixel

1 - https://saitoha.github.io/libsixel/

2 - https://github.com/csdvrx/sixel-tmux

3 - https://news.ycombinator.com/item?id=28756701

4 - https://github.com/csdvrx/sixel-tmux/blob/main/RANTS.md

5 - https://github.com/hackerb9/lsix

6 - https://en.wikipedia.org/wiki/VT220

csdvrx · on Oct 14, 2021

> If the author of sixel-tmux[2][3] is to be believed[4], the relative lack of adoption is a result of unwillingness on the part of maintainers of some popular open source terminal libraries to implement sixel support.

If you have any doubt, look no further than this thread: the sixel format is attacked not for any technical reasons, but for its age, RIGHT HERE ON HN:

>> "That's a protocol that's a good forty years old, and even that is not supported. And I can see why, why on earth would you want to be adding support for that in 2021? What a ridiculous state of affairs."

What's ridiculous is, with so many examples and quotes, some people still thing I must be "emotional" (I had a long discussion here... https://news.ycombinator.com/item?id=28761043 ) or that a few million colors is not sufficient for the terminal (!)

There is none so blind as those who will not see...

db48x · on Oct 14, 2021

Sixels are fun, but I was disappointed by libsixel. It’s not really a general–purpose library; most of it is there only to implement various command–line arguments of img2sixel. Most of the functions determine what to do by parsing strings taken from the command–line arguments, so reusing it is super annoying.

When implementing a program that outputs sixels, you are better off looking elsewhere. SDL1.2-SIXEL is a good choice in general, if you are writing C or don’t mind using the C bindings for your preferred language.

jwosty · on Oct 13, 2021

That’s interesting. Do you think sixels could work for the baseline tests? Would it be feasible to have them display nicely in an IDE, like VS Code or Visual Studio?

charlesdaniels · on Oct 13, 2021

I don’t see why sixels couldn’t work. You’d probably want a tool to decode them, diff the images, and then output another sixel image. I’m admittedly not sure of such a tool existing off the shelf though.

I’m not aware of text editors supporting sixels, which could make preparing the tests a challenge. Certainly, you could imagine a text editor supporting them, but I’m not aware of one that does personally.

I will concede that for your specific use case, an off the shelf ASCII plotting library probably involves less custom tooling.

csdvrx · on Oct 14, 2021

> I don’t see why sixels couldn’t work.

Sixels will work: they are fast enough to allow youtube video playback !!!

https://github.com/saitoha/FFmpeg-SIXEL/blob/sixel/README.md

The problem is NOT THE FORMAT, the problem is the lack of tooling: links and w3m are among the rare text browsers that can display images in the console.

It's just a matter of the browser sending the image to the terminal in some format it can understand, but if that hasn't be thought about as a possibility (say, for text reflow issues) it's going to be far more complicated than just adding a new format, as you will have to work both on say the text reflow issues (ex: how do you select the size of the placeholder, when expressed in characters?), on top of the picture display issues.

Said differently, it would be easier to have console IDE that supported graphics if any format whatsoever (sixel, kitty...) was supported by a console IDE; we could then argue about the ideal format.

Arguing about the ideal format BEFORE letting the ecosystem grow using whatever solution there is only results in a negative loop.

It's like if a startup was arguing about the ideal technological stack even before trying to find a product market fit!!

Personally, I do not care much about sixels, kitty or iterm format - all I want is to see some kind of support for a format that's popular enough for tools using it to emerge.

Yes, it would be better if that supported format was the option that had the greatest chance of succeeding, but right now, that is a very remote concern: first we need tools, then if in the worst case they are for a "bad" format, we can write transcoders to whatever format people prefer!

Right now, there is rarely any "input" to transcode (how many console tools support say iTerm format?), so we have a much bigger bootstrapping problem.

> an off the shelf ASCII plotting library probably involves less custom tooling

With a terminal like msys2 or xterm, no custom tooling is required: just use the regular gnuplot after doing the export for the desired resolution, font, and font size.

gnuplot is far more standard than plotting library that often require special Unicode fonts on top of requiring you to use their specific format.

MayeulC · on Oct 13, 2021

I find kitty's graphics protocol to be a superior implementation of the idea: https://sw.kovidgoyal.net/kitty/graphics-protocol/

user-the-name · on Oct 13, 2021

That's a protocol that's a good forty years old, and even that is not supported. And I can see why, why on earth would you want to be adding support for that in 2021? What a ridiculous state of affairs.

charlesdaniels · on Oct 13, 2021

> That's a protocol that's a good forty years old

34 years old, actually. I guess we can go ahead and deprecate the x86 instruction set, tcp/ip, ASCII, C, tar, and many other tools and standards that are old.

> and even that is not supported.

xterm supports vt430 emulation. I use this semi regularly. I believe mintty also supports sixels, plus a handful of others. The libsixel website has a full list.

> And I can see why, why on earth would you want to be adding support for that in 2021?

You might want to read your own post ( https://news.ycombinator.com/item?id=28856005 ).

What’s your great idea as opposed to sixels?

charlesdaniels · on Sept 14, 2021

Personally, I have never used any of these features.

I'm not even really sure what "Expanded Dark Mode" is.

DOH is probably good for many people, but I prefer to run my own DNS server (and I VPN back into my LAN when not at home with WireGuard).

charlesdaniels · on Sept 14, 2021

The only BlackBerry I ever owned was the BlackBerry Classic, back when that was still contemporary. Best phone I ever had. The UX was very consistent between apps, everything targeting it natively tended to be quite speedy, and the keyboard was excellent.

I also miss the "BlackBerry Hub" feature, which would aggregate your emails, BlackBerry messenger messages, and SMS messages into a single UI. It even pulled in notifications from Android apps, though opening them switched to that app rather than letting you reply in-line.

I bought mine after they had already released Android compatibility for any APK you cared to load, but unfortunately I think that feature was too little, too late.

I've been on an iPhone SE since around 2016. If I had the option to go back to using the BB Classic hardware/OS as it was when I switched, but with third-party app support and security updates, I would do it without second thought.

tomaskafka · on Sept 14, 2021

I still want Apple to make a Hub thing inside iOS. I hate having to use 6 messaging apps to reach people I love, but as a 'data providers' I'd keep them installed.

Analemma_ · on Sept 14, 2021

Microsoft tried doing something like this with Windows Phone. It was OK, but the fundamental issue here is that the messaging apps themselves absolutely do not want such a thing, because it would commoditize them and keep the user away from revenue-boosting gimmicks and dark patterns. They will never allow it to be built. RIM was only able to do it before companies realized the value of messaging lock-in; even if it still existed in any real capacity today, this feature would not.

skinnymuch · on Sept 14, 2021

There’s startups focusing on this. https://Beeperhq.com and https://texts.com

rbanffy · on Sept 14, 2021

Blackberry Hub was a thing of beauty. A single message queue for things I need to be aware of would make my life so much easier.

afandian · on Sept 14, 2021

I agree, it was wonderful on a Passport.

But I do wonder what a mess would result if it had collided with mass market Android and the shovelfulls of sofware that treats the notifications like a 90s systray. BBOS 10 not being Android might have been a bit of a moat against that.

fragmede · on Sept 14, 2021

With the right API, it would have been great - I don't want a notification from Instagram that friend X liked my story, but if all app notifications were segregated by user, that firehose of notifications would be way more manageable.

vjvj · on Sept 14, 2021

This. I started with a BB Storm and stayed through several BB10 handsets up to the Passport. There was no second guessing the UX with Blackberry. The menu was always in the same place and Hub was great when you have so many accounts.

I found the UX in iPhone apps so irritating. Settings could be virtually anywhere and were commonly scattered across multiple places.

That said, I now use my iPhone very differently to how I use my Blackberry and I wonder if I would still appreciate Blackerry features if I go back.

By this I mean I get virtually no notifications. I don't have work emails on my iPhone and only the red badge icon turned on for personal email accounts. Whatsapp only fetches new messages when I open the app. The only app notification I get is from screen time every Sunday.

One of the best things about Blackberry was the subtlety of notifications but I've just chosen to go low-notification with iPhone and I don't think I'll ever revert that.

christophilus · on Sept 14, 2021

Windows phone had a hub-like UI, and I absolutely loved it.

charlesdaniels · on Aug 20, 2021

Regular expressions, deterministic finite automata, and nondeterministic finite automata are all equivalent[0][1].

All three of these representations are capable of describing any regular language (set of symbol sequences, or more intuitively a set of strings), and the fact that a language can be described by an NFA, DFA, or RE implies that it is regular.

I am not hugely familiar with Pearl's "extended regular expression" system, however I was under the understanding that the set of languages it can recognize is a superset of the set of all regular languages. Based on [2], it would appear that Perl regexes can recognize all regular languages, and parts of the set of all Turing-recognizable languages.

0 - Introduction to the Theory of Computation 3/e, Michael Sipser, Thm 1.39, pp. 55.

1 - Introduction to the Theory of Computation 3/e, Michael Sipser, Thm 1.54, pp. 67.

2- https://www.perlmonks.org/?node_id=809842

beecafe · on Aug 20, 2021

FWIW the equivalence between NFA and DFA requires an exponential space increase to encode the NFA as a DFA, with an exponential space blow up you can encode a lot of things as DFAs (I'm pretty sure you could encode a Turing machine that uses bounded space on the tape as a DFA with exponentially more space, "just" make each possible configuration one state in the DFA)

charlesdaniels · on Aug 17, 2021

At the end of the day, even if you assume good faith on Google's part (which I think is quite a leap), causing the user to present more entropy to the site will make them easier to fingerprint.

256 topics would be ceil(log2(256)) = 8 bits of entropy

30,000 topics would be ceil(log2(30000) = 15 bits of entropy

As a reminder, there are ~ 10 billion people on earth, so if you have 34 bits of entropy or so, you can uniquely identify each person.

So really, the way to think of this as "Google considers making FLoC 20% less effective at fingerprinting users", and that's not even considering other sources of entropy, like user agent or screen size.

josefx · on Aug 17, 2021

> and that's not even considering other sources of entropy, like user agent or screen size.

As a reminder: Chrome sends 16 bits of x-client-data with every http request aimed at Google servers. So they already have half the bits they need to uniquely identify your system without FLoC.

jefftk · on Aug 17, 2021

The X-Client-Data header is only for evaluating the effect of changes to Chrome, not ad targeting: https://www.google.com/chrome/privacy/whitepaper.html#variat...

Earlier comment with more details: https://news.ycombinator.com/item?id=27367482

(Disclosure: I work on ads at Google, speaking only for myself)

gruez · on Aug 17, 2021

>The X-Client-Data header is only for evaluating the effect of changes to Chrome, not ad targeting

Just like the 2fa phone numbers that people to facebook were for "security purposes" only, but later turned out they were using it for ad targeting?

https://techcrunch.com/2018/09/27/yes-facebook-is-using-your...

Spivak · on Aug 17, 2021

Saying "well they could be lying" kinda makes the whole discussion moot doesn't it? Why even bother talking about FloC because they could be lying about that too.

nimih · on Aug 17, 2021

Because neither Facebook nor Google is a single monolithic decision-maker that understands what it itself is doing. Instead, they're fragmented organizations with many different groups with competing interests and goals within them.

More concretely, I think it's easy to believe that:

- The Facebook software developers and product managers who originally built and promoted phone 2FA were being earnest when they said the data would never be used for advertising.

- Some number of years later, someone elsewhere in the organization successfully got themselves access to that information without the knowledge/approval of the first group of people--who in all likelihood don't even work at Facebook anymore--and broke that original promise.

Throwing your hands up in the air and crying "well if they're lying, then all is for naught!" ignores the fact that large organizations act in complex ways, and even if you assume good faith on behalf of the current set of actors, you still need to push for systems which remain ethical and safe if some future set of actors turns out to be complete scumbags.

pessimizer · on Aug 17, 2021

Irrespective of whether they're telling the truth or lying, saying Chrome sends 16 bits of x-client-data that can be used to identify you means Chrome sends 16 bits of x-client-data that can be used to identify you.

numbsafari · on Aug 17, 2021

Exactly. The protocols need to not depend on “because I said so” or “pinky swear”.

This is the same problem with Apple’s new SpywareKit.

neolog · on Aug 18, 2021

FLoC is open source in Chromium. They're not lying about that. What they do with Google-specific information originating from Chrome is where skepticism applies.

dessant · on Aug 17, 2021

It's understandable that people have concerns about a HTTP header that could be used as part of a fingerprinting system.

Is there an option in Chrome to opt-out of this data being sent to Google?

anchpop · on Aug 17, 2021

I don't really understand this mindset. Google controls Chrome. They openly track every page you visit and show it to you at https://myactivity.google.com/ .

It's possible that Google is tracking you with FLoC or with extra HTTP headers or whatever. But they're also openly tracking you all the time anyway because you use Chrome. If you don't trust them to use the data they collect responsibly, don't use Chrome. (I'm not saying you shouldn't pressure Chrome to collect less data, I'm saying it doesn't make any sense to theorize about secret HTTP header fingerprinting operations when they're making literally no effort to hide the much bigger data collection operation right in front of you.)

gruez · on Aug 17, 2021

>They openly track every page you visit and show it to you at https://myactivity.google.com/ .

what if I'm not signed into chrome, or have that feature disabled?

perryizgr8 · on Aug 18, 2021

My guess is they are still tracking you. Kind of like how fb creates shadow profiles for people who don't have an account. So they do all the same tracking, you just don't get to see it in a nice dashboard!

kleene_op · on Aug 17, 2021

>256 topics would be ceil(log2(256)) = 8 bits of entropy

Unless several topics can be assigned to a person (which seems to be implied in the article), in which case that's 256 bits of entropy available to classify each person.

>As a reminder, there are ~ 10 billion people on earth, so if you have 34 bits of entropy or so, you can uniquely identify each person.

Yeah, well theoretically you could. But that assumes that browsers are able to extract and balance some very arbitrary and very specific information from the browsing habits of all people on earth in a perfect decision tree.

In practice, lots of browsing habits overlap, making this decision tree far less discriminating and powerful than the theoretically optimal one.

Though I think you are absolutely correct that in practice the number of bits to build up a classifier able to uniquely classify each person must be pretty low. Maybe a few hundreds.

That may very well be possible with those 256 topics mentioned in that article.

Also I don't understand the difference between cohorts and topics, apart from the fact that topic are less numerous and can have appealing names?

charlesdaniels · on Aug 17, 2021

> Unless several topics can be assigned to a person (which seems to be implied in the article), in which case that's 256 bits of entropy available to classify each person.

Good catch, forgot this was a bit-vector not a single key.

> Yeah, well theoretically you could. But that assumes that browsers are able to extract and balance some very arbitrary and very specific information from the browsing habits of all people on earth in a perfect decision tree.

Not really, people have found in the past that combinations of user agent, screen resolution, installed fonts, installed extensions, and things of that sort can come very close to uniquely identifying individual people.

> Though I think you are absolutely correct that in practice the number of bits to build up a classifier able to uniquely classify each person must be pretty low. Maybe a few hundreds.

Exactly. It might not narrow it down to one person, but perhaps a relatively small pool.

jnwatson · on Aug 17, 2021

I imagine that advertisers wouldn't have access to the entire bit vector.

Google would also have limit the number of bits an advertiser has access to.

ggggtez · on Aug 17, 2021

I'm not exactly sure what you're doing with your math there, but I think you probably should include what you think is the current entropy of your browsing sessions...

Considering you're already aware of screen size and user agent, and other forms of fingerprinting, you should probably realize that in the pre-FLoC world, you're likely already 100% identified by numerous ad networks.

omegalulw · on Aug 17, 2021

While your general assertion is true (log2(10billion) ceil'd is indeed 34) it is also misleading. For it assumes that that your identifier will be almost completely uniformly distributed. That is a very hard to achieve, in fact every team that's trying to track users is effectively trying to solve this problem.

omginternets · on Aug 17, 2021

Where can I read an introduction to the concept of "entropy" in the sense that you're using it?

I understand it's an information-theoretical concept, and also understand it's somehow related to randomness, but I'm not sure exactly how, and I would like to have a more precise understanding.

Seirdy · on Aug 17, 2021

I'm going to copy something I sent to a 13 year old to explain entropy in simple terms. It came up when we were talking about encryption. Reading forwards goes from dense/mathematical to conceptual; reading sections in reverse order does the opposite. This probably won't be useful to you but I have found it useful in other situations.

N bits of entropy refers to 2^n possible possible states.

Cryptanalysis:

AES-128 has a key size of 128 bits, so there are 2^128 possible AES-128 keys. A brute-force attack capable of testing 2^128 keys can break any AES-128 key with certainty.

Fingerprinting:

If a website measures your "uniqueness", saying "one in over 14 thousand people" isn't a great way to measure uniqueness because that number changes exponentially. Since we're dealing possible states, i.e. possible combinations of screen size, user-agent, etc., we instead take the base-2 logarithm of this to get a count of entropy bits (~13.8 bits).

Thermal physics:

The second law of thermodynamics states that spontaneous changes in a system should move from a low- to a high-entropy state. Hot particles are far apart and moving a lot; there are many possible states. Cold particles are moving around less and can't change as easily; there are fewer possible states. Heat cannot move from cold things to hot things on its own, but it can move from hot things to cold things. Think of balls on a billiards table moving apart rather than together.

Entropy of the whole universe is perpetually on the rise. In an unimaginably long time, the most popular understanding is that particles will all be so far apart that they'll never interact. The universe will look kind of like white noise. And endless sea of random-like movement, where everything adds up to nothing, everywhere and forever.

nybble41 · on Aug 18, 2021

> A brute-force attack capable of testing 2^128 keys can break any AES-128 key with certainty.

One minor caveat: You have to be able to recognize when you've found the right key. If the message is short (less than the key size) then it is likely that there are multiple keys that can decode the ciphertext to a plausible message and you have no way to know which one was correct. This is why an ideal One-Time Pad is considered unbreakable even by brute force: For any possible message of size less than or equal to the ciphertext there exists a key which will decode the ciphertext into that message.

omginternets · on Aug 18, 2021

This is a wonderful overview — thank you for writing it! It really helps navigate some of the more dense mathematical introductions out there.

seattleeng · on Aug 17, 2021

Assuming light background in introductory probability, try the first 4 pages of this: https://web.stanford.edu/~montanar/RESEARCH/BOOK/partA.pdf

omginternets · on Aug 17, 2021

Spot on, thank you kindly!