Hacker News new | past | comments | ask | show | jobs | submit | pvankessel's comments login

Heh, that's actually pretty compelling - it sounds like a darker twist on the plot of a Stargate episode I recently rewatched: https://en.m.wikipedia.org/wiki/Revisions_(Stargate_SG-1)


Are there any models out there for cleaning up an image, not just upscaling? I have a bunch of old photos taken on early low-res point-and-shoots that have JPEG artifacts etc and this seems like something a modern model could easily be fine-tuned to resolve, but every few months I look around and have yet to find anything


Check out these models: https://replicate.com/collections/image-restoration

Most of them can be run locally, but I’d recommend testing them with replicate before investing in understanding cog/docker/hf…


Oh this sounds like exactly like what I've been looking for, can't wait to give these a try - many thanks


Here's a study I did a few years ago that broke down popular YT videos into more granular categories, might be of interest: https://www.pewresearch.org/internet/2019/07/25/childrens-co...


This is such a clever way of sampling, kudos to the authors. Back when I was at Pew we tried to map YouTube using random walks through the API's "related videos" endpoint and it seemed like we hit a saturation point after a year, but the magnitude described here suggests there's a quite a long tail that flies under the radar. Google started locking down the API almost immediately after we published our study, I'm glad to see folks still pursuing research with good old-fashioned scraping. Our analysis was at the channel level and focused only on popular ones but it's interesting how some of the figures on TubeStats are pretty close to what we found (e.g. language distribution): https://www.pewresearch.org/internet/2019/07/25/a-week-in-th...


> Google started locking down the API almost immediately after we published our study

Isn't this ironic, given how google bots scour the web relentlessly and hammer sites almost to death?


> google bots scour the web relentlessly and hammer sites almost to death

I have been hosting sites and online services for a long time now and never had this problem, or heard of this issue ever before.

If your site can't even handle a crawler, you need to seriously question your hosting provider, or your architecture.


Perhaps stop and reconsider such a dismissive opinion given that "you've never had this issue before" then? Or go read up a bit more on how crawlers work in 2023.

If your site is very popular and the content changes frequently, you can find yourself getting crawled a higher frequency than you might want, particularly since Google can crawl your site at a high rate of concurrency, hitting many pages at once, which might not be great for your backend services if you're not used to that level of simultaneous traffic.

"Hammered to death" is probably hyperbole but I have worked with several clients who had to use Google's Search Console tooling[0] to rate-limit how often Googlebot crawled their site because it was indeed too much.

0: https://developers.google.com/search/docs/crawling-indexing/...


I have a website thats get crawled at least 50 times per second. Is that a real deal? No not really. The site is probably doing 10.000 requests per second. I mean a popular site is indexed a lot. Your webserver should be designed for it. What tech are you using if I may ask?


My specific case doesn't really matter (and my examples are from some years ago and of smaller clients, not my own setup).

My point was that people provision capacity ideally based on observed or expected traffic, and that crawlers can, and do, show up and exceed that capacity sometimes, having a negative effect on your customers' experience.

But you are correct that it's absolutely manageable. And telling crawlers to slow the F down is one of the tools you can use to manage it. :-)


if your site is popular and you have a problem with crawlers use robots.txt (in particular the Crawl-delay stanza)

also for less friendly crawlers a rate limiter is needed anyway :(

(of course the existence of such tools doesn't give carte blanche to any crawler to overload sites ... but let's say they implement some sensing, based on response times, that means a significant load is probably needed to increase response times, which definitely can raise some eyebrows, and with autoscaling can cost a lot of money to site operators)


I worked at a company back in 2005-2010 where we had a massive problem with Googlebot crawlers hammering our servers, stuff like 10-100x the organic traffic.

That's pre-cloud ubiquity so scaling up meant buying servers, installing them on a data center, and paying rent for the racks. It was a fucking nightmare to deal with.


"Rules for thee, but not for me"


This is one of the most important parts of the EUs upcoming digital services act in my opinion. Platforms have to share data with (vetted) researchers, public interest groups and journalists.


For aggregated data and stats like this I think it could be fully publicly available.


Vetted always means people with the time, resources and desire to navigate through the vetting process, which makes them biased.


You might say the same thing about doing research in general


I would argue it's better than nothing, and what are they going to be biased towards?


Are you talking about Europe? they're certainly going to be biased against Google and any US tech giant.

I'm biased against Google, but I'm honest about it. I don't ask "what could I possibly be biased about?"


This would find things like unlisted videos which don’t have links to them from recommendations.


That’s a really good point. I wonder if they have an estimate of the percentage of YouTube videos that are unlisted.


This technique isn't new. Biologists use it to count the number of fish in a lake. (Catch 100 fish, tag them, wait a week, catch 100 fish again, count the number of tagged fishes in this batch)


That's typically the Lincoln-Petersen Estimator. You can use this type of approach to estimate the number of bugs in your code too! If reviewer A catches 4 bugs, and reviewer B catches 5 bugs, with 2 being the same, then you can estimate there are 10 total bugs in the code (7 caught, 3 uncaught) based on the Lincoln-Petersen Estimator.


A similar approach is “bebugging” or fault seeding: purposely adding bugs to measure the effectiveness of your testing and to estimate how many real bugs remain. (Just don’t forget to remove the seeded bugs!)

https://en.m.wikipedia.org/wiki/Bebugging


But this implies that all bugs are of equal likelihood of being found which I would highly doubt, no?


Yes, it's obviously not a perfect estimate, but can be directionally helpful.

You could bucket bugs into categories by severity or type and that might improve the estimate, as well.


Oh this is a really interesting concept.

I guess it underestimates the number hard to find bugs though since it assumes same likelyhood to be found.


That's not actually the technique the authors are using. Catching 100 fish would be analogous to "sample 100 YouTube videos at random", but they don't have a direct method of doing so. Instead, they're guessing possible YouTube video links at random and seeing how many resolve to videos.

In the "100 fish" example, the formula for approximating the total number of fish is:

    total ~= caught / tagged
    (where caught=100 in the example)
In their YouTube sampling method, the formula for approximating the total number of videos is:

    total ~= (valid / tried) * 2^64
Notice that this is flipped: in the fish example the main measurement is "tagged" (the number of fish that were tagged the second time you caught them), which is in the denominator. But when counting YouTube videos, the main measurement is "valid" (the number of urls that resolved to videos), which is in the numerator.


Did you understand where the 2^64 came from in their explanation btw? I would have thought it would be (64^10)*16 according to their description of the string.

Edit: Oh because 64^10 * 16 = (2^6)^10 * (2^4)


The YouTube identifiers are actually 64 bit integers encoded using url-safe base64 encoding. Hence the limited number of possible characters for the 11th position.


Do you get the same 100 dumb fish?


Why are they dumb? Free tag.


Imagine being the only fish without a tag. Everyone at school will know how lame you are.


It would be illegal not to have a tag. If the fish has nothing to hide, it shouldn't worry about being tagged.

And, also, the fish gets tagged for its own good.


>Everyone at school will know how lame you are.

They'll even call you tinfoil fish.


This comment. Please see here.


Catching fish is theoretically not perfectly random (risk-averse fish are less likely to get selected/caught) but that's the best method in those circumstances and it's reasonable to argue that the effect is insignificant.


You make a very weak argument, and are simply assuming the conclusion.

What makes it the "best method"? Would it be better to use a seine, or a trap, or hook-and-line? How would we know if there are subpopulations that have different likelihood of capture by different methods?

To say it's "reasonable to argue that the effect is insignificant" is purely assertion. Why is it unreasonable to argue that a fish could learn from the first experience and be less likely to be captured a second time?

If what you mean is that it's better than a completely blind guess, then I'd agree. But it's not clearly the best method nor is it clearly unbiased.


Fair points. But, mark-recapture is about practicality. It's not perfect, but it's a solid compromise between accuracy and feasibility (so I mean best in these regards, to be 100% clear). Sure, different methods might skew results, but this technique is about getting a reliable estimate, not pinpoint accuracy. As for learning behavior in fish, that's considered in many studies (and many other things, like listed here: https://fishbio.com/fate-chance-encounters-mark-recapture-st... ), but overall, it doesn't hugely skew the population estimates. So, again, it's about what works best in the field, not in theory.


In my experience conservation biologists are really good at finding animals in the wild. Much better than a typical SWE or typical business person.


Wouldn't a previously caught fish be less likely to fall for the same trick a second time?


only if you're within a 100 mile radius of me the ultimate dumb fish


I made the same connection but it’s still the first time I’ve seen it used for reverse looking up IDs.


It’s not even new in the YouTube space as they acknowledge from 2011

https://dl.acm.org/doi/10.1145/2068816.2068851


Also related is the unseen species problem (if you sample N things, and get Y repeats, what's the estimated total population size?).

https://en.wikipedia.org/wiki/Unseen_species_problem http://www.stat.yale.edu/~yw562/reprints/species-si.pdf


> You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.

Won't this mess up stats though? It's like a lake monster randomly swapping an untagged fish with tagged fish as you catch them.


Isn’t this just a variation of the Monte Carlo method?


That's only vaguely the same. It would be much closer if they divided the lake into a 3D grid and sampled random cubes from it.


I think YouTube locked down their APIs after the Cambridge Analytica scandal.


in the end, that scandal was the open web's official death sentence :(


The issue wasn't the analytics either. The issue was the engagement algorithms and lack of accountability. Those problems still exist today.


So as usual, the exploitative agents get to destroy the commons and come out on top.

We need to figure out how to target the malicious individuals and groups instead of getting creeped out by them to the point of destroying most of the so praised democratizing of computing. Between this and locking down the local desktop and mobile software and hardware, we've never got to having the promised "bicycle for the mind".


no one promised you anything


And what kind of accountability is that? An engagement algorithm is a simple thing that gives people more of what they want. It just turns out that what we want is a lot more negative than most people are willing to admit to themselves.


I would rephrase that to 'what we predictably respond to'.

You can legitimately claim that people respond in a very striking and predictable way to being set on fire, and even find ways to exploit this behavior for your benefit somehow, and it still doesn't make setting people on fire a net benefit or a service to them in any way.

Just because you can condition an intelligent organism in a certain way doesn't make that become a desirable outcome. Maybe you're identifying a doomsday switch, an exploit in the code that resists patching and bricks the machine. If you successfully do that, it's very much on you whether you make the logical leap to 'therefore we must apply this as hard as possible!'


Engagement can be quite unrelated to what people like. A well crafted troll comment will draw tons of engagement, not because people like it.


If people didn't like engaging with troll comments, they wouldn't do it. It's not required, and they aren't getting paid.


This comment has a remarkable lack of nuance in it. That isn't even remotely close to how how human motivation works. We do all kinds of things motivated by emotions that have nothing to do with "liking" it.


I don't think people "like" it as much as hate elicits a response from your brain, like it or not.

If people had perfect self-control, they wouldn't do it. IMO it's somewhat irresponsible for the algorithm makers to profit from that - it's basically selling an unregulated, heavily optimized drug. They downrank scammy content for instance, which limits its reach - why not also downrank trolling? (obviously bc the former directly impacts profits, but not the latter, but still)


This is really a child like understanding of the world.


In which ways were the Cambridge Analytica thing and the openness of Youtube APIs (or other web APIs) related? I just don't see the connection


The original open API from the Facebook was open for the benefit of the good actors to use their data. You can disagree with how it's used, but u can't disagree with the intention.

With the CA scandal, now all the big companies would lock down their app data and sell ads strictly through their limited API only, so the ads buyer would have much less control before.

It's basically saying: u cant behave with the open data. Then we will do business only


CA was about 3rd parties scraping private user data.

Companies are locking down access to public posts. This has nothing to do with CA, just with companies moving away from the open web towards vertical integration.

Companies requiring users to login to view public posts (Twitter, Instagram, Facebook, Reddit) has nothing to do with protecting user data. It's just that tech companies now want to be in control of who can view their public posts.


I'm a bit hazy on the details of the event but the spirit still applies: there were more access to the data that were not 100% profit driven. Now the it's locked down as the companies want to cover their asses and do not want another CA


Wasn't the "open" data policy used to create Clearview AI to create a profile and provide it to US govt departments?


They actually held out for a couple of years after Facebook and didn't start forcing audits and cutting quotas until 2019/2020


[flagged]


It is a little more sophisticated. They say they use an exploit that was found where a URL with five characters with a dash will get autocompleted by YouTube (I wonder why that is.) That improves sampling by 32,000 times apparently


First thing I thought of, 7 years old and more prophetic by the day: https://m.youtube.com/watch?v=7Pq-S557XQU

It's only a matter of time. Five years, twenty - we need to prepare for mass unemployment.


Not surprised to see this pop up. I interviewed for a position with Amazon a few months back to lead up a new research program to determine how they can better recruit and retain hourly workers. I had no intention of taking the job, but was curious so I took the interview. What stood out was just the sheer scale at which they're operating - they're literally up against the constraints of domestic labor supply. I have plenty of strong opinions about how they treat their workers and have no desire to work for such a company, but I was surprised to find that I did sympathize with them to an extent - it's not just about offering better pay and bathroom breaks, they're also on the verge of exhausting the viable labor market. I wish whoever took the job the best of luck - I hope that they're taking the research effort seriously and it's not just performance art.


The article suggests they are on the verge of exhausting the labor supply because they can't get anyone to stay for more than a couple years though, because so many people have already been there and left -- and that this has in the past been intentional on the employer's part, to only keep workers for a couple years.

If true, that puts a different light on things -- how the combination of having such a large labor force and a strategy to intentionally have high turnover combine to exhaust the labor supply, sure.


Oh don't get me wrong, much of it is definitely the result of their policies and could have been avoided if they hadn't treated their labor force like a discardable, consumable resource for years. I just thought it was interesting to see something pop up in the news that echoes what I got a glimpse of a few months ago - that Amazon is starting to recognize that they have a problem on their hands (finally). Just thinking about it in dispassionate scientific terms, it's a fascinating and unique problem - it might be too late for them to pivot and shift back towards "sustainable" practices, and if they fail I'm certainly going to enjoy the schadenfreude, but I'm really curious to see how they attempt to deal with all this.


>I interviewed for a position with Amazon a few months back to lead up a new research program to determine how they can better recruit and retain hourly workers.

Some companies take pride in promoting from within. McDonald's is well known for turning entry-level workers into store mangers, then regional managers, and so on. Walmart does this, too.

Does Amazon not have such a culture?


They definitely have some fluff pieces talking about cases where it happens, no idea how common it is relative to those others though


Lol, not yet it seems


> also on the verge of exhausting the viable labor market [at this level of pay]


I mean, that's definitely a huge factor, but there IS a limit to how much Amazon can pay their hundreds of thousands of workers and still remain competitive - and I have NO idea where that threshold is, which is why I'm fascinated to see how this shakes out. Either Amazon workers start getting treated better which is great, or the company collapses and turns into an unprecedented case study of what not to do. Either way, I'm grabbing some popcorn!


I love Vaillant's work. His and many other studies definitely point to the importance of relationships as determinants of happiness. We found evidence of the same a few years ago - people who mention their friends and their spouse/partner rate their lives more highly: https://www.pewresearch.org/fact-tank/2018/11/20/americans-w...

What I'm really interested in, though, is not just what the formula for happiness is - it's whether people are actually tapping into the key components of that formula. That is, knowing that strong and healthy relationships bring meaning and satisfaction to life, to what extent do people have and prioritize those relationships?

The fact that our latest research shows family and friends ranking highly around the world suggests that many people are doing just that, but there are other places where there might be room for improvement. For example, compared to four years ago, Americans are half as likely to mention their spouse or romantic partner when describing where they find meaning in life: https://www.pewresearch.org/fact-tank/2021/11/18/where-ameri...


We started off by trying LDA and NMF, but the topics were too messy so we wound up switching to CorEx (https://github.com/gregversteeg/corex_topic), which is a semi-supervised algo that lets you "nudge" the model in the right direction using anchor terms. By the time our topics started looking coherent, it turned out that a regex with the anchor terms we'd picked outperformed the model itself. This case study was on a relatively small sample of relatively short documents (~4k survey open-ends) but for what it's worth, we also tried to use topic models to classify congressional Facebook posts (much larger corpus and longer documents) and the results were the same.

Overfitting is certainly part of the problem - in one of my earlier posts I talk about "conceptually spurious words," which are essentially the product of overfitting - but the more difficult problem is polysemy. I'm sure there are ways to mitigate that - expanding the feature space with POS tagging, etc. - but ultimately I think the solution is to simply avoid using a dimensionality reduction method for text classification. Supervised models are clearly the way to go - even if those "models" are just keyword dictionaries curated based on domain knowledge.


This is really great - and kudos for providing methodological details and code! I love seeing this kind of large-scale descriptive research, it's a real bummer that YouTube is starting to close up access to their API. We did a similar kind of analysis looking at videos posted by popular channels last year, including some analysis of keywords that boosted views - figure you might find it interesting (and I'd love to see if our findings hold up with your dataset!) https://www.pewresearch.org/internet/2019/07/25/a-week-in-th...


Thank you.. The analysis you've done seems really interesting and gave me some more ideas to implement in future analyses.

For keywords that boost views, I'll run a similar analysis on my data and report the results to you here in another comment. Probably this weekend.

And I just heard from you that YouTube is starting to close up access to their API. I visit YT API website from time to time but haven't noticed such a thing. It's really a bummer. They should be more open not the opposite. Do you have some source if I want to know more about that?


Would love to see what you come up with, will stay tuned!

As for the API restrictions, they aren't advertising it but about a year ago they started warning users about forthcoming extensive audits to maintain access, and about six months ago they started reducing access for API keys if you stopped maxing them out for a day or more. Our last API key got shut down for good a couple of weeks ago. We're going to try to fill out the form and get our access reinstated, but I'm not sure how willing they'll be to allow access for research. The form seems intended for client-facing apps. But who knows - Facebook/CrowdTangle/Twitter have been very supportive of legitimate research initiatives, I'm hoping YouTube follows that trend!


Hey, I was not able to do the analysis we talked about. The analysis received much interaction after I published it here and was contacted by many people. As a result, I've been too busy. I hope I can do it soon. When I do it, I'll notify you in a comment here. Or provide me with your email.


To clarify - the current API has a limit of 10,000 "query points" per day for new API keys (most endpoints cost 1-5 points per query). It used to be 1 million; they've since throttled everyone down and started forcing audits. 10k is still something, but it certainly doesn't allow large scale research.


It used to be 50 million, it used to be 500000% more. I would have expected it to increase as bandwidth got cheaper yet it for whatever reason has decreased.

https://stackoverflow.com/questions/15568405/youtube-api-lim...


Wow. That makes me sad and suspicious. Why would they close themselves to this degree? From 50 million to 10,000!


Wow, I had no idea. 1m was enough to work with, but... wow. Thanks for the reference.


It is really sad to see one of the biggest Internet platforms and the biggest video platform apply more restrictions. If they opened their data more and made less restrictions, we would be in front of a gold mine of data.

With current restrictions, maybe multiple accounts + VPN are essential for large-scale research.


Full report is here, in case it interests anyone: https://www.pewinternet.org/2019/07/25/a-week-in-the-life-of...


This is a very good report.

Makes me wonder why such data has to be generated by an external “fact tank” when, with some caveats, such could come directly from YouTube.


The YouTube Data API is a bit...arbitrarily fussy. (example: https://twitter.com/minimaxir/status/1154482894850498561)

I'm tempted to revamp my old scraper for it and open source it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: