Anyone who wants to demystify ML should read: The StatQuest Illustrated Guide to Machine Learning [0] By Josh Starmer.
To this day I haven't found a teacher who could express complex ideas as clearly and concisely as Starmer does. It's written in an almost children's book like format that is very easy to read and understand. He also just published a book on NN that is just as good. Highly recommend even if you are already an expert as it will give you great ways to teach and communicate complex ideas in ML.
> Europe’s dependence on the United States for its security means that the United States possesses a de facto veto on the direction of European defense. Since the 1990s, the United States has typically used its effective veto power to block the defense ambitions of the European Union. This has frequently resulted in an absurd situation where Washington loudly insists that Europe do more on defense but then strongly objects when Europe’s political union—the European Union—tries to answer the call. This policy approach has been a grand strategic error—one that has weakened NATO militarily, strained the trans-Atlantic alliance, and contributed to the relative decline in Europe’s global clout. As a result, one of America’s closest partners and allies of first resort is not nearly as powerful as it could be.
This guide doesn't tell you what to do; it give you much of the lingo and a bunch of framework choices that you can use, or perhaps that the existing teammates are already using.
Lots of Pandas hate in this thread. However, for folks with lots of lines of Pandas in production, Fireducks can be a lifesaver.
I've had the chance to play with it on some of my code it queries than ran in 8+ minutes come down to 20 seconds.
Re-writing in Polars involves more code changes.
However, with Pandas 2.2+ and arrow, you can use .pipe to move data to Polars, run the slow computation there, and then zero copy back to Pandas. Like so...
OOP is an industry of its own which generates a ton of incidental complexity. See "Object-Oriented Programming is Bad" by Brian Wills (https://www.youtube.com/watch?v=QM1iUe6IofM) and most of Rich Hickey's excellent videos, especially his keynote at Rails Conf 2012 where he basically told the Ruby crowd they're doing it wrong (https://www.youtube.com/watch?v=rI8tNMsozo0).
1. Live coding, in Zoom or in person. Don't play gotcha on the language choice (unless there's a massive gulf in skill transference, like a webdev interviewing for an embedded C position). Pretend the 13 languages on the candidate's resume don't exist. Tell them it can be any of these x languages, which are every language you the interviewer feel comfortable to write leetcode in.
2. Write some easy problem in that language. I always go with some inefficient layout for the input data, then ask for something that's only one or two for loops away from being a stupid simple brute force solution. Good hygienic layout of the input data would have made this a single hashtable lookup.
3. Run the 45 minute interview with a lot of patience and positive feedback. One of the best hires in our department had first-time interview nerves and couldn't do anything for the first 10 minutes. I just complimented their thinking-out-loud, laughed at their jokes, and kept them from overthinking it.
4. 80% of interviewees will fail to write a meaningful loop. For the other 20%, spend the rest of the time talking about possible tradeoffs, anecdotes they share about similar design decisions, etc. The candidate will think you're writing in your laptop their scoring criteria, but you already passed them and generated a pop-sci personality test result for them of questionable accuracy. You're fishing for specific things to support your assessment, like they're good at both making and reviewing snap decisions and in doing so successfully saved a good portion of interview time, which contributed to their success. If it uses a weasel word, it's worth writing down.
5. Spend an hour (yes, longer than the interview) (and yes, block this time off in your calender) writing your interview assessment. Start with a 90s-television-tier assessment. For example, the candidate is nimble, constantly creating compelling technical alternatives, but is not focused on one, and they often communicate in jargon. DO NOT WRITE THIS DOWN. This is the lesson you want the geriatric senior management to take away from reading your assessment. Compose relatively long (I do 4 paragraphs minimum) prose that describes a slightly less stereotyped version of the above with plenty of examples, which you spent most of the interview time specifically fishing for. If the narrative is contradicted by the evidence, it's okay to re-write the narrative so the evidence fits.
6. When you're done, skim the job description you're hiring for. If there's a mismatch between that and the narrative you wrote, change your decision to no hire and explain why.
Doing this has gotten me eye rolls from coworkers but compliments at director+ level. I have had the CTO quote me once in a meeting. Putting that in my performance review packet made the whole thing worth it.
I use the fact you can configure git to use custom diff tools and take advantage of this with the following in my .gitconfig:
[diff "pdf"]
command = ~/bin/git-diff-pdf
And in my .gitattributes I enable the above with:
*.pdf binary diff=pdf
~/bin/git-diff-pdf does a diff of the output of `pdftotext -layout` (from poppler) and also runs pdf-compare-phash.
To use this custom diff with `git show`, you need to add an extra argument (`git show --ext-diff`), but it uses it automatically if running `git diff`.
LiveEO | Berlin, Germany | Hybrid or REMOTE in EU | Full-time
Hiring Manager here, we're building e2e applications using ML/AI and satellite imagery to identify deforestation in EU supply chains - and more.
First of all, apologies to those of you who haven't received a response yet. I was a bit overwhelmed by the number of sent applications. On the Data Scientist position: we are looking for someone who brings additional remote sensing/satellite imagery expertise into our team, sorry if I haven't followed up with you if you applied without this background.
We need people that know how to build efficient and performant ML pipelines for our data scientists (experience with any sort of computer vision data cubes beneficial, doesn't need to be satellite imagery) and for e2e production inference pipelines, engineers that implement complex supply chain business logic using nestjs, engineers that build wonderful user interfaces in react and data scientists that solve really hard problems in satellite imagery segmentation using deep learning:
- Staff/Senior Data Scientists with experience in remote sensing, satellite imagery, computer vision, time-series image classification/data hypercubes and foundation models (not yet online, please send me an email, see below)
Please reach out to me with your CVs and questions: sven dot mesecke @ company domain
> So much of science became only possible because of better instruments.
I would argue a stronger claim: experimental confirmation of theories and better measurements must always bootstrap each other. The history of temperature has many examples [1].
I'll give some more detail. concurrent.futures is designed to be a new consistent API wrapper around the functionality in the multiprocessing and threading libraries. One example of an improvement is the API for the map function. In multiprocessing, it only accepts a single argument for the function you're calling so you have to either do partial application or use starmap. In concurrent.futures, the map function will pass through any number of arguments.
The API was designed to be a standard that could be used by other libraries. Before if you started with thread and then realised you were GIL-limited then switching from the threading module to the multiprocessing module was a complete change. With concurrent.futures, the only thing that needs change is:
with ThreadPoolExecutor() as executor:
executor.map(...)
to
with ProcessPoolExecutor() as executor:
executor.map(...)
The API has been adopted by other third-party modules too, so you can do Dask distributed computing with:
with distributed.Client().get_executor() as executor:
executor.map(...)
or MPI with
with MPIPoolExecutor() as executor:
executor.map(...)
Just one aspect of the huge, tragicomic waste that resulted from the medical establishment's position that airborne transmission couldn't be real because it reminded everyone too much of miasma theory.
Frankly, even though no one has asked for my opinion, the amount of effort people put into the code of conduct (and the discussion around it) is ridiculous; utterly ridiculous. In a time long ago, before everyone was incapable of compromising and cooperating with one another, we had two rules in an informal code of conduct for interacting with one another:
1. Be cool
2. Don't be an asshole
And then we all went back to trying to achieving the task at hand.
One of my favorite ideas in Structure is Kuhn's observation that contradictory evidence seldom suffices to overturn a scientific theory. Contradictory evidence is the norm. What's required to overturn a scientific theory is both contradictory evidence and a better theory - and even then adoption is often grudging and uneven.
I've found that idea to be very valuable when navigating disagreements at work (and in life generally). Refutation is seldom constructive in isolation - you also need to bring a constructive alternative if you want to keep things moving forward.
I guess that sounds like common sense, but I'm surprised by how often people cleave to exclusively negative argumentation (particularly in political discussions).
I'm a professional forecaster (i.e. getting paid for it) at a large e-commerce company. We have extensive experience with Prophet and a host of other approaches (all the traditional models in Hyndman's book/R package, some scattered LSTM/NN implementations). Here's my quick take (the article is a lot more extensive than the median blogpost, and likely warrants a more extensive study than I have time for right now.)
Prophet main claims ("Get a reasonable forecast on messy data with no manual effort. Prophet is robust to outliers, missing data, and dramatic changes in your time series.") are surely exaggerated. As the article shows, time series come in many different shapes, and many of them are not handled properly.
It deals well with distant-past or middle-of-the-sample outliers, but not with recent outliers. It cannot deal with level changes (as opposed to trend/slope changes). None of this should be a surprise if you take some time to understand the underlying model, which unlike most neural nets is very easily to completely understand and visualise: it's really a linear regression model with fixed-frequency periodic components (for yearly seasonality and weekly seasonality) and a somewhat-flexible piecewise-linear trend. The strong assumption that the trend is continuous (with flexible slopes that pivot around a grid of trend breakpoints, which are trimmed by regularisation) accounts for most of the cases where the forecasts are clearly wrong.
That said, it does occupy a bit of a sweet spot in commercial forecasting applications. It it's largely tuned for a few years of daily data with strong and regular weekly and yearly seasonalities (and known holidays), or a few weeks/months of intraday and weekday seasonalities. Such series are abundant in commerce, but a bit of a weak spot for the traditional ARIMA and seasonal exponential smoothers in Hyndman's R package. These tended to be tuned on monthly or quarterly data, where Prophet often performs worse. In our experience, for multiple years of daily commercial-activity data, there are no automated approaches that easily outperform Prophet. You can get pretty similar (or slightly better) results with Hyndman's TBATS model if you choose the periodicities properly (not surprising, as the underlying trend-season-weekday model is pretty similar as Prophet, but a bit more sophisticated). Some easy win for the Prophet devs are probably to incorporate a Box-Cox step in the model, and a sort-term ARMA error correction, then the model really resembles TBATS. You can usually get better results with NNs that are a bit more tuned to the dataset. But if you know nothing a priori about the data except that it's a few years of sales data, your fancy NN will probably resemble Prophet's trend-season-weekday model anyway.
All of these assume that we're trying to forecast any time series' future only from its own past. If you want to predict (multiple) time series using multiple series as input/predictors, that's a whole new level of difficulty. I don't know of a good automatic/fast/scalable approach that properly guards against overfitting. Good results for multiple-input forecasting approaches probably requires some amount of non-scalable "domain knowledge".
This always comes up in these threads and I always wonder if the commenters ever actually used Usenet.
DejaNews isn't and never was Usenet, it was an archive, and Google Groups was just another Usenet client. Google Groups embrace-extend-extinguished Usenet as much as Gmail embrace-extend-extinguished email, and it got some cachet from having historical posts.
You can still use Usenet as much as you could 20 years ago, and while it was nice a decade or two ago to be able to browse historical threads in google groups, now the Internet Archive has an excellent Usenet archive[1] so we don't have to trust a giant corporation with ADD to hold onto history for us.
Usenet still exists and is actually getting better with time. There's not much spam and the only people there left are geeks that care about the topic. The eternal september is over and it's time to start again. You can get text only access for your usenet client (or use a web interface) at https://www.eternal-september.org/ . I post on alt.startrek and alt.cyberpunk every now and then and get decent conversations. I also posted back in the 90s.
Usenet need not rely on google groups mangling of deja news.
My startup is working on novel microalgae photobioreactors - we are raising an angel/pre-seed round right now.
Our system is far more energy and space efficient compared to PBRs and raceways - we do this via a proprietary mechanism that greatly increases surface area to volume ratio of liquid water in our reactors. Feel free to drop us a line at info[at]skyfarmclimate.tech
Please allow me to state a more refined stance than either Sabine's or the perception of the type of criticism levelled against theoretical physics especially. I'm going to use biochemistry and several other fields as great examples to compare and contrast what happens when "religiosity" creeps into a field.
Something that stuck with me was an article[1] describing how certain fields of medicine had languished, whereas others had made dramatic strides forward during the same period. For example, psychology is basically 90% hogwash. Meanwhile, the study of the the brain's biochemistry and therapeutic medication-based remedies for many common brain dysfunctions have made huge strides forward in recent decades.
Back in the mid 20th century, the only treatment for most brain-related problems was "talk to a psychiatrist". This generally did nothing except help psychiatrists rack up billable hours.
When medications eliminated the "need" for one-on-one therapy sessions, how do you think it went, trying to convince psychiatrists that their entire lucrative field is a giant waste of time?
Yeah.
Precisely how you think it went.
Similarly, when I was at University around the late 1990s, the entire field of AI research revolved around only logic. As in, some person or a few people had decided that the path to general intelligence was to be via "fuzzy logic", or "predicate logic", or "probabilistic logic", or... well... some kind of logic at any rate. We had to learn LISP, Prolog, and Boolean algebra, and so on.
That entire path was a giant waste of time, but... none of it was wrong. Not one bit. They were correct, confidently making slow but steady progress, and they weren't even going in the wrong direction! They were advancing AI! Just... very slowly, and with no chance at achieving the modern miracles of AI such a text-to-image generation. (DALL*E, Imagen, Stable Diffusion, etc...)
Let's circle back to Sabine's criticism of theoretical physics.
They're the same.
They're confident that they're correct because they're not wrong!
Sabine doesn't argue that they're wrong. I also don't believe physicists are wrong, and I've even studied physics to a graduate level, agreeing with at least 99% of it.
What I believe is that theoretical physics specifically has gone down a parallel path that leads to a dead end.
The path leads in the right direction, and every step along it leads to forward progress.
The issue is that there are people that simply refuse to believe that maybe, just maybe, we're all stuck in a dead end.
Quantum Mechanics researchers especially love to point to a small number of successes such as the precision numerical calculations related to the anomalous magnetic dipole moment of the electron as proof that they're on the right path. This is cheerfully ignoring the minor detail that the equivalent calculation for the muon has a 3.5 standard deviation error when compared to reality. Oops. Just ignore that. We're on the right path, remember, because the step we took with the electron was forward!
This is the problem.
I remember my professors being so earnest about AI and logic, and how "one day", there will be intelligent robots programmed in Prolog.
I also remember my Quantum Mechanics lessons where the professor very earnestly told me that microscopic systems change magically only when you look at them, which is patent nonsense, but is repeated to this day, just like Freud's idiocy is still taught to psychiatrists decades after much of their profession has been largely superseded by therapeutic drugs.
[1] I wish I could dig it up, but I read it over a decade ago. The gist is pretty clear though, in that not all "sciences" are equal, and small branches even within the same larger field are often dramatically more effective and/or correct due to "cultural" reasons, or a better approach. The article listed many examples, but the criticism of certain areas of medicine and psychology were the most scathing. For a similar rant, see Richard Feynman's take on the topic: http://people.cs.uchicago.edu/~ravenben/cargocult.html
> But for subjects like Civil, Electrical, sciences like Physics, Biology, I suspect the story is quite different.
I can speak to Biology; most undergrads are left with little opportunity as most roles require a graduate degree to do anything meaningful when the economy is actually favorable, and often seek roles outside of their field when it's not. As a undergrad that was lucky enough to enter the field in the wake of the GFC long enough to just pay off my student debt I've met so many biologists/biochemists who either couldn't find anything in their field, or simply gave up trying and sought the highest paid and somewhat stable role they could get 'that required a degree.' It's quite clear that the system has failed, and the system is designed to onboard more STEM grads as it sounds as their are ever-growing holders of degrees chasing so n amount of open roles.
Tech in unique in that it is one of those few Industries where the barrier of entry is rather low, but the attrition and competence level vets for viable candidates almost entirely on it's own--the gatekeeping interview process is still an unnecessary hurdle.
Personally, for all the controversy Eric Weinstein has made directly and indirectly of his own actions, the concept of Embedded Growth Obligations was incredibly well researched and consistent with what we have seen as outlined since 2005 and then solidified itself in the financial collapse of 2008 [0]:
> Embedded Growth Obligations are the way in which institutions plan their future predicated on legacies of growth. And since the period between the end of World War II in 1945 and the early 70s had such an unusually beautiful growth regime, many of our institutions became predicated upon low-variance technology-led, stable, broadly distributed growth. Now, this is a world we have not seen in an organic way since the early 1970s, and yet, because it was embedded in our institutions, what we have is a world in which the expectation is still present in the form of an embedded growth obligation. That is, the pension plans, the corporate ladders, are all still built very much around a world that has long since vanished.
We have effectively become a Growth Cargo Cult. That is, once upon a time, planes used to land in the Pacific, let's say, during World War II, and Indigenous people looked at the air strips and the behavior of the air traffic controllers, and they've been mimicking those behaviors in the years since as ritual, but the planes no longer land. Well, in large measure, our institutions are built for a world in which growth doesn't happen in the same way anymore.
I am using C++ for sioyek which is a PDF reader designed for reading research papers and textbooks: https://sioyek.info/
Other libraries used:
* MuPDF for PDF rendering
* Qt5 for UI
* sqlite for database
I also recently added a plugin system which is language neutral, but I wrote a python wrapper around it (see and example here: https://sioyek.info/#extensible).
Doing it the other way around makes more sense, i.e. write SQL to generate code. See https://github.com/kyleconroy/sqlc, why aren't more people following this approach?
I disagree. I think we should generalize more, beyond codes of conduct: If you are building an adjudication process for resolving non-criminal personal conflicts (whether that be a Code of Conduct, an HR department, a Title IX proceeding, a professional organization, or something else), you should take a look at Anglo-derived common law and the safeguards against abuse that have been evolved over the centuries.
That doesn't mean everything needs to go through the courts; it means that if your process allows something that Anglo common law does not allow, you should have a good answer for why that is. Does it allow anonymous accusations? Is the accused allowed to know the charges against them, before a finding of guilt is rendered? Is there a presumption of innocence? Is the accused allowed to have a trusted third party - one who knows the rules of the game - to advocate on their behalf? Who, exactly, is responsible for deciding matters of fact vs matters of "law"? Is there an appeals process to fix possibly incorrect decisions?
Going by the linked document by Valerie Aurora, a good Code of Conduct allows anonymous accusations, the accused does not get to know the charges against them before a finding is rendered, there is no presumption of innocence, the accused does not get a third party advocate, matters of fact are necessarily decided by the same committee that makes the rules, and there is no appeal process.
This doesn't mean that such a committee will always do wrong. But I think it's worth thinking about how people operating in bad faith (either on the committee, or reporters to it) can abuse those features to achieve goals that are not actually aligned with what the Code of Conduct is trying to do. Yes, it's true that people can't be put in jail for these sorts of things, but a poorly-run adjudication process can have significant negative personal and financial effects on people.
"use cached versions of pre-computed artifacts" This is really nice. I wonder if this also covers partial pre-computations, for example when the same subquery is reused across several pipelines.
For example one of the "data thugs" (James Heathers) wrote a tool that could detect impossible numbers in psych papers, like means/std devs that couldn't possibly have been the result of any allowable combination of results. Some very high percentage of papers failed, I think it was 40%.
And of course psychology isn't the worst. Epidemiology is worse than psychology. The methodological problems there are terrifying. Good luck getting public policy to even accept that it's happened, let alone do anything about it.