Tinygrad: A simple and powerful neural network framework

brrrrrm · on Nov 4, 2022

> It compiles a custom kernel for every operation, allowing extreme shape specialization.

This doesn't matter. Just look at the performance achieved by CuDNN kernels (which back PyTorch), they're dynamically shaped and hit near peak. For dense linear algebra at the size of modern neural networks, optimizing for the loop bound condition won't help much.

> All tensors are lazy, so it can aggressively fuse operations.

This matters. PyTorch teams are trying to implement that now (they have LazyTensor, AITemplate, TorchDynamo), but I'm not sure of the status (it's been tried repeatedly).

> The backend is 10x+ simpler, meaning optimizing one kernel makes everything fast.

The first part of that sentence matters, the second part doesn't. Kernels are already fast and their reuse outside of being fused into each other (which you need a full linear algebra compiler to do) isn't very high. If you make sum fast, you have not made matrix multiplication fast even though MM has a sum in it. It just isn't that easy to compose operations and still hit 80+% of hardware efficiency.

But it is easier to iterate fast and build a seamless lazy compiler if your backend is simple. You can pattern match more easily and ensure you handle edge cases without insanely complicated things like alias analysis (which PyTorch has to do).

georgehotz · on Nov 4, 2022

> they're dynamically shaped and hit near peak

While this is true for most common GEMM looking ops, if you tread off the beaten path things get slow (odd channel sizes, batch sizes, etc...). Right now in PyTorch, GroupNorm is 2x slower than BatchNorm. There's no fundamental reason, just that the kernels loop over axes in a less than ideal order. Dynamic recompilation allows you to change the loop order too, not just deal with boundary conditions.

brrrrrm · on Nov 4, 2022

> tread off the beaten path things get slow

Yea, makes sense. I think there's something to be said for dynamic compilation solving this problem more elegantly than providing tons of hand-tuned kernels (PyTorch is 890MB lmao https://pypi.org/project/torch/#files), but I don't think it's a strict reason for a performance win.

> change the loop order too

Memory layout as well! I'm 100% for dynamic compilation, but I'm claiming that it really finds its stride when you fuse things.

georgehotz · on Nov 4, 2022

Agreed. For anything at all common, most of the gains will be from fusion, the rest is just free. PyTorch also uses tons of GPU memory after only initializing, I wonder if it's copying all the kernels in?

terafo · on Nov 4, 2022

Jax preallocates 90% of available GPU memory when first operation is run to minimize allocation overhead. Can PyTorch grab that VRAM for a similar reason?

zorgmonkey · on Nov 4, 2022

Yes PyTorch uses what they call a caching memory allocator[0], basically seems like are allocating a very chunk of GPU memory and implementing a heap with it. If needed they expose some knobs and functions to allow you to control it and observe the memory usage.

[0]: https://pytorch.org/docs/stable/notes/cuda.html#memory-manag...

twothreeone · on Nov 4, 2022

> Right now in PyTorch, GroupNorm is 2x slower than BatchNorm

How did you benchmark this? I think there are like 3 or 4 different GN implementations in PyTorch..

georgehotz · on Nov 4, 2022

Whole net performance at comma, when we switch from BatchNorm to GroupNorm it adds 70ms to the training step time, and it's -70ms for no norm. We also wrote a custom AllNorm that's like 10% slower than BatchNorm (and I put several hours into trying to optimize it). Obviously not indicative of everyone's experience, but my point is BatchNorm is hyperoptimized and others, which are pretty much the same thing, aren't.

twothreeone · on Nov 4, 2022

Thanks, that's certainly helpful anecdotal evidence.. yeah it seems like there should be an "AllNorm" implementation that covers all cases and is just fast. I was wondering because I'm currently looking at math_group_norm, which was ported from PyTorch/XLA and it results in a really weird decomposition that I'm astonished works at all. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen...

I'm also wondering if the handcoded backward passes are actually "numerically correct", because e.g. epsilon doesn't appear in it at all. Someone worked out the gradients manually for BN here: https://web.archive.org/web/20180826123459/http://cthorey.gi...

You can clearly see epsilon appearing in the output. And of course there's the whole training vs. eval mode thing with BN which GN doesn't have.

In any case, thanks again.

markisus · on Nov 4, 2022

What does it mean to "fuse operations"?

brrrrrm · on Nov 4, 2022

avoiding writes to memory and reducing the number of loops (although not FLOPs)

    for j in range(10):
      c[j] = a[j] + b[j]
    for j in range(10):
      d[j] = c[j] * 2

becomes

    for j in range(10):
      d[j] = (a[j] + b[j]) * 2

thrtythreeforty · on Nov 4, 2022

Or, better, identifying that the machine has a primitive that is better than doing each op individually. For example, a multiply-accumulate instruction vs a multiply and separate accumulate. The source code still says "a*b+c", the compiler is just expected to infer the MAC instruction.

brrrrrm · on Nov 4, 2022

Yep! This is an assumed optimization when it comes to modern linear algebra compilers. New primitives go way beyond FMAs: full matrix multiplies on nvidia/Intel and outer product accumulates on Apple silicon. It’s also expected that these are used nearly optimally (or you’ve got a bug).

thrtythreeforty · on Nov 5, 2022

I am extremely familiar with how far these primitives go, ha. I develop kernels professionally for AWS ML accelerators.

FL33TW00D · on Nov 4, 2022

Any more writing on laziness in frameworks? I'm trying to implement it myself.

brrrrrm · on Nov 4, 2022

The only thing I'd recommend is exposing "eval()" or something to let users tell you when they want you to evaluate things. It'll save a ton of time when it comes to hot-fixing performance and memory use issues. It's really hard to determine when to evaluate, and although it's a fun problem to figure out, it's nice to have an escape hatch for users to just tell you. (Flashlight has explored this and written about it here: https://fl.readthedocs.io/en/latest/debugging.html?highlight...)

If you're interested, I've looked into symbolic laziness, which allows you to infer correct input sizes even when the constraints happen later. Can be useful for errors. https://dev-discuss.pytorch.org/t/loop-tools-lazy-frontend-e...

bmc7505 · on Nov 4, 2022

https://arxiv.org/abs/2203.08069

orlp · on Nov 4, 2022

    > It's extremely simple, and breaks down the most complex networks into 4 OpTypes:
    >
    > - UnaryOps operate on one tensor and run elementwise. RELU, LOG, RECIPROCAL, etc...
    > - BinaryOps operate on two tensors and run elementwise to return one. ADD, MUL, etc...
    > - ReduceOps operate on one tensor and return a smaller tensor. SUM, MAX
    > - MovementOps operate on one tensor and move the data around, copy-free with ShapeTracker. RESHAPE, PERMUTE, EXPAND, etc...
    >
    > But how...where are your CONVs and MATMULs? Read the code to solve this mystery.

Ok, I was curious, so I read the code. The answer is that it represents a MATMUL as a 1x1 CONV. And it lied about CONV, which is a ProcessingOps.CONV and explicitly represented and implemented: https://github.com/geohot/tinygrad/blob/c0050fab8ff0bc667e40... Quite the letdown of figuring out this 'mystery'.

georgehotz · on Nov 4, 2022

That CONV is only used on the older backends. The GPU and LLVM backend rewrite CONV as MUL+SUM, to be fused later, and thus only use the 4 OpTypes.

https://github.com/geohot/tinygrad/blob/master/tinygrad/lazy...

WithinReason · on Nov 4, 2022

That's cool, am I right in assuming that you want to automate the production of efficient GPU (or other accelerator) code based on these low level primitives? But you would still need a piece of sorcery that can produce high performance OpenCL code, right? And that code could be different for every device, so you would need some trial and error, benchmark-based compilation at the very least. Or would OpenCL code be generated by hand for each device?

georgehotz · on Nov 4, 2022

Yea, benchmark based compilation, that's already happening in the tinygrad compiler we use for openpilot to determine the local group size. https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

Working on parameterizing a search space that includes more than the local group size. The end dream is some ML guided search to optimize the kernels :)

WithinReason · on Nov 4, 2022

OK generally I think you're doing exactly what I believe ML is lacking right now. Another huge opportunity is instead of taking the average neural network and designing accelerators for it, designing hardware-friendly networks that run well on a sane accelerator that was designed to work with only these specialised networks (that doesn't need 80% chip area for on-chip memory for example). These might end up being completely different networks to what researchers use today. I work in this area and I think it's also possible to use the loss function to optimise the network for a specific HW.

brrrrrm · on Nov 4, 2022

I've done some work in the past on NN representations and you actually can represent Conv and MatMul in more primitive ways. I ended up writing an IR called loop_tool that exposes this stuff:

https://github.com/facebookresearch/loop_tool/blob/main/pyth...

The idea is basically this: https://news.ycombinator.com/item?id=28883086

WithinReason · on Nov 4, 2022

To directly quote the source:

    # these are the llops your accelerator must implement, along with toCpu
    UnaryOps = Enum("UnaryOps", ["NOOP", "NEG", "RELU", "EXP", "LOG", "SIGN", "RECIPROCAL"])
    BinaryOps = Enum("BinaryOps", ["ADD", "SUB", "MUL", "DIV", "POW", "CMPEQ"])
    ReduceOps = Enum("ReduceOps", ["SUM", "MAX"])
    MovementOps = Enum("MovementOps", ["RESHAPE", "PERMUTE", "EXPAND", "FLIP", "STRIDED", "PAD", "SHRINK"])
    ProcessingOps = Enum("ProcessingOps", ["CONV"])

https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

There is a MAX but not a MIN? Is that because max(x,y) = -min(-x,-y)? But then why is there a SUB? Why is there a RELU if it's only max(0,x)? Maybe MIN is just too rare to be worth implementing?

georgehotz · on Nov 4, 2022

Min is an HLOP.

From: https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...

def min(self, axis=None, keepdim=False): return -((-self).max(axis=axis, keepdim=keepdim))

All folded together, no slower than MAX.

WithinReason · on Nov 4, 2022

But then SUB, DIV and RELU could be an HLOP as well, no?

georgehotz · on Nov 4, 2022

We could have NEG instead of SUB, but with the constant folding it's a wash. DIV is already an HLOP with reciprocal (used to use POW, but that was slower. And what would you implement RELU in terms of?

WithinReason · on Nov 4, 2022

max(0,x)

georgehotz · on Nov 4, 2022

That's a ReduceOp right now, more annoying to reason about than a UnaryOp. But in the limit, yea. Or add an elementwise BinaryOp for max.

Submit a PR if you can improve something!

liuliu · on Nov 4, 2022

Very similar idea as Jittor, convolution definitely can be break down: https://github.com/Jittor/jittor/blob/master/python/jittor/n...

jerpint · on Nov 4, 2022

Just looking at the code from my phone, but it seems that the conv op calls another primitive and einsum, which I believe is just a fancy MUL with broadcasting? so it might still be technically correct?

PartiallyTyped · on Nov 4, 2022

Einsum is an expressive way of doing element wise products and then possibly reducing them. An einsum is essentially a description of the dimensions of the input tensors and the dimensions of the resulting output after multiplication. If the output has reduced dimensions, then a summation is applied over them. The package einops provides reductions such as summation, averaging, and so on.

For example; the einsum " b k n p, k -> b k n p" broadcasts the second tensor b to b[None, :, None, None] and does element wise multiplication. It can be changed to a vector product by writing "b k n p, k -> b n p", which for all intents and purposes is identical to a.transpose(0, 2, 3, 1) @ b .

I can easily recommend the einops package and using einsum, simplifies things significantly.

sakras · on Nov 4, 2022

I must say they gained instant credibility with the minimalistic website given how fast it loaded.

Code looks simple and easy to follow, and I love how the comments are constantly mentioning hardware characteristics, making maxing the hardware the goal. It seems that it’s trying to achieve this by jitting optimal code for the operations at hand rather than hand-optimizing kernels, and betting that the small number of operations will make tuning the codegen tractable.

I haven’t kept up much with what’s happening in ML, but at least in the realm of columnar database engines, interpreting a series of hand-optimized kernels seems to be the dominant approach over compiling a vectorized query plan. Are compilers good enough at optimizing ML operations that specializing on input shape makes a difference over hand-tuned kernels?

kklisura · on Nov 4, 2022

It's geohot. He comes with credibility. [1]

[1] https://en.wikipedia.org/wiki/George_Hotz

JacobiX · on Nov 4, 2022

I love those tiny DNN frameworks, some examples that I studied in the past (I still use PyTorch for work related projects) :

thinc.by the creators of spaCy https://github.com/explosion/thinc

nnabla by Sony https://github.com/sony/nnabla

LibNC by Fabrice Bellard https://bellard.org/libnc/

Dlib dnn http://dlib.net/ml.html#add_layer

37ef_ced3 · on Nov 4, 2022

And https://NN-512.com

emaro · on Nov 4, 2022

I love this website. Their style tag literally is:

    <style>
      body {
        font-family:'Lucida Console', monospace
      }
    </style>

Also look like a very cool project.

arketyp · on Nov 4, 2022

I like this too and I don't understand the downvotes. It says a lot about the philosophy of the project. Minimalist, bold, brutalist, no-frills first principles thinking. For better and worse.

therealchiggs · on Nov 4, 2022

There's an interesting roadmap in the "cherry" folder of the git repo[0]. It begins by bringing up a design on FPGA and ends with selling the company for $1B+ by building accelerator cards to compete with NVIDIA:

  Cherry Three (5nm tapeout)
  =====
  * Support DMA over PCI-E 4.0. 32 GB/s
  * 16 cores
  * 8M elements in on board RAM of each core (288 MB SRAM on chip)
  * Shared ~16GB GDDR6 between cores. Something like 512 GB/s
  * 16x 32x32x32 matmul = 32768 mults
  * 1 PFLOP @ 1 ghz (finally, a petaflop chip)
  * Target 300W, power savings from process shrink
  * This card should be on par with a DGX A100 and sell for $2000

  * At this point, we have won.
  * The core Verilog is open source, all the ASIC speed tricks are not.
  * Cherry will dominate the market for years to come, and will be in every cloud.
  * Sell the company for $1B+ to anyone but NVIDIA

[0] https://github.com/geohot/tinygrad/blob/master/accel/cherry/...

lr1970 · on Nov 4, 2022

As it was recently discussed at length here on HN [0] (401 comments), George Hotz (the lead of tinygrad) is taking time off his self-driving startup comma.ai [1]. Curious if this would help or hurt tinygrad progress.

[0] https://news.ycombinator.com/item?id=33406790

[1] https://comma.ai/

fragmede · on Nov 4, 2022

Of course, the stable diffusion tie-in is not to be missed!

https://github.com/geohot/tinygrad/blob/master/examples/stab...

eterevsky · on Nov 4, 2022

How is it compared to JAX? After TensorFlow and PyTorch, JAX seems very simple, basically an accelerated numpy with just a few additional useful features like automatic differentiation, vectorization and jit-compilation. In terms of API I don't see how you can go any simpler.

learndeeply · on Nov 4, 2022

JAX is a DSL on top of XLA, instead of writing Python. Example: a JAX for loop looks like this:

   def summ(i, v): return i + v
   x = jax.lax.fori_loop(0, 100, summ, 5)

A for loop in TinyGrad or PyTorch looks like regular Python:

   x = 5
   for i in range(0, 100):
      x += 1

By the way, PyTorch also has JIT.

eterevsky · on Nov 4, 2022

I've just tried making a loop in a jit-compiled function and it just worked:

    >>> import jax
    >>> def a(y):
    ...   x = 0
    ...   for i in range(5):
    ...     x += y
    ...   return x
    ...
    >>> a(5)
    25
    >>> a_jit = jax.jit(a)
    >>> a_jit(5)
    DeviceArray(25, dtype=int32, weak_type=True)

koningrobot · on Nov 4, 2022

It definitely works, JAX only sees the unrolled loop:

  x = 0
  x += y
  x += y
  x += y
  x += y
  x += y
  return x

The reason you might need `jax.lax.fori_loop` or some such is if you have a long loop with a complex body. Replicating a complex body many times means you end up with a huge computation graph and slow compilation.

eterevsky · on Nov 4, 2022

And how does TinyGrad solve this?

georgehotz · on Nov 4, 2022

Fused into one operation since the Tensor isn't resolved until I call .numpy()

  kafka@tubby:/tmp$ cat fuse.py 
  from tinygrad.tensor import Tensor
  x = Tensor.zeros(1)
  for i in range(5):
    x += i
  print(x.numpy())

  kafka@tubby:/tmp$ OPT=2 GPU=1 DEBUG=2 python3 fuse.py 
  using [<pyopencl.Device 'Apple M1 Max' on 'Apple' at 0x1027f00>]
  **CL**      0 elementwise_0        args     1  kernels [1, 1, 1]          None         OPs     0.0M/   0.00G  mem  0.00 GB tm      0.15us/     0.00ms (    0.03 GFLOPS)
  **CL**        copy OUT (1,)
  [10.]

vladf · on Nov 4, 2022

How does this differ from XLA? Would tinygrad's lazy approach also just see the same unrolled loop right before compilation?

cl3misch · on Nov 4, 2022

He mentioned in a recent stream that he dislikes the complexity of the XLA instruction set used by JAX. So it's less the user-facing API, and more the inner workings of the library.

tucosan · on Nov 4, 2022

Can someone from the ML crowd ELI5 to me what tinygrad does, how it plugs into an ML pipeline and what it's use cases are?

gamegoblin · on Nov 4, 2022

There are libraries like tensorflow and PyTorch that allow the user to define their neural net in simple, readable Python code, and they internally "compile" and optimize your neural net to run on GPUs and such.

Tinygrad is like a very, very lean PyTorch with a different philosophy -- it intends to keep the codebase and API surface very very small and focus most of its energy on optimizing the way the output neural net runs on physical hardware.

The author, George Hotz, has observed in the last few years that neural net performance is hindered by lack of optimization here, particularly around memory accesses.

ivalm · on Nov 4, 2022

“Almost 9k stars” is actually 7.3k stars…

But otherwise very cool project :)

queuebert · on Nov 4, 2022

Maybe he hired a bunch of bot accounts to star it; then when the accounts were banned the stars were removed? ^_^

dqft · on Nov 6, 2022

because milestones for children of the 90s

ivalm · on Nov 6, 2022

I am dumb but now it makes sense. His power is indeed over 9k.

dedoussis · on Nov 4, 2022

It's funny that geohot/tinygrad chooses to not meet the PEP8 standards [0] just to stay on brand (<1000 lines). Black [1] or any other python autoformatter would probably 2x the lines of code.

[0] https://peps.python.org/pep-0008/

[1] https://github.com/psf/black

kurisufag · on Nov 4, 2022

to anybody experienced in writing functional-esque oneliners, PEP8 is an appalling waste of space

koningrobot · on Nov 4, 2022

More like 10x. Black is truly a terrible thing.

jamesrom · on Nov 4, 2022

tinygrad core is over 1000 loc now[1]. If anyone was looking for a fun weekend project :)

https://github.com/geohot/tinygrad/blob/master/.github/workf...

KptMarchewa · on Nov 4, 2022

It does achieve that by being most horizontally dense Python code I've ever seen.

orf · on Nov 4, 2022

Wow https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...

stephc_int13 · on Nov 4, 2022

I understand that the Python code is mostly driving faster low-level code, but I wonder how much time is effectively wasted by not using a lower-level language.

From my experience with game engines, it often turns out to be a bad idea (for performance and maintainability) to mix C/C++ and Lua or C#.

terafo · on Nov 4, 2022

I would argue that there are performance *benefits* for a developer in running python code, due to how programs are run in python(Jupyter notebooks) you basically can change program on the fly, and not recompile and restart it, as you would do with compiled languages. And yeah, CPU does very very little in modern DL workloads and it is commonplace for CPU python code to be jitted and vectorized, so performance difference isn't as large as you would think.

lynndotpy · on Nov 4, 2022

This is very true!

Another benefit to interactivity is when exploring/using bad code. In academia, you'll often be importing the worst and least-well-documented code you've ever seen.

Being able to interactively experiment with someones 500-line 0-documentation function is often a better path to understanding than directly reading the code.

brrrrrm · on Nov 4, 2022

Doesn’t really matter for large batch/large model training on GPUs that don’t need much coordination.

But Python speed is one of the main motivations for a JS/TS based ML lib I’m working on: https://github.com/facebookresearch/shumai

gregjw · on Nov 4, 2022

Geohot at it again, this guy nails everything.

mhh__ · on Nov 4, 2022

This doesn't really nail anything at the moment.

It used to nail simplicity but now its a mess IMO

alexmolas · on Nov 4, 2022

> almost 9000 GitHub stars

I wouldn't say that 7500 stars is almost 9000 stars ;)

ordu · on Nov 4, 2022

It is probably based on a meme https://en.wikipedia.org/wiki/It%27s_Over_9000!

They are not over 9k yet but closing.

alexmolas · on Nov 4, 2022

ah, I didn't get the reference haha

anyway, I just gave them my star ;)

bfrankline · on Nov 4, 2022

If you care exclusively about numerical stability and performance, why _this_ set of operators (e.g., there’re plenty of good reasons to include expm1 or log1p and certainly trigonometric functions)? It’d be an interesting research problem to measure and identify the minimal subset of operators (and I suspect it’d look differently than what you’d expect from an FPU).

If you care exclusively about minimalism, why not limit yourself to the Meijer-G function (or some other general-purpose alternative)?

learndeeply · on Nov 4, 2022

The code is very easy to read. Doesn't seem like there's data/model parallelism support for training, which will be important for real-world use.

bArray · on Nov 4, 2022

How does this compare on embedded systems for performance? For example PyTorch vs tinygrad, or Darknet vs tinygrad?

bullen · on Nov 4, 2022

Does anyone know of a neural network that is written in C and GLSL and that runs on normal OpenGL?

lostmsu · on Nov 4, 2022

It was ok as an educational tool, but now they don't count GPU implementation in 1000 lines, so it is not small. Considering the code style it is closer to 20k+ lines when formatted and GPU code included.

It also doesn't support bfloat16 so is doomed to be 2x slower.

terafo · on Nov 4, 2022

Actual code of tinygrad is less than 5k lines. There is also 1600 lines of tests and around 2k lines of example models. And I didn't count unfinished support for geohot's own unfinished neural network accelerator(verilog for that accelerator sits in repo too), which is abandoned.

lostmsu · on Nov 4, 2022

> Actual code of tinygrad is less than 5k lines

Yeah, not

> Considering the code style

I mean it is possible to read it, but I would not say it is optimized for it. Which I suppose betrays the goal.

terafo · on Nov 4, 2022

> Yeah, not

Provide some evidence. I just ran tokei on freshly cloned tinygrad repo using arguments from[1]. Got 4854 lines of code, which is less than 5k lines.

[1] tokei --exclude *.json --exclude accel/cherry --exclude test --exclude examples

lostmsu · on Nov 4, 2022

After running black it is more like 6.8k. And black does not format C and/or shader code.

But even 5k is closer to 20k on the log scale than to the promised 1k.

DeathArrow · on Nov 4, 2022

I believe neural networks are over hyped sometimes.

They are not always the best tool for the job. There are lots of other ML techniques such as SVM, naive Bayes, k-nearest neighbor, decision tree, logistic regression, random forest etc. nobody is using because they lack the hype factor.

If something lacks some keywords like neural network, deep learning, reinforced learning, than it is deemed not cool.

learndeeply · on Nov 4, 2022

I can't think of anything that neural nets can't beat, except small tabular data with boosted decision trees. Can you give some examples?

zelphirkalt · on Nov 4, 2022

Explicability is a big part of it It is often worth being a percent less accurat but having an explainable result.

adamsmith143 · on Nov 4, 2022

I've been on a lot of ML teams and outside of Finance and a few other sensitive topics explainability has always been irrelevant.

zelphirkalt · on Nov 5, 2022

What happens, when your model exhibits a discriminating bias? How do you find out, what is going wrong? Knowing, what the model pays attention to can be pretty helpful.

adamsmith143 · on Nov 7, 2022

Not aware of any court cases where someone successfully sued because they were shown one product recommendation or Ad on a webpage instead of another.

jstx1 · on Nov 4, 2022

(I don't really agree with GP's point but for the sake of answering your question)

1. Collaborative filtering based on a sparse dataset of implicit interactions.

2. Many time series applications.

marcyb5st · on Nov 4, 2022

Didn't all recommendations engines move to two-towers like models? I remember that it "solved" the freshness problem (ie when adding a new item to your catalog how do you recommend it to users if there are no ratings/interactions). Of course as long as you have a good model that creates items embeddings.

Regarding time series, don't everyone moved to attention based models?

Not challenging your answer, just curious. I work mostly with Graph NNs and quite a bit out of touch with the rest of the field.

niemandhier · on Nov 4, 2022

Small data problems, where’re never the less have a really good idea of how things are causally related.

insane_dreamer · on Nov 4, 2022

> we often use ML over DL in scientific analysis because we need models that can be inspected/explained not just results

> also, DL generally requires more data whereas you can get by with ML on less data if you have domain knowledge

patrick451 · on Nov 4, 2022

The black box nature of a neural net is a problem. For model based design, a bit more accuracy out of a black box doesn't really help when you need, for example, state space matrices in a control design.

jack_pp · on Nov 4, 2022

I'm no expert but can you show how those techniques can be used to solve the same problems NNs can? Like SOTA image recognition, chess / go, STT, TTS etc?

DeathArrow · on Nov 4, 2022

>I'm no expert but can you show how those techniques can be used to solve the same problems NNs can?

Sentiment analysis, classification.

nl · on Nov 4, 2022

NN based sentiment analysis is certainly a lot better than non-NN based techniques.

Classification depends on the problem (and mostly the datasize). Boosting is certainly competitive on tabular data and widely everywhere I've worked.

No one talks about it (except on Kaggle) because it's pretty much at a local maximum. All the improvement comes from manual feature engineering.

But modern techniques using NNs on tabular data are are competitive with boosting and do away with a lot of the feature engineering. That's a really interesting development.

Vetch · on Nov 4, 2022

> NN based sentiment analysis is certainly a lot better than non-NN based techniques.

I wouldn't say this. Sentiment analysis trained on the standard datasets is one place where performance is barely better than old-school linear classifiers. They remained brittle and easy to trick until recent flexible systems systems based on question answering, zero-shot entailment or lotsa instruction finetuning (improving in that order). I strongly advice against using something fine-tuned solely on sentiment datasets. It'd be a total waste.

nl · on Nov 4, 2022

> Sentiment analysis trained on the standard datasets is one place where performance is barely better than old-school linear classifiers

Well yeah. But why would you do that?

Do what eveyrone does: Train on large scale a language corpus (or use a pre-trained model) then finetune for sentiment analysis.

> I strongly advice against using something fine-tuned solely on sentiment datasets

Did you mean trained on sentiment datasets? I agree with that.

Otherwise, well [1] is a decent overview of the field. I think Document Vectors using Cosine Similarity[2] at 17 is the highest rated that isn't a NN trained on large corpus and fine-tune on sentiment task. Even that uses document vectors that are trained on a large language corpus.

[1] https://paperswithcode.com/sota/sentiment-analysis-on-imdb

[2] https://paperswithcode.com/paper/the-document-vectors-using-...

Vetch · on Nov 4, 2022

No, I meant finetuned. I also meant finetuned when I said trained. Experience with applying finetuned sentiment classifiers on real world data found gain vs cost of running to not be worth it. They remain nearly as brittle as cheaper classifiers and have a habit of gloming too much unto certain adjectives. They are also prone to overfitting on finetuned data's domain. Transformers trained not specifically on sentiment but on general domains like question answering or entailment are just leagues better for sentiment tasks.

nl · on Nov 5, 2022

Well when you put the sentiment head on a pretrained language model you kind of have to train that head a bit on the sentiment task right?

But if the rest of your model is frozen the head will never see actual words, just contextual vectors from the LM.

It feels like we are in strong agreement but using slightly different terms or something

minimaxir · on Nov 4, 2022

The problems where traditional ML works best and the problems where Transformers or ConvNets work best are usually two different domains.

AI is not a buzzword.

DeathArrow · on Nov 4, 2022

>The problems where traditional ML works best and the problems where Transformers or ConvNets work best are usually two different domains.

Yes and we are using NN for everything.

adamsmith143 · on Nov 4, 2022

But we aren't. Outside of using AEs for embeddings and then feeding them through a boosted tree model I don't know anyone using NNs for tabular data. We all use XGBoost or Catboost, etc.

adamsmith143 · on Nov 4, 2022

Don't think you really know the field. On my team we almost exclusively use XGBoost or other boosted tree methods because it is typically the best model for tabular data. If we were working on CV or NLP that would be a different story and for that Neural Nets are by far the best models.

HelloNurse · on Nov 4, 2022

This is a library for neural networks, and it should be compared to other neural networks solutions.

jstx1 · on Nov 4, 2022

Everything you're listing works mostly on tabular data, not on text or images which is where we have the most impressive ML applications right now.

RektBoy · on Nov 4, 2022

No Bible quotes? I'm disappointed...

passion__desire · on Nov 4, 2022

I can't believe how can someone so accomplished believe in God.

li4ick · on Nov 4, 2022

Like Knuth? He even has a book about it: https://www.goodreads.com/book/show/484459.Things_a_Computer...

RektBoy · on Nov 4, 2022

I come from probably the most atheistic country in the world (CZ)

Yet, I had made this bashing comment about bible. IMHO anyone can believe in whatever they want. Christ., Islam, anything. I (and I would say every friend of mine) don't care about what do you believe in, but if you publicly preach some religion, prepare to be made fun of, or take a stand and try defend it your religion with arguments. But no blind faith here.

(Personally, if I like some religion it's Shinto.)

thrtythreeforty · on Nov 4, 2022

I think it's orthogonal. There are tons of smart people who believe in God. (Knuth has already been mentioned.)

If God wanted, He could make himself apparent to everyone. Clearly that isn't the case; there is room to doubt or to believe no matter how smart or accomplished you are.

Siira · on Nov 5, 2022

The Bible is a very specific version of a creator that the world already gives you enough evidence to disprove ten times. You could argue that the possibility of the Bible being divine is not zero, but it’s less than any epsilons people care about. The reason it endures are cultural, which is clear from its correlations with geography and demographics.

matesz · on Nov 4, 2022

If anybody is dealing with procrastination watch George Hotz live streaming 10h straight working on this library [1][2]. Does he take some supplements to do this? There is even 19.5h stream [3].

Actually I have local obs setup to record myself, just instead of streaming I do recordings for my own inspection. Important part is to do the inspection after. It works wonders.

[1] https://youtu.be/GXy5eVwnL_Q

[2] https://m.youtube.com/watch?v=Cb2KwcnDKrk

[3] no joke, 19.5h stream https://www.youtube.com/watch?v=xc0jGZYFQLQ

terafo · on Nov 4, 2022

It isn't 19.5 hour stream, I went and randomly clicked on couple of timestamps and stumbled upon[1]. So it's two almost-10-hour-long streams put together because they are thematically similar.

[1] https://youtu.be/xc0jGZYFQLQ?t=34333

rl3 · on Nov 4, 2022

>Does he take some supplements to do this? There is even 19.5h stream [3]

It's probably the fact he has an audience. I can't speak for him, but that'd sure as hell light a fire under my ass—or at least significantly reduce procrastination.

Also, I find in periods where I've worked ~17 hours straight that the tiredness calms by brain to the point I'm normal and makes focus easy, albeit difficult in a different way due to fatigue. There's a weird drone zone there that's nice. Not something to make a habit of, though.

kramerger · on Nov 4, 2022

Tried to watch some videos but the high resolution/tiny font made it hard to watch.

I used to watch scanlime do 8 hour sw/hw sessions, really hope she comes back soon.

I have watched Brandon Falk do 10-11 hours of rust programming, although he sometimes take a break to play games for 4-5 hours (while in stream)

ren_engineer · on Nov 4, 2022

If you enjoy what you are doing it's pretty easy to work on something that long, I've had gaming sessions last as long back in the day with friends and some of those games are as demanding in terms of focus as programming.

External motivation of having an audience would also help

blt · on Nov 4, 2022

Factorio?

pyinstallwoes · on Nov 4, 2022

If it's not Adderall I don't know. But, if I've ever focused for that long it's been because of Ritalin or Adderall.

kklisura · on Nov 4, 2022

I think it's combination of: 1) he's really passionate about what he's doing 2) he sees the problem as real challenge 3) he doesn't have corporate structure on his back giving him deadlines and pressure

chrisMyzel · on Nov 4, 2022

I have no issue keeping 10 hrs focus like him =D

nextlevelwizard · on Nov 4, 2022

Is 10 hours really _that_ strange? You are (hopefully) focusing 8 hours "straight" during work _every day_.

If you watch Hotz's streams he takes small breaks to talk with chat and to meme around (just like everyone else during their work days) and he eats lunch and whatever (again just like everyone else).

What I'm trying to say is that Hotz's isn't a superman on Adderall he is just working on stuff he is excited about.

mfru · on Nov 4, 2022

> You are (hopefully) focusing 8 hours "straight" during work _every day_.

I refuse to believe that 8 hours straight focus every day is common.

I have about 4 - 6 hours of really focused, deep work focus available. 6 hours if I am really interested in the project and 4 hours for normal days. The rest is doing low focus work like writing mail, planning ahead, attending workshops, reading up on updates for relevant libraries, reading documentation etc.

sandos · on Nov 4, 2022

When I was around 15 I used to do 10 hours of x86 assembly programming, and then several hours every day after school for a month or so in a row. Parents would have to force me from the computer.

I attribute it to a younger brain, NO internet and NO fun distractions. At 42 I just don't see how I did it, and I know I could never be that focused. Just sitting still for 4 hours make me feel quasy now, and I need to use my physical body in some way.

llaolleh · on Nov 4, 2022

Yea I miss youth. I'm in my 30s and all nighters are not the same anymore :(. When I was young I'd do 2-3 in a week and with a four hour nap I would recover.

Now after those I lay down and I can't get up for a couple hours with all this aching in my limbs lol.

nextlevelwizard · on Nov 6, 2022

Note that Hotz isn’t doing 10 hour streams after work or pulling all-nighters.

Sheeny96 · on Nov 4, 2022

I'd say 99.9% of the western worlds workforce doesn't focus for 8 hours straight in the work day - its not really eve possible to in a great deal of jobs where there's context switching (meetings etc)

nisegami · on Nov 4, 2022

>You are (hopefully) focusing 8 hours "straight" during work _every day_.

A single, continuous 30-minute stint of focus is probably a once-a-quarter event for me.

anonymoushn · on Nov 4, 2022

At most workplaces you are interrupted dozens of times per day and have big time blocks of stuff that prevents focus. Where do you work?

pyinstallwoes · on Nov 4, 2022

10 hours? Maybe not, 19? Yeah.

I've been excited on 10 hours for a long portion of my life. Getting older makes it harder though.

valentin_kfc · on Nov 4, 2022

yeah man he does, but he is crazy genius like Nikola Tesla or something and I’m not

langsoul-com · on Nov 4, 2022

What's the file size for your recordings? 5 hours of 720p would be huge.

albert_e · on Nov 4, 2022

YouTube lets you livestream from OBS but mark the stream as private.

You get unlimited free storage of your streams for your personal use that way without the need for any local storage at all.

I haven't come across any limits or downsides to this yet but happy to be corrected.

matesz · on Nov 4, 2022

The bigger problem is recording taking too much cpu. That's why I don't record full work day, just chunks, whenever I feel I procrastinate. Youtube is an option here, I've tested it however cpu problem doesn't go away.

My plan is to make this obs-ndi plugin work on ubuntu, so I will be able to record on ubuntu to take the load off of mac which is my primary laptop.

PS. I forgot to read obs-ndi instructions properly, it works ok so now I can delegate regording to second laptop

terafo · on Nov 4, 2022

It wouldn't be huge, I once recorded a week of me using my pc(so around 80 hours in total), and it was sub-100 gigs. It was 1080p with decent quality, don't remember FPS though.

jjallen · on Nov 4, 2022

What do you do with your own recordings afterwards? How do they help you?

matesz · on Nov 4, 2022

I just quickly loop through them and categorieze chunks of time, just to see how I work. I have iphone as input, put on the table on the right which also captures my posture - I slouch almost all the time.

There is a problem however - I work on mac m1 and obs recordings take full 4 out of 8 cores so actually I am recording only when I notice I am starting to procrastinate. All recordings I remove afterwords to save up space. Obs is turned on all the time though.

I wanted to use obs-ndi to combine output from another laptop, but I have some issues with it so I just record mac atm. I also have powerfull desktop on the side and can ssh between all those by name, but desktop is noisy so it's off most of the time. Also there is raspberry pi with simple script with which I can turn on desktop remotely via Wait On Lan udp packet, dns handled via https://www.noip.com with which my router has an integration, but I actually never used it. I've done this setup to justify purchasing this powerful desktop in the first place :) humble brag, I know.

Here is screenshot of obs recording with sneak peak of my room https://imgur.com/a/m92R7Bx

RobertDeNiro · on Nov 4, 2022

> Does he take some supplements to do this?

I think its just hyperfocus.

taneq · on Nov 4, 2022

Yeah but what’s he supposed to be doing during that time? :P

locuscoeruleus · on Nov 4, 2022

What do you inspect on your recordings?

neets · on Nov 4, 2022

4 OpCodes, I think Geohot is taking a cue from his favorite intellectual's Curtis Yavin's Urbit project

kwant_kiddo · on Nov 4, 2022

I think posts like this are only getting upvotes because George Hotz owns the project. I do see value in simple code, but the constraint of 1000 LOC makes little sense to me, especially when the code is formatted poorly.

This will get downvoted, but reading the comments here I dont understand the (cult/respect) for him. Siding with the most successful CTF-team ever (PPP) he won defcon two times. He made a startup with funding that makes a cool 'niche' product.

I just think a guy like Chris Lattner or Dave Cutler who made so much impact on real computing deserve so much more respect, but I guess that the norm here is to admire this guy.

gamegoblin · on Nov 4, 2022

Your list lacks the reason for his initial fame: iPhone and PS3 jailbreaking.

And I think you're downplaying the achievements of Comma AI -- it may still be somewhat niche, but its product is better than Tesla Autopilot for highway driving (they aren't there on city driving / FSD yet), all with an absolutely tiny team.

lostmsu · on Nov 4, 2022

Re: Comma AI. This is what it tells me about my run-of-the-mill Toyota:

> openpilot upgrades your Toyota Highlander Hybrid with automated lane centering at all speeds, and adaptive cruise control that automatically resumes from a stop.

Both are annoying artificial limitations Toyota put presumably to avoid abuse by inattentive drivers.

I mean it can't change lanes. What does it do exactly?

gamegoblin · on Nov 4, 2022

Comma AI deliberately made lane change require a small bit of human intervention for safety reasons. The human hits the blinker and gives the wheel a tiny nudge in the direction, and then openpilot will complete the lane change and resume driving in the new lane.

The theory is that at the current ability of software like Tesla and Comma has, it's probably a good idea for a human to be paying more attention during a lane change maneuver. Comma is of the opinion that the level of autonomy Teslas have is probably unnecessarily unsafe. Comma cares a lot about safety (e.g. they have much more sophisticated driver monitoring than Tesla).

Lane change here: https://www.youtube.com/shorts/xm8DRwvLObQ

Since openpilot is open source software, there are of course forks that exist that remove these safety limitations and will lane change automatically.