Hacker News new | past | comments | ask | show | jobs | submit | john_horton's comments login

Hey thanks for noticing - here's the MIT licensed library it's based on: https://github.com/expectedparrot/edsl


Very cool - I've been working on an open source python package that lets you do some similar things (https://github.com/expectedparrot/edsl).

Here's an example of the Enron email demo using the edsl syntax/package & a few different LLMs: https://www.expectedparrot.com/content/6607caa1-efc5-439f-85...


Thanks for sharing! It handled the emails very well.


That is very cool, thank you for sharing.


thanks! B/c it got some positive reaction here, I did a little thread on how you can turn this flow into an API: https://x.com/johnjhorton/status/1823672992624242895



It is so implausible that the training process that creates LLMs might learn features of human behavior that could then be uncovered via experimentation? I showed, empirically, that one can replicate several findings in behavioral economics with AI agents. Perhaps the model "knows" how to behave from these papers, but I think the more plausible interpretation is that it learned about human preferences (against price gouging, status quo bias, & so on) from its training. As such, it seems quite likely that there are other latent behaviors captured by LLMs and yet to be discovered.


> As such, it seems quite likely that there are other latent behaviors captured by LLMs and yet to be discovered.

>> What NN topology can learn a quantum harmonic model?

Can any LLM do n-body gravity? What does it say when it doesn't know; doesn't have confidence in estimates?

>> Quantum harmonic oscillators have also found application in modeling financial markets. Quantum harmonic oscillator: https://en.wikipedia.org/wiki/Quantum_harmonic_oscillator

"Modeling stock return distributions with a quantum harmonic oscillator" (2018) https://iopscience.iop.org/article/10.1209/0295-5075/120/380...

... Nudge, nudge.

Behavioral economics: https://en.wikipedia.org/wiki/Behavioral_economics

https://twitter.com/westurner/status/1614123454642487296

Virtual economies do afford certain opportunities for economic experiments.


The potential hole in your thinking is the end of your paper where you advise how to get good answers: ask questions in an economist phd style! This presents a problem left unaddressed.


Are you referring to this: "What kinds of experiments are likely to work well? Given current capabilities, games with complex instructions are not presently likely to work well, but with more advanced LLMs on the horizon, this is likely to change. I should also note that research questions like what is “the effect of x on y” are likely to work much better than questions like “what is the level of x?.” Consider that in my Kahneman et al. (1986) example, I can create AI “socialists” who are not too keen on the price system generally. If I polled them about who they want for president, there is no reason to think it would generalize to the population at large. But if my research question was “what is the effect of the size of the price increase on moral judgments” I might get be able to make progress. That being said, it might be possible to create agents with the correct “weights” to get not just qualitative results but also quantitatively accurate results. I did not try, but one could imagine choosing population shares for the Charness and Rabin (2002) “types” to match moments with reality, then using that population for other scenarios." --- To clarify, this about what research questions are likely to work well here, not what questions posed to LLMs will work well.


By posing research questions, you get research conclusions from the same field of study. The whole thing is not a model of human thinking in the text world, but rather a model of economic research papers.


I'm sorry I don't follow - is your claim, that, say, an AI agent exhibiting status quo bias in responding to decision scenarios (e.g., a preference for options posed as the status quo relative to a neutral framing - Figure 3) that the reason this happens, empirically, is because the LLM has been trained on text describing status quo bias? E.g., like if an apple fell to the ground in an game, it was because the physics engine had been programmed w/ laws of gravity?


You are posing questions to the AI that only economists ever ask. You think you are instructing to it to reason “as a libertarian”, but you are actually using such economics lingo that the AI is regurgitating via “based on economist descriptions of libertarian decision making, what decision should the AI make.”

Imagine this scenario. You have a group of students and you teach them how libertarians, socialists, optimists, etc empirically respond to game theory questions. For the final exam, you ask them “assuming you are a libertarian, what would you do in this game?” Now the students mostly get the answers right according to economic theory. By teaching economic theory, and having students regurgitate the ideas on an exam, the exam results provide nothing new for field of economics. The AI is answering questions just like the students taking the final exam.

It would be like me teaching my child lots of things, and then when my child shares my own opinions, then I take that as evidence my beliefs are correct. Since I already believe my beliefs are correct, it is natural, but incorrect, to think the child’s utterances offer confirmation.


Got it - so it is the "performativity critique" - the idea that the LLM "knows" economic theories and responds in accordance with those theories. I don't think that's very likely because a) econ writing is presumably a tiny, tiny fraction of the corpus and (b) it would imply an amazing degree of transfer learning e.g., it would know to apply "status quo bias" (because it ready the papers) to new scenarios. But as the paper makes clear, you can't use it to "confirm" theories but rather use it like economists use other models - to explore behavior and generate testable predictions cheaply that you can go test with actual humans in realistic scenarios. The last experiment in the paper is from an experiment in a working paper of mine. There's no way the LLM knows this result, but if I had reverse the temporal order (create the scenario w/ the LLM, then run the experiment), it could have guided what to look at. That's likely what's scientifically useful. Anyway, thanks for engaging.


yeah - so I think this is worth exploring. Given how many tokens you can jam in the prompt even w/ GPT3, I think could do some pretty complex game play, at least compared to what is typical in the lab e.g., I think could easily have it remember how 100 or so other agents behaved in some kind of public goods game.


Reminds me of Facebook's CICERO model:

> Facebook's CICERO artificial intelligence has achieved “human-level performance” in the board game Diplomacy, which is notable for the fact that’s a game built on human interaction, not moves and manoeuvres, like, say, chess.

It was the same scenario - many agents, multiple rounds, complex dialogue based interactions.


That's actually a pretty cool analogy, even the decisionmaking is arguably quite close to how human decision making actually happens (which involves a lot more exchange of words than just transmitting coded information like "accept proposal to exchange X of good Y for Z monetary units"). Might be a bit tricky to get an AI to really "understand" those implications of their response, but it's cool as a thought experiment.


oh hey - it's my paper! If anyone is interested in exploring these ideas, feel free to get in touch (@johnjhorton, https://www.john-joseph-horton.com/). FWIW - I think it would be really neat to build a Python library w/ some tools for constructing & running experiments of different kinds. I think the paper only scratches surface of what's possible (esp. once GPT4-ish has an API).


This is really cool. I had a similar thought that GPT3 could be used to simulate political polling. A few weeks ago I tried telling GPT3 that it was part of a specific demographic (age, gender, race, income, political leaning, etc) and then asked it how it would respond to certain political questions (I tried gun control, immigration, abortion and some other issues). GPT3 was able to change its answers in believable ways depending on what demographic I instructed it to be.

My thinking was that this could be used as a quick polling test to see how the real population may respond to certain new ideas.

More work would need to be done to calibrate it, as without specific demographic details the answers tended to be liberal leaning. But its an interesting idea which could be used to create instant focus tests on any number of topics.


> GPT3 was able to change its answers in believable ways depending on what demographic I instructed it to be.

Doesn't this just mean that your own preconceptions about those demographics matches the language models preconceptions? How would we know that is matches reality when presented with novel ideas/concepts that we want to get feedback on?


That's part of what I mean by it needing to be calibrated. Initially some polling could be done with real people and the GPT agents. Whatever calibration factors are needed to make those two line up could then be used when asking the GPT agents novel questions.


> I had a similar thought that GPT3 could be used to simulate political polling.

There's a good paper for that task. See my other post: https://news.ycombinator.com/item?id=34385489


You might like to include this experiment in future research: https://www.science.org/doi/10.1126/sciadv.1600451


that's a great suggestion - thanks!


I'm one of the authors of the original paper the article is based on. Here's a link to the actual research: https://www.john-joseph-horton.com/papers/schumpeter.pdf


It wasn't dishonest - they didn't have the data for those multi-site employers b/c of how Washington state keeps records. The authors are upfront about this and try to address the limitation by adding a a survey component---they talked to these employers to see if their responses to the MW were different.


I agree with that, so maybe my post wasn't as clear as it could be.

I would say that in the transmission between the study and the mainstream press, something happened that is best described as "dishonest", or at the very least a reckless disregard for nuance. As you point out, and I agree, the study itself doesn't have that problem. The study itself is quite clear on its limitations.

So where's the issue? Is it just the general incompetency of the mainstream press (I could absolutely believe that). Is it the authors crossing every 't' in the study, but then overstating the findings and dropping the caveats in their interviews with the media? Is it the UW PR staff overstating the findings in a way PR staffs are wont to do? A combination of all of these.

I could believe any of those things. But when this study came out a few days ago, I had to read the study itself to have even a remote awareness of what it actually said (and remember, I think the findings are probably correct). And yes, I think that process is best described with the word "dishonest".


Mentioned above, but the survey data indicates the effects on multi-site employers was likely more negative.

http://m.startribune.com/seattle-study-shows-low-wage-jobs-d...


Yeah - I think this is just inherent in the transmission of academic work to a broader audience. Nuance and caveats get stripped away, in part because the journalists themselves aren't capable of assessing these features.


Full disclosure: I worked at oDesk for 2 years as their staff economist and still consult with the company (I'm now a professor at NYU Stern).

A few points on issues raise in this thread:

(1) I can assure you---and I really should do some blog posts on this---but client/employers are not nearly as price sensitive as people believe. When you try to model employers choosing who to hire from their pool of applicants, you need to work really hard in specifying the model to get demand curves to slope downward. In other words, "price" often gets the wrong sign, meaning it looks like the higher the bid, the more likely an applicant is to get hired. What's going on is that clients can and will pay more for better, more experienced developers and these developers bid accordingly. However as a freelancer, finding those kinds of employers can be a challenge.

To help deal with this search problem, we asked employers up front to state their relative preferences over price and quality. For example, employers could state they were looking for high quality at a high price, or less experienced workers at a lower price. During the experimentation phase, we randomized whether these employer/client preferences were revealed to applying freelancers. We found that we could induce substantial sorting by freelancers to job/employers of the "right" type, raising wages and project sizes at the high end. This feature is now universal and helps freelancers get in front of clients that are a good fit for them.

(2) After very extensive experimentation, oDesk did in fact impose a minimum wage. It was set at price point that improved the quality of people getting hired, but was not so high that jobs weren't being filled. Obviously, this level of minimum wage doesn't touch the high-end of the market, but picking minimum wages is a real balancing act and setting it too high can definitely price some work out of the market. It also "pulls up the ladder" and prevents new workers from getting started in the market, which oDesk understandably wants to avoid.

(3) There is a problem with too many low quality applications. The problem is similar to what's going on in college admissions---because it's cheap to apply, people apply to almost every school, whether or not it's a good fit. oDesk recently started using the Elance "Connects" system that imposes a meaningful quota on applications, which in turn seems to be improving application quality. It is hard problem though, because if you set the quota to high, you get the bad spam equilibrium, and if you set it too low, you choke off liquidity.

(4) There is a problem with inflated reputations---it's a general problem in online marketplaces, particularly those with bilateral reputation systems. However, oDesk has done something quite clever which seems to be working well, which is collecting private feedback from both workers and clients after a contract and then eventually disclosing non-identifiable aggregates of those ratings to future employers/workers. These ratings are far more truthful on a host of measures, are harder to subvert by begging for good feedback and so far, aren't getting inflated.


If you're looking for a discussion of running economics-style experiments on MTurk (as well as some advice on keeping online experiments internally valid): http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1591202


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: