thats wonderful. id like that to not change the order of the letters - but change the highlight order. Do a round 1 of frequency order first (just do first say 6 letters) then do a round 2 which is standard order..
i probably am not making much sense. Look at where I'm coming from in the world of Assistive Tech - https://docs.acecentre.org.uk/products/echo (go to around 5 min mark in the vide)
- I don't like to have to put {{ ref('source') }} everywhere. I think the tool should parse dependencies automatically. I wrote more about this here: https://maxhalford.github.io/blog/dbt-ref-rant/
- I don't like the idea each .sql file has to have an associated .yml file. It feels better to have everything in one place. For instance, with lea you can add a @UNIQUE tag as an SQL comment to unit test a column for uniqueness.
Moreover, although dbt brought a shift in the way we do data (which is great) it's very straightforward under the hood. It boils down to parsing queries, organizing them in a DAG, and processing said DAG. dbt feels bloated to me. Also, it seems to me some of the newer cool features are going to be put behind a paywall (e.g. metric layers)
Thanks for the answer! Congrats!
I never went very deep on DBT but I can't help but think:
"2022: DBT raises at a $4.2b valuation
// 2023: Hi there! I built a dbt lite during the weekend because I was frustrated with ref('source')" and it does the job for us!
Hard to understand the defensibility and valuation
Hey there HN. lea is a tool we developed over the past year at Carbonfact. Carbonfact is a platform that helps fashion brands decarbonize. We believe in doing this in a data-driven way, and lea is a cornerstone for us.
Hey, great work. Do you think this algorithm would be amenable to be done online? I'm the author of River (https://riverml.xyz) where we're looking for good online clustering algorithms.
Definitely possible, but it would require some extensions to the algorithm. More specifically, as new datapoints enter the stream, they could be compared with the existing medoids to see if swapping them would lower the clustering loss.
This would be a nontrivial engineering effort and I likely won't be able to do it myself (I'm a PhD student about to graduate), but if you or your team is interested in adapting BanditPAM to the streaming setting, please feel free to reach out! My email's motiwari@stanford.edu
In hindsight this is harder than it seems if you don't already have access to data, so I'll try to be as informative as possible in my response
It's hard to find this information out there, so here's ~all you need to know.
Data is usually behind paywalls, unfortunately. Industry standards are Bloomberg terminal (ridiculously expensive, 5 digits $), FactSet (very expensive, 4 digits), Capital IQ (expensive, not sure)... but there are a number of up-and-coming startups trying to disrupt the space so you may be able to grab data from them. I think https://atom.finance has a 7-day free trial you could use to play around with.
P/E simply means the company's _P_rice per share divided by _E_arnings per share. Cancel out the "per share" terms and you get total market capitalization (which is the value of the total equity) divided by net income (since "earnings per share" really means "net income per share")
So the "P" is easy to get. It's your Adj Close.
The "E" is trickier as it can mean a lot of things. Diluted EPS from financial statements? Last year's EPS? Management's guidance for EPS? None of those are actually correct even if they are all "EPS"
Importantly--and contrary to 99% of the info you will find online--the most relevant EPS number are forward estimates of EPS, usually for the next twelve months ("NTM"). That is based on an average or median of analyst estimates which is called "consensus". These are analysts from financial institutions who build their own little models based on their own views of where the business is going to go, informed by recent earnings, management's color in earnings calls and filings, etc.
Believe it or not, as hairy as that sounds, EPS is fairly easy to get as it's a metric that has less room for interpretation than, say, EBITDA.
So you're not going to go out there, read all these (paid) analyst reports, find their EPS, calculate the median, etc. Bloomberg, Capital IQ, FactSet do this for you and it's easily observable for the end user (that's their business).
The thing is, as you may have guessed, "next twelve months" are a moving target across time. Analysts usually provide estimates for the current fiscal year (i.e. FY 2023, ending 12/31/2023 for most companies) and the following year, ending 12/31/2024. Let's call these FY0_EPS and FY1_EPS, for simplicity
You might be tempted to just take a moving average of these two estimates, so that on 1/1/2023 it is 100% of FY0_EPS + 0% of FY1_EPS, on 1/2/2023 it is 99.9% + 0.1% and gradually "move forward in time" as the days pass. That sort of works (and definitely checks the box for a proof-of-concept like in your post) but for the sake of completeness, I'll just say that the right-er approach is to only "move forward in time" when new earnings are released. So it doesn't matter if we're in 1/1/2023 or 2/1/2023--what matter is what is the latest reported quarter. Take Coca-Cola for instance (https://www.bamsec.com/companies/21344/coca-cola-co). Let's roll the tape backward one year. They reported FY 2021 earnings on 2/22/2022, at which point analysts published new estimates in revised models, so on from that day forward until the next quarterly earnings we take 100% FY0_EPS + 0% FY1_EPS, in which these correspond to estimates for FY 2022 and FY 2023, respectively.
On 4/1/2022, Coca-Cola reported Q1 2022 results, analysts published new estimates, and we now take 75% FY0_EPS + 25% FY1_EPS. On 7/1/2022, we move forward another quarter so 50% + 50%, then 25% + 75% starting on 10/26 and then back to square one with 100% + 0% except FY0_EPS now means FY 2023 vs FY 2022 previously, and FY1_EPS means FY 2024
So your table is something like (I'm making up numbers)
With that you can take NTM0_Weight and NTM1_Weight to calculate NTM_EPS by multiplying those weights by FY0_EPS and FY1_EPS. And then can take AdjClose / NTM_EPS to calculate P/E
Why is this useful? Because in theory you can take the average P/E of companies X, Y and Z in one industry and compare it to a fourth company W. Is W's P/E multiple above or below the industry average? You now know if they are over or undervalued, respectively, which means you know if you should buy or sell that stock (if you believe you picked the right "comparable" companies in that industry)
This is just one example... there are all sorts of similar analyses done daily in the financial services industry. I'm not saying it's easy to extract alpha from trading on these, but that's the framework
My pleasure! I've spent the better part of the last decade doing this stuff and I appreciate how hard it is to find resources on it, so thought I'd share since you mentioned you were interested in learning
See PDF page 14. Note the lines called "Composite P / NTM EPS" which they built as a blend of American Eagle's, Tilly's and Zumiez's P/E multiple, which are companies X, Y and Z in my comment above (for some reason they gave AE double the weight which is unusual) and compared it to Heat's P/E multiple (Heat was the codename for retailer Rue21, or hypothetical company W in my example above)
That's too bad, I would have expected it to work out of the box. Other than rewriting the query in a different way, I'm not sure I see an easy workaround. Are you still working on this?
Hehe I was wondering if someone would catch that. Rest assured, I know the difference between online and stochastic gradient descent. I admit I used stochastic on Hacker News because I thought it would generate more engagement.
What are some adversarial cases for gradient descent, and/or what sort of e.g. DVC.org or W3C PROV provenance information should be tracked for a production ML workflow?
We built model & data provenance into our open source ML library, though it's admittedly not the W3C PROV standard. There were a few gaps in it until we built an automated reproducibility system on top of it, but now it's pretty solid for all the algorithms we implement. Unfortunately some of the things we wrap (notably TensorFlow) aren't reproducible enough due to some unfixed bugs. There's an overview of the provenance system in this reprise of the JavaOne talk I gave here https://www.youtube.com/watch?v=GXOMjq2OS_c. The library is on GitHub - https://github.com/oracle/tribuo.
I agree. Databases are going to be here for a long time, and we're barely scratching the surface of making people productive with them. dbt is just the beginning.
I'm watching it, it's really good. Montana makes a great point: you can move data to the models, or move the models to the data. Data is typically larger than models, so it makes sense to go with the latter.
thanks for watching! i should really up the production quality haha but also this is what i can kinda manage with my existing workload. idk how the pro youtubers make these calls interesting