Hacker News new | past | comments | ask | show | jobs | submit login

For the 404s (assuming the status code isn't a 4xx), use a URL that you strongly suspect won't exist, then you can do a comparison (levenshtein distance, bag of words, etc.) to see if it's very similar to one of about, ideas, etc. pages.



> For the 404s (assuming the status code isn't a 4xx)

Most are a 4xx code, I checked myself, some may be 301/302 redirect to 4xx not being handled properly by their crawler


Good point. We're using https://crawlee.dev, I think there's a way to handle more status codes as errors...

Right now it only excludes pages based on the text content: https://github.com/lindylearn/aboutideasnow/blob/main/apps/a...


I think openai embeddings API could be useful here. Perhaps one of the neurons responds to corporate speak.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: