Hacker News new | past | comments | ask | show | jobs | submit login

lxml is definitely faster, but I've found BSoup to be more forgiving with poorly formatted DOMs



BeautifulSoup is poorly maintained — you have to be very specific with which version you're using.

Note: Lxml has a number of repair modes that allows it to parse virtually anything. Cpu cycles and memory go up quite a bit when they're activated, but it's still better than BeautifulSoup.


Thankfully lxml has a slower-but-more-forgiving mode that you can use when interacting with poorly formatted HTML, which takes advantage of BeautifulSoup http://lxml.de/elementsoup.html


I’ve found the exact opposite. BSoup will choke on invalid tags in the DOM, such as: <div id=“content”><content>…</content></div>

If I try to return the innerHTML of #content, I get '<div id=“content”><content>’ as a string, nothing else.

While I know that’s inexcusable markup, it’s nothing I have control over.

lxml (if it builds on the target system) has been much better for my scripts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: