Hacker News new | past | comments | ask | show | jobs | submit login
Did You Know: BeautifulSoup's bits are rotting (crummy.com)
72 points by andrewljohnson on March 12, 2009 | hide | past | favorite | 37 comments



I've been using BeautifulSoup on a project and noticed the exact problems he's mentioning. I actually ended up filtering the source with a regexp to remove script tags and their contents prior to parsing because of the HTMLParser weirdness. It wasn't a pleasant experience. The whole time I was doing this, I kept looking at my nice Firebug element tree and wondering "Why am I even going to this trouble?"

Does anyone else wonder why we're writing all these parsers when both Mozilla and WebKit have reliable, robust parsers that are actively maintained? How difficult would it be to package up the existing code and distribute it with wrappers for python, ruby, etc... I assume there's something I don't know, because not only has it not been done, but no one seems to want to talk about it.


I had the same problem using 3.1.0, and with some suggestions from the news group, the html5lib alternative works fairly well. I never had a problem so far parsing about 6 sites i previous had to clean up using regexp.


I've always wondered this too. It is very strange that no one wants to talk about it. Maybe we could all get together and put up a bounty somewhere for someone to make this?


I briefly looked into doing this. The answer is pretty damn difficult, at least in the case of mozilla.


Actually, I think it would be pretty easy if you are willing to have a running Mozilla process. Just connect to it with MozRepl, get it to render a page, and then inspect the DOM with JavaScript. (This could be library-ed up so that you get a W3C DOM back on the Python side, or whatever.)

I use a similar technique to get emacs to syntax-highlight my slides. Connect to the running emacs (with all my settings), run htmlify via emacsclient --eval, and enjoy perfect highlighting!


Sorry, yes -- I definitely don't want a running mozilla process. Plus it's not at all clear that it's possible to run mozilla headless, though I didn't look that hard.


You can run any X app headless with Xvfb.


Ah, cool -- it's just that my servers don't run X. Or really have enough ram to spare to add 30 copies of X, mozilla, and other associated stuff. I really just need a relatively compact parsing engine.


I'm not sure why you would need 30 copies of X or Mozilla.

Either way, it is kind of inelegant, but it is hard to pick-and-choose parts of Mozilla. This is probably the simplest way to let Mozilla parse your HTML. (That, however, may not be necessary. I have done a lot of screen-scraping, and I have never encountered anything that HTML::TreeBuilder got confused on. Lately, I've been using libxml2, and that has also worked very well. Zero problems.)


This is so unfortunate. It's such a great piece of software that so many of us depend on.

It's really too bad that there's not enough money in it for Leonard to keep it up. But, I have no bitterness, just thanks!


Your title really rubs me the wrong way. This isn't bitrot, it's actually quite the opposite: the problem showed up because he does actively maintain the code, he made the latest release compatible with future versions of the standard Python distribution.

He's standing up to say he's going to honor his responsibility to this code even though he doesn't enjoy it anymore, but that that doesn't include writing html parsers, and you come along and scream 'bitrot'. Sorry, but that's kind of an assholish thing of you to do.


Its performance is getting worse over time because maintaining its speed requires more maintenance than anyone is willing to give it, at least so far. I would call that bit rot too.

I think both of you agree that the original author deserves only thanks.


But that's not what the linked article is about at all. If you have benchmarks and you want to write that article, by all means, do it.


I think you must have woken up on the wrong side of the bed. I certainly had no intention of impugning the author of the code, and in fact, I thanked him with the thread-opening comment.

Personally, I can't/won't pitch in on BeautfulSoup, but I thought maybe if I posted the page to this forum, a better hacker than myself might jump in. We can across the behavior he describes in building TrailBehind, and I just thought I'd share with the community.

As for bit rot, that's a pretty old term that just means code breaks down as it ages.


'Bit rot' is a pretty loaded term, it's not as benign as you're pretending it is. It's one of those things that you can say about your own project, or in pointing out a specific problem within a codebase, but to say it about a project as a whole that has an active maintainer (especially after he releases an update to avoid bit rot going forward and then asks for help dealing with the upstream problems), that's assholish. Sorry, but it just is.


Assuming malice over the OP's protests is kind of an assholish thing for you to do. I don't view 'bit rot' as a loaded term and if the OP says he doesn't either then perhaps you should take him at his word and not accuse him of 'pretending' otherwise. If a large part of the community sees the term in the same light as you do then perhaps it was a poor choice of title but I see nothing to indicate that there was bad intent here.


I'm not assuming malice, I don't care if he's intentionally being rude or not. A lack of intent isn't a free pass. This site gets visited by a lot of developers, probably a lot of whom use Python, probably a fair share of those use BS, and accusing the project of rotting, especially when that's not what's actually happening, could have an actual negative impact on it going forward.

Titles get changed here all the time, why couldn't this one be changed to just a simple statement of fact, something like: "Beautiful Soup switches parsers, developer requests help replacing lost functionality"? Instead we get this sensational and misleading title, and, yeah, it pisses me off.


Yeah, that would have been a great headline. Then, no one would have read the post, no one would find out the guy would like some help, and the project would be that much closer to sinking into oblivion.

Enough of this nonsense. I think the highly popular nature of the post, and the productive discussion that ensued, means that the post and headline were good. I certainly did the guy more good than harm, by bringing the attention of a community of hackers to this issue.

This reminds me of when I was editor of my college newspaper, and I wrote a headline that said "Student Raped in Gesling Stadium." The university and the campus cops all wanted us to say sexual assault and were pitching a fit. But we felt the strong language was justified, and we printed the word RAPE in big bold block letters. As a result, there was immediately hundreds of thousands of dollars spent on improving campus security.

In this case, I wrote that headline because I knew people would read it, and I knew that they wanted to read it, and I knew that they should read it. The last thing I was doing was trying to give the guy a hard time.

I think Bit Rot is pretty catchy.


Yeah, my bad, you're totally right, this site (and the world) will be so much better when everyone starts writing catchy, sensational headlines that don't reflect the actual content of their articles. I'm glad you're fighting the good fight to bring that mentality here, I really don't get enough of it at Reddit.


If you didn't assume bad intent (or even if you did, really) then perhaps you should have approached this more civilly. You called his actions 'assholish' and accuse him of 'pretending'. These terms appear to me as an attack on the author and not just his headline. For someone complaining of a rude headline your comments can come off as awfully rude themselves.


You may be right, but I approached it with the idea that he had already made one assholish statement and so there was a higher probability that he was an asshole in general. It takes one to know one, I suppose, but I try to limit my assholishness to people who have already demonstrated a good bit of their own.


I've had a couple of friends bitten by BeautifulSoup's sudden loss of functionality; dropping in html5lib's BeautifulSoup mode turned out to be a more than adequate replacement.


He is not saying lack of enough money, but is saying lack of enough time.


Yes, he is also saying it's about the money:

"To make the time for Beautiful Soup development, I'd need enough money to make it my job, and that's too much money to ask for or to expect."

Money == Time


After having run various html/xml/rss parsers against a 1B page web crawl, I'd have to say that it's pretty rare to find ones that can actually pass the web fuzz test. Most seem to have been written from a more spec-driven approach. This is fine in a controlled environment, but pretty useless if you want to turn the code loose on real world web data.

Some of the stuff we find, like 1-in-80M core dumps are to be expected because they're so rare and most folks don't have that much test data. But many others could be found by simply running a parser against a few hundred random urls from the dmoz rdf. I wish more lib developers would do this.


I'm sure the html5lib guys would love to hear about parser bugs exposed by a corpus that large:

http://code.google.com/p/html5lib/

Especially since html5lib is supposed to follow the HTML5 parsing rules, which were basically reverse-engineered from IE's HTML parsing, so they ought to work for every web-page in existence.


I don't think anything is going to work on every web page in existence. Perhaps strlen.


Yeah, since i just wrote a spider last night using html5lib, and had to wrap it up in a try block, I can categorically say that it doesn't work for all webpages:

  parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
  try:
    document = parser.parse(response)
  except Exception, e:
    print 'parse failed ' + str(e)
    return


And strlen certainly wouldn't if you actually expect a correct answer. Can't guess at encodings... :)


Such a great project should have little trouble finding good devs. Imagine how many bright young hacker would kill to be an official contributor for BeautifulSoup.

He could make an earnest attempt at finding other people to work on it and just do code reviews. Be a figure of advice and authority while doing no real work. That would be great IMHO.


Well it is free software. If he's throwing in the towel, couldn't someone just put the code on github/bitbucket and run with it?


Wait, why not just port SGMLParser to Python 3.0? Did I miss something?


Here's the note in PEP 3108:

  sgmllib [done]
    * Does not fully parse SGML.
    * In the stdlib for support to htmllib which is slated for removal.
Based on that and the standard docs, it looks like it was lost in the standard library reorganization. The HTMLParser and other HTML-related libraries were merged into a new html module; sgmllib's parser was an incomplete implementation of SGML that only HTMLParser used, so apparently in the reorg it was found unnecessary and scrapped.

http://www.python.org/dev/peps/pep-3108/#id53


But, if it worked for BeautifulSoup, as it existed in Python 2.5 would be suitable for BeautifulSoup to continue as if it were the last good version.... I'm not suggesting it be readded to the standard library, just that it be added to BeautifulSoup.


Personally, I far prefer lxml to BeautifulSoup. The latter is incredibly slow and leaks memory like a sieve unless you manually tear apart object trees. Although, BS is easier to write in many cases. Just don't use it for any heavy work.


Does Python have a libxml2 binding? I have had pretty good luck with its parse_html_string function.

Failing that, you can always use Perl and HTML::TreeBuilder / HTML::Parser. They work pretty well on malformed input.


That it does, it works like this:

import libxml2

  parse_options = libxml2.HTML_PARSE_RECOVER + libxml2.HTML_PARSE_NOERROR + libxml2.HTML_PARSE_NOWARNING

  xml_document = libxml2.readDoc(junk_html, None, None, parse_options)

  clean_xhtml = xml_document.getRootElement().serialize()
Note: this method of "cleaning" works by building an XML tree out of HTML, except that HTML is not XML, so non-closed tags such as <textarea></textarea> will get closed and the browser puts any HTML after the tag into the textarea on the screen, so don't use this if you still want to output to a browser.

EDIT: fixed formatting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: