* encoding detection from <meta> tags doesn't normalize encodings - Python doesn't use the same names as HTML;
* I'm still not sure encoding detection is correct, as it is unclear what are priorities in the current implementation. It should be 1) Content-Type header; 2) BOM marks; 3) encoding in meta tags (or xml declared encoding if you support it); 4) content-based guessing - chardet, etc., or just a default value. I.e. encoding in meta should have less priority than Content-Type header, but more priority than chardet, and if I understand it properly, response.text is decoded both using Content-Type header and chardet.
* lxml's fromstring handles XML (XHTML) encoding declarations, and it may fail in case of unicode data (http://lxml.de/parsing.html#python-unicode-strings), so passing response.text to fromstring is not good. At the same time, relying on lxml to detect encoding is not enough, as http headers should have a higher priority. In parsel we're re-encoding text to utf8, and forcing utf8 parser for lxml to solve it: https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046....
* absolute_links doesn't look correct for base urls which contain path. It also may have issues with urls like tel:1122333, or mailto:.
For encoding detection we're using https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f... in Scrapy. It works well overall; its weakness is that it doesn't require a HTML tree, and doesn't parse it, extracting meta information only from first 4Kb using a regex (4Kb limit is not good). Other than that, it does all the right things AFAIK.
That’s just my preference. Keep the features that are best for your specific use cases. It’s your project after all. I think that’s better than design by committee. Linus didn’t write Linux for other people.
If you’re doing markdown conversion well, keep it. Apis are not about doing the smallest thing, it’s about designing a clean abstraction of a problem you’re trying to solve.
Sure, if APIs existed within a vacuum. In the real world we have to worry about ease of packaging, ease of maintenance, and various other practical costs.
For what it is worth, I strongly prefer leaving it in. Sometimes I want to just generate documentation for different things based on different sources and being able to just clean the html by rendering to Markdown would be great.
There is always room for improvement. I think it's easy to underestimate the time required to dwell in a thing before we really understand it. This is true for runtimes, libraries, even entire programming paradigms. We want to take shortcuts using abstract reasoning, and that works well usually, but sometimes you just gotta use something for years. (Alas, one lifetime may be too short to dwell in all the various paradigms the way they each require. But I digress...)
It wasn't at first, actually. It was originally a wrapper around urllib2 — Andrey and I collabed early on in both project's histories to make them what they are today.
All of these improvements I'd like to be made to the software. It's all about getting a nice API in place first, then making it perfect second.