Hacker Newsnew | past | comments | ask | show | jobs | submit | kenneth_reitz's commentslogin

GPL dependency removed.

All of these improvements I'd like to be made to the software. It's all about getting a nice API in place first, then making it perfect second.


I addressed most of your issues, like not using urlparse in the latest release.

With libraries like these, it's all about getting the API right first, them optimizing for perfection second. :)


<base> tag is now implemented as well. Thanks for bringing that to my attention — I wasn't aware of it!


:thumbs up:

A second iteration of review:

* encoding detection from <meta> tags doesn't normalize encodings - Python doesn't use the same names as HTML;

* I'm still not sure encoding detection is correct, as it is unclear what are priorities in the current implementation. It should be 1) Content-Type header; 2) BOM marks; 3) encoding in meta tags (or xml declared encoding if you support it); 4) content-based guessing - chardet, etc., or just a default value. I.e. encoding in meta should have less priority than Content-Type header, but more priority than chardet, and if I understand it properly, response.text is decoded both using Content-Type header and chardet.

* lxml's fromstring handles XML (XHTML) encoding declarations, and it may fail in case of unicode data (http://lxml.de/parsing.html#python-unicode-strings), so passing response.text to fromstring is not good. At the same time, relying on lxml to detect encoding is not enough, as http headers should have a higher priority. In parsel we're re-encoding text to utf8, and forcing utf8 parser for lxml to solve it: https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046....

* when extracting links, it is not enough to use raw @href attribute values, as they are allowed to have leading and trailing whitespaces (see https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f...)

* absolute_links doesn't look correct for base urls which contain path. It also may have issues with urls like tel:1122333, or mailto:.

For encoding detection we're using https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f... in Scrapy. It works well overall; its weakness is that it doesn't require a HTML tree, and doesn't parse it, extracting meta information only from first 4Kb using a regex (4Kb limit is not good). Other than that, it does all the right things AFAIK.


Thanks for the feedback, integrated w3lib!


Boo, you should have just made it GPL.


I consider it a nice-to-have, but we can definitely remove it if deemed unnecessary. Want to open an issue about it?


That’s just my preference. Keep the features that are best for your specific use cases. It’s your project after all. I think that’s better than design by committee. Linus didn’t write Linux for other people.


If you’re doing markdown conversion well, keep it. Apis are not about doing the smallest thing, it’s about designing a clean abstraction of a problem you’re trying to solve.


Sure, if APIs existed within a vacuum. In the real world we have to worry about ease of packaging, ease of maintenance, and various other practical costs.


For what it is worth, I strongly prefer leaving it in. Sometimes I want to just generate documentation for different things based on different sources and being able to just clean the html by rendering to Markdown would be great.


I personally find it useful, although I can see why it feels out of place.

If you do decide to remove it, I think it's worth adding as an example in the README.


Could be made an optional feature in setup.py.


Definitely shouldn’t be there.


updated :)


That bug has been fixed, but the docs hadn't been. Fixed now :)


Looks like there's still some room for improvement though!


There is always room for improvement. I think it's easy to underestimate the time required to dwell in a thing before we really understand it. This is true for runtimes, libraries, even entire programming paradigms. We want to take shortcuts using abstract reasoning, and that works well usually, but sometimes you just gotta use something for years. (Alas, one lifetime may be too short to dwell in all the various paradigms the way they each require. But I digress...)


Always, very nice work however. Thanks for the pipenv also, makes working with venvs tolerable.


It wasn't at first, actually. It was originally a wrapper around urllib2 — Andrey and I collabed early on in both project's histories to make them what they are today.


Very different use cases, imo. MechanicalSoup emulates a web browser experience. This is more for scraping.


Aaron Swartz


My thoughts exactly :)


See also: pipenv :)

http://docs.pipenv.org/en/latest/

Very different tools, but they cover the same surface area, in spots.


Friendly competition {^.^}


I'm bipolar!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: