More

kenneth_reitz · on Feb 26, 2018

GPL dependency removed.

All of these improvements I'd like to be made to the software. It's all about getting a nice API in place first, then making it perfect second.

kenneth_reitz · on Feb 26, 2018

I addressed most of your issues, like not using urlparse in the latest release.

With libraries like these, it's all about getting the API right first, them optimizing for perfection second. :)

kenneth_reitz · on Feb 26, 2018

<base> tag is now implemented as well. Thanks for bringing that to my attention — I wasn't aware of it!

kmike84 · on Feb 26, 2018

:thumbs up:

A second iteration of review:

* encoding detection from <meta> tags doesn't normalize encodings - Python doesn't use the same names as HTML;

* I'm still not sure encoding detection is correct, as it is unclear what are priorities in the current implementation. It should be 1) Content-Type header; 2) BOM marks; 3) encoding in meta tags (or xml declared encoding if you support it); 4) content-based guessing - chardet, etc., or just a default value. I.e. encoding in meta should have less priority than Content-Type header, but more priority than chardet, and if I understand it properly, response.text is decoded both using Content-Type header and chardet.

* lxml's fromstring handles XML (XHTML) encoding declarations, and it may fail in case of unicode data (http://lxml.de/parsing.html#python-unicode-strings), so passing response.text to fromstring is not good. At the same time, relying on lxml to detect encoding is not enough, as http headers should have a higher priority. In parsel we're re-encoding text to utf8, and forcing utf8 parser for lxml to solve it: https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046....

* when extracting links, it is not enough to use raw @href attribute values, as they are allowed to have leading and trailing whitespaces (see https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f...)

* absolute_links doesn't look correct for base urls which contain path. It also may have issues with urls like tel:1122333, or mailto:.

For encoding detection we're using https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f... in Scrapy. It works well overall; its weakness is that it doesn't require a HTML tree, and doesn't parse it, extracting meta information only from first 4Kb using a regex (4Kb limit is not good). Other than that, it does all the right things AFAIK.

kenneth_reitz · on Feb 26, 2018

Thanks for the feedback, integrated w3lib!

cup-of-tea · on Feb 26, 2018

Boo, you should have just made it GPL.

kenneth_reitz · on Feb 25, 2018

I consider it a nice-to-have, but we can definitely remove it if deemed unnecessary. Want to open an issue about it?

friendlydude12 · on Feb 25, 2018

That’s just my preference. Keep the features that are best for your specific use cases. It’s your project after all. I think that’s better than design by committee. Linus didn’t write Linux for other people.

nojvek · on Feb 26, 2018

If you’re doing markdown conversion well, keep it. Apis are not about doing the smallest thing, it’s about designing a clean abstraction of a problem you’re trying to solve.

friendlydude12 · on Feb 26, 2018

Sure, if APIs existed within a vacuum. In the real world we have to worry about ease of packaging, ease of maintenance, and various other practical costs.

3pt14159 · on Feb 25, 2018

For what it is worth, I strongly prefer leaving it in. Sometimes I want to just generate documentation for different things based on different sources and being able to just clean the html by rendering to Markdown would be great.

arikfr · on Feb 26, 2018

I personally find it useful, although I can see why it feels out of place.

If you do decide to remove it, I think it's worth adding as an example in the README.

mixmastamyk · on Feb 26, 2018

Could be made an optional feature in setup.py.

EmilStenstrom · on Feb 26, 2018

Definitely shouldn’t be there.

kenneth_reitz · on Feb 25, 2018

updated :)

kenneth_reitz · on Feb 25, 2018

That bug has been fixed, but the docs hadn't been. Fixed now :)

kenneth_reitz · on Feb 25, 2018

Looks like there's still some room for improvement though!

javajosh · on Feb 25, 2018

There is always room for improvement. I think it's easy to underestimate the time required to dwell in a thing before we really understand it. This is true for runtimes, libraries, even entire programming paradigms. We want to take shortcuts using abstract reasoning, and that works well usually, but sometimes you just gotta use something for years. (Alas, one lifetime may be too short to dwell in all the various paradigms the way they each require. But I digress...)

mixmastamyk · on Feb 25, 2018

Always, very nice work however. Thanks for the pipenv also, makes working with venvs tolerable.

kenneth_reitz · on Feb 25, 2018

It wasn't at first, actually. It was originally a wrapper around urllib2 — Andrey and I collabed early on in both project's histories to make them what they are today.

kenneth_reitz · on Feb 25, 2018

Very different use cases, imo. MechanicalSoup emulates a web browser experience. This is more for scraping.

kenneth_reitz · on Nov 6, 2017

Aaron Swartz

kenneth_reitz · on Sept 5, 2017

My thoughts exactly :)

kenneth_reitz · on Sept 5, 2017