Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Client libraries are better when they have no API? (csvbase.com)
159 points by todsacerdoti on April 10, 2024 | hide | past | favorite | 100 comments


Total clickbait title, obviously being able to make an HTTP request to an endpoint, and get a CSV back, is an API. Respecting the “accept” header to return different data formats, based on what the client requests, is totally fine/reasonable, but it doesn’t magically make an interface that’s designed for applications to program against not an API.

Also, while CSV is a nice/convenient data format for a data analytics use case (like this), it’s certainly not a format I’d choose for an API where clients are likely to be more standard CRUD-ish apps. JSON is great for those, CSVs (with their trickier parsing, “everything is a string” data types, and enforced flatness) are a pain in the ass.

I did think it was interesting to learn about fsspec, didn’t know about that! And this style of client library does seem like a good/convenient one for this specific Python data analysis use case. Enjoyed that part of the article, but had to wade through a fair bit of clickbait style writing to get there.


CSV is an awful non-standard standard that nobody should use.

To understand how awful it is, you just have to ask two questions "How do I represent empty values?" "What do I do when my separator is contained in the data?"

There are answers to these questions, but there's not a standard backing up those answers. And that's what makes CSV is PITA to deal with. It's such a loose "standard" that anything goes.

xml and json are FAR better options even when you just want a table of data.


> To understand how awful it is, you just have to ask two questions

Those are rookie questions. What about when the data contains newline/crlf? What if the data contains the quote character and newline?

And why is the file mostly Windows-1252 encoded except some fields that are sometimes, at random, UTF-8 encoded?


Still basic questions.

The real kicker is when your fellow users are opening the CSV in a spreadsheet with a locale that prefers commas for fractional currency amount and then saving the file back.


JSON is not without its own problems. Not the least for getting very close to what folks want, but just falling short when things get complicated.

https://en.wikipedia.org/wiki/CDATA is something far too few people pay attention to when considering the complications of object notation and why markup languages have some escape hatches.


The CSV standard is RFC 4180, although it's merely informational.

https://www.ietf.org/rfc/rfc4180.txt


It always pains me to say, but when processing tabular data from sources I don't personally control, I prefer xlsx format. It's usually well supported by most export processes (in contrast to e.g. the LibreOffice equivalent) and I never had problems finding a library that can parse it. And you never have to solve the "anything goes" CSV problems you're talking about.


> How do I represent empty values

Empty string. For example, here are three in a row:

  ,,
> What do I do when my separator is contained in my values?

Use quotes. For example, here is a single comma:

  ","


I don't understand the question about empty values. A row with there empty values is just:

,,

And yeah, dealing with separators is annoying, but pretty much every reader supports quoted values and escapes (not every writer cares up add them, though).

In practice I use csv to process large amount of tabular data (like logs, events, etc) where I care about greppability and performance more than about potential for 0.001% of corrupted data. YMMV of course, if getting it 100% right is important use something else (JSON is not without sins too, consider that JSON numbers are often parsed as floats).


Is it ''?

Or is it ,,,,,,,

Or is it tabtabtabtab

Or perhaps it's "NULL,NULL,NULL"

My part of my work is data ingestion and I've seen all these (and more) as answers to the "empty values" question.

I'm not saying that other formats aren't without their problems, they certainly are. However, CSV doesn't just have those problems, it has multiple other problems on top of them.

It's a basic idea with really obvious edge cases addressed in multiple ways depending on who is producing these documents.


CSV is extremely underspecified, but that doesn't mean it deserves the blame for software that fails to implement even the one thing it inherently specifies. A sequence of tabs is one value in a CSV. A sequence of semicolons is one value in a CSV. Any software that thinks otherwise is buggy, and unfortunately that includes some extremely popular software that is supposed to be good at tables (Excel).


I would guess that their point is that CSV does have any official null-vs-empty-string differentiation mechanism.


json lines/ndjson is great.


The site has an API.

The python library has no python API, serving instead just as a plugin for fsspec. You then use pandas or anything else the same as you did before, the python library adds no (python) API. i guess technically "use the custom `csvbase://` scheme in the URIs you supply as input, instead of `http`" could be called an "API" if you really want to play gotcha.

I think the point is legit -- the python programmer has to reference nothing specific to the client library here other than the custom URI scheme, to then use remote data from the site with any one of several existing python data libraries, via their own python APIs.


> Total clickbait title, obviously being able to make an HTTP request to an endpoint, and get a CSV back, is an API. Respecting the “accept” header to return different data formats, based on what the client requests, is totally fine/reasonable, but it doesn’t magically make an interface that’s designed for applications to program against not an API.

I'm sorry about the title. As I said below; that was really meant in the spirit of fun.

On the subject of csvbase's content negotiation - yes that is an API. That was covered here some time ago when I wrote about it before: https://news.ycombinator.com/item?id=37526047

The "no API" bit I'm talking about in this article is basically the "trick" (or whatever word you want to use) of avoiding having any user-facing interface and just hooking into stuff that is already there. There is no "API surface" here for the user to learn beyond a url scheme. I think that's nice. And it's mainly what I'm talking about.

> Also, while CSV is a nice/convenient data format for a data analytics use case (like this), it’s certainly not a format I’d choose for an API where clients are likely to be more standard CRUD-ish apps. JSON is great for those, CSVs (with their trickier parsing, “everything is a string” data types, and enforced flatness) are a pain in the ass.

Without wanting to sound too much like a sales pitch: csvbase does offer JSON. Try https://csvbase.com/calpaterson/opcodes-6502.jsonl for JSON lines (or https://csvbase.com/calpaterson/opcodes-6502.json (no 'l') for a paged plain-JSON interface).

I personally think there is no ideal format for this at the moment. JSON is very very large and slow to parse. CSV has well known problems though has massive compatibility and often works well in practice. Parquet is probably closest to the ideal and excellent in many respects but is quite complicated to parse (moreso than CSV? perhaps) and anyway is effectively unstreamable - actually quite annoying for something like csvbase where you really _don't_ want to materialise the dataset while serving it.

> I did think it was interesting to learn about fsspec, didn’t know about that!

Yes it is cool isn't it. Millions of downloads, terabytes of bandwidth of PyPI and no one has heard of it.


> I'm sorry about the title. As I said below; that was really meant in the spirit of fun.

There's no need to apologize! I and many others understood exactly what you meant by "no API". Some other people didn't understand the distinction you were drawing and chose to interpret their lack of understanding as you misleading them somehow, but that's on them, not you.

It was a great article that I thoroughly enjoyed! Thanks for sharing!


> That barely qualifies as an API, it's just HTTP. So what possible use could a client library be?

Wikipedia: API stands for Application Programming Interface. In the context of APIs, the word Application refers to any software with a distinct function. Interface can be thought of as a contract of service between two applications.

This is an interface between two applications.

I do like the simplicity of the APIs provided though, but they _are_ APIs


Perhaps the author is referring to "annoying APIs"?

- Where there is an "API key" and a "secret key" when in reality the idiot maintainers could have just concatenated the two and called it a "key". The customer doesn't need to know the abstraction on the server side

- Where one needs to create an "account" and then a "project" before one can even create a "key". WTF is a project

- Where one needs to do a dumb SMS 2FA to even get an API key

- Where one needs to do multiple steps and try/except blocks to get a result in a plain format


SMS 2FA to get an API key is usually when you get free credits and for one or both of a) get your phone number for marketing purposes and b) make it slightly harder to bot farm those credits on hundreds of different accounts (you need to do LRN dips and HLR checks to actually make this somewhat useful)


But if you say "API's are better when they're not annoying" it doesn't sound nearly as witty.


API key has nothing to do with what an API is. API is just how you ask the software to do something, or tell it what to do. API key is just for it to confirm who you are. Before the internet, every piece of software still had an API. Even the OS has an API.


I smell an OAuth2.0 hater =]


Yeah I hate Oath.

I really wish APIs were as just having a free tier that you can just use, and if you want more, just POST $10 worth of Solana to some endpoint and the API gives your IP 10000 more requests. No accounts, no keys, just simple.

I'm not a cryptocurrency fanatic but making APIs easier to access is one really good use case for it. It makes paying for an API as easy as paying for a lemonade at a street stand with cash. No accounts, no billing addresses, no subscriptions, just POST over a nickel along with your API request and get a response, and API owner gets paid.


No keys or accounts, but you need to maintain a (volatile as they all are) cryptocurrency wallet? It would be nice if ye olde banks and fintech actually came up with some standard for a sort of digital cashier's check though.

IP-based requests seems very restrictive. I'd rather post some sort of payment to an endpoint then get some arbitrary secret back that I could use/save/distribute. Maybe you could choose the mode you wanted though, so if you were confident your IP would remain stable and not be shared with anyone else, you could choose that mode.


The accounts are usually an anti-abuse feature; few people want to provide services to faceless bad actors on the internet.


Having to force all your requests through a single IP and maintain some weird crypto wallet is easier than grabbing an OAuth library?


Death to APIs. Long live APIs!


Abstraction rediscovered


Person 1 adds code that uses this with Pandas. Person 2 sees the csvbase:// URIs and copy/pastes them to use with some other library Foozle that uses URIs and it works because Foozle also happens to use fsspec. Foozle migrates off of fsspec to some other filesystem wrapper. 3 years later, after persons 1 and 2 have left the company, person 3 bumps the version of Foozle the project uses and suddenly tests are failing because they can't resolve csvbase:// URIs, but the Foozle release notes say nothing about removing support for csvbase:// URIs.


Action at a distance, and across time.

I just read this article and have very conflicting feelings about it. It is clever, and the nice kind of clever that does not require one to be a mega-brain. On the other hand, it creates invisible and uncontrollable dependencies, such as the one you describe.

Another drawback: something is broken and I want to set a debug point in the code that fetches the CSV data. Unless you know about fsspec it will be hard to follow the breadcrumbs to know how this library injects itself into your code.


The problem can be even more immediate. Person 2 (a junior or intern) is given the normally mundane task to go through the code base and replace pandas with foozle, a faster, but less mature alternative. They try to replace the pd.read_csv() calls with foozle's equivalent, say fz.read_csv(). So, pd.read_csv("csvbase://") becomes fz.read_csv("csvbase://"). Only, Foozle hasn't implemented fsspec and person 2 doesn't know anything about it either. Fun times.


How is this different from the risk that a dataframe library will stop supporting S3:// URIs? Why wouldn't Foozle's release notes mention a major protocol change? How is this scenario different from any other library making a major change without documenting it?


Because fsspec handles the extensions directly, Foozle does not have to opt in to get the csvbase:// support, all that's required is for the csvbase Python client to be installed. So users of Foozle might start depending on these extensions without Foozle ever finding out about it.

But I guess that's the reality of making libraries, even bugs will be relied on, we really don't have any methods for evolving software ecosystems reliably and compatibly.


Not sure what should be my takeaway here. It feels one can say the same story with any third party dependency. Are you saying one should not use dependencies at all?


In this hypothetical case, there would be some floundering to discover the cause, and a two-line fix. Rather than having the two extra lines from the start. Seems acceptable risk to me.


Exaggerative title aside, it's really cool to learn about how fsspec lets you essentially make a custom filesystem in a few lines of code, as well as the (important) notion that the very existence of a Python package can cause a plugin to be installed at a deep level, to be used automatically by other packages.

https://filesystem-spec.readthedocs.io/en/latest/developer.h...

https://setuptools.pypa.io/en/latest/userguide/entry_point.h...

(As a side note here, you might think that "pip install" can only add functionality, and that you can add extra packages locally for your convenience, or keep them around when you switch branches, without having your dev environment diverge from production behavior. But if they register plugins with other packages via the entry point functionality, you might end up coding things that depend on behavior that's not present in other environments! This is especially common in the pytest ecosystem. CI is vital here!)


> Exaggerative title aside

I apologise for the title. Not intended to mislead, just to be a bit of fun :)

> t's really cool to learn about how fsspec [...]

I'm glad you took something from it. At the last place I worked I spent a lot of time building a library to persist dataframes (intended audience: data analysts) and in retrospect I wish we'd thought of the approach of just having a `csvbase://` url scheme via fsspec.

Currently it doesn't support Parquet which they would have needed though. Still some work to do for csvbase to accept Parquet uploads.


I'm curious what the author's loose definition of "API" is, if all of those API's that were presented were somehow not API's to the author (strictly speaking, they were definitely all API's).


My initial reaction was also "if I can talk to your software, that's an API."

Reading through the article, though, the author isn't advocating for "no APIs," they're advocating for minimalist APIs that use agent detection to "do the right thing."

Chrome tells the API that it can accept HTML, so the server sends data formatted inside of a web page, with an HTML table.

Curl doesn't send that header, so the server sends unformatted CSV. But you could send an Accept header to get the HTML if you wanted.

The benefit of this is that for most use cases, this will "just work."

The downside is, if you want to view the CSV in a browser, or the web page in curl, you need to know (or guess) how the server is deciding what to send you, and take the correct action. An API documented with OpenAPI, while more complicated, explicitly tells you what you can do.


> My initial reaction was also "if I can talk to your software, that's an API."

I've long held that software ages far better if you just eliminate the ability of people to provide it input.


And users. We wouldn't have nearly as many issues if it weren't for those darned users...


PEBKAC vulnerability still unpatched all these decades later.


Not much use then either, it's like encased in resin. Limit the ability to provide input, maybe?


It just transfers part of the complexity to “I have to know this will send CVS” and “which flavor of CVS?” and “what will the CVS columns be?”, and so on. In the end, you still have to know the format, and, as opposed to a programming-language library API, it’s less discoverable, tends to be less documented, and mostly untyped (a Python API could at least be gradually-typed).


I personally thought the author was pretty clear in the article. My interpretation is that there is no `csvbase-client` API for a user to learn, manage, or even import. The end user continues using their preferred dataframe library, but now has access to data via the "csvbase://" URI as if it was a csv file on their local file system.


It seems to mean no API that is new, unfamiliar, or something to that effect for the user to learn. You can use the Pandas, Polaris, etc. APIs that you are already already accustomed to.


Have you never used a client library that provides its own API that wraps the web API? Something like Stripe, where you have domain objects and helper functions and other incantations that are layered on top of the bare http?

That's what the author is saying they don't have. When you import their client library you just have a new URI scheme you can use anywhere that accepts a URI, not layers of extra classes and methods to learn how to interact with.


Wouldn't it be better to just register http as a protocol with fsspec and use

    http://csvbase.example/username/dataset/table
instead of creating a whole protocol handler just for an online service?


http is already registered as a protocol and you can indeed do that already. It's printed on every table page. eg on https://csvbase.com/calpaterson/opcodes-6502:

    import pandas as pd; opcodes_6502 = pd.read_csv('https://csvbase.com/calpaterson/opcodes-6502', index_col=0)
But writing back is harder. It's not easy to make pandas do an HTTP PUT and then insert the HTTP basic auth and so on. Plus (not discussed in the article) there is a cache to avoid redownloading the data when it hasn't changed, which is my personal bete noire in "data science" such as it is.


Ah, now I understand the motive! I wasn't thinking about writing back to csvbase at all.


What happens if N projects attempt this?


This is a great example of digging into existing functionality, understanding how it works, and building on it. I've seen way too many examples of re-inventing the wheel just because someone wasn't willing to put in the effort to research what's already out there. This is the type of example I would share with my junior developers on how to do good dev, especially in an older enterprise codebase.


I think the thing that prevents efficient research is presentation of what is already there. There are lots of great things presented very poorly/undocumented. Anybody developing should worry much more about explaining clearly/presenting what is out there rather than build amazing features that nobody will use that they are not aware of.

But this must be a management vision and effort, if you are always pressed just to deliver something working, you will end up with lots of people reinventing the wheel internally.


I don’t know if the post lives up to its title but it’s a very nice introduction to a little known piece of Python machinery


Those examples in the article are what I would consider APIs. It is just that it is written in a way that is _composable_, and the primitives are scoped in a way that composing them versatile. That there was a library that facilitates that composition allows for an ecosystem.

The Rack protocol did something similar for http servers in the Ruby world, allowing for a number of middleware. Whole frameworks (Rails, Sinatra, as examples) could be mounted on specific routes.


> For whatever reason, fsspec is not that well known. It has less than 800 stars on github.

Sorry, I was the 800th, your post is now outdated!


From a low to low-mid developer, I feel like this is a great introduction (or at least good reason for) inheritence (from a 10k foot perspective) and why its awesome


After reading the article and the comments, I’m still not sure if TFA means “API” in the HTTP/REST sense or in the programming-language library sense.


When I started using the www the term "API" referred to functions in programming languages not to httpd configurations. It has been bizarre to see the appearance of this terminology to refer to websites.

In every instance I have seen, the free "web API" involves an extra HTTP header(s) that, in lieu of a common one such as "User-Agent", can be used to track, rate-limit and/or selectively block a www user.

The upside of the "web API" idea IMO is the serving of public information in formats other than HTML or PDF. It's great.

But why not just do this without using the extra HTTP header(s), tracking and limitations.


serverless - someone else's servers

adaptors - someone else's APIs


Returning CSV is an api thought. A terrible one, as "CSV" means a dozen ways of how to encode the special cases/escaping in the column


The just swear jar’s overflowing for this article.

It sounds like an API and it looks like an API, but don’t let that fool you…


> On csvbase, you can just curl any table url and get a csv file.

that's still an API


Yes, though as I try to explain in https://news.ycombinator.com/item?id=39993176 the bit I was trying to get at is that for `pd.read_csv('csvbase://calpaterson/opcodes-6502')` there is no API surface at all. And you can do writes with `pd.DataFrame.to_csv`. I personally think that is fun.

PS: Thanks for all of your work over the years (decades?) on sqlalchemy. I'm sure you're not too surprised to learn that the lions share of code in csvbase is calls to SQLA core.


> I'm sure you're not too surprised to learn that the lions share of code in csvbase is calls to SQLA core.

Really! a bit surprised sure, wasn't really sure what the tool was actually doing to persist data :)


Why not use the http link to the csv data?


How is that not an API?

It’s offloaded the work to an adapter.


Confession: I hate that. Hate it. It does fiddly magic at package installation time and then there’s zero indication that anything’s happened. It’s magic. When I’m looking at the end result, there’s no hint whatsoever at how those new URLs are being handled. Did Python somehow learn csvbase://? Pandas? Where’s that coming from? How do I fix it if it breaks?

Please don’t do that.

Edit: Explicit patching would be just fine, like:

  from my_project import csvbase_patch
  csvbase_patch()

  p = pandas.pd(“csvbase://…”)
Then there’s a giant indicator inside the same module that there’s magic happening. If I cmd-f “csvbase”, the magic string in the URL, I’ll stumble across that patch function. Then I can jump to its definition to see what’s happening.

This is Python. Hidden monkeypatching isn’t how we do things.


I generally agree: explicit is better than implicit, magic can be dangerous.

But the author is not doing any monkeypatching or changing how pandas works. He is using fsspec [0] to create a filesystem interface that pulls data from his site. fsspec appears to be somewhat standard since pandas, polars, Dask, and other libraries use it.

As soon as I understand that fsspec exists and this library uses it, there is no more magic. I would prefer if the specification of which fsspec to use were not embedded as part of a string, but overall this approach seems pretty reasonable.

[0] https://filesystem-spec.readthedocs.io/en/latest/


It’s still magic, just implemented with a shared crystal ball. If I learned about fsspec at 2AM during a prod outage via reading tracebacks instead of reading code, I’d be a font of creative profanity.

The idea is nice! I’m not saying fsspec is bad. It’s not. It’s neat! I just strongly abhor the idea that pip installing a package changes runtime behavior whether that packages is ever even imported.


Yeah, I'd prefer the pandas call look like

```

from csvbase import CSVBASE_FS

pd.read_csv('//calpaterson/onion-vox-pops', fsspec=CSVBASE_FS)

```


I vastly, hugely, enormously prefer that pattern. It’s not as pretty as the blog post, but it’s so explicit and easy to understand at a glance. If I’d never heard of csvbase or fsspec before seeing that code, I’d still know what to look for and how to debug it.

At 2PM when I’m well rested, caffeinated, and alert, the original code is clever and lovely. At 2AM, I want this kind of easy to understand explicitness.


Looks like explicit fsspec usage is possible:

```

fs = fsspec.implementations.local.LocalFileSystem()

with fs.open('test.csv', mode='r') as fp:

    print(pd.read_csv(StringIO(fp.read())))
```


If fsspec is an interface, then he is using an API and the whole blog post is just a pile of clickbaity BS.


Yes, he is using an API. Even using curl over HTTP is an API, strictly speaking.

But, for simple use cases, what he is doing beats building yet another client library or defining REST or RPC endpoints.


He is defining an HTTP response based on the Accept header, no?

The resource is represented by the URL, no?

What else is needed for this to be called a "REST API"?


> This is Python. Hidden monkeypatching isn’t how we do things.

The author didn't make this pattern up, it's how fsspec officially recommends implementing backends [0]. Given how widely used fsspec is I think it's fair to say that hidden monkeypatching is how we do things... sometimes.

[0] https://filesystem-spec.readthedocs.io/en/latest/developer.h...


Monkeypatching, dll injection, Jar Classpath - Huge sources of pain. It Is how it's done but in certain contexts any benefit is immediately outweighed as the bugs start accumulating.


Right, in certain contexts (I would even say most contexts) it can be a problem. However, it's not obvious to me that this—a well-defined interface for injecting plugins that's specified by a library that functions as low-level infrastructure—is one of those contexts.


It needs to be scoped to the dep so conflicting deps have independent dependencies.

If foo 1.1 takes a dep on Bar 1.2, and baz 2.3 takes a dep on Bar 1.4, both Bar contexts are tested with their respective libraries but forcing Bar to a particular global version can have problems and both versions of Bar are needed to have tested behavior.

Examples mentioned and OP change the global behavior vs proper dependency management.


I like Go's approach to this same idea. The standard database/sql library provides the standard API and the individual database drivers implement their own backend. You can use the URI connection string for your database (ie. postgres://...) though only after including the driver in your file's imports. There's even the idiomatic underscore prefix to the package import to note that you're only importing it for how it's presence affects another package. Unfortunately there's no way to sway which package you're affecting, but its still better than hidden changes.


So basically JDBC? :) I think a similar approach is used with the crypto provider API and some others

IIRC (in JDBC) you also used to have to do `Class.forName("name.of.it")` somewhere before trying to do any DB access, to ensure that the static initializers had actually run, but I don't believe it's necessary anymore

(And then of course you have Spring Boot autoconfiguring which is another level of magic up, using automatic subclassing and proxy injection to add things like transaction management. And then you can get into proper classloader hackery)


Essentially, yes. And you need to register before use in Go too since reflection is far more limited in Go - often this is done at init time, so you just import the package that does the registration [somehow], but ultimately you just have to do it before you use something: https://pkg.go.dev/database/sql#Register

The "data source name" string when connecting is... basically a JDBC connection string, and some adapters use exactly that iirc, but it's fundamentally an unstructured string that just serves the same purpose. Plugins can use anything they like, and style varies.


I'm sorry you're not a fan :)

I do take your point - to an extent.

I think the subject you're touching on is basically that of configuration. Configuration is what allows you to change the behaviour of code without making a code change. Configuration is of course very really powerful for good and evil. Someone below mentions ODBC which, yes, 100% is a great example. But the one that sticks in my mind is resolv.conf.

I would contend (:)) that csvbase-client is not doing "fiddly magic at package installation time" but supplying configuration for a url scheme. In much the same way that installing boto3 does for using the `s3://` url scheme.

"Magic" in my book would be monkeypatching the methods of other libs. I would hate to write `csvbase_patch` and have it execute on the import of a specific module. Like you, I prefer to live in a low-magic fantasy world.


To be super clear, I'm not a fan of this specific thing. The rest of csvbase may be the best thing ever!

This kind of configuration feels very magical in that it's completely behind the scenes. There's nothing in code I can look at that says "this URL should be handed to this adapter". There's no env var that gets slurped in. No .env file that's parsed. There's nothing checked into git that tells me how that URL scheme could possibly work, except the existence of a package in pyproject.toml -- one that's not even imported, just present.

That makes it stateful in a bad way. If `poetry install` (or `pip install -r requirements.txt` or whatever) didn't complete, the module using it will still load and run up until the point that it crashes because that URL doesn't load. If I ran an automated process to prune unused modules from poetry/pip, suddenly behavior changes. I don't like that idea one bit.


> This is Python. Hidden monkeypatching isn’t how we do things.

It kind of is, though. This isn’t the first time I’ve seen this; off the top of my head, I believe the Stackdriver libraries do something similar but to the stdlib logging library.

The fact that it’s possible is also a tacit endorsement of doing things this way. Many languages just don’t permit adding to or modifying a namespace.

I would argue monkey-patching is a bad pattern in general, regardless of whether it’s manual or automatic.

IMO, fsspec should have a private package-level variable with a list of these adapters, and expose a ‘fsspec.register_adapter(adapter)’ method to add things to it. I don’t see a need or use in patching here.

It just seems fraught with issues that could be avoided with an adapter registry. Testing seems easier too; I can’t imagine the pain of unit testing a bunch of adapters if they each try to modify the fsspec package directly.


Do you get upset when you see one of create_engine("postgresql://"), create_engine("mysql://"), or create_engine("sqlite://")?

The author is just adding a new backend for a common protocol in the data world. Pandas, Polars, Dask, DuckDB, etc. all support this protocol and the type of people who want to access a dedicated CSV data archive would probably much rather keep their current client API and just add a new connection URI string vs. adding an entire client API for just one data source (or dealing with making requests and passing the data into the dataframe).


I didn't say it was uncommon, just that it's a bad idea.

There's no need whatsoever for a separate client API. There could be a convention like:

  from csvbase import loader as csvloader
  df = pd.read_csv(csvloader("calpaterson/onion-vox-pops"))
The user wouldn't have to know anything but what to import to fetch a certain thing, and it's explicit about what's coming from there. There's also less risk of mistyping the URL string ("oops, I just typed cvsbase and accidentally loaded a list of CVS drugstores"), and code completion can tell you that csvloader() fetches things through the csvbase module.

Cons:

- It takes 5 seconds longer to type, one time.

Pros:

- You can tell what the code does at a glance.


You're missing the point. It's not about the use of bespoke URIs. This is the debate on "convention over configuration" revisited. Conventions are cute and always look clever while you're reading documentation. But they're a pain during maintenance, as they force others who don't know them to first go read the doc, instead of simply inferring from the previously researched and clearly stated declarations how all the wires connect.

When I see `pd.read_csv("csvbase://")` during debug, I wonder how pandas knows to speak to csvbase (as the article anticipates). Nothing is imported. Nothing is configured. Things just speak to one another. So, can I also call pd.read_csv("other_csv_server://") like this? When I replace pandas with koalas, will koalas.read_csv("csvbase://") also work? How the wires connect between pandas and csvbase is hidden. Unless you know that the two are obeying some implicit lower layer (the fsspec standard), this becomes a mystery. Mysteries are the last thing you want when debugging.

I don't know which `create_engine()` function you're alluding to. The one I know and have used comes from SQLAlchemy. How it works has always been obvious. I've never seen any mention of fsspec. I looked at its code and it's predictably just a convenient syntax to specify connection information in a single string. The string is simply parsed to extract connection attributes, which are then relayed to the lower DBAPI. There's no mystery involved.


> Hidden monkeypatching isn’t how we do things.

The original authors of several codebases I've been saddled with over the years would disagree with you.

This is Python and you a peasant. Yes most of the time the snake along the road is a snake, but occasionally it's a Basilisk waiting to destroy you with it's gaze. Knowing this, you the weary traveler, carry inconvenient tools such as mirrors to protect yourself and are well versed in superstition and arcane dark arts. You never let your guard down, because yes there is magic afoot.


It's not like "Zen of Python" is an inviolable law, but things that break it are not, in my opinion, "Pythonic". The top 2 aphorisms are:

"Beautiful is better than ugly."

I think this is ugly. It's visually appealing, but my editor doesn't know that "csvbase" is a magic symbol that selects the code path to follow. That deep ugliness outweighs any skin-deep cleverness.

"Explicit is better than implicit."

That sums it up. This is implicit. `pip install foo` changing runtime behavior of packages that never import foo is as implicit as it gets. From a developer's and a security engineer's point of view, I loathe that code is running simply because it exists.

It's possible to do these things. I contend that it's not Pythonic, and we should not be doing them.


It's absolutely terrible. My first job out of college I was basically only fixing bugs in a python codebase. 99% of the bugs were from monkey patching and null pointer exceptions where people were trying to access properties on objects in places they were not instantiated yet.

People that write code like this should be forced into bug jail like i was for a year.


Disagree here. JDBC, ODBC, standards are great. There are old tools that can connect to generic databases because of said standards, you just need the right JDBC URL, or perhaps in Windows, ODBC. It's what makes interoperability work amazing things without having to recompile your code.

"hidden monkeypatching" is essentially how all of Java works.


Lacking much Java experience, I’ll accept this as true.

If so, that’s another reason I’m uninterested in Java. Magic behavior changes based on the prefix of a string passed into a function doesn’t appeal to me at all. The Python equivalent for a database might look like (typed from memory):

  from psycopg import connect
  conn = connect(host, username, password, dbname)
Where connect() returns a database adapter object with DB-API methods. When something breaks, I can see which DB code is involved. It’s right there in the import. My IDE doesn’t have to parse strings to know which code path subsequent method calls are following.


And it worked out super well for everybody in 2021 :)


> Explicit patching would be just fine

Not fine. Just less horrible.


Thanks for the edit. I agree it should be explicit and then I actually think it's quite a nice solution.


Please, for the love of god, listen to this commenter. I’m coming from the Ruby side of things, and hidden monkeypatching will be the death of us all.


When I wrote “This is Python”, I resisted the urge to finish with “not Ruby”.


I was going to say "this reminds me of Ruby". :-D




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: