Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Self-Directed Pandas Crash Course (kellyfoulk.herokuapp.com)
109 points by luu on Feb 6, 2021 | hide | past | favorite | 41 comments


Pandas is basically impossible for me to use without dozens of Google searches after being familiar with it for over 7 years. Ofcourse, I don't use it daily but its one of those pieces of software that has a very non-intuitive API. Does any one find it difficult to use it or its just me?

In particular, I find this answer infuriating [1]. I've come across it so many times. Look, I have a CSV with 200 rows and I need to loop through them in the most intuitive way. Sure, its not optimal but I don't want fast code. I have a mental model of how to modify this dataframe. Let me do it, please.

[1] https://stackoverflow.com/a/55557758


I understand where you're coming from (yes yes there's a theoretically "good" way of doing this, but come on, why can't I just do the simplest thing), but I also sympathize with the spirit of that SO answer (although I agree that its presentation is wanting).

Pandas is really a DSL unto itself and is heavily influenced by R, where the same dynamic happens. Programmers coming from a background where procedural control flow constructs are basically second nature bump up against statisticians for whom array-based programming (in the form of overloaded mathematical notation acting on both scalar and vector values) is second nature.

R and pandas are both very array-oriented programming languages (the most extreme example of this might be early-era APL) and it's really going against the grain to implement things with explicit iteration.

It's kind of like trying to program in Python without using loops or list comprehensions and asking just how to do everything in recursion. You can... but someone is bound to point out that doing everything with recursion (and the concomitant trampolines to prevent stack overflows) is not the Pythonic way.

(Also separately @dang, I feel like I'm running into a very minor bug with time stamps, where when I'm composing this reply I get "9 hours ago" for systemvoltage, but in the main thread I get "3 hours ago")


You’re making a great point there about procedural mindset being applied to array programming. But the thing is, I feel like array based programming should lend itself naturally to functional approaches. And Pandas does do this to an extent.

My problem is that this is super inconsistent. Some things are done as a method call on an object, others by passing the object to a pandas function and others yet by passing a function to a method on an object. This is the major source of frustration for me.

Maybe there is some logic to that, but I haven’t found it yet and I think that is a sign of bad API design. Its like PHP to me. All nice and documented but useless without Googling everything


SQL is more natural to me for this sort of declarative DSL. Is there a Pandas-like package that can use in-memory tables and accept SQL queries?

About `pd.read_sql()`. That's totally awesome.

I realized that Pandas is a very useful tool for many thousands of developers, I only have a problem with its interface. Obviously, I've been using it for many years for a reason!


DuckDB is a really promising project for just that: https://duckdb.org/


That's freaking cool, thanks for sharing.


I actually don't like Pandas but I think it's pretty obvious that the array language style is an incredibly powerful way to manipulate blocks of multiple dimensional data. People here are complaining about not being able to use for loops, but why would you want to use them in the first place? Like let's say you want to add two vectors together. Looping over each index and adding them and assigning them to a third vector element by element is not only I expressive (and like everything else written in Python, computational inefficient), but it is an conceptually inappropriate solution to the problem.

You are operating on n-dimensional arrays, not elements, so your need to write code that expresses that intent.

Anyone who thinks that anything besides writing "a + b" to add two matrices together is a good or simple solution is crazy.

Operate at a higher conceptual level. Don't use loops. Transform and compose your data, not your datums.


Fortran has had array operations since the 1990 standard, but I believe that programmers who are familiar with that functionality still often use loops involving array elements. I do. Not all calculations conveniently map to array operations, and sometimes you can exit the loop if a condition is met. It's a benefit of a compiled, statically typed language to use loops without suffering a speed penalty.


For folks doing this everyday, absolutely. For anyone just learning, it's Yet Another Thing to learn. Sure, the complete programmer will be able to think in better abstractions, but most people fail to reach that level. In part because of the complexity of early stage learning.

It's a balance, and I think pandas is on the more complex side of things.


If you don’t want the conceptual model and features pandas provides, is there a reason you don’t just use regular Python without pandas?


Pandas has really great ingesting / exporting helpers, is easy to chart, and lots of tutorials reach for it first.

Those might be more excuses than reasons, but it’s my experience.


It's a conceptual spreadsheet or table. Why wouldn't a read-write (element values) row iterator always be available? SQL tables can do it and numpy.ndarrays can do it.


100% my experience whenever I work with it, and I've been working with it on and off for about five years. I get its appeal -- it fits that fuzzy place where a database is too heavy, but vanilla python is cumbersome... but damn, is it tough to work with. With almost no fail, I always seem to just scrap the pandas code in favor of vanilla python, either from usability issues, or from yet another a library's false promise that it works well with dataframes. So many hours lost in using pandas.

That said, one of its best features, and probably the only thing I use it for these days is, `pd.read_sql(sqlstr, conn).to_csv(fp)`. This is far less cumbersome than using psycopg2.

Edit: for charting these days, similar to OP's visualiztions, I highly recommend vega-lite.


Funny that the example you give is one of the (many) frustrations with Pandas.

It’s ‘read_<format>’ to read something in, but ‘to_<format>’ to write it out? In which world is this intuitive? Surely it should be read/write or even from/to?


It's `pandas.read_<format> -> DataFrame` and `DataFrame.to_<format>`. So pandas reads a data format and gives you a dataframe. The `to_<format>` are instane methods of the DataFrame where you write a dataframe to a format. To me that makes sense.


Sure, you can create a sort-of logic to fit the existing naming, but it's still something you have to learn.

Imagine you were using Pandas for the first time, working through a tutorial, and you've just learned that pd.read_csv reads a csv into a dataframe.

Intuitively, what would you expect the corresponding output function to be called? I'd hazard that a vast majority of people would guess at some version of write_csv, and would experiment with either pd.write_csv, or df.write_csv.


Pandas isn't great, but I find it significantly nicer to use than normal Python. Why do you want to write a 100 lines of dumb loopy code when you can probably solve the same problem with a couple lines of Pandas?

If you really just want to use loops and stuff (which I would discourage) just use a list or a dict or something.


As you indicate, Pandas is effectively another language, distinct from regular Python. E.g. it doesn't really use loops. So I think what irritates a lot of people about pandas is that they think "Cool. I can solve this complex problem in a couple of lines of Python code with pandas." then get irritated when they find that the couple of lines of pandas they need to use is "Incomprehensible pandas gibberish" instead of the familiar Python code they were expecting.


"If you really just want to use loops and stuff (which I would discourage) just use a list or a dict or something."

Like my post said, I favor vanilla python over pandas... so yes, I use lists, dicts and somethings. FWIW, though, my workflow pushes everything into postgres, and things that would normally go into pandas are just accessed through SQL through and with helper functions.


Same here. Why re-invent SQL as a weird object system?

I've also recently found that using the sqlite3 command line tool increases my productivity when doing data sciency stuff a lot. It's a super fast, super simple way of making sense of CSV data, especially for the selecting an joining operations that are so unintuitive in pandas. Once that's done, I can dump the results into Jupyter or RStudio for transformation and/or visualization.

Another personal productivity win I've discovered is using two-line python scripts to write really long and repetitive SQL commands (select count a, b, c... from huge_table where d... and e... and f...; etc.) and then running them in SQLite.


It has happened that I’ve searched StackOverflow with Pandas questions and found an answer written by me.


Here are two options which various people might find intuitive:

1. Make a good old numeric for loop like "for ii in range(len(df))" with df.iloc or df.ix, etc.

2. Use df.apply() to create a new DataFrame with your changes.

Both of these are mentioned in brief answers to that SO post. But not in the accepted answer. A lot of the answers focus on the most efficient ways to do things, even though the question was very basic and did not imply the data were large.


It's not even about performance, writing vectorized code is just superior. The array language model is very powerful and you can literally write a couple lines of code that would take hundreds of lines of normal Python.

That being said, Pandas isn't a very good array language. I really like kdb+/q and find it better and more expressive than almost any other language I've used.


I'm in this boat, and I'm decent at SQL and find the Numpy API to be pretty intuitive, so I'm not sure what it is.


This is reassuring. Anytime I use pandas I’m like wtf? (First encountered it like a year ago)


it's shit


I just read the GitHub repo mentioned in the article and it looks like gibberish to me.

Pandas to me are still big lazy black and white bears eating bamboo, until someone can point me to something more intelligible.


Khuyen Tran has a nice collection of bite-sized tricks and tips for learning Pandas on her blog: https://mathdatasimplified.com/page/1/?s=pandas https://mathdatasimplified.com/page/2/?s=pandas https://mathdatasimplified.com/page/3/?s=pandas

It's not like I'll remember all the syntax, but good to know what tools/tricks exist.


I like the outline you have here for a crash course on Pandas. I've been tinkering with it myself off and on for a while but have been wanting to to really dig in lately for the same reasons you mention. Just a couple of nitpicks:

1. The style of your website makes it pretty much impossible to tell if a bit of text is linked anywhere. I only figured it out after clicking on the word "here" in the first paragraph and coming back and clicking on essentially everything. None of this happened until after I visited the site on my desktop instead of my mobile and slowed down to read things carefully.

2. It would be great to get a link to the denvergov.org data set, or corresponding area of the site.


Thanks! I enjoyed working through the 100 Pandas Puzzles repo and would recommend it - I've found myself going back to reference my answers many times. Thanks for the critique on the links, I hadn't realized how hard they were to see until now. I'm in the middle of refactoring the site and I'll add that to the to-do list.


He links it in the first sentence under Resources. As you mention, it's a "here" link with no styling.

One shouldn't have to, but FWIW the HTML of his page is very clean so you can see all of the links by viewing the source.


*She



Subscribed. Thanks for putting together this quality content. I was checking out your merge dataframes video the other day, that I was led to from a Stack Overflow post. I'm not a novice w/Pandas but I find that things stick better after repeated exposure, from different sources at that. The various ways of using merge/join/concat can be tricky to keep straight at first. I will be sure to dig deeper into your content in the near future, including the series of videos on Numpy.


Not mine, I'd have mentioned if it was my resource.

I saw it on: https://www.reddit.com/r/Python/comments/lain0r/hey_reddit_h...


I'm delighted at the multiple ways one can parse the headline!


Well that’s apt for Pandas: Eats, Shoots & Leaves

https://en.wikipedia.org/wiki/Eats,_Shoots_%26_Leaves


Yea :)

  * Self-Directed, Pandas Crash Course
  * Self-Directed Pandas, Crash Course

Rewriting as Self-Directed Crash Course on Pandas would eliminate the ambiguity.


Good point, I'll add that to my to-do list


I think Corey Shafer's tutorial[1] is a great place to get started. His voice is clear even on 2x and he doesn't bore me

[1] https://youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDC...


I love all of his tutorials, good recommendation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: