More

mywaifuismeta · on May 1, 2022

Can someone explain what the purpose of duckdb is? From what gather from the docs, it's a column-oriented in-memory SQL database? So I would use it over something like in-memory sqlite simply because it's faster due to vectorization? Or are there other aspects that I'm missing?

akdor1154 · on May 1, 2022

Like SQLite, it's not necessarily in-memory, it has an on-disk format as well. I haven't given it a massive workout for bigger-than-memory workloads but in theory they are possible.

With that said, the resident memory size of stuff-in-parquet/arrow/duckdb is a lot less in practice than stuff-in-pandas (i know) and stuff-in-sqlite (i believe), so it still enables more in-memory workloads than you might otherwise be able to do.

wmwmwm · on May 1, 2022

I really like that you can use duckdb to sql query an on disk directory structure of parquet files with no preloading into RAM or other db formats. Super useful/quick and only one line of code! Instead of selecting from a table you just pass a glob pattern into the SQL - and since it’s a oneliner I can use it in adhoc notebooks too. It even has a to_df() method on the query result so you can get it into pandas for further manipulation

kthejoker2 · on May 1, 2022

As you noted, it's column-oriented, so designed for OLAP workloads rather than OLTP workloads. Think data warehousing, BI, Big Data analytics, feature engineering ...

mywaifuismeta · on May 1, 2022

But isn't it in-memory? In my experience, data warehousing, BI, analytics, etc are usually defined by large data in data lake or warehouse that doesn't fit into memory. How would duckdb help here?

1egg0myegg0 · on May 1, 2022

DuckDB can certainly handle larger than memory data! It's a streaming engine, so it doesn't need everything in memory all at once. Some operators (like sorting) can also buffer out to disk if you run low on memory mid-computation. We are working to expand that to more operators! It also can selectively read from parquet for only the columns and row groups that you need. (Disclaimer - I do documentation for DuckDB!)

FridgeSeal · on May 1, 2022

No it can query larger-than-memory datasets.

mywaifuismeta · on April 30, 2022

I'm also a grump and I know it's useless when the word "superpower" is in the title. I'm just allergic to that word and anyone who uses it immediately loses credibility.

mywaifuismeta · on April 30, 2022

I'm not part of this group myself, but I've seen this quite a lot as people get older. My anecdotal evidence is that it's less about becoming a "believer" in the supernatural, and more about realizing that many of the traditions and practices can be useful for personal happiness, mental health, and sense of belonging, regardless of the science behind them. Praying is not so different from e.g. meditation or yoga, just a different type of practice.

mywaifuismeta · on April 25, 2022

Twitter is just another social media fad that will go away and be replaced by the next one, Musk or not. I don't think anyone can "solve" that.

CydeWeys · on April 25, 2022

The timelines here are pretty long. Facebook has been around for 18 years, Twitter for 16. That's more than a "fad". These companies are lasting as long or longer than other tech companies, certainly longer than your average start-up.

mywaifuismeta · on April 26, 2022

Except that Facebook is nearly irrelevant these days. Facebook is relevant only because of its acquisitions of apps like Instagram, WhatsApp, Oculus, etc. Twitter doesn't have that.

fortran77 · on April 25, 2022

They all fall! It used to be quicker, but they all do. It's clear Facebook is fading. And it's a separate issue from anything "wrong" the platform does. The userbase simply ages, and kids don't want to be where their parents are. I have a facebook account that I only use for high school reuinons. My classmates post photos of their grandchildren, etc.

objektif · on April 25, 2022

Although I generally agree that most social media platforms are fads. I am not sure about Twitter. At least I know that it will stay around for a long time. It did so despite all it's problems so far.

mywaifuismeta · on April 13, 2022

> At the same time there's seems to be a lot of anecdotal evidence that a lot of developers are now flocking to these crypto companies, enough that I think its fair to assume its likely true.

Crypto companies are overfunded by VCs and offer amazing packages, more than FAANG in many cases, plus the token upside. Who doesn't want to make a quick buck?

Also, it's pretty interesting technology that can be fun to work with. Even if it's useless, it scratches the itch of working on something deeply technical instead of the boring old web technologies.

antifa · on April 13, 2022

And if you ever get bored of your overpaid crypto job, just pretend to get hacked and walk away with the company's cold wallet.

mywaifuismeta · on April 12, 2022

Am I missing something or is this nonsense? The "Big Model" paper is a survey paper that summarizes the current state of the field, while the "original" paper is specific technique. The former, as a small part of the survey, describes what was presented in the original and uses the same language the authors use. Why wouldn't they? Why would they need to re-phrase the facts and methods the authors described themselves when all they're doing is giving an overview?

They are not claiming to have done something they didn't.

Same argument about copying super generic introductions like "Deep Learning has been successful at...." - does each author really need to come up with their own variation of widely accepted facts to avoid "plagiarism" ?

fwilliams · on April 13, 2022

To quote the article:

> But even putting aside the fact that claiming someone else's writing as one's own is wrong, the value in survey papers is in how they re-frame the field. A survey paper that just copies directly from the prior paper hasn't contributed anything new to the field that couldn't be obtained from a list of references.

Good survey papers can be important contributions in their own right (e.g. [1]). A good survey should contextualize works within a subject area with respect to each other and identify high level trends/ideas in that subject. These connections are not only useful for learning a topic, but also for positioning novel work or identifying under-researched areas to focus on.

If the authors felt that one of the papers they plagiarized concisely expressed what they wanted to say, they could simply quote and cite that work. Otherwise, it could be construed that the authors are claiming to be the ones drawing the conclusions they wrote. Moreover, from the article, the survey in question seems to be pretty egregiously plagiarizing, which deserves to be called out/shamed.

[1] https://arxiv.org/abs/2111.11426

mywaifuismeta · on April 13, 2022

I disagree with this:

> But even putting aside the fact that claiming someone else's writing as one's own is wrong, the value in survey papers is in how they re-frame the field. A survey paper that just copies directly from the prior paper hasn't contributed anything new to the field that couldn't be obtained from a list of references.

Whether or not a survey paper is "good" is irrelevant here. Yes, a survey paper that just lists others papers may be a bad survey paper, but it does nothing wrong as long as it cites the original papers, which this does. A bad survey paper may not be published in a journal, that's what peer review is for, but there is nothing wrong with publishing it openly on the web.

And there is still value in aggregating other papers, even if it's just a list with description. That's the reason why these "awesome-XX" Github repos are so popular. Time to hunt them down?

fwilliams · on April 13, 2022

If you look at the plagiarized language in the article, it seems as if the BM paper authors are claiming contributions (emphasis mine). Credit is a major currency in research, and it's important to give it where it is due. If someone did this with one of my papers, I'd be quite upset.

For example (Emphasis mine):

> The risks of data memorization, for example, the ability to extract sensitive data such as valid phone numbers and IRC usernames, are highlighted by Carlini et al. [41]. While their paper identifies 604 samples that GPT-2 emitted from its training set, we show that over 1 of the data most models emit is memorized training data. In computer vision, memorization of training data has been studied from various angles for both discriminative and generative models Deduplicating training data does not hurt perplexity: models trained on deduplicated datasets have no worse perplexity compared to baseline models trained on the original datasets. In some cases, deduplication reduces perplexity by up to 10%. Further, because recent LMs are typically limited to training for just a few epochs

mywaifuismeta · on April 13, 2022

Yes, I agree that's bad but looks like sloppy copy and pasting as opposed to intentional plagiarism to claim contributions. Would it have been okay if they said "they" instead of "we"?

fwilliams · on April 13, 2022

Then who is "they" in this situation? You need a citation!

bijjgi · on April 13, 2022

Yes, you are missing something. No, this is not nonsense. It’s considered academic plagiarism to copy paste text without indicating the passage is a direct quote. They should re-phrase the text. Not only is it academic plagiarism, it’s also poor writing, because the copied text is providing a different framing than the overview article, and because the copied text is in some places incoherent due to formatting (e.g. we show that over 1 [sic] of the data most models emit).

You are wrong, the Big Models paper does in fact claim to have done something that they did not, e.g. “We introduce two complementary methods for performing deduplication.” They do not introduce these methods. The text they are lifting did.

What you are missing is long standing norms around academic plagiarism and false claims in the Big Models paper (as a result of copy pasting language).

shihab · on April 13, 2022

This clearly falls into the federal government's definition of plagiarism.

(I'd know, I just had to do a mandatory course for grad school on research ethics, exactly this scenario was used there as an example.)

frozenport · on April 13, 2022

>>summarizes the current state of the field, while the "original" paper is specific technique.

They copied the summarizes from that those techniques.

>>Why would they need to re-phrase the facts and methods the authors described themselves when all they're doing is giving an overview?

A review paper is synthesis of ideas and they copied the synthesis of other authors.

Ultimately, this is what plagiarism looks like in the case of a review paper.

godelski · on April 12, 2022

You should describe the same concepts, but you should use your own voice. A survey paper is not just copy pasting blocks of text and adding connecting language. A survey itself should have its own story and therefore your own voice. You're allowed to quote, sure, but what is shown is beyond a reasonable quotation and the "quoter" doesn't reference that the section was quoted.

mywaifuismeta · on April 10, 2022

Same thing. Incredibly slow and clunky for me as well. Seems like all tools go through this evolution because you know, somehow they have to get VC-sized returns instead of staying small and nimble.

I stopped using tools with any kind of lock-in or custom format because I know they will eventually degrade into something unusable.

mywaifuismeta · on March 29, 2022

Do you really not know anything about these topics or is this post a lie? :)

In case what you're saying is true, props to you for realizing yourself that you're doing something wrong. People stretch the truth sometimes and exaggerate to sound smarter, but in your case it seems to be more than that. You are making up facts (like living in South Africa) that are outright lies. I believe you may need help and should talk to a therapist about this. I may be a sign of something bigger that should be addressed.

mywaifuismeta · on March 20, 2022

How old are you? I used to be like you, but ever since I've hit mid-30s I started to find more fulfillment in books that are not engineering and science focused. Unless it's something directly relevant to what I'm currently working on, I would end up never using concepts from scientific books in real life and forget them after a couple of years. And if I really need to learn something relevant to my current work, I can do it "on-demand" anytime, without needing to be proactive about it.

On the other hand, books that change the way to think about life can have a more profound long lasting impact, even if they are just opinions.

Existenceblinks · on March 20, 2022

I am a couple years younger than you. For non-technical stuff, I'm more comfortable with video format, and there's going to be several videos on the same topic, better than getting narrow story from single book. Podcast is kinda ok too. Knowledge these days need competitions, you likely to just buy single book for a set of topic based on reviews. But for video or audio content, there are plenty!

mywaifuismeta · on March 15, 2022

You need to look beneath the surface. There is a fundamental difference. In Web3, not everyone can post pictures, only those who are rich. Furthermore, it is decentralized. Decentralized! No government can take away your cat or anime waifu picture! Web3 is also more than that. It is a platform. VCs love platforms. On this platform, anyone can create money and claim it's worth something! If you do this right, you can cash out before anyone notices! Kind of like overvalued startups sold to clueless buyers, but without all the middlemen.

hirako2000 · on March 15, 2022

it's a bit like saying that with the discovery of threaded cotton, we will soon replace whatever it is we used to make clothing and make everyone pay premium for the new material, only the rich won't end up naked. Yes some business models are money grabs, no face value add, but to represent the potential of web3 which itself is still unknown is an over simplification and focus on the bad side of what's been happening in the space.

don't brush the valuable uses cases away. guaranteed immutability, integrity, and even anonymity, all of that through trustless networks that guarantee incentives for the maintainers of those network can't honestly be so narrowly represented.

Fun preview though of what will be mistakenly built on networks such as Eth that is nowhere near scalable for this kind of use case.