Can someone explain what the purpose of duckdb is? From what gather from the docs, it's a column-oriented in-memory SQL database? So I would use it over something like in-memory sqlite simply because it's faster due to vectorization? Or are there other aspects that I'm missing?
Like SQLite, it's not necessarily in-memory, it has an on-disk format as well. I haven't given it a massive workout for bigger-than-memory workloads but in theory they are possible.
With that said, the resident memory size of stuff-in-parquet/arrow/duckdb is a lot less in practice than stuff-in-pandas (i know) and stuff-in-sqlite (i believe), so it still enables more in-memory workloads than you might otherwise be able to do.
I really like that you can use duckdb to sql query an on disk directory structure of parquet files with no preloading into RAM or other db formats. Super useful/quick and only one line of code! Instead of selecting from a table you just pass a glob pattern into the SQL - and since it’s a oneliner I can use it in adhoc notebooks too. It even has a to_df() method on the query result so you can get it into pandas for further manipulation
As you noted, it's column-oriented, so designed for OLAP workloads rather than OLTP workloads. Think data warehousing, BI, Big Data analytics, feature engineering ...
But isn't it in-memory? In my experience, data warehousing, BI, analytics, etc are usually defined by large data in data lake or warehouse that doesn't fit into memory. How would duckdb help here?
DuckDB can certainly handle larger than memory data! It's a streaming engine, so it doesn't need everything in memory all at once. Some operators (like sorting) can also buffer out to disk if you run low on memory mid-computation. We are working to expand that to more operators! It also can selectively read from parquet for only the columns and row groups that you need. (Disclaimer - I do documentation for DuckDB!)
I'm also a grump and I know it's useless when the word "superpower" is in the title. I'm just allergic to that word and anyone who uses it immediately loses credibility.
I'm not part of this group myself, but I've seen this quite a lot as people get older. My anecdotal evidence is that it's less about becoming a "believer" in the supernatural, and more about realizing that many of the traditions and practices can be useful for personal happiness, mental health, and sense of belonging, regardless of the science behind them. Praying is not so different from e.g. meditation or yoga, just a different type of practice.
The timelines here are pretty long. Facebook has been around for 18 years, Twitter for 16. That's more than a "fad". These companies are lasting as long or longer than other tech companies, certainly longer than your average start-up.
Except that Facebook is nearly irrelevant these days. Facebook is relevant only because of its acquisitions of apps like Instagram, WhatsApp, Oculus, etc. Twitter doesn't have that.
They all fall! It used to be quicker, but they all do. It's clear Facebook is fading. And it's a separate issue from anything "wrong" the platform does. The userbase simply ages, and kids don't want to be where their parents are. I have a facebook account that I only use for high school reuinons. My classmates post photos of their grandchildren, etc.
Although I generally agree that most social media platforms are fads. I am not sure about Twitter. At least I know that it will stay around for a long time. It did so despite all it's problems so far.
> At the same time there's seems to be a lot of anecdotal evidence that a lot of developers are now flocking to these crypto companies, enough that I think its fair to assume its likely true.
Crypto companies are overfunded by VCs and offer amazing packages, more than FAANG in many cases, plus the token upside. Who doesn't want to make a quick buck?
Also, it's pretty interesting technology that can be fun to work with. Even if it's useless, it scratches the itch of working on something deeply technical instead of the boring old web technologies.
Am I missing something or is this nonsense? The "Big Model" paper is a survey paper that summarizes the current state of the field, while the "original" paper is specific technique. The former, as a small part of the survey, describes what was presented in the original and uses the same language the authors use. Why wouldn't they? Why would they need to re-phrase the facts and methods the authors described themselves when all they're doing is giving an overview?
They are not claiming to have done something they didn't.
Same argument about copying super generic introductions like "Deep Learning has been successful at...." - does each author really need to come up with their own variation of widely accepted facts to avoid "plagiarism" ?
> But even putting aside the fact that claiming someone else's writing as one's own is wrong, the value in survey papers is in how they re-frame the field. A survey paper that just copies directly from the prior paper hasn't contributed anything new to the field that couldn't be obtained from a list of references.
Good survey papers can be important contributions in their own right (e.g. [1]). A good survey should contextualize works within a subject area with respect to each other and identify high level trends/ideas in that subject. These connections are not only useful for learning a topic, but also for positioning novel work or identifying under-researched areas to focus on.
If the authors felt that one of the papers they plagiarized concisely expressed what they wanted to say, they could simply quote and cite that work. Otherwise, it could be construed that the authors are claiming to be the ones drawing the conclusions they wrote. Moreover, from the article, the survey in question seems to be pretty egregiously plagiarizing, which deserves to be called out/shamed.
> But even putting aside the fact that claiming someone else's writing as one's own is wrong, the value in survey papers is in how they re-frame the field. A survey paper that just copies directly from the prior paper hasn't contributed anything new to the field that couldn't be obtained from a list of references.
Whether or not a survey paper is "good" is irrelevant here. Yes, a survey paper that just lists others papers may be a bad survey paper, but it does nothing wrong as long as it cites the original papers, which this does. A bad survey paper may not be published in a journal, that's what peer review is for, but there is nothing wrong with publishing it openly on the web.
And there is still value in aggregating other papers, even if it's just a list with description. That's the reason why these "awesome-XX" Github repos are so popular. Time to hunt them down?
If you look at the plagiarized language in the article, it seems as if the BM paper authors are claiming contributions (emphasis mine). Credit is a major currency in research, and it's important to give it where it is due. If someone did this with one of my papers, I'd be quite upset.
For example (Emphasis mine):
> The risks of data memorization, for example, the ability to extract sensitive data such as valid phone numbers and IRC usernames, are highlighted by Carlini et al. [41]. While their paper identifies 604 samples that GPT-2 emitted from its training set, we show that over 1 of the data most models emit is memorized training data. In computer vision, memorization of training data has been studied from various angles for both discriminative and generative models Deduplicating training data does not hurt perplexity: models trained on deduplicated datasets have no worse perplexity compared to baseline models trained on the original datasets. In some cases, deduplication reduces perplexity by up to 10%. Further, because recent LMs are typically limited to training for just a few epochs
Yes, I agree that's bad but looks like sloppy copy and pasting as opposed to intentional plagiarism to claim contributions. Would it have been okay if they said "they" instead of "we"?
Yes, you are missing something. No, this is not nonsense. It’s considered academic plagiarism to copy paste text without indicating the passage is a direct quote. They should re-phrase the text. Not only is it academic plagiarism, it’s also poor writing, because the copied text is providing a different framing than the overview article, and because the copied text is in some places incoherent due to formatting (e.g. we show that over 1 [sic] of the data most models emit).
You are wrong, the Big Models paper does in fact claim to have done something that they did not, e.g. “We introduce two complementary methods for performing deduplication.” They do not introduce these methods. The text they are lifting did.
What you are missing is long standing norms around academic plagiarism and false claims in the Big Models paper (as a result of copy pasting language).
You should describe the same concepts, but you should use your own voice. A survey paper is not just copy pasting blocks of text and adding connecting language. A survey itself should have its own story and therefore your own voice. You're allowed to quote, sure, but what is shown is beyond a reasonable quotation and the "quoter" doesn't reference that the section was quoted.
Same thing. Incredibly slow and clunky for me as well. Seems like all tools go through this evolution because you know, somehow they have to get VC-sized returns instead of staying small and nimble.
I stopped using tools with any kind of lock-in or custom format because I know they will eventually degrade into something unusable.
Do you really not know anything about these topics or is this post a lie? :)
In case what you're saying is true, props to you for realizing yourself that you're doing something wrong. People stretch the truth sometimes and exaggerate to sound smarter, but in your case it seems to be more than that. You are making up facts (like living in South Africa) that are outright lies. I believe you may need help and should talk to a therapist about this. I may be a sign of something bigger that should be addressed.
How old are you? I used to be like you, but ever since I've hit mid-30s I started to find more fulfillment in books that are not engineering and science focused. Unless it's something directly relevant to what I'm currently working on, I would end up never using concepts from scientific books in real life and forget them after a couple of years. And if I really need to learn something relevant to my current work, I can do it "on-demand" anytime, without needing to be proactive about it.
On the other hand, books that change the way to think about life can have a more profound long lasting impact, even if they are just opinions.
I am a couple years younger than you. For non-technical stuff, I'm more comfortable with video format, and there's going to be several videos on the same topic, better than getting narrow story from single book. Podcast is kinda ok too. Knowledge these days need competitions, you likely to just buy single book for a set of topic based on reviews. But for video or audio content, there are plenty!
You need to look beneath the surface. There is a fundamental difference. In Web3, not everyone can post pictures, only those who are rich. Furthermore, it is decentralized. Decentralized! No government can take away your cat or anime waifu picture! Web3 is also more than that. It is a platform. VCs love platforms. On this platform, anyone can create money and claim it's worth something! If you do this right, you can cash out before anyone notices! Kind of like overvalued startups sold to clueless buyers, but without all the middlemen.
it's a bit like saying that with the discovery of threaded cotton, we will soon replace whatever it is we used to make clothing and make everyone pay premium for the new material, only the rich won't end up naked.
Yes some business models are money grabs, no face value add, but to represent the potential of web3 which itself is still unknown is an over simplification and focus on the bad side of what's been happening in the space.
don't brush the valuable uses cases
away. guaranteed immutability, integrity, and even anonymity, all of that through trustless networks that guarantee incentives for the maintainers of those network can't honestly be so narrowly represented.
Fun preview though of what will be mistakenly built on networks such as Eth that is nowhere near scalable for this kind of use case.