More

jmngomes · on Jan 18, 2024

uBlock Origin did it for me: https://ublockorigin.com/

jmngomes · on Oct 18, 2021

TLP [1] covers this use case pretty well, although I confess I didn't expect this to still be an issue these days (until I bought my latest laptop and had to run TLP on Kubuntu 20.04 to ensure battery life on par with the factory specs, and TLP almost tripled it)

[1] https://wiki.archlinux.org/title/TLP

jmngomes · on Aug 13, 2021

Here's a detailed and very interesting writeup about the development of the IBM PC: https://www.filfre.net/2012/05/the-ibm-pc-part-1/

The blog itself is great, providing detailed accounts about iconic products and business initiatives in the tech industry.

jmngomes · on Feb 24, 2021

BOINC tasks are validated by running multiple executions on different hosts/users and comparing results; credit, and hence GRC, is only attributed to tasks with valid results: https://boinc.berkeley.edu/trac/wiki/CreditNew

jmngomes · on Oct 19, 2020

The legal right to demand rectification is a fact in at least one EU country: https://www.erc.pt/pt/perguntas-frequentes/sobre-a-imprensa (in portuguese)

jmngomes · on Oct 16, 2020

Midomi does this and has been around for a while, and probably carries less privacy concers: https://www.midomi.com/

jmngomes · on Sept 21, 2020

All the tools you mention are great for their specific use cases, e.g. Snowflake BigQuery and RedShift are great for analytics over big data, the very common COUNT/SUM/AVG/PARTITION/GROUP BY analysis. But as with any other tool, they're not a good/perfect fit for diverse analysis methods of diverse data types, which I think is what a data lake aims at. Analyzing a large JSON dataset on Snowflake is possible, but either too slow or expensive when compared with more appropriate tools (e.g. Elasticsearch or a Python notebook running PySpark).

Having a data lake - which I understand as a repository of raw data of diverse types, regardless of the tools - structured in a tool like S3 is very useful when you have multiple use cases over data of different kinds.

For example, you could store audio files from customer calls and have them processed automatically by Spark jobs (e.g. for transcript and stats generation), structure and store call stats on a database for analytics, and do further analysis via notebooks on data science initiatives (e.g. sentiment analysis). This is akin to having a staging area for complex and diverse data types, and S3 is useful for this because of its speed, scale and management features.

Teradata or Snowflake aren't a great fit for use cases like these, but they are great if the use case is to get answers to questions like "top 3 operators per team in volume of calls, by department and region, in last quarter" if the volume of calls is big.

If I understood correctly, I think your comment was more focused on why use new tools when the existing are mature, but I think big data tools have had to become more specialized and targeted for specific use cases. But if the question is "why build more than one data lake", the only reason I can see is organizational: teams or different areas of an organization either need their own data lake because they have specific needs (which is rare) or won't/can't collaborate with others to have a shared asset.

georgewfraser · on Sept 21, 2020

I agree that no matter what you're going to store unstructured binary data, like audio files, in object storage. But that is perfectly compatible with storing structured data in a relational database.

> they are great if the use case is to get answers to questions like "top 3 operators per team in volume of calls..."

You are straw-manning Snowflake/BQ. Just because they are SQL database systems doesn't mean you have to do 100% of your analysis in SQL. You can use other systems, like Spark, PyTorch, Tensorflow, to work with data that you manage inside a RDBMS. There are some issues with bandwidth getting data between systems, but these issues are getting solved (by Arrow!) and in the meantime unload-to/load-from S3 is a good workaround.

I've heard a lot of people make these same arguments and I've tentatively concluded it's mostly motivated reasoning. Engineers like to engineer things. They start by trying to make the obvious, boring system work, but when they run into an obstacle they immediately jump to "I need to build a new system using $TECHNOLOGY."

jmngomes · on July 3, 2020

uBlock Origin is pretty effective for that issue: https://addons.mozilla.org/en-US/firefox/addon/ublock-origin...

jmngomes · on May 12, 2020

>> Controlling cost is the hard part. You may only need a cluster for 1 hour per day for a nightly aggregation job. Kubernetes clusters are not easy to provision and de-provision, so you end up paying for a cluster for 24 hour days and use it for only 1 hour.

What is the benefit of using Kubernetes to deploy Spark jobs then? Is that approach meant to achieve independence from the hardware?

I'm asking because that is fairly trivial to achieve using, at least, a provider like AWS: you can build a CloudFormation template (or use the AWS API or the web UI) to launch AWS EMR clusters with specific hardware and run any spark jars, and you can use services like DataPipeline or Glue to schedule and/or automate the whole process. So you can use AWS services to set up a schedule that will periodically spin up a cluster with whatever machines you need to run a Spark app and decommission it as soon as its done.

In this case, the EMR cluster comes with the myriad of Hadoop tools and services (and Spark, and other relevant software) preinstalled and ready to use. And most relevant Spark settings are already optimized for the cluster's hardware; but not for the Spark app itself, which is what this solutions seems to address.

jmngomes · on May 7, 2020

DBeaver is great as a mature cross platform sql editor, and packs a lot of db design and management tools: https://dbeaver.io

dunefox · on May 7, 2020

I used it for a bit and I find it lacks in the text editing capabilities.

shock · on May 7, 2020

Your comment would have been useful if you elaborated on what text editing capabilities DBeaver is currently lacking.

dunefox · on May 7, 2020

Multiple cursors, jumping around in the text without the mouse, etc.