Hacker Newsnew | past | comments | ask | show | jobs | submit | kiyoto's commentslogin

While I'm happy to hear about a great success story of a great piece of open source software, Elasticsearch has done a great disservice by making application developers lazy about learning the ins and outs of various analytical/transactional/storage backend systems.

Echoing other commenters, Elasticsearch is hardly the best tool for many kinds of analytics. In fact, it is strictly not a good tool for several use cases. For starters:

1. It's not good at joining two or more data sources

2. It's not good at complex analytical processing like window functions (for example to calculating session length based on the deltas of consecutive timestamps partitioned by user_id and ordered by time).

Of course, it's also good at many things like simple filtering and aggregation against "real-time" data. Being in-memory really helps with performance, and with right tools, it's horizontally scalable. Elastic's commercial support is also not to be discounted.

However, as an old OLAP fart who spent years optimizing KDB+ queries, I am deeply concerned about the willful ignorance of data processing systems that I see among Elasticsearch fans. Just take my word for it and study Postgres (with c_store extension) and other real databases, in-memory or otherwise, open-source or proprietary, so that you won't be shooting yourself (or future co-workers) in the foot, trying to shoehorn Elasticsearch and its ilk into suboptimal workloads (To be fair, I see a similar tendency among Splunk zealots).


> Of course, it's also good at many things like simple filtering and aggregation against "real-time" data.

And also fulltext search at scale, which is basically its primary use case.

PostgreSQL's fulltext search isn't quite at the same level. The last time I looked into its capabilities, it didn't fully support TF-IDF. (I don't think it keeps track of corpus frequencies for terms.) Interestingly, I think SQLite's fulltext support does include TF-IDF, but I could be misremembering.

I mean, the Elasticsearch docs are pretty clear that joining doesn't work well (or really, at all). I'm not sure how being clear about the trade offs of your software is "doing a disservice." Sometimes you don't need to store relational data. Sometimes you do need to store relational data, but the other benefits of Elasticsearch outweight shoehorning relational data into what is effectively a document database.

If your only complaint is that people misuse software... Well... Yeah. It's been happening for a while now. We should help educate others. I'm not sure your approach is the most constructive.


What surprised me a bit is that pg decided to use the word "bias" without any clarification, considering his background in computer science and AI.

Anyway, I think pg's whole argument is rather moot because the three assumptions that he states are incredibly difficult to measure (Part of the reason why it is very difficult to argue for or against affirmative actions without coming across as "biased").


Unfortunately I think this is a problem with many of his essays. They often present a very specific argument with reservations, which makes the argument very hard to disagree with since you have to argue relevance which requires a lot more insight. It's therefor taken as truth by the readers, even if the original argument don't support their conclusion. In general I think they should be seen as opinion pieces rather essays. I have a hard time seeing many of them being up to e.g. basic university standard.


This reminded me of an interview of David Foster Wallace

"...here’s this fundamental difference that comes up in freshman comp and haunts you all the way through teaching undergrads: there is a fundamental difference between expressive writing and communicative writing. One of the biggest problems in terms of learning to write, or teaching anybody to write, is getting it in your nerve endings that the reader cannot read your mind. That what you say isn’t interesting simply because you, yourself, say it. Whether that translates to a feeling of obligation to the reader I don’t know, but we’ve all probably sat next to people at dinner or on public transport who are producing communication signals but it’s not communicative expression. It’s expressive expression, right? And actually it’s in conversation that you can feel most vividly how alienating and unpleasant it is to feel as if someone is going through all the motions of communicating with you but in actual fact you don’t even need to be there at all."

"Conversations with David Foster Wallace" (Literary Conversations Series, page 113

A big thing that pg seems to be unaware is that most of us are expressive, not communicative, when we talk. Stylistically, it's true that plain English is the way to go. However, the deeper problem lies in our (in)ability to communicate our thoughts, in writing or in speech.


I agree. One of my mentors recommended that I write down my own version and share it with my team. Here is my attempt: http://kiyototamura.tumblr.com/post/130937953602/good-market...


I actually began using Acme a couple of months ago: it's kind of interesting how much I have accustomed to mouse-driven interaction and lack of syntax highlighting. Humans indeed are creatures of habit.


I think you are conflating between the needs of centralizing logs and having them around at all. The OP is saying that always centralizing them might not always make sense, and I tent to agree (and I say this as someone who maintains a popular open source log collector)

If I were to play the devil's advocate, the real needs for raw log data in a centralized location is for folks outside of Ops: data analysts and data scientists.


Log data should always be centralized if the machines they're stored on are ephemeral.


All machines are ephemeral. Some people just don't realise it.


You are preaching to the choir here. Fluentd (as a proxy for my logging-related beliefs) pushed Docker to have logging long before the logging driver, and now it is one of the officially supported logging drivers.

What you don't seem to realize is that the cost of centralized logging is not always worth it. Machines are ephemeral and so are many application related problems. It's one thing to counter the OP saying that centralized logging has merits (and I believe the OP agrees with that statement) and another to say centralized logging is always a must.


Why? My whole point is that the data isn't useful beyond the life of the machine so why store it?


If the logs might contain the stacktrace of a request that brought down your application, how do you look at them if your health-checker helpfully terminates the now-considered-wedged app VM, and your logs also "helpfully" disappear right along with it?

Half the point of logs is that they're the examinable blast-wave of something that might no longer exist.


But why do you care if a request took down the application, unless it happens repeatedly? Any sufficiently complex system will have transient errors. Why waste time tracking them down?

If it happens repeatedly, then you add central logging until you find the problem.


Maybe the app comes back up with all its users missing. Maybe you get an email a day later telling you you've been hacked and listing details of your database. Or maybe the one instance wedged itself in a weird way (crashed with a lock open, etc.) and destabilized the system, and you spent sixteen hours getting everything back up.

The severity of a bug has nothing to do with its commonality. Sometimes there's something that happens once, and bankrupts your business. Security vulnerabilities are such.

However, I'm more confused by the statement "add central logging"—how are you doing this, and how much time does it cost you? If you mean enabling logging code you already wrote using ops-time configuration (in effect, increasing the log-level) then I can see your point. If you mean adding logging code, then you're making your ops work block on developers.

Either way, what is the cost you're imagining to central logging, that you would consider adding it in specific cases where it's "worth it", but not otherwise? It's just a bunch of streams going somewhere, getting collated, getting rotated, and then getting discarded. The problem itself is about as well-known as load-balancing; it's infrastructure you just set up once and don't have to think about. It doesn't even have scaling problems!


I think there are two kinds of logging that's conflated into two in the industry: logging for devops and logging for analytics.

For logging for devops, I 100% agree with you. Looking at application metrics rather than raw logs is far more productive, and the raw logs should only be consulted after you have triaged the situation based on the metrics monitored.

However, there is another kind of logging, and that's for data science and analytics. Here, it's hugely helpful to have centralized logging. Hell, it is a must. The last thing you want is to have data scientists with a shaky Linux knowledge to ssh into your prod machines. At the same time, logs are the best source of customer behavior data to inform product insights, etc. By centralizing these logs and making them available on S3 or HDFS or something, you can point them there and have everyone win.

Among Fluentd users, we definitely see both camps. As a matter of fact, one of the reasons that I think people like Fluentd is that because it enables both monitoring and log aggregation within it.


You're absolutely right. I was only talking about DevOps logging (should have made that clearer). Logging for data science is a totally different ball game.


Does that change your opinion about having a centralized log data store? If you need it anyway for your data scientists, why not give your security/devops people access to it when they need to debug a problem?


It doesn't change my opinion because the logging for data scientists is different than for DevOps. For data scientists I assume it would be all application information going into either a queue or stream processor, or being inserted directly into a database, or being pulled out of a database during ETL.

Stuff going to syslog isn't generally going to be used for data science.


One alternative is to replicate some of your data into a different DB that is safe for the data scientists to use. And that way, they have raw data to play with instead of logs.


Curious what you mean by "some of your data"? What kinds of data? Usually, I think of the logs as the raw data, everything else is derived data & analysis.


Well it would be data from your databases, and anything they would find useful excluding private user information.


Compared to GCP, perhaps, but AWS's support is pretty atrocious too. The real reason AWS wins is because they know how to sell platforms: as you said, it's about more services, more options, and yes, more _selling_. You can't just build the best components and wait for people to try them. You have to go listen to customers and propose how to build what they need using your platform. This is by far the biggest difference between GCP and AWS: the technical salesperson mindset.


disclaimer: I'm the original author, and a technical salesperson of sorts at GCP

What could we do to make this better? I'm super serious; I talk to customers every day, and we're working hard to deliver a platform (all of it - product, docs, support, sales, customer advocacy, OSS, etc) that exceeds expectations and delivers results. What could we do to impress you from a customer engagement standpoint?


Your experience with AWS support is generally correlated with how much you pay. After many years of grumbling, we're finally at the highest level and support is pretty great. As it should be, given the price.

You're exactly right about the "technical salesperson mindset". It's awesome see new features come out several times a year that feel like direct response to our feedback. Google is impressive, but I don't get the feeling they would be as responsive to customer requests.


"Your experience with AWS support is generally correlated with how much you pay"

Can't say I've found this to be the case. We spend a minor fortune each month, have premium support, and get really poor help. I thought mostly everyone agreed AWS support stinks.

Part of the problem, I've realized, is we are not contacting support for trivial, common items. Rather we engage with them on outages in their service (ahem "elevated error rates"). On these and other similarly complex issues, their front-line support staff is ill equipped and they mostly resort to saying "I've escalated to the service team."


>We'd love to use a ruby-based solution like this, but the docs say it will lose data whenever the receiving end crashes. Any plans to fix that?

Where does it say this? I don't think this was ever the case for Fluentd.

>The way it was described in the docs gave me the impression there is no acknowledgement of network writes - if that's true won't even clean shutdowns lose data sometimes?

This is not true. All writes are acknowledged over TCP, at least between Fluentd and Fluentd.


> Where does it say this? I don't think this was ever the case for Fluentd.

It's in http://docs.fluentd.org/articles/high-availability#forwarder..., which says:

However, possible message loss scenarios do exist:

The process [log forwarder’s fluentd from the paragraph above] dies immediately after receiving the events, but before writing them into the buffer.

Is this document out of date?


>Is this document out of date?

No, in the described case, the message can get lost. This is a really unlikely scenario though. The only real-world case that I know of first-hand is using file buffer and somehow being unable to write to disk, possibly because the disk is full. Something like that can be prevented by a fairly routine set of server monitoring alerts.


Interesting. Do you run hekad on the host machine?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: