[VIDEO] Exploring Observability Using AWS ELK Stack

•

Ricardo Ferreira is the Principal Developer Advocate at Elastic in the United States. His work is to inspire communities to “make developers fall in love with technology." A passionate professional himself, he has made a significant journey as a programmer and solutions architect over the past two decades, eventually achieving his current position at the company behind the famous ELK Stack.

In an informal conversation (see video below) hosted by Bruno Pereira, CEO of the Brazilian system monitoring startup Elvenworks, Ricardo discussed Site Reliability Engineering (SRE), Observability, distributed systems, and the challenges SRE engineers face with cloud computing tools and services:

[Bruno Pereira]: ElasticSearch was launched in 2010 as a search engine solution. At that time, Solr was the existing company, along with pre-cloud options, that is, those not inherently designed to function as distributed architectures. ElasticSearch was initially introduced as a search solution, but it also doubles as a document-oriented database technology. Nine years later, it evolved to offer an integrated solution encompassing log management, Observability, and more. So, my first question is: What is the strategy behind the product portfolio of a company that became worth $15 billion in less than 10 years?

[Ricardo Ferreira]: I like to explain Elastic from the perspective of a search product. Shay Banon, the guy who came up with ElasticSearch, nailed it with this slogan: everything we do today as technology and developers, is a search. Only the format and the way we think change, but at its core, it is a search. So, the entire portfolio that Elastic has created to date revolves around this concept of search; obviously, we're talking about ElasticSearch. ElasticSearch is to Elastic what the Oracle database is to Oracle. For Elastic, ElasticSearch is the heart of everything Elastic does.

However, Elastic has created a strategy based on creating (or acquiring) technologies that will become so standard in the developer's life, that in 7 years – which is where we find ourselves today – we can start to create solutions and products on top of a platform without having the problem (which most companies have), which is: “Where do I find the workforce?” So, when you ask me what Elastic is today, it continues to be a company behind the Elastic Stack.

That is: Logstash, Kibana, and Elasticsearch continue to be key technologies for the organization, but today we are diversifying our portfolio. For example, we have three solutions that were created on top of the Elastic stack: the security solution, the Enterprise Search solution (which is a more specific niche for research, kind of competing with Google), and the Observability solution, which is the theme we will discuss more in-depth shortly.

But the point is: in terms of strategy, Elastic has diversified its product portfolio in a well-thought-out manner. And why do I say this? Because the company spent about 10 years creating a foundation where, for example, if Elastic's security solution, which is on top of the Elastic stack, simply "breaks down", then where do I find a professional who understands Logstash? It won't be so difficult nowadays. It's not so hard to find or even train the workforce.

I made the comparison with Oracle because I've worked there, but if we stop to think, that's exactly what Oracle did: it created the database and then began to invest in Oracle Fusion Middleware, to start creating applications on top of the platform. So, this is a very interesting business model. And that's what Elastic is doing today. It is a company of security, Observability, and Enterprise search, but at its core, it's always tied by the good old Elastic Stack.

BP: Elasticsearch emerged at a time when companies were starting to dig into cloud migration, therefore moving away from an on-premises setup. But, once they hit the cloud, boom — they ran into Serverless, containers, and all sorts of processing types we mess with. Monitoring turned into a real headache because it wasn't just about keeping an eye on some static thing anymore: we needed to closely observe the architectures, and what was being done in production in “n” ways. So, what do you consider a recommended architecture, from the perspective of Observability and Reliability (SRE), for those who wish to monitor their systems “like a pro” - just like the top tech product companies do today?

RF: You mentioned the issue of change, on how technology evolves, especially with the advent of the cloud, and I was reminded of something you said that I haven't forgotten to this day, which is that analogy between “pets” and “cattle”. Back in the day, we looked at our architecture and knew each of the “animals” we had at home, but today we're dealing with cattle, meaning there's such an astonishing number of “cows” grazing together that we can no longer name them individually. So, the first point I notice, speaking both as a user and technology provider, is that most people view Observability as some sort of “magic box.”

That is, once the data is there, everything looks beautiful – as it indeed is – the data is already stored in databases; and then we start talking about things like machine learning, anomaly detection and learning, which is wonderful. However, the problem is that nobody talks about what precedes that; what is the journey to identify which telemetry data needs to be collected, how it's collected, what the overhead is for collecting it, and most importantly: how you're going to normalize all of it.

Let me explain better where I'm going with this. It's good to discuss what nobody wants to discuss. That's the first point. Don't avoid a discussion you need to have just because you know it will hurt. The point I'm trying to make is this: everyone says that Observability is based on the 3 pillars, which are metrics, logs, and tracing. And it is, I'm not saying it's not. The problem is that they are not distinct or isolated pillars, and I think the first point I see is that people tend to treat these pillars as isolated things. For example, what often happens is something like this: “We're going to consolidate our telemetry project, and for metrics, we'll use Prometheus, and then we'll have a team responsible only for Prometheus; for tracing, we'll adopt Jaeger or Zipkin, and put it all here (another silo); and for logs, we'll use Elasticsearch, Splunk” or something like that.

Instinctively, people continue thinking in silos. And I think they do that because it's easy and comfortable. But I believe that for anyone wanting to structure a real serious Observability initiative, the first thing you need to do is precisely to avoid creating these silos, although it may “hurt” and lead to a conflict of interest, for example. But even this conflict is normal and part of the dynamics of creating a unified strategy. So, it's really interesting to promote these discussions sooner rather than avoiding them.

This concept of creating teams dedicated to something is in the DNA of professionals today, old or new. And that's what needs to change. I'm not getting into the merit of right or wrong. But I think if you want Observability to work, you have to integrate and unify. And then comes the second part of what I would suggest: thinking that if you're going to unify and want an integrated view of logs, traces, and metrics; if you're viewing the tracing of a transaction, you need to be able to see the logs that are inherent to that transaction; you're looking at the metric of a host, or a container, and you have to derive which are all the transactions that are supporting that metric; errors per second, for example.

And ask yourself: what were all the transactions that generated those errors? For everything to be integrated, you need to normalize. There has to be this normalization process (we call it telemetry data, so we don't keep repeating log, metric, and tracing). You need to create a unified version of telemetry to have an integrated view of everything. So, these are the discussions I think need to happen. They are more discussions than technical things we will decide to use. And this discussion has to start with the design.

Design, which is another thing I try to work on when I'm helping someone with this part of systems’ Observability. It seems there are three different schools: one that thinks Observability is a responsibility of Ops; one that thinks Observability is Ops but also the developer, who has to instrument the application, has to get into what “logs” and “does not log.” (And it's true, that needs to be done too). But there's a third school that is kind of invisible, which consists of thinking of Observability as a design discussion, first and foremost.

If you are a functional requirements analyst, when you're conceptualizing your product, you need to think about what the most important transactions are, at least for a transaction to finalize a purchase. You have to identify how you're going to collect metrics from this, that are relevant, and how you're going to correlate the metrics with the transactions. In my view, this is not a discussion that should only involve Development & Product teams. No, everyone has to do it! Do you know why? Because when a problem arises, the people who will be really interested in knowing about it are those who designed the Product.

BP: Successful products, with high traffic peaks and many transactions, generate large volumes of logs. This can be a challenge for those designing system architectures capable of efficiently handling this demand. During events like Black Friday, for example, traffic can significantly increase, requiring solutions that support these spikes in access without excessive costs or operational complexity. Cloud computing offers good responses to these fluctuations, allowing for scalable log structures that serve well during peak periods without burdening the rest of the time. Considering this context, what approaches can we use to manage traffic peaks economically and sustainably?

RF: I'll answer this using my experience with Elasticsearch. I'm not sure how other log solutions work, but in the case of Elasticsearch, it's a combination of practices, implementations, and culture you have to adopt. First off, Elasticsearch is extremely fast, even with huge volumes and data reads. That's its main feature. There are use cases with petabytes of data, and yet Elasticsearch delivers a query of data that was indexed just half a millisecond ago, happening in (almost) real-time. However, it wasn't made for massive data writing. So, when you expose it to applications or agents writing directly to it, you're taking a serious risk, although I've seen Elasticsearch handle 100,000 documents per second – properly sized, but it supports that.

But you brought up an important point, which is seasonality. Take Black Friday, for example. So, when we face situations like this, what's advisable is to place a buffer in front of Elasticsearch. What I call a buffer is a component designed to handle massive writing, like Apache Kafka, Apache Pulsar, AWS Kinesis, Google Cloud Pub/Sub, or Azure Event Hubs – that's what I call a buffer. Because these systems are designed for massive writing, if there's seasonality, we can hold it, and they are persistent. It's different from a messaging system, that's why we have this difference between calling it streaming or messaging. The main difference is that streaming is persistent, and because it's persistent, you can go back in time and reprocess things, so these features are very interesting to have in an architecture that needs to support seasonality. That's the first point.

The second point, still talking about Elasticsearch, is that the architecture has changed a lot in the last releases, starting around release 7.9 (it's now at 7.14). Since then, a lot of changes have occurred, but I think the main one is that Elasticsearch has always supported the concept of data tiers. What are data tiers? For example, in the past, all my "hot" data could be assigned to a set of nodes in my cluster that it will place in these nodes here, then I include an SSD, which is a nicer block storage, and for "warm" data, I use a magnetic, rotational disk, which is a bit cheaper, and has similar durability but isn't as performative; so it becomes economically viable. This has always happened. Now, what happened in the latest releases, which is related to what we're talking about, is the introduction of two different tiers, which are the frozen data tier, and the searchable snapshots; they can offload the entire Elasticsearch dataset so that all the writing goes to an ObjectStorage, which we know is durable and infinitely cheaper compared to BlockStorage, and also more resilient, with greater geographic distribution, and redundancy in availability zones.

But, bringing it back to the world of Elasticsearch, this was the big headache: how to keep Elasticsearch economically viable when you think about seasonality and scale? That's the big pain. In the past, you had to grow an Elasticsearch cluster by putting Storage and Compute together, each node was a combination of Storage and Compute, RAM, CPU, and Storage. Now, you can expand your Storage, which is the ObjectStorage, and have a finite set of compute nodes effectively taking care of only the distributed protocol part of Elasticsearch. Then we have an absurd economy of being able to accommodate years of data while still being transparent because the API is the same.

But will the performance be the same? No. If you make a query to a document stored in S3, Elasticsearch will transparently fetch that document, rematerialize it into the cluster, put it in the cache, and serve that document. Obviously, this will not have the same performance as, for example, an SSD disk. But, what's the advantage? The advantage is that, once it's in the cache, the second request will be much faster, this is another thing that Elasticsearch has also introduced, there's a cache layer now.

Conclusion

Observability is not just a tool or a set of practices, but an essential element for the Reliability, security, and effectiveness of systems in a cloud environment. As we move towards SRE in the cloud era, companies and professionals must be prepared (and properly equipped) to face challenges of traffic peaks and scaling with effective and innovative strategies. Success in Observability, as Ricardo Ferreira highlights in this conversation, doesn't lie solely in the Elastic Stack itself, (of course!) but also in the ability to integrate, adapt, and anticipate future needs. It's the AWS guarantee that systems not only withstand a load of tremendous impact but also scale in response to constantly evolving demands.