Data ingestion at scale using Logstash

3 minutes read

INTRODUCTION

A business generates enormous quantities of data every single day and as the business grows and changes, this multiplies. To stay ahead of the competing, a business should collect these data and analyze them. In this tutorial, I will share how I achieved data ingestion at scale using Logstash.

Nowadays every business runs on information, but there is a problem. The data is not at one place since it might have been created or stored on various systems and in different formats, which can make analyzing it very difficult. One of the product I work on consumes and analyzes huge quantities of data. For this purpose I used Elasticsearch, a NoSQL distributed database. It helps you organize, parse and analyze huge quantities of data. One of the tool provided by the elastic stack.

Logstash is a plugin-based data collection and processing engine that makes it easy to collect, process and forward data to different systems, but the important part is that it helps in normalizing different schemas, which means data is gathered from different systems and made available in a single format. Processing is organized into one or more pipelines. In each pipeline, one or more input plugins receive or collect data and then is placed on an internal queue, later it is processed by any available filter plugins and then pushed to the output plugin, in our case this is Elasticsearch.

Logstash daa ingestion process data ingestion at scale using logstash Data ingestion at scale using Logstash image 8 — Logstash Process

PROBLEM

The product uses 13 pipelines, each using the JDBC input plugin to ingest data from Redshift. The Data Integration job runs on a schedule and incrementally pushes data to our source Redshift. On the other end, we schedule our Logstash pipelines after the respective job completes on the Data Integration job.

We also divided the pipelines into 2 Logstash instances and shared the loads between them on two 8 core CPU with 32 GB memory boxes, each running Logstash process with Xmx 30gb. Unfortunately, we ran into out of memory issues and sometimes Elasticsearch node not available while sending data to the output plugin. So, we utilized the Logstash monitoring APIs like node info, node stats, and hot threads and found out that at least 3 – 4 pipelines are using high load data. High loads are around 20 million and above, low loads are anywhere below 1 million.

We tested with only one pipeline having high load data and observed that in 3 hours CPU usage was 100% and memory utilized to max resulting in the Logstash process crash.

SOLUTION

We replaced the default small memory queue to a large persisted queue on disk to improve reliability and resiliency. For this, we made the following changes in Logstash configuration.

queue.type: persisted
- Enable persistent queues. The default is memory.
queue.max_bytes: 4gb
- The Total capacity of events that are allowed in the queue. The default is 0 (unlimited)
queue.max_events: 10000
- The maximum number of events that are allowed in the queue. The default is 0 (unlimited)
queue.checkpoint.writes: 1
- Force a checkpoint after each event is written for durability

With the above settings specified, Logstash will buffer events on disk until the size of the queue reaches 4gb or max 10000 events. This is handling back pressure, when the queue is full, Logstash puts back pressure on the inputs to stall data flowing into Logstash. It commits to disk in a mechanism called check pointing, in our case check pointing is done after each event. So, even if Logstash is terminated or if there is a hardware failure, we will not lose any data as it will be persisted to disk.

Apart from the above changes, we also modified our high load pipeline schedule. We changed the high data loads pipeline to trigger in between the Data Integration job to divide loads between multiple schedules on Logstash.

Logstash high load pipeline intervals data ingestion at scale using logstash Data ingestion at scale using Logstash image 9 — Schedules

CONCLUSION

Having 2 or more Logstash instances share pipeline load with persisted queue will help in load balancing data ingestion. Setting max_bytes and max_events set to a smaller value will improve queue performance. Also, taking care of the schedule regarding the input data is a must implementation. This is how I achieved data ingestion at scale using Logstash. Let me know in the comments section how you use Logsash.

5 Comments

Mohammed Fayaz · October 8, 2020 at 12:41 pm

Very useful information on scaling logstash. Thanks for the article.

Simon · October 8, 2020 at 1:14 pm

@Mohammed Fayaz: I am glad you liked it. Thank you for visiting my website.

Aditya · May 14, 2021 at 3:38 pm

Hey Simon,

Great read. I was trying to propose using of persistent queue with JDBC plugin as input in Logstash. But I found a worrying statement in the doucmentation of Persistent Queue.

“Input plugins that do not use a request-response protocol cannot be protected from data loss.For example: tcp, udp, zeromq push+pull, and many other inputs do not have a mechanism to acknowledge receipt to the sender. Plugins such as beats and http, which do have an acknowledgement capability, are well protected by this queue.”

Since here also you are using JDBC plugin, has this caused any problems ? I think I don’t completely understand this statement in the documentation.

Simon · May 15, 2021 at 10:20 pm

@Aditya Thank you. I am using Logstash persistent queue in production (customer data) for more than 2 years and so far I did not face any issues.

Parth · February 18, 2023 at 2:29 am

I have a similar configuration. But my logstash instance keeps failing with an error java.io.IOException: data to be written is bigger than page capacity

Data ingestion at scale using Logstash

Published by Simon on March 28, 2020March 28, 2020

INTRODUCTION

PROBLEM

SOLUTION