Scalable Logging for Microservices

A log management and exploration platform is really important for a team of engineers who work on the backend of a medium to large-sized system. It helps them keep track of logs and explore them easily. This tool is crucial for understanding how the system is working and identifying any issues. Unfortunately, it is often overlooked and not given enough attention.

There are many options available for such a platform. You can use services provided by cloud providers, choose a Software-as-a-Service (SaaS) vendor, or even use open-source software that you can host yourself on your own infrastructure.

But it's not as simple as just picking any solution. Each option has its own strengths and weaknesses, and it's important to consider your specific requirements. There are many factors to consider, and it's important to weigh them against each other to find the best fit for your needs.

In this article, we will see what challenges we faced when we were looking for a log exploration solution at Carousell. We will discuss the solution we ended up choosing and talk about the advantages and disadvantages we encountered along the way.

Scalable Logging for Microservices - Challenge

To understand the scale of the challenge, let's consider a sample of the volume of logs we were dealing with.

Identifying Problems:

As an internal tool, the log platform primarily catered to engineers themselves. To identify the requirements, we engaged in discussions with the engineers who would be using the platform. This process was relatively smooth since engineers understand the importance of providing clear requirements.

Here are the problems highlighted by our engineers:

Lack of Centralized Access in a Microservices Environment: Observability is a major challenge in a microservices architecture. Engineers needed a way to centrally access logs from all services, especially when pinpointing the root cause of complex bugs. Therefore, the log platform is needed to facilitate a centralized point of access for logs across all services.
Ineffective Searching and Filtering: Mountains of data are meaningless if they can't be retrieved and consumed effectively. The platform needed to support querying capabilities, including "field-level" filters and full-text search of log payloads. Additionally, basic aggregation and faceting on selected filters were required to provide quantitative insights.
Lack of Defined and Parsed Log Formats: Logs needed to be parseable to derive useful fields from the text payload. Key fields such as log level (DEBUG, INFO, ERROR), timestamp, and log origin (service name, instance, etc.) were crucial for a functional log exploration experience.
Absence of Live Tailing for Real-time Insights: Engineers expressed a need for the ability to live "tail" logs in the central dashboard. This feature would provide a real-time view of logs as they streamed in, applying necessary filters and enhancing debugging capabilities.
Cost Inefficiency due to Log Ingestion Volume: The previous log-processing solutions adopted at Carousell posed a significant cost challenge. Log payloads were often substantial in size and quantity, requiring substantial compute resources (CPU, RAM, disks, network bandwidth) and inflating infrastructure costs. Thus, a fundamental requirement was to provide controls for log ingestion, enabling engineers to selectively enable log ingestion at a specific level for each service and avoid unnecessary ingestion of logs.
Loss of Log Data during Ingestion: Maintaining log consistency (completeness and order) was paramount. Lossy logs would cause frustration for engineers relying on the platform for production issue debugging, reducing trust and adoption. Ensuring log consistency became a key requirement.
Onboarding Complexity for Engineers: To minimize disruptions to engineers' productivity, the log platform needed to ensure minimal onboarding effort. Seamless access to logs without manual account creation was an ideal scenario to empower product engineers and allow them to focus on creating value for users.

To address these problems, we embarked on building a log platform that incorporated the following solutions:

Centralized Log Access: The log platform was designed to provide centralized access to logs from all services in a microservices environment.
Enhanced Searching and Filtering: The platform supported querying capabilities, including "field-level" filters and full-text search, along with basic aggregation and faceting for quantitative insights.
Log Format Definition and Parsing: Logs were made parseable, enabling extraction of important fields such as log level, timestamp, and log origin for a functional log exploration experience.
Real-time Log Tailing: The log platform incorporated the ability to live "tail" logs on the central dashboard, offering real-time insights with applied filters for effective debugging.
Cost-Effective Log Ingestion: Controls for log ingestion were provided, allowing engineers to selectively enable log ingestion at the desired level, reducing unnecessary ingestion and optimizing infrastructure costs.
Log Consistency and Integrity: The log platform ensured the maintenance of log consistency, avoiding lossy logs and preserving the trust and reliability of the solution.
Seamless Onboarding: Efforts were made to minimize onboarding complexity, allowing product engineers to seamlessly gain access to logs without the need for manual account creation.

Architecture

A. Log Service Agent

The log service agent is responsible for collecting and sending logs to the log service. It works alongside our microservices by running as a separate component, called a sidecar, in each pod where the microservice is running. The sidecar and the main application share the same storage space for storing logs. This setup allows us to manage the log agent independently without affecting the main application.

The log service agent is designed to use very little computing resources. It requires a small amount of CPU (about 30 milliseconds) and memory. This helps us keep our overall infrastructure costs under control.

To understand how a log service agent works, let's start by understanding how application code writes logs. All the services use a common framework that provides APIs for the application code to write logs to disk in separate log files.

When the log files become too big, a log daemon running on the application pod rotates the files. This means that once a log file reaches a certain size limit, it is replaced with a new one. If a service generates a large volume of logs, the files will be rotated frequently. In this case, the log agent needs to keep track of which files are being rotated and how many logs have been read from each file.

While the application code continues to write logs to the log files, the log agent in the sidecar container continuously reads the log files and sends the logs to the log service.

Since the log agent is running as a sidecar, it functions as a plug-and-play service. This means we can easily attach or detach the log agent whenever we need to without disrupting the main application.

Components of Log Service Agent

Here are the components of the Log Service Agent that are explained in detail:

Scanner: The scanner is a continuously running goroutine that scans for any log file rotation. It starts a new reader for each new log file. It also retrieves the initial state of each file from the state store to determine where to start reading the log file, especially if there was a previous failure.
Reader: The reader is responsible for reading the assigned log files. It reads the logs and sends them to the buffer. The reader stops reading when it reaches the end-of-file (EOF) of the log file.
Buffer: The buffer accepts logs from all the readers. It waits until the buffer is full or reaches a configured time to flush the logs to the publisher. The buffer holds the logs temporarily before sending them to the next stage.
Publisher: The publisher is responsible for streaming the logs to the defined store or API. It receives the logs from the buffer and sends them to the designated destination for storage or further processing.
StateStore: The state store is responsible for maintaining checkpoints that indicate which point in a log file has been read. It keeps track of the progress of log file reading, allowing the system to resume from the last checkpoint in case of failures or restarts.
Go Channel: A buffered channel is established between the reader and the buffer. All the readers that read logs from files push the logs to this buffered channel. The channel acts as a queue, storing logs until they are consumed by the buffer. It helps control the flow of logs between the reader and the buffer, creating back pressure if the buffer overflows. This back pressure halts the readers until the buffer starts consuming the channel again, ensuring proper handling of the logs.

B. Log Service

In the above diagram, the log agents that we use publish log messages in bulk to the log service. Here's what the log service provides:

gRPC Sink for Log Export: The log service offers a gRPC sink that allows log agents to export their log messages to the service. This enables efficient communication between the log agents and the log service.
Publishing Logs to Kafka: The log service takes the log messages from the log agents and publishes them to Kafka, a messaging system, with minimal processing. This ensures that the log messages are efficiently handled and delivered to the appropriate destinations.
No Log Parsing: The log service does not parse the log messages. It simply forwards them to Kafka without performing any extensive analysis or modification.
Logging Configurations and Backend for Log Dashboard UI: The log service manages the logging configurations and serves as the backend for the log dashboard user interface (UI) used for configuration. This allows users to easily configure and manage their logging settings through the log dashboard.
Interface for Log Workers: The log service provides an interface for log workers to read configurations. This ensures that the log workers have access to the necessary settings and can perform their tasks effectively.

The log service acts as a mediator between the log agents and Kafka. Instead of connecting the log agents directly to Kafka, which could be problematic in the long term due to the large number of pods running in production, the log agents call the log service with their log messages using gRPC. The log service then forwards these log messages to Kafka. This approach allows us to decouple the complexity of sending messages to Kafka from the log agents, ensuring scalability and efficient log message handling.

C. Workers

Workers in our system act as consumers for Kafka, which helps in decoupling the log processing. Here's what workers do in a simple explanation:

Understanding Log Configuration: Workers are responsible for understanding the log configuration for each service. They know which logs should be ingested and processed based on the defined configurations.
Deciding Log Ingestion: Once workers receive log messages, they make decisions on whether to ingest the logs or not, based on the log configuration. This helps filter out unnecessary logs and ensures that only relevant ones are processed further.
Parsing Log Messages: Workers parse the log messages to extract important information like log level (e.g., DEBUG, INFO), service name, and other metadata. This extracted information helps in organizing and categorizing the logs effectively.
Ingesting Logs into Log Storage: At the end of the logging pipeline, workers are responsible for ingesting the parsed logs into the log storage. They ensure that the logs are securely stored and available for future reference or analysis.

This setup allows us to have multiple workers running concurrently, processing logs simultaneously. It also enables us to write the processed logs into different storage systems, depending on our specific requirements. By using this approach, we were able to fulfill our requirements of scalability, efficient log processing, and flexibility in choosing storage systems for the logs.

Log control Dashboard

The log control dashboard is a simple user interface that allows engineers to configure how logs are ingested. It provides options to control the log level and sampling rate for each service.

Engineers can use the dashboard to set the log level for each service. By default, services are initialized with the log level set to capture known errors and exceptions (ERROR logs). However, when troubleshooting or needing more detailed logs, engineers can change the log level to DEBUG. This allows us to store logs that are relevant to the current state of our services and helps us manage logging costs.

The log agent is the only component that needs to run in service pods. The rest of the logging infrastructure can be independently managed by our platform team. Since the log agent is deployed as a sidecar, engineers face no effort in onboarding their services. The entire rollout is transparent, and engineers can start using the platform immediately.

Challenges During Rollout

Although we didn't encounter any major incidents during the rollout, we faced a few unexpected challenges:

Unintentional Log Ingestion: Initially, we allowed ingestion of all log levels, including logs that couldn't be parsed. This led to a higher log volume than expected. To tackle this, we quickly made a fix to disable the ingestion of unparsable logs, bringing the log volume under control.
High CPU Usage with Go Tickers: We experienced a small issue where we missed closing some time.tickers used in the code. This caused CPU usage to gradually increase over time. Fortunately, we identified and fixed the issue promptly. We plan to share our debugging process in a future blog post.

Conclusion

Handling logs at scale in Microservices is a significant challenge. We took on the task of building an in-house framework while utilizing managed solutions when they fit best. The log platform has proven to be effective in providing Carousell engineers with a comprehensive solution for accessing logs. More importantly, it has given us valuable opportunities to learn and tackle core engineering problems, ultimately enabling us to deliver more value to our users.