Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.
DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!
Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.
 
                                .NET 9 and C# 13: New Features and Improvements
API Gateway vs. Load Balancer
In today’s Information Technology (IT) digital transformation world, many applications are getting hosted in cloud environments every day. Monitoring and maintaining these applications daily is very challenging and we need proper metrics in place to measure and take action. This is where the importance of implementing SLAs, SLOs, and SLIs comes into the picture and it helps in effective monitoring and maintaining the system performance. Defining SLA, SLO, SLI, and SRE What Is an SLA? (Commitment) A Service Level Agreement is an agreement that exists between the cloud provider and client/user about measurable metrics; for example, uptime check, etc. This is normally handled by the company's legal department as per business and legal terms. It includes all the factors to be considered as part of the agreement and the consequences if it fails; for example, credits, penalties, etc. It is mostly applicable for paid services and not for free services. What Is an SLO? (Objective) A Service Level Objective is an objective the cloud provider must meet to satisfy the agreement made with the client. It is used to mention specific individual metric expectations that cloud providers must meet to satisfy a client’s expectation (i.e., availability, etc). This will help clients to improve overall service quality and reliability. What Is an SLI? (How Did We Do?) A Service Level Indicator measures compliance with an SLO and actual measurement of SLI. It gives a quantified view of the service's performance (i.e., 99.92% of latency, etc.). Who Is an SRE? A Site Reliability Engineer is an engineer who always thinks about minimizing gaps between software development and operations. This term is slightly related to DevOps, which focuses on identifying the gaps. An SRE creates and uses automation tools to monitor and observe software reliability in production environments. In this article, we will discuss the importance of SLOs/SLIs/SLAs and how to implement them into production applications by a Site Reliability Engineer (SRE). Implementation of SLOs and SLIs Let’s assume we have an application service that is up and running in a production environment. The first step is to determine what an SLO should be and what it should cover. Example of SLOs SLO = Target Above this target, GOOD Below this target, BAD: Needs an action item While setting up a Target, please do not consider it 100% reliable. It is practically not possible and it fails most of the items due to patches, deployments, downtime, etc. This is where Error Budget (EB) comes into the picture. EB is the maximum amount of time that a service can fail without contractual consequences. For example: SLA = 99.99% uptime EB = 55 mins and 35 secs per year, or 4 mins and 23 secs per month, the system can go down without consequences. A step is how to measure this SLO, and it is where SLI comes into the picture, which is an indicator of the level of service that you are providing. Example of SLIs HTTP reqs = No. of success/total requests Common SLI Metrics Durability Response time Latency Availability Error rate Throughput Leverage automation of deployment monitoring and reporting tools to check SLIs and detect deviations from SLOs in real-time (i.e., Prometheus, Grafana, etc.). Category SLO SLI Availability 99.92% uptime/month X % of the time app is available Latency 92% of reqs with response time under 240 ms X average resp time for user reqs Error rate Less than 0.8% of requests result in errors X % of reqs that fail Challenges SLA: Normally, SLAs are written by business or legal teams with no input from technical teams, which results in missing key aspects to measure. SLO: Not able to measure or too broad to calculate SLI: There are too many metrics and differences in capturing and calculating the measures. It leads to lots of effort for the SREs and gives less beneficial results. Best Practices SLA: Involve the technical team when SLAs are written by the company's business/legal team and the provider. This will help to reflect exact tech scenarios into the agreement. SLO: This should be simple, and easily measurable to check, whether we are in line with objectives or not. SLI: Define all standard metrics to monitor and measure. It will help SREs to check the reliability and performance of the services. Conclusion Implementation of SLAs, SLOs, and SLIs should be included as part of the system requirements and design and it should be in continuous improvement mode. SREs need to understand and take responsibility for how the systems serve the business needs and take necessary measures to minimize the impact.
Relational Databases are the bedrock of any FinTech application, especially for OLTP (Online transaction Processing). This foundational component in any application architecture usually poses challenges around scaling as the business expands rapidly. So, it is imperative that all database activities are monitored closely in the production environment and issues like long-running queries are tracked and resolved. This article will explore the FinTech case study, which has built a Lending Platform. The company uses the MySQL database hosted in AWS as part of the AWS RDS service. It has multiple microservices using different database schemas hosted on the database instance. The MVP product offering was launched a few years back, and since then, they have been incorporating new features into the FinTech platform. We will cover commonly identified database issues and what was done to resolve these issues. Common Mistakes This section discusses common mistakes identified, the steps to resolve the issues, and additional guidelines. Using Database Views Inside Stored Procedures After it started seeing database growth, the major issue identified in the initial days with the MySQL database was that usage of views inside stored procedures resulted in long-running queries and full table scans. The pattern we saw with the developers was that they built multiple reusable views and used those in the stored procedures. The example below is a stored procedure invoking a view uv_consultations: MySQL CREATE PROCEDURE `usp_get_consultation`( in_dso_ref_id varchar(45) ) BEGIN select * from uv_consultations where dso_ref_id=in_dso_ref_id; END Here is what the view looks like: MySQL CREATE OR REPLACE VIEW uv_consultations AS SELECT c.id AS consultation_id, c.dso_ref_id AS dso_ref_id, c.pat_id AS pat_id, p.first_name AS pat_first_name, p.last_name AS pat_last_name, p.dso_pat_id AS dso_pat_id, p.dob AS pat_dob, COUNT(cn.id) AS notifs FROM ((((consultation c JOIN patient p ON ((c.pat_id = p.id))) LEFT JOIN application app ON ((c.id = app.consultation_id))) LEFT JOIN responsible_party rp ON ((app.primary_rp_id = rp.id))) LEFT JOIN consultation_notif cn ON (((cn.consultation_id = c.id) AND (cn.read = 0) AND (cn.hide = 0)))) GROUP BY c.id So, if we look at the view, it has multiple joins and a group by. It's a complex view query. MySQL executes views in two ways: view merging and view materialization. View Merging In this approach, MySQL merges the view query with the outer query in which the view has been referenced. The biggest advantage is that it uses underlying table indexes of the actual base table, improving the query timing. However, there are limitations to view merging. The query within the view must be a simple select statement with no aggregations, joins, sub-queries, or DISTINCT clauses. View Materialization When MySQL cannot perform a merge view, it defaults to View Materialization. In this approach, MySQL stores the view results in a temporary internal table and then uses the query on the view to query the internal table. The drawback is that it does not use base table indexes, causing the queries on the view to run slower. Views with GROUP BY, aggregation, DISTINCT, and complex joins trigger view materialization. In the above example, the view has aggregation and multiple joins, so the stored procedure's execution results in view materialization, which causes it to execute slowly. To mitigate this, refactored the stored procedure to use the direct complex SQL query and avoid the intermittent layer of views. Leading Wildcard String Comparison If we use the LIKE operator and the leading wild card search, e.g., LIKE ‘%AMOL’, then the query will not use the index on that column. Below is the sample query: MySQL SELECT COUNT(*) INTO v_ext_partner_service_count FROM loan_ext_payload WHERE loan_acct_id = v_loan_acct_id AND ext_partner_service LIKE '%CREDIT_PULL' For LIKE ‘CREDIT_PULL%’, MySQL uses the index efficiently as the indexes are structured in a way that this type of search makes them inherently faster. On the contrary, for the leading wildcard LIKE ‘%CREDIT_PULL’, MySQL execution engine looks at each entry in the index if it ends with ‘CREDIT_PULL.’ The index is optimized for prefixing (CREDIT_PULL% ), not suffixing (%CREDIT_PULL); its performance benefits are wiped out with leading wildcard string comparisons. The recommendation is to avoid using prefixed wild card string comparison or search. If this is unavoidable, use a full-text search. Using Functions as Part of the WHERE Clause Using functions for filtering records using a WHERE condition can harm query performance. For instance, using FIND_IN_SET() method as in the below query: MySQL SELECT code, code_type, message, message_es FROM locale_messages WHERE code_type = v_code_type AND FIND_IN_SET(code, v_codes) > 0; FIND_IN_SET returns the position of a string if it is present (as a substring) within a list of strings. Using FIND_IN_SET in the query causes a full table scan as it does not use the index. MySQL has to scan each record to evaluate the function, which could be very expensive from a computing perspective for larger tables. Similarly, mathematical, string, or date functions would have a similar side effect. Below are the examples: MySQL SELECT loan_id, loan_acct_id FROM loan_application WHERE YEAR(create_date) = 2024; MySQL SELECT loan_id, loan_acct_id FROM loan_application WHERE YEAR(create_date) = 2024; MySQL SELECT loan_id, loan_acct_id FROM loan_application WHERE ABS(merchant_discount_fee) > 5; If we cast columns as part of the where clause, it again leads to a full table scan, impacting the query performance. Example as below: MySQL SELECT loan_id, loan_acct_id FROM loan_application WHERE CAST(approval_datetime AS DATE) = '2024-06-01'; As a resolution for FIND_IN_SET, it enabled full-text search. For all other function-based where conditions, function-based indexes were created. Performing Deletes as Part of Regular Transactions Frequent delete operations as part of regular transaction processing can degrade database performance. Deleting rows from a table leads to index rebalancing. MySQL has to update the indexes on the columns for that table as it has to remove the index entries. For rollback and recovery, transaction logs are created for all delete operations. This can lead to rapid growth of transaction logs. The rows deleted in random order fragment tables and indexes degrade performance due to scattered reads. To mitigate regular delete operations, rows can be soft-deleted, and then, an offline batch job could perform a hard delete during non-peak hours. Non-peak hours allow the system to use its resources effectively. Batch or bulk deletes are more performant than the deletes span across regular intervals. Executing Multiple Inserts From the Application Instead of Batching If we have to insert multiple child rows for the parent record from the application layer, we will execute a stored procedure for each insert instead of batching it. Each call on insert needs to establish a database connection from the application layer and execute individual stored procedures for each record, which could be inefficient. To avoid multiple round trips to the database server, the multi-insert statement was built as below on the application layer and then sent to the database-stored procedure to execute it. MySQL INSERT INTO verif_user_attrib_result (id, application_id, user_attrib_id, doc_type_id, verif_result, verif_notes, verif_date, doc_upload_count) values(148712,146235,1,NULL,1,NULL,NULL,0), (148712,146235,2,NULL,1,NULL,NULL,0), (148712,146235,3,NULL,1,NULL,NULL,0), (148712,146235,4,NULL,-1,NULL,NULL,0); MySQL CREATE PROCEDURE p_verif_create_user_attrib_results ( IN v_user_attrib_result_multi_insert TEXT, out v_status int) BEGIN SET v_status = 1; SET @qry_user_attrib_result = v_user_attrib_result_multi_insert; PREPARE stmt_user_attrib_result from @qry_user_attrib_result; EXECUTE stmt_user_attrib_result; DEALLOCATE PREPARE stmt_user_attrib_result; SET v_status = 0; END Recommendations Connection Labels In the microservices world, where multiple microservices connect to different or the same schemas of the database server, monitoring the connection requests from the different microservices often becomes difficult. If there is an issue with connection pools, connection draining, or database performance issues, we cannot identify which connection belongs to which service. This is where program_name connection attributes come in handy. Java jdbc:mysql://db.fintech.com:3306/icwdev?cacheCallableStmts=true &callableStmtCacheSize=1000&connectionAttributes=program_name:loan-application-api This attribute labels every connection to the respective program name. This connection identifier helps segregate database issues with a specific service. Apart from diagnosing issues, it helps in enhanced monitoring. We can build query performance metrics and error rates for specific microservices. Purging Large Tables For large tables, we purge the data based on purging criteria. Utility tables managing user access tokens are periodically purged. To manage purging effectively, we implement partitioning on the tables. Partitioning is time-based Index Columns Used in Join We ensure that all columns listed in join conditions are indexed. By including index columns in join conditions, the database engine filters the data efficiently using indexed data, thus avoiding the entire table scan. Keep Transactions Smaller In high-concurrency applications, we must ensure smaller transactions for optimal database performance. Smaller transactions reduce the duration of data locking and minimize resource contention, which helps reduce resource contention, avoid deadlocks, and improve the transaction success rate. Miscellaneous Common Asks We need to ensure that UPDATE statements include a WHERE clause. At a FinTech organization, some complex stored procedures missed the WHERE clause, leading to unintended behaviors in the test environment. Strict no to “SELECT *” in queries. Conclusion We have covered common pitfalls and resolutions for fine-tuning the application and underlying MySQL database performance through the FinTech application case study. Here are some important suggestions: avoid using complex views in stored procedures, stay away from leading wildcard searches, monitor queries with full table scans, and take appropriate corrective actions. Perform delete operations during offline hours through the batch job. Avoid multiple roundtrips to the server for multi-inserts; instead, use batching. Define a purging strategy. Have indexes defined on all join columns, and finally, try to keep the transaction scope smaller. With these recommendations incorporated, we can provide optimal database performance and faster end-user applications. Continuous monitoring and proactive issue resolution help maintain consistent application performance and availability.
In Site Reliability Engineering (SRE), the ability to quickly and effectively troubleshoot issues within Linux systems is crucial. This article explores advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and using the Extended Berkeley Packet Filter (eBPF) for real-time data gathering. Kernel Debugging Kernel debugging is a fundamental skill for any SRE working with Linux. It allows for deep inspection of the kernel's behavior, which is critical when diagnosing system crashes or performance bottlenecks. Tools and Techniques GDB (GNU Debugger) GDB can debug kernel modules and the Linux kernel. It allows setting breakpoints, stepping through the code, and inspecting variables. GNU Debugger Official Documentation: This is the official documentation for GNU Debugger, providing a comprehensive overview of its features. KGDB The kernel debugger allows the kernel to be debugged using GDB over a serial connection or a network. Using kgdb, kdb, and the kernel debugger internals provides a detailed explanation of how kgdb can be enabled and configured. Dynamic Debugging (dyndbg) Linux's dynamic debug feature enables real-time debugging messages that help trace kernel operations without rebooting the system. The official Dynamic Debug page describes how to use the dynamic debug (dyndbg) feature. Tracing System Calls With strace strace is a powerful diagnostic tool that monitors the system calls used by a program and the signals received by a program. It is instrumental in understanding the interaction between applications and the Linux kernel. Usage To trace system calls, strace can be attached to a running process or start a new process under strace. It logs all system calls, which can be analyzed to find faults in system operations. Example: Shell root@ubuntu:~# strace -p 2009 strace: Process 2009 attached munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 In the above example, the -p flag is the process, and 2009 is the pid. Similarly, you can use the -o flag to log the output to a file instead of dumping everything on the screen. You can review the following article to understand system calls on Linux with strace. Performance Analysis With perf perf is a versatile tool used for system performance analysis. It provides a rich set of commands to collect, analyze, and report on hardware and software events. Key Features perf record: Gathers performance data into a file, perf.data, which can be further analyzed using perf report to identify hotspots perf report: This report analyzes the data collected by perf record and displays where most of the time was spent, helping identify performance bottlenecks. Event-based sampling: perf can record data based on specific events, such as cache misses or CPU cycles, which helps pinpoint performance issues more accurately. Example: Shell root@ubuntu:/tmp# perf record ^C[ perf record: Woken up 17 times to write data ] [ perf record: Captured and wrote 4.619 MB perf.data (83123 samples) ] root@ubuntu:/tmp# root@ubuntu:/tmp# perf report Samples: 83K of event 'cpu-clock:ppp', Event count (approx.): 20780750000 Overhead Command Shared Object Symbol 17.74% swapper [kernel.kallsyms] [k] cpuidle_idle_call 8.36% stress [kernel.kallsyms] [k] __do_softirq 7.17% stress [kernel.kallsyms] [k] finish_task_switch.isra.0 6.90% stress [kernel.kallsyms] [k] el0_da 5.73% stress libc.so.6 [.] random_r 3.92% stress [kernel.kallsyms] [k] flush_end_io 3.87% stress libc.so.6 [.] random 3.71% stress libc.so.6 [.] 0x00000000001405bc 2.71% kworker/0:2H-kb [kernel.kallsyms] [k] ata_scsi_queuecmd 2.58% stress libm.so.6 [.] __sqrt_finite 2.45% stress stress [.] 0x0000000000000f14 1.62% stress stress [.] 0x000000000000168c 1.46% stress [kernel.kallsyms] [k] __pi_clear_page 1.37% stress libc.so.6 [.] rand 1.34% stress libc.so.6 [.] 0x00000000001405c4 1.22% stress stress [.] 0x0000000000000e94 1.20% stress [kernel.kallsyms] [k] folio_batch_move_lru 1.20% stress stress [.] 0x0000000000000f10 1.16% stress libc.so.6 [.] 0x00000000001408d4 0.84% stress [kernel.kallsyms] [k] handle_mm_fault 0.77% stress [kernel.kallsyms] [k] release_pages 0.65% stress [kernel.kallsyms] [k] super_lock 0.62% stress [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.61% stress [kernel.kallsyms] [k] blk_done_softirq 0.61% stress [kernel.kallsyms] [k] _raw_spin_lock 0.60% stress [kernel.kallsyms] [k] folio_add_lru 0.58% kworker/0:2H-kb [kernel.kallsyms] [k] finish_task_switch.isra.0 0.55% stress [kernel.kallsyms] [k] __rcu_read_lock 0.52% stress [kernel.kallsyms] [k] percpu_ref_put_many.constprop.0 0.46% stress stress [.] 0x00000000000016e0 0.45% stress [kernel.kallsyms] [k] __rcu_read_unlock 0.45% stress [kernel.kallsyms] [k] dynamic_might_resched 0.42% stress [kernel.kallsyms] [k] _raw_spin_unlock 0.41% stress [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.40% stress [kernel.kallsyms] [k] mas_walk 0.39% stress [kernel.kallsyms] [k] arch_counter_get_cntvct 0.39% stress [kernel.kallsyms] [k] rwsem_read_trylock 0.39% stress [kernel.kallsyms] [k] up_read 0.38% stress [kernel.kallsyms] [k] down_read 0.37% stress [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.36% stress [kernel.kallsyms] [k] free_unref_page_commit 0.34% stress [kernel.kallsyms] [k] memset 0.32% stress libc.so.6 [.] 0x00000000001408c8 0.30% stress [kernel.kallsyms] [k] sync_inodes_sb 0.29% stress [kernel.kallsyms] [k] iterate_supers 0.29% stress [kernel.kallsyms] [k] percpu_counter_add_batch Real-Time Data Gathering With eBPF eBPF allows for creating small programs that run on the Linux kernel in a sandboxed environment. These programs can track system calls and network messages, providing real-time insights into system behavior. Applications Network monitoring: eBPF can monitor network traffic in real-time, providing insights into packet flow and protocol usage without significant performance overhead. Security: eBPF helps implement security policies by monitoring system calls and network activity to detect and prevent malicious activities. Performance monitoring: It can track application performance by monitoring function calls and system resource usage, helping SREs optimize performance. Conclusion Advanced troubleshooting in Linux involves a combination of tools and techniques that provide deep insights into system operations. Tools like GDB, strace, perf, and eBPF are essential for any SRE looking to enhance their troubleshooting capabilities. By leveraging these tools, SREs can ensure the high reliability and performance of Linux systems in production environments.
With the complexity of modern software applications, one of the biggest challenges for developers is simply understanding how applications behave. Understanding the behavior of your app is key to maintaining its stability, performance, and security. This is a big reason why we do application logging: to capture and record events through an application’s lifecycle so that we can gain valuable insights into our application. What kinds of insights? Application activity (user interactions, system events, and so on), errors and exceptions, resource usage, potential security threats, and more. When developers can capture and analyze these logs effectively, this improves application stability and security, which, in turn, improves the user experience. It’s a win-win for everybody. Application logging is easy—if you have the right tools. In this post, we’ll walk through using Heroku Logplex as a centralized logging solution. We’ll start by deploying a simple Python application to Heroku. Then, we’ll explore the different ways to use Logplex to view and filter our logs. Finally, we’ll show how to use Logplex to send your logs to an external service for further analysis. Ready to dive in? Let’s start with a brief introduction to Heroku Logplex. Introducing Heroku Logplex Heroku Logplex is a central hub that collects, aggregates, and routes log messages from various sources across your Heroku applications. Those sources include: Dyno logs: generated by your application running on Heroku dynos. Heroku logs: generated by Heroku itself, such as platform events and deployments. Custom sources: generated by external sources, such as databases or third-party services. By consolidating logs in a single, central place, Logplex simplifies log management and analysis. You can find all your logs in one place for simplified monitoring and troubleshooting. You can perform powerful filtering and searching on your logs. And you can even route logs to different destinations for further processing and analysis. Core Components At its heart, Heroku Logplex consists of three crucial components that work together to streamline application logging: 1. Log sources are the starting points where log messages originate within your Heroku environment. They are your dyno logs, Heroku logs, and custom sources which we mentioned above. 2. Log drains are the designated destinations for your log messages. Logplex allows you to configure drains to route your logs to various endpoints for further processing. Popular options for log drains include: External logging services with advanced log management features, dashboards, and alerting capabilities. Examples are Datadog, Papertrail, and Sumo Logic. Notification systems that send alerts or notifications based on specific log entries, enabling real-time monitoring and troubleshooting. Custom destinations such as your own Syslog or web server. 3. Log filters are powerful tools that act as checkpoints, allowing you to refine the log messages before they reach their final destinations. Logplex allows you to filter logs based on source, log level, and even message content. By using filters, you can significantly reduce the volume of data sent to your drains, focusing only on the most relevant log entries for that specific destination. Routing and Processing As Logplex collects log messages from all your defined sources, it passes these messages through your configured filters, potentially discarding entries that don't match the criteria. Finally, filtered messages are routed to their designated log drains for further processing or storage. Alright, enough talk. Show me how, already! Integrating Logplex With Your Application Let’s walk through how to use Logplex for a simple Python application. To get started, make sure you have a Heroku account. Then, download and install the Heroku CLI. Demo Application You can find our very simple Python script (main.py) in the GitHub repo for this demo. Our script runs an endless integer counter, starting from zero. With each iteration, it emits a log message (cycling through log levels INFO, DEBUG, ERROR, and WARN). Whenever it detects a prime number, it emits an additional CRITICAL log event to let us know. We use isprime from the sympy library to help us determine if a number is prime. To run this Python application on your local machine, first clone the repository. Then, install the dependencies: Plain Text (venv) ~/project$ pip install -r requirements.txt Next, start up the Python application. We use gunicorn to spin up a server that binds to a port, while our prime number logging continues to run in the background. (We do this because a Heroku deployment is designed to bind to a port, so that’s how we’ve written our application even though we’re focused on logging). Plain Text (venv) ~/project$ gunicorn -w 1 --bind localhost:8000 main:app [2024-03-25 23:18:59 -0700] [785441] [INFO] Starting gunicorn 21.2.0 [2024-03-25 23:18:59 -0700] [785441] [INFO] Listening at: http://127.0.0.1:8000 (785441) [2024-03-25 23:18:59 -0700] [785441] [INFO] Using worker: sync [2024-03-25 23:18:59 -0700] [785443] [INFO] Booting worker with pid: 785443 {"timestamp": "2024-03-25T23:18:59.507828Z", "level": "INFO", "name": "root", "message": "New number", "Number": 0} {"timestamp": "2024-03-25T23:19:00.509182Z", "level": "DEBUG", "name": "root", "message": "New number", "Number": 1} {"timestamp": "2024-03-25T23:19:01.510634Z", "level": "ERROR", "name": "root", "message": "New number", "Number": 2} {"timestamp": "2024-03-25T23:19:02.512100Z", "level": "CRITICAL", "name": "root", "message": "Prime found!", "Prime Number": 2} {"timestamp": "2024-03-25T23:19:05.515133Z", "level": "WARNING", "name": "root", "message": "New number", "Number": 3} {"timestamp": "2024-03-25T23:19:06.516567Z", "level": "CRITICAL", "name": "root", "message": "Prime found!", "Prime Number": 3} {"timestamp": "2024-03-25T23:19:09.519082Z", "level": "INFO", "name": "root", "message": "New number", "Number": 4} Simple enough. Now, let’s get ready to deploy it and work with logs. Create the App We start by logging into Heroku through the CLI. Plain Text $ heroku login Then, we create a new Heroku app. I’ve named my app logging-primes-in-python, but you can name yours whatever you’d like. Plain Text $ heroku apps:create logging-primes-in-python Creating ⬢ logging-primes-in-python... done https://logging-primes-in-python-6140bfd3c044.herokuapp.com/ | https://git.heroku.com/logging-primes-in-python.git Next, we create a Heroku remote for our GitHub repo with this Python application. Plain Text $ heroku git:remote -a logging-primes-in-python set git remote heroku to https://git.heroku.com/logging-primes-in-python.git A Note on requirements.txt and Procfile We need to let Heroku know what dependencies our Python application needs, and also how it should start up our application. To do this, our repository has two files: requirements.txt and Procfile. The first file, requirements.txt, looks like this: Plain Text python-json-logger==2.0.4 pytest==8.0.2 sympy==1.12 gunicorn==21.2.0 And Procfile looks like this: Plain Text web: gunicorn -w 1 --bind 0.0.0.0:${PORT} main:app That’s it. Our entire repository has these files: Plain Text $ tree . ├── main.py ├── Procfile └── requirements.txt 0 directories, 3 files Deploy the Code Now, we’re ready to deploy our code. We run this command: Plain Text $ git push heroku main … remote: Building source: remote: remote: -----> Building on the Heroku-22 stack remote: -----> Determining which buildpack to use for this app remote: -----> Python app detected … remote: -----> Installing requirements with pip … remote: -----> Launching... remote: Released v3 remote: https://logging-primes-in-python-6140bfd3c044.herokuapp.com/ deployed to Heroku remote: remote: Verifying deploy... done. Verify the App Is Running To verify that everything works as expected, we can dive into Logplex right away. Logplex is enabled by default for all Heroku applications. Plain Text $ heroku logs --tail -a logging-primes-in-python … 2024-03-22T04:34:15.540260+00:00 heroku[web.1]: Starting process with command `gunicorn -w 1 --bind 0.0.0.0:${PORT} main:app` … 2024-03-22T04:34:16.425619+00:00 app[web.1]: {"timestamp": "2024-03-22T04:34:16.425552Z", "level": "INFO", "name": "root", "message": "New number", "taskName": null, "Number": 0} 2024-03-22T04:34:17.425987+00:00 app[web.1]: {"timestamp": "2024-03-22T04:34:17.425837Z", "level": "DEBUG", "name": "root", "message": "New number", "taskName": null, "Number": 1} 2024-03-22T04:34:18.000000+00:00 app[api]: Build succeeded 2024-03-22T04:34:18.426354+00:00 app[web.1]: {"timestamp": "2024-03-22T04:34:18.426205Z", "level": "ERROR", "name": "root", "message": "New number", "taskName": null, "Number": 2} 2024-03-22T04:34:19.426700+00:00 app[web.1]: {"timestamp": "2024-03-22T04:34:19.426534Z", "level": "CRITICAL", "name": "root", "message": "Prime found!", "taskName": null, "Prime Number": 2} We can see that logs are already being written. Heroku’s log format is following this scheme: Plain Text timestamp source[dyno]: message Timestamp: The date and time recorded at the time the dyno or component produced the log line. The timestamp is in the format specified by RFC5424 and includes microsecond precision. Source: All of your app’s dynos (web dynos, background workers, cron) have the source app. Meanwhile, all of Heroku’s system components (HTTP router, dyno manager) have the source heroku. Dyno: The name of the dyno or component that wrote the log line. For example, web dyno #1 appears as web.1, and the Heroku HTTP router appears as router. Message: The content of the log line. Logplex splits any lines generated by dynos that exceed 10,000 bytes into 10,000-byte chunks without extra trailing newlines. It submits each chunk that is submitted as a separate log line. View and Filter Logs We’ve seen the first option for examining our logs, the Heroku CLI. You can use command line arguments, such as --source and --dyno, to use filters and specify which logs to view. To specify the number of (most recent) log entries to view, do this: Plain Text $ heroku logs --num 10 To filter down logs to a specific dyno or source, do this: Plain Text $ heroku logs --dyno web.1 $ heroku logs --source app Of course, you can combine these filters, too: Plain Text $ heroku logs --source app --dyno web.1 The Heroku Dashboard is another place where you can look at your logs. On your app page, click More -> View logs. Here is what we see: If you look closely, you’ll see different sources: heroku and app. Log Drains Let’s demonstrate how to use a log drain. For this, we’ll use BetterStack (formerly Logtail). We created a free account. After logging in, we navigated to the Source page and clicked Connect source. We enter a name for our source and select Heroku as the source platform. Then, we click Create source. After creating our source, BetterStack provides the Heroku CLI command we would use to add a log drain for sending logs to BetterStack. Technically, this command adds an HTTPS drain that points to an HTTPS endpoint from BetterStack. We run the command in our terminal, and then we restart our application: Plain Text $ heroku drains:add \ "https://in.logs.betterstack.com:6515/events?source_token=YKGWLN7****************" \ -a logging-primes-in-python Successfully added drain https://in.logs.betterstack.com:6515/events?source_token=YKGWLN7***************** $ heroku restart -a logging-primes-in-python Almost instantly, we begin to see our Heroku logs appear on the Live tail page at BetterStack. By using a log drain to send our logs from Heroku Logplex to an external service, we can take advantage of the features from BetterStack to work with our Heroku logs. For example, we can create visualization charts and configure alerts on certain log events. Custom Drains In our example above, we created a custom HTTPS log drain that happened to point to an endpoint from BetterStack. However, we can send our logs to any endpoint we want. We could even send our logs to another Heroku app! Imagine building a web service on Heroku that only Heroku Logplex can make POST requests to. Logging Best Practices Before we conclude our walkthrough, let’s briefly touch on some logging best practices. Focus on relevant events: Log only the information that’s necessary to understand and troubleshoot your application's behavior. Prioritize logging application errors, user actions, data changes, and other crucial activities. Enrich logs with context: Include details that provide helpful context to logged events. Your future troubleshooting self will thank you. So, instead of just logging "User logged in," capture details like user ID, device information, and relevant data associated with the login event. Embrace structured logging: Use a standardized format like JSON to make your logs machine-readable. This allows easier parsing and analysis by logging tools, saving you time in analysis. Protect sensitive data: Never log anything that could compromise user privacy or violate data regulations. This includes passwords, credit card information, or other confidential data. Take advantage of log levels: Use different log levels (like DEBUG, INFO, WARNING, and ERROR) to categorize log events based on their severity. This helps with issue prioritization, allowing you to focus on critical events requiring immediate attention. Conclusion Heroku Logplex empowers developers and operations teams with a centralized and efficient solution for application logging within the Heroku environment. While our goal in this article was to provide a basic foundation for understanding Heroku Logplex, remember that the platform offers a vast array of advanced features to explore and customize your logging based on your specific needs. As you dig deeper into Heroku’s documentation, you’ll come across advanced functionalities like: Customizable log processing: Leverage plugins and filters to tailor log processing workflows for specific use cases. Real-time alerting: Configure alerts based on log patterns or events to proactively address potential issues. Advanced log analysis tools: Integrate with external log management services for comprehensive log analysis, visualization, and anomaly detection. By understanding the core functionalities and exploring the potential of advanced features, you can leverage Heroku Logplex to create a robust and efficient logging strategy. Ultimately, good logging will go a long way in enhancing the reliability, performance, and security of your Heroku applications.
In any microservice, managing database interactions with precision is crucial for maintaining application performance and reliability. Usually, we will unravel weird issues with database connection during performance testing. Recently, a critical issue surfaced within the repository layer of a Spring microservice application, where improper exception handling led to unexpected failures and service disruptions during performance testing. This article delves into the specifics of the issue and also highlights the pivotal role of the @Transactional annotation, which remedied the issue. Spring microservice applications rely heavily on stable and efficient database interactions, often managed through the Java Persistence API (JPA). Properly managing database connections, particularly preventing connection leaks, is critical to ensuring these interactions do not negatively impact application performance. Issue Background During a recent round of performance testing, a critical issue emerged within one of our essential microservices, which was designated for sending client communications. This service began to experience repeated Gateway time-out errors. The underlying problem was rooted in our database operations at the repository layer. An investigation into these time-out errors revealed that a stored procedure was consistently failing. The failure was triggered by an invalid parameter passed to the procedure, which raised a business exception from the stored procedure. The repository layer did not handle this exception efficiently; it bubbled up. Below is the source code for the stored procedure call: Java public long createInboxMessage(String notifCode, String acctId, String userId, String s3KeyName, List<Notif> notifList, String attributes, String notifTitle, String notifSubject, String notifPreviewText, String contentType, boolean doNotDelete, boolean isLetter, String groupId) throws EDeliveryException { try { StoredProcedureQuery query = entityManager.createStoredProcedureQuery("p_create_notification"); DbUtility.setParameter(query, "v_notif_code", notifCode); DbUtility.setParameter(query, "v_user_uuid", userId); DbUtility.setNullParameter(query, "v_user_id", Integer.class); DbUtility.setParameter(query, "v_acct_id", acctId); DbUtility.setParameter(query, "v_message_url", s3KeyName); DbUtility.setParameter(query, "v_ecomm_attributes", attributes); DbUtility.setParameter(query, "v_notif_title", notifTitle); DbUtility.setParameter(query, "v_notif_subject", notifSubject); DbUtility.setParameter(query, "v_notif_preview_text", notifPreviewText); DbUtility.setParameter(query, "v_content_type", contentType); DbUtility.setParameter(query, "v_do_not_delete", doNotDelete); DbUtility.setParameter(query, "v_hard_copy_comm", isLetter); DbUtility.setParameter(query, "v_group_id", groupId); DbUtility.setOutParameter(query, "v_notif_id", BigInteger.class); query.execute(); BigInteger notifId = (BigInteger) query.getOutputParameterValue("v_notif_id"); return notifId.longValue(); } catch (PersistenceException ex) { logger.error("DbRepository::createInboxMessage - Error creating notification", ex); throw new EDeliveryException(ex.getMessage(), ex); } } Issue Analysis As illustrated in our scenario, when a stored procedure encountered an error, the resulting exception would propagate upward from the repository layer to the service layer and finally to the controller. This propagation was problematic, causing our API to respond with non-200 HTTP status codes—typically 500 or 400. Following several such incidents, the service container reached a point where it could no longer handle incoming requests, ultimately resulting in a 502 Gateway Timeout error. This critical state was reflected in our monitoring systems, with Kibana logs indicating the issue: `HikariPool-1 - Connection is not available, request timed out after 30000ms.` The issue was improper exception handling, as exceptions bubbled up through the system layers without being properly managed. This prevented the release of database connections back into the connection pool, leading to the depletion of available connections. Consequently, after exhausting all connections, the container was unable to process new requests, resulting in the error reported in the Kibana logs and a non-200 HTTP error. Resolution To resolve this issue, we could handle the exception gracefully and not bubble up further, letting JPA and Spring context release the connection to the pool. Another alternative is to use @Transactional annotation for the method. Below is the same method with annotation: Java @Transactional public long createInboxMessage(String notifCode, String acctId, String userId, String s3KeyName, List<Notif> notifList, String attributes, String notifTitle, String notifSubject, String notifPreviewText, String contentType, boolean doNotDelete, boolean isLetter, String groupId) throws EDeliveryException { ……… } The implementation of the method below demonstrates an approach to exception handling that prevents exceptions from propagating further up the stack by catching and logging them within the method itself: Java public long createInboxMessage(String notifCode, String acctId, String userId, String s3KeyName, List<Notif> notifList, String attributes, String notifTitle, String notifSubject, String notifPreviewText, String contentType, boolean doNotDelete, boolean isLetter, String loanGroupId) { try { ....... query.execute(); BigInteger notifId = (BigInteger) query.getOutputParameterValue("v_notif_id"); return notifId.longValue(); } catch (PersistenceException ex) { logger.error("DbRepository::createInboxMessage - Error creating notification", ex); } return -1; } With @Transactional The @Transactional annotation in Spring frameworks manages transaction boundaries. It begins a transaction when the annotated method starts and commits or rolls it back when the method completes. When an exception occurs, @Transactional ensures that the transaction is rolled back, which helps appropriately release database connections back to the connection pool. Without @Transactional If a repository method that calls a stored procedure is not annotated with @Transactional, Spring does not manage the transaction boundaries for that method. The transaction handling must be manually implemented if the stored procedure throws an exception. If not properly managed, this can result in the database connection not being closed and not being returned to the pool, leading to a connection leak. Best Practices Always use @Transactional when the method's operations should be executed within a transaction scope. This is especially important for operations involving stored procedures that can modify the database state. Ensure exception handling within the method includes proper transaction rollback and closing of any database connections, mainly when not using @Transactional. Conclusion Effective transaction management is pivotal in maintaining the health and performance of Spring Microservice applications using JPA. By employing the @Transactional annotation, we can safeguard against connection leaks and ensure that database interactions do not degrade application performance or stability. Adhering to these guidelines can enhance the reliability and efficiency of our Spring Microservices, providing stable and responsive services to the consuming applications or end users.
As a Linux administrator or even if you are a newbie who just started using Linux, having a good understanding of useful commands in troubleshooting network issues is paramount. We'll explore the top 10 essential Linux commands for diagnosing and resolving common network problems. Each command will be accompanied by real-world examples to illustrate its usage and effectiveness. 1. ping Example: ping google.com Shell test@ubuntu-server ~ % ping google.com -c 5 PING google.com (142.250.189.206): 56 data bytes 64 bytes from 142.250.189.206: icmp_seq=0 ttl=58 time=14.610 ms 64 bytes from 142.250.189.206: icmp_seq=1 ttl=58 time=18.005 ms 64 bytes from 142.250.189.206: icmp_seq=2 ttl=58 time=19.402 ms 64 bytes from 142.250.189.206: icmp_seq=3 ttl=58 time=22.450 ms 64 bytes from 142.250.189.206: icmp_seq=4 ttl=58 time=15.870 ms --- google.com ping statistics --- 5 packets transmitted, 5 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 14.610/18.067/22.450/2.749 ms test@ubuntu-server ~ % Explanation ping uses ICMP protocol, where ICMP stands for internet control message protocol and ICMP is a network layer protocol used by network devices to communicate. ping helps in testing the reachability of the host and it will also help in finding the latency between the source and destination. 2. traceroute Example: traceroute google.com Shell test@ubuntu-server ~ % traceroute google.com traceroute to google.com (142.250.189.238), 64 hops max, 52 byte packets 1 10.0.0.1 (10.0.0.1) 6.482 ms 3.309 ms 3.685 ms 2 96.120.90.197 (96.120.90.197) 13.094 ms 10.617 ms 11.351 ms 3 po-301-1221-rur01.fremont.ca.sfba.comcast.net (68.86.248.153) 12.627 ms 11.240 ms 12.020 ms 4 ae-236-rar01.santaclara.ca.sfba.comcast.net (162.151.87.245) 18.902 ms 44.432 ms 18.269 ms 5 be-299-ar01.santaclara.ca.sfba.comcast.net (68.86.143.93) 14.826 ms 13.161 ms 12.814 ms 6 69.241.75.42 (69.241.75.42) 12.236 ms 12.302 ms 69.241.75.46 (69.241.75.46) 15.215 ms 7 * * * 8 142.251.65.166 (142.251.65.166) 21.878 ms 14.087 ms 209.85.243.112 (209.85.243.112) 14.252 ms 9 nuq04s39-in-f14.1e100.net (142.250.189.238) 13.666 ms 192.178.87.152 (192.178.87.152) 12.657 ms 13.170 ms test@ubuntu-server ~ % Explanation Traceroute shows the route packets take to reach a destination host. It displays the IP addresses of routers along the path and calculates the round-trip time (RTT) for each hop. Traceroute helps identify network congestion or routing issues. 3. netstat Example: netstat -tulpn Shell test@ubuntu-server ~ % netstat -tuln Active LOCAL (UNIX) domain sockets Address Type Recv-Q Send-Q Inode Conn Refs Nextref Addr aaf06ba76e4d0469 stream 0 0 0 aaf06ba76e4d03a1 0 0 /var/run/mDNSResponder aaf06ba76e4d03a1 stream 0 0 0 aaf06ba76e4d0469 0 0 aaf06ba76e4cd4c1 stream 0 0 0 aaf06ba76e4ccdb9 0 0 /var/run/mDNSResponder aaf06ba76e4cace9 stream 0 0 0 aaf06ba76e4c9e11 0 0 /var/run/mDNSResponder aaf06ba76e4d0b71 stream 0 0 0 aaf06ba76e4d0aa9 0 0 /var/run/mDNSResponder test@ubuntu-server ~ % Explanation Netstat displays network connections, routing tables, interface statistics, masquerade connections, and multicast memberships. It's useful for troubleshooting network connectivity, identifying open ports, and monitoring network performance. 4. ifconfig/ip Example: ifconfig or ifconfig <interface name> Shell test@ubuntu-server ~ % ifconfig en0 en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 options=6460<TSO4,TSO6,CHANNEL_IO,PARTIAL_CSUM,ZEROINVERT_CSUM> ether 10:9f:41:ad:91:60 inet 10.0.0.24 netmask 0xffffff00 broadcast 10.0.0.255 inet6 fe80::870:c909:df17:7ed1%en0 prefixlen 64 secured scopeid 0xc inet6 2601:641:300:e710:14ef:e605:4c8d:7e09 prefixlen 64 autoconf secured inet6 2601:641:300:e710:d5ec:a0a0:cdbb:79a7 prefixlen 64 autoconf temporary inet6 2601:641:300:e710::6cfc prefixlen 64 dynamic nd6 options=201<PERFORMNUD,DAD> media: autoselect status: active test@ubuntu-server ~ % Explanation ifconfig and ip commands are used to view and configure network parameters. They provide information about the IP address, subnet mask, MAC address, and network status of each interface. 5. tcpdump Example:tcpdump -i en0 tcp port 80 Shell test@ubuntu-server ~ % tcpdump -i en0 tcp port 80 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on en0, link-type EN10MB (Ethernet), snapshot length 524288 bytes 0 packets captured 55 packets received by filter 0 packets dropped by kernel test@ubuntu-server ~ % Explanation Tcpdump is a packet analyzer that captures and displays network traffic in real-time. It's invaluable for troubleshooting network issues, analyzing packet contents, and identifying abnormal network behavior. Use tcpdump to inspect packets on specific interfaces or ports. 6. nslookup/dig Example: nslookup google.com or dig Shell test@ubuntu-server ~ % nslookup google.com Server: 2001:558:feed::1 Address: 2001:558:feed::1#53 Non-authoritative answer: Name: google.com Address: 172.217.12.110 test@ubuntu-server ~ % test@ubuntu-server ~ % dig google.com ; <<>> DiG 9.10.6 <<>> google.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46600 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;google.com. IN A ;; ANSWER SECTION: google.com. 164 IN A 142.250.189.206 ;; Query time: 20 msec ;; SERVER: 2001:558:feed::1#53(2001:558:feed::1) ;; WHEN: Mon Apr 15 22:55:35 PDT 2024 ;; MSG SIZE rcvd: 55 test@ubuntu-server ~ % Explanation nslookup and dig are DNS lookup tools used to query DNS servers for domain name resolution. They provide information about the IP address associated with a domain name and help diagnose DNS-related problems such as incorrect DNS configuration or server unavailability. 7. iptables/firewalld Example: iptables -L or firewall-cmd --list-all Shell test@ubuntu-server ~# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy DROP) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination test@ubuntu-server ~# Explanation iptables and firewalld are firewall management tools used to configure packet filtering and network address translation (NAT) rules. They control incoming and outgoing traffic and protect the system from unauthorized access. Use them to diagnose firewall-related issues and ensure proper traffic flow. 8. ss Example: ss -tulpn Shell test@ubuntu-server ~# Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port udp UNCONN 0 0 *:161 *:* udp UNCONN 0 0 *:161 *:* test@ubuntu-server ~# Explanation ss is a utility to investigate sockets. It displays information about TCP, UDP, and UNIX domain sockets, including listening and established connections, connection state, and process IDs. ss is useful for troubleshooting socket-related problems and monitoring network activity. 9. arp Example: arp -a Shell test@ubuntu-server ~ % arp -a ? (10.0.0.1) at 80:da:c2:95:aa:f7 on en0 ifscope [ethernet] ? (10.0.0.57) at 1c:4d:66:bb:49:a on en0 ifscope [ethernet] ? (10.0.0.83) at 3a:4a:df:fe:66:58 on en0 ifscope [ethernet] ? (10.0.0.117) at 70:2a:d5:5a:cc:14 on en0 ifscope [ethernet] ? (10.0.0.127) at fe:e2:1c:4d:b3:f7 on en0 ifscope [ethernet] ? (10.0.0.132) at bc:d0:74:9a:51:85 on en0 ifscope [ethernet] ? (10.0.0.255) at ff:ff:ff:ff:ff:ff on en0 ifscope [ethernet] mdns.mcast.net (224.0.0.251) at 1:0:5e:0:0:fb on en0 ifscope permanent [ethernet] ? (239.255.255.250) at 1:0:5e:7f:ff:fa on en0 ifscope permanent [ethernet] test@ubuntu-server ~ % Explanation arp (Address Resolution Protocol) displays and modifies the IP-to-MAC address translation tables used by the kernel. It resolves IP addresses to MAC addresses and vice versa. arp is helpful for troubleshooting issues related to network device discovery and address resolution. 10. mtr Example: mtr Shell test.ubuntu.com (0.0.0.0) Tue Apr 16 14:46:40 2024 Keys: Help Display mode Restart statistics Order of fields quit Packets Ping Host Loss% Snt Last Avg Best Wrst StDev 1. 10.0.0.10 0.0% 143 0.8 9.4 0.7 58.6 15.2 2. 10.0.2.10 0.0% 143 0.8 9.4 0.7 58.6 15.2 3. 192.168.0.233 0.0% 143 0.8 9.4 0.7 58.6 15.2 4. 142.251.225.178 0.0% 143 0.8 9.4 0.7 58.6 15.2 5. 142.251.225.177 0.0% 143 0.8 9.4 0.7 58.6 15.2 Explanation mtr (My traceroute) combines the functionality of ping and traceroute into a single diagnostic tool. It continuously probes network paths between the host and a destination, displaying detailed statistics about packet loss, latency, and route changes. Mtr is ideal for diagnosing intermittent network problems and monitoring network performance over time. Mastering these commands comes in handy for troubleshooting network issues on Linux hosts.
In today’s digital landscape, where user expectations for speed and responsiveness are at an all-time high, optimizing system performance is crucial for businesses to stay competitive. One effective approach to address performance bottlenecks is the implementation of caching mechanisms. In this article, we’ll delve into the importance of caching in enhancing system performance, explore various caching strategies and cache invalidation methods, and examine common cache-related problems along with real-world solutions. Problem Statement Consider a popular e-commerce platform experiencing a surge in user traffic during a festive season sale. As the number of users increases, so does the volume of database queries, resulting in sluggish performance and delayed response times. This performance degradation impacts user experience and may lead to lost sales opportunities for the business. Caching as a Solution To mitigate the performance issues caused by increased database queries, the e-commerce platform decides to implement caching. By caching frequently accessed product information, such as product details, prices, and availability, in memory, the platform aims to reduce the need for repetitive database queries and deliver faster response times to users browsing the website or making purchases. How Caching Works When a user visits the e-commerce platform to view product listings, the system first checks the cache for the requested product information. If the data is found in the cache (cache hit), it is retrieved quickly and served to the user, resulting in a seamless browsing experience. However, if the data is not present in the cache (cache miss), the system retrieves the information from the database, serves it to the user, and updates the cache to prevent future incidents of not finding the data in the cache. If the data is found in the cache it is called cache hit and if not, it’s called cache miss. But, how does the desired data get loaded to the cache? There are several caching strategies that help us load the data to the cache. Caching strategies are approaches to manage how data is stored and retrieved in a cache to optimize performance and efficiency. These strategies determine how data is cached, when it is updated or invalidated, and how it is accessed. Caching Strategies Read-Through Cache With this approach, when a cache miss occurs, the system automatically fetches the data from the database, populates the cache, and returns the data to the user. Example: Memcached is used for caching frequently accessed database queries in a web application. 2. Cache-Aside (Lazy Loading) In cache-aside caching, the application first checks the cache for the requested data. If the data is not found in the cache, the application fetches it from the database, populates the cache, and returns the data to the user. Example: MongoDB caching frequently accessed user profiles in a social media platform. 3. Write-Back Cache With write-back caching, data is written to the cache first and then asynchronously written to the underlying storage at a later time. This strategy improves write performance by reducing the latency associated with synchronous disk writes. Example: A database system caching write operations in memory before committing them to disk to improve overall system responsiveness and throughput. 4. Write-Around Cache In this strategy, data is written directly to the underlying storage without being cached initially. Only subsequent read requests for the same data trigger caching. This approach is beneficial for data with low access frequency or data that doesn’t benefit from caching. Example: A file storage system where large files are written directly to disk without being cached to optimize cache utilization for frequently accessed files. 5. Write-Through Cache In this strategy, every write operation to the database is also written to the cache simultaneously. This ensures that the cache is always up-to-date with the latest data. Example: Redis cache with write-through caching for user session management. Cache Invalidation Methods Okay, now that we’ve loaded the data in the cache, how do we make sure it’s still good? Since cached data can be outdated, we need ways to check if it’s still reliable. Cache invalidation methods confirm that the cached data matches the latest information from the original source. Cache invalidation is the process of removing or updating cached data when it becomes stale or outdated. This ensures that the cached data remains accurate and reflects the most recent changes from the source of truth, such as a database or a web service. “Purge,” “refresh,” and “ban” are commonly used cache invalidation methods that are frequently used in the application cache, content delivery networks (CDNs), and web proxies. Here’s a brief description of a few famous cache invalidation methods: 1. Time-Based Invalidation Cached data is invalidated after a specified period to ensure freshness. Example: Expire user authentication tokens in the cache after a certain time interval. Time-To-Live(TTL) Expiration This method involves setting a time-to-live value for cached content, after which the content is considered stale and must be refreshed. When a request is received for the content, the cache checks the time-to-live value and serves the cached content only if the value hasn’t expired. If the value has expired, the cache fetches the latest version of the content from the origin server and caches it. For example, in a news application, articles in the cache may have a TTL of 24 hours. After 24 hours, the articles expire and are removed from the cache. 2. Event-Based Invalidation Cache entries are invalidated based on specific events or triggers, such as data updates or deletions. Example: Invalidate product cache when inventory levels change in an e-commerce platform. Purge The purge method removes cached content for a specific object, URL, or a set of URLs. It’s typically used when there is an update or change to the content and the cached version is no longer valid. When a purge request is received, the cached content is immediately removed, and the next request for the content will be served directly from the origin server. Refresh Fetches requested content from the origin server, even if cached content is available. When a refresh request is received, the cached content is updated with the latest version from the origin server, ensuring that the content is up-to-date. Unlike a purge, a refresh request doesn’t remove the existing cached content; instead, it updates it with the latest version. Ban The ban method invalidates cached content based on specific criteria, such as a URL pattern or header. When a ban request is received, any cached content that matches the specified criteria is immediately removed, and subsequent requests for the content will be served directly from the origin server. Stale-While-Revalidate This method is used in web browsers and CDNs to serve stale content from the cache while the content is being updated in the background. When a request is received for a piece of content, the cached version is immediately served to the user, and an asynchronous request is made to the origin server to fetch the latest version of the content. Once the latest version is available, the cached version is updated. This method ensures that the user is always served content quickly, even if the cached version is slightly outdated. Cache Eviction After the cache invalidation, it appears that the cache is full. But, the cache misses have increased tremendously. What steps should be taken next? We need to make room in the cache for new items and reduce cache misses. This process, known as cache eviction, is crucial for optimizing cache performance. Cache eviction refers to the process of removing items from a cache to make room for new entries when the cache reaches its capacity limit. Cache eviction policies determine which items are selected for removal based on certain criteria. There are various cache eviction policies: 1. Least Recently Used (LRU) Removes the least recently accessed items when space is needed for new entries. For instance, in a web browser’s cache using LRU, if the cache is full and a user visits a new webpage, the least recently accessed page is removed to make room for the new one. 2. Most Recently Used (MRU) Evicts the most recently accessed items to free up space when necessary. For example, in a mobile app cache, if the cache is full and a user accesses a new feature, the most recently accessed feature is evicted to accommodate the new data. 3. First-In-First-Out (FIFO) Evicts items based on the order they were added to the cache. In a FIFO cache, the oldest items are removed first when space is required. For instance, in a messaging app, if the cache is full and a new message arrives, the oldest message in the cache is evicted. 4. Least Frequently Used (LFU) Removes items that are accessed the least frequently. LFU keeps track of access frequency, and when space is limited, it removes items with the lowest access count. For example, in a search engine cache, less frequently searched queries may be evicted to make room for popular ones. 5. Random Replacement (RR) Selects items for eviction randomly without considering access patterns. In a cache using RR, any item in the cache may be removed when space is needed. For instance, in a gaming app cache, when new game assets need to be cached, RR randomly selects existing assets to replace. Each cache eviction policy has its advantages and disadvantages, and the choice depends on factors such as the application’s access patterns, memory constraints, and performance requirements. The e-commerce platform in our example has implemented a caching solution, leading to noticeable improvements in website performance. However, several cache-related issues have recently emerged. Cache-related problems occur due to issues such as multiple users or processes accessing data simultaneously, outdated cached data, limited cache size, network problems, incorrect cache configurations, implementation errors, dependencies on external systems, and inadequate cache warming strategies. Common Cache-Related Problems and Solutions 1. Thundering Herd Problem This occurs when multiple requests simultaneously trigger cache misses, overwhelming the system. Solution Implement cache warming techniques to pre-load frequently accessed data into the cache during low-traffic periods. 2. Cache Penetration This happens when malicious users bombard the system with requests for non-existent data, bypassing the cache and causing unnecessary load on the backend. Solution Implement input validation and rate limiting to mitigate the impact of cache penetration attacks. 3. Cache Breakdown This occurs when the cache becomes unavailable or unresponsive, leading to increased load on the backend database. Solution Implement cache redundancy and failover mechanisms to ensure high availability and fault tolerance. 4. Cache Crash This happens when the cache experiences failures or crashes due to software bugs or hardware issues. Solution Regularly monitor cache health and performance, and implement automated recovery mechanisms to quickly restore cache functionality. In companies like Amazon, Netflix, Meta, Google and eBay distributed caching solutions like Redis or Memcached are commonly used for caching dynamic data, while CDNs are employed for caching static content and optimizing content delivery to users. Summary In summary, caching is vital for optimizing system performance in high-traffic environments. Strategies like read-through cache and cache-aside cater to diverse needs, while invalidation methods ensure data accuracy. However, issues like thundering herds and cache breakdowns pose challenges. Solutions like cache warming and redundancy mitigate these problems, ensuring system resilience. Overall, a well-designed caching strategy is essential for delivering optimal performance in dynamic digital landscapes. Video Must Read for Continuous Learning Head First Design Patterns Clean Code: A Handbook of Agile Software Craftsmanship Java Concurrency in Practice Java Performance: The Definitive Guide Designing Data-Intensive Applications Designing Distributed Systems Clean Architecture Kafka — The Definitive Guide Becoming An Effective Software Engineering Manager
1. Use "&&" to Link Two or More Commands Use “&&” to link two or more commands when you want the previous command to be succeeded before the next command. If you use “;” then it would still run the next command after “;” even if the command before “;” failed. So you would have to wait and run each command one by one. However, using "&&" ensures that the next command will only run if the preceding command finishes successfully. This allows you to add commands without waiting, move on to the next task, and check later. If the last command ran, it indicates that all previous commands ran successfully. Example: Shell ls /path/to/file.txt && cp /path/to/file.txt /backup/ The above example ensures that the previous command runs successfully and that the file "file.txt" exists. If the file doesn't exist, the second command after "&&" won't run and won't attempt to copy it. 2. Use “grep” With -A and -B Options One common use of the "grep" command is to identify specific errors from log files. However, using it with the -A and -B options provides additional context within a single command, and it displays lines after and before the searched text, which enhances visibility into related content. Example: Shell % grep -A 2 "java.io.IOException" logfile.txt java.io.IOException: Permission denied (open /path/to/file.txt) at java.io.FileOutputStream.<init>(FileOutputStream.java:53) at com.pkg.TestClass.writeFile(TestClass.java:258) Using grep with -A here will also show 2 lines after the “java.io.IOException” was found from the logfile.txt. Similarly, Shell grep "Ramesh" -B 3 rank-file.txt Name: John Wright, Rank: 23 Name: David Ross, Rank: 45 Name: Peter Taylor, Rank: 68 Name Ramesh Kumar, Rank: 36 Here, grep with -B option will also show 3 lines before the “Ramesh” was found from the rank-file.txt 3. Use “>” to Create an Empty File Just write > and then the filename to create an empty file with the name provided after > Example: Shell >my-file.txt It will create an empty file with "my-file.txt" name in the current directory. 4. Use “rsync” for Backups "rsync" is a useful command for regular backups as it saves time by transferring only the differences between the source and destination. This feature is especially beneficial when creating backups over a network. Example: Shell rsync -avz /path/to/source_directory/ user@remotehost:/path/to/destination_directory/ 5. Use Tab Completion Using tab completion as a habit is faster than manually selecting filenames and pressing Enter. Typing the initial letters of filenames and utilizing Tab completion streamlines the process and is more efficient. 6. Use “man” Pages Instead of reaching the web to find the usage of a command, a quick way would be to use the “man” command to find out the manual of that command. This approach not only saves time but also ensures accuracy, as command options can vary based on the installed version. By accessing the manual directly, you get precise details tailored to your existing version. Example: Shell man ps It will get the manual page for the “ps” command 7. Create Scripts For repetitive tasks, create small shell scripts that chain commands and perform actions based on conditions. This saves time and reduces risks in complex operations. Conclusion In conclusion, becoming familiar with these Linux commands and tips can significantly boost productivity and streamline workflow on the command line. By using techniques like command chaining, context-aware searching, efficient file management, and automation through scripts, users can save time, reduce errors, and optimize their Linux experience.
The Advantages of Elastic APM for Observing the Tested Environment My first use of the Elastic Application Performance Monitoring (Elastic APM) solution coincides with projects that were developed based on microservices in 2019 for the projects on which I was responsible for performance testing. At that time (2019) the first versions of Elastic APM were released. I was attracted by the easy installation of agents, the numerous protocols supported by the Java agent (see Elastic supported technologies) including the Apache HttpClient used in JMeter and other languages (Go, .NET, Node.js, PHP, Python, Ruby), and the quality of the dashboard in Kibana for the APM. I found the information displayed in the Kibana APM dashboards to be relevant and not too verbose. The Java agent monitoring is simple but displays essential information on the machine's OS and JVM. The open-source aspect and the free solution for the main functions of the tool were also decisive. I generalize the use of the Elastic APM solution in performance environments for all projects. With Elastic APM, I have the timelines of the different calls and exchanges between web services, the SQL queries executed, the exchange of messages by JMS file, and monitoring. I also have quick access to errors or exceptions thrown in Java applications. Why Integrate Elastic APM in Apache JMeter By adding Java APM Agents to web applications, we find the services called timelines in the Kibana dashboards. However, we remain at a REST API call level mainly, because we do not have the notion of a page. For example, page PAGE01 will make the following API calls: /rest/service1 /rest/service2 /rest/service3 On another page, PAGE02 will make the following calls: /rest/service2 /rest/service4 /rest/service5 /rest/service6 The third page, PAGE03, will make the following calls: /rest/service1 /rest/service2 /rest/service4 In this example, service2 is called on 3 different pages and service4 in 2 pages. If we look in the Kibana dashboard for service2, we will find the union of the calls of the 3 calls corresponding to the 3 pages, but we don't have the notion of a page. We cannot answer "In this page, what is the breakdown of time in the different REST calls," because for a user of the application, the notion of page response time is important. The goal of the jmeter-elastic-apm tool is to add the notion of an existing page in JMeter in the Transaction Controller. This starts in JMeter by creating an APM transaction, and then propagating this transaction identifier (traceparent) with the Elastic agent to an HTTP REST request to web services because the APM Agent recognizes the Apache HttpClient library and can instrument it. In the HTTP request, the APM Agent will add the identifier of the APM transaction to the header of the HTTP request. The headers added are traceparent and elastic-apm-traceparent. We start from the notion of the page in JMeter (Transaction Controller) to go to the HTTP calls of the web application (gestdoc) hosted in Tomcat. In the case of an application composed of multi-web services, we will see in the timeline the different web services called in HTTP(s) or JMS and the time spent in each web service. This is an example of technical architecture for a performance test with Apache JMeter and Elastic APM Agent to test a web application hosted in Apache Tomcat. How the jmeter-elastic-apm Tool Works jmeter-elastic-apm adds Groovy code before a JMeter Transaction Controller to create an APM transaction before a page. In the JMeter Transaction Controller, we find HTTP samplers that make REST HTTP(s) calls to the services. The Elastic APM Agent automatically adds a new traceparent header containing the identifier of the APM transaction because it recognizes the Apache HttpClient of the HTTP sampler. The Groovy code terminates the APM transaction to indicate the end of the page. The jmeter-elastic-apm tool automates the addition of Groovy code before and after the JMeter Transaction Controller. The jmeter-elastic-apm tool is open source on GitHub (see link in the Conclusion section of this article). This JMeter script is simple with 3 pages in 3 JMeter Transaction Controllers. After launching the jmeter-elastic-apm action ADD tool, the JMeter Transaction Controllers are surrounded by Groovy code to create an APM transaction before the JMeter Transaction Controller and close the APM transaction after the JMeter Transaction Controller. In the “groovy begin transaction apm” sampler, the Groovy code calls the Elastic APM API (simplified version): Groovy Transaction transaction = ElasticApm.startTransaction(); Scope scope = transaction.activate(); transaction.setName(transactionName); // contains JMeter Transaction Controller Name In the “groovy end transaction apm” sampler, the groovy code calls the ElasticApm API (simplified version): Groovy transaction.end(); Configuring Apache JMeter With the Elastic APM Agent and the APM Library Start Apache JMeter With Elastic APM Agent and Elastic APM API Library Declare the Elastic APM Agent URLto find the APM Agent: Add the ELASTIC APM Agent somewhere in the filesystem (could be in the <JMETER_HOME>\lib but not mandatory). In <JMETER_HOME>\bin, modify the jmeter.bat or setenv.bat. Add Elastic APM configuration like so: Shell set APM_SERVICE_NAME=yourServiceName set APM_ENVIRONMENT=yourEnvironment set APM_SERVER_URL=http://apm_host:8200 set JVM_ARGS=-javaagent:<PATH_TO_AGENT_APM_JAR>\elastic-apm-agent-<version>.jar -Delastic.apm.service_name=%APM_SERVICE_NAME% -Delastic.apm.environment=%APM_ENVIRONMENT% -Delastic.apm.server_urls=%APM_SERVER_URL% 2. Add the Elastic APM library: Add the Elastic APM API library to the <JMETER_HOME>\lib\apm-agent-api-<version>.jar. This library is used by JSR223 Groovy code. Use this URL to find the APM library. Recommendations on the Impact of Adding Elastic APM in JMeter The APM Agent will intercept and modify all HTTP sampler calls, and this information will be stored in Elasticsearch. It is preferable to voluntarily disable the HTTP request of static elements (images, CSS, JavaScript, fonts, etc.) which can generate a large number of requests but are not very useful in analyzing the timeline. In the case of heavy load testing, it's recommended to change the elastic.apm.transaction_sample_rate parameter to only take part of the calls so as not to saturate the APM Server and Elasticsearch. This elastic.apm.transaction_sample_rate parameter can be declared in <JMETER_HOME>\jmeter.bat or setenv.bat but also in a JSR223 sampler with a short Groovy code in a setUp thread group. Groovy code records only 50% samples: Groovy import co.elastic.apm.api.ElasticApm; // update elastic.apm.transaction_sample_rate ElasticApm.setConfig("transaction_sample_rate","0.5"); Conclusion The jmeter-elastic-apm tool allows you to easily integrate the Elastic APM solution into JMeter and add the notion of a page in the timelines of Kibana APM dashboards. Elastic APM + Apache JMeter is an excellent solution for understanding how the environment works during a performance test with simple monitoring, quality dashboards, time breakdown timelines in the different distributed application layers, and the display of exceptions in web services. Over time, the Elastic APM solution only gets better. I strongly recommend it, of course, in a performance testing context, but it also has many advantages in the context of a development environment used for developers or integration used by functional or technical testers. Links Command Line Tool jmeter-elastic-apm JMeter plugin elastic-apm-jmeter-plugin Elastic APM Guides: APM Guide or Application performance monitoring (APM)
The world of Telecom is evolving at a rapid pace, and it is not just important, but crucial for operators to stay ahead of the game. As 5G technology becomes the norm, it is not just essential, but a strategic imperative to transition seamlessly from 4G technology (which operates on OpenStack cloud) to 5G technology (which uses Kubernetes). In the current scenario, operators invest in multiple vendor-specific monitoring tools, leading to higher costs and less efficient operations. However, with the upcoming 5G world, operators can adopt a unified monitoring and alert system for all their products. This single system, with its ability to monitor network equipment, customer devices, and service platforms, offers a reassuringly holistic view of the entire system, thereby reducing complexity and enhancing efficiency. By adopting a Prometheus-based monitoring and alert system, operators can streamline operations, reduce costs, and enhance customer experience. With a single monitoring system, operators can monitor their entire 5G system seamlessly, ensuring optimal performance and avoiding disruptions. This practical solution eliminates the need for a complete overhaul and offers a cost-effective transition. Let's dive deep. Prometheus, Grafana, and Alert Manager Prometheus is a tool for monitoring and alerting systems, utilizing a pull-based monitoring system. It scrapes, collects, and stores Key Performance Indicators (KPI) with labels and timestamps, enabling it to collect metrics from targets, which are the Network Functions' namespaces in the 5G telecom world. Grafana is a dynamic web application that offers a wide range of functionalities. It visualizes data, allowing the building of charts, graphs, and dashboards that the 5G Telecom operator wants to visualize. Its primary feature is the display of multiple graphing and dashboarding support modes using GUI (Graphical user interface). Grafana can seamlessly integrate data collected by Prometheus, making it an indispensable tool for telecom operators. It is a powerful web application that supports the integration of different data sources into one dashboard, enabling continuous monitoring. This versatility improves response rates by alerting the telecom operator's team when an incident emerges, ensuring a minimum 5G network function downtime. The Alert Manager is a crucial component that manages alerts from the Prometheus server via alerting rules. It manages the received alerts, including silencing and inhibiting them and sending out notifications via email or chat. The Alert Manager also removes duplications, grouping, and routing them to the centralized webhook receiver, making it a must-have tool for any telecom operator. Architectural Diagram Prometheus Components of Prometheus (Specific to a 5G Telecom Operator) Core component: Prometheus server scrapes HTTP endpoints and stores data (time series). The Prometheus server, a crucial component in the 5G telecom world, collects metrics from the Prometheus targets. In our context, these targets are the Kubernetes cluster that houses the 5G network functions. Time series database (TSDB): Prometheus stores telecom Metrics as time series data. HTTP Server: API to query data stored in TSDB; The Grafana dashboard can query this data for visualization. Telecom operator-specific libraries (5G) for instrumenting application code. Push gateway (scrape target for short-lived jobs) Service Discovery: In the world of 5G, network function pods are constantly being added or deleted by Telecom operators to scale up or down. Prometheus's adaptable service discovery component monitors the ever-changing list of pods. The Prometheus Web UI, accessible through port 9090, is a data visualization tool. It allows users to view and analyze Prometheus data in a user-friendly and interactive manner, enhancing the monitoring capabilities of the 5G telecom operators. The Alert Manager, a key component of Prometheus, is responsible for handling alerts. It is designed to notify users if something goes wrong, triggering notifications when certain conditions are met. When alerting triggers are met, Prometheus alerts the Alert Manager, which sends alerts through various channels such as email or messenger, ensuring timely and effective communication of critical issues. Grafana for dashboard visualization (actual graphs) With Prometheus's robust components, your Telecom operator's 5G network functions are monitored with diligence, ensuring reliable resource utilization, tracking performance, detection of errors in availability, and more. Prometheus can provide you with the necessary tools to keep your network running smoothly and efficiently. Prometheus Features The multi-dimensional data model identified by metric details uses PromQL (Prometheus Querying Language) as the query language and the HTTP Pull model. Telecom operators can now discover 5G network functions with service discovery and static configuration. The multiple modes of dashboard and GUI support provide a comprehensive and customizable experience for users. Prometheus Remote Write to Central Prometheus from Network Functions 5G Operators will have multiple network functions from various vendors, such as SMF (Session Management Function), UPF (User Plane Function), AMF (Access and Mobility Management Function), PCF (Policy Control Function), and UDM (Unified Data Management). Using multiple Prometheus/Grafana dashboards for each network function can lead to a complex and inefficient 5G network operator monitoring process. To address this, it is highly recommended that all data/metrics from individual Prometheus be consolidated into a single Central Prometheus, simplifying the monitoring process and enhancing efficiency. The 5G network operator can now confidently monitor all the data at the Central Prometheus's centralized location. This user-friendly interface provides a comprehensive view of the network's performance, empowering the operator with the necessary tools for efficient monitoring. Grafana Grafana Features Panels: This powerful feature empowers operators to visualize Telecom 5G data in many ways, including histograms, graphs, maps, and KPIs. It offers a versatile and adaptable interface for data representation, enhancing the efficiency and effectiveness of your data analysis. Plugins: This feature efficiently renders Telecom 5G data in real-time on a user-friendly API (Application Programming Interface), ensuring operators always have the most accurate and up-to-date data at their fingertips. It also enables operators to create data source plugins and retrieve metrics from any API. Transformations: This feature allows you to flexibly adapt, summarize, combine, and perform KPI metrics query/calculations across 5G network functions data sources, providing the tools to effectively manipulate and analyze your data. Annotations: Rich events from different Telecom 5G network functions data sources are used to annotate metrics-based graphs. Panel editor: Reliable and consistent graphical user interface for configuring and customizing 5G telecom metrics panels Grafana Sample Dashboard GUI for 5G Alert Manager Alert Manager Components The Ingester swiftly ingests all alerts, while the Grouper groups them into categories. The De-duplicator prevents repetitive alerts, ensuring you're not bombarded with notifications. The Silencer is there to mute alerts based on a label, and the Throttler regulates the frequency of alerts. Finally, the Notifier will ensure that third parties are notified promptly. Alert Manager Functionalities Grouping: Grouping categorizes similar alerts into a single notification system. This is helpful during more extensive outages when many 5G network functions fail simultaneously and when all the alerts need to fire simultaneously. The telecom operator will expect only to get a single page while still being able to visualize the exact service instances affected. Inhibition: Inhibition suppresses the notification for specific low-priority alerts if certain major/critical alerts are already firing. For example, when a critical alert fires, indicating that an entire 5G SMF (Session Management Function) cluster is not reachable, AlertManager can mute all other minor/warning alerts concerning this cluster. Silences: Silences are simply mute alerts for a given time. Incoming alerts are checked to match the regular expression matches of an active silence. If they match, no notifications will be sent out for that alert. High availability: Telecom operators will not load balance traffic between Prometheus and all its Alert Managers; instead, they will point Prometheus to a list of all Alert Managers. Dashboard Visualization Grafana dashboard visualizes the Alert Manager webhook traffic notifications as shown below: Configuration YAMLs (Yet Another Markup Language) Telecom Operators can install and run Prometheus using the configuration below: YAML prometheus: enabled: true route: enabled: {} nameOverride: Prometheus tls: enabled: true certificatesSecret: backstage-prometheus-certs certFilename: tls.crt certKeyFilename: tls.key volumePermissions: enabled: true initdbScriptsSecret: backstage-prometheus-initdb prometheusSpec: retention: 3d replicas: 2 prometheusExternalLabelName: prometheus_cluster image: repository: <5G operator image repository for Prometheus> tag: <Version example v2.39.1> sha: "" podAntiAffinity: "hard" securityContext: null resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi serviceMonitorNamespaceSelector: matchExpressions: - {key: namespace, operator: In, values: [<Network function 1 namespace>, <Network function 2 namespace>]} serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false ruleSelectorNilUsesHelmValues: false Configuration to route scrape data segregated based on the namespace and route to Central Prometheus. Note: The below configuration can be appended to the Prometheus mentioned in the above installation YAML. YAML remoteWrite: - url: <Central Prometheus URL for namespace 1 by 5G operator> basicAuth: username: name: <secret username for namespace 1> key: username password: name: <secret password for namespace 1> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 1> action: keep - url: <Central Prometheus URL for namespace 2 by 5G operator> basicAuth: username: name: <secret username for namespace 2> key: username password: name: <secret password for namespace 2> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 2> action: keep Telecom Operators can install and run Grafana using the configuration below. YAML grafana: replicas: 2 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app.kubernetes.io/name" operator: In values: - Grafana topologyKey: "kubernetes.io/hostname" securityContext: false rbac: pspEnabled: false # Must be disabled due to tenant permissions namespaced: true adminPassword: admin image: repository: <artifactory>/Grafana tag: <version> sha: "" pullPolicy: IfNotPresent persistence: enabled: false initChownData: enabled: false sidecar: image: repository: <artifactory>/k8s-sidecar tag: <version> sha: "" imagePullPolicy: IfNotPresent resources: limits: cpu: 100m memory: 100Mi requests: cpu: 50m memory: 50Mi dashboards: enabled: true label: grafana_dashboard labelValue: "Vendor name" datasources: enabled: true defaultDatasourceEnabled: false additionalDataSources: - name: Prometheus type: Prometheus url: http://<prometheus-operated>:9090 access: proxy isDefault: true jsonData: timeInterval: 30s resources: limits: cpu: 400m memory: 512Mi requests: cpu: 50m memory: 206Mi extraContainers: - name: oauth-proxy image: <artifactory>/origin-oauth-proxy:<version> imagePullPolicy: IfNotPresent ports: - name: proxy-web containerPort: 4181 args: - --https-address=:4181 - --provider=openshift # Service account name here must be "<Helm Release name>-grafana" - --openshift-service-account=monitoring-grafana - --upstream=http://localhost:3000 - --tls-cert=/etc/tls/private/tls.crt - --tls-key=/etc/tls/private/tls.key - --cookie-secret=SECRET - --pass-basic-auth=false resources: limits: cpu: 100m memory: 256Mi requests: cpu: 50m memory: 128Mi volumeMounts: - mountPath: /etc/tls/private name: grafana-tls extraContainerVolumes: - name: grafana-tls secret: secretName: grafana-tls serviceAccount: annotations: "serviceaccounts.openshift.io/oauth-redirecturi.first": https://[SPK exposed IP for Grafana] service: targetPort: 4181 annotations: service.alpha.openshift.io/serving-cert-secret-name: <secret> Telecom Operators can install and run Alert Manager using the configuration below. YAML alertmanager: enabled: true alertmanagerSpec: image: repository: prometheus/alertmanager tag: <version> replicas: 2 podAntiAffinity: hard securityContext: null resources: requests: cpu: 25m memory: 200Mi limits: cpu: 100m memory: 400Mi containers: - name: config-reloader resources: requests: cpu: 10m memory: 10Mi limits: cpu: 25m memory: 50Mi Configuration to route Prometheus Alert Manager data to the Operator's centralized webhook receiver. Note: The below configuration can be appended to the Alert Manager mentioned in the above installation YAML. YAML config: global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' routes: - receiver: '<Network function 1>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 1>" - receiver: '<Network function 2>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 2>" Conclusion The open-source OAM (Operation and Maintenance) tools Prometheus, Grafana, and Alert Manager can benefit 5G Telecom operators. Prometheus periodically captures all the status of monitored 5G Telecom network functions through the HTTP protocol, and any component can be connected to the monitoring as long as the 5G Telecom operator provides the corresponding HTTP interface. Prometheus and Grafana Agent gives the 5G Telecom operator control over the metrics the operator wants to report; once the data is in Grafana, it can be stored in a Grafana database as extra data redundancy. In conclusion, Prometheus allows 5G Telecom operators to improve their operations and offer better customer service. Adopting a unified monitoring and alert system like Prometheus is one way to achieve this.
Joana Carvalho
                                            Site Reliability Engineering,
                                                
                                            Virtuoso
                                        
Eric D. Schabell
                                            Director Technical Marketing & Evangelism,
                                                
                                            Chronosphere