DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Long Tests: Saving All App’s Debug Logs and Writing Your Own Logs
  • Develop and Debug C++ for ARM Linux Boards on Windows With Eclipse
  • Essential Monitoring Tools, Troubleshooting Techniques, and Best Practices for Atlassian Tools Administrators
  • Performance and Scalability Analysis of Redis and Memcached

Trending

  • How To Use Thread.sleep() in Selenium
  • Application Telemetry: Different Objectives for Developers and Product Managers
  • 10 ChatGPT Prompts To Boost Developer Productivity
  • Mastering Distributed Caching on AWS: Strategies, Services, and Best Practices
  1. DZone
  2. Software Design and Architecture
  3. Performance
  4. Advanced Linux Troubleshooting Techniques for Site Reliability Engineers

Advanced Linux Troubleshooting Techniques for Site Reliability Engineers

Explore advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and eBPF.

By 
Prashanth Ravula user avatar
Prashanth Ravula
DZone Core CORE ·
May. 15, 24 · Tutorial
Like (2)
Save
Tweet
Share
2.1K Views

Join the DZone community and get the full member experience.

Join For Free

In Site Reliability Engineering (SRE), the ability to quickly and effectively troubleshoot issues within Linux systems is crucial. This article explores advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and using the Extended Berkeley Packet Filter (eBPF) for real-time data gathering.

Kernel Debugging

Kernel debugging is a fundamental skill for any SRE working with Linux. It allows for deep inspection of the kernel's behavior, which is critical when diagnosing system crashes or performance bottlenecks.

Tools and Techniques

GDB (GNU Debugger)

GDB can debug kernel modules and the Linux kernel. It allows setting breakpoints, stepping through the code, and inspecting variables. 

  • GNU Debugger Official Documentation: This is the official documentation for GNU Debugger, providing a comprehensive overview of its features.

KGDB

The kernel debugger allows the kernel to be debugged using GDB over a serial connection or a network. Using kgdb, kdb, and the kernel debugger internals provides a detailed explanation of how kgdb can be enabled and configured.

Dynamic Debugging (dyndbg)

Linux's dynamic debug feature enables real-time debugging messages that help trace kernel operations without rebooting the system. The official Dynamic Debug page describes how to use the dynamic debug (dyndbg) feature.

Tracing System Calls With strace

strace is a powerful diagnostic tool that monitors the system calls used by a program and the signals received by a program. It is instrumental in understanding the interaction between applications and the Linux kernel.

Usage

To trace system calls, strace can be attached to a running process or start a new process under strace. It logs all system calls, which can be analyzed to find faults in system operations.

Example:

Shell
 
root@ubuntu:~# strace -p 2009
strace: Process 2009 attached
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0
mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000
munmap(0xe02057400000, 134221824)       = 0


In the above example, the -p flag is the process, and 2009 is the pid. Similarly, you can use the -o flag to log the output to a file instead of dumping everything on the screen. You can review the following article to understand system calls on Linux with strace.

Performance Analysis With perf

perf is a versatile tool used for system performance analysis. It provides a rich set of commands to collect, analyze, and report on hardware and software events.

Key Features

  1. perf record: Gathers performance data into a file, perf.data, which can be further analyzed using perf report to identify hotspots
  2. perf report: This report analyzes the data collected by perf record and displays where most of the time was spent, helping identify performance bottlenecks.
  3. Event-based sampling: perf can record data based on specific events, such as cache misses or CPU cycles, which helps pinpoint performance issues more accurately.

Example:

Shell
 
root@ubuntu:/tmp# perf record
^C[ perf record: Woken up 17 times to write data ]
[ perf record: Captured and wrote 4.619 MB perf.data (83123 samples) ]

root@ubuntu:/tmp#

root@ubuntu:/tmp# perf report
Samples: 83K of event 'cpu-clock:ppp', Event count (approx.): 20780750000
Overhead  Command          Shared Object             Symbol
  17.74%  swapper          [kernel.kallsyms]         [k] cpuidle_idle_call
   8.36%  stress           [kernel.kallsyms]         [k] __do_softirq
   7.17%  stress           [kernel.kallsyms]         [k] finish_task_switch.isra.0
   6.90%  stress           [kernel.kallsyms]         [k] el0_da
   5.73%  stress           libc.so.6                 [.] random_r
   3.92%  stress           [kernel.kallsyms]         [k] flush_end_io
   3.87%  stress           libc.so.6                 [.] random
   3.71%  stress           libc.so.6                 [.] 0x00000000001405bc
   2.71%  kworker/0:2H-kb  [kernel.kallsyms]         [k] ata_scsi_queuecmd
   2.58%  stress           libm.so.6                 [.] __sqrt_finite
   2.45%  stress           stress                    [.] 0x0000000000000f14
   1.62%  stress           stress                    [.] 0x000000000000168c
   1.46%  stress           [kernel.kallsyms]         [k] __pi_clear_page
   1.37%  stress           libc.so.6                 [.] rand
   1.34%  stress           libc.so.6                 [.] 0x00000000001405c4
   1.22%  stress           stress                    [.] 0x0000000000000e94
   1.20%  stress           [kernel.kallsyms]         [k] folio_batch_move_lru
   1.20%  stress           stress                    [.] 0x0000000000000f10
   1.16%  stress           libc.so.6                 [.] 0x00000000001408d4
   0.84%  stress           [kernel.kallsyms]         [k] handle_mm_fault
   0.77%  stress           [kernel.kallsyms]         [k] release_pages
   0.65%  stress           [kernel.kallsyms]         [k] super_lock
   0.62%  stress           [kernel.kallsyms]         [k] _raw_spin_unlock_irqrestore
   0.61%  stress           [kernel.kallsyms]         [k] blk_done_softirq
   0.61%  stress           [kernel.kallsyms]         [k] _raw_spin_lock
   0.60%  stress           [kernel.kallsyms]         [k] folio_add_lru
   0.58%  kworker/0:2H-kb  [kernel.kallsyms]         [k] finish_task_switch.isra.0
   0.55%  stress           [kernel.kallsyms]         [k] __rcu_read_lock
   0.52%  stress           [kernel.kallsyms]         [k] percpu_ref_put_many.constprop.0
   0.46%  stress           stress                    [.] 0x00000000000016e0
   0.45%  stress           [kernel.kallsyms]         [k] __rcu_read_unlock
   0.45%  stress           [kernel.kallsyms]         [k] dynamic_might_resched
   0.42%  stress           [kernel.kallsyms]         [k] _raw_spin_unlock
   0.41%  stress           [kernel.kallsyms]         [k] __mod_memcg_lruvec_state
   0.40%  stress           [kernel.kallsyms]         [k] mas_walk
   0.39%  stress           [kernel.kallsyms]         [k] arch_counter_get_cntvct
   0.39%  stress           [kernel.kallsyms]         [k] rwsem_read_trylock
   0.39%  stress           [kernel.kallsyms]         [k] up_read
   0.38%  stress           [kernel.kallsyms]         [k] down_read
   0.37%  stress           [kernel.kallsyms]         [k] get_mem_cgroup_from_mm
   0.36%  stress           [kernel.kallsyms]         [k] free_unref_page_commit
   0.34%  stress           [kernel.kallsyms]         [k] memset
   0.32%  stress           libc.so.6                 [.] 0x00000000001408c8
   0.30%  stress           [kernel.kallsyms]         [k] sync_inodes_sb
   0.29%  stress           [kernel.kallsyms]         [k] iterate_supers
   0.29%  stress           [kernel.kallsyms]         [k] percpu_counter_add_batch


Real-Time Data Gathering With eBPF

eBPF allows for creating small programs that run on the Linux kernel in a sandboxed environment. These programs can track system calls and network messages, providing real-time insights into system behavior.

Applications

  • Network monitoring: eBPF can monitor network traffic in real-time, providing insights into packet flow and protocol usage without significant performance overhead.
  • Security: eBPF helps implement security policies by monitoring system calls and network activity to detect and prevent malicious activities.
  • Performance monitoring: It can track application performance by monitoring function calls and system resource usage, helping SREs optimize performance.

Conclusion

Advanced troubleshooting in Linux involves a combination of tools and techniques that provide deep insights into system operations. Tools like GDB, strace, perf, and eBPF are essential for any SRE looking to enhance their troubleshooting capabilities. By leveraging these tools, SREs can ensure the high reliability and performance of Linux systems in production environments.

Linux kernel Site reliability engineering Linux (operating system) Performance Debug (command)

Opinions expressed by DZone contributors are their own.

Related

  • Long Tests: Saving All App’s Debug Logs and Writing Your Own Logs
  • Develop and Debug C++ for ARM Linux Boards on Windows With Eclipse
  • Essential Monitoring Tools, Troubleshooting Techniques, and Best Practices for Atlassian Tools Administrators
  • Performance and Scalability Analysis of Redis and Memcached

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: