123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328 |
- Notes on Analysing Behaviour Using Events and Tracepoints
- Documentation written by Mel Gorman
- PCL information heavily based on email from Ingo Molnar
- 1. Introduction
- ===============
- Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
- creating custom kernel modules to register probe functions using the event
- tracing infrastructure.
- Simplistically, tracepoints represent important events that can be
- taken in conjunction with other tracepoints to build a "Big Picture" of
- what is going on within the system. There are a large number of methods for
- gathering and interpreting these events. Lacking any current Best Practises,
- this document describes some of the methods that can be used.
- This document assumes that debugfs is mounted on /sys/kernel/debug and that
- the appropriate tracing options have been configured into the kernel. It is
- assumed that the PCL tool tools/perf has been installed and is in your path.
- 2. Listing Available Events
- ===========================
- 2.1 Standard Utilities
- ----------------------
- All possible events are visible from /sys/kernel/debug/tracing/events. Simply
- calling
- $ find /sys/kernel/debug/tracing/events -type d
- will give a fair indication of the number of events available.
- 2.2 PCL (Performance Counters for Linux)
- -------
- Discovery and enumeration of all counters and events, including tracepoints,
- are available with the perf tool. Getting a list of available events is a
- simple case of:
- $ perf list 2>&1 | grep Tracepoint
- ext4:ext4_free_inode [Tracepoint event]
- ext4:ext4_request_inode [Tracepoint event]
- ext4:ext4_allocate_inode [Tracepoint event]
- ext4:ext4_write_begin [Tracepoint event]
- ext4:ext4_ordered_write_end [Tracepoint event]
- [ .... remaining output snipped .... ]
- 3. Enabling Events
- ==================
- 3.1 System-Wide Event Enabling
- ------------------------------
- See Documentation/trace/events.txt for a proper description on how events
- can be enabled system-wide. A short example of enabling all events related
- to page allocation would look something like:
- $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
- 3.2 System-Wide Event Enabling with SystemTap
- ---------------------------------------------
- In SystemTap, tracepoints are accessible using the kernel.trace() function
- call. The following is an example that reports every 5 seconds what processes
- were allocating the pages.
- global page_allocs
- probe kernel.trace("mm_page_alloc") {
- page_allocs[execname()]++
- }
- function print_count() {
- printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
- foreach (proc in page_allocs-)
- printf("%-25d %s\n", page_allocs[proc], proc)
- printf ("\n")
- delete page_allocs
- }
- probe timer.s(5) {
- print_count()
- }
- 3.3 System-Wide Event Enabling with PCL
- ---------------------------------------
- By specifying the -a switch and analysing sleep, the system-wide events
- for a duration of time can be examined.
- $ perf stat -a \
- -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched \
- sleep 10
- Performance counter stats for 'sleep 10':
- 9630 kmem:mm_page_alloc
- 2143 kmem:mm_page_free
- 7424 kmem:mm_page_free_batched
- 10.002577764 seconds time elapsed
- Similarly, one could execute a shell and exit it as desired to get a report
- at that point.
- 3.4 Local Event Enabling
- ------------------------
- Documentation/trace/ftrace.txt describes how to enable events on a per-thread
- basis using set_ftrace_pid.
- 3.5 Local Event Enablement with PCL
- -----------------------------------
- Events can be activated and tracked for the duration of a process on a local
- basis using PCL such as follows.
- $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched ./hackbench 10
- Time: 0.909
- Performance counter stats for './hackbench 10':
- 17803 kmem:mm_page_alloc
- 12398 kmem:mm_page_free
- 4827 kmem:mm_page_free_batched
- 0.973913387 seconds time elapsed
- 4. Event Filtering
- ==================
- Documentation/trace/ftrace.txt covers in-depth how to filter events in
- ftrace. Obviously using grep and awk of trace_pipe is an option as well
- as any script reading trace_pipe.
- 5. Analysing Event Variances with PCL
- =====================================
- Any workload can exhibit variances between runs and it can be important
- to know what the standard deviation is. By and large, this is left to the
- performance analyst to do it by hand. In the event that the discrete event
- occurrences are useful to the performance analyst, then perf can be used.
- $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free
- -e kmem:mm_page_free_batched ./hackbench 10
- Time: 0.890
- Time: 0.895
- Time: 0.915
- Time: 1.001
- Time: 0.899
- Performance counter stats for './hackbench 10' (5 runs):
- 16630 kmem:mm_page_alloc ( +- 3.542% )
- 11486 kmem:mm_page_free ( +- 4.771% )
- 4730 kmem:mm_page_free_batched ( +- 2.325% )
- 0.982653002 seconds time elapsed ( +- 1.448% )
- In the event that some higher-level event is required that depends on some
- aggregation of discrete events, then a script would need to be developed.
- Using --repeat, it is also possible to view how events are fluctuating over
- time on a system-wide basis using -a and sleep.
- $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched \
- -a --repeat 10 \
- sleep 1
- Performance counter stats for 'sleep 1' (10 runs):
- 1066 kmem:mm_page_alloc ( +- 26.148% )
- 182 kmem:mm_page_free ( +- 5.464% )
- 890 kmem:mm_page_free_batched ( +- 30.079% )
- 1.002251757 seconds time elapsed ( +- 0.005% )
- 6. Higher-Level Analysis with Helper Scripts
- ============================================
- When events are enabled the events that are triggering can be read from
- /sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
- options exist as well. By post-processing the output, further information can
- be gathered on-line as appropriate. Examples of post-processing might include
- o Reading information from /proc for the PID that triggered the event
- o Deriving a higher-level event from a series of lower-level events.
- o Calculating latencies between two events
- Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
- script that can read trace_pipe from STDIN or a copy of a trace. When used
- on-line, it can be interrupted once to generate a report without exiting
- and twice to exit.
- Simplistically, the script just reads STDIN and counts up events but it
- also can do more such as
- o Derive high-level events from many low-level events. If a number of pages
- are freed to the main allocator from the per-CPU lists, it recognises
- that as one per-CPU drain even though there is no specific tracepoint
- for that event
- o It can aggregate based on PID or individual process number
- o In the event memory is getting externally fragmented, it reports
- on whether the fragmentation event was severe or moderate.
- o When receiving an event about a PID, it can record who the parent was so
- that if large numbers of events are coming from very short-lived
- processes, the parent process responsible for creating all the helpers
- can be identified
- 7. Lower-Level Analysis with PCL
- ================================
- There may also be a requirement to identify what functions within a program
- were generating events within the kernel. To begin this sort of analysis, the
- data must be recorded. At the time of writing, this required root:
- $ perf record -c 1 \
- -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched \
- ./hackbench 10
- Time: 0.894
- [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
- Note the use of '-c 1' to set the event period to sample. The default sample
- period is quite high to minimise overhead but the information collected can be
- very coarse as a result.
- This record outputted a file called perf.data which can be analysed using
- perf report.
- $ perf report
- # Samples: 30922
- #
- # Overhead Command Shared Object
- # ........ ......... ................................
- #
- 87.27% hackbench [vdso]
- 6.85% hackbench /lib/i686/cmov/libc-2.9.so
- 2.62% hackbench /lib/ld-2.9.so
- 1.52% perf [vdso]
- 1.22% hackbench ./hackbench
- 0.48% hackbench [kernel]
- 0.02% perf /lib/i686/cmov/libc-2.9.so
- 0.01% perf /usr/bin/perf
- 0.01% perf /lib/ld-2.9.so
- 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so
- #
- # (For more details, try: perf report --sort comm,dso,symbol)
- #
- According to this, the vast majority of events triggered on events
- within the VDSO. With simple binaries, this will often be the case so let's
- take a slightly different example. In the course of writing this, it was
- noticed that X was generating an insane amount of page allocations so let's look
- at it:
- $ perf record -c 1 -f \
- -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched \
- -p `pidof X`
- This was interrupted after a few seconds and
- $ perf report
- # Samples: 27666
- #
- # Overhead Command Shared Object
- # ........ ....... .......................................
- #
- 51.95% Xorg [vdso]
- 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1
- 0.09% Xorg /lib/i686/cmov/libc-2.9.so
- 0.01% Xorg [kernel]
- #
- # (For more details, try: perf report --sort comm,dso,symbol)
- #
- So, almost half of the events are occurring in a library. To get an idea which
- symbol:
- $ perf report --sort comm,dso,symbol
- # Samples: 27666
- #
- # Overhead Command Shared Object Symbol
- # ........ ....... ....................................... ......
- #
- 51.95% Xorg [vdso] [.] 0x000000ffffe424
- 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2
- 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc
- 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f
- 0.01% Xorg [kernel] [k] read_hpet
- 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
- 0.00% Xorg [kernel] [k] ftrace_trace_userstack
- To see where within the function pixmanFillsse2 things are going wrong:
- $ perf annotate pixmanFillsse2
- [ ... ]
- 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax)
- : }
- :
- : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
- : _mm_store_si128 (__m128i *__P, __m128i __B) : {
- : *__P = __B;
- 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax)
- 0.00 : 34ef5: ff
- 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax)
- 0.00 : 34efd: ff
- 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax)
- 0.00 : 34f05: ff
- 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax)
- 0.00 : 34f0d: ff
- 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax)
- 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax)
- 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax)
- 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax)
- At a glance, it looks like the time is being spent copying pixmaps to
- the card. Further investigation would be needed to determine why pixmaps
- are being copied around so much but a starting point would be to take an
- ancient build of libpixmap out of the library path where it was totally
- forgotten about from months ago!
|