William Cohen
William Cohen has been a developer of performance tools at Red Hat for over a decade and has worked on a number of the performance tools in Red Hat Enterprise Linux and Fedora such as OProfile, PAPI, SystemTap, and Dyninst.
William Cohen's contributions
Programmer's Model of a Processor Executing Instructions Versus Reality
William Cohen
Everything on a computer system eventually ends up being run as a sequence of machine instructions. People want to keep things simple and understandable even if that is not really the way that things work. The simple programmer's model of a Reduced Instruction Set Computer (RISC) processor executing those machine language instruction is a loop of the following steps each step finished before moving on the the next step: Fetch instruction Decode instruction and fetch register operands Execute arithmetic computation...
Monitoring: Corollary to "Release Early and Often"
William Cohen
Each day I get off the elevator and walk to my desk at Red Hat I am greeted by a very large sign that says "Release early, release often."
Which tasks are periodically taking processor time?
William Cohen
When running a latency-sensitive application one might notice that on a regular basis (for example every 5 minutes) there is a delay. The SystemTap periodic.stp script can provide some possible causes of that regular delay. The SystemTap periodic.stp script generates a list of the number of times that various scheduled functions run and the time between each scheduled execution. In the case of delay every five minutes one would run the periodic script for tens of minutes and then look...
Which task is getting all the CPU processor cycles?
William Cohen
If an important task is processor limited, one would like to make sure that the task is getting as much processor time as possible and other tasks are not delaying the execution of the important task. The SystemTap example script, cycle_thief.stp, lists what interrupts and other tasks run on the same processor as the important task. The cycle_thief.stp script provides the following pieces of information: the number of times the monitored task migrated a histogram of the duration of time...
Examining Huge Pages or Transparent Huge Pages performance
William Cohen
All modern processors use page-based mechanisms to translate the user-space processes virtual addresses into physical addresses for RAM. The pages are commonly 4KB in size and the processor can hold a limited number of virtual-to-physical address mappings in the Translation Lookaside Buffers (TLB). The number TLB entries ranges from tens to hundreds of mappings. This limits a processor to a few megabytes of memory it can address without changing the TLB entries. When a virtual-to-physical address mapping is not in...
Determining whether an application has poor cache performance
William Cohen
Modern computer systems include cache memory to hide the higher latency and lower bandwidth of RAM memory from the processor. The cache has access latencies ranging from a few processor cycles to ten or twenty cycles rather than the hundreds of cycles needed to access RAM. If the processor must frequently obtain data from the RAM rather than the cache, performance will suffer. With Red Hat Enterprise Linux 6 and newer distributions, the system use of cache can be measured...
Profiling Ruby Programs
William Cohen
The Ruby Interpreter includes a profiling tool which is invoked with the -rprofile option on the command line. Below is an example running the Ruby Fibonacci program ( fib.rb) included in Ruby documentation samples. The list of functions is sorted from most to least time spent exclusively in the function ( self seconds). The first column provides the percentage of self seconds for each function. The cumulative seconds indicates the amount of time spent in that function and the functions...
Profiling Python Programs
William Cohen
For RHEL6 and newer distributions tools are available to profile Python code and to generate dynamic call graphs of a program's execution. Flat profiles can be obtained with the cProfile module and dynamic callgraphs can be obtained with pycallgraph. The cProfile Python module records information about each of the python methods run. For older versions of Python that do not include the cProfile module you can use the higher overhead profile module. Profiling is fairly simple with the cProfile module...
Programmer's Model of a Processor Executing Instructions Versus Reality
William Cohen
Everything on a computer system eventually ends up being run as a sequence of machine instructions. People want to keep things simple and understandable even if that is not really the way that things work. The simple programmer's model of a Reduced Instruction Set Computer (RISC) processor executing those machine language instruction is a loop of the following steps each step finished before moving on the the next step: Fetch instruction Decode instruction and fetch register operands Execute arithmetic computation...
Monitoring: Corollary to "Release Early and Often"
William Cohen
Each day I get off the elevator and walk to my desk at Red Hat I am greeted by a very large sign that says "Release early, release often."
Which tasks are periodically taking processor time?
William Cohen
When running a latency-sensitive application one might notice that on a regular basis (for example every 5 minutes) there is a delay. The SystemTap periodic.stp script can provide some possible causes of that regular delay. The SystemTap periodic.stp script generates a list of the number of times that various scheduled functions run and the time between each scheduled execution. In the case of delay every five minutes one would run the periodic script for tens of minutes and then look...
Which task is getting all the CPU processor cycles?
William Cohen
If an important task is processor limited, one would like to make sure that the task is getting as much processor time as possible and other tasks are not delaying the execution of the important task. The SystemTap example script, cycle_thief.stp, lists what interrupts and other tasks run on the same processor as the important task. The cycle_thief.stp script provides the following pieces of information: the number of times the monitored task migrated a histogram of the duration of time...
Examining Huge Pages or Transparent Huge Pages performance
William Cohen
All modern processors use page-based mechanisms to translate the user-space processes virtual addresses into physical addresses for RAM. The pages are commonly 4KB in size and the processor can hold a limited number of virtual-to-physical address mappings in the Translation Lookaside Buffers (TLB). The number TLB entries ranges from tens to hundreds of mappings. This limits a processor to a few megabytes of memory it can address without changing the TLB entries. When a virtual-to-physical address mapping is not in...
Determining whether an application has poor cache performance
William Cohen
Modern computer systems include cache memory to hide the higher latency and lower bandwidth of RAM memory from the processor. The cache has access latencies ranging from a few processor cycles to ten or twenty cycles rather than the hundreds of cycles needed to access RAM. If the processor must frequently obtain data from the RAM rather than the cache, performance will suffer. With Red Hat Enterprise Linux 6 and newer distributions, the system use of cache can be measured...
Profiling Ruby Programs
William Cohen
The Ruby Interpreter includes a profiling tool which is invoked with the -rprofile option on the command line. Below is an example running the Ruby Fibonacci program ( fib.rb) included in Ruby documentation samples. The list of functions is sorted from most to least time spent exclusively in the function ( self seconds). The first column provides the percentage of self seconds for each function. The cumulative seconds indicates the amount of time spent in that function and the functions...
Profiling Python Programs
William Cohen
For RHEL6 and newer distributions tools are available to profile Python code and to generate dynamic call graphs of a program's execution. Flat profiles can be obtained with the cProfile module and dynamic callgraphs can be obtained with pycallgraph. The cProfile Python module records information about each of the python methods run. For older versions of Python that do not include the cProfile module you can use the higher overhead profile module. Profiling is fairly simple with the cProfile module...