Notes from JavaOne 2005...
Jack Catchpoole, Ian Formanek, Patrick Keegan, Gregg Sporar, all of Sun Microsystems
Why is profiling hard? Tools are expensive. Setup complex. App does not run well in profiler. App runs differently in profiler. Results are hard to analyze. Even more so for large apps. More levels of abstraction. More code. More unknown code. Large heaps. Many threads. Highly repetitive actions.
Performance problems in large apps: total code size may be huge (100s of KLOCs). Problems can happen in any code layer.
To deliver apps that perform well, you may need to check all the pieces. There are tools to help, but they don't always make it easy.
Conventional profiling and its limitations. JVMPI/JVMTI. Bytecode instrumentation is the main technique - rewriting code to emit events about program actions. VM-generated events is an alternative. VM can generate its own events on method entry/exit, object allocated, etc. Sampling-based profiling at method execution time. Low overhead.
Bytecode instrumentation is flexible. Most common use is CPU and memory profiling. For CPU profiling, inject methodEntry(methodId) call before first bytecode, and methodExit(methodId) before each return. Memory profiling injects objectAllocated(Object) after each new(). Or, could just instrument constructors. Array allocations still need in-place instrumentation. Bytecode overhead can be huge, or can be very small. Impact is directly proportional to amount of info you get out. Can control by being selective in what you instrument.
VM can generate its own events, but only limited number of events are supported. Used sparingly for profiling these days, since it breaks VM optimizations leading to poor performance, and has very little flexibility. Some events can't be obtained any other way (e.g., garbage collection start/finish).
Sampling-based gives CPU profiling only. Looks at each thread's stack periodically (about 10-100ms). Good: Very low overhead (1-2%), proportional to number of threads. Bad: no info on number of invocations. Accuracy isn't as good.
Profiling overhead: overall slowdown of application. Disproportional slowdown can affect application behavior. Can also prevent JVM optimizations like JIT, inlining, etc. Because of this, in some cases the use of profilers becomes impractical.
Minimizing profiling overhead: Reduce what gets instrumented with package name filters, ignoring simple methods (getters and setters). Problems - can filter out important data. Results are hard to analyze.
Convential profiling summary: No method delivers all info at low cost. Bytecode is most popular, but there is always a tradeoff between information gained and overhead.
Can profiling of large apps be improved? Yes. Dynamic bytecode instrumentation with some additional algorithms can reduce overhead dramatically.
What is Dynamic BCI? Main idea is that bytecodes get instrumented at runtime. Next time a method is called a new (instrumented) version is executed. Same technique is used for fix-and-continue in debugging. Uses JVMTI's redefineClass(Class classes, byte  newBC)
Benefits of Dynamic BCI: instrumentation can change during application runtime. Instrumentation can be removed entirely, profiled app runs at full speed. Instrumentation can be injected adaptively. You can tune the overhead/info ratio while the app is running.
Selective CPU profiling - typically only interested in performance of some specific application feature. Ideally only the method that implements this feature needs to be instrumented. At class load, specify one or more root methods. When the profiled app VM loads a class with some root methods, it will instrument the root methods and determine transitive closure of methods the root method can call and instrument them. At execution, when instrumented method is executed, instrumentation determines if it is within root method execution. If not, quickly exits with negligible overhead. if it is, it gathers the profiling info it needs.
Memory profiling and leak debugging. In many cases, especially for large apps, keeping track of objects is expensive. Idea: if we keep track of a subset of objects, the results are still valid with much lower overhead.
Dynamic BCI and memory overhead. Typically a limited number of object types are of interest when profiling memory use, but we don't always know which ones they are.
The hardest to debug are slow leaks of small amounts of memory. Problems - figuring out which objects are leaking is hard. The heaps are large and leaking objects are not obvious. Comparing heap snapshots doesn't help. Can take weeks in a production environment to reach a situation that causes trouble. Objects tend to be either very short lifespan or long-lived, there isn't much in between.
Demos using NetBeans 4.1.