Table of Contents
It is hard to predict how users will engage your application before it hits production. You can invest much or little in your attempts to predict demand, but in either case your best educated guess cannot dictate the future. If you can avoid performance problems on first release, monitoring is paramount to ensure that things stay under control as users find new ways to torture your application. If you do have problems on initial deployment, monitoring is paramount to finding and addressing a problem quickly. In any case, production monitoring is required.
There are many powerful tools available to monitor any modern Java application. Beet is designed to augment, not replace, the information that these tools provide. What's most important is that you actually measure your system to understand where the problems lie. You can verify whether a candidate fix improves the readings you get on these metrics. It also enables you to devise tests that load the system in a way that resembles the actual production load patterns, rather than having to rely on expectations and guesses.
System resource monitoring tools are widely available and do not require you to make any special considerations in your application. For example, most modern Windows systems have Performance Monitor, and most Unix systems will include some variant of sar. In either case, you should be able to identify bottlenecks at the resource level (disk, memory, CPU, network). If your system is simply being asked to do too much with the tools given, and you can address bottlenecks with better hardware, that's excellent.
If you are having application responsiveness problems and your resources are underutilized, you can't hope to get much more out of your resource monitors and must dig deeper. If your system is overutilized with two users logging in and nothing more, you must dig deeper. In short, you should anticipate a need for deeper visibility into your application than what resource monitoring tools alone can provide.
Modern Java application servers provide a wealth of information to JMX clients (jconsole, MC4J). If your application is overpowering the CPU, you can quickly identify which threads are eating your processor and what code they are executing while they are doing it. You can view heap memory use statistics, identify deadlocked threads, and usually (depending on application server) exercise some control over user sessions.
A significant limitation of JMX monitoring tools is that it is difficult to perform post-mortem analysis after your system has become unresponsive. If your application is leaking memory, or pegs your processor, users of JMX clients are frequently as out-of-luck as users of your application. Some IT departments will consider (not entirely without merit) JMX access a security risk and will refuse to activate it.
All major web servers will provide some kind of configurable access logging, usually in the style of the Apache web server. An access log may include which URL was requested, when the request was made, which user made the request, a session ID associated with the request, how long the server spent processing the request, how large the response was, and whether the processing resulted in any errors. The access log is the primary data source for many user analytics packages. If you are writing a web application, access logging is a powerful tool for understanding how users are accessing your application and how their behavior changes over time.
Many applications implement custom authentication methods, without integrating these methods into the container's standard authentication methods. Whether or not this is a good idea is a separate issue, but at the very least these applications lose association between a specific user account and the info in the access logs. Therefore you can only understand how your users behave as an entire community, not as individuals or classes. You can make inferences from session ID and unique IP address, but you generally won't know which roles are attached to which requests.
Even in cases where you know which user caused a request, access logs do not directly tie requests to resource utilization and system behavior. Why did the request take so long to process? Why is the response so large? What parameters were provided with the request? You may have separate logging of database activity or maybe even method calls, but you generally must correlate events using timestamps. If you have a high volume application in which many events often happen within the same second, this is actually impossible to do with any reasonable accuracy.
The use of toolkit logging (Java Logging API, Apache Commons Logging, log4j) is widely practiced in the global Java development community. Logging statements allow you to select which information will be most useful for your own troubleshooting. Logging toolkits are highly configurable and allow you to dynamically increase the volume of information in areas where trouble is suspected.
One major limitation of toolkits is that high volume logging is expensive -- it fills up your disk and takes processor time away from your application. Learning when and what to log is an art that most programmers acquire naturally and will not be discussed here. But even the most skillful programmer will frequently experience the logging hole: information seems to pour out of every part of the system except the one that is behaving badly. After all, if you knew it was going to be a problem, you wouldn't have written it that way, right? The proactive, compile-time nature of logging is its essential weakness. Zealously adding a logging statement at every conditional branch, at every method open and close, and at every object instantiation and finalization is time-consuming, error-prone, erodes application performance, decreases code legibility, and generates log noise that distracts from the truly valuable information you are logging. A unique frustration of troubleshooting a flagging system is discovering that a critical stacktrace has been rotated out of existence by a debug message that is being logged every 3 seconds. Of course you can then tune your configuration appropriately, but at that point you must wait for your error to happen again. Even so you can't add logging statements without a recompile / redeploy, which in tightly controlled environments may be difficult or entirely forbidden.
Assume that you have precisely predicted your logging needs, and have a calibrated configuration that tells you exactly what your system is doing. Still, like access logging, you must often do extra work to infer relationships between the toolkit logging and other streams of information (database logs, access logs, performance monitors).
The Java virtual machine now hosts a rich toolset for logging performance data about an application -- method calls, class construction and destruction, garbage collector behavior. Profiling tools like Eclipse TPTP allow you to connect to a virtual machine and mine this data in exhaustive detail.
This is excellent in a test environment; but at time of this writing and for a long time before, a profiler is not a tool that can be easily used in production. The impact on system responsiveness is massive, and interference from monitoring can obscure the source of the problem. Profilers are generally a "last mile" tool -- you've identified the problem scenario (which users, what they are doing, when), can reproduce the problem in a controlled environment, and want to know what the problem is at the level of bits and booleans. Even less invasive tools like verbose GC and allocation logging must be used with care (usually reactively), and again require tedious correlation analysis with other data sources to arrive at meaningful results.