
My first effort is a little Python package that provides a decorator/context-manager class that makes capturing and logging process execution times easier and more consistent. I picked this one to start with for a variety of reasons:
- The code involved was relatively minimal, so I could write it in a couple of afternoons, including testing, giving me time to focus on learning the newest standard processes involved in creating Python packages.
- That also allowed me to work through the CI/CD processes that I wanted to build for any future Python packaging projects, specifically Bitbucket’s Pipelines.
- It integrates nicely, I think, with standard AWS CloudWatch log ➔ metrics-filters ➔ dashboards/alarms processes and resources.
- It helps keep logs that are used to capture metrics consistent and more efficient (though in fairness, that efficiency gain is, perhaps, trivial).
The back-story behind this package is fairly straightforward, though
I’ve abstracted the circumstances and details for various reasons.
Assume that you have a system that runs in AWS. It could be an API
Gateway backed by one to many Lambda Functions, a Fargate task, or
anything, really, that has inner functions or methods that you need
visibility into the successes and errors, and elapsed execution time of
for every request. For the sake of illustration, let’s assume a Fargate
task that: - Accepts some identifiers for data that needs to be
retrieved from APIs that you have no control over. For purposes of
illustration, I’ll call them Person
, Place
and
Thing
APIs. - Retrieves the actual data for the identified
items from their relevant APIs. - Does some manipulation of the data to
assemble it into a payload for a call to another API. - Makes
that API call to analyze the collected data. - Writes that analysis to a
data store.
Putting all of that together looks something like this:
There are, I think, two main categories of logging-related messages
to be concerned with. The first is messages for
troubleshooting purposes: information that would help
an engineer track down which parts of a process are succeeding and which
are failing. The approach that I have advocated for over the last
several years is a fairly simple rule-set: - Entry-point functions or
methods should info
-log their initial call:
my_entry_point called
, for example. - Entry-point functions
or methods should debug
-log their inputs. In Python, this
can be as simple as logging f'{vars()}'
at the top of the
code, but other, more sophisticated approaches are possible (or even
desirable) too. - All functions or methods should log
their exit state. That takes two basic forms: - For successful
completion without errors, info
-logging that is sufficient:
my_child_function completed successfully
, for example. In
addition, debug
-logging what will be returned is advised. -
Any errors raised should be captured and logged, with the inputs that
led to the error. Logging the stack-trace at a debug
level
is also advised.
When everything is running according to plan, that leads to one
log-message per function or method called, plus one at the start of the
entry-point, each of which is intentionally small to reduce the amount
of logged data being paid for. For example (these examples are all
asyncio
-based code, unless noted otherwise):
[INFO] 12:05:57,008: main called [INFO] 12:06:01,857: get_place_data completed without error [INFO] 12:06:02,241: get_person_data completed without error [INFO] 12:06:02,811: get_thing_data completed without error [INFO] 12:06:05,387: analyze_data completed without error [INFO] 12:06:05,748: write_data completed without error [INFO] 12:06:05,748: async_main completed [INFO] 12:06:05,750: main completed
This captures the entire process wrapped by the main
function, and shows that all of the relevant sub-processes ran to
completion without error. This is, if not the optimum level of logging
information for troubleshooting purposes, at least very close to it. In
situations where there are no explicit errors being raised, but the
results of a process are suspect or just wrong, this level of logging
can show at least things over and above the “there were no errors”
scenario: - Whether the process as a whole was executing as expected. -
Which child processes executed (if there are any), and which did
not.
It also affords the ability to simply turn debug-level logging on to see more details:
[INFO] 14:08:08,408: main called [DEBUG] 14:08:08,408: main: {} [DEBUG] 14:08:08,408: async_main called: {} [DEBUG] 14:08:11,206: get_person_data called: {} [DEBUG] 14:08:11,206: get_place_data called: {} [DEBUG] 14:08:11,206: get_thing_data called: {} [INFO] 14:08:13,873: get_thing_data completed without error [DEBUG] 14:08:13,874: {'thing_data': ('the Revolver', {})} [INFO] 14:08:13,977: get_place_data completed without error [DEBUG] 14:08:13,978: {'place_data': ('The Library', {})} [INFO] 14:08:14,001: get_person_data completed without error [DEBUG] 14:08:14,001: {'person_data': ('Professor Plum', {})} [DEBUG] 14:08:14,001: analyze_data called: { 'payload': { 'person_data': {'Professor Plum': {}}, 'place_data': {'The Library': {}}, 'thing_data': {'the Revolver': {}}} } [INFO] 14:08:16,537: analyze_data completed without error [DEBUG] 14:08:16,537: { 'person_data': { 'Professor Plum': {}}, 'place_data': {'The Library': {}}, 'thing_data': {'the Revolver': {}} } [DEBUG] 14:08:16,537: write_data called: { 'payload': { 'person_data': {'Professor Plum': {}}, 'place_data': {'The Library': {}}, 'thing_data': {'the Revolver': {}}} } [INFO] 14:08:17,145: write_data completed without error [DEBUG] 14:08:17,145: True [INFO] 14:08:17,145: async_main completed [INFO] 14:08:17,146: main completed
In situations where errors arise, this approach will usually log the source error, any “intermediate” errors that happen as a result, and a “final” error, making the process of tracking down the source of the error, and the inputs that led to it, relatively simple:
[INFO] 14:02:13,003: main called [ERROR] 14:02:15,672: RuntimeError in get_thing_data: with {}: Thing API failure [INFO] 14:02:18,108: get_person_data completed without error [INFO] 14:02:18,294: get_place_data completed without error [ERROR] 14:02:18,295: RuntimeError in async_main: 1 child processes failed: [RuntimeError('Thing API failure')] [INFO] 14:02:18,296: main completed
This still provides a reasonable view of a process’ execution — usually enough for an engineer to be able to replicate the error locally, since the inputs that caused the error are captured. But, if more detail is needed, the same switch over to debug-level logging can provide more detail.
The second main category of logging messages is concerned with providing data-points that can be picked up by metrics filters, and used to provide observability into the performance of a system or its process. In an AWS CloudWatch context, these data points can be collected with a Metric Filter, and those collected metrics can drive displays in CloudWatch Dashboards and be used to set up Alarms that can be used to notify someone when performance is outside an expected range or passing some value threshold. The metric filter setup uses one of several filter syntaxes to identify names in the associated log group, and defines how the value for that key is determined.
If storage of logs is an ongoing cost concern, note that the metrics extracted from logs persist after the logs they were extracted from have been deleted. That allows the log retention policy to be set for a much shorter timeframe — just long enough that the passage of a long weekend won’t completely eliminate them, for example — without affecting the metrics extracted in any meaningful fashion.
The metrics themselves will be captured in one-minute buckets by default, and over time it appears that those buckets are merged into hourly data-sets. I’ve observed this, but cannot confirm whether the higher resolution time scale is really not available, or determine what the age is when that happens. I’m reasonably certain, from memory, that it’s over a month, and it seems likely that it’s closer to three months.
One of the more efficient logging output formats, from a metric filter standpoint, is to log data as a JSON object, with fields and values that filter definitions can leverage. By way of example, capturing the execution times of the functions in the hypothetical Fargate task above might issue a log-message with the following values (pretty-printed for ease of reading; there’s no need for it to be pretty in the log message itself):
{
"async_main": 2.4285507202148438,
"get_place_data": 2.770256280899048,
"get_person_data": 2.859272003173828,
"get_thing_data": 2.9412269592285156,
"async_results": 2.9431209564208984,
"analyze_data": 2.139224052429199,
"write_data": 0.7172739505767822,
"main": 8.23118805885315
}
Functionally, there’s no reason that these metrics values would have to be logged in a single line: Logging each value independently is (trivially) less efficient, though, and in my experience, troubleshooting efforts that need access to the raw metrics data are easier to deal with if they are all in one place.
With all of those factors in mind, I wrote a small Python package to
capture elapsed time values with associated names, and write the entire
collection of them out to a single log-line: the
# goblinfish-metrics-trackers
package. It also provides
a mechanism for adding arbitrary metrics to that output, allowing a
single log-message to capture both “latencies” (elapsed times for any
processes of interest) and other points of interest (like error-counts).
To reduce the amount of data storage in the logs and allow for better
consistency in the metrics filters later on, the elapsed times are
output as milliseconds, with no more than three decimal places of
precision (e.g. 1234.567
). An example of the final output
for the Fargate structure above, pretty-printed, is:
{
"latencies":{
"async_main":2869.3,
"get_place_data":2324.621,
"get_thing_data":2427.277,
"get_person_data":2871.321,
"async_results":2872.099,
"analyze_data":2199.275,
"main":7942.849
},
"metrics":{
"write_data_errors":1
}
}
The corresponding general logging output for the test-run that generated this log message was:
[INFO] 09:38:18,861: main called [INFO] 09:38:23,246: get_thing_data completed without error [INFO] 09:38:23,545: get_place_data completed without error [INFO] 09:38:23,859: get_person_data completed without error [INFO] 09:38:26,426: analyze_data completed without error [ERROR] 09:38:26,426: RuntimeError in write_data: Write data failure [ERROR] 09:38:26,426: RuntimeError in async_main: Write data failure [INFO] 09:38:26,427: main completed
The function that raised the error (an asynchronous function in this
case), incorporating the logging policy described earlier, using a
WRITE_FAIL_RATE
value of 0.99
to force a fake
error, and capturing that error metric was:
# Module "Constants" and Other Attributes
tracker = ProcessTracker()
# ...
async def write_data(payload):
inputs = vars()
logging.debug(f'write_data called: {inputs}')
try:
if random.random() < WRITE_FAIL_RATE:
raise RuntimeError('Write data failure')
sleep_for = random.randrange(250, 1_000) / 1000
with tracker.timer('write_data'):
await asyncio.sleep(sleep_for)
except Exception as error:
logging.error(
f'{error.__class__.__name__} in write_data: {error}'
)
logging.debug(f'write_data inputs: {inputs}')
tracker.set_metric('write_data_errors', 1)
raise
else:
result = True
logging.info('write_data completed without error')
logging.debug(f'{result}')
return result
The synchronous equivalent of this function can use the
@tracker.track
decorator on the function. When the
decorator is applied to an async
function, it still runs
without error, but it only captures the elapsed time needed for an
asynchronous promise or future to be created, which is not terrifically
useful in general. Apart from that, it is nearly identical:
# Module "Constants" and Other Attributes
tracker = ProcessTracker()
# ...
@tracker.track
def write_data(payload):
inputs = vars()
logging.debug(f'write_data called: {inputs}')
try:
if random.random() < WRITE_FAIL_RATE:
raise RuntimeError('Write data failure')
sleep_for = random.randrange(250, 1_000) / 1000
time.sleep(sleep_for)
except Exception as error:
logging.error(
f'{error.__class__.__name__} in write_data: {error}'
)
logging.debug(f'write_data inputs: {inputs}')
# Here is the erro-count metric
tracker.set_metric('write_data_errors', 1)
raise
else:
result = True
logging.info('write_data completed without error')
logging.debug(f'{result}')
return result
The general logging output also looks pretty much the same as the output for the asynchronous function version, though as I’m writing this, it also generated a stack trace.
The CloudFormation resource definitions for the related metric
filters don’t need to be terribly complicated — There are only seven
properties for either that must be provided, with one more
(FilterName
) that is more a convenience item. The
CloudFormation YAML required for the latency and errors metrics capture
for the data-writer function ends up looking like this:
# ...
Resources:
# ...
DataWriterLatencyMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
FilterName: data-writer-latency-metric-filter
# This is what triggers picking up the value
FilterPattern: {$.latencies.write_data>0}
# This is the source of the log-messages to be evaluated
LogGroupName: !Ref DataWriterLogGroup
MetricTransformations:
- MetricName: data-writer-latency
# This is a grouping name, so that all the metrics for
# the application will be kept in the same namespace
MetricNamespace: !Ref ApplicationName
# This is the value that will be picked up
MetricValue: {$.latencies.write_data}
# This is the unit-type for the metric
Unit: Milliseconds
# ...
DataWriterErrorCountMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
FilterName: data-writer-error-count-metric-filter
FilterPattern: {$.metrics.write_data_errors>0}
LogGroupName: !Ref DataWriterLogGroup
MetricTransformations:
- MetricName: data-writer-latency
MetricNamespace: !Ref ApplicationName
MetricValue: {$.metrics.write_data_errors}
Unit: Count
Full examples are available in the package repository, both modeled to simulate the Fargate example above: - An asynchronous version - A synchronous version
No comments:
Post a Comment