The first article introduced an overview of goals and architecture for log processing, next two articles will cover inputs and outputs – how can be data (both logs and metrics) forwarder into monitoring and how can be data viewed after the processing.
There are two ways of forwarding data into monitoring platform, automatic and manual. The first one – automatic – is currently used in testing environments, where both logs and metrics are continuously collected and forwarded for processing. On the other hand, when YSoft SafeQ is deployed at customer’s site, such approach is seldom possible, because of security concerns and additional performance requirements for the monitoring server. Instead, only specific log files containing the problem are transferred from the customer and these have to be manually uploaded.

Automatic log forwarding

The simplest way to forward logs would be configuring logging framework to send logs directly over the network, however, such solution does not work with network outages which can be part of tests. Some logging frameworks can be configured with failover logging destination (if the network does not work, it will write logs into files), but these files would need another mechanism to automatically upload them.
Instead, logs are sent to local port into log forwarder, which has to be installed. We currently use Logstash, which (since version 5.0) has a persistent queue. If network works properly, logs are sent before they are flushed to disk, however, if there is a network outage, logs are written on a disk and there is no danger of overflowing RAM memory.
There are two other goals of Logstash. The first one is to unify log formats, logs generated by different logging frameworks have different formats. That could be done on monitoring servers, but this approach makes the processing simpler. The other goal of Logstash is enhancing logs by additional info, like hostname and name of deployment group.
Telegraf is deployed next to Logstash to collect various host metrics, which are again forwarder to monitoring servers. Note that Telegraf does not support persistent queue, so it sends metrics into Logstash, which provides necessary buffering.
Logstash and Telegraf are installed by Calf, our internal tool. Calf can be easily configured and installed as service, it is responsible for installing, configuring and running both Logstash and Telegraf. That makes usage of both tools much easier.

Log and metrics collection schema

Manual log uploading

The main goal of manual log uploading is clear, forward logs to monitoring servers, in the same format as the previous method. That requires log parsing and adding additional information.
The logs for automatic processing are generated directly in JSON format, on the other hand, logs are written into files as lines. These lines have to be parsed, GROK patterns are used for this purpose (basically named regexes). More can be found here, there is also a simple way for constructing GROK patterns.

log:
2017-05-19 10:03:19,368 DEBUG pool-9-thread-14| RemotePeerServer| [RemotePeer{name='1dc5e474-1abc-43fc-85c9-7e5e786919ef', state='ONLINE', session='ZeroMQSessio

grok pattern:
^%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} *(?<thread>[\w-]*)\| *%{WORD:loggerName}\| *%{GREEDYDATA:message}

result:
{
  "timestamp": "2017-05-19 10:03:19,368",
  "level": "DEBUG",
  "thread": "pool-9-thread-14",
  "loggerName": "RemotePeerServer",
  "message": "[RemotePeer{name='1dc5e474-1abc-43fc-85c9-7e5e786919ef', state='ONLINE', session='ZeroMQSessio"
}

However, when manually uploading files, it is necessary to provide additional information about log file, specifically hostname, a name of deployment group and a component name, since each component of YSoft SafeQ has different log format. Logstash is again used for log uploading, but it is wrapped in a Python script for better usability.

cat spoc*.log | python import.py -c spoc -ip localhost -g default

This article is a first one of planned series focused on log processing, therefore YSoft SafeQ monitoring. It explains goals of log monitoring and intended use cases, as well as requirements for designed architecture. A brief description of high-level view of designed architecture follows, architecture itself will be properly described in one of following articles.

Goals of YSoft SafeQ log processing

Logs contain information about behaviour of SafeQ deployment, but each log carries only limited local info (such as single Exception). This can directly lead to understanding, what has happened, but sometimes more logs, even from different components, are needed.

The main goal of log processing is to collect all logs to a central location, unify them to single format to simplify their structure, compute additional information, such as duration of a print process, and finally index this info into a database to allow searches and visualizations into graphs.

Such graphs are a much faster way to obtain info about YSoft SafeQ behaviour than to go through individual log files, for example, spikes in printing time can be easily identified and possibly correlated with additional WARN or ERROR logs. That allows us to understand YSoft SafeQ faster and better.

All that is usually used for fixing bugs, but can be used for many other things, as for help with improving performance or monitoring user-related activities. On the other hand, there are several requirements which need to be satisfied by log processing.

Requirements of log processing architecture

At first, log processing shouldn’t significantly increase performance requirements on YSoft SafeQ servers, therefore logs should be forwarded by a network on a different server. That requires some network bandwidth, but on the other hand, logs don’t have to be saved on hard disk. Also, some network bandwidth can be saved by compressing, in exchange for some CPU time.

Even with logs being sent over the network, the reliability should stay the same as with writing logs directly into files. That is provided by a number of mechanisms:

  • when a connection is temporarily unavailable, generated logs should be buffered on a hard disk and automatically resent, when the connection is established
  • reliable protocol (such as TCP) is used
  • target server can properly work even under heavy load. Note, that multiple YSoft SafeQ servers can forward logs into one destination and there can be also big spikes when many logs are generated at once

Another requirement for log processing is low end-to-end latency. While it is no problem for reporting purposes to collect and aggregate data once a week, for debugging or bug fixing the logs should be processed in order of seconds/minutes.

Lastly, processing of logs should be scalable, so it can be easily enlarged for bigger YSoft SafeQ deployments.

(Note: these are not the only requirements, but the most important ones. More detailed info can be found in my bachelor thesis)

Architecture overview

The main idea is that generated logs are processed as a stream. As soon as a log is generated, it is forwarded from YSoft SafeQ server to a queue on a monitoring server. Queue serves as a buffer when there is a spike in a number of incoming logs or the rest of monitoring is being reconfigured. Kafka is used as persistent and fast performing queue.

Various tools and software can be used „above“ queue to unify logs, aggregate various information, compute a duration of processes (duration of a print job), correlate logs and metrics. Still, logs are processed as a stream, therefore they can’t be replayed or queried as in conventional databases, which makes aggregation more complex. For example, computation of elapsed time between two logs needs to consider logs out of order or missing logs, this will be explained in detail in one of following articles. On the other hand, such approach makes overall latency much lower and it is more efficient (despite more complex algorithms), since logs are processed just once.

At last, logs, metrics and computed data are stored and possibly visualized. Some data can be stored only into files (as logs), but metrics computed from them (as a duration of processes) can be indexed into a database to allow its visualization.

Moreover, this architecture can be simply extended, by another source of data, another tool for log processing or different storage method, online and without affecting the rest of processing pipelines.

The next article will cover data visualizations, with real examples from our testing environments.

 

Chef is an automation platform designed to help the deployment and provisioning
process during software development and in production. Chef can, in cooperation with other deployment tools, transform the whole product environment into 
infrastructure as a code.

DSL

Chef provides a custom DSL that lets its users define the whole environment as a set of resources, together forming recipes, which can be further grouped into cookbooks. The DSL is based on Ruby, which adds a level of flexibility by offering Ruby’s language constructs to help the development. A basic example of a resource is a file with a specified content:
file 'C:\app\app.config' do
    content "server_port = #{port}"
end
Upon executing, Chef will make sure there exists a defined file and has the correct content. If the file with the same content already exists, Chef will finish without updating the resource, letting developers know the environment has already been in a desired stated before the Chef run.

Provisioning

The resources have build-in validations ensuring only the changes in configurations are applied in an existing environment. This lets users execute recipes repeatedly with only minor adjustments and Chef will make the necessary changes in your environment, leaving the correctly defined resources untouched.
This is especially handy in a scenario when an environment is already deployed and developers keep updating the recipes with new resources and managing configurations of deployed components. Here, with correctly defined validations, the recipes will be executed on target machines repeatedly, always updating the environment without modifying the parts of the environment which are already up to date.
This behavior can be illustrated on the following example:
my_tool = maven 'tool.exe' do
    artifact_id     'tool'
    group_id        'com.ysoft'
    version         '1.0.0'
    dest            'C:\utils'
    packaging       'exe'
end

execute 'run tool.exe' do
    command "#{my_tool.dest}\\#{my_tool.name} > #{my_tool.dest}\\tool.output"
    not_if {::File.exist?(#{my_tool.dest}\\tool.output)}
end
In this example, the goal is to download an exe file and run it exactly once (only the first run of this recipe should update the environment). The maven resource internally validates, whether the given artifact has already been downloaded (there would already exist a file C:\utils\tool.exe).
The problem is with the execute resource, as it has no way of checking whether it has been run before, thus potentially executing more times. Users can, however, define restrictions themselves, in this case, the not_if attribute. It will prevent the resource to execute again, as it checks the existence of the tool output from previous runs.

Architecture

To enable environment provisioning, Chef operates in a client-server architecture with a pull-based model.
Chef server represents the storage of everything necessary for deployment and provisioning. It stores cookbooks, templates, data bags, policies and metadata describing each registered node.
Chef client is installed on every machine managed by Chef server. It is responsible for contacting Chef server and checking whether there are new configurations to be applied (hence the pull-based model).
ChefDK workstation is the machine from which the whole Chef infrastructure is operated. Here, the cookbooks are developed and Chef server is managed.
In this example, we can differentiate between the Chef infrastructure (blue) and the managed environment (green). The process of deployment and provisioning is as follows:
  1. A developer creates/modifies a cookbook and uploads it to the Chef server.
  2. Chef client requests the server for changes in the recipes.
  3. If there are changes to be made, Chef server notifies the client.
  4. The client initiates a Chef run with the new recipes.
Note here that in a typical Chef environment, Chef client is set to request the server for changes periodically, to automate the process of configuration propagation.

Serverless deployment

When only the deployment of the environment is necessary (e.g for a simple installation of a product where no provisioning is required), in an offline deployment or while testing, much of the operational overhead of Chef can be mitigated by leaving out the server completely.
Chef client (with additional tools from ChefDK) can operate in a local mode. In such case, everything necessary for the deployment, including the recipes, is stored on the Chef client, which will act as a dummy server for the duration of the Chef run.
 
Here, you can see the architecture of a serverless deployment. The process is as follows:
  1. Chef client deploys a dummy server and points it to cookbooks stored on the same machine.
  2. Chef client from now on acts as the client in the example above and requests the server for changes in the recipes.
  3. Chef Server notifies the client of the changes and a new Chef run is initiated.

Conclusion

Chef is a promising tool that has a potential to help us improve not only the products we offer, but also make the process of development and testing easier.
In combination with infrastructure deployment tools (like Terraform) we are currently researching, automatization of product deployment and provisioning can allow our developers to focus on important tasks instead of dealing with the deployment of testing environments or manually updating configuration files across multiple machines.

This is an introduction to UAM project – management server for Universal appliances (new TPv4, eDee and SafeQube) and the Vert.x framework on which it is developed.

There was a need for easy installation and upgrade of OS and applications running on an UA so we needed a system to manage UAs with as little user interaction as possible.
The main requirements are robustness (we will deal with unrealiable networks and possible power outages), small executables (there is limited storage on the SafeQube) and little disk access (we don’t want to wear out our SafeQube flashdisks).

UAThe system has 2 parts:

  • UA management server – server application which will be deployed next to CML and will contain APIs for system administration
  • UA management server proxy – application cluster which will be deployed on SafeQubes and will cache application binaries and configuration

Both applications are developed using Vert.x framework. UAM server uses also database as configuration storage and proxy applications use Hazelcast for clustering.

Vert.x is an application framework for programming servers (especially web servers). It is built on top of Netty so no application container like Tomcat or Jetty is necessary.

I will show several examples of working with Vert.x so you can try out writing your own applications.

Starting a web server is simple:

vertx.createHttpServer()
    .requestHandler(request -> {
        System.out.println("Received request from " + request.remoteAddress().host());
    })
    .listen(80);

Using this approach is fast, but you have to handle which business logic to call for which endpoint and it can get quite messy.
Fortunately, we can do better using Vert.x web module:

Router router = Router.router(vertx);
router.route("/hello").handler(context -> {
    String name = context.request().getParam("name");
    context.response().end("Hello " + name);
});

router.get("/ping").handler(context -> {
    context.response().end("Pong!");
});

vertx.createHttpServer()
     .requestHandler(router::accept)
     .listen(80);

This way, we can easily separate business logic for our endpoints.

Do you need request logging or security? You can add more routes for same endpoints. Routes are then processed in the order of addition to the router and can call next route in the chain or just end the response.

Router router = Router.router(vertx);
router.route().handler(CookieHandler.create());
router.get("/secured-resource").handler(context -> {
    if ("password".equals(context.getCookie("securityToken"))) {
        context.next();
    } else {
        context.response().setStatusCode(401).end("sorry, you are not authorized to see the secured stuff");
    }
});

router.get("/secured-resource").handler(context -> {
    context.response().end("top secret stuff");
});

Do you want more than just APIs? There are template engines for Handlebars, MVEL, Thymeleaf or Jade if you want to use their functionality.

router.get().handler(TemplateHandler.create(JadeTemplateEngine.create()));

Things get more interesting when you configure Vert.x instances to form a cluster. Currently, the only production ready cluster manager is built on top of Hazelcast, so we will use it.

VertxOptions options = new VertxOptions();
options.setClustered(true);
options.setClusterHost("192.168.1.2");

Vertx.clusteredVertx(options, result -> {
    if (result.succeeded()) {
        Vertx vertx = result.result();
        //initialize what we need
    } else {
        LOGGER.error("Failed to start vertx cluster", result.cause());
    }
});

As you can see, starting Vert.x cluster isn’t that complicated. Just be careful to set correct cluster host address or you may find out that the cluster doesn’t behave how you would expect.
And what functionality is available for clustered applications? There is clustered event bus, where you can use point-to-point or publish-subscribe messaging. Just register your consumers and send them messages to logical addresses without ever caring on which node the consumer resides.

vertx.eventBus().consumer("/echo", message -> {
    message.reply(message.body());
});

vertx.eventBus().send("/echo", "ping");

I hope this short introduction to Vert.x will make you curious to look at its documentation and start writing your own applications on top of it.

One of the systems our team develops is UI for end-users, where users can view and manage their print related data.

The system is designed as a simple web application, where we make AJAX calls to Spring controllers which delegate the calls to two other systems, no database is present.

One of the requirements on the system was to support about 1000 concurrent users. Since Tomcat has by default 200 threads and the calls to the other systems may take long (fortunately it’s not the case at the moment), we have decided to make use of Servlet 3.0 async. This way, each AJAX call from the browser uses up Tomcat thread only for preparation of a call to other system. The calls are handled by our asynchronous library for communication with SafeQ and asynchronous http client for communication with Payment System which both have own threadpools and fill in the response when they get a reply.

Since we depend so much on other systems’ performance, we wanted to monitor the execution time of the requests for better tracking and debugging of production problems.

There are several endpoints for each system and more can come later, so in order to avoid duplication, we have decided to leverage Spring aspect programming support. We have created an aspect for each business service (one for SafeQ, one for Payment System) and it was time to implement the measurement business logic.

In synchronous scenario, things are pretty simple. Just intercept the call, note start and end time and if it took too long, just log it in a logfile.

@Around("remoteServiceMethod()")
public Object restTimingAspect(ProceedingJoinPoint joinPoint) throws Throwable {
    long start = System.currentTimeMillis();

    Object result = joinPoint.proceed();

    long end = System.currentTimeMillis();

    long executionInMillis = end - start;
    if (executionInMillis &gt; remoteServiceCallDurationThresholdInMillis) {
        LOGGER.warn("Execution of {} took {} ms.", joinPoint, executionInMillis);
    }

    return result;
}

This won’t work in our asynchronous case. The call to joinPoint.proceed(), which will make call to other system returns immediately without waiting for a reply. Reply is processed in a callback provided to one of async communication libraries. So we have to do a bit more.

We know the signature of our business methods. One of the arguments is always callback, which will process the reply.

public void getEntitlement(ListenableFutureCallback callback, String userGuid, Long costCenterId)

If we want to add our monitoring logic in a transparent way, we have to create special callback implementation, which will wrap the original callback and track the total time execution.

class TimingListenableFutureCallback implements ListenableFutureCallback {

    private ListenableFutureCallback delegate;
    private StopWatch timer = new StopWatch();
    private String joinPoint;

    public TimingListenableFutureCallback(ListenableFutureCallback delegate, String joinPoint) {
        this.delegate = delegate;
        this.joinPoint = joinPoint;
        timer.start();
    }
        
    @Override
    public void onSuccess(Object result) {
        logExecution(timer, joinPoint);
        delegate.onSuccess(result);
    }

    @Override
    public void onFailure(Throwable ex) {
        logExecution(timer, joinPoint);
        delegate.onFailure(ex);
    }
}

And then we have to call the target business method with properly wrapped callback argument.

@Around("remoteServiceMethod() && args(callback,..)")
public Object restTimingAspect(ProceedingJoinPoint joinPoint, ListenableFutureCallback callback) throws Throwable {
    String joinPointName = computeJoinPointName(joinPoint);
    
    Object[] wrappedArgs = Arrays.stream(joinPoint.getArgs()).map(arg -> {
        return arg instanceof ListenableFutureCallback ? wrapCallback((ListenableFutureCallback) arg, joinPointName) : arg;
    }).toArray();
    
    LOGGER.trace("Calling remote service operation {}.", joinPointName);
    
    return joinPoint.proceed(wrappedArgs);
}

There is similar implementation for the other async messaging library.

I hope this solution will help you solve similar problems in your applications in an elegant manner 🙂

Wishing you good luck and great success in year 2016.

Here is small puzzle game for you:

PF 2016 Puzzle Game

http://ysofters.com/pf2016

Source code is available at GitHub.

You can play the game on tablet, phone or computer.

Based on: Enigma game, Kiwi.JS, Font Awesome.

Limitation for users with Windows 10 with touch displays: Mouse does not work, you can use only touch. Bug was already reported to authors of Kiwi.JS framework.

You can enjoy also PF games from previous years: PF 2015, PF 2014, PF 2013,PF 2012,PF 2011, PF 2010