A common practice in almost any branch of science and research is data processing. There are, however, some factors which turn this routine activity into a real chore. One factor is the amount of data as well as the amount of processing to be done on that data which both have usually followed an ever-increasing trend considering the advances in science and engineering. Another common factor which further complicates the situation is tight constraints on processing time which, in turn, calls for real-time processing in the extreme case. The problem we are trying to solve well falls under this category. On one hand, the data generated by pnCCD camera, which is an energy dispersive CCD with fast read-out, has a large volume. To make this point clear one can consider the version of camera we are using. This version can generate 400 images of size 384 * 384 pixels per second. Considering the fact that each pixel value is represented by two bytes, we can calculate a rough estimation of more than 110 MB/s as data generation rate for the camera. On the other hand, the problem calls for real-time processing of generated data so that the user can interactively work with data and instantly see the results of processing applied.
Fast-paced growth of applications which call for processing of data with previously-mentioned requirements, however, has led to production of more powerful processing solutions. Traditionally, two prominent trends have been followed in semiconductor industry to deliver more powerful processors. In one trend, the emphasis has been put on increasing the clock rate of processors with the natural consequence of more data being processed at a unit of time. This trend, however, has notably slowed down in past years due to the technical limits encountered while increasing the frequency of processors. Noticing these limits, the semiconductor industry has shifted more towards another major trend where multiple instances of the same processing unit (the so-called core) is bundled into a single processor. This way, the processing power delivered by the processor simply gets multiplied by a factor equal to the number of cores while the processor frequency is unchanged. Following the latter trend, markets have been witnessing introduction of multi-core CPUs which, alongside slight increases in processor frequencies, have provided consumers with notable boost in the computational power at their hands. Nevertheless, there are still applications which require much more computational power than what is offered by today's CPUs. To answer the needs of this class of applications many-core processors like Graphics Processing Units (GPUs) have been extensively used which have hundreds of times as many cores as found in multi-core CPUs, though this significant increase has been achieved at the expense of less flexibility in the programming features offered to consumers. Having hundreds of cores makes GPUs the hardware of choice for many problems which are parallel in nature.
As already mentioned, GPUs are suitable for applications dealing with data highly parallel in nature. Needless to say, image processing problems in general and pnCCD data processing problem in particular, fall well under this category where a significant portion of processing done in parallel on image pixels. Having these considerations in mind, we have chosen GPUs as the core hardware infrastructure for processing of pnCCD images. Yet there are still parts of processing which don't follow the parallel processing paradigm supported by GPUs and are better backed by CPUs. Thus our hardware also incorporates considerable amount of CPU computational power, though the main emphasis is still put on GPUs. Speaking more technically, the main computational power in our project is provided by a workstation equipped with 4 nVIDIA Tesla C2050 GPUs alongside two quad-core server CPUs E5630 with hyper-threading enabled. This workstation can be paired with a PC with which the user can interactively control the processing done on pnCCD data and see the results. The platform can be further extended to accommodate a third computer which is connected to the pnCCD camera and transmits captured data over network to the workstation in a server-client fashion.
As said, the main processing power is provided by GPUs in the workstation. Therefore we should have a mechanism to exploit the GPUs as efficiently as possible. To answer this need, we have developed a software framework which is also our major contribution to the project. The framework is responsible to efficiently manage the processing done on GPUs with the aim of maximizing hardware utilization. The most fundamental question in framework design is how to distribute the processing workload among GPUs. Not surprisingly, this design decision greatly affects other features of the framework. Considering the same technical characteristics of 4 GPUs available, a natural policy would be to distribute the workload evenly among the GPUs. A straightforward approach to realize this policy is to split the processing done on each pnCCD frame into four 'equal' parts and assign each part to a separate GPU. Although, this approach is seemingly simple it encounters some serious technical problems while implementing. The major source of difficulty for implementation of this approach lies in its definition where processing should be divided into 'equal' portions whereas we do not know anything about the specific processing which should be done on each frame in advance (let alone the difficulty of how to determine the equality of two processing parts in terms of workload even if they are known in advance). To resolve this fundamental problem, we've implemented another approach whereby the same whole processing runs on each GPU with each GPU operating on a separate frame. To realize this approach we launch as many CPU threads as there are GPUs with each CPU thread controlling a separate GPU. This multiplicity of CPU threads, on one hand, allows more efficient synchronization while invoking blocking functions on GPUs and, on the other hand, the CPU processing workload is better distributed among CPU cores compared to the case where only a single CPU thread is used.
Among other features and specifications of the framework we can mention its transparent processing synchronization and GPU memory management as well as its development using C++ and CUDA. Furthermore, the processing done on each pnCCD frame can be described in a flowchart, hereafter called processing graph. The processing graph consists of a number of modules interconnected by data flow connections (Figure 1 shows a typical processing graph). There are mainly three types of modules supported: source and sink modules which represent the flow of data in and out of the processing graph, respectively and processing modules which represent the processing done on data. The processing modules can have GPU and/or CPU code as the processing code. Also, the modules can be connected to each other using five different types of connections as shown in Figure 2. One point worth mentioning about connections is the support for feedback connections which allows the output of a module at a time step t to be used by either the same or another module at time step t + 1, thus allowing stateful processing of data where the output of a system at a specific moment of time not only depends on the input at that moment but also on the input(s) at moment(s) before. As a final note, it should be cleared that the software framework is not restricted to processing of pnCCD data and actually it is general enough for processing sequences of incoming packets of any type of data (the so-called data streams).
Application 1:
The developed framework has successfully been employed to extract spot positions from a sequence of pnCCD images. Spots are defined as areas in the pnCCD image plane where we have a high density of hitting X-ray photons. The developed GPU algorithms for this application were capable of generating results comparable to the ones generated by a single-threaded sequential CPU algorithm developed by physicists with a speedup of more than 7. Further details can be found in the paper titled “Fast GPU-Based Spot Extraction for Energy-Dispersive X-Ray Laue Diffraction”.