Currently I am working on areas of multicore architecture and cloud platforms for High Performance Computing (HPC) applications. A brief description of the projects that I have worked on recently is as follows. Please feel free to contact me if you would like to know more about or collaborate on any of these areas.
Shared Cache Management:
A project about partitioning the ways of a shared last-level cache among concurrently-executing threads of a data-parallel multithreaded application running on a chip-multiprocessor.
LRU based replacement policies divide the cache among the competing threads according to demand generated by each thread, and, in many cases, does not lead to the most efficient use of the cache. Instead explicitly partitioning the ways of the cache among the threads improve the cache utilization and fairness, because threads with high demand and poor reuse-behavior can not affect the share of others. Unlike prior work on way-partitioning for unrelated threads in a multiprogramming workload, the domain of multithreaded programs requires both throughput and fairness. Additionally, current data-parallel programs show no obvious thread differences to exploit: program threads see nearly identical IPC and data reuse as they progress (as expected for a well-written load-balanced data-parallel program).
Despite the balance and symmetry among threads, we find that a balanced partitioning of cache ways between threads is suboptimal. Instead, a strategy of temporarily imbalancing the partitions between different threads to improve cache utilization by adapting to the locality behavior of the threads as captured by dynamic set-specific reuse-distance (SSRD) yields better application performance. Cumulative SSRD histograms have knees that correspond to different important working sets; thus, cache ways can be taken away from a thread with only minimal performance impact if that thread is currently operating far from a knee. Those ways can then be given to a single “preferred” thread to push it over the next knee. The preferred thread is chosen in a round-robin fashion to ensure balanced progress over the execution.
Parallel File System-as-a-service in the Cloud:
A project on how to effectively integrate a parallel file system (PFS) in a cloud environment and provide PFS as a service to cloud users running HPC applications.
PFS as a service in the cloud would allow users to provision and configure PFS storage dynamically and securely. Initial results indicate that the best performance can be obtained by providing a native PFS installation as a service and using file-system passthrough from the VM instances to access the files directly. Most of the work in this project was done when I was working at the Information Sciences Institute in Arlington, Virginia as a visiting researcher.
Multiprocessor Reliability:
A project about introducing a novel micro-architectural modification for improving yield and reliability of homogeneous chip multiprocessors (CMP).
Device reliability and manufacturability have emerged as dominant concerns in end-of-road CMOS devices. An increasing number of hardware failures are attributed to manufacturability or reliability problems. Maintaining an acceptable manufacturing yield for chips containing tens of billions of transistors with great variations in device parameters has been identified as a great challenge. Additionally today’s nanometer scale devices suffer from accelerated aging effects because of the extreme operating temperature and electric fields they are subjected to. Near-threshold transistor operation promises huge reductions in chip power at the cost of reliability. Hence for computing systems of the future, architecture must play an important role in maintaining acceptable yield and reliability in systems built upon unreliable substrates.
To this end, investigate a micro-architectural scheme for improving yield and reliability of homogeneous chip multiprocessors (CMP). The proposed solution involves a hardware framework that enables us to utilize the redundancies inherent in a multi-core system to keep the system operational in face of partial failures. A micro-architectural modification allows a faulty core in a multiprocessor system to use another core’s resources to service any instruction that the former cannot execute correctly by itself. This service improves yield and reliability, but may cause loss of performance. The target platform for quantitative evaluation of performance under degradation is a dual-core and a quad-core chip multiprocessor with one or more cores sustaining partial failure. Simulation studies indicate that when a large, high-latency, and sparingly-used unit such as a floating point unit fails in a core, correct execution may be sustained through outsourcing with at most 16% impact on performance for a floating point intensive application. For applications with moderate floating-point load, the degradation is insignificant. The performance impact may be mitigated even further by judicious selection of the cores to commandeer depending on the current load on each of the candidate cores. The area overhead is also insignificant due to resource re-use.