Tuesday, June 9, 2020

Controlling page cache

The topic of this post is the orange portion of RAM seen after running the 'htop' command on Linux. It is called the "page cache" and it is the part of RAM that contains the recently accessed (read or written) disk pages (4K blocks of files). When the user attempts to read the cached part of disk, the operating system will not perform any disk access, and instead return the pages from RAM. All this happens in a way completely transparent to the user, and of course the OS keeps track of whether the cached pages changed in the meantime (when this happens, disk read occurs as normal). Importantly, at any point page cache is free to be used by other processes. Namely, whenever a program allocates the RAM, the operating system /may/ choose to return a block of RAM that overlaps the page cache (in which case the contents of the block are forgotten by the OS). The operating system will, however, avoid doing that as long as possible and instead try to return a block of RAM that does not contain any pages. Wikipedia also has a quite good intro on this subject so feel free to give it a read: https://en.wikipedia.org/wiki/Page_cache.

In most practical situations having page cache is a good thing, as it boosts the performance of the system essentially for free. However, this mechanism can be very problematic when one is trying to accurately measure the performance of some function or a data structure operating in external memory. Page cache is enabled by default and I am not aware of a simple way to disable it system-wide (and if it existed, it almost certainly would require having admin-level access to the system). Fortunately, once we have an understanding of how it works and a few scripts at hand, it becomes quite easy to manage.

To illustrate the problem, consider the following C++ code (handling errors omitted for simplicity) that sequentially streams a file given as a parameter and outputs (three times) a simple checksum.

NOTE: At the and of the article there is a link to the archive containing all the source code and scripts necessary to reproduce the experiments. Also, note that the code has only been tested on Linux/PC.


Assuming you have compute_checksum.cpp (containing the above code) in your current directory, running the following set of commands in succession
 
 $ dd if=/dev/zero of=./input.txt count=1024 bs=1048576
 $ g++ -o compute_checksum compute_checksum.cpp -std=c++0x -O3 -march=native
 $ ./compute_checksum ./input.txt

will:

1. Create a 1GiB test file called input.txt.
2. Compile the above program.
3. Run it on input.txt file.

On my computer it gives the following output:

 checksum = 0, time = 0.52s
 checksum = 0, time = 0.45s
 checksum = 0, time = 0.46s

The measurements are pretty consistent which is reassuring. The problematic question arises, however, when one asks how fast is the disk access on the test machine? My disk can read at a rate of around 200MiB/s, so 0.5s is not enough time to read a 1024MiB file into RAM. The explanation is the following. As the file was being created, the operating system cached the file so that subsequent accesses are fast. So, how can we measure the performance of this program?

Solution 1: Reboot the computer and rerun compute_checksum. This will work, but:

(a) What if you are using a third-party cluster (rebooting requires sudo)?
(b) Even if you have the privileges necessary for rebooting, you would still need to reboot every time you modify your program and rebooting can take a minute or so. This solution is thus not very practical.

Solution 2: Use this (or some other) program to read another file(s) into RAM.

After reading a sufficient amount of data, this will indeed overwrite the cache and is slightly better, though still a bit "dirty", solution. This solution has the advantage that it does not require sudo privileges.

Solution 3: Use a script that empties the entire page cache. This is quite simple and can be accomplished as follows. Run the script containing the following commands. Unfortunately, this again requires sudo.

echo 3 | sudo tee /proc/sys/vm/drop_caches

Solution 4: So far we have only been dealing with the problem of guaranteeing that the cache is empty when we start the program. This, however, does not solve the following issue. Suppose your test machine has a lot of RAM, say 512GiB, and to demonstrate the scalability of the algorithm, in your experiments you would like to simulate the scenario when the machine has much less RAM than the size of the input (a quite typical value here is to assume few GiB to simulate the commodity machine). Even if:
  1. The algorithm is only allocating and using a small amount of RAM (few GiB), and
  2. We successfully empty the page cache at the beginning of the program,
then, if during its execution we scan the same file multiple times (or simply read some temporary file that our program has just generated), the measured performance of the algorithm will not be, due to caching, the same as if the algorithm was running on a machine with just few GiB of RAM.

Fortunately, there exists a solution that solves all the above problems and does not require sudo. Linux implements a system function called posix_fadvise that does exactly what we need. The important bits from the manual are as follows (https://www.man7.org/linux/man-pages/man2/posix_fadvise.2.html):

NAME

    posix_fadvise - predeclare an access pattern for file data

SYNOPSIS

    #include <fcntl.h>
    int posix_fadvise(int fd, off_t offset, off_t len, int advice);

DESCRIPTION

    Programs can use posix_fadvise() to announce an intention to access
    file data in a specific pattern in the future, thus allowing the
    kernel to perform appropriate optimizations.

    The advice applies to a (not necessarily existent) region starting at
    offset and extending for len bytes (or until the end of the file if
    len is 0) within the file referred to by fd.  The advice is not
    binding; it merely constitutes an expectation on behalf of the
    application.
    
    Permissible values for advice include:

    (...)

    POSIX_FADV_DONTNEED
  
        The specified data will not be accessed in the near future.

        POSIX_FAVD_DONNEED attempts to free cached pages associated
        with the specified region.  This is useful, for example, while
        streaming large files.  A program may periodically request the
        kernel to free cached data that has already been used, so that
        more useful cached pages are not discarded instead.

    (...)

Let us try it. The following program drops all cached pages of a file given as an argument.


Now, assuming we saved the above code to empty_page_cache.cpp, running

 $ g++ -o empty_page_cache empty_page_cache.cpp -std=c++0x -O3 -march=native
 $ ./empty_page_cache ./input.txt
 $ ./compute_checksum ./input.txt

gives the following result on my computer (note the difference on the first line).

 checksum = 0, time = 5.14s
 checksum = 0, time = 0.50s
 checksum = 0, time = 0.45s

Indeed, at the first scan of the file, we actually needed to read it from disk, and subsequent two reads have read (with the transparent help of OS) the file directly from RAM. Note that all the caching happens completely without our control. In the second and third run, we are still opening the file and reading its contents using fread. And yet, no transfer occurs between RAM and disk.

Of course there is no reason to have two programs. We can copy the empty_page_cache function into the previous program and call it every time we scan a file. Or better yet, have a designated function to open a file and simultanously empty the corresponding pages from cache. For example:


Assuming we saved the above code to checksum_uncached.cpp, running

 $ g++ -o checksum_uncached checksum_uncached.cpp -std=c++0x -O3 -march=native
 $ ./checksum_uncached ./input.txt

gives the following result on my computer:

 checksum = 0, time = 4.98s
 checksum = 0, time = 4.86s
 checksum = 0, time = 4.92s

There is one more subtle thing to mention about this solution. We claimed that the above idea (of always emptying the corresponding pages when opening any file) allowed us to simulate the situation, where the algorithm runs on a machine with a lot less RAM. One might, however, argue as follows that this simulation is not perfect.
 
Assume our old algorithm (without any countermeasures against page caching) was indeed running on a "commodity" machine with, say, 4GiB of RAM and was allowed (when invoked from the command line) to use 3.5GiB, but in actuality was using only 2GiB at some stages. It will then benefit from caching a little bit in those stages. There would not be anything "unfair" about reporting measurements from such an experiment and it should indeed be accepted as the true performance of the algorithm. One might now say that in the above solution (where we /always/ empty the cache when opening a file) we are depriving ourselves of these legitimate time savings and the resulting performance on the large-RAM machine will actually be *worse* than it would be on the machine with less RAM but with allowed caching (i.e., the setup we accepted as fair).
 
The reconciliation of this argument is as follows. The above scenario indeed can happen for some programs. However, it will never happen for any /efficiently designed/ external-memory algorithm. The key observation is that any external-memory program that benefits from a piece of data accidentally residing in RAM, can be redesigned to work equally efficiently but without such assumption. In our case, the stages of computation where the algorithm was using 2GiB and yet was benefiting from the page cache, should then be reorganized so that this data (that accidentally stayed in RAM) should be explicitly maintained in RAM by the algorithm, and the algorithm should be explicitly using the full 3.5GiB of allowed memory. In other words, as long as the program is not "obviously inefficient", any measurement on the large-RAM machine (and with the above page-clearing mechanism enabled) will indeed be equivalent to running on the commodity-RAM machine.

Summary


One more solution exists to the problem and is based on opening the file with the O_DIRECT flag. This, however, is a material for a separate post.

My preferred solution is Solution 4, at it does not have the restrictions that come with using O_DIRECT, can be easily embedded in the code, does not require sudo access. It is also faster than other solutions, as it only drops the cached pages of the files we specify.


Supplementary Materials:


The attached package contains scripts and source code (this version of the code includes some error handling, but is otherwise the same as above) that allows easily reproducing the above experiments. The code has been tested only on Linux/PC. After extracting the package, enter the directory and type: 'make experiment' in the terminal. If everything goes well, this should run the experiment and then clean up after itself (i.e., remove all the temporary files), returning the directory to the initial state.

Additionally, the package contains the parallelized (using OpenMP) version of the code, which might be useful for demontrating the effect of page cache on systems with an SSD (where the read times can easily reach 500MiB/s on a typical laptop and above 1GiB/s on higher-end systems).

2020-07-04: Update: Thanks to Manuel Penschuck for pointing out missing optimization flags and comments on running the experiment on an SSD!

Wednesday, May 27, 2020

First post

Hello, world! I decided to start a blog to share some of my experience working in the Computer Science field known as Algorithm Engineering (equivalently: Experimental Algorithmics). During my Ph.D. at the University of Helsinki (2011-2015), where I was a member of the Practical Algorithms and Data structures on Strings group, I was involved in a number of projects mostly focused on the efficient implementation of string algorithms from external-memory (EM), parallel (multicore), as well as good old RAM (internal memory) model. Prior to that I also I earned some experience as a Software Engineer Intern in Google in Zurich. And prior to that, for about 8 years, I was obsessed with programming competitions (think: Google CodeJam, Topcoder, ACM ICPC). Check out for example my Sphere Online Judge profile.

Example papers I worked on:
[JEA1], [JEA2], [JEA3],
[ESA1], [CPM1], [SEA1].

During that time I learned a lot about programming and computer architecture, but what I aim to share in this blog is some experience on running the experiments. I will mostly focus on the less known issues that arise often. The idea is to be brief, bring attention to some specific technique or issue, share some scripts or code snippets, and give links for further reading. Most importantly, however, I want to highlight where in the process of implementing/running the code this specific issue arises, how to avoid it, or take advantage of it, without dwelling on the technical details.

I chose to cover these topics because there is quite a lot to say and yet the materials on this topic are very scattered over the internet. In addition, I hope this will provide some forum for discussion and I will learn something in the process as well. Theory blogs seem to be covered quite well these days but it seems to me there is comparatively little discussion on experimental algorithmics.

Examples of things I would like to cover are: out of order execution (= reducing the cost of cache misses), page cache (i.e., why running experiments involving streaming/EM algorithms requires extra care), restricting the amount of RAM using GRUB options (and how to deal with it, when altering GRUB settings is not in your power), controlling the number of threads (commands: numactl and taskset), dealing with NUMA systems (i.e., how to get the same time measurement every time), measuring peak RAM allocation, measuring peak disk space allocation, cache-aligned memory allocation, managing data structures with complicated allocation patterns, asynchronous I/O (+ conditional variables), and a bit about choosing the right buffer size when doing sequential disk scans.

The material comes from the experience of writing code and running experiments in C++ on Linux/PC platform, but I think a lot of this transfers well to Mac/OSX. In many cases, the snippets I share can also be found inside the implementations of my code that can be found on my Github profile (most of the code I've written it still, however, available on the PADS webpage. I am in the process of updating them and migrating to github). For more updates on current activities, check out my Twitter page.