[REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang

Wed Aug 4 19:48:36 UTC 2021

Excerpts from Linus Torvalds's message of August 4, 2021 12:31 pm:
> Your program is buggy.
> 
> On Wed, Aug 4, 2021 at 8:37 AM Alex Xu (Hello71) <alex_y_xu at yahoo.ca> wrote:
>>
>>     pipe(pipefd);
>>     printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
>>     printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
> 
> Yeah, what did you expect this to do? You said you want a minimal
> buffer, you get a really small buffer.
> 
> Then you try to write multiple messages to the pipe that you just said
> should have a minimum size.
> 
> Don't do that then.
> 
>> /proc/x/stack shows that the remaining thread is hanging at pipe.c:560.
>> It looks like not only there needs to be space in the pipe, but also
>> slots.
> 
> Correct. The fullness of a pipe is not about whether it has the
> possibility of merging more bytes into an existing not-full slot, but
> about whether it has empty slots left.
> 
> Part of that is simply the POSIX pipe guarantees - a write of size
> PIPE_BUF or less is guaranteed to be atomic, so it mustn't be split
> among buffers.
> 
> So a pipe must not be "writable" unless it has space for at least that
> much (think select/poll, which don't know the size of the write).
> 
> The fact that we might be able to reuse a partially filled buffer for
> smaller writes is simply not relevant to that issue.
> 
> And yes, we could have different measures of "could write" for
> different writes, but we just don't have or want that complexity.
> 
> Please don't mess with F_SETPIPE_SZ unless you have a really good
> reason to do so, and actually understand what you are doing.
> 
> Doing a F_SETPIPE_SZ, 0 basically means "I want the mimimum pipe size
> possible". And that one accepts exactly one write at a time.
> 
> Of course, the exact semantics are much more complicated than that
> "exactly one write". The pipe write code will optimistically merge
> writes into a previous buffer, which means that depending on the
> pattern of your writes, the exact number of bytes you can write will
> be very different.
> 
> But that "merge writes into a previous buffer" only appends to the
> buffer - not _reuse_ it - so when each buffer is one page in size,
> what happens is that you can merge up to 4096 bytes worth of writes,
> but then after that the pipe write will want a new buffer - even if
> the old buffer is now empty because of old reads.
> 
> That's why your test program won't block immediately: both writers
> will actually start out happily doing writes into that one buffer that
> is allocated, but at some point that buffer ends, and it wants to
> allocate a new buffer.
> 
> But you told it not to allocate more buffers, and the old buffer is
> never completely empty because your readers never read _everythign_,
> so it will hang, waiting for you to empty the one minimal buffer it
> allocated. And that will never happen.
> 
> There's a very real reason why we do *not* by default say "pipes can
> only ever use only one buffer".
> 
> I don't think this is a regression, but if you have an actual
> application - not a test program - that does crazy things like this
> and used to work (I'm not sure it has ever worked, though), we can
> look into making it work again.
> 
> That said, I suspect the way to make it work is to just say "the
> minimum pipe size is two slots" rather than change the "we want at
> least one empty slot". Exactly because of that whole "look, we must
> not consider a pipe that doesn't have a slot writable".
> 
> Because clearly people don't understand how subtle F_SETPIPE_SZ is.
> It's not really a "byte count", even though that is how it's
> expressed.
> 
>                    Linus
> 

I agree that if this only affects programs which intentionally adjust 
the pipe buffer size, then it is not a huge issue. The problem, 
admittedly buried very close to the bottom of my email, is that the 
kernel will silently provide one-page pipe buffers if the user has 
exceeded 16384 (by default) pipe buffer pages allocated. Try this 
slightly more complicated program:

#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

static int pipefd[2];

void *thread_start(void *arg) {
    char buf[1];
    int i;
    for (i = 0; i < 1000000; i++) {
        read(pipefd[0], buf, sizeof(buf));
        if (write(pipefd[1], buf, sizeof(buf)) == -1)
            break;
    }
    printf("done %d\n", i);
    return NULL;
}

int main() {
    signal(SIGPIPE, SIG_IGN);
    // pretend there are a very large number of make processes running
    for (int i = 0; i < 1025; i++)
    {
        pipe(pipefd);
        write(pipefd[1], "aa", 2);
    }
    printf("%d %d\n", pipefd[0], pipefd[1]);
    printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
    //printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
    pthread_t thread1, thread2;
    pthread_create(&thread1, NULL, thread_start, NULL);
    pthread_create(&thread2, NULL, thread_start, NULL);
    sleep(3);
    close(pipefd[0]);
    close(pipefd[1]);
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
}

You may need to raise your RLIMIT_NOFILE before running the program.

With default settings (fs.pipe-user-pages-soft = 16384) the init buffer 
will be 4096, no mangling required. I think this could be probably be 
solved if the kernel instead reduced over-quota pipes to two pages 
instead of one page. If someone wants to set a one-page pipe buffer, 
then they can suffer the consequences, but the kernel shouldn't silently 
hand people that footgun.

Regards,
Alex.