Quantcast
Channel: my tech blog » Linux
Viewing all 173 articles
Browse latest View live

Octave: Empty plots (after “figure”)

$
0
0

Running Octave 4.2.2 on Linux Mint 19, I got plots with nothing in them occasionally. Solution: Change the graphics toolkit to GNU Plot.

Simply put, add ~/.octaverc reading

graphics_toolkit("gnuplot")

and rerun Octave.

By the way, for zooming in, right-click the mouse on the first point, and left-click on the second.


Linux CUSE (and FUSE): Why I ditched two months of work with it

$
0
0

Introduction

If you’re planning to use CUSE (or FUSE) for an application you care about, this post is for you. That includes future self. I’m summarizing my not-so-pleasant journey with this framework here, with focus on how I gradually realized that I should start from the scratch with an old-school kernel module instead.

Most important, if you run CUSE on a v5.0 to v5.3 Linux kernel, you’re in for an imminent OOPS that requires an immediate reboot of the computer. This was the final straw for me (more like a huge log). Even if the user-space driver detected the kernel version and refused to run on kernels that would crash, that would mean it wouldn’t run on the most common distributions at the time of release. And I asked if I want to depend on a subsystem that is maintained this way.

Maybe I should have listened to what Linus had to say about FUSE back in 2011:

People who think that userspace filesystems are realistic for anything but toys are just misguided.

Unfortunately, it seems like the overall attiude towards FUSE is more or less in that spirit, hence nobody gets alarmed when the relevant code gets messier than is usually allowed: FUSE is nice for that nifty GUI that allows me to copy some files from my smartphone to the computer over a USB cable. It fails when there are many files, but what did I expect. Maybe it’s a problem with FUSE, maybe with the MTP/PTP protocol, but the real problem is that it’s treated as a toy.

As for myself, I was tempted to offer a user-space device driver for a USB product I’ve designed. A simple installation, possibly from binaries, running on virtually any computer. CUSE is around for many years, and opens a file in /dev with my name of choice. It makes the device file behave as if it was backed by a driver in the kernel (more or less). What could possibly go wrong?

And a final note before the storytelling: This post was written in the beginning of 2020. Sometimes things change after a while. Not that they usually do, but who knows?

Phase I: Why I opted out libfuse

The natural and immediate choice for working with FUSE is to use its ubiquitous library, libfuse. OK, how does the API go? How does it work?

libfuse’s git commits date back to 2001, and the project is alive by all means, with several commits and version updates every month. As for documentation, the doc/ subdirectory doesn’t help much, and its mainpage.dox says it straight out:

The authoritative source of information about libfuse internals (including the protocol used for communication with the FUSE kernel module) is the source code.

Simply put, nothing is really documented, read the source and figure it out yourself. There’s also an example/ directory with example code, showing how to get it done. Including a couple of examples for CUSE. But no API at all. Nothing on the fine details that make the difference between “look it works, oops, now it doesn’t” and something you can rely upon.

As for the self-documenting code, it isn’t a very pleasant experience, as it’s clearly written in “hack now, clean up later (that is, never)” style.

There are however scattered pieces of documentation, for example:

So with the notion that messy code is likely to bite back, I decided to skip libfuse and talk with /dev/cuse directly. I mean, kernel code can’t be that messy, can it?

It took me quite some time to reverse-engineer the CUSE protocol, and I’ve written a couple of posts on this matter: This and this.

Phase II: Accessing /dev/cuse causing a major OOPS

After nearly finishing my CUSE-based (plus libusb and epoll) driver on a Linux v4.15 machine , I gave it a test run on a different computer, running kernel v5.3. And that went boooom.

Namely, just when trying to close /dev/cuse, an OOPS message as follows appeared, leaving Linux limping, requiring an immediate reboot:

kernel: BUG: spinlock bad magic on CPU#0, cat/951
kernel: general protection fault: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 0 PID: 951 Comm: cat Tainted: G           O      5.3.0-USBTEST1 #1
kernel: RIP: 0010:spin_bug+0x6a/0x96
kernel: Code: 04 00 00 48 8d 88 88 06 00 00 48 c7 c7 90 ef d5 81 e8 8c af 00 00 41 83 c8 ff 48 85 db 44 8b 4d 08 48 c7 c1 85 ab d9 81 74 0e <44> 8b 83 c8 04 00 00 48 8d 8b 88 06 00 00 8b 55 04 48 89 ee 48 c7
kernel: RSP: 0018:ffffc900008abe18 EFLAGS: 00010202
kernel: RAX: 0000000000000029 RBX: 6b6b6b6b6b6b6b6b RCX: ffffffff81d9ab85
kernel: RDX: 0000000000000000 RSI: ffff88816da16478 RDI: 00000000ffffffff
kernel: RBP: ffff88815a109248 R08: 00000000ffffffff R09: 000000006b6b6b6b
kernel: R10: ffff888159b58c50 R11: ffffffff81c5cd00 R12: ffff88816ae00010
kernel: R13: ffff88816a165e78 R14: 0000000000000012 R15: 0000000000008000
kernel: FS:  00007ff8be539700(0000) GS:ffff88816da00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007ffe4fa2fee8 CR3: 000000016b5d0002 CR4: 00000000003606f0
kernel: Call Trace:
kernel: do_raw_spin_lock+0x19/0x84
kernel: fuse_prepare_release+0x3b/0xe7 [fuse]
kernel: fuse_sync_release+0x37/0x49 [fuse]
kernel: cuse_release+0x16/0x22 [cuse]
kernel: __fput+0xf0/0x1c2
kernel: task_work_run+0x73/0x86
kernel: exit_to_usermode_loop+0x4e/0x92
kernel: do_syscall_64+0xc9/0xf4
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9

OK, so why did this happen?

To make a long story short, because a change was made in the FUSE kernel code without testing it on CUSE. I mean, no test at all.

The gory details: A spinlock was added to struct fuse_inode, but someone forgot that the CUSE doesn’t have such struct related to it, because it’s on /dev and not on a FUSE mounted filesystem. Small mistake, no test, big oops.

Even more gory details: Linux kernel commit f15ecfef058d94d03bdb35dcdfda041b3de9d543 adds a spinlock check in fuse_prepare_release() (among others), saying

	if (likely(fi)) {
		spin_lock(&fi->lock);
		list_del(&ff->write_entry);
		spin_unlock(&fi->lock);
	}

For this to even be possible, an earlier commt (ebf84d0c7220c7c9b904c405e61175d2a50cfb39) adds a struct fuse_inode *fi argument to fuse_prepare_release(), and also makes sure that it’s populated correctly. In particular, in cuse.c, it goes:

struct fuse_inode *fi = get_fuse_inode(inode);

(what would I do without git blame?).

But wait. What inode? Apparently, the idea was to get the inode’s struct fuse_inode, which is allocated and initialized by fuse_alloc_inode() in fs/fuse/inode.c. However this function is called only as a superblock operation — in other words, when the kernel wants to create a new inode on a mounted FUSE filesystem. A CUSE device file doesn’t have such entry allocated at all!

get_fuse_inode() is just a container_of(). In other words, it assumes that @inode is an entry inside a struct fuse_inode, and returns the address of the struct. But it’s not. In the CUSE case, the inode belongs to a devfs or something. get_fuse_inode() returns just some random address, and no wonder do_raw_spin_lock() whines that it’s called on something that isn’t a spinlock at all.

The relevant patches were submitted by Kirill Tkhai and committed by Miklos Szeredi. None of whom made the simplest go-no-go test on CUSE after this change, of course, or they would have spotted the problem right away. What damage could a simple change in cuse.c make, after all?

The patch that fixed it

This issue is fixed in kernel commit 56d250ef9650edce600c96e2f918b9b9bafda85e (effective in kernel v5.4) by Miklos Szeredi, saying “It’s a small wonder it didn’t blow up until now”. Irony at its best. He should have written “the FUSE didn’t blow”.

So this bug lived from v5.0 to v5.3 (inclusive), something like 8 months. 8 months without a single minimal regression test by the maintainer or anyone else.

The patch removes the get_fuse_inode() call in cuse.c, and calls fuse_prepare_release() with a NULL instead. Meaning there is no inode, like it should.

FUSE / CUSE signal handling: The very gory details

$
0
0

First: If you’re planning on using FUSE / CUSE for an application, be sure to read this first. It also explains why I didn’t just take what libfuse offered.

Overview

This is a detour from another post of mine, which dissects the FUSE / CUSE kernel driver. I wrote this separate post on signal handling because of some confusion on the matter, which ended up with little to phone home about.

To understand why signals is a tricky issue, suppose that an application program is blocking on a read() from a /dev file that is generated by CUSE. The server (i.e. the driver of this device file in userspace) has collected some of the data, and is waiting for more, which is why it doesn’t complete the request. And then a “harmless” signal (say, SIGCHLD) is sent to the application program.

Even though that program is definitely not supposed to terminate on that signal, the read() should return ASAP. And because it has already collected some data (and possibly consumed it from its source), it should return with the number of bytes already read, and not with an -EINTR (which is the response if it has no data when returning on an interrupt).

So the FUSE / CUSE must notify the server that an interrupt has arrived, so that the relevant request is finished quickly, this way or another. To make things even trickier, it might very well be, that while notification on the interrupt is being prepared and sent to the server, the server has already finished the request, and is in the middle of returning the response.

Luckily, the FUSE / CUSE kernel interface offers a simple solution to this: An INTERRUPT request is sent to the server in response to an interrupt to the application program, with a unique ID number that matches a previously sent request. The server responds with normally returning a response for the said request, possibly with -EINTR status, exactly like a kernel character driver’s response to a signal.

The only significant race condition is when the server has already finished handling the request, for which the INTERRUPT request arrives, and has therefore forgotten the unique ID that comes with it. In this case, the server can simply ignore the INTERRUPT request — it has done the right thing anyhow.

So why this long post? Because I wanted to be sure, and because the little documentation there is on this topic, as well as the implementation in libfuse are somewhat misleading. Anyhow, the bottom line has already been said, if you’d like to TL;DR this post.

The official version

There is very little documentation on FUSE in general, however there is a section in the kernel source tree’s Documentation/filesystems/fuse.txt:

If a process issuing a FUSE filesystem request is interrupted, the following will happen:

  1. If the request is not yet sent to userspace AND the signal is fatal (SIGKILL or unhandled fatal signal), then the request is dequeued and returns immediately.
  2. If the request is not yet sent to userspace AND the signal is not fatal, then an “interrupted” flag is set for the request. When the request has been successfully transferred to userspace and this flag is set, an INTERRUPT request is queued.
  3. If the request is already sent to userspace, then an INTERRUPT request is queued.

INTERRUPT requests take precedence over other requests, so the userspace filesystem will receive queued INTERRUPTs before any others.

The userspace filesystem may ignore the INTERRUPT requests entirely, or may honor them by sending a reply to the original request, with the error set to EINTR.

It is also possible that there’s a race between processing the original request and its INTERRUPT request. There are two possibilities:

  1. The INTERRUPT request is processed before the original request is processed
  2. The INTERRUPT request is processed after the original request has been answered

If the filesystem cannot find the original request, it should wait for some timeout and/or a number of new requests to arrive, after which it should reply to the INTERRUPT request with an EAGAIN error. In case 1 the INTERRUPT request will be requeued. In case 2 the INTERRUPT reply will be ignored.

The description above is correct (see detailed dissection of kernel code below) however beginning from the “race condition” part it gets somewhat confusing.

Race condition?

In the rest of this post, there’s a detailed walkthrough of the involved functions in the v5.3.0 kernel, and there’s apparently no chance for the race condition mentioned fuse.txt. It’s not even an old bug that was relevant when interrupt handling was introduced with Git commit a4d27e75ffb7b (where the text cited above in fuse.txt was added as well): Even looking at the original commit, there’s a clear locking mechanism that prevents any race condition in the kernel code. This was later replaced with memory barriers, which should work just the same.

All in all: An INTERRUPT request is queued, if at all, only after the related request has been submitted as the answer to a read() by the server.

So what is this all about, then? A multi-threaded server, which spreads requests randomly among work threads, might indeed handle requests in a random order. It seems like this is what the “race condition” comment refers to.

The solution to the non-existing problem

Had there been a possibility that INTERRUPT request may arrive before the request it relates to, the straightforward solution would be to maintain an orphan list of Unique IDs of INTERRUPT requests that didn’t have a request processed when the INTERRUPT request arrived. This list would then be filled with INTERRUPT requests that arrived too early (before the related request) or too late (after the request was processed).

Then, for each non-INTERRUPT request that arrives, see if it’s in the list, and if so, remove the Unique ID from the list, and treat the request as interrupted.

But the requests that were added into the list because of the “too late” scenario will never get off the list this way. So some garbage collection mechanism is necessary.

The FUSE driver facilitates this by allowing a response with an -EAGAIN status to INTERRUPT requests. Even though no response is needed to INTERRUPT requests, an -EAGAIN response will cause the repeated queuing of the INTERRUPT request by the kernel if the related request is still pending, and otherwise do nothing.

So occasionally, the server may go through its list of orphans, and send an -EAGAIN response to each entry, and delete this entry as the response is sent. If the deleted entry is still relevant, it will be re-sent by the kernel, so it’s re-listed (or possibly handled directly if the related request has arrived in the meantime). Entries from the “too late” scenario won’t be re-listed, because the kernel will do nothing in reaction to the -EAGAIN response.

This is the solution suggested in fuse.txt on the race conditions issue. The reason this solution is suggested in the kernel’s documentation, even though it relates to a problem in a specific software implementation, is probably to explain the motivation to the -EAGAIN feature. But boy, was it confusing.

How libfuse handles INTERRUPT requests

Spoiler: The solution to the non-existent problem is implemented in libfuse 3.9.0 (and way back) as described above. The related comment was written based upon a problem that arose with libfuse. Which is multithreaded, of course.

The said garbage collection mechanism is run on the first entry in the list of orphaned INTERRUPT requests each time a non-INTERRUPT request arrives and has no match against any of the list’s members. This ensures that the list is emptied quite quickly, and without risk of an endless loop circulation of INTERRUPT requests, because the arrival of a non-INTERRUPT request means that the queue for INTERRUPT requests in the kernel was empty at that moment. A quirky solution to a quirky problem.

Note that even when libfuse is run with debug output, it’s difficult to say anything about ordering, as the debug output shows processing, not arrival. And the log messages come from different threads.

The problem of unordered processing of INTERRUPT requests could have been solved much more elegantly of course, but libfuse is a patch on patch, so they made another one.

And for the interested, this is the which-function-calls-what in libfuse.

So in libfuse’s fuse_lowlevel.c, the method for handling interrupts, do_interrupt(), first attempts to find the related request, and if it fails, it adds an entry to a session-specific list, se->interrupts. Then there’s check_interrupt(), which is called by fuse_session_process_buf_int() for each arriving request that isn’t an INTERRUPT itself. This function looks up the list for the request, and if it’s found, it sets that request’s “interrupted” flag, and removes it from the list. Otherwise, if the list is non-empty, it removes the first entry of se->interrupts and returns it to the caller, which initiates an EAGAIN for that.

Read the source

Since this is an important topic, let’s look on how this is implemented. So from this point until the end of this post, these are dissection notes of the v5.3.0 kernel source. There are commits applied all the time in this region, but in essence it seems to be the same for a long time.

Generally speaking, all I/O operations that are initiated by the application program (read(), write(), etc.) end up with the setup of a fuse_req structure containing the request information in file.c, and its submission to the server front-end with a call to fuse_request_send(), which is defined in dev.c. If the I/O is asynchronous, fuse_async_req_send() is called instead, but that’s irrelevant for the flow discussed now. fuse_request_send() calls __fuse_request_send(), which in turn calls queue_request() which puts the request in the queue, and more importantly, request_wait_answer(), which puts the process to sleep until the request is completed (or something else happens…).

And now details…

So what does request_wait_answer() do? First, let’s get acquainted with some of the flags that are related to each request (i.e. in struct fuse_req’s flags entry), see also fuse_i.h:

  • FR_FINISHED: request_end() has been called for this request, which happens when the response for this request has arrived (but not processed yet — when that is done the request is freed). Or when it has been aborted for whatever reason (and once again, the error has not been processed yet).
  • FR_PENDING: The request is on the list of requests for transmission to the server. The flag is set when the fuse_req structure of a new request is initialized, and cleared when fuse_dev_do_read() has completed a server’s read() request. Or alternatively, failed for some reason, in which case request_end() has been called to complete the request with an error. So when it’s set, the request has not been sent to the server, but when cleared, it doesn’t necessarily mean it has.
  • FR_SENT: The request has been sent to the server. This is set by fuse_dev_do_read() when nothing can fail anymore. It differs from !FR_PENDING in that FR_PENDING is cleared when there’s an error as well.
  • FR_INTERRUPTED: This flag is set if an interrupt arrived while waiting for a response from the server.
  • FR_FORCE: Force sending of the request even if interrupted
  • FR_BACKGROUND: This is a background request. Irrelevant for the blocking scenario discussed here.
  • FR_LOCKED: Never mind this: It only matters when tearing down the FUSE connection and aborting all requests, and it determines the order in which this is done. It means that data is being copied to or from the request.

request_wait_answer()

With this at hand, let’s follow request_wait_answer() step by step:

  • Wait (with wait_event_interruptible(), sleeping) for FR_FINISHED to be set. Simply put, wait until a response or any interrupt has arrived.
  • If FR_FINISHED is set, the function returns (it’s a void function, and has no return value).
  • If any interrupt occurred while waiting, set FR_INTERRUPTED and check FR_SENT. If the latter was set, call queue_interrupt() to queue the request on the list of pending interrupt requests (unless it is already queued, as fixed in commit 8f7bb368dbdda. The same struct fuse_req is likely to be listed in two lists; one for the pending request and the second for the interrupt).

Note that these three bullets above are skipped if the FUSE connection had the “no_interrupt” flag on invocation to request_wait_answer(). This flag is set if the server answered to any interrupt request in the current session’s past with an -ENOSYS.

  • Wait again for FR_FINISHED to be set, now with wait_event_killable(). This puts the process in the TASK_KILLABLE state, so it returns only when the condition is met or on a fatal signal. If wait_event_interruptible() was awaken by a fatal signal to begin with, there will be no waiting at all on this stage (because the signal is still pending).
  • If FR_FINISHED is set, the function returns. This means that a response has been received for the request itself. As explained below, this is unrelated to the interrupt request’s fate.
  • Otherwise, there’s a fatal signal pending. If FR_PENDING is set (the request has not been sent to server yet), the request is removed from the queue for transmission to the server (with due locking). It’s status is set to -EINTR, and the function returns.

Note that these three bullets are skipped if the FR_FORCE flag is set for this request. And then, there’s the final step if none of the above got the function to return:

  • Once again, wait for FR_FINISHED to be set, but this time with the dreaded, non-interruptible wait_event(). In simple words, if the server doesn’t return a response for the request, the application that is blocking on the I/O call is sleeping and non-killable. This is not so bad, because if the server is killed (and hence closes /dev/fuse or /dev/cuse), all its requests are marked with FR_FINISHED.

To see the whole picture, a close look is needed on fuse_dev_do_read() and fuse_dev_do_write(), which are the functions that handle the request and response communication (respectively) with the driver.

fuse_dev_do_write()

Starting with fuse_dev_do_write(), which handles responses: After a few sanity checks (e.g. that the data lengths are in order), it looks up the request based upon the @unique field (for responses to interrupt requests, the original request is looked for). If the request isn’t found, the function returns with -ENOENT.

If the response has an odd @unique field, it’s an interrupt request response. If the @error field is -ENOSYS, the “no_interrupt” flag is set for the current connection (see above). If it’s -EAGAIN, another interrupt request is queued immediately. Otherwise the interrupt request response is ignored and the function returns. In other words, except for the two error codes just mentioned, it’s pointless to send them. The desired response to an interrupt request is to complete the original request, not responding to the interrupt request.

So now to handling regular responses: The first step is to clear FR_SENT, which sort-of breaks the common sense meaning of this flag, but it’s probably a small hack to reduce the chance of an unnecessary interrupt request, as the original request is just about to finish.

The response’s content is then copied into kernel memory, and request_end() is called, which sets FR_FINISHED, then removes the request from the queue of pending interrupts (if it’s queued there), and after that it returns with due error code (typically success).

So not much interesting here.

fuse_dev_do_read() step by step

The function returns with -EAGAIN if /dev/fuse or /dev/cuse was opened in non-blocking mode, and there’s no data to supply. Otherwise, it waits with wait_event_interruptible_exclusive_locked() until there’s a request to send to the server in any of the three queues (INTERRUPT, FORGET or regular requests queues). If the server process got an interrupt, the wait function returns with -ERESTARTSYS, and so does this function (this is bug? It should be -EINTR).

First, the queue of pending interrupts is checked. If there’s any entry there, fuse_read_interrupt() is called, which generates a FUSE_INTERRUPT request with the @unique field set to the original request’s @unique, ORed with FUSE_INT_REQ_BIT (which equals 1). The request is copied into the user-space buffer, and fuse_dev_do_read() returns with the size of this request.

Second, FORGET requests are submitted, if such are queued.

If none of the INTERRUPT and FORGET were sent, the first entry in the request queue is dequeued, and its FR_PENDING flag is cleared. The I/O data handling then takes place.

Just before returning, the FR_SENT flag is set, and then FR_INTERRUPTED is checked. If the latter is set, queue_interrupt() is called to queue the request on the list of pending interrupt requests (unless it is already queued. Once again, the same struct fuse_req is likely to be listed in two lists; one for the pending request and the second for the interrupt). Together with request_wait_answer(), this ensures that an interrupt is queued as soon as FR_SENT is set: If the waiting function returned before FR_SENT is set, FR_INTERRUPTED is set by request_wait_answer() before checking FR_SENT, so fuse_dev_do_read will call queue_interrupt() after setting FR_SENT. If the waiting function returned after FR_SENT is set, request_wait_answer() will call queue_interrupt(). And in case of a race condition, both will call this function; note that each of the two racers sets one flag and checks opposite in reverse order with respect to each other. And calling queue_interrupt() twice results in queuing the interrupt request only once.

FUSE / CUSE kernel driver dissection notes

$
0
0

What this post is about

Before anything: If you’re planning on using FUSE / CUSE for an application, be sure to read this first. It also explains why I bothered looking at the kernel code instead of using libfuse.

So these are some quite random notes I took while trying to figure out how to talk with /dev/cuse directly by reading the sources directly. I’m probably not going to touch CUSE with a five-foot stick again, so maybe this will help someone out there.

Everything said here relates to Linux v5.3. As FUSE a bit of hack-on-demand kind of filesystem, things change all the time.

CUSE vs. FUSE

CUSE is FUSE’s little brother, allowing to generate a single device file in /dev, having the driver implemented in user space. Compared with FUSE’s ability to mount an entire filesystem, CUSE much lighter, and is accordingly implemented as a piggy-back on the FUSE driver.

CUSE and FUSE are reached from user space through different device files: A server (i.e. driver) for FUSE opens /dev/fuse, and a server for CUSE opens /dev/cuse.

Note that the user application program that opens /dev/cuse or /dev/fuse is called the server. It’s actually a driver, but the latter term is saved for the FUSE kernel framework.

The driver for /dev/cuse is implemented entirely in fs/fuse/cuse.c, and it does quite little: All file operation methods for /dev/cuse are redirected to those used for /dev/fuse (by literally copying the list of methods), except for open and release.

The CUSE-specific method for open runs a slightly different initialization procedure against the server (more about this below) and eventually generates a character device file instead of making a filesystem mountable.

This character device file is assigned I/O methods that are handled in cuse.c, however their implementation relies heavily on functions that are defined in the mainline FUSE driver. Effectively, this device file is a FUSE file which is forced to use “direct I/O” methods to present a data pipe abstraction.

It might very well be that it’s possible to obtain the same result by setting up a small mounted filesystem with a file with certain settings, but I haven’t investigated this further. It seems however that the application program will have to open the file with the O_DIRECT flag for this to work. See Documentation/filesystems/fuse-io.txt in the kernel source tree.

The relevant source files

The FUSE filesystem handles I/O requests of two completely different types: Those related to the file system that is mounted in relation to it (or the device file generated on behalf of CUSE), and those related to the character device which the FUSE / CUSE server opens. This might cause a slight confusion, but the kernel code sticks to a naming convention that pretty much avoids it.

The interesting files in the kernel tree:

  • fs/fuse/file.c — Methods for handling I/O requests from the FUSE-mounted file system. The typical function name prefix is fuse_file_*.
  • fs/fuse/dev.c — Methods for handling I/O requests from /dev/fuse. The typical function name prefix is fuse_dev_*.
  • fs/fuse/cuse.c — CUSE-specific driver. Responsible for generating /dev/cuse, and make it behave quite like /dev/fuse. In fact, it routes a lot of function calls to the FUSE driver. The typical function name prefix is cuse_channel_* for methods handling I/O requests from /dev/cuse. Functions named just cuse_* are handlers for the CUSE-generated character device. Note that the /dev/cuse character device is referred to as the “channel” so it’s not confused with the other one.
  • include/uapi/linux/fuse.h — Header file with all structures and constants that are visible in user space
  • fs/fuse/fuse_i.h — Header file with everything that isn’t visible from user space.

FUSE protocol

It’s probably necessary to be acquainted with writing a Linux kernel character device (at least) in order to understand the nuts and bolts of FUSE. It’s actually helpful to have worked with a device driver for Microsoft Windows as well, since flow of I/O requests resembles the IRP concept in Windows’ driver model:

Each I/O request by the user space program goes into the kernel and is translated into a data structure which contains the information, and that data structure is handed over to the server (i.e. the driver in user space). The server queues the request for processing and acknowledges its reception, but not its completion. Rather, the server processes the request in its own free time, and when finished, it turns it back to the I/O system that requested it, along with the status and possibly data. If the user program blocks on the completion of the I/O system call (async I/O is also supported), it does so until the server turns back the request.

So there’s a flow of requests arriving from /dev/fuse (or /dev/cuse, as applicable), and a flow of responses written to the same file descriptor by the driver. The relation between the requests and responses is asynchronous (which is the main resemblance with IRPs), so the responses may arrive in no particular order.

The main difference from Windows’ IRP model is that Windows’ kernel makes calls to I/O operation handlers in the device driver (just like a Linux driver, but with the driver’s hands tied) with a pointer to the IRP. With FUSE, all requests go through a single pipe (good old UNIX design philosophy) and the driver chooses what to do with each. Also, in Windows, there’s a special treatment of requests that can be finished immediately — the driver can return with a status saying so. FUSE’s take on this matter is congratulations, finish the request and submit the response. Now or later makes no essential difference.

This way or another, the FUSE / CUSE server should not block or otherwise delay the reception of requests from /dev/fuse while handling a previous request (a Windows device driver is not allowed to block because it runs in arbitrary thread context, but that’s really irrelevant here). Even if it can’t or isn’t expected to handle another request before the current one is done, it must keep receiving requests while handling previous ones, at least for one reason: Accepting requests to handle a signal (interrupt) for an already queued request. More on that below.

The other side of the coin: A read() call from /dev/fuse or /dev/cuse may block, and will do so until there’s a request available to handle. On the other hand, a write() never blocks (which makes sense, since it merely informs the kernel driver a request has been finished). The poll() system call is implemented, so epoll() and select() can be used on /dev/fuse and /dev/cuse, rather than blocking a thread on waiting for a request (libfuse doesn’t take advantage of this).

I/O requests

The request from /dev/fuse or /dev/cuse is starts with a header of the following form (defined in the kernel’s include/uapi/linux/fuse.h and libfuse’s libfuse/include/fuse_kernel.h):

struct fuse_in_header {
	uint32_t	len;
	uint32_t	opcode;
	uint64_t	unique;
	uint64_t	nodeid;
	uint32_t	uid;
	uint32_t	gid;
	uint32_t	pid;
	uint32_t	padding;
};

The header is then followed by data that is related to the request, if necessary.

@len is the number of bytes in the request, including the fuse_in_header struct itself.

@opcode says what operation is requested, out those listed in enum fuse_opcode in the same header files (the opcodes are also listed and explained on this page).

@unique is the identifier of the request, to be used in the response. Note that if bit 0 is set (i.e. @unique is odd), the request is an interrupt notification to another request (with the @unique after clearing bit 0). This is not true on all kernel versions however.

The rest — nodeid, uid, gid and pid are quite obvious. But it’s noteworthy that the process ID is exposed to the driver in user space.

Reads from /dev/{cuse,fuse} are done in one single read() requests, which dequeues one request from one of the kernel driver’s requests queues: One for INTERRUPT requests, one for FORGET requests, and one for all the others. They are prioritized in this order (i.e. INTERRUPT go before any other etc.).

The read() call is atomic: It must request a number of bytes that is larger or equal to the request’s @len, or the request is discarded and -EIO is returned instead. For this reason, the number of bytes of any read() from /dev/cuse or /dev/fuse must be max_write + fuse_in_header, where @max_write is as submitted on the cuse_init_out structure in response to an INIT request (see below) (max_write is expected to be 4096 at least).

However oddly enough, in libfuse’s fuse_lowlevel.c it says

	se->bufsize = FUSE_MAX_MAX_PAGES * getpagesize() +
		FUSE_BUFFER_HEADER_SIZE;

(the session’s buffer size of arriving requests are se->bufsize) and then libfuse’s fuse_i.h goes

#define FUSE_MAX_MAX_PAGES 256

but how is that an upper limit of something?

I/O responses

Responses are written by the server into the same file descriptor of /dev/fuse or /dev/cuse, starting with a header as follows:

struct fuse_out_header {
	uint32_t	len;
	int32_t		error;
	uint64_t	unique;
};

The meaning of @len and @unique are the same in the request: @len includes the header itself, and @unique is a copy of the identifier of the request (with some extra care when handling interrupt requests).

@error is the status. Zero means success, negative numbers are in the well-known form of -EAGAIN, -EINVAL etc. It’s expected to be zero or negative (but not below -999). If it’s non-zero, the response must consist of a header only, or the write() call that submits the response returns -EINVAL.

A response write() is atomic as well: The number of bytes requested in the call must equal to @len, or the call returns -EINVAL.

How requests are made in the kernel code

For each request to the server, a struct fuse_req is allocated and initialized to contain the information on the request to send and what the answer is about to look like. This begin with calling fuse_get_req() or fuse_get_req_for_background(), which both call __fuse_get_req(struct fuse_conn *fc, unsigned npages, bool for_background).

To make a long story short, this function allocates the memory for the struct fuse_req itself as well a memory array of npages entries of struct page and struct fuse_page_desc. It also initializes several functional fields of the structure, among others the pages, page_descs, max_pages entries, as well as setting the reference count to 1, the FR_PENDING flag and initializing the two list headers and the wait queue. The pid, uid and gid fields in the information for the request are also set.

Then the fuse_req structure is set up specifically for the request. In particular, the @end entry points at the function to call by request_end() following the arrival of a response from the server or the abortion of the request.

The fuse_req has two entries, @in and @out, which are of type fuse_in and fuse_out, respectively. Note that “in” and “out” are from the server’s perspective, so “in” means kernel to server and vice versa.

struct fuse_arg {
	unsigned size;
	void *value;
};

struct fuse_in_arg {
	unsigned size;
	const void *value;
};

struct fuse_in {
	struct fuse_in_header h;
	unsigned argpages:1;
	unsigned numargs;
	struct fuse_in_arg args[3];
};

struct fuse_out {
	struct fuse_out_header h;
	unsigned argvar:1;
	unsigned argpages:1;
	unsigned page_zeroing:1;
	unsigned page_replace:1;
	unsigned numargs;
	struct fuse_arg args[2];
};

Despite the complicated outline, the usage is quite simple. It’s summarized in detail at the rest of this section, but in short: The request consists of a fuse_in_header followed by arguments, which is just a concatenation of strings (there are @in.numargs of them), which are set up when the request is prepared. @value and @size are set up in an array of struct fuse_in_arg.

The response is a concatenation of fuse_out_header and @out.numargs arguments, once again these are concatenated strings. The sizes and buffers are set up when the request is generated. The @argvar flag is possibly set to allow a shorter response at the expense of the last argument. Look at the function pointed by @end for how these arguments are interpreted.

And now the longer version of the two clauses above:

When a request is prepared for transmission to the server by fuse_dev_do_read(), it concatenates the @h entry in the struct fuse_in with @numargs “arguments”. Each “argument” is a string, which is represented as a fuse_in_arg entry in the @args array, by a pointer @value and the number of bytes given as @size. So it’s a plain string concatenation of @numargs + 1 strings, the first with a fixed size (of struct fuse_in_header) and some variable-length strings. What makes it seem complicated is the paging-aware data copying mechanism.

As for handling the arrival of responses from the server: Except for notifications and interrupt replies, fuse_dev_do_write() handles the write() request, which must include everything in the buffer submitted, as follows. The first bytes are copied into the fuse_req’s @out.h, or in other words, the fuse_out’s @h entry. So this consumes the number of bytes in a struct fuse_out_header.

The rest is chopped into arguments (by copy_out_args() ), following the same convention of @numargs concatenated strings, each having the length of @size and written into the buffer pointed by @value. @numargs as well as entries of the struct fuse_arg array are set when preparing the request — when the response arrives, the relevant buffers are filled. And don’t confuse struct fuse_arg with struct fuse_args, which is completely different.

copy_out_args() checks the header’s @error field before copying anything. If it’s non-zero, no copying takes place: The response is supposed to consist of a struct fuse_out_header only.

The last argument of a response from the server may be shorter (possibly zero length) than its respective @size entry if and only if the @argvar entry in the related struct fuse_out struct is set (which is possibly done when preparing the request). If this is the case, the server simply submits less bytes than the sum of the header + all argument’s @size, and the last argument is shortened accordingly. This may sound complicated, but it just means, for example, that a response to READ submits the data that it managed to collect.

Once again, all this sounds a bit scary, but take the relevant snippet from cuse_send_init() defined in the kernel’s fs/fuse/cuse.c:

	req->in.h.opcode = CUSE_INIT;
	req->in.numargs = 1;
	req->in.args[0].size = sizeof(struct cuse_init_in);
	req->in.args[0].value = arg;
	req->out.numargs = 2;
	req->out.args[0].size = sizeof(struct cuse_init_out);
	req->out.args[0].value = outarg;
	req->out.args[1].size = CUSE_INIT_INFO_MAX;
	req->out.argvar = 1;
	req->out.argpages = 1;
	req->pages[0] = page;
	req->page_descs[0].length = req->out.args[1].size;
	req->num_pages = 1;
	req->end = cuse_process_init_reply;
	fuse_request_send_background(fc, req);

It’s quite clear: The driver sends one argument (i.e. one string) after the header, and expects two back in the response. And the function that handles the response is cuse_process_init_reply(). So it’s fairly easy to tell what is sent and what is expected in return.

How CUSE implements read()

The CUSE driver (cuse.c) assigns cuse_read_iter() for the read_iter fops method. This function sets the file position to zero, and calls fuse_direct_io(), defined in file.c. Not to be confused with fuse_direct_IO(), defined in the same file.

The latter function retrieves the number of bytes to process as its local variable @count. It then loops on sending requests and retrieving the data as follows (outlined for non-async I/O): fuse_send_read() is called for sending a READ request to the server by calling fuse_read_fill() and fuse_request_send(). The latter is defined in dev.c, and calls __fuse_request_send(), which queues the request for transmission (with queue_request()) and then waits (i.e. blocks, sleeps) until the response with a matching unique ID has arrived (by calling request_wait_answer()). This happens by virtue of the server’s invocation of a write() on its /dev/cuse filehandle, with a matching unique ID.

Back to the loop on @count, fuse_send_read() returns with the number of bytes of the response’s first argument — that is, the length of the data that arrived. The loop hence continues with checking the error status of the response (in the @error field). If there was an error, or if there were less bytes than requested in the response, the loop terminates. Also if @count is zero after deducing the number of arrived bytes from it.

The return value of fuse_direct_io(), which is also the return of the cuse_read_iter(), is the number of bytes that were read (in total), if this number is non-zero, even if the loop quit because of an error. Only no bytes were received, the function returns the @error field in the response (which is zero if there was neither an error nor data).

The rationale behind the loop and the way it handles errors is that a single read() request by the application may be chopped into several READ requests if the read() can’t be fit into a single READ request (i.e. the read()’s @count is larger than max_read, as specified on the INIT response). It’s therefore necessary to iterate.

How CUSE implements write()

The CUSE driver (cuse.c) assigns cuse_write_iter() for the write_iter fops method. This function sets the file position to zero, and like cuse_read_iter(), it calls fuse_direct_io(), defined in file.c. Only with different arguments to tell the latter function that the data goes in the opposite direction.

fuse_direct_io() calls fuse_send_write() instead of fuse_send_read, which calls fuse_write_fill() instead of fuse_read_fill(). And then fuse_request_send() is called, which sends the request and waits for its response. fuse_send_write() returns with the number of bytes that were actually written, as it appears in the @size entry of the struct fuse_write_out in the response.

Note that the kernel driver sends a buffer along with the WRITE call, and the server chooses how much to consume from it, and then tells the kernel about that in the response. This requires a small discussion on partial handling of write().

The tricky thing with a write() is that the application program supplies a buffer to write, along with the number of bytes to write. It’s perfectly fine to point to a huge buffer and set the count to the entire buffer. Any character device driver may write the entire buffer, or just as much as it can at the moment, and return the number of bytes written. The fact that a huge number of bytes were requested makes no difference, because the character device driver treats the request as if it was for the number of bytes it could write. The rest of the buffer is ignored.

So there are two problems, both arising when the buffer of the write() from the application program is large: One is how to make sure that the server has allocated a buffer large enough to receive the data in one go (recall that both requests and responses must be done in a single I/O operation). The second and smaller problem is the wasted I/O of data in a WRITE request that is eventually ignored, because the server chose to consume less than available.

To prevent huge buffers from being transmitted to the server and then ignored, the server supplies a max_write parameter in its response to an INIT request, that sets the maximal number of bytes for transmitted on a WRITE server request (it should be 4096 or larger). So the write() operation is chopped up into smaller buffers by FUSE / CUSE as necessary to meet this restriction.

This parameter is a tradeoff between reducing the number of I/Os with the server and the possibility to waste data transfers. fuselib picks 128 kB.

There is no similar problem with read() calls, because the server submits the number of bytes actually read in the response after the response header that says how many bytes are submitted. Nevertheless, there is a separate max_read limit for CUSE sessions nevertheless (but not for FUSE, which copies it from max_write).

Handling interrupts (signals)

There is a lot of fuss about this topic, which is discussed on a separate post. To make a long story short, a server must be able to process INTERRUPT requests. To the server, such request is just like the others, in the sense that it comprises of a struct fuse_in_header followed by a single argument:

struct fuse_interrupt_in {
	uint64_t	unique;
};

The function that implements this in the kernel is fuse_read_interrupt() in dev.c.

Note that there are two @unique IDs in the request. One is in the header, which is ID of the interrupt request itself. The second is in the argument, which is the unique ID of the request that should be interrupted. The server should not assume any special connection between the two (there is such since kernel v4.20, due to commit c59fd85e4fd07).

When a server receives an INTERRUPT request, it shall immediately send a response (i.e. completion) of the request with the @unique given in the argument. An -EINTR status may be reported, in accordance the common POSIX rules.

Note that even though an INTERRUPT request is guaranteed to be conveyed to the server after the request it relates to, it may arrive after the server’s response has been submitted if a race condition occurs. As a result, the server may receive INTERRUPT requests with a @unique ID that it doesn’t recognize (because it has removed its records while responding). Therefore, the server should ignore such requests.

On the other hand, if multiple threads fetch requests from the same file descriptor (of /dev/cuse or /dev/fuse), one thread may decode the INTERRUPT request before the original request has been recorder. This possibility is present in the libfuse implementation, and is the reason behind the complication discussed in that other post.

POLL requests

Poll is different from many other requests in that it requires two (or even more) responses from the server:

  • An immediate response, with the bitmap informing which operations are possible right away
  • Possibly additional notifications, when one or more of the selected operations have become possible.

fuse_file_poll in file.c handles poll() is calls on a file. It queues a FUSE_POLL request, with one argument, consisting of a fuse_poll_in struct:

struct fuse_poll_in {
	uint64_t	fh;
	uint64_t	kh;
	uint32_t	flags;
	uint32_t	events;
};

The @events entry is set with

inarg.events = mangle_poll(poll_requested_events(wait));

which supplies a bitmap of the events that are waited for in POSIX style (mangle_poll() is defined in the kernel’s poll.h, which does the conversion).

@flags may have one flag set, FUSE_POLL_SCHEDULE_NOTIFY, saying that there’s a process actually waiting. If it’s set, the server is required to send a notification when the file becomes ready. If cleared, the server may send such notification, but it will be ignored.

@fh and @kh are the file’s file handle, in userspace and kernel space respectively (the latter is systemwide unique).

If there is a process waiting, the file is then registered in a dedicated data structure (an RB tree), and will be kept there until the file is released. The underlying idea is that if a file descriptor has been polled once, it’s likely happen a lot of times to follow.

Either way, the POLL request is submitted, and the server is expected to submit a response with a poll bitmap, which is deconverted into kernel format, and used as the poll() return value. Consequently, poll() blocks until the response arrives.

Should the server respond with an -ENOSYS status, no more POLL requests are sent to the server at the rest of the session, and DEFAULT_POLLMASK is returned on this and all subsequent poll() calls. Defined in poll.h:

#define DEFAULT_POLLMASK (EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM)

So there’s the poll response:

struct fuse_poll_out {
	uint32_t	revents;
	uint32_t	padding;
};

Rather trivial — just the events that are active.

More interesting, is the notifications. The server may send a notification anytime by setting @unique to zero and the @error field to the code of the notification request (FUSE_NOTIFY_POLL == 1). The @opcode field is ignored in this case (there is no opcode for notifications).

There’s one argument in a poll notification:

struct fuse_notify_poll_wakeup_out {
	uint64_t	kh;
};

where @kh echoes back the value in the poll request.

In dev.c, fuse_notify() calls fuse_notify_poll(), which in turn calls fuse_notify_poll_wakeup() (in file.c) after a few sanity checks.

fuse_notify_poll_wakeup() looks up the value of @kh entry in the dedicated data structure. If it’s not found, the notification is silently ignored. This is considered OK, since the server is allowed to send notifications even if FUSE_POLL_SCHEDULE_NOTIFY wasn’t set.

If the entry is found, wake_up_interruptible_sync() is called on the file’s wait queue that is used only in relation to poll (which is known from the entry in the data structure). That’s it.

poll() is supported by FUSE since kernel v2.6.29 (Git commit 95668a69a4bb8, Nov 2008)

CUSE INIT requests

The bringup of the device file is initiated by the kernel driver, which sends an CUSE_INIT request. The server sets up the connection and device file’s attributes by responding to this request.

In cuse.c, cuse_channel_open(), implements /dev/cuse’s method for open(). Aside from allocating and initializing a struct cuse_conn for containing the private data of this connection, it calls cuse_send_init() for queuing an CUSE_INIT (opcode 4096) request to the new file handle. Note that this is different from the FUSE_INIT (opcode 26) that arrives from /dev/fuse.

The request consists of a struct fuse_in_header concatenated with a struct cuse_init_in:

struct cuse_init_in {
	uint32_t	major;
	uint32_t	minor;
	uint32_t	unused;
	uint32_t	flags;
};

The major and minor fields are the FUSE_KERNEL_VERSION and FUSE_KERNEL_MINOR_VERSION, telling the server which FUSE version the kernel offers. flags is set to 0x01, which is CUSE_UNRESTRICTED_IOCTL.

The pid, uid and gid in the header are those of the process that opened /dev/cuse — not really interesting. @unique is typically 1 (but don’t rely on it — it can be anything in future versions). On fairly recent kernels, it continues with 2 and increments by 2 for each request to follow. On older kernels, it just counts upwards with steps of 1. The unique ID mechanism was changed in kernel commit c59fd85e4fd07 (September 2018, v4.20) for the purpose of allowing a hash of unique IDs in the future.

The response is a string concatenation of the following three elements (header + two arguments):

  • A struct fuse_out_header, with the header for the response (with @unique typically set to 1)
  • A struct cuse_init_out with some information (more on that below)
  • A null-terminated string that reads e.g. “DEVNAME=mydevice” (without the quotes, of course) for generating the device file /dev/mydevice. Don’t forget to actually write the null byte in the end, or the device generation fails with a “CUSE: info not properly terminated” in the kernel log.

struct cuse_init_out is defined as

struct cuse_init_out {
	uint32_t	major;
	uint32_t	minor;
	uint32_t	unused;
	uint32_t	flags;
	uint32_t	max_read;
	uint32_t	max_write;
	uint32_t	dev_major;
	uint32_t	dev_minor;
	uint32_t	spare[10];
};

The fields of cuse_init_out are as follows:

  • @major and @minor have the same meaning as these fields in struct cuse_init_in, but they reflect the version that the server is designed for, and hence rules the session. As of kernel v5.3 (which implement FUSE version 7.26), @major must be 7 and @minor at least 11, or the initialization fails. FUSE 7.11 was introduced in kernel v2.6.29 in 2008. See include kernel sources’ uapi/linux/fuse.h for revision history.
  • @max_read and @max_write are the maximal number of bytes in the payload of a READ and WRITE request, respectively. Note that @max_write forces read() requests from /dev/cuse to supply a @count parameter of at least @max_write + the size of struct fuse_out_header + the size of struct fuse_write_out, or WRITE requests may fail. Same goes for @max_read and struct fuse_in_header and struct fuse_read_in. What counts is the length of the requests and their possible responses, which includes the lengths of the non-data parts.
  • @flags: If bit 0 (CUSE_UNRESTRICTED_IOCTL) is set, unrestricted ioctls is enabled.
  • @dev_major and @dev_minor are the created device file’s major and minor numbers. This means that the server needs to make sure that the aren’t already allocated.

FORGET requests

These requests inform a FUSE server that there’s no need to retain information on a specific inode. This request will never appear on /dev/cuse.

usbpiper: A single-threaded /dev/cuse and libusb-based endpoint to device file translator

$
0
0

Introduction

Based upon CUSE, libusb and the kernel’s epoll capability, this is a single-threaded utility which generates one /dev/usbpiper_* device file for each bulk / interrupt endpoint on a USB device. For example, /dev/usbpiper_bulk_in_01 and /dev/usbpiper_bulk_out_03.

It’s an unfinished project, that was stopped before a lot of obvious tasks in the TODO list were done. This is why several parameters are hardcoded and some memory allocations aren’t freed. Plus several other implications listed below.

It’s available at Github: https://github.com/billauer/usbpiper

I eventually went for a good old kernel driver instead. This post explains why, and you probably want to read it if you have plans on this utility or want to use FUSE or CUSE otherwise. That post also explains why I went right on to /dev/cuse rather than using libfuse.

Nevertheless, the project may very well be useful for development of USB projects, as a boilerplate or a getting-started utility. It also shows how to implement epoll-based asynchronous USB transfers, as well as implementing a CUSE-based device file driver in userspace, implementing the protocol of /dev/cuse directly (i.e. without relying on libfuse). And all this as a single thread program.

But what was the utility meant to do in the first place?

The underlying idea is simple: With a single-threaded userspace program, create a plain character device for each BULK (or INTERRUPT) endpoint that is found on a selected USB device, and allow data to be sent to each OUT endpoint by opening a device file, and just write data to it. With “cat” for example. And the other way around, read data from each IN endpoint by reading data from another device file. This description is simplistic, however it may work quite well when working on a USB device project. Just be sure to read the details below on setting up usbpiper. Doing that pretty much covers the necessary gory details.

What usbpiper definitely isn’t: It’s NOT a user-space driver for XillyUSB (a generic FPGA IP core for SuperSpeed USB 3.0, based upon the FPGA’s Gigabit transceivers). XillyUSB requires a dedicated driver, which implements a specific protocol with the IP core.

Confusing usbpiper with XillyUSB’s driver is easy, because both share the idea of plain device files for I/O with a USB device. In fact, usbpiper started off as a user-space driver for XillyUSB, but never got to the point of covering XillyUSB’s protocol.

Another possible source of confusion is usbfs. It’s a USB filesystem, so what is there to add? So yes, usbfs is used by libusb to allow a low-level driver for a USB device to be written in user space (usbpiper uses this interface, of course). It doesn’t allow a simple access to the data.

It’s recommended to look on this post on the protocol with /dev/cuse before diving into this one.

What works and in what ways it’s unfinished

usbpiper is executed with no arguments. It takes control of the selected USB device’s interface (which one — see below) and creates a /dev/usbpiper_* device file for each bulk or endpoint endpoint that it finds. The file’s name reflects the endpoint’s number, direction and bulk vs. interrupt.

It has however only been tested on bulk endpoints. Interrupt endpoints may work, but has not been tested, and isochronous endpoints are ignored. Also, usbpiper doesn’t free memory properly, in particular not buffers and other memory consuming stuff that are related to libusb.

Several parameters would normally be set through command-line parameters, but they are hardcoded.

The verbosity level can be set by editing some defines in usbpiper.h. In particular, a lot of messages are reduced by replacing

#define DEBUG(...) { fprintf(stderr, __VA_ARGS__); }

with

#define DEBUG(...)

In usbpiper.c, max_size defines the largest number of bytes that can be handled in a CUSE READ or WRITE request.

In usb.c, the following parameters are hardcoded:

  • FIFOSIZE: The effective number of bytes in the FIFO between the CUSE and USB units. The actual FIFO size for OUT endpoints is larger by max_size, for reasons explained in the “Basic data flow principle” section below.
  • vendorID and prodID define the device to be targeted. Note that the find_device() function in usb.c explicitly finds the device from the list of devices on the bus, so it can be altered to select the device based upon other criteria.
  • int_idx and alt_idx are the Interface and Alternate Setting indexes for selection on the device. More on this issue below.
  • td_bufsize is the size of the buffer that goes which each transfer. Set to 64 kiB, which is probably an overkill for most devices, but reasonable for proper bandwidth utilization with SuperSpeed devices. Also see below why it should be large when working with just some device.
  • numtd: The maximal number of outstanding transfers for each endpoint. A large number is good for high-bandwidth applications (with SuperSpeed) since it gives the hardware controller several transfers in a row before software intervention is required. Make it too big, and libusb_submit_transfer() may fail (the controller got more than it could accept).

Features that were meant to be added, but I never got there:

  • Array size of epoll should be dynamic (number of held file descriptors). Currently it’s ARRAYSIZE in usbpiper.c.
  • A file was supposed to be bidirectional. Makes no sense in this usage scenario, and bidirectional was never tested.
  • Non-blocking OPEN not supported
  • Was intended to support USB hotplugging
  • Adaption to XillyUSB’s protocol

USB Transfers and why you should care about them

There is a good reason why there isn’t any pipe-like plain device file interface for any USB device by default: usbpiper overlooks several details in the communication of a USB device.

The most important issue is that USB communication is divided into transfers, and are generally not treated as a continuous stream of data. The underlying model in the USB spec is that the host software initiates a transfer of a given number of bytes (in or out), the USB framework carries it out against the device, and then informs the software that it has been finished. The USB spec’s authors seem to have thought that the mainline usage of the USB bus would be done with a functional call saying something like “send this packet of data to the device”. Or another function saying “receive X bytes from the device”, which returns with a buffer pointing to the data buffer.

The USB framework supports asynchronous transfers, of course, but that doesn’t change the notion that the host’s software explicitly requests each transfer with a given number of bytes. All communication is cut into packet-like chunks of data with clear, boundaries. The device is allowed to divert from the host’s transfer requests only in one way: On IN endpoints, it’s allowed to terminate a transfer with less bytes than the host expected, and this is not considered an error.

However generally speaking, any software that communicates with a device directly (i.e. a device driver) is expected to know when the device expects transfers and of what size. usbpiper ignores this completely. Therefore, it may very well not work properly with just any device. This is less of an issue if the device is developed along with using usbpiper.

The three points to note are hence:

  • usbpiper sets byte count of OUT transfers according to the momentary buffer fill, up to a certain limit (td_bufsize). If the device expects a certain number of bytes in the transfer (which is legit) or the transfers are longer than in can take — things will break, of course. A device may also be sensitive to transfer boundaries, which usbpiper pays no attention to. If the device expects a fixed length for all transfers, this issue can be worked around by modifying try_queue_bulkout() never send a partially filled transfer, and set the desired length instead of td_bufsize.
  • usbpiper sets td_bufsize as the length of IN transfers, however the host doesn’t inform the device on how long the transfer is expected to be. The device driver is supposed to know the maximal length of an IN transfer that the device will respond with, and prepare a buffer long enough. Otherwise, a babbling error results (libusb returns LIBUSB_ERROR_OVERFLOW). td_bufsize is set to 64 kiB which is unlikely to be exceeded by USB devices — but this isn’t guaranteed.
  • Another issue with IN endpoints is that the information on where the boundaries of the transfers is lost: usbpiper just copies the data into a FIFO, which is read continuously on the other side. If the protocol of an IN endpoint relies on the driver knowing where a transfer started, usbpiper won’t be useful. This can be the case if the transfers are packets with a header, but without a data length field. This makes sense against a driver that receives the transfers directly.

Interfaces and alternate settings

A USB device may present several interfaces, and each interface may have alternate settings. This isn’t a gory technical detail, but can be the difference between getting your device working with usbpiper or not, in particular if it’s not something you designed yourself.

Even though a device is assigned an address on the USB bus, any USB driver claims the control of an interface of that device. In other words, it’s perfectly normal that several, possibly independent drivers control a single physical device. A keyboard / mouse combo device or a sound card with MIDI and joystick interface (not so common today). Or like a scanner / printer, which also acts as a card reader:

$ usb-devices
T:  Bus=01 Lev=03 Prnt=44 Port=03 Cnt=01 Dev#= 45 Spd=480 MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=03f0 ProdID=7a11 Rev=01.00
S:  Manufacturer=HP
S:  Product=Photosmart B109a-m
S:  SerialNumber=MY5687428T02D2
C:  #Ifs= 4 Cfg#= 1 Atr=c0 MxPwr=2mA
I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=cc Prot=00 Driver=(none)
I:  If#= 1 Alt= 0 #EPs= 2 Cls=07(print) Sub=01 Prot=02 Driver=usblp
I:  If#= 2 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
I:  If#= 3 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage

Note that the device effectively behaves like two independent devices: A scanner / printer and a USB disk.

It’s therefore important to not just set the Vendor / Product IDs correctly, but also the interface. usb-devices and lsusb -vv may help making the correct selection.

Alternate setting is less common, but a single interface may have different usage modes. If present, this must be set correctly as well.

Basic data flow principle

The purpose of the utility is to move data from a USB endpoint to a CUSE device file or vice versa. To accomplish this, there is a plain RAM FIFO allocated for each such data stream.

For an IN endpoint, the USB frontend queues asynchronous transfer requests using libusb. For each IN transfer that is finished, the data is copied into the relevant FIFO. On the FIFO’s other side, the read() calls on the device file (i.e. CUSE READ requests) are fulfilled, as necessary, by submitting data that is fetched from the FIFO. Overflow of the FIFO is prevented by queuing IN transfer requests only when there’s enough room in the FIFO to accept the data that all outstanding requests may carry, if they all return with a full buffer. Underflow is not an issue, but the read() call isn’t completed if there is no data to submit, in which case read() blocks.

For an OUT endpoint, a the handler of a write() call (i.e. CUSE WRITE requests) copies the data into the relevant FIFO. As a result of the FIFO containing data, the USB frontend may queue new OUT transfers with the data available — it may also not do so, in particular if the number of already outstanding transfer stands at the maximal available. The FIFO is protected from overflow by blocking the write() call until there is enough room in the FIFO. The exact condition relates to the fact the length of the data buffer of each CUSE WRITE request is limited by a number (max_size in the code) that is set during CUSE initialization. A WRITE request is hence not completed (hence preventing another one) until there is room for max_size additional bytes in the FIFO, after writing the current request’s data to the FIFO. This ensures that the usbpiper process always has where to put the data, and doesn’t need to block — which it’s now allowed to, being a single-threaded utility.

The requirement of always having max_size bytes of data vacant in the FIFO gets slightly trickier when a WRITE request is interrupted (i.e. receives an INTERRUPT request on its behalf). This forces usbpiper to immediately complete the request. In order to ensure the requirement on the FIFO, usbpiper possibly unwinds the FIFO, throwing away data so that the FIFO’s write fill is at most max_size bytes below full. This doesn’t break the data stream’s integrity or continuity, because the write() call returns with the number of bytes actually written (or an -EINTR, if none). If the FIFO was unwound, the number of bytes that were discarded is reduced from write()’s return value, giving the caller of write() the correct picture of how much data was consumed.

Execution flow

Recall from above that usbpiper doesn’t rely on libfuse, but rather communicates with the CUSE framework directly through /dev/cuse.

As the utility’s single thread needs to divide attention between the USB tasks and those related to CUSE, a single epoll() file descriptor is allocated for all open /dev/cuse files as well as those supplied by the libusb framework. A epoll_wait() event loop is implemented in usbpiper.c: Each entry in the epoll_event array contains a pointer a small structure, which contains a function to call and a pointer to a private data pass it to the function.

The communication protocol with /dev/cuse is discussed on another post. For the purpose of the current topic, the CUSE kernel framework creates a device file in /dev/ as a result of each time /dev/cuse being opened and a simple read-write handshake completed. After this, for each operation on the related device file (e.g. open(), read(), write() etc) a request packet is passed to the server (i.e. usbpiper in this case) by virtue of read() calls to the /dev/cuse file handle. The operation blocks until the server responds by writing a buffer to the same file handle, which contains a status header and possibly data. Responses to requests are not necessarily written in the same order as the requests. A unique ID number in the said status header ensures the pairing between requests and their responses.

read() calls from /dev/cuse block when there’s nothing to do, and are therefore subject to epoll in usbpiper. write() calls never block.

However this is not enough: For example, an epoll entry may indicate a new WRITE request on a CUSE file descriptor, which fills one of the FIFOs with data. As a result, there might be a new opportunity to queue new USB transfers. There are many software design approaches for how to make one action trigger others — the one taken in usbpiper is the simplest and messiest: Letting the performer of the action call the functions that may benefit from the opportunity directly. In the given example, this means that process_write() calls try_queue_bulkout() directly. The latter calls try_complete_write() in turn.

The function nomenclature in this utility is consistent in that several functions have a try_*() prefix to mark that they are opportunity oriented. It would have been equally functional, cleaner and more elegant (however less efficient) to call all try_*() functions on behalf of all endpoints and device files. Or alternatively, maintain some queue of try_*() function calls, however this wouldn’t take away the need for awareness of which actions may open what opportunity.

Delays and timeouts

There are a couple of situations where a timer is required. A timerfd is allocated for each device file, serving the following two scenarios:

  • Related to IN endpoints: When a READ request can’t be completed with the full number of bytes that are required, usbpiper waits up to 10 ms for data from the IN endpoint to fill the relevant FIFO. After this timeout, try_complete_read() completes the request as soon as there is any data in the FIFO. The rationale is to avoid a flood of READ request and responses if the data arrives frequently and in small chunks.
  • Related to OUT endpoints: When a RELEASE request arrives, and there is still data in the relevant FIFO, try_complete_release() waits up to 1000 ms for the FIFO to drain by the OUT endpoint. After this, try_complete_release() completes the request, hence closing the related device file (not /dev/cuse) after emptying the FIFO.

A single timer can be used for both tasks, because a RELEASE can’t occur before all outstanding requests have been completed on the related device file (Linux’ device file API ensures that). Besides, each device file can be related only to either an IN or OUT endpoint, so once again, the timer won’t be necessary for both uses at the same time.

A similar 10 ms timeout could have been implemented for OUT endpoints, i.e. generate an OUT transfer only if the FIFO contains enough data for a full transfer buffer. This wouldn’t require another timer, for the first reason given above. However this possibility was dropped in favor of another mechanism for preventing unnecessary I/O: try_queue_bulkout() submits a transfer with less than a full buffer only if there is no other outstanding transfer on the same endpoint. The reason for opting out the 10 ms timer for this purpose has to do with the original purpose of this usbpiper, as a driver for XillyUSB (which didn’t materialize).

Jots on named pipes (FIFOs in Linuxish)

$
0
0

Major disclaimer

These are pretty random jots that I made while evaluating named pipes as a solution for project. I eventually went for a driver in the kernel for various reasons, so I never got to verify that anything written below is actually correct.

I’ve also written a small post on epoll with named pipes (in Linux, of course) along with a tiny test program.

Linux named pipes (FIFOs)

Section 3.162 of IEEE Std 1003.1-2001, Base Definitions, defines FIFO Special File (or FIFO) trivially, and refers to POSIX IEEE Std 1003.1-2001 for “Other characteristics” related to lseek( ), open( ), read( ), and write( ). In section 3.273 it defines Pipe, and says that it “behaves identically to a FIFO special file”.

From POSIX IEEE Std 1003.1-2001, System Interfaces, Issue 6, line 27153: “When opening a FIFO with O_RDONLY or O_WRONLY set: (…) If O_NONBLOCK is clear, an open( ) for reading-only shall block the calling thread until a thread opens the file for writing. An open( ) for writing-only shall block the calling thread until a thread opens the file for reading.”

Even though the POSIX standard is available for download at a cost of a few hundred dollars, there’s the Open Group Base Specification, which matches it quite well: Base Definitions and System Interfaces.

This is a good source on pipes.

  • The buffer for each pipe is 64 kB by default on Linux
  • An anonymous pipe (created with pipe() ) is just like a named pipe (i.e. a FIFO special file), only that the node is in pipefs, which only the kernel has access to. There are slight differences in how they’re handled.

Reading the source code (fs/pipe.c)

  • This source clearly handles both named and anonymous pipes (the behavior differs slightly).
  • There’s F_SETPIPE_SZ and F_GETPIPE_SZ fcntl calls, which clearly allow setting the buffer size. Added in patch 35f3d14dbbc58 from 2010 (v2.6.35).
  • fifo_open(): The is_pipe boolean is true if the FIFO is a pipe() (has a magic of PIPE_FS). In other words, if it’s false, it’s a named pipe.
  • If a FIFO is opened for read or write (not both), the open() call blocks until there’s a partner, unless O_NONBLOCK is set, in which case the behavior is somewhat messy. The comments imply that this is required by POSIX.
  • It’s perfectly fine that a FIFO is opened multiple times both for read and write. Only one reader gets each piece of data, and quite randomly. Writers contribute to the stream independently.
  • FIFOs and pipes implement the FIONREAD ioctl() command for telling how many bytes are immediately ready for reading. This is the only ioctl() command implemented however.
  • The flags returned by epoll_wait() are as understood from pipe_poll(). Not clear what purpose the list of EPOLL* flags in those calls to wake_up_interruptible_sync_poll() has.

Things to look at:

  • What if a file descriptor open for read is closed and there’s no write() blocking on the other side? Is there a way to get a software notification? A zero-length write always succeeds, even if the other side is closed (this case is handled explicitly before the part that produces the EPIPE / SIGPIPE).
  • Same in the other direction: A descriptor open for write closes, but there’s no read() to get the EOF. Likewise, a zero-length read() always succeeds, even if the other side is closed (which makes sense, because even if it’s closed, reading should continue until the end).
  • Is there a way to wait for the partner without blocking, or must a thread block on it?
  • What happens on a close() followed immediately by an open()?
  • What about user opening the file in the wrong direction?

open() with O_NONBLOCK? (not)

It’s tempting to open a FIFO with O_NONBLOCK, so there needs not to be a thread blocking while waiting for the other side to be opened. POSIX IEEE Std 1003.1-2001, System Interfaces, Issue 6,says in the part defining open(), page 836:

When opening a FIFO with O_RDONLY or O_WRONLY set:

  • If O_NONBLOCK is set, an open( ) for reading-only shall return without delay. An open( ) for writing-only shall return an error if no process currently has the file open for reading.
  • If O_NONBLOCK is clear, an open( ) for reading-only shall block the calling thread until a thread opens the file for writing. An open( ) for writing-only shall block the calling thread until a thread opens the file for reading.

In the list of error codes for open(), ENXIO is bound (along with another irrelevant sceniario) to: O_NONBLOCK is set, the named file is a FIFO, O_WRONLY is set, and no process has the file open for reading.

Linux’ implementation of FIFOs follows this exactly.

This rules out using O_NONBLOCK for opening a file for write — it will simply not work. As for opening a file for read, it will work, but the epoll() call won’t wake up the process before there is data to read. Opening the other side only doesn’t generate any event.

Windows named pipes

The basic reading: Microsoft’s introduction and a list of relevant API functions.

General notes:

  • The concept of named pipes in Windows seems to resemble UNIX domain sockets in that it’s formed with a client / server terminology. Unlike domain sockets, the client may “connect” with a plain file open() (actually CreateFile and variants). Hence Windows’ named pipes can be truly full duplex between two processes.
  • But named pipes can also be accessed remotely. The permissions need to be set explicitly to avoid that.
  • The client’s opening of a named pipe is known by the return of ConnectNamedPipe (or a respective event in OVERLAPPED mode).
  • Pipe names are not case-sensitive.
  • A pipe is created (by the server) with access mode PIPE_ACCESS_INBOUND, PIPE_ACCESS_OUTBOUND or PIPE_ACCESS_DUPLEX, indicating its direction, as well as the number of instances — the number of times the pipe can be opened by clients. There’s PIPE_TYPE_BYTE and PIPE_TYPE_MESSAGE types, the former creating a UNIX-like stream, and the second treats writes to the pipe as atomic messages.
  • A named pipe ceases to exist when its number of instances goes to zero. It’s an object, not a file. Hence for a sensible application where a single process generates the named pipe, they vanish when the process terminates.
  • Use PIPE_WAIT even when non-blocking behavior is required. Don’t use PIPE_NOWAIT flag on creation for achieving non-blocking use of the pipe. Overlapping access is the correct tool. The former is available to allow compatibility with some Microsoft software accident.
  • If a client needs to connect to a pipe which has all its instances already connected, it may wait for its turn with WaitNamedPipe.
  • When finishing a connection with a client, FlushFileBuffers() should be called to flush pending written data (if the client didn’t close the connection first) and then DisconnectNamedPipe().
  • The suitable mode for working is overlapped I/O. This is the official example.
  • There’s a scary remark on this page and this claiming that named pipes are effectively useless for IPC, and that the object’s name has changed, and this is also discussed here. It seems however that this remark doesn’t relate to anything else than UWP apps. Or else it wouldn’t have cause the vulnerability mentioned here, and how could Docker use it this way?

Windows named pipes: Detecting disconnection of client

The word out there is that a disconnection can’t be detected unless there’s an outstanding ReadFile or WriteFile request. This holds true in particular for TCP sockets. So first try this: Make sure there’s only one instance, and let the server call WaitNamedPipeA as soon as a connection is made. This call will hopefully return when the real client disconnects. This will not work if Windows wants the server to disconnect as well before considering the pipe instance vacant enough. It might, because the client is allowed to connect before the server. It all depends on when Windows decrements the reference count and cleans up.

A few epoll jots

$
0
0

Just a few things I wrote down while getting the hang on Linux’ epoll working with a named pipe. There’s also a little test program at Github.

  • Be sure to read this and this.
  • An event list for a file descriptor can be added only once with epoll_ctl(…, EPOLL_CTL_ADD, …). Calling epoll_ctl for adding an event entry for a file descriptor which is already listed results with an EEXIST error (the manual says so, and hey, it also happens).
  • The @events member passed to epoll_ctl() is an OR of all events to watch. The @events member in the array of events returned by epoll_wait() are the events that are in effect.
  • It’s fine to register events that are unrelated (i.e. will never happen), not that there’s any reason to do so deliberately.
  • If several events are triggered for the same file descriptor, they are ORed in one array entry by epoll_wait().
  • Without the EPOLLET flag (edge-triggered), the same event keeps appearing endlessly until cleared by some I/O action.
  • In particular, EPOLLHUP is returned continuously on a FIFO (named pipe) opened for read with the other side unopened.
  • Same for EPOLLERR with a FIFO opened for write.
  • In edge-triggered mode (with EPOLLET) an event is generated each time new data is fed or drained on the other side, even if the previous data hasn’t been cleared. In this sense, it isn’t really edge-triggered. Probably the calls to wake_up() (in different variants) in the driver causes this.
  • As expected, if a FIFO is opened for read with O_NONBLOCK, there is no event whatsoever when the other side is opened — only when data arrives.
  • Important: If the counterpart side is closed and then reopened while there is no epoll_wait() blocking, this will go unnoticed. The solution is probably to have a tight loop only picking up events, and let some other thread take it from there.

Microsoft’s outlook.com servers and the art of delivering mails to them

$
0
0

Introduction

Still in 2020, it seems like Microsoft lives up to its reputation: Being arrogant, thinking that anyone in business must be a huge corporate, and in particular ending up completely ridiculous. Microsoft’s mail servers, which accept on behalf of Hotmail, MSN, Office 365, Outlook.com, or Live.com users are no exception. This also affects companies and other entities which use their own domain names, but use Microsoft’s services for handling mail.

This post summarizes my personal experience and accumulated knowledge with delivering mail to their servers. I use a simple Linux sendmail SMTP MTA on a virtual server for handling the delivery of my own private mails as well as a very low traffic of transactional mails from a web server. All in all, it’s about 100 mails / month coming out from that server to all destinations.

So one server, one IP address with a perfect reputation on all open spam reputation trackers, with SPF, DKIM and DMARC records all in place properly.

One may ask why I’m not relying on existing mail delivery services or my ISP. Answer is simple: Any commercial mail delivery server is likely to have its reputation contaminated by some spammer, no matter what protection measures they take. When that happens, odds are that emails will just disappear, because the ISP has little interest in forwarding the bounce message saying that delivery failed. On a good day, they will be handling the problem quickly, and yet the sender of the lost mail won’t be aware that the correspondence is broken.

For this reason, it’s quite likely that small businesses will go on keeping their own, small, email delivery servers, maintaining their own reputation. So when Outlook’s servers are nasty with a single-IP server, they’re not just arrogant, but they are causing delivery issues with small to medium businesses.

To do when setting up the server

For starter info, go here. Microsoft is pretty upfront about not being friendly to new IP addresses (see troubleshooting page for postmasters).

So it’s a very good idea to create a Microsoft account to log into their services, and then join their Smart Network Data Service (SDNS) and Junk Mail Reporting Program. This is the start page for both of these services.

SDNS allows the owner of a mail server to register its IP address range (“Request Access“), so its status can be monitored (“View IP Status”) over time. When all is fine, the IP Status page says “All of the specified IPs have normal status”, and when they don’t like this or other IP address, it’s more like this (click to enlarge):

Microsoft SDNS blocked IP

The Junk Mail Reporting Program (JMRP) allows the owner of the mail server to receive notifications (by email) when a mail message is delivered however deemed suspicious, either by an end-user (marking it as spam) or by automatic means. So it’s a good idea to create a special email address for this purpose and fill in the JMRP form. Even for the sake of claiming that you got no complaints when contacting support later on.

Note that this is important for delivery of mail to any institution relies on Microsoft’s mail infrastructure. A proper IP address blacklist delisting takes you from

Mar 11 20:18:23 sm-mta[5817]: x2BKIL2H005815: to=<xxxxxxx@mit.edu>, delay=00:00:02, xdelay=00:00:02, mailer=esmtp, pri=121914, relay=mit-edu.mail.protection.outlook.com. [104.47.42.36], dsn=5.7.606, stat=User unknown

(but the bounce message indicated that it’s not an unknown user, but a blacklisted IP number) to

Mar 11 21:15:12 sm-mta[6170]: x2BLF8rT006168: to=<xxxxxxx@mit.edu>, delay=00:00:03, xdelay=00:00:03, mailer=esmtp, pri=121915, relay=mit-edu.mail.protection.outlook.com. [104.47.42.36], dsn=2.0.0, stat=Sent (<5C86CFDC.6000206@example.com> [InternalId=11420318042095, Hostname=DM5PR01MB2345.prod.exchangelabs.com] 11012 bytes in 0.191, 56.057 KB/sec Queued mail for delivery)

Note that the session response said nothing about a blacklisted IP, however the bounce message (not shown here) did.

Finally, Microsoft suggest getting a certification from Return Path. A paid-for service, clearly intended for large companies and in particular mass mailers to get their spam delivered. Microsoftish irony at its best.

To do when things go wrong

First thing first, read the bounce message. If it says that it’s on Microsoft’s IP blacklist, go to the Office 365 Anti-Spam IP Delist Portal and delist it.

Then check the IP’s status (requires logging in). If you’re blocked, contact support. This doesn’t require a Microsoft login account, by the way. I’m not sure if this link to the support page is valid in the long run, so it’s on SNDS’ main page (“contact sender support”) as well as Troubleshooting page.

My own ridiculous experience

I kicked off my mail server a bit more than a year ago. There was some trouble in the beginning, but that was no surprise. Then things got settled and working for a year, and only then, suddenly & out of the blue, a mail to a Hotmail address bounced with:

Action: failed
Status: 5.7.1
Diagnostic-Code: SMTP; 550 5.7.1 Unfortunately, messages from [193.29.56.92] weren't sent. Please contact your Internet service provider since part of their network is on our block list (S3140). You can also refer your provider to http://mail.live.com/mail/troubleshooting.aspx#errors. [VE1EUR01FT021.eop-EUR01.prod.protection.outlook.com]

And indeed, checking the IP status indicated that is was blocked “because of user complaints or other evidence of spamming”.

So first I went to the mail logs. Low traffic. No indication that the server has been tricked into sending a lot of mails. No indication that it has been compromised in any way. And when a server has been compromised, you know it.

No chance that there were user complaints, because I got nothing from JMRP. So what the “evidence of spamming”?

My best guess: A handful transactional mail messages (at most) to their servers for authenticating email addresses that were marker suspicious by their super software. Putting these messages in quarantine for a few hours is the common solution when that happens. Spam is about volume. If all you got was 4-5 messages, how could that be a spam server? Only if you look at percentage. 100% suspicious. Silly or what?

So I filled in the contact support form, and soon enough I got a message saying a ticket has been opened, and 30 minutes later saying

We have completed reviewing the IP(s) you submitted. The following table contains the results of our investigation.

Not qualified for mitigation
193.29.56.92
Our investigation has determined that the above IP(s) do not qualify for mitigation. These IP(s) have previously received mitigations from deliverability support, and have failed to maintain patterns within our guidelines, so they are ineligible for additional mitigation at this time.

Cute, heh? And that is followed by a lot of general advice, basically copied from the website, recommending to join JMRP and SDNS. Which I had a year earlier, of course. The script that responded didn’t even bother to check that.

But it also said:

To have Deliverability Support investigate further, please reply to this email with a detailed description of the problem you are having, including specific error messages, and an agent will contact you.

And so I did. I wrote that I had joined those two programs a year ago, that the mail volume is low and so on. I doubt it really made a difference. After sending the reply, I got a somewhat automated response rather quickly, now with a more human touch:

Hello,

My name is Ayesha and I work with the Outlook.com Deliverability Support Team.

IP: 193.29.56.92

We will be looking into this issue along with the Escalations Team. We understand the urgency of this issue and will provide an update as soon as this is available. Rest assured that this ticket is being tracked and we will get back to you as soon as we have more information to offer.

Thank you for your patience.

Sincerely,
Ayesha

Outlook.com Deliverability Support

And then, a few days later, another mail:

Hello,

My name is Yaqub and I work with the Outlook.com Deliverability Support Team.

Recent activity coming from your IP(s): ( 193.29.56.92) has been flagged by our system as suspicious, causing your IP to become blocked. I have conducted an investigation into the emails originating from your IP space and have implemented mitigation for your deliverability problem. This process may take 24 – 48 hours to replicate completely throughout our system.

Please note that lifting the block does not guarantee that your email will be delivered to a user’s inbox. However, here are some things that can help you with delivery:

(and here came the same suggestions on JMRP and SDNS)

And about 24 hours later, the IP status went back to OK again. And my emails went through normally.

Well, almost. A few days even further down, I attempted to send an email to a live.co.uk destination, and once again, I got the same rejection message (in block list, S3140). The only difference was that the mail server on the other side was hotmail-com.olc.protection.outlook.com (residing in the US), and now eur.olc.protection.outlook.com (somewhere in Europe).

I checked the IP’s status in SDNS and it was fine. So updating the Europeans on the updated IP status takes a bit time, or what?

So I replied to last email I got from Microsoft’s support, saying it failed with live.co.uk. I didn’t get any reply, but a few hours later I tried again, and the mail went through. Coincidence or not.

This time I also caught the related messaged from the mail log. It’s

May 01 15:10:28 sm-mta[2239]: 041FASMh002237: to=<xxxxx@live.co.uk>, ctladdr=<eli@billauer.co.il> (510/500), delay=00:00:00, xdelay=00:00:00, mailer=esmtp, pri=121816, relay=eur.olc.protection.outlook.com. [104.47.1.33], dsn=5.0.0, stat=Service unavailable
May 01 15:10:28 sm-mta[2239]: 041FASMh002237: 041FASMh002239: DSN: Service unavailable

for a failure, and

May 02 06:23:00 sm-mta[4024]: 0426Mx1I004021: to=<xxxxx@live.co.uk>, ctladdr=<eli@billauer.co.il> (510/500), delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=121808, relay=eur.olc.protection.outlook.com. [104.47.18.97], dsn=2.0.0, stat=Sent (<5EAD11C3.20105@billauer.co.il> [InternalId=21887153366859, Hostname=AM6EUR05HT060.eop-eur05.prod.protection.outlook.com] 10627 bytes in 0.246, 42.064 KB/sec Queued mail for delivery -> 250 2.1.5)

for success.

Lesson learned: Contact support and insist.

And the lesson to all those using Microsoft’s mail services: Your provider cuts off your email contacts arbitrarily. Because they are Microsoft.


Firejail: Putting a program in its own little container

$
0
0

Introduction

Firejail is a lightweight security utility which ties the hands of running processes, somewhat like Apparmor and SELinux. However it takes the mission towards Linux kernel’s cgroups and namespaces. It’s in fact a bit of a container-style virtualization utility, which creates sandboxes for running specific programs: Instead of a container for an entire operating system, it makes one for each application (i.e. the main process and its children). Rather than disallowing access from files and directories by virtue of permissions, simply make sure they aren’t visible to the processes. Same goes for networking.

By virtue of Cgroups, several security restrictions are also put in place regardless if so desired. Certain syscalls can be prevented etc. But in the end of the day, think container virtualization. A sandbox is created, and everything happens inside it. It’s also easy to add processes to an existing sandbox (in particular, start a new shell). Not to mention the joy of shutting down a sandbox, that is, killing all processes inside it.

While the main use of Firejail to protect the file system from access and tampering by malicious or infected software, it also allows more or less everything that a container-style virtual machine does: Control of network traffic (volume, dedicated firewall, which physical interfaces are exposed) as well as activity (how many subprocesses, CPU and memory utilization etc.). And like a virtual machine, it also allows statistics on resource usage.

Plus spoofing the host name, restricting access to sound devices, X11 capabilities and a whole range of stuff.

And here’s the nice thing: It doesn’t require root privileges to run. Sort of. The firejail executable is run with setuid.

It’s however important to note that firejail doesn’t create a stand-alone container. Rather, it mixes and matches files from the real file system and overrides selected parts of the directory tree with temporary mounts. Or overlays. Or whiteouts.

In fact, compared with the accurate rules of a firewall, its behavior is quite loose and inaccurate. For a newbie, it’s a bit difficult to predict exactly what kind of sandbox it will set up given this or other setting. It throws in all kind of files of its own into the temporary directories it creates, which is very helpful to get things up and running quickly, but that doesn’t give a feeling of control.

Generally speaking, everything that isn’t explicitly handled by blacklisting or whitelisting (see below) is accessible in the sandbox just like outside it. In particular, it’s the user’s responsibility to hide away all those system-specific mounted filesystems (do you call them /mnt/storage?). If desired, of course.

Major disclaimer: This post is not authoritative in any way, and contains my jots as I get to know the beast. In particular, I may mislead you to think something is protected even though it’s not. You’re responsible to your own decisions.

The examples below are with firejail version 0.9.52 on a Linux Mint 19.

Install

# apt install firejail
# apt install firetools

By all means, go

$ man firejail

after installation. It’s also worth to look at /etc/firejail/ to get an idea on what protection measures are typically used.

Key commands

Launch FireTools, a GUI front end:

$ firetools &

And the “Tools” part has a nice listing of running sandboxes (right-click the ugly thing that comes up).

Now some command line examples. I name the sandboxes in these examples, but I’m not sure it’s worth bothering.

List existing sandboxes (or use FireTools, right-click the panel and choose Tools):

$ firejail --list

Assign a name to a sandbox when creating it

$ firejail --name=mysandbox firefox

Shut down a sandbox (kill all its processes, and clean up):

$ firejail --shutdown=mysandbox

If a name wasn’t assigned, the PID given in the list can be used instead.

Disallow the root user in the sandbox

$ firejail --noroot

Create overlay filesystem (mounts read/write, but changes are kept elsewhere)

$ firejail --overlay firefox

There’s also –overlay-tmpfs for overlaying tmpfs only, as well as –overlay-clean to clean the overlays, which are stored in $HOME/.firejail.

To create a completely new home directory (and /root) as temporary filesystems (private browsing style), so they are volatile:

$ firejail --private firefox

Better still,

$ firejail --private=/path/to/extra-homedir firefox

This uses the directory in the given path as a persistent home directory (some basic files are added automatically). This path can be anywhere in the filesystem, even in parts that are otherwise hidden (i.e. blacklisted) to the sandbox. So this is probably the most appealing choice in most scenarios.

Don’t get too excited, though: Other mounted filesystems remain unprotected (at different levels). This just protects the home directory.

By default, a whole bunch of security rules are loaded when firejail is invoked. To start the container without this:

$ firejail --noprofile

A profile can be selected with the –profile=filename flag.

Writing a profile

If you really want to have a sandbox that protects your computer with relation to a specific piece of software, you’ll probably have to write your own profile. It’s no big deal, except that it’s a bit of trial and error.

First read the manpage:

$ man firejail-profile

It’s easiest to start from a template: Launch FireTools from a shell, right-click the ugly thing that comes up, and pick “Configuration Wizard”, and create a custom security profile for one of the listed application — the one that resembles most the one for which the profile is set up.

Then launch the application from FireTools. The takeaway is that it writes out the configuration file to the console. Start with that.

Whilelisting and blacklisting

First and foremost: Always run a

$ df -h

inside the sandbox to get an idea of what is mounted. Blacklist anything that isn’t necessary. Doing so to entire mounts removes the related mount from the df -h list, which makes it easier to spot things that shouldn’t be there.

It’s also a good idea to start a sample bash session with the sandbox, and get into the File Manager in the Firetool’s “Tools” section for each sandbox.

But then, what is whitelisting and blacklisting, exactly? These two terms are used all over the docs, somehow assuming we know what they mean. So I’ll try to nail it down.

Whitelisting isn’t anywhere near what one would think it is: By whitelisting certain files and/or directories, the original files/directories appear in the sandbox but all other files in their vicinity are invisible. Also, changes in the same vicinity are temporary to the sandbox session. The idea seems to be that if files and/or directories are whitelisted, everything else close to it should be out of sight.

Or as put in the man page:

A temporary file system is mounted on the top directory, and the whitelisted files are mount-binded inside. Modifications to whitelisted files are persistent, everything else is discarded when the sandbox is closed. The top directory could be user home, /dev, /media, /mnt, /opt, /srv, /var, and /tmp.

So for example, if any file or directory in the home directory is whitelisted, the entire home directory becomes overridden by an almost empty home directory plus the specifically whitelisted items. For example, from my own home directory (which is populated with a lot of files):

$ firejail --noprofile --whitelist=/home/eli/this-directory
Parent pid 31560, child pid 31561
Child process initialized in 37.31 ms

$ find .
.
./.config
./.config/pulse
./.config/pulse/client.conf
./this-directory
./this-directory/this-file.txt
./.Xauthority
./.bashrc

So there’s just a few temporary files that firejail was kind enough to add for convenience. Changes made in this-directory/ are persistent since it’s bind-mounted into the temporary directory, but everything else is temporary.

Quite unfortunately, it’s not possible to whitelist a directory outside the specific list of hierarchies (unless bind mounting is used, but that requires root). So if the important stuff is one some /hugedisk, only a bind mount will help (or is this the punishment for not putting it has /mnt/hugedisk?).

But note that the –private= flag allows setting the home directory to anywhere on the filesystem (even inside a blacklisted region). This ad-hoc home directory is persistent, so it’s not like whitelisting, but even better is some scenarios.

Alternatively, it’s possible to blacklist everything but a certain part of a mount. That’s a bit tricky, because if a new directory appears after the rules are set, it remains unprotected. I’ll explain why below.

Or if that makes sense, make the entire directory tree read-only, with only a selected part read-write. That’s fine if there’s no issue with data leaking, just the possibility of malware sabotage.

So now to blacklisting: Firejail implements blacklisting by mounting an empty, read-only-by-root file or directory on top of the original file. And indeed,

$ firejail --blacklist=delme.txt
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-passwdmgr.inc
Reading profile /etc/firejail/disable-programs.inc

** Note: you can use --noprofile to disable default.profile **

Parent pid 30288, child pid 30289
Child process initialized in 57.75 ms
$ ls -l
[ ... ]
-r--------  1 nobody nogroup     0 Jun  9 22:12 delme.txt
[ ... ]
$ less delme.txt
delme.txt: Permission denied

There are –noblacklist and –nowhitelist flags as well. However these merely cancel future or automatic black- or whitelistings. In particular, one can’t blacklist a directory and whitelist a subdirectory. It would have been very convenient, but since the parent directory is overridden with a whiteout directory, there is no access to the subdirectory. So each and every subdirectory must be blacklisted separately with a script or something, and even then if a new subdirectory pops up, it’s not protected at all.

There’s also a –read-only flag allows setting certain paths and files as read-only. There’s –read-write too, of course. When a directory or file is whitelisted, it must be flagged read-only separately if so desired (see man firejail).

Mini-strace

Trace all processes in the sandbox (in particular accesses to files and network). Much easier than using strace, when all we want is “which files are accessed?”

$ firejail --trace

And then just run any program to see what files and network sockets it accesses. And things of that sort.

Linux Wine jots

$
0
0

General

These are just a few jots on Wine. I guess this post will evolve over time.

I’m running Wine version 4.0 on Linux Mint 19, running on an x86_64.

First run

Every time Wine is run on a blank (or absent) directory given by WINEPREFIX, it installs a Windows environment. Which Windows version an several other attributes can be set with Wine Configuration:

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/winecfg

It often suggests to install Wine Mono and Wine Gecko. I usually tend to agree.

This installation downloads three files into .cache/wine/: wine_gecko-2.47-x86_64.msi, wine_gecko-2.47-x86.msi and wine-mono-4.7.5.msi. This is why Wine doesn’t ask for permission to install these when setting up new Windows environments after the first time.

Install and use Winetricks

It’s a good idea in general, and it allows installation of Microsoft runtime environment easily:

# apt install winetricks
# apt install wine32-development

And now to install Virtual Studio 6 runtime environment, for example (solving some error message on not being able to import isskin.dll or isskinu.dll)

$ WINEPREFIX=/path/to/winedir winetricks vcrun6

Prevent browser popup

Wine has this thing that it opens a browser when so requested by the Windows application. That can be annoying at times, and get the program stuck when run inside a firejail. To prevent this altogether, just delete two files:

  • drive_c/windows/syswow64/winebrowser.exe
  • drive_c/windows/system32/winebrowser.exe

Open explorer

The simplest way to start: Open the file explorer:

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/wine explorer

DOS command line

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/wine cmd

This is better than expected: The command session is done directly in the console (no new window opened). Like invoking a shell.

Use with firejail

Windows equals viruses, and Wine doesn’t offer any protection against that. Since the entire filesystem is accessible from Z: (more on that below), it’s a good idea to run Wine from within a firejail mini-container. I have a separate post on firejail.

The execution of the program then looks something like (non-root user):

$ firejail --profile=~/my.profile --env=WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/wine 'C:\Program Files\Malsoft\Malsoft.exe' &

The my.profile file depends on what the Windows program is expected to do. I discuss that briefly in that post, however this is something that worked for me:

include /etc/firejail/disable-common.inc
include /etc/firejail/disable-passwdmgr.inc
private-tmp
private-dev

# All relevant directories are read-only by default, not /opt. So add it.
read-only /opt
#
# This whitelisting protects the entire home directory.
# .cache/wine is where the Gecko + Mono installation files are kept.
# They can't be downloaded, because of "net none" below
mkdir ~/sandboxed/
mkdir ~/.cache/wine
whitelist ~/sandboxed/
whitelist ~/.cache/wine

net none
nonewprivs
caps.drop all
noroot
# blacklist everything that can be harmed
#
blacklist /mnt
blacklist /cdrom
blacklist /media
blacklist /boot

Notes:

  • Note the “net none” part. Networking completely disabled. No access to the internet nor the local network.
  • Be sure to blacklist any system-specific mount, in particular those that are writable by the regular user. Do you have a /hugestorage mount? That one.
  • There’s a seccomp filter option that often appears in template profiles. It got a program in Wine completely stuck. It prevents certain system calls, so no doubt it adds safety, but it came in the way of something in my case.

Poor man’s sandboxing

If you’re too lazy to use firejail, you can remove some access to the local storage by virtue of Wine’s file system bindings. This is worth almost nothing, but almost nothing is more than nothing.

$ WINEPREFIX=/path/to/winedir /opt/wine-stable/bin/winecfg

In the “Drives” tab, remove Z:, and in the Desktop Integration tab, go through each of the folders and uncheck “Link to”.

This doesn’t prevent a Wine-aware Windows program to accessing the machine with plain Linux API with your user permissions just like any Linux program, and the root directory is still visible in Windows’ file browsing utilities. Yet, simple Windows programs expect any file system to be mapped to a drive letter, and these steps prevent that. Not much, but once again, better than nothing.

LG OLED with a Linux computer: Getting that pitch black

$
0
0

Introduction

So I got myself an LG OLED65B9. It’s huge and a really nice piece of electronics. I opted out the bells and whistles, and connected it via HDMI to my already existing media computer, running Linux Mint 18.1. All I wanted was a plain (yet very high quality) display.

However at some point I noticed that black wasn’t displayed as black. I opened GIMP, drew a huge black rectangle, and it displayed as dark grey. At first I thought that the screen was defective (or that I was overly optimistic expecting that black would be complete darkness), but then I tried an image from a USB stick, and reassured myself that black is displayed as complete darkness. As it should. Or why did I pay extra for an OLED?

Because I skipped the “play with the new toy” phase with this display, I’m 100% it’s with its factory settings. It’s not something I messed up.

I should mention that I use plain HD resolution of 1920x1080. The screen can do much better than that (see list of resolutions below), and defaults at 3840x2160 with my computer, but it’s quite pointless: Don’t know about you, I have nothing to show that goes higher than 1080p. And the computer’s graphics stutters at 4k UHD. So why push it?

I have a previous post on graphics modes, and one on the setup of the Brix media center computer involved.

So why is black displayed LCD style?

The truth is that I don’t know. But it seems to be a problem only with standard 16:9 graphics modes. When switching to modes that are typical for computers (5:4 and 4:3 aspect ratios), the image was stretched to the entire screen, and black areas showed as pitch black. I’m not sure about this conclusion, and even less do I have an idea why this would happen or why a properly designed display would “correct” a pixel that arrives as RGB all zeros to something brighter.

Also, the black level on text consoles (Ctrl-Shift-F1 style) is horrible. But I’m not sure about which resolution they use.

An idea that crossed my mind is that maybe the pixels are sent as YCbCr in some modes or maybe the computer goes “Hey, I’m a TV now, let’s do some color correction nobody asked for” when standard HDTV aspect ratios are used. If any, I would go for the second possibility. But xrandr’s verbose output implies that both brightness and gamma are set to 1.0 for the relevant HDMI output, even when black isn’t black.

The graphics adapter is Intel Celeron J3160′s on-chip “HD Graphics” processor (8086:22b1) so nothing fancy here.

The fix (for now?)

This just worked for me, and I didn’t feel like playing with it further. So I can’t assure that this is a consistent solution, but anyhow.

The idea is that since the problem arises with standard 16:9 modes, maybe make up a non-standard one?

Unlike the case with my previous TV, using cvt to calculate the timing parameters turned out to be a good idea.

$ cvt 1920 1080 60
# 1920x1080 59.96 Hz (CVT 2.07M9) hsync: 67.16 kHz; pclk: 173.00 MHz
Modeline "1920x1080_60.00"  173.00  1920 2048 2248 2576  1080 1083 1088 1120 -hsync +vsync
$ xrandr -d :0 --newmode "try" 173.00  1920 2048 2248 2576  1080 1083 1088 1120 -hsync +vsync
$ xrandr -d :0 --addmode HDMI3 try
$ xrandr -d :0 --output HDMI3 --mode try

At this point I got a proper 1920x1080 on the screen, with black pixels as dark as when the display is powered off. The output of xrandr after this was somewhat unexpected, yet functionally what I wanted:

$ xrandr -d :0 --verbose
  1280x720 (0x4b) 74.250MHz +HSync +VSync +preferred
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  45.00KHz
        v: height  720 start  725 end  730 total  750           clock  60.00Hz
  1920x1080 (0x141) 173.000MHz -HSync +VSync *current
        h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
        v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz
  1920x1080 (0x10c) 148.500MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  67.50KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  60.00Hz
 [ ... ]
  try (0x13e) 173.000MHz -HSync +VSync
        h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
        v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz

So the mode in effect didn’t turn out the one I generated (“try”), but a replica of its parameters, marked as 0x141 (and 0x13a on another occasion). This mode wasn’t there before.

I’m don’t quite understand how this happened. Maybe Cinnamon’s machinery did this. It kind of gets in the way all the time, and at times it didn’t let me set just any mode I liked with xrandr, so maybe that. This whole thing with graphics modes is completely out of control.

I should mention that there is no problem with sound in this mode (or any other situation I tried). Not that there should be, but at some point I thought maybe there would be, because the mode implies a computer and not a TV-something. But no issues at all. Actually, the screen’s loudspeakers are remarkably good, with a surprisingly present bass, but that’s a different story.

List of graphics modes

Just in case this interests anyone, this is the output of a full resolution list:

$ xrandr -d :0 --verbose
[ ... ]
HDMI3 connected primary 3840x2160+0+0 (0x1ba) normal (normal left inverted right x axis y axis) 1600mm x 900mm
	Identifier: 0x48
	Timestamp:  -1469585217
	Subpixel:   unknown
	Gamma:      1.0:1.0:1.0
	Brightness: 1.0
	Clones:
	CRTC:       0
	CRTCs:      0
	Transform:  1.000000 0.000000 0.000000
	            0.000000 1.000000 0.000000
	            0.000000 0.000000 1.000000
	           filter:
	EDID:
		00ffffffffffff001e6da0c001010101
		011d010380a05a780aee91a3544c9926
		0f5054a1080031404540614071408180
		d1c00101010104740030f2705a80b058
		8a0040846300001e023a801871382d40
		582c450040846300001e000000fd0018
		781e871e000a202020202020000000fc
		004c472054560a20202020202020012d
		02035af1565f101f0413051403021220
		212215015d5e6263643f403209570715
		07505707016704033d1ec05f7e016e03
		0c001000b83c20008001020304e200cf
		e305c000e50e60616566eb0146d0002a
		1803257d76ace3060d01662150b05100
		1b304070360040846300001e00000000
		0000000000000000000000000000008b
	aspect ratio: Automatic
		supported: Automatic, 4:3, 16:9
	Broadcast RGB: Automatic
		supported: Automatic, Full, Limited 16:235
	audio: auto
		supported: force-dvi, off, auto, on
  3840x2160 (0x1ba) 297.000MHz +HSync +VSync *current +preferred
        h: width  3840 start 4016 end 4104 total 4400 skew    0 clock  67.50KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  30.00Hz
  4096x2160 (0x1bb) 297.000MHz +HSync +VSync
        h: width  4096 start 5116 end 5204 total 5500 skew    0 clock  54.00KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  24.00Hz
  4096x2160 (0x1bc) 296.703MHz +HSync +VSync
        h: width  4096 start 5116 end 5204 total 5500 skew    0 clock  53.95KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  23.98Hz
  3840x2160 (0x1bd) 297.000MHz +HSync +VSync
        h: width  3840 start 4896 end 4984 total 5280 skew    0 clock  56.25KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  25.00Hz
  3840x2160 (0x1be) 297.000MHz +HSync +VSync
        h: width  3840 start 5116 end 5204 total 5500 skew    0 clock  54.00KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  24.00Hz
  3840x2160 (0x1bf) 296.703MHz +HSync +VSync
        h: width  3840 start 4016 end 4104 total 4400 skew    0 clock  67.43KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  29.97Hz
  3840x2160 (0x1c0) 296.703MHz +HSync +VSync
        h: width  3840 start 5116 end 5204 total 5500 skew    0 clock  53.95KHz
        v: height 2160 start 2168 end 2178 total 2250           clock  23.98Hz
  1920x1080 (0x1c1) 297.000MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock 135.00KHz
        v: height 1080 start 1084 end 1089 total 1125           clock 120.00Hz
  1920x1080 (0x1c2) 297.000MHz +HSync +VSync
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock 112.50KHz
        v: height 1080 start 1084 end 1094 total 1125           clock 100.00Hz
  1920x1080 (0x1c3) 296.703MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock 134.87KHz
        v: height 1080 start 1084 end 1089 total 1125           clock 119.88Hz
  1920x1080 (0x16c) 148.500MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  67.50KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  60.00Hz
  1920x1080 (0x1c4) 148.500MHz +HSync +VSync
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  56.25KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  50.00Hz
  1920x1080 (0x16d) 148.352MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  67.43KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  59.94Hz
  1920x1080i (0x10c) 74.250MHz +HSync +VSync Interlace
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.75KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  60.00Hz
  1920x1080i (0x10d) 74.250MHz +HSync +VSync Interlace
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  28.12KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  50.00Hz
  1920x1080 (0x1c5) 74.250MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.75KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  30.00Hz
  1920x1080 (0x1c6) 74.250MHz +HSync +VSync
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  28.12KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  25.00Hz
  1920x1080 (0x1c7) 74.250MHz +HSync +VSync
        h: width  1920 start 2558 end 2602 total 2750 skew    0 clock  27.00KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  24.00Hz
  1920x1080i (0x10e) 74.176MHz +HSync +VSync Interlace
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.72KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  59.94Hz
  1920x1080 (0x1c8) 74.176MHz +HSync +VSync
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.72KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  29.97Hz
  1920x1080 (0x1c9) 74.176MHz +HSync +VSync
        h: width  1920 start 2558 end 2602 total 2750 skew    0 clock  26.97KHz
        v: height 1080 start 1084 end 1089 total 1125           clock  23.98Hz
  1280x1024 (0x1b5) 108.000MHz +HSync +VSync
        h: width  1280 start 1328 end 1440 total 1688 skew    0 clock  63.98KHz
        v: height 1024 start 1025 end 1028 total 1066           clock  60.02Hz
  1360x768 (0x4b) 85.500MHz +HSync +VSync
        h: width  1360 start 1424 end 1536 total 1792 skew    0 clock  47.71KHz
        v: height  768 start  771 end  777 total  795           clock  60.02Hz
  1152x864 (0x1ca) 81.579MHz -HSync +VSync
        h: width  1152 start 1216 end 1336 total 1520 skew    0 clock  53.67KHz
        v: height  864 start  865 end  868 total  895           clock  59.97Hz
  1280x720 (0x110) 74.250MHz +HSync +VSync
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  45.00KHz
        v: height  720 start  725 end  730 total  750           clock  60.00Hz
  1280x720 (0x111) 74.250MHz +HSync +VSync
        h: width  1280 start 1720 end 1760 total 1980 skew    0 clock  37.50KHz
        v: height  720 start  725 end  730 total  750           clock  50.00Hz
  1280x720 (0x112) 74.176MHz +HSync +VSync
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  44.96KHz
        v: height  720 start  725 end  730 total  750           clock  59.94Hz
  1024x768 (0x113) 65.000MHz -HSync -VSync
        h: width  1024 start 1048 end 1184 total 1344 skew    0 clock  48.36KHz
        v: height  768 start  771 end  777 total  806           clock  60.00Hz
  800x600 (0x115) 40.000MHz +HSync +VSync
        h: width   800 start  840 end  968 total 1056 skew    0 clock  37.88KHz
        v: height  600 start  601 end  605 total  628           clock  60.32Hz
  720x576 (0x116) 27.000MHz -HSync -VSync
        h: width   720 start  732 end  796 total  864 skew    0 clock  31.25KHz
        v: height  576 start  581 end  586 total  625           clock  50.00Hz
  720x576i (0x117) 13.500MHz -HSync -VSync Interlace
        h: width   720 start  732 end  795 total  864 skew    0 clock  15.62KHz
        v: height  576 start  580 end  586 total  625           clock  50.00Hz
  720x480 (0x118) 27.027MHz -HSync -VSync
        h: width   720 start  736 end  798 total  858 skew    0 clock  31.50KHz
        v: height  480 start  489 end  495 total  525           clock  60.00Hz
  720x480 (0x119) 27.000MHz -HSync -VSync
        h: width   720 start  736 end  798 total  858 skew    0 clock  31.47KHz
        v: height  480 start  489 end  495 total  525           clock  59.94Hz
  640x480 (0x11c) 25.200MHz -HSync -VSync
        h: width   640 start  656 end  752 total  800 skew    0 clock  31.50KHz
        v: height  480 start  490 end  492 total  525           clock  60.00Hz
  640x480 (0x11d) 25.175MHz -HSync -VSync
        h: width   640 start  656 end  752 total  800 skew    0 clock  31.47KHz
        v: height  480 start  490 end  492 total  525           clock  59.94Hz
  720x400 (0x1cb) 28.320MHz -HSync +VSync
        h: width   720 start  738 end  846 total  900 skew    0 clock  31.47KHz
        v: height  400 start  412 end  414 total  449           clock  70.08Hz

So it even supports fallback mode with a 25.175 MHz clock if one really insists.

When umount says target is busy, but no process can be blamed

$
0
0

A short one: What to do if unmount is impossible with a

# umount /path/to/mount
umount: /path/to/mount: target is busy

but grepping the output of lsof for the said path yields nothing. In other words, the mount is busy, but no process can be blamed for accessing it (even as a home directory).

If this happens, odds are that it’s an NFS mount, held by some remote machine. The access might have been over long ago, but the mount is still considered busy. So the solution for this case is simple: Restart the NFS daemon. On Linux Mint 19 (and probably a lot of others) it’s simply

# systemctl restart nfs-server

and after this, umount is sucessful (hopefully…)

Turning off DSN on sendmail to prevent backscatter

$
0
0

I sent that?

One morning, I got a bounce message from my own mail sendmail server, saying that it failed to deliver a message I never sent. That’s red alert. It means that someone managed to provoke my mail server to send an outbound message. It’s red alert, because my mail server effectively relays spam to any destination that the spammer chooses. This could ruin the server’s reputation horribly.

It turned out that an arriving mail required a return receipt, which was destined to just some mail address. There’s an SMTP feature called Delivery Status Notification (DSN), which allows the client connecting to the mail server to ask for a mail “in return”, informing the sender of the mail if it was properly delivered. The problem is that the MAIL FROM / From addresses could be spoofed, pointing at a destination to spam. Congratulations, your mail server was just tricked into sending spam. This kind of trickery is called backscatter.

Checking my own mail logs, the DSN is a virtually unused feature. So it’s probably just something spammers can take advantage of.

The relevant RFC for DSN is RFC1891. Further explanations on DSN can be found in one of sendmail’s tutorial pages.

How to turn DSN off

First, I recommend checking if it’s not disabled already, as explained below. In particular, if the paranoid-level “goaway” privacy option is used, DSN is turned off anyhow.

It’s actually easy. Add the noreceipts option to PrivacyOptions. More precisely, edit /etc/mail/sendmail.mc and add noreceipts to the list of already existing options. In my case, it ended up as

define(`confPRIVACY_FLAGS',dnl
`needmailhelo,needexpnhelo,needvrfyhelo,restrictqrun,restrictexpand,nobodyreturn,noetrn,noexpn,novrfy,noactualrecipient,noreceipts')dnl

and then run “make” in /etc/mail, and restart sendmail.

Turning off DSN is often recommended against in different sendmail guides, because it’s considered a “valuable feature” or so. As mentioned above, I haven’t seen it used by anyone else than spammers.

Will my mail server do DSN?

Easy to check, because the server announces its willingness to fulfill DSN requests at the beginning of the SMTP session, with the line marked in red in the sample session below:

<<< 220 mx.mymailserver.com ESMTP MTA; Wed, 15 Jul 2020 10:22:32 GMT
>>> EHLO localhost.localdomain
<<< 250-mx.mymailserver.com Hello 46-117-33-227.bb.netvision.net.il [46.117.33.227], pleased to meet you
<<< 250-ENHANCEDSTATUSCODES
<<< 250-PIPELINING
<<< 250-8BITMIME
<<< 250-SIZE
<<< 250-DSN
<<< 250-DELIVERBY
<<< 250 HELP
>>> MAIL FROM:<spamvictim@billauer.co.il>
<<< 250 2.1.0 <spamvictim@billauer.co.il>... Sender ok
>>> RCPT TO:<legal_address@billauer.co.il> NOTIFY=SUCCESS
<<< 250 2.1.5 <legal_address@billauer.co.il>... Recipient ok
>>> DATA
<<< 354 Enter mail, end with "." on a line by itself
>>> MIME-Version: 1.0
>>> From: spamvictim@billauer.co.il
>>> To: legal_address@billauer.co.il
>>> Subject: Testing email.
>>>
>>>
>>> Just a test, please ignore
>>> .
<<< 250 2.0.0 06FAMWa1014200 Message accepted for delivery
>>> QUIT
<<< 221 2.0.0 mx.mymailserver.com closing connection

To test a mail server for its behavior with DSN, the script that I’ve already published can be used. To make it request a return receipt, the two lines that set the SMTP recipient should be changed to

  die("Failed to set receipient\n")
    if (! ($smtp->recipient( ($to_addr ), { Notify => ['SUCCESS'] } ) ) );

This change causes the NOTIFY=SUCCESS part in the RCPT TO line, which effectively requests a receipt from the server when the mail is properly delivered.

Note that if DSN isn’t supported by the mail server (possibly because of the privacy option fix shown above), the SMPT session looks exactly the same, except that the SMTP line marked in red will be absent. Then the mail server just ignores the NOTIFY=SUCCESS part silently, and responds exactly as before.

However when running the Perl script, the Net::SMTP will be kind enough to issue a warning to its stderr:

Net::SMTP::recipient: DSN option not supported by host at ./testmail.pl line 36.

The mail addresses I used in the sample session above are bogus, of courses, but note that the spam victim is the sender of the email, because that’s where the return receipt goes. On top of that, the RCPT TO address will also get a spam message, but that’s the smaller problem, as it’s yet another spam message arriving — not one that is sent away from our server.

I should also mention that Notify can be a comma-separated list of events, e.g.

RCPT TO:<bad_address@billauer.co.il> NOTIFY=SUCCESS,FAILURE,DELAY

however FAILURE doesn’t include the user not being known to the server, in which case the message is dropped anyhow without any DSN message generated. So as a spam trick, one can’t send mails to random addresses, and issue spam bounce messages because they failed. That would have been too easy.

In the mail logs

The sample session shown above causes the following lines in mail.log. Note the line marked in red, which indicates that the return receipt mechanism was fired off.

Jul 15 10:15:31 sm-mta[12697]: 06FAFTbL012697: from=<spamvictim@billauer.co.il>, size=121, class=0, nrcpts=1, msgid=<202007151015.06FAFTbL012697@mx.mymailserver.com>, proto=ESMTP, daemon=IPv4-port-587, relay=46-117-33-227.bb.netvision.net.il
[46.117.33.227]
Jul 15 10:15:31 sm-mta[12698]: 06FAFTbL012697: to=<legal_address@billauer.co.il>, ctladdr=<spamvictim@billauer.co.il> (1010/500), delay=00:00:01, xdelay=00:00:00, mailer=local, pri=30456, dsn=2.0.0, stat=Sent
Jul 15 10:15:31 sm-mta[12698]: 06FAFTbL012697: 06FAFVbL012698: DSN: Return receipt
Jul 15 10:15:31 sm-mta[12698]: 06FAFVbL012698: to=<spamvictim@billauer.co.il>, delay=00:00:00, xdelay=00:00:00, mailer=local, pri=30000, dsn=2.0.0, stat=Sent

The receipt

Since I’m at it, this is what a receipt message for the sample session above looks like:

Received: from localhost (localhost)	by mx.mymailserver.com
 (8.14.4/8.14.4/Debian-8+deb8u2) id 06FAFVbL012698;	Wed, 15 Jul 2020
 10:15:31 GMT
Date: Wed, 15 Jul 2020 10:15:31 GMT
From: Mail Delivery Subsystem <MAILER-DAEMON@billauer.co.il>
Message-ID: <202007151015.06FAFVbL012698@mx.mymailserver.com>
To: <spamvictim@billauer.co.il>
MIME-Version: 1.0
Content-Type: multipart/report; report-type=delivery-status;
 boundary="06FAFVbL012698.1594808131/mx.mymailserver.com"
Subject: Return receipt
Auto-Submitted: auto-generated (return-receipt)
X-Mail-Filter: main

This is a MIME-encapsulated message

--06FAFVbL012698.1594808131/mx.mymailserver.com

The original message was received at Wed, 15 Jul 2020 10:15:30 GMT
from 46-117-33-227.bb.netvision.net.il [46.117.33.227]

   ----- The following addresses had successful delivery notifications -----
<legal_address@billauer.co.il>  (successfully delivered to mailbox)

   ----- Transcript of session follows -----
<legal_address@billauer.co.il>... Successfully delivered

--06FAFVbL012698.1594808131/mx.mymailserver.com
Content-Type: message/delivery-status

Reporting-MTA: dns; mx.mymailserver.com
Received-From-MTA: DNS; 46-117-33-227.bb.netvision.net.il
Arrival-Date: Wed, 15 Jul 2020 10:15:30 GMT

Final-Recipient: RFC822; legal_address@billauer.co.il
Action: delivered (to mailbox)
Status: 2.1.5
Last-Attempt-Date: Wed, 15 Jul 2020 10:15:31 GMT

--06FAFVbL012698.1594808131/mx.mymailserver.com
Content-Type: text/rfc822-headers

Return-Path: <spamvictim@billauer.co.il>
Received: from localhost.localdomain (46-117-33-227.bb.netvision.net.il [46.117.33.227])
	by mx.mymailserver.com (8.14.4/8.14.4/Debian-8+deb8u2) with ESMTP id 06FAFTbL012697
	for <legal_address@billauer.co.il>; Wed, 15 Jul 2020 10:15:30 GMT
Date: Wed, 15 Jul 2020 10:15:29 GMT
Message-Id: <202007151015.06FAFTbL012697@mx.mymailserver.com>
MIME-Version: 1.0
From: spamvictim@billauer.co.il
To: legal_address@billauer.co.il
Subject: Testing email.

--06FAFVbL012698.1594808131/mx.mymailserver.com--

But note that if DSN is used by a spammer to trick our mail server, we will get the failure notice that results from sending this message to the other server. If we’re lucky enough to get anything at all: If the message is accepted, we’ll never know our server has been sending spam.

Root over NFS remains read only with Linux v5.7

$
0
0

Upgrading the kernel should be quick and painless…

After upgrading the kernel from v5.3 to 5.7, a lot of systemd services failed (Debian 8), in particular systemd-remount-fs:

● systemd-remount-fs.service - Remount Root and Kernel File Systems
   Loaded: loaded (/lib/systemd/system/systemd-remount-fs.service; static)
   Active: failed (Result: exit-code) since Sun 2020-07-26 15:28:15 IDT; 17min ago
     Docs: man:systemd-remount-fs.service(8)

http://www.freedesktop.org/wiki/Software/systemd/APIFileSystems

  Process: 223 ExecStart=/lib/systemd/systemd-remount-fs (code=exited, status=1/FAILURE)
 Main PID: 223 (code=exited, status=1/FAILURE)

Jul 26 15:28:15 systemd[1]: systemd-remount-fs.service: main process exited, code=exited, status=1/FAILURE
Jul 26 15:28:15 systemd[1]: Failed to start Remount Root and Kernel File Systems.
Jul 26 15:28:15 systemd[1]: Unit systemd-remount-fs.service entered failed state.

and indeed, the root NFS remained read-only (checked with “mount” command), which explains why so many other services failed.

After an strace session, I managed to nail down the problem: The system call to mount(), which was supposed to do the remount, simply failed:

mount("10.1.1.1:/path/to/debian-82", "/", 0x61a250, MS_REMOUNT, "addr=10.1.1.1") = -1 EINVAL (Invalid argument)

On the other hand, any attempt to remount another read-only NFS mount, which had been mounted the regular way (i.e. after boot) went through clean, of course:

mount("10.1.1.1:/path/to/debian-82", "/mnt/tmp", 0x61a230, MS_REMOUNT, "addr=10.1.1.1") = 0

The only apparent difference between the two cases is the third argument, which is ignored for MS_REMOUNT according to the manpage.

The manpage also says something about the EINVAL return value:

EINVAL A remount operation (MS_REMOUNT) was attempted, but source was not already mounted on target.

A hint to the problem could be that the type of the mount, as listed in /proc/mounts, is “nfs” for the root mounted filesystem, but “nfs4″ for the one in /mnt/tmp. The reason for this difference isn’t completely clear.

The solution

So it’s all about that little hint: If the nfsroot is selected to boot as version 4, then there’s no problem remounting it. Why it made a difference from one kernel version to another is beyond me. So the fix is to add nfsvers=4 to the nfsroot assignment. Something like

root=/dev/nfs nfsroot=10.1.1.1:/path/to/debian-82,nfsvers=4

For the record, I re-ran the remount command with strace again, and exactly the same system call was made, including that most-likely-ignored 0x61a250 argument, and it simply returned success (zero) instead of EINVAL.

As a side note, the rootfstype=nfs in the kernel command line is completely ignored. Write any junk instead of “nfs” and it makes no difference.

Another yak shaved successfully.

The sledge hammer: Forcing a permanent screen resolution mode on Linux

$
0
0

When to do this

Because Gnome desktop is sure it knows what’s best for me, and it’s virtually impossible to just tell it that I want this screen resolution mode and no other, there is only one option left: Lie about the monitor’s graphics mode capabilities. Make the kernel feed it with fake screen information (EDID), that basically says “there is only this resolution”. Leave it with one choice only.

What is EDID? It’s a tiny chunk of information that is stored on a small EEPROM memory on the monitor. The graphics card fetches this blob through two I2C wires on the cable, and deduces from it what graphics mode (with painfully detailed timing parameters) the monitor supports. It’s that little hex blob that appears when you go xrandr –verbose.

I should mention a post in Gentoo forum, which suggests making X ignore EDID info by using

Option       "UseEDID" "false"
Option       "UseEDIDFreqs" "false"

in /etc/X11/xorg.conf, or is it a file in /usr/share/X11/xorg.conf.d/? And then just set the screen mode old-school. Didn’t bother to check this. There are too many players in this game. Faking EDID seemed to be a much better idea than to ask politely not to consider it.

How to feed a fake EDID

The name of the game is Kernel Mode Setting (KMS). Among others, it allows loading a file from /lib/firmware which is used as the screen information (EDID) instead of getting it from the screen.

For this to work, the CONFIG_DRM_LOAD_EDID_FIRMWARE kernel compilation must be enabled (set to “y”).

Note that unless Early KMS is required, the firmware file is loaded after the initramfs stage. In other words, it’s not necessary to push the fake EDID file into the initramfs, but it’s OK to have it present only in the filesystem that is mounted after the initramfs.

The EDID file should be stored in /lib/firmware/edid (create the directory if necessary) and the following command should be added to the kernel command line:

drm_kms_helper.edid_firmware=edid/fake_edid.bin

(for kernels 4.15 and later, there’s a drm.edid_firmware parameter that is supposed to be better in some way).

Generating a custom EDID file

I needed a special graphics mode to solve a problem with my OLED screen. Meaning I had to cook my own EDID file. It turned out quite easy, actually.

The kernel’s doc for this is Documentation/admin-guide/edid.rst

In the kernel’s tools/edid, edit one of the asm files (e.g. 1920x1080.S) and set the parameters to the correct mode. This file has just defines. The actual data format is produced in edid.S, which is included at the bottom. The output in this case is 1920x1080.bin. Note that the C file (1920x1080.c) is an output as well in this case — for reference of some other use, I guess.

And then just type “make” in tools/edid/ (don’t compile the kernel, that’s really not necessary for this).

The numbers in the asm file are in a slightly different notation, as explained in the kernel doc. Not a big deal to figure out.

In my case, I translated this xrandr mode line

  oledblack (0x10b) 173.000MHz -HSync +VSync
        h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
        v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz

to this:

/* EDID */
#define VERSION 1
#define REVISION 3

/* Display */
#define CLOCK 173000 /* kHz */
#define XPIX 1920
#define YPIX 1080
#define XY_RATIO XY_RATIO_16_9
#define XBLANK 656
#define YBLANK 40
#define XOFFSET 128
#define XPULSE 200
#define YOFFSET 3
#define YPULSE 5
#define DPI 96
#define VFREQ 60 /* Hz */
#define TIMING_NAME "Linux FHD"
/* No ESTABLISHED_TIMINGx_BITS */
#define HSYNC_POL 0
#define VSYNC_POL 0

#include "edid.S"

There seems to be a distinction between standard resolution modes and those that aren’t. I got away with this, because 1920x1080 is a standard mode. It may be slightly trickier with a non-standard mode.

When it works

This is what it looks like when all is well. First, the kernel logs. In my case, because I didn’t put the file in the initramfs, loading it fails twice:

[    3.517734] platform HDMI-A-3: Direct firmware load for edid/1920x1080.bin failed with error -2
[    3.517800] [drm:drm_load_edid_firmware [drm_kms_helper]] *ERROR* Requesting EDID firmware "edid/1920x1080.bin" failed (err=-2)

and again:

[    4.104528] platform HDMI-A-3: Direct firmware load for edid/1920x1080.bin failed with error -2
[    4.104580] [drm:drm_load_edid_firmware [drm_kms_helper]] *ERROR* Requesting EDID firmware "edid/1920x1080.bin" failed (err=-2)

But then, much later, it loads properly:

[   19.864966] [drm] Got external EDID base block and 0 extensions from "edid/1920x1080.bin" for connector "HDMI-A-3"
[   93.298915] [drm] Got external EDID base block and 0 extensions from "edid/1920x1080.bin" for connector "HDMI-A-3"
[  109.573124] [drm] Got external EDID base block and 0 extensions from "edid/1920x1080.bin" for connector "HDMI-A-3"
[ 1247.290084] [drm] Got external EDID base block and 0 extensions from "edid/1920x1080.bin" for connector "HDMI-A-3"

Why several times? Well, the screen resolution is probably set up several times as the system goes up. There’s clearly a quick screen flash a few seconds after the desktop goes up. I don’t know exactly why, and at this stage I don’t care. The screen is at the only mode allowed, and that’s it.

And now to how xrandr sees the situation:

$ xrandr -d :0 --verbose
[ ... ]
HDMI3 connected primary 1920x1080+0+0 (0x10c) normal (normal left inverted right x axis y axis) 500mm x 281mm
 Identifier: 0x48
 Timestamp:  21339
 Subpixel:   unknown
 Gamma:      1.0:1.0:1.0
 Brightness: 1.0
 Clones:   
 CRTC:       0
 CRTCs:      0
 Transform:  1.000000 0.000000 0.000000
 0.000000 1.000000 0.000000
 0.000000 0.000000 1.000000
 filter:
 EDID:
 00ffffffffffff0031d8000000000000
 051601036d321c78ea5ec0a4594a9825
 205054000000d1c00101010101010101
 010101010101944380907238284080c8
 3500f41911000018000000ff004c696e
 75782023300a20202020000000fd003b
 3d424412000a202020202020000000fc
 004c696e7578204648440a2020200045
 aspect ratio: Automatic
 supported: Automatic, 4:3, 16:9
 Broadcast RGB: Automatic
 supported: Automatic, Full, Limited 16:235
 audio: auto
 supported: force-dvi, off, auto, on
 1920x1080 (0x10c) 173.000MHz -HSync -VSync *current +preferred
 h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
 v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz

Compare the EDID part with 1920x1080.c, which was created along with the binary:

{
 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00,
 0x31, 0xd8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
 0x05, 0x16, 0x01, 0x03, 0x6d, 0x32, 0x1c, 0x78,
 0xea, 0x5e, 0xc0, 0xa4, 0x59, 0x4a, 0x98, 0x25,
 0x20, 0x50, 0x54, 0x00, 0x00, 0x00, 0xd1, 0xc0,
 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01,
 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x94, 0x43,
 0x80, 0x90, 0x72, 0x38, 0x28, 0x40, 0x80, 0xc8,
 0x35, 0x00, 0xf4, 0x19, 0x11, 0x00, 0x00, 0x18,
 0x00, 0x00, 0x00, 0xff, 0x00, 0x4c, 0x69, 0x6e,
 0x75, 0x78, 0x20, 0x23, 0x30, 0x0a, 0x20, 0x20,
 0x20, 0x20, 0x00, 0x00, 0x00, 0xfd, 0x00, 0x3b,
 0x3d, 0x42, 0x44, 0x12, 0x00, 0x0a, 0x20, 0x20,
 0x20, 0x20, 0x20, 0x20, 0x00, 0x00, 0x00, 0xfc,
 0x00, 0x4c, 0x69, 0x6e, 0x75, 0x78, 0x20, 0x46,
 0x48, 0x44, 0x0a, 0x20, 0x20, 0x20, 0x00, 0x45,
};

So it definitely took the bait.


Writing to a disk even when df says zero available space

$
0
0

Just a quick note to remind myself: There’s a gap between the size of a disk, the used space and the available space. It’s quite well-known that a certain percentage of the disk (that’s 200 GB on a 3.6 TB backup disk) is saved for root-only writes.

So the reminder is: No problem filling the disk beyond the Available = zero blocks point if you’re root. And it doesn’t matter if the files written don’t belong to root. The show goes on.

Also, the numbers shown by df are updated only when the file written to is closed. So if a very long file is being copied, it might freeze for a while, and then boom.

This is important in particular when using the disk just for backing up data, because the process doing the backup is root, but the files aren’t.

But whatever you do, don’t press CTRL-C while the extracting goes on. If tar quits in the middle, there will be file ownerships and permissions unset, and symlinks set to zero-length files too. It wrecks the entire backup, even in places far away from where tar was working when it was stopped.

Installing Vivado 2020.1 on Linux Mint 19

$
0
0

… or any other “unsupported” Linux distribution.

… or: How to trick the installer into thinking you’re running one of the supported OSes.

So I wanted to install Vivado 2020.1 on my Linux Mint 19 (Tara) machine. I downloaded the full package, and ran xsetup. A splash window appeared, and soon after it another window popped up, saying that my distribution wasn’t supported, listing those that were, and telling me that I could click “OK” to continue nevertheless. Which I did.

But then nothing happened. Completely stuck. And there was an error message on the console, reading:

$ ./xsetup
Exception in thread "SPLASH_LOAD_MESSAGE" java.lang.IllegalStateException: no splash screen available
	at java.desktop/java.awt.SplashScreen.checkVisible(Unknown Source)
	at java.desktop/java.awt.SplashScreen.getBounds(Unknown Source)
	at java.desktop/java.awt.SplashScreen.getSize(Unknown Source)
	at com.xilinx.installer.gui.H.run(Unknown Source)
Exception in thread "main" java.lang.IllegalStateException: no splash screen available
	at java.desktop/java.awt.SplashScreen.checkVisible(Unknown Source)
	at java.desktop/java.awt.SplashScreen.close(Unknown Source)
	at com.xilinx.installer.gui.G.b(Unknown Source)
	at com.xilinx.installer.gui.InstallerGUI.G(Unknown Source)
	at com.xilinx.installer.gui.InstallerGUI.e(Unknown Source)
	at com.xilinx.installer.api.InstallerLauncher.main(Unknown Source)

This issue is discussed in this thread of Xilinx’ forum. Don’t let the “Solved” title mislead you: They didn’t solve it at all. But one of the answers there gave me the direction: Fool the installer to think my OS is supported, after all. In this specific case there was no problem with the OS, but a bug in the installer that caused it to behave silly after that popup window.

It was also suggested to install Vivado in batch mode with

./xsetup -b ConfigGen

however this doesn’t allow for selecting what devices I want to support. And this is a matter of tons of disk space.

So to make it work, I made changes in some files in /etc, and kept the original files in a separate directory. I also needed to move /etc/lsb-release into that directory as well, so it won’t mess up things

I changed /etc/os-release (which is in fact a symlink to ../usr/lib/os-release on my machine, so watch it) to

NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

and /etc/lsb-release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"

This might very well be an overkill, but once I got the installation running, I didn’t bother check what the minimal change is. Those who are successful might comment below on this, maybe?

Note that this might not work on a Red Hat based OS, because it seems that there are distribution-dependent issues. Since Linux Mint 19 is derived from Ubuntu 18.04, faking an earlier Ubuntu distro didn’t cause any problems.

This failed for me repeatedly at first, because I kept a copy of the original files in the /etc directory. This made the installation tool read the original files as well as the modified ones. Using strace, I found that the tool called cat with

execve("/bin/cat", ["cat", "/etc/lsb-release", "/etc/os-release", "/etc/real-upstream-release"], 0x5559689a47b0 /* 55 vars */) = 0

It looks like “cat” was called with some wildcard, maybe /etc/*-release? So the solution was to move the original files away to a new directory /etc/real/, and create the fake ones in their place.

Another problem was probably that /etc/lsb-release is a directory in my system. That made the “cat” return a failure status, which can’t be good.

Of course I have an opinion on the entire OS support situation, but if you’re reading this, odds that you agree with me anyhow.

Perl + Linux: Properly cleaning up a forking script after it exits

$
0
0

Leave no leftover childred

One of the really tricky things about a Perl script that forks this way or another, is how to make sure that the children vanish after the parent has exited. This is an issue both if the children were created with a fork() call, or with a safe pipe, as with

my $pid = open(my $fd, '-|');

It may seem to work fine when the main script is terminated with a CTRL-C. The children will indeed vanish. But try killing the main script with a “kill” command, and the parent dies, but the children remain alive and kicking.

The Linux-only solution is

use Linux::Prctl

and then, in the part of the script that runs as a child, do

Linux::Prctl::set_pdeathsig(9);

immediately after the branch between parent and child. This tells Linux to send a SIGKILL to the process that made this call (i.e. the child) as soon as the parent exits. One might be more gentle with a SIGTERM (number 15). But the idea is the same. Parent is away, get the hammer.

To get the Perl module:

# apt install liblinux-prctl-perl

And BTW, SIGPIPE doesn’t help here, even if there’s a pipe between the two processes: It’s delivered only when the child processes attempts to write to a pipe that is closed on the other end. If it doesn’t, the broken pipe is never sensed. And if it’s on the reading side, there’s no SIGPIPE at all — the pipe just gives an EOF when the data is exhausted.

The pdeathsig can of course be used in non-Perl programs as well. This is the Perl example.

Multiple safe pipes

When a process generates multiple children, there’s a problem with the fact that the children inherit the already existing opened file descriptors. For example, when the main script creates multiple children by virtue of safe pipes for read (calling open(my $fd, ‘-|’) repeatedly, so the children write and parent reads): Looking at /proc/PID/fd  of the children, it’s clear that they have a lot of pipes opened that they have nothing to do with.

This prevents the main script (the parent), as well some of the children from terminating, even after either side calls to exit() or die(). These processes don’t turn into zombies, but remain plain unterminated processes in the stopped state. At least so it turned out on my Perl v5.26.1 on an x86_64 Linux machine.

The problem for this case occurs when pipes have pending data when the main script attempted to terminate, for example by virtue of a print to STDOUT (which is redirected to the pipe going to the parent). This is problematic, because the child process will attempt to write the remaining data just before quitting (STDOUT is flushed). The process will block forever on this write() call. Since the child doesn’t terminate, the parent process blocks on wait(), and doesn’t terminate either. It’s a deadlock. Even if close() isn’t called explicitly in the main script, the automatic file descriptor close before termination will behave exactly the same: It waits for the child process.

What usually happens in this situation is that when the parent closes the file descriptor, it sends a SIGPIPE to the child. The blocking write() returns as a result with an EPIPE status (Broken pipe), and the child process terminates. This allows the parent’s wait() to reap the child, and the parent process can continue.

And here’s the twist: If the file descriptor belongs to several processes after forking, SIGPIPE is sent to the child only when the last file descriptor is closed. As a result, when the parent process attempts to close one of its pipes, SIGPIPE isn’t sent if the children hasn’t closed their copies of the same pipe file descriptor. The deadlock described above occurs.

There can be worked around by making sure to close the pipes so that the child processes are reaped in the order reversed to their creation. But it’s much simpler to just close the unnecessary file descriptors on the children side.

So the solution is to go

foreach my $fd (@safe_pipe_fds) {
  close($fd)
   and print STDERR "What? Closing unnecessary file descriptor was successful!\n";
}

on the child’s side, immediately after the call to set_pdeathsig(), as mentioned above.

All of these close() calls should fail with an ECHILD (No child processes) status: The close() call attempts to waitpid() for the main script’s children (closing a pipe waits for the process on the other side to terminate), which fails because only the true parent can do that. Regardless, the file descriptors are indeed closed, and each child process holds only the file descriptors it needs to. And most importantly, there’s no problem terminating.

So the error message is given when the close is successful. The “and” part isn’t a mistake.

It’s also worth mentioning, that exactly the same close() (with a failed wait() call) occurs anyhow when the child process terminates (I’ve checked it with strace). The code snippet above just makes it earlier, and solves the deadlock problem.

Either way, it’s probably wiser to use pipe() and fork() except for really simple one-on-one IPC between a script and itself, so that all this file descriptor and child reaping is done on the table.

As for pipes to and from other executables with open(), that’s not a problem. I mean calls such as open(IN, “ps aux|”) etc. That’s because Perl automatically closes all file descriptors except STDIN, STDOUT and STDERR when calling execve(), which is the syscall for executing another program.

Or more precisely, it sets the FD_CLOEXEC flag for all files opened with a file number above $^F (a.k.a $SYSTEM_FD_MAX), which defaults to 2. So it’s actually Linux that automatically closes the files on a call to execve(). The possible problem mentioned above with SIGPIPE is hence solved this way. Note that this is something Perl does for us, so if you’re writing a program in C and plan to call execve() after a fork — by all means close all file descriptors that aren’t needed before doing that.

Linux kernel: Dumping a module’s content for regression check

$
0
0

After making a lot of whitespace reorganization in a kernel module (indentation, line breaks, fixing things reported by sparse and checkpatch), I wanted to make sure I didn’t really change anything. All edits were of the type that the compiler should be indifferent about, but how can I be sure I didn’t change anything accidentally?

It would have been nice if the compiler’s object files were identical before and after the changes, but that doesn’t happen. So instead, let’s hope it’s enough to verify that the executable assembly code didn’t change, and neither did the string literals.

The idea is to make a disassembly of the executable part and dump the part that contains the literal strings, and output everything into a single file. Do that before and after the changes (git helps here, of course), and run a plain diff on the couple of files.

Which boils down to this little script:

#!/bin/bash

objdump -d $1
objdump -s -j .rodata -j .rodata.str1.1 $1

and run it on the compiled module, e.g.

$ ./regress.sh themodule.ko > original.txt

The script first makes the disassembly, and then makes a hex dump of two sections in the ELF file. Most interesting is the .rodata.str1.1 section, which contains the string literals. That’s the name of this section on an v5.7 kernel, anyhow.

Does it cover everything? Can I be sure that I did nothing wrong if the outputs before and after the changes are identical? I don’t really know. I know for sure that it detects the smallest change in the code, as well as a change in any error message string I had (and that’s where I made a lot of changes), but maybe there are some accidents that this check doesn’t cover.

As for how I found the names of the sections: Pretty much trying them all. The list of sections in the ELF file can be found with

$ readelf -S themodule.ko

However only those marked with PROGBITS type can be dumped with objdump -s (or more precisely, will be found with the -j flag). I think. It’s not like I really understand what I’m doing here.

Bottom line: This check is definitely better than nothing.

Creating a tarball for distribution (without user/group information)

$
0
0

A tarball is the common way to convey several files on UNIX systems. But because tar was originally intended for backup, it stores not only the permission information, but also the owner and group of each file. Try listing the content of a tarball with e.g.

$ tar -tzvf thestuff.tar.gz

Note the “v” flag that goes along with the flag for listing, “t”: It causes tar to print out ownership and permission information.

This doesn’t matter much if the tarball is extracted as a non-root user on the other end, because tar doesn’t set the user and group ID in that case: The extracted files get the uid/gid of the process that extracted them.

However if user at the other end extract the tarball as root, the original uid/gid is assigned, which may turn out confusing.

To avoid this, tell tar to assign user root to all files in the archive. This makes no difference if the archive is extracted by a non-root user, but sets the ownership to root if extracted by root. In fact, it sets the ownership to the extracting user in both cases, which is what one would expect.

So this is the command to use to create an old-school .tar.gz tarball:

$ tar --owner=0 --group=0 -czf thestuff.tar.gz thestuff

Note that you don’t have to be root to do this. You’re just creating a plain file with your own ownership. It’s extracting these file as root that requires root permissions (if so desired).

Viewing all 173 articles
Browse latest View live