Writing a panel applet for Cinnamon: The basics

December 8, 2018, 11:15 am

≫ Next: Solved: Missing ktorrent icon on Linux Mint / Cinnamon

≪ Previous: Failed: Install Argos Shell Extension on Cinnamon

Introduction

What I wanted: A simple applet on Cinnamon, which allows me to turn a service on and off (hostapd, a Wifi hotspot). I first went for Argos catch-all extension, and learned that Cinnamon isn’t gnome-shell, and in particular that extensions for gnome-shell don’t (necessarily?) work with Cinnamon.

Speaking of which, my system is Linux Mint 19 on an x86_64, with

$ cinnamon --version
Cinnamon 3.8.9

So I went for writing the applet myself. Given the so-so level of difficulty, I should have done that to begin with.

Spoiler: I’m not going to dive into the details of that, because my hostapd-firewall-DHCP daemon setting is quite specific. Rather, I’ll discuss about some general aspects of writing an applet.

So what is it like? Well, quite similar to writing something useful in JavaScript for a web page. Cinnamon’s applets are in fact written in JavaScript, and it feels pretty much the same. In particular, this thing about nothing happening when there’s an error, now go figure what it was. And yes, there’s an error log console which helps with syntax errors (reminds browsers’ error log, discussed below) but often run-time errors just lead to nothing. A situation that is familiar to anyone with JavaScript experience.

And I also finally understand why the cinnamon process hogs CPU all the time. OK, it’s usually just a few percents, and still, what is it doing all that time with no user activity? Answer: Running some JavaScript, I suppose.

But all in all, if you’re good with JavaScript and understand the concepts of GUI programming and events + fairly OK with object oriented programming, it’s quite fun. And there’s another thing you better be good at:

Read The Source

As of December 2018, the API for Cinnamon applets is hardly documented, and it’s somewhat messy. So after reading a couple of tutorials (See “References” at the bottom of this post), the best way to grasp how to get X done is by reading the sources of existing applets:

System-installed: /usr/share/cinnamon/applets
User-installed: ~/.local/share/cinnamon/applets
Cinnamon’s core JavaScript sources: /usr/share/cinnamon/js

Each of these contains several subdirectories, typically with the form name@creator, one for each applet that is available for adding to the panels. Each of these has at least two files, which are also those to supply for your own applet:

metadata.json, which contains some basic info on the applet (probably used while selecting applets to add).
applet.js, which contains the JavaScript code for the applet.

It doesn’t matter if they’re executable, even though they often are.

There may also be additional *.js files.

Also, there might also be a po/ directory, which often contains .po and .pot files that are intended for localizing the text displayed to the user. These go along with the _() function in the JavaScript code. For the purposes of a simple applet, these are not necessary. Ignore these _(“Something”) things in the JavaScript code, and read them as just “Something”.

Some applets allow parameter setting. The runtime values for these are at ~/.cinnamon, which contains configuration data etc.

Two ways to object orient

Unfortunately, there are two styles for defining the applet class, both of which are used. This is a matter of minor confusion if you read the code of a few applets, and therefore worthy to note: Some of the applets use JavaScript class declarations (extending a built-in class), e.g.

class CinnamonSoundApplet extends Applet.TextIconApplet {
    constructor(metadata, orientation, panel_height, instanceId) {
        super(orientation, panel_height, instanceId);

and others use the “prototype” syntax:

MyApplet.prototype = {
  __proto__: Applet.IconApplet.prototype,

and so on. I guess they’re equivalent, despite the difference in syntax. Note that in the latter format, the constructor is a function called _init().

This way or another, all classes that employ timeout callbacks should have a destroy() method (no underscore prefix) to cancel them before quitting.

I wasn’t aware of these two syntax possibilities, and therefore started from the first applet I got my hands on. It happened to be written in the “prototype” syntax, which is probably the less preferable choice. I’m therefore not so sure my example below is a good starter.

Getting Started

It’s really about three steps to get an applet up and running.

Create a directory in ~/.local/share/cinnamon/applets/ and put the two files there: metadata.json and applet.js.
Restart Cinnamon. No, it’s not as bad as it sounds. See below.
Install the applet to some panel, just like any other applet.

I warmly suggest copying an existing applet and hacking it. You can start with the skeleton applet I’ve listed below, but there are plenty other available on the web, in particular along with tutorials.

The development cycle (or: how to “run”)

None of the changes made in the applet’s directory (well, almost none) take any effect until Cinnamon is restarted, and when it is, everything is in sync. It’s not like a reboot, and it’s fine to do on the computer you’re working on, really. All windows remain in their workspaces (even though the windows’ tabs at the panel may change order). No reason to avoid this, even if you have a lot of windows opened. Done it a gazillion times.

So how to restart Cinnamon: ALT-F2, type “r” and Enter. Then cringe as your desktop fades away and be overwhelmed when it returns, and nothing bad happened.

If something is wrong with your applet (or otherwise), there a notification saying “Problems during Cinnamon startup” elaborating that “Cinnamon started successfully, but one or more applets, desklets or extensions failed to load”. From my own experience, that’s as bad as it gets: The applet wasn’t loaded, or doesn’t run properly.

Press Win+L (or ALT-F2, then type “lg” and Enter, or type “cinnamon-looking-glass” at shell prompt as non-root user) to launch the Looking Glass tool (called “Melange”). The Log tab is helpful with detailed error messages (colored red, that helps). Alternatively, look for the detailed error message in .xsession-errors in your home directory.

Note that the error message often appears before the line saying that the relevant applet was loaded.

OK, so now to some more specific topics.

Custom icons

Icons are referenced by their file name, without extension, in the JavaScript code as well as the metadata.json file (as “icon” assignment). The search path is the applet’s own icons/ subdirectory and the system icons, present at /usr/share/icons/.

My own experience is that creating an icons/ directory side-by-side with applet.js, and putting a PNG file named wifi-icon-off.png there makes a command like

this.set_applet_icon_name("wifi-icon-off");

work for setting the applet’s main icon on the panel. The PNG’s transparency is honored. The official file format is SVG, but who’s got patience for that.

Same goes with something menu items with icons:

item = new PopupMenu.PopupIconMenuItem("Access point off", "wifi-icon-off", St.IconType.FULLCOLOR);

item.connect('activate', Lang.bind(this, function() {
   Main.Util.spawnCommandLine("/usr/local/bin/access-point-ctl off");
}));
this.menu.addMenuItem(item);

My own experience with the menu items is that if the icon file isn’t found, Cinnamon silently puts an empty slot instead. JavaScript-style no fussing.

I didn’t manage to achieve something similar with the “icon” assignment in metadata.json, so the choices are either to save the icon in /usr/share/icons/, or use one of the system icons, or eliminate the “icon” assignment altogether from the JSON file. I went to the last option. This resulted in a dull default icon when installing the applet, but this is of zero importance for an applet I’ve written myself.

Running shell commands from JavaScript

The common way to execute a shell command is e.g.

const Main = imports.ui.main;

Main.Util.spawnCommandLine("gnome-terminal");

The assignment of Main is typically done once, and at the top of the script, of course.

When the output of the command is of interest, it becomes slightly more difficult. The following function implements the parallel of the Perl backtick operator: Run the command, and return the result as a string. Note that unlike its bash counterpart, newlines remain newlines, and are not translated into spaces:

const GLib = imports.gi.GLib;

function backtick(command) {
  try {
    let [result, stdout, stderr] = GLib.spawn_command_line_sync(command);
    if (stdout != null) {
      return stdout.toString();
    }
  }
  catch (e) {
    global.logError(e);
  }

  return "";
}

and then one can go e.g.

let output = backtick("/bin/systemctl is-active hostapd");

after which output is a string containing the result of the execution (with a trailing newline, by the way).

As of December 2018, there’s no proper documentation of Cinnamon’s Glib wrapper, however the documentation of the C library can give an idea.

My example applet

OK, so here’s a skeleton applet for getting started with.

Its pros:

It’s short, quite minimal, and keeps the mumbo-jumbo to a minimum
It shows a simple drop-down menu display applet, which allows running a different shell command from each entry.

Its cons:

It’s written in the less-preferable “prototype” syntax for defining objects.
It does nothing useful. In particular, the shell commands it executes exist only on my computer.
It depends on a custom icon (see “Custom Icons” above). Maybe this is an advantage…?

So if you want to give it a go, create a directory named ‘wifier@eli’ (or anything else?) in ~/.local/share/cinnamon/applets/, and put this as metadata.json:

{
    "description": "Turn Wifi Access Point on and off",
    "uuid": "wifier@eli",
    "name": "Wifier"
}

And this as applet.js:

const Applet = imports.ui.applet;
const Lang = imports.lang;
const St = imports.gi.St;
const Main = imports.ui.main;
const PopupMenu = imports.ui.popupMenu;
const UUID = 'wifier@eli';

function ConfirmDialog(){
  this._init();
}

function MyApplet(orientation, panelHeight, instanceId) {
  this._init(orientation, panelHeight, instanceId);
}

MyApplet.prototype = {
  __proto__: Applet.IconApplet.prototype,

  _init: function(orientation, panelHeight, instanceId) {
    Applet.IconApplet.prototype._init.call(this, orientation, panelHeight, instanceId);

    try {
      this.set_applet_icon_name("wifi-icon-off");
      this.set_applet_tooltip("Control Wifi access point");

      this.menuManager = new PopupMenu.PopupMenuManager(this);
      this.menu = new Applet.AppletPopupMenu(this, orientation);
      this.menuManager.addMenu(this.menu);

      this._contentSection = new PopupMenu.PopupMenuSection();
      this.menu.addMenuItem(this._contentSection);

      // First item: Turn on
      let item = new PopupMenu.PopupIconMenuItem("Access point on", "wifi-icon-on", St.IconType.FULLCOLOR);

      item.connect('activate', Lang.bind(this, function() {
					   Main.Util.spawnCommandLine("/usr/local/bin/access-point-ctl on");
					 }));
      this.menu.addMenuItem(item);

      // Second item: Turn off
      item = new PopupMenu.PopupIconMenuItem("Access point off", "wifi-icon-off", St.IconType.FULLCOLOR);

      item.connect('activate', Lang.bind(this, function() {
					   Main.Util.spawnCommandLine("/usr/local/bin/access-point-ctl off");
					 }));
      this.menu.addMenuItem(item);
    }
    catch (e) {
      global.logError(e);
    }
  },

  on_applet_clicked: function(event) {
    this.menu.toggle();
  },
};

function main(metadata, orientation, panelHeight, instanceId) {
  let myApplet = new MyApplet(orientation, panelHeight, instanceId);
  return myApplet;
}

Next, create an “icons” subdirectory (e.g. ~/.local/share/cinnamon/applets/wifier@eli/icons/) and put a small (32 x 32 ?) PNG image there as wifi-icon-off.png, which functions as the applet’s top icon. Possibly download mine from here.

Anyhow, be sure to have an icon file. Otherwise there will be nothing on the panel.

Finally, restart Cinnamon, as explained above. You will get errors when trying the menu items (failed execution), but don’t worry — nothing bad will happen.

References

A JavaScript object reference (there’s probably some “original” out there, but where?)
The main Cinnamon reference page (don’t expect too much)
Creating an Applet
Writing an Applet
Lang.bind() explained here.

↧

Solved: Missing ktorrent icon on Linux Mint / Cinnamon

December 24, 2018, 12:08 am

≫ Next: Cinelerra 2019 notes

≪ Previous: Writing a panel applet for Cinnamon: The basics

Running ktorrent on Linux Mint 19 (Tara), the famous downwards-arrow icon was invisible on the system tray. Which made it appear like the program had quit when it was actually minimized. Clicking the empty box made ktorrent re-appear.

Solution: Invoke the Qt5 configuration tool

$ qt5ct

and under the Appearance tab set “Style” to gtk2 (I believe it was “Fusion” before). It’s not just prettier generally, but after restarting ktorrent, the icon is there.

Actually, it’s probably not about the style, but the fact that qt5ct was run. Because before making the change, the ktorrent printed out the following when launched from the command line:

Mon Dec 24 09:52:55 2018: Qt Warning: QSystemTrayIcon::setVisible: No Icon set
Warning: QSystemTrayIcon::setVisible: No Icon set
Mon Dec 24 09:52:55 2018: Starting minimized
Mon Dec 24 09:52:55 2018: Started update timer
Mon Dec 24 09:52:55 2018: Qt Warning: inotify_add_watch("/home/eli/.config/qt5ct") failed: "No such file or directory"
Warning: inotify_add_watch("/home/eli/.config/qt5ct") failed: "No such file or directory"

The “No Icon set” warning is misleading, because it continued to appear. This is after the fix, with the icon properly in place in the tray:

Mon Dec 24 10:16:17 2018: Qt Warning: QSystemTrayIcon::setVisible: No Icon set
Warning: QSystemTrayIcon::setVisible: No Icon set

Anyhow, problem fixed. For me, that is.

And why ktorrent? Because its last reported vulnerability was in 2009, compared with “Transmission” which had a nasty issue in January 2018. Actually, the exploit in Transmission is interesting by itself, with a clear lesson: If you set up a webserver on the local host for any purpose, assume anyone can access it. Setting it to respond to 127.0.0.1 only doesn’t help.

↧

Cinelerra 2019 notes

January 7, 2019, 11:43 pm

≫ Next: Replacing ntpd with systemd-timesyncd (Mint 18.1)

≪ Previous: Solved: Missing ktorrent icon on Linux Mint / Cinnamon

Cinelerra is alive and kicking. I’ve recently downloaded the “goodguy” revision of Cinelerra, Nov 29 2018 build (called “cin” for some reason), which is significantly smoother than the tool I’m used to.

Notes:

There are now two ways to set the effects’ attributes. With the good-old magnifying glass (“Controls”) in with the gear (“Presets”), which gives a textual interface
Unlike what I’m used to, the Controls only set the global parameters, even with “Genererate keyframes while tweeking” on (spelling as used in Cinelerra).
In order to create keyframes, enable “Generate keyframes” and go for the “gear” tool. That isn’t much fun, because the settings are manual.
If the Controls are used, all keyframes get the value.

↧

Replacing ntpd with systemd-timesyncd (Mint 18.1)

January 27, 2019, 6:57 am

≫ Next: Perl, DBI and MySQL wrongly reads zeros from database

≪ Previous: Cinelerra 2019 notes

Introduction

It all began when I noted that my media center Linux machine (Linux Mint 18.1, Serena) finished a TV recording a bit earlier than expected. Logging in and typing “date” I was quite surprised to find out that the time was off by half a minute.

The first question that comes to mind is why the time synchronization didn’t work. The second is, if it didn’t work, how come I hadn’t noted this issue earlier? The computer is in use as a media center for little less than two years.

What happened

It turns out (and it wasn’t easy to tell) that the relevant daemon was ntpd.

So what’s up, ntp?

$ systemctl status ntp
● ntp.service - LSB: Start NTP daemon
   Loaded: loaded (/etc/init.d/ntp; enabled; vendor preset: enabled)
   Active: active (exited) since Wed 2018-12-19 12:38:06 IST; 1 months 7 days ag
     Docs: man:systemd-sysv-generator(8)
  Process: 1257 ExecStop=/etc/init.d/ntp stop (code=exited, status=0/SUCCESS)
  Process: 1385 ExecStart=/etc/init.d/ntp start (code=exited, status=0/SUCCESS)

Dec 19 12:38:06 tv systemd[1]: Starting LSB: Start NTP daemon...
Dec 19 12:38:06 tv ntp[1385]:  * Starting NTP server ntpd
Dec 19 12:38:06 tv ntp[1385]:    ...done.
Dec 19 12:38:06 tv systemd[1]: Started LSB: Start NTP daemon.
Dec 19 12:38:06 tv ntpd[1398]: proto: precision = 0.187 usec (-22)
Dec 19 12:38:08 tv systemd[1]: Started LSB: Start NTP daemon.

Looks fairly OK. Maybe the logs can tell something?

$ journalctl -u ntp
Dec 19 12:38:02 tv systemd[1]: Stopped LSB: Start NTP daemon.
Dec 19 12:38:02 tv systemd[1]: Starting LSB: Start NTP daemon...
Dec 19 12:38:02 tv ntp[1055]:  * Starting NTP server ntpd
Dec 19 12:38:02 tv ntpd[1074]: ntpd 4.2.8p4@1.3265-o Wed Oct  5 12:34:45 UTC 2016 (1): Starting
Dec 19 12:38:02 tv ntpd[1076]: proto: precision = 0.175 usec (-22)
Dec 19 12:38:02 tv ntp[1055]:    ...done.
Dec 19 12:38:02 tv systemd[1]: Started LSB: Start NTP daemon.
Dec 19 12:38:02 tv ntpd[1076]: Listen and drop on 0 v6wildcard [::]:123
Dec 19 12:38:02 tv ntpd[1076]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Dec 19 12:38:02 tv ntpd[1076]: Listen normally on 2 lo 127.0.0.1:123
Dec 19 12:38:02 tv ntpd[1076]: Listen normally on 3 lo [::1]:123
Dec 19 12:38:02 tv ntpd[1076]: Listening on routing socket on fd #20 for interface updates
Dec 19 12:38:03 tv ntpd[1076]: error resolving pool 0.ubuntu.pool.ntp.org: Temporary failure in name resolution (-3)
Dec 19 12:38:04 tv ntpd[1076]: error resolving pool 1.ubuntu.pool.ntp.org: Temporary failure in name resolution (-3)
Dec 19 12:38:05 tv ntpd[1076]: error resolving pool 2.ubuntu.pool.ntp.org: Temporary failure in name resolution (-3)
Dec 19 12:38:06 tv systemd[1]: Stopping LSB: Start NTP daemon...
Dec 19 12:38:06 tv ntp[1257]:  * Stopping NTP server ntpd
Dec 19 12:38:06 tv ntp[1257]:    ...done.
Dec 19 12:38:06 tv systemd[1]: Stopped LSB: Start NTP daemon.
Dec 19 12:38:06 tv systemd[1]: Stopped LSB: Start NTP daemon.
Dec 19 12:38:06 tv systemd[1]: Starting LSB: Start NTP daemon...
Dec 19 12:38:06 tv ntp[1385]:  * Starting NTP server ntpd
Dec 19 12:38:06 tv ntp[1385]:    ...done.
Dec 19 12:38:06 tv systemd[1]: Started LSB: Start NTP daemon.
Dec 19 12:38:06 tv ntpd[1398]: proto: precision = 0.187 usec (-22)
Dec 19 12:38:08 tv systemd[1]: Started LSB: Start NTP daemon.

Hmmm… There is some kind of trouble there, but it was surely resolved. Or? In fact, there was no ntpd process running, so maybe it just died?

Let’s try to restart the daemon, and see what happens. As root,

# systemctl restart ntp

after which the log went

Jan 26 20:36:46 tv systemd[1]: Stopping LSB: Start NTP daemon...
Jan 26 20:36:46 tv ntp[32297]:  * Stopping NTP server ntpd
Jan 26 20:36:46 tv ntp[32297]: start-stop-daemon: warning: failed to kill 1398: No such process
Jan 26 20:36:46 tv ntp[32297]:    ...done.
Jan 26 20:36:46 tv systemd[1]: Stopped LSB: Start NTP daemon.
Jan 26 20:36:46 tv systemd[1]: Starting LSB: Start NTP daemon...
Jan 26 20:36:46 tv ntp[32309]:  * Starting NTP server ntpd
Jan 26 20:36:46 tv ntp[32309]:    ...done.
Jan 26 20:36:46 tv systemd[1]: Started LSB: Start NTP daemon.
Jan 26 20:36:46 tv ntpd[32324]: proto: precision = 0.187 usec (-22)
Jan 26 20:36:46 tv ntpd[32324]: Listen and drop on 0 v6wildcard [::]:123
Jan 26 20:36:46 tv ntpd[32324]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Jan 26 20:36:46 tv ntpd[32324]: Listen normally on 2 lo 127.0.0.1:123
Jan 26 20:36:46 tv ntpd[32324]: Listen normally on 3 enp3s0 10.1.1.22:123
Jan 26 20:36:46 tv ntpd[32324]: Listen normally on 4 lo [::1]:123
Jan 26 20:36:46 tv ntpd[32324]: Listen normally on 5 enp3s0 [fe80::f757:9ceb:2243:3e16%2]:123
Jan 26 20:36:46 tv ntpd[32324]: Listening on routing socket on fd #22 for interface updates
Jan 26 20:36:47 tv ntpd[32324]: Soliciting pool server 118.67.200.10
Jan 26 20:36:48 tv ntpd[32324]: Soliciting pool server 210.23.25.77
Jan 26 20:36:49 tv ntpd[32324]: Soliciting pool server 211.233.40.78
Jan 26 20:36:50 tv ntpd[32324]: Soliciting pool server 43.245.49.242
Jan 26 20:36:30 tv ntpd[32324]: Soliciting pool server 45.76.187.173
Jan 26 20:36:30 tv ntpd[32324]: Soliciting pool server 46.19.96.19
Jan 26 20:36:31 tv ntpd[32324]: Soliciting pool server 210.173.160.87
Jan 26 20:36:31 tv ntpd[32324]: Soliciting pool server 119.28.206.193
Jan 26 20:36:49 tv ntpd[32324]: Soliciting pool server 133.243.238.244
Jan 26 20:36:49 tv ntpd[32324]: Soliciting pool server 91.189.89.199

Aha! So this is what a kickoff of ntpd should look like! Clearly ntpd didn’t recover all that well from the lack of internet connection (I suppose) during the media center’s bootup. Maybe it died, and was never restarted. The irony is that systemd has a wonderful mechanism for restarting failing daemons, but ntpd is still under the backward-compatible LSB interface. So the system silently remained with no time synchronization.

Go the systemd way

systemd supplies its own lightweight time synchronization mechanism, systemd-timesyncd. It makes much more sense, as it doesn’t open NTP ports as a server (like ntpd does, one may wonder what for), but just synchronizes the computer it runs on to the remote NTP server. And judging from my previous experience with systemd, in the event of multiple solutions, go for the one systemd offers. In fact, it’s sort-of enabled by default:

$ systemctl status systemd-timesyncd
● systemd-timesyncd.service - Network Time Synchronization
   Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled; vendor preset: enabled)
  Drop-In: /lib/systemd/system/systemd-timesyncd.service.d
           └─disable-with-time-daemon.conf
   Active: inactive (dead)
Condition: start condition failed at Wed 2018-12-19 12:38:01 IST; 1 months 7 days ago
           ConditionFileIsExecutable=!/usr/sbin/VBoxService was not met
     Docs: man:systemd-timesyncd.service(8)

Start condition failed? What’s this? Let’s look at the drop-in file:

$ cat /lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf
[Unit]
# don't run timesyncd if we have another NTP daemon installed
ConditionFileIsExecutable=!/usr/sbin/ntpd
ConditionFileIsExecutable=!/usr/sbin/openntpd
ConditionFileIsExecutable=!/usr/sbin/chronyd
ConditionFileIsExecutable=!/usr/sbin/VBoxService

Oh please, you can’t be serious. Disabling the execution because of the existence of a file? If another NTP daemon is installed, does it mean it’s being enabled? In particular, if VBoxService is installed, does it mean we’re running as guests on a virtual machine? Like, seriously, someone might just install the Virtual Box client tools for no reason at all, and poof, there goes the time synchronization without any warning (note that this wasn’t the problem I had).

Moving to systemd-timesyncd

As mentioned earlier, systemd-timesyncd is enabled by default, but one may insist:

# systemctl enable systemd-timesyncd.service

(Nothing response, because it’s enabled anyhow)

However in order to make it work, remove the condition that prevents it from running:

# rm /lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf

and then disable and stop ntpd:

# systemctl disable ntp
# systemctl stop ntp

On my computer, the other two time synchronizing tools (openntpd and chrony) aren’t installed, so they are not to worry about.

And then we have timedatectl

Note directly related, and still worth mentioning

$ timedatectl
      Local time: Sat 2019-01-26 21:22:57 IST
  Universal time: Sat 2019-01-26 19:22:57 UTC
        RTC time: Sat 2019-01-26 19:22:57
       Time zone: Asia/Jerusalem (IST, +0200)
 Network time on: yes
NTP synchronized: yes
 RTC in local TZ: no

Systemd is here to take control of everything, obviously.

↧

Perl, DBI and MySQL wrongly reads zeros from database

February 15, 2019, 7:29 am

≫ Next: When mplayer plays a black window (or: Cinnamon leaking GPU memory)

≪ Previous: Replacing ntpd with systemd-timesyncd (Mint 18.1)

TL;DR: SELECT queries in Perl for numerical columns suddenly turned to zeros after a software upgrade.

This is a really peculiar problem I had after my web hosting provider upgraded some database related software on the server: Numbers that were read with SELECT queries from the database were suddenly all zeros.

Spoiler: It’s about running Perl in Taint Mode.

The setting was DBD::mysql version 4.050, DBI version 1.642, Perl v5.10.1, and MySQL Community Server version 5.7.25 on a Linux machine.

For example, the following script is supposed to write the number of lines in the “session” table:

#!/usr/bin/perl -T -w
use warnings;
use strict;
require DBI;

my $dbh = DBI->connect( "DBI:mysql:mydb:localhost", "mydb", "password",
		     { RaiseError => 1, AutoCommit => 1, PrintError => 0,
		       Taint => 1});

my $sth = $dbh->prepare("SELECT COUNT(*) FROM session");

$sth->execute();

my @l = $sth->fetchrow_array;
my $s = $l[0];
print "$s\n";

$sth->finish();
$dbh->disconnect;

But instead, it prints zero, even though there are rows in the said table. Turning off taint mode by removing the “-T” flag in the shebang line gives the correct output. Needless to say, accessing the database with the “mysql” command-line utility client gave the correct output as well.

This is true for any numeric readout from this MySQL wrapper. This is in particular problematic when an integer is used as a user ID of a web site, and fetched with

my $sth = db::prepare_cached("SELECT id FROM users WHERE username=? AND passwd=?");
$sth->execute($name, $password);
my ($uid) = $sth->fetchrow_array;
$sth->finish();

If the credentials are wrong, $uid will be undef, as usual. But if any valid user gives correct credentials, it’s allocated user number 0. Which I was cautious enough not to allocate as the site’s supervisor, but that’s actually a common choice (what’s the UID of root on a Linux system?).

A softer workaround, instead of dropping the “-T” flag, is to set the TaintIn flag in the DBI->connect() call, instead of Taint. The latter stands for TaintIn and TaintOut, and so this fix effectively disables TaintOut, hence tainting of data from the database is disabled. And in this case, disabling tainting of this data also skips the zero-value bug. This leaves all other tainting checks in place, in particular that of data supplied from the network. So not enforcing sanitizing data from the database is a small sacrifice (in particular if the script already has mileage running with the enforcement on).

And in the end I wonder if I’m the only one who uses Perl’s tainting mechanism. I mean, if there are still (2019) advisories on SQL injections (mostly PHP scripts), maybe people just don’t care much about things of this sort.

↧

When mplayer plays a black window (or: Cinnamon leaking GPU memory)

April 14, 2019, 5:22 am

≫ Next: Solved: systemd boot waits 90 seconds on net-devices-eth0

≪ Previous: Perl, DBI and MySQL wrongly reads zeros from database

The incident

All of the sudden, playing videos with Mplayer opened a black window. Sometimes going fullscreen helped, sometimes it didn’t, sometimes with video playing but without OSD. ffplay worked, but somewhat limping.

Setting: Linux Mint 19 on an x86_64, with a couple of fanless GeForce GT 1030 graphics cards and Cinnamon 3.8.9.

Mplayer’s output in this situation:

Playing IHS_1235.MOV.
libavformat version 57.83.100 (external)
libavformat file format detected.
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f858e2362a0]Protocol name not provided, cannot determine if input is local or a network protocol, buffers and access patterns cannot be configured optimally without knowing the protocol
[lavf] stream 0: video (h264), -vid 0
[lavf] stream 1: audio (pcm_s16le), -aid 0, -alang eng
VIDEO:  [H264]  1920x1080  24bpp  59.940 fps  36067.5 kbps (4402.8 kbyte/s)
==========================================================================
Opening video decoder: [ffmpeg] FFmpeg's libavcodec codec family
libavcodec version 57.107.100 (external)
Selected video codec: [ffh264] vfm: ffmpeg (FFmpeg H.264)
==========================================================================
Opening audio decoder: [pcm] Uncompressed PCM audio decoder
AUDIO: 48000 Hz, 2 ch, s16le, 1536.0 kbit/100.00% (ratio: 192000->192000)
Selected audio codec: [pcm] afm: pcm (Uncompressed PCM)
==========================================================================
AO: [pulse] 48000Hz 2ch s16le (2 bytes per sample)
Starting playback...
Movie-Aspect is undefined - no prescaling applied.
VO: [vdpau] 1920x1080 => 1920x1080 Planar YV12
[vdpau] Error when calling vdp_output_surface_create: The system does not have enough resources to complete the requested operation at this time.
[vdpau] Error when calling vdp_output_surface_create: The system does not have enough resources to complete the requested operation at this time.
[vdpau] Error when calling vdp_output_surface_create: The system does not have enough resources to complete the requested operation at this time.
[vdpau] Error when calling vdp_output_surface_create: The system does not have enough resources to complete the requested operation at this time.
[vdpau] Error when calling vdp_presentation_queue_block_until_surface_idle: An invalid handle value was provided.
[vdpau] Error when calling vdp_video_mixer_render: An invalid handle value was provided.
[vdpau] Error when calling vdp_presentation_queue_display: An invalid handle value was provided.
A:   0.2 V:   0.0 A-V:  0.216 ct:  0.000   0/  0 ??% ??% ??,?% 0 0
[vdpau] Error when calling vdp_presentation_queue_block_until_surface_idle: An invalid handle value was provided.
[vdpau] Error when calling vdp_video_mixer_render: An invalid handle value was provided.
[vdpau] Error when calling vdp_presentation_queue_block_until_surface_idle: An invalid handle value was provided.
[vdpau] Error when calling vdp_video_mixer_render: An invalid handle value was provided.
[vdpau] Error when calling vdp_presentation_queue_display: An invalid handle value was provided.
[vdpau] Error when calling vdp_presentation_queue_display: An invalid handle value was provided.

And a lot of error messages, with “invalid handle value was provided” all over the place.

What does the graphics card have to say?

Opening Nvidia’s graphical control panel (Nvidia X Server Settings), it turns out that “User Dedicated Memory” stands at 1864 MB out of 1998 MB (93%). No wonder things don’t work.

OK, so who’s eating up all RAM? I have a wild guess, but nothing like getting it black on white:

$ nvidia-smi
Sun Apr 14 14:39:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 1030     Off  | 00000000:17:00.0 Off |                  N/A |
|  0%   41C    P8    N/A /  30W |      1MiB /  2001MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GT 1030     Off  | 00000000:65:00.0  On |                  N/A |
|  0%   51C    P8    N/A /  30W |   1914MiB /  1998MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      1803      G   /usr/lib/xorg/Xorg                           433MiB |
|    1      2373      G   cinnamon                                    1310MiB |
|    1     54180      G   ...uest-channel-token=14764917277860092693   165MiB |
|    1     68188      G   /usr/bin/nvidia-settings                       0MiB |
+-----------------------------------------------------------------------------+

(The memory consumptions are at the far right on each line. Scroll to see them)

At that very moment it had slurped quite some CPU RAM as well: 5.7 GB virtual memory allocated and 1.3 GB resident (real RAM). So leaking memory everywhere. That’s after running two months.

The other hog is Google Chrome, by the way, (165 MiB), also after running continuously for two months.

Solution

The solution is surprisingly simple and harmless: Restart Cinammon. Yes, you can do this even if there are a lot of windows open, spread out in different workspaces. They will remain in place, don’t worry. Only the tabs will be reordered within each workspace, but that’s really small. To do this (as I mentioned on another post):

Press ALT-F2, type “r” and Enter. Look away for a few seconds, because what happens next looks like a sudden reboot, but it isn’t. All comes back.

Except a lot of memory has been freed. Resident CPU RAM went down from 1.3 GB to 256 MB, but even more important:

$ nvidia-smi
Sun Apr 14 14:49:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 1030     Off  | 00000000:17:00.0 Off |                  N/A |
|  0%   41C    P8    N/A /  30W |      1MiB /  2001MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GT 1030     Off  | 00000000:65:00.0  On |                  N/A |
|  0%   52C    P0    N/A /  30W |    701MiB /  1998MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      1803      G   /usr/lib/xorg/Xorg                           498MiB |
|    1      2373      G   cinnamon                                      17MiB |
|    1     54180      G   ...uest-channel-token=14764917277860092693   177MiB |
+-----------------------------------------------------------------------------+

That’s a crash diet until the next time. Once a month, I guess.

↧

Solved: systemd boot waits 90 seconds on net-devices-eth0

May 22, 2019, 4:07 am

≫ Next: systemd / DBus debugging starter pack

≪ Previous: When mplayer plays a black window (or: Cinnamon leaking GPU memory)

Introduction

After installing wireshark (and tons of packages it depends on) on a rather fresh and bare-boned Debian 8 (Jessie), I got the “A start job is running for sys-subsystem-net-devices-eth0.device” message for a minute and half on every boot.

It was exceptionally difficult to find the reason, because so many packages were installed along with wireshark.

This is the short version of how this was solved. For the entire battery of stuff I tried out, I’ve written a separate post.

Bad omens

# systemctl status sys-subsystem-net-devices-eth0.device
● sys-subsystem-net-devices-eth0.device
   Loaded: loaded
   Active: inactive (dead)

May 20 12:47:07 diskless systemd[1]: Expecting device sys-subsystem-net-devices-eth0.device...
May 20 12:48:37 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out.
May 20 12:48:37 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device.

OK. Not surprising it’s not active. So start manually…?

# systemctl start sys-subsystem-net-devices-eth0.device
Job for sys-subsystem-net-devices-eth0.device timed out.

The second line appeared after a minute and a half, of course.

So I went to another, more recent machine (Mint 19) and went

$ systemctl status sys-subsystem-net-devices-eth0.device
● sys-subsystem-net-devices-eth0.device - Killer E2500 Gigabit Ethernet Controll
   Loaded: loaded
   Active: active (plugged) since Wed 2019-02-20 14:48:54 IST; 2 months 30 days
   Device: /sys/devices/pci0000:00/0000:00:1c.2/0000:04:00.0/net/eth0

And then comparing the outputs of just

$ systemctl

it became evident that *.device units are listed on the Mint 19 machine, but not on Debian 8.

Which led me to the conclusion that sys-subsystem-net-devices-eth0.device isn’t meant to be on Debian 8. That the problem isn’t that it’s not starting when commanded to do so, but that it’s not supposed to be started that way. The problem is that some other unit requests it.

As far as I understand, these .device units should become active and inactive by a systemd-udev event by virtue of udev labeling. They are there to trigger other units that depend on them, not to be controlled explicitly. For some reason they aren’t activated on the Debian 8 machine, despite udev rules being roughly the same as on the Mint 19 machine.

In the lack of proper docs (?), I’m left to guess that requesting a start or stop on device units means waiting for them to reach the desired state by themselves. This goes along with an observation I’ve made with strace, showing that systemd does nothing meaningful until it times out. So most likely, it just looked up the state of the device unit, saw it wasn’t started, and then went to sleep, essentially waiting for a udev event to bring the unit to the desired state, and consequently return a success status to the start request.

In fact, when I tried “systemctl stop” on the eth0 device on Mint 19 (i.e. the machine on which it was already loaded) it got stuck exactly the same way as for starting it on Debian 8. So that command probably meant “wait until eth0 goes away”.

Closing in

The trick is now to find which unit causes the attempt to kick off sys-subsystem-net-devices-eth0.device.

# journald -x

[ ... ]

May 20 11:41:20 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out.
May 20 11:41:20 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device.
-- Subject: Unit sys-subsystem-net-devices-eth0.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit sys-subsystem-net-devices-eth0.device has failed.
--
-- The result is timeout.
May 20 11:41:20 diskless systemd[1]: Dependency failed for ifup for eth0.
-- Subject: Unit ifup@eth0.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ifup@eth0.service has failed.
--
-- The result is dependency.

In English: It’s clear that sys-subsystem-net-devices-eth0.device is the unit that didn’t manage to kick off. But the more important clue is that ifup@eth0.service failed, because it depends on the former. The easier solution lies in the latter.

But frankly, I don’t really understand what happened here. If eth0 was detected by systemd, why wasn’t the relevant device unit activated? Or if it wasn’t, why was ifup@eth0.service kicked off? The relevant unit file is a wildcard service, not naming any specific device name.

Solution

The textbook solution is to find why .device files aren’t generated at all on my Debian 8 system, fix that, and then there won’t be any delay. The correct solution in some cases is to manipulate the udev rules, adding a “TAG+=”systemd”" rule to the device, so the device unit is started automatically by systemd (man systemd.device). In my case this tag was already there, so it’s probably some issue with the service that’s supposed to respond to the udev event. So that’s a dead end.

So go the clumsy way: Remove the unit file that requests the device unit (or maybe I should have masked it by adding a file in /etc?). In this case, it’s /lib/systemd/system/ifup@.service, which said:

[Unit]
Description=ifup for %I
After=local-fs.target network-pre.target networking.service systemd-sysctl.service
Before=network.target
BindsTo=sys-subsystem-net-devices-%i.device
After=sys-subsystem-net-devices-%i.device
ConditionPathIsDirectory=/run/network
DefaultDependencies=no

[Service]
ExecStart=/sbin/ifup --allow=hotplug %I
ExecStop=/sbin/ifdown %I
RemainAfterExit=true

and then make sure this had no adverse side effects (none found so far). Actually, removing this file can’t be worse than it was when it took 90 seconds to boot, because this service wasn’t launched anyhow, as its precondition never started.

↧

systemd / DBus debugging starter pack

May 24, 2019, 11:05 am

≫ Next: Linux: Writing a Windows MBR instead of grub

≪ Previous: Solved: systemd boot waits 90 seconds on net-devices-eth0

Introduction

Trying to solve a 90 second wait-on-some-start-job on each boot situation, I found that there’s little info on how to tackle a problem like this out there. Most web pages usually go “I did this, hurray it worked!”, but where do you start solving a problem like this if none of the do-this-do-that advice helps?

So this is a random pile of things to try out. Most of the things shown below didn’t solve my own issue, but these are the tools I collected in my toolbox.

I did this on a Debian 8 (Jessie), and the problem was that boot was stuck for a minute and a half on “A start job is running for sys-subsystem-net-devices-eth0.device”. It’s not directly relevant, except that it influences what I tried out, and hence listed below.

I have written a shorter version of this post which focuses on the specific problem. This post is more about the techniques for figuring out what’s going on.

PID 1 at your service

The tricky part of systemd is that much of the activity is done directly by the systemd process, having PID 1. Requests to start and stop services and other units are sent via DBus messages, i.e. over connections to UNIX sockets. To someone who is used to the good-old-systemV Linux, this is voodoo at its worst, but there are simple ways to keep track of this, as shown below.

In particular, don’t strace the “systemctl start” process — it just sends the request over DBus. Rather, attach strace to PID 1, also explained below. That’s where the fork to the actual job process takes place, if at all.

And don’t get confused by having /org/freedesktop/ appearing everywhere in the logs. It doesn’t necessarily have anything to do with the desktop (if such exists), and is likewise relevant to a non-graphical system. DBus’ started as a solution for desktop machines, and that’s the only reason “freedesktop” is everywhere.

First thing first

Read the man page, “man systemd.device” in my case. If there’s another computer with different configuration, see what happens there. What does it look like when it works?

journald -x

As mentioned on this page, if something went wrong during boot, check out the log to see why. The -x flag adds valuable info for solving issues of this sort.

For example,

# journald -x

[ ... ]

May 20 11:41:20 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out.
May 20 11:41:20 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device.
-- Subject: Unit sys-subsystem-net-devices-eth0.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit sys-subsystem-net-devices-eth0.device has failed.
--
-- The result is timeout.
May 20 11:41:20 diskless systemd[1]: Dependency failed for ifup for eth0.
-- Subject: Unit ifup@eth0.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ifup@eth0.service has failed.
--
-- The result is dependency.

Now, how to read it: It’s clear that sys-subsystem-net-devices-eth0.device is the unit that didn’t manage to kick off. But the more important clue is that ifup@eth0.service failed, because it depends on the former.

It’s important, because it explains why an attempt to launch sys-subsystem-net-devices-eth0.device was done in the first place. A lot of “there I fixed it” pages on the web disable the latter service, and get rid of the problem, not necessarily understanding how and why.

So here’s why: On some relatively early systemd versions, *.device units simply won’t launch. It’s worked around by making sure that no other unit requests them. But then some software package isn’t aware of this, and requests a .device unit and there’s the deadlock. Or more precisely, waiting 90s for the timeout.

Kicking if off manually

Obviously, the unit is inactive:

# systemctl status sys-subsystem-net-devices-eth0.device
● sys-subsystem-net-devices-eth0.device
   Loaded: loaded
   Active: inactive (dead)

May 20 12:47:07 diskless systemd[1]: Expecting device sys-subsystem-net-devices-eth0.device...
May 20 12:48:37 diskless systemd[1]: Job sys-subsystem-net-devices-eth0.device/start timed out.
May 20 12:48:37 diskless systemd[1]: Timed out waiting for device sys-subsystem-net-devices-eth0.device.

So try to start it manually (this little session took 90 seconds, right?)

# systemctl start sys-subsystem-net-devices-eth0.device
Job for sys-subsystem-net-devices-eth0.device timed out.

The important takeaway is that we can repeat the problem on a running system (as opposed to a booting one). This allows running some tools for looking at what happens.

On the other hand, I’m not all that sure .device units are supposed to be started or stopped with systemctl at all. Or more likely, that requesting a start or stop on device units means waiting for them to reach the desired state by themselves. This goes along with the observation I made with strace (below), showing that systemd does nothing meaningful until it times out. So most likely, it just looked up the state of the device unit, saw it wasn’t started, and then went to sleep, essentially waiting for a udev event to bring the unit to the desired state, and consequently return a success status to the start request.

In fact, when I tried “systemctl stop” on the eth0 device on another machine, on which it the device file was activated automatically, it got stuck exactly the same way as for starting it on Debian 8.

As far as I understand, these should become active and inactive by a systemd-udev event by virtue of udev labeling. They are there to trigger other units that depend on them, not to be controlled explicitly.

But here comes a major red herring: Curiously enough, during the 90 seconds of waiting, “systemctl starts” created a child process, “/bin/systemd-tty-ask-password-agent –watch”. One can easily be misled into thinking that it’s this child process that blocks the completion of the former command.

So first, let’s convince ourselves that it’s not the problem, because running

# systemctl --no-ask-password start sys-subsystem-net-devices-eth0.device

doesn’t create this second process, but is stuck nevertheless.

This systemd-tty-ask-password-agent process listens for system-wide requests for obtaining a password from the user (e.g. when opening a crypto disk), and does that job if necessary. systemctl launches it just in case, regardless of the unit requested for starting. This is the way to make sure passwords are collected, if so needed. This process is usually not visible, because systemctl commands typically don’t last very long. More about it here.

Actually, checking with strace, systemctl was blocking all those 90 seconds on a ppoll(), waiting for some response from the /run/systemd/private UNIX socket. That’s the DBus connection with process number 1, systemd. In other words, systemctl requested the start of the unit over DBus, and then waited for the result for 90 seconds, at which point it got the answer that the attempt timed out.

Listening to DBus

There’s are two utilities, dbus-monitor and “busctl monitor” for dumping DBus messages (an eavesdrop add-on may be required to allow system-wide message monitoring, but this was not the case on my system).

So on the invocation of

# systemctl start sys-subsystem-net-devices-eth0.device

the output of

# dbus-monitor --system

was

signal sender=org.freedesktop.DBus -> dest=:1.8 serial=2 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameAcquired
   string ":1.8"
signal sender=:1.0 -> dest=(null destination) serial=127 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=UnitNew
   string "sys-subsystem-net-devices-eth0.device"
   object path "/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice"
signal sender=:1.0 -> dest=(null destination) serial=128 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=JobNew
   uint32 173
   object path "/org/freedesktop/systemd1/job/173"
   string "sys-subsystem-net-devices-eth0.device"
signal sender=:1.0 -> dest=(null destination) serial=129 path=/org/freedesktop/systemd1/job/173; interface=org.freedesktop.DBus.Properties; member=PropertiesChanged
   string "org.freedesktop.systemd1.Job"
   array [
      dict entry(
         string "State"
         variant             string "running"
      )
   ]
   array [
   ]

and when the timeout occurs with a

Job for sys-subsystem-net-devices-eth0.device timed out.

the following output is captured on the DBus:

signal sender=:1.0 -> dest=(null destination) serial=141 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=JobRemoved
   uint32 173
   object path "/org/freedesktop/systemd1/job/173"
   string "sys-subsystem-net-devices-eth0.device"
   string "timeout"
signal sender=:1.0 -> dest=(null destination) serial=142 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=UnitRemoved
   string "sys-subsystem-net-devices-eth0.device"
   object path "/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice"

Clearly, everything was done by the systemd main process, and almost nothing by the process created from console.

The “sender=:1.0″ part means that the sender is the process number 1 (systemd). Try

$ busctl

(that’s short for “busctl list”) to get a mapping between these addresses and processes.

See the number 173 in the object paths all over in the dbus traffic? That’s the job number as listed in

# systemctl list-jobs
JOB UNIT                                  TYPE  STATE
173 sys-subsystem-net-devices-eth0.device start running

1 jobs listed.

Note that these job numbers have absolutely nothing to do with the Linux PIDs.

Using strace

strace is often very useful for resolving OS problems. It’s however important to realize that the old-fashioned way of stracing the process created on command line will probably not yield much information, because this process only sends a request over DBus.

Instead, strace the process that does the actual work: PID 1, the Mother Of All Processes, the almighty systemd itself. I have to admit that I was first intimidated by the idea to attach strace to this process, but it turns out that it’s usually quite calm, and spits out relatively little unrelated mumbo-jumbo.

Bonus: It’s always the same command:

# strace -p 1 -s 128 -ff -o systemd-trace

This makes a file for each process systemd may fork into. If things went wrong because some process didn’t execute properly, this is how we catch it.

For example, when running the said “systemctl start sys-subsystem-net-devices-eth0.device” command, this was the output:

accept4(12, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 13
getsockopt(13, SOL_SOCKET, SO_PEERCRED, {pid=901, uid=0, gid=0}, [12]) = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "\270\305\231\206+&\262MI\313[\337y}\314V", 16) = 16
close(18)                               = 0
fcntl(13, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(13, F_GETFD)                      = 0x1 (flags FD_CLOEXEC)
fstat(13, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
setsockopt(13, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
setsockopt(13, SOL_SOCKET, 0x22 /* SO_??? */, [0], 4) = 0
getsockopt(13, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(13, SOL_SOCKET, 0x21 /* SO_??? */, [8388608], 4) = 0
getsockopt(13, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
setsockopt(13, SOL_SOCKET, 0x20 /* SO_??? */, [8388608], 4) = 0
getsockopt(13, SOL_SOCKET, SO_PEERCRED, {pid=901, uid=0, gid=0}, [12]) = 0
fstat(13, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
getsockopt(13, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0
getsockname(13, {sa_family=AF_LOCAL, sun_path="/run/systemd/private"}, [23]) = 0
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"\0AUTH EXTERNAL 30\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 256}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 45
epoll_ctl(4, EPOLL_CTL_ADD, 13, {0, {u32=2784593808, u64=94857137325968}}) = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "\10,\360z\t\363+\355D\2556NLkhL", 16) = 16
close(18)                               = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "$F\3302\215\326\320\251\261\240\217\232\224\1\346\205", 16) = 16
close(18)                               = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "|\313E\273R\264 \375\v\245\235\206h\247\30-", 16) = 16
close(18)                               = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 18
read(18, "{\200\255\356\26\341b4V_P\225aHkO", 16) = 16
close(18)                               = 0
epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN|EPOLLOUT, {u32=2784593808, u64=94857137325968}}) = 0
timerfd_settime(29, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={2519, 883043000}}, NULL) = 0
epoll_wait(4, {{EPOLLOUT, {u32=2784593808, u64=94857137325968}}}, 33, 0) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 359675827}) = 0
timerfd_settime(29, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={2508, 883043000}}, NULL) = 0
epoll_wait(4, {{EPOLLOUT, {u32=2784593808, u64=94857137325968}}}, 33, 0) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 359719924}) = 0
sendmsg(13, {msg_name(0)=NULL, msg_iov(3)=[{"OK b8c599862b26424d89cb5bdf797dcc56\r\nAGREE_UNIX_FD\r\n", 52}, {NULL, 0}, {NULL, 0}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 52
epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN, {u32=2784593808, u64=94857137325968}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 359793419}) = 0
epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 359846496}) = 0
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\0018\0\0\0\1\0\0\0\240\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\t\0\0\0StartUnit\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop."..., 208}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 208
getuid()                                = 0
sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1&\0\0\0\1\0\0\0\17\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1o\0\0", 32}, {"!\0\0\0/org/freedesktop/systemd1/job/242\0", 38}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 70
sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\4\1\1H\0\0\0\2\0\0\0\206\0\0\0\1\1o\0!\0\0\0/org/freedesktop/systemd1/job/242\0\0\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChange"..., 152}, {"\34\0\0\0org.freedesktop.systemd1.Job\0\0\0\0\34\0\0\0\5\0\0\0State\0\1s\0\0\0\0\7\0\0\0running\0\0\0\0\0", 72}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 224
sendmsg(35, {msg_name(0)=NULL, msg_iov(2)=[{"l\4\1\1H\0\0\0\302\0\0\0\206\0\0\0\1\1o\0!\0\0\0/org/freedesktop/systemd1/job/242\0\0\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChange"..., 152}, {"\34\0\0\0org.freedesktop.systemd1.Job\0\0\0\0\34\0\0\0\5\0\0\0State\0\1s\0\0\0\0\7\0\0\0running\0\0\0\0\0", 72}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 224
epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, 0) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 360201428}) = 0
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1*\0\0\0\2\0\0\0\227\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\7\0\0\0GetUnit\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1"..., 186}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 186
sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1S\0\0\0\3\0\0\0\17\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1o\0\0", 32}, {"N\0\0\0/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice\0", 83}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 115
epoll_wait(4, {}, 33, 0)                = 0
clock_gettime(CLOCK_BOOTTIME, {2508, 360292582}) = 0
epoll_wait(4, {{EPOLLIN, {u32=2784593808, u64=94857137325968}}}, 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 360319860}) = 0
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\0019\0\0\0\3\0\0\0\300\0\0\0\1\1o\0N\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2deth0_2edevice\0\0\3\1s\0\3\0\0\0Get\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Pro"..., 241}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=901, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 241
lstat("/etc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/etc/systemd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/etc/systemd/system", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/etc/systemd/system/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/run", {st_mode=S_IFDIR|0755, st_size=620, ...}) = 0
lstat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0
lstat("/run/systemd/system", {st_mode=S_IFDIR|0755, st_size=120, ...}) = 0
lstat("/run/systemd/system/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/run", {st_mode=S_IFDIR|0755, st_size=620, ...}) = 0
lstat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0
lstat("/run/systemd/generator", {st_mode=S_IFDIR|0755, st_size=360, ...}) = 0
lstat("/run/systemd/generator/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/local", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0
lstat("/usr/local/lib", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0
lstat("/usr/local/lib/systemd", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/lib/systemd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/lib/systemd/system", {st_mode=S_IFDIR|0755, st_size=36864, ...}) = 0
lstat("/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/lib/systemd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/lib/systemd/system", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
lstat("/run", {st_mode=S_IFDIR|0755, st_size=620, ...}) = 0
lstat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0
lstat("/run/systemd/generator.late", {st_mode=S_IFDIR|0755, st_size=440, ...}) = 0
lstat("/run/systemd/generator.late/sys-subsystem-net-devices-eth0.device.d", 0x7ffec2dd0540) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/run/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/run/systemd/generator/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/local/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/systemd/system/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/run/systemd/generator.late/sys-subsystem-net-devices-eth0.device.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
sendmsg(13, {msg_name(0)=NULL, msg_iov(2)=[{"l\2\1\1\10\0\0\0\4\0\0\0\17\0\0\0\5\1u\0\3\0\0\0\10\1g\0\1v\0\0", 32}, {"\1b\0\0\0\0\0\0", 8}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 40
epoll_wait(4, {{EPOLLIN, {u32=3, u64=3}}}, 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {2508, 883466886}) = 0
read(29, "\1\0\0\0\0\0\0\0", 8)         = 8
timerfd_settime(29, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={2509, 383043000}}, NULL) = 0
epoll_wait(4, {{EPOLLIN, {u32=3, u64=3}}}, 33, -1) = 1

So what we have here is the acceptance of the connection, sending and receiving messages like those captured with dbus-monitor above, but nothing meaningful was executed: There was no productive system call done, and systemctl didn’t fork. But we can also see which files (actually, directories) systemd was looking for, and didn’t find: It really wanted to find some sys-subsystem-net-devices-eth0.device.d in one of the famous paths. Not that it matters so much, though.

By contrast and for example, if “systemctl start atd” is launched and atd is not already running, systemd (as process 1) forks into another process and calls execve(“/usr/sbin/atd”) on the forked process (after a whole lot of cgroup stuff, closing files etc.). If the same systemctl command is called with the atd service already running, there is no such fork (not surprisingly, systemd does nothing when attempting to start an already started service).

For the record, the failed lookup of directories ins’t the problem: I had the luxury of trying exactly the same on a machine that doesn’t get stuck on starting sys-subsystem-net-devices-eth0.device, and the strace looked the same. Except that the systemd job was terminated immediately and successfully, rather than getting stuck.

On my own behalf, this was the moment I realized that this unit shouldn’t be started at all on the system it gets stuck on.

Checking udev

If a boot problem is related to a device, maybe something went wrong with the device’s bringup, which in turn prevented the relevant .device unit to become active, and then some other unit waits for it…?

So what is running when eth0 is detected?

# udevadm test /sys/class/net/eth0
calling: test
version 215
This program is for debugging only, it does not run any program
specified by a RUN key. It may show incorrect results, because
some values may be different, or not available at a simulation run.

load module index
Network interface NamePolicy= disabled on kernel commandline, ignoring.
timestamp of '/etc/systemd/network' changed
timestamp of '/lib/systemd/network' changed
Parsed configuration file /lib/systemd/network/99-default.link
Created link configuration context.
timestamp of '/etc/udev/rules.d' changed
read rules file: /lib/udev/rules.d/42-usb-hid-pm.rules
read rules file: /lib/udev/rules.d/50-bluetooth-hci-auto-poweron.rules
read rules file: /lib/udev/rules.d/50-firmware.rules
read rules file: /lib/udev/rules.d/50-udev-default.rules
read rules file: /lib/udev/rules.d/55-dm.rules

[ ... ]

read rules file: /etc/udev/rules.d/90-local-imagedisk.rules
read rules file: /lib/udev/rules.d/95-cd-devices.rules
read rules file: /lib/udev/rules.d/95-udev-late.rules
read rules file: /lib/udev/rules.d/97-hid2hci.rules
read rules file: /lib/udev/rules.d/99-systemd.rules
rules contain 393216 bytes tokens (32768 * 12 bytes), 23074 bytes strings
21081 strings (168928 bytes), 18407 de-duplicated (148529 bytes), 2675 trie nodes used
NAME 'eth0' /etc/udev/rules.d/70-persistent-net.rules:2
IMPORT builtin 'net_id' /lib/udev/rules.d/75-net-description.rules:6
IMPORT builtin 'hwdb' /lib/udev/rules.d/75-net-description.rules:12
IMPORT builtin 'path_id' /lib/udev/rules.d/80-net-setup-link.rules:5
IMPORT builtin 'net_setup_link' /lib/udev/rules.d/80-net-setup-link.rules:11
Config file /lib/systemd/network/99-default.link applies to device eth0
RUN 'net.agent' /lib/udev/rules.d/80-networking.rules:1
RUN '/lib/systemd/systemd-sysctl --prefix=/proc/sys/net/ipv4/conf/$name --prefix=/proc/sys/net/ipv4/neigh/$name --prefix=/proc/sys/net/ipv6/conf/$name --prefix=/proc/sys/net/ipv6/neigh/$name' /lib/udev/rules.d/99-systemd.rules:61
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/net/eth0
ID_BUS=pci
ID_MODEL_FROM_DATABASE=RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (Motherboard)
ID_MODEL_ID=0x8168
ID_NET_DRIVER=r8169
ID_NET_NAME_MAC=enx408d5c4d1b15
ID_NET_NAME_PATH=enp1s0
ID_PATH=pci-0000:01:00.0
ID_PATH_TAG=pci-0000_01_00_0
ID_PCI_CLASS_FROM_DATABASE=Network controller
ID_PCI_SUBCLASS_FROM_DATABASE=Ethernet controller
ID_VENDOR_FROM_DATABASE=Realtek Semiconductor Co., Ltd.
ID_VENDOR_ID=0x10ec
IFINDEX=2
INTERFACE=eth0
SUBSYSTEM=net
SYSTEMD_ALIAS=/sys/subsystem/net/devices/eth0
TAGS=:systemd:
USEC_INITIALIZED=20767
run: 'net.agent'
run: '/lib/systemd/systemd-sysctl --prefix=/proc/sys/net/ipv4/conf/eth0 --prefix=/proc/sys/net/ipv4/neigh/eth0 --prefix=/proc/sys/net/ipv6/conf/eth0 --prefix=/proc/sys/net/ipv6/neigh/eth0'
unload module index
Unloaded link configuration context.

Note that the “systemd” tag is in place, and so is the SYSTEMD_ALIAS assignment. So there’s probably no reason udev-wise why there was no .device unit activated.

hwdb.bin

Update the file /etc/udev/hwdb.bin:

# udevadm hwdb --update

Note that it doesn’t touch /lib/udev/hwdb.bin, and I’m unclear how they interact, if at all (“wild” guess: /etc/udev/hwdb.bin overrules the one in /lib/udev, if it exists).

On newer systems it appears to be “systemd-hwdb update”.

↧

Linux: Writing a Windows MBR instead of grub

June 13, 2019, 11:22 pm

≫ Next: ImageMagick convert to scale jpgs

≪ Previous: systemd / DBus debugging starter pack

On a test computer, I use /dev/sda1 to contain whatever operating system I need for the moment. At some point, I installed Linux Mint 19.1 properly on that partition, and then I wanted to return to Windows 10. After writing the Windows 10 image to /dev/sda1, I got a message from grub saying it didn’t detect the filesystem.

Hmmm… So the MBR was overwritten by GRUB, and now I need to get it back to Windowish. One can use Microsoft’s rescue disk-on-key, or the quick hack: Download ms-sys, compile with plain “make”, don’t bother to install, and just go from the bin/ directory:

# ./ms-sys -7 /dev/sda

and Windows 10 boots like a charm.

↧

ImageMagick convert to scale jpgs

June 21, 2019, 12:28 am

≫ Next: VIA VL805 USB 3.0 PCIe adapter: Forget about Linux

≪ Previous: Linux: Writing a Windows MBR instead of grub

Instead of using my scanner, I put my cell phone on some kind of stand, and shot a lot of paper documents (voice activation is a blessing). But then the files are unnecessarily large. Don’t need all that resolution. So

$ for i in * ; do convert "$i" -scale '33%' -quality 75 "smaller/scan_$i" ; done

And the files are 100-200k each with enough resolution to see the fine print.

↧

VIA VL805 USB 3.0 PCIe adapter: Forget about Linux

July 15, 2019, 10:16 pm

≫ Next: Linux: Atheros QCA6174′s Bluetooth disappearing after reboot

≪ Previous: ImageMagick convert to scale jpgs

TL;DR

Bought an Orico PCIe adapter for USB 3.0 for testing a USB device I’m developing (PVU3-5O2I). It has the VL805 chipset (1106/3483) which isn’t xHCI compliant. So it works only with the vendor’s own drivers for Windows, which you’ll have to struggle a bit to install.

Attempt with Linux

That the device is detected by its class (xHCI), and not by its Vendor / Product IDs.

The following was found in the kernel log while booting:

[    0.227014] pci 0000:03:00.0: [1106:3483] type 00 class 0x0c0330
[    0.227042] pci 0000:03:00.0: reg 0x10: [mem 0xdf000000-0xdf000fff 64bit]
[    0.227104] pci 0000:03:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[    0.227182] pci 0000:03:00.0: System wakeup disabled by ACPI

and

[    0.325254] pci 0000:03:00.0: xHCI HW did not halt within 16000 usec status = 0x14

and then

[    1.474178] xhci_hcd 0000:03:00.0: xHCI Host Controller
[    1.474421] xhci_hcd 0000:03:00.0: new USB bus registered, assigned bus number 3
[    1.505919] xhci_hcd 0000:03:00.0: Host not halted after 16000 microseconds.
[    1.506066] xhci_hcd 0000:03:00.0: can't setup: -110
[    1.506241] xhci_hcd 0000:03:00.0: USB bus 3 deregistered
[    1.506494] xhci_hcd 0000:03:00.0: init 0000:03:00.0 fail, -110
[    1.506640] xhci_hcd: probe of 0000:03:00.0 failed with error -110

The error message comes from xhci_halt() defined in drivers/usb/host/xhci.c, and doesn’t seem to indicate anything special, except that the hardware doesn’t behave as expected.

Update firmware, maybe?

The idea was to try updating the firmware on the card. Maybe that will help?

So I downloaded the driver from the manufacturer and the firmware fix tool from Station Drivers.

Ran the firmware fix tool before installing the driver on Windows. It went smooth. Then recycled power completely and booted Linux again (the instructions require that). Exactly the same error as above.

Went for Windows again, ran the firmware update tool, and this time read the firmware revision. It was indeed 013704, as it should be. So this doesn’t help.

Install driver on Windows 10

Checking in the Device Manager, the card was found as an xHCI controller, but with the “device cannot start (Code 10)”. In other words, Windows’ xHCI driver didn’t like it either.

Attempted installation of the driver. Failed with “Sorry, the install wizard can’t find the proper component for the current platform. Please press OK to terminate the install Wizard”. What it actually means is that the installation software (just downloaded from the hardware vendor) hasn’t heard about Windows 10, and could therefore not find an appropriate driver.

So I found the directory to which the files were extracted, somewhere under C:\Users\{myuser}\AppData\Local\Temp\is-VJVK5.tmp, and copied the USetup directory from there. Then selected xhcdrv.inf for driver installation. It’s intended for Windows 7, but it so happends, that generally drivers for Windows 7 and Windows 10 are the same. It’s the installer that was unnecessarily fussy.

After installing this driver, a “VIA USB eXtensible Host Controller” entry appeared in the USB devices list of the Device Manager, and it said it works properly.

After a reboot, there was “xHCI Root Hub 0″ under “Other Devices” of the Device Manager, with the error message “The drivers for this device are not installed”. It was available under the same USetup directory (ViaHub3.inf).

This added “VIA USB 2 Hub” and “VIA USB 3 Root Hub” to the list of USB devices, and believe it or not, the card started working.

Bottom line: It does work with its own very special drivers for Windows, with a very broken setup procedure.

↧

Linux: Atheros QCA6174′s Bluetooth disappearing after reboot

July 21, 2019, 3:26 am

≫ Next: Linux: Command-line utilities for obtaining information

≪ Previous: VIA VL805 USB 3.0 PCIe adapter: Forget about Linux

When Bluetooth goes poof

Having rebooted my computer after a few months of continuous operation, I suddenly failed to use my Bluetooth headphones. It took some time to figure out that the problem wasn’t with the Cinnamon 3.8.9 fancy stuff, nor the DBus interface, which produced error messages. There was simply no Bluetooth device in the system to talk to.

Prior to this mishap, my Atheros QCA6174 had worked flawlessly and reliably for several months, both as a Wifi adapter and a Bluetooth adapter.

For the record, I have a Linux Mint 19 Tara machine with 4.15.0-20-generic kernel on a X299 AORUS Gaming 7 motherboard, running in 64 bit mode of course.

I’ll jump to the spoiler: If you happen to have a Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter on your machine, never just reboot the machine: Shut down the computer completely, and disconnect main power for a minute or so with the power supply’s switch. Just powering off the computer the fine way isn’t enough. The device probably continues to get power from the motherboard when in computer is off by virtue of its own power control.

Powering off the computer this way is what solved it for me. However there are also some rumors on the web, which I can’t confirm, about Bluetooth coming back to life after loading Windows on the same computer. Or turning Bluetooth off and on again with the BIOS. My guess is that due to a bug, the chip sometimes needs some kind of tickle on shutdown or when starting, or Bluetooth is lost. Something that is worked around with a hush-hush fix in the driver for Windows, but the Linux driver doesn’t do the same.

This post goes down to the gory details, partly for the sake of quick diagnostics in the future, and partly because Bluetooth tends to be a mystery thing. So I’m trying to give an idea on what’s going on.

Why it’s confusing

Here’s the thing: The Qualcomm QCA6174 connects to the motherboard as a PCIe device as a Wifi adapter, and to the USB bus as a Bluetooth device. Sounds weird, but that’s the way it is. Bluetooth has been wonky for 20 years, and that’s probably its destiny.

So there are wires going from the QCA6174 to the PCIe bus and other wires from the same device going to one of the ports of the motherboard’s USB root hub (see details below). On my specific motherboard, the Bluetooth interface of the QCA6174 is connected to port 13 of the root hub, that facilitates the motherboard’s physical USB connector at the back of the computer. So while designing the board, they wired some of the D+/D- wires to the physical ports at the back, and a couple of those go to the QCA6174. I’m saying this over and over again, because it’s so counterintuitive.

Counterintuituve, but it seems like it’s quite common. Intel’s AC 7260 Wifi / Bluetooth combo seems to do exactly the same thing.

As a PCIe device, QCA6174 has Vendor / Product IDs 168c:003e. As as USB device, it’s 0cf3:e300. Confusing? It won’t surprise me if the Wifi and Bluetooth interfaces are two independent units on the same chip, that happen to share an antenna.

Apparently, when the QCA6174 has a bad day, the PCIe interface wakes up properly, and USB doesn’t. The result is that the Wifi works fine, but the Bluetooth is absent.

To add some confusion, the kernel source’s drivers/net/wireless/ath/ath10k/usb.c matches USB device 13b1:0042, which is indeed a Linksys device (the comment in the code says Linksys WUSB6100M). Not clear why it’s there.

On the other hand, drivers/bluetooth/btusb.c, matches a whole range of Atheros USB devices, among others 0cf3:e300, calling it “QCA ROME” in the comments. So it’s the btusb module that takes care of QCA6174′s Bluetooth interface, not anything in ath/ath10k. Cute, isn’t it?

What it looks like when it works

When trying to figure out what’s wrong, it helps knowing that it looks like when it’s OK. So below is a lot of info that was collected when I got the Bluetooth up and running.

When it failed, everything looked exactly the same in relation to the device’s PCIe interface, but there was absolutely nothing related to USB and Bluetooth: No entry for the device in lsusb, hcicontrol nor rfkill, as shown below.

Kernel log output on behalf of the device, as connected to the PCIe bus. Note that the exact same logs appeared when the Bluetooth device was absent. Exactly-exactly. Down to the single character, I’ve compared them. So this isn’t really relevant, but anyhow:

[    0.126428] pci 0000:03:00.0: [168c:003e] type 00 class 0x028000
[    0.126456] pci 0000:03:00.0: reg 0x10: [mem 0x92800000-0x929fffff 64bit]
[    0.126555] pci 0000:03:00.0: PME# supported from D0 D3hot D3cold

[ ... ]

[   17.616738] ath10k_pci 0000:03:00.0: enabling device (0000 -> 0002)
[   17.617514] ath10k_pci 0000:03:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0

[ ... ]

[   17.915091] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:03:00.0.bin failed with error -2
[   17.915109] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/cal-pci-0000:03:00.0.bin failed with error -2
[   17.926172] ath10k_pci 0000:03:00.0: qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:1535
[   17.926173] ath10k_pci 0000:03:00.0: kconfig debug 0 debugfs 1 tracing 1 dfs 0 testmode 0
[   17.926505] ath10k_pci 0000:03:00.0: firmware ver WLAN.RM.4.4.1-00124-QCARMSWPZ-1 api 6 features wowlan,ignore-otp crc32 d8fe1bac
[   18.078191] ath10k_pci 0000:03:00.0: board_file api 2 bmi_id N/A crc32 506ce037

[ ... ]

[   18.642461] ath10k_pci 0000:03:00.0: Unknown eventid: 3
[   18.658195] ath10k_pci 0000:03:00.0: Unknown eventid: 118809
[   18.661096] ath10k_pci 0000:03:00.0: Unknown eventid: 90118
[   18.661772] ath10k_pci 0000:03:00.0: htt-ver 3.56 wmi-op 4 htt-op 3 cal otp max-sta 32 raw 0 hwcrypto 1

Note that two attempts to load firmware failed, but apparently the third went OK. Don’t let these error messages mislead you: The kernel messages in this respect were the same when the Bluetooth appeared and when it didn’t.

The “Unknown eventid” may appear more than once.

Its entry with plain lspci (unrelated entries removed):

$ lspci
03:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)

And now to the parts that were missing completely when the Bluetooth device didn’t appear: The logs on behalf of the device, connected to the USB bus:

[    3.764868] usb 1-13: new full-speed USB device number 12 using xhci_hcd
[    3.913930] usb 1-13: New USB device found, idVendor=0cf3, idProduct=e300
[    3.915610] usb 1-13: New USB device strings: Mfr=0, Product=0, SerialNumber=0

Plain lsusb:

$ lsusb
[ ... ]
Bus 001 Device 012: ID 0cf3:e300 Atheros Communications, Inc.
[ ... ]

lsusb, tree view (a lot of irrelevant stuff excluded):

$ lsusb -t
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
    |__ Port 13: Dev 12, If 1, Class=Wireless, Driver=btusb, 12M
    |__ Port 13: Dev 12, If 0, Class=Wireless, Driver=btusb, 12M

How to check if a Bluetooth device is present

There is no device file for Bluetooth interface, exactly as there’s none for network. Like there’s eth0 for Ethernet, there’s hci0 for Bluetooth.

hcicontrol grabs the info by opening a socket. As in the relevant strace:

socket(PF_BLUETOOTH, SOCK_RAW, 1)

So this is what it looked like with the Bluetooth device present (without it, hciconfig simply prints nothing):

$ hciconfig
hci0:	Type: Primary  Bus: USB
	BD Address: xx:xx:xx:xx:xx:xx  ACL MTU: 1024:8  SCO MTU: 50:8
	UP RUNNING PSCAN ISCAN
	RX bytes:1386 acl:0 sco:0 events:94 errors:0
	TX bytes:5494 acl:0 sco:0 commands:94 errors:0

Real hex numbers appear instead of the xx’s above. Use hciconfig -a for more verbose output.

And the device appears in the rfkill list, and shouldn’t be blocked.

$ rfkill list
0: hci0: Bluetooth
	Soft blocked: no
	Hard blocked: no
1: phy0: Wireless LAN
	Soft blocked: no
	Hard blocked: no

These two show that the kernel supplies a Bluetooth device to the higher software levels. If Bluetooth doesn’t work, there are other reasons…

↧

Linux: Command-line utilities for obtaining information

July 21, 2019, 3:32 am

≫ Next: apt / dpkg: Ignore error in post-install script

≪ Previous: Linux: Atheros QCA6174′s Bluetooth disappearing after reboot

There are many ways to ask a Linux machine how it’s doing. I’ve collected a few of them, mostly for my own reference. I guess I’ll add more items as I run across new ones.

General Info

inxi -Fxxxz (neat output, but makes the system send me security “password required” alert mails because of attempts to execute hddtemp).
hwinfo
lshw
Temperature and fans: sensors

Status

Logs: journalctl and dmesg
systemctl, with all its variants
Network: ifconfig
Wifi: iwconfig
Bluetooth: hciconfig
CPU: lscpu
PCI bus: lspci
USB: lsusb
RAID: mdadm –detail /dev/md0

Operating system

List open files: lsof
Block devices and partitions: blkid and lsblk
List namespaces: lsns
List loaded kernel modules: lsmod
List locks: lslocks

↧

apt / dpkg: Ignore error in post-install script

July 27, 2019, 10:02 am

≫ Next: Making a snapstot of a full Ubuntu / Mint repository on the local disk

≪ Previous: Linux: Command-line utilities for obtaining information

You have been warned

This post shows how to cripple the installation of a Debian package, and make the system think it went through OK. This might very well bring your system’s behavior towards the exotic, unless you know perfectly what you’re doing.

In some few cases, like the one shown below, it might actually be a good idea.

Introduction

Sometimes the post-installation script of Debian packages fails for a good reason. So good that one wants to ignore the failure, and mark the package as installed, so its dependencies are in place. And so apt stops nagging about it.

On my Linux Mint 19 machine, this happened to me with grub-efi-amd64: Its installation involves updating something in /boot, which is mounted read-only, exactly for that reason: The system boots perfectly, why fiddle?

And indeed, it’s not fully installed (note the iF part):

$ dpkg -l grub-efi-amd64
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                                  Version                         Architecture                    Description
+++-=====================================================-===============================-===============================-================================================================================================================
iF  grub-efi-amd64                                        2.02-2ubuntu8.13                amd64                           GRand Unified Bootloader, version 2 (EFI-AMD64 version)

There must be an easy way around it…?

OK, so how about giving it a push?

# dpkg --configure --force-all grub-efi-amd64
Setting up grub-efi-amd64 (2.02-2ubuntu8.13) ...
Installing for x86_64-efi platform.
grub-install: error: cannot delete `/boot/grub/x86_64-efi/lsacpi.mod': Read-only file system.
Failed: grub-install --target=x86_64-efi
WARNING: Bootloader is not properly installed, system may not be bootable
cp: cannot create regular file '/boot/grub/unicode.pf2': Read-only file system
dpkg: error processing package grub-efi-amd64 (--configure):
 installed grub-efi-amd64 package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 grub-efi-amd64

Not only didn’t it work, but this error message appears every time I try installing anything with apt. It will always attempt to finish that installation. And fail.

After quite some googling, I’m convinced that there’s no way to tell apt or dpkg to ignore a failed post-installation. Not with a million warnings and confirmations. Nothing. The post installation must succeed, by hook or by crook. The packaging machinery simply won’t register a package as fully installed otherwise.

The ugly fix

So that leaves us with the crook. Luckily, the script in question is in a known place, waiting to be edited:

# vi /var/lib/dpkg/info/grub-efi-amd64.postinst

For any other package, just replace the grub-efi-amd64 with what you have.

And just add an “exit 0″ as shown below. This makes the script return with success, without doing anything. You may want to examine what it would do, possibly perform some of the operations manually etc. But anyhow, it’s just this:

#!/bin/bash
set -e

exit 0;

[ ... ]

And then try again:

# dpkg --configure grub-efi-amd64
Setting up grub-efi-amd64 (2.02-2ubuntu8.13) ...

Like a charm, of course. And now the package is happily installed:

$ dpkg -l grub-efi-amd64
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=================================
ii  grub-efi-amd64 2.02-2ubuntu amd64        GRand Unified Bootloader, version

And don’t forget to remove that edit afterwards. Or possibly it might cause issues in the future…?

↧

Making a snapstot of a full Ubuntu / Mint repository on the local disk

July 27, 2019, 10:05 am

≫ Next: Linux: Permanent graphics mode (resolution) on Cinnamon

≪ Previous: apt / dpkg: Ignore error in post-install script

What’s that good for?

This isn’t about maintaining a local repository that mirrors its original, following along with its changes. The idea is to avoid the upgrades of a lot of packages every time I want to install a new one with apt. Maybe I should mention that I don’t allow automatic upgrades on my machine? Exactly like I don’t leave my car keys to the mechanic, so he can make any fixes he considers would make my car better. Every day.

Before I get into the technical details, I’ll have my say on the culture of upgrading everything all the time. Just in case someone with influence on the matter reads this. Or maybe someone ready to maintain a non-updating mirror…?

The way packaging is made today is that each package requires the latest-latest dependencies just because they happened to exist, not because they’re needed. I mean, forcing an unnecessary upgrade of other packages is fine, because how could upgrading be wrong? Or go wrong?

But upgrading is good…?

Most people believe upgrading software is generally good. Personally, I don’t. Every now and then, an upgrade breaks something that worked, and even seemingly harmless upgrades of minor pieces of software can force me into a session of debugging my system. It may very well be that the upgrade rectified something that was wrong before. But this way or another, my computer worked before, and after the upgrade it didn’t. As has already been said:

If it ain’t broke, don’t fix it.

Upgrades sometimes involve security fixes. Staying with old versions is considered neglecting your security. This may be true when the computer is a server or a multiuser machine, and strangers are allowed to do this and that with it. However when it comes to single-user desktops that are properly set up (plus a firewall), the risk for an upfront security exploit is rather minimal. I always ask people when they last heard about a personal Linux desktop being compromised by virtue of a vulnerability, and nobody can come up with such case.

I’ve also discussed this issue with several guys who are responsible for major Linux systems, for which a downtime means real damage, and stability is important. The typical conversation contains an apology for using old distributions and old software, and them reassuring me that they understand that upgrading is important. It’s just that in their specific case they have to stick to a certain, old Linux distribution to keep the system running continuously.

So it’s a risk management question: The risk of having the computer messed up by an upgrade (with probability converging to 1 as time increases) vs the probability of desktop being attacked (probability not known, as no such event is known to me). Given that I fix significant security issues by other means, as they occur.

So my decision is clear, and here’s how to do it.

apt-mirror

All said below relates to Linux Mint 19, but most likely applies to a wide range of Debian-based distros.

apt-mirror is a cleverly written Perl script that mirrors selected Debian repositories into the local disk. It’s one of those utilities that simply do the job with the practical, real-life details taken care of correctly. In the proper Perl spirit, in short.

In essence:

Install apt-mirror with a plain apt command.
Change the ownership the directory to which the packages go (given as base_path in the config file) to apt-mirror.
Don’t set up the cron job, as we’re not into having it updated (possibly delete /etc/cron.d/apt-mirror).

Set up mirror.list

The default /etc/apt/mirror.list is generally fine, with nthreads set to 20 by default, which is OK.

You may want to set base_path in /etc/apt/mirror.list to something else than the default.

Then copy all repositories listed in /etc/apt/sources.list.d into mirror.list. This is just copying the lines beginning with “deb” as is.

Well, probably not. If you’re running on a 64-bit machine (is there anyone not?), set /etc/apt/mirror.list to download packages for amd64 and i386. This will grow the disk consumption from 140 GB to 193 GB (YMMV), but sometimes these i386 packages are handy.

For this to work, each line must appear twice. So if the original “deb” line said

deb http://packages.linuxmint.com tara main upstream import backport

these two should appear in mirror.list:

deb-i386 http://packages.linuxmint.com tara main upstream import backport
deb-amd64 http://packages.linuxmint.com tara main upstream import backport

Otherwise apt-mirror downloads only the packages for the current arch. One can also add a deb-src for the mirror repository as well, if desired.

Running apt-mirror

Run apt-mirror as root with

# su - apt-mirror -c apt-mirror

The cool part with running it this way is that if you go CTRL-C (you want to do that, apt-mirror runs forever on the first attempt, downloading ~200 GiB or so), the child processes (a lot of wgets) are killed gracefully as well.

How it works: First it does some crunching of the repositories’ metadata. After not too long time, it generates 20 processes, each for downloading a list of URLs:

wget --no-cache --limit-rate=100m -t 5 -r -N -l inf -o /opt/apt-mirror/var/archive-log.0 -i /opt/apt-mirror/var/archive-urls.0

all of which run in /opt/apt-mirror/mirror, which is the target for the files.

and then it just waits. The output shown on console (like “[20]…”) is the number of processes still running.

Configure apt to use local repositories only

First of all, move the “mirror” subdirectory away to some other place, so it’s out of sight for apt-mirror. No more updates. For example, into /var/local-apt-repo. I also suggest changing its owner to root at this stage:

# chown -R root:root local-apt-repo/

So a line saying

deb http://packages.linuxmint.com tara main upstream import backport

changes into (if the repository is kept in /var/local-apt-repo/)

deb file:///var/local-apt-repo/packages.linuxmint.com tara main upstream import backport

and make apt aware of the change:

# apt clean
# apt update

Verify that only local files are accessed (it prints out the paths) and that there are no errors. Those opting out downloading the i386 repositories as well will get a lot of error messages at the end, like

E: Failed to fetch file:/var/local-apt-repo/packages.linuxmint.com/dists/tara/main/binary-i386/Packages  File not found - /var/local-apt-repo/packages.linuxmint.com/dists/tara/main/binary-i386/Packages (2: No such file or directory)

I suppose it’s harmless to the end that there will be no i386 packages to work with, but I don’t really know, as I went for downloading packages for both archs.

And then comes the last session of upgrades. At a convenient time for tackling possible upgrade side effects, go

# apt list --upgradable

and then try to upgrade packages in small chunks, so that the changes each make can be tracked (in particular if you have a git repo on /etc, like myself), with

# apt install --only-upgrade package-name

For convenience, package-name may include wildcards with * (use single quotes or escape it with backslash, i.e. \*).

After all this is done, fix whatever broke because of all upgrades. In my case I lost graphics acceleration on my NVidia card, solved by manually reinstalling the drivers as originally downloaded from the vendor. Just to remind me why I’m doing all this.

If apt-file is also installed (it’s a good idea to have it), this is also a good time to go

# apt-file update

How and why the local repo is self-contained

It’s not worth much to take a snapshot that can’t be relied upon forever. The fact that the download process typically takes a few days, most likely with several interruptions in the middle, doesn’t contribute to the feeling of reassurance. I ran my downloads on nights only, for example (hey, I want a decent internet connection during the day).

This is solved in a surprisingly simple manner: The Packages files contain the list of files required for constituting a self-contained repository. apt-mirror first downloads these files, then it downloads the files it requires, and then uploads the Packages files in the mirror. At all times, all required files are in place.

This is why it’s safe to stop apt-mirror in the middle: Even though the running wget processes will leave some files half-downloaded, they will be fixed on the next run: apt-mirror compares the size of the file on disk with the size declared in the respective Packages file (in its need_update() function). So all files in the Packages files must exist and be of the correct size. This is apt-mirror’s view of a file being in place.

One could also compare the SHA sums of each file in the entire repo. I haven’t found such utility, and I’m not sure it’s worth the effort.

There’s somewhat reassuring to run apt-mirror after its completion, and see that it downloads nothing. It seems like that doesn’t happen. I ended up downloading one archive file of 596 MiB each time (or so apt-mirror said), but then going

$ find /opt/apt-mirror/ -iname \*.deb -cmin -3

found no files. So this was probably only metadata loaded (indeed, dropping the *.deb requirement listed a lot of files).

Reducing wasted disk space

A side effect of the way apt-mirror works, is that outdated packages remain in the repository: When apt-mirror is re-run, it makes sure that all files in the current Packages files are downloaded. When a package is updated, a new package file is enlisted, and the old one just vanishes from the Packages file. But apt-mirror doesn’t delete it, as it’s in the process of updating the repository. The old Packages file is still in effect.

Also, in a real-life mirror scenario, someone could be in the middle of an installation which is based upon several files. So the unnecessary files can only be deleted after the Packages files have been updated (i.e. when apt-mirror finishes) plus the maximal time one could imagine an installation to take. Actually, in a continuously updating web mirror situation, removing a package file will break things for end-users until they run “apt update”. So a real mirror with happy end users should probably not delete files all that often.

Anyhow, apt-mirror creates a clean.sh script in the var/ subdirectory, which deletes all files that aren’t required by the current set of Packages files. It should be executed to get rid of those, when it’s good time. Note that the script changes directory to the absolute path to which it downloaded the mirror (so watch out if you’re moving that directory eventually).

# su - apt-mirror -c /opt/apt-mirror/var/clean.sh

For this script to be generated, add “clean” lines in mirror.list, like

clean http://packages.linuxmint.com

If there are several “deb” lines for the same host, one “clean” line like the above covers them all.

Another waste of disk space is that security.ubuntu.com contains a lot of packages that are already in other repositories. As this entire repo takes 40 GB (30 GB for amd64 alone), it’s unfortunate. One possibility would be to write a script than scans the directories for identical files (based upon SHA sums) and removes one file, replacing it with a symbolic link. Or maybe get this info from the Packages file. Or, like I did, not bother at all.

↧

Linux: Permanent graphics mode (resolution) on Cinnamon

August 3, 2019, 9:04 pm

≫ Next: Linux / X-Windows: Which process owns this window?

≪ Previous: Making a snapstot of a full Ubuntu / Mint repository on the local disk

The goal

Quite simple: Set a fixed graphics mode on the computer screen.

More precisely, make Cinnamon (version 3.2.6) on Linux Mint 18.1 (Serena) show the desktop with a predefined resolution, no matter what happens. Spoiler: I failed. But I got close enough for practical purposes, and collected a lot of knowledge while trying. So here it is.

The reason I need this: On the machine mentioned here, I have an two screens connected through an HDMI splitter, so the monitor identification is somewhat shaky, and it’s not clear which piece of info the computer gets each time. To make it even trickier, the graphics mode I need is only listed in the EDID information submitted by one of the monitors. In other words: More often than not, the computer doesn’t know it’s allowed to use the mode I want it to use.

This situation meets the somewhat arrogant “I know what’s best, I never fail” attitude often seen by graphics software. There is more than one automatic mechanism for changing the resolution to “what is correct”, so just changing the resolution with xrandr doesn’t cut. The underlying mechanisms seem to change frequently from one version to another, and having them documented is probably too much to ask for. It seems like there are some race conditions taking place between different utilities that have a say on this matter. Possibly the reason for the problem I tried to solve on this post.

For clarity: EDID is a chunk of data that is typically stored on a small flash memory on the monitor. This data is fetched through I2C wires that are part of an DVI / HDMI / VGA connector when the monitor is plugged in. This is how the computer knows not only the commercial name of the monitor, but also what graphics modes it supports and prefers.

How cinnamon selects the resolution to use

So — the first question is: How does (my very specific) Cinnamon determine which screen resolution is the “correct” one?

This is a journey into the realm of mystery and uncertainty, but it seems like the rationale is to remember previously connected monitors, along with a separate user-selected graphics mode for each.

So the steps are probably something like:

Grab the list of allowed resolution modes, as presented by xrandr, for the relevant monitor (through libxrandr?). This is typically the set of modes listed in the monitor’s EDID information, but it’s possible to add modes as well (see below).
If there’s a user logged in, look up .config/monitors.xml in that user’s home directory. If there’s a match between the monitor’s product identification, apply the selected resolution. This file is changed by Cinnamon’s Display setting utility (among others, I guess), and presents the user’s preferences.
There’s possibly also a globally default monitor.xml at /etc/gnome-settings-daemon/xrandr/. I don’t have such file, and it’s not clear if it’s in effect had it existed. I haven’t tried this one.
If there’s no matching (or adequate?) mode setting in monitor.xml (or no user logged in), choose the preferred mode, as pointed at by xrandr.

This way or another, monitors.xml only lists width, height and rate for each graphics mode, without the timing details that are required to run it properly. So if the resolution requested in monitors.xml isn’t listed by xrandr, there is no way to request it, as there is crucial information missing. This isn’t supposed to happen ever, since the utility that sets the user’s preferences isn’t supposed to select a mode that the monitor doesn’t support. But if it does, the logical thing would be to ignore the resolution in monitors.xml, and go on with the monitor’s preferred mode. In reality, it appears like this causes the blank screen that I’ve mentioned on this post.

The automatic setting of resolution seems to take place when some kind of X session starts (the login screen and after the user logs in) as well as when a new monitor is hotplugged. Setting a monitor’s mode with xrandr seems to trigger an automatic setting as well sometimes. Having tried to set the resolution with xrandr a few times, it reverts sometimes to the automatic setting, and sometimes it stays with the one I set. Go figure.

How I got it done

Since there are all kinds of ghosts in the system that insist on “fixing” the display resolution, I might as well play along. So the trick is as follows:

Edit ~/.config/monitors.xml (manually), setting the resolution for all monitors listed to the one I want.
Make sure that the desired graphics mode, along with its timing parameters, is listed by xrandr, even if the monitor didn’t mention it in its EDID info.The first step is relatively easy. The entries in the XML file look like this:

      <output name="HDMI3">
          <vendor>SNY</vendor>
          <product>0x0801</product>
          <serial>0x01010101</serial>
          <width>1360</width>
          <height>768</height>
          <rate>60</rate>
          <x>0</x>
          <y>0</y>
          <rotation>normal</rotation>
          <reflect_x>no</reflect_x>
          <reflect_y>no</reflect_y>
          <primary>yes</primary>
      </output>

This is after editing the file. I needed 1360 x 768 @60 Hz, as shown above. So just set the width, height and rate tags in the XML file for all entries. No matter what monitor the system thinks it sees, the “user preference” is the same.

Now making sure that the mode exists: Add something like the following as /etc/X11/Xsession.d/10add-xrandr-mode (owned by root, not executable, no shebang):

xrandr -d :0 --newmode "hdmi_splitter" 85.5 1360 1424 1536 1792 768 771 777 795 +hsync +vsync
xrandr -d :0 --addmode HDMI3 hdmi_splitter
xrandr -d :0 --output HDMI3 --mode hdmi_splitter

Needless to say (?), this relates to the specific graphics mode.

So this file is executed every time X is started (and hence the xrandr modes list is cleared). All it does is making sure that the relevant output port (HDMI3) knows how to display 1360 x 768. Note that the name of the mode has no particular significance, and that the frame rate isn’t given explicitly, but is calculated by the tools. I got these figures from an xrandr readout with the desired monitor connected directly. See the full listing at the end of this post. It’s the first entry there.

The third command actually switches the display to the desired mode. It can be removed actually, because it’s overridden very soon anyhow. Nevertheless, it shows the command that can be used manually on console, given the two earlier commands (should not be needed, given that the mode is invoked automatically, fingers crossed).

That’s it. Except for occasional glitches (getting full control of this was too much to expect), the two actions mentioned above are enough to get the mode I wanted. Not the “no matter what” I wanted, but close enough.

As for the -d :0 flags, it’s required in remote sessions and scripts. Alternatively, start with an

$ export DISPLAY=:0

Using cvt to obtain the timing parameters (not!)

It’s suggested on some websites to obtain the timing parameters with something like

$ cvt 1360 768 60
# 1360x768 59.80 Hz (CVT) hsync: 47.72 kHz; pclk: 84.75 MHz
Modeline "1360x768_60.00"   84.75  1360 1432 1568 1776  768 771 781 798 -hsync +vsync

I tried this, and the monitor didn’t sync on the signal. It’s indeed a pretty lousy monitor to miss on a DVI signals, and still.

Note the small differences between the timing parameters — that’s probably the reason for this failure. So when the real parameters can be obtained, use them. There is no secret catch-all formula for all graphics modes. The formula works on a good day.

Hands off, Cinnamon’s daemon!

My original idea was to turn off all automatic graphics mode setting mechanisms, and stay with a single xrandr command, running from /etc/X11/Xsession.d/ or something. It was a great idea, but it didn’t work: I saw a momentary switch to the mode I wanted, and then it changed to something else. I could have added some kind of daemon of my own, that waits a bit and then changes the mode with xrandr, but that’s just adding another daemon to wrestle with the others.

So this didn’t really help, but I’ll leave it here anyway, in case someone wants to change the display mode without having some daemon change it back. Note that according to this page, using gsettings as shown below works only up to Cinnamon before version 3.4, after which the procedure is different (haven’t tried it however): Copy /etc/xdg/autostart/cinnamon-settings-daemon-xrandr.desktop to $HOME/.config/autostart. Then append the line Hidden=true to the copied file.

In short, YMMV. Here’s how I did it on my system (and then found it’s not good enough, as mentioned above).

Resolution mode settings made with xrandr will be sporadically overridden by cinnamon-settings-daemon, which has a lot of plugins running for different housekeeping tasks. One of them is to keep X-Window’s display resolution in sync with .config/monitors.xml. So disable it.

Following my own post, this is typically the setting for the said plugin:

$ gsettings list-recursively org.gnome.settings-daemon.plugins.xrandr
org.gnome.settings-daemon.plugins.xrandr active true
org.gnome.settings-daemon.plugins.xrandr priority 0
org.gnome.settings-daemon.plugins.xrandr default-monitors-setup 'follow-lid'
org.gnome.settings-daemon.plugins.xrandr default-configuration-file '/etc/gnome-settings-daemon/xrandr/monitors.xml'

So turn it off:

$ gsettings set org.gnome.settings-daemon.plugins.xrandr active false

and then check again with the list-recursively command above.

xrandr output: The full list of modes

Just for reference, these are the modes given by xrandr for the monitor I did all this for:

$ xrandr -d :0 --verbose

[ ... ]

  1360x768 (0x4b) 85.500MHz +HSync +VSync *current +preferred
        h: width  1360 start 1424 end 1536 total 1792 skew    0 clock  47.71KHz
        v: height  768 start  771 end  777 total  795           clock  60.02Hz
  1920x1080i (0x10b) 74.250MHz -HSync -VSync Interlace
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.75KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  60.00Hz
  1920x1080i (0x10c) 74.250MHz +HSync +VSync Interlace
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.75KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  60.00Hz
  1920x1080i (0x10d) 74.250MHz +HSync +VSync Interlace
        h: width  1920 start 2448 end 2492 total 2640 skew    0 clock  28.12KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  50.00Hz
  1920x1080i (0x10e) 74.176MHz +HSync +VSync Interlace
        h: width  1920 start 2008 end 2052 total 2200 skew    0 clock  33.72KHz
        v: height 1080 start 1084 end 1094 total 1125           clock  59.94Hz
  1280x720 (0x10f) 74.250MHz -HSync -VSync
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  45.00KHz
        v: height  720 start  725 end  730 total  750           clock  60.00Hz
  1280x720 (0x110) 74.250MHz +HSync +VSync
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  45.00KHz
        v: height  720 start  725 end  730 total  750           clock  60.00Hz
  1280x720 (0x111) 74.250MHz +HSync +VSync
        h: width  1280 start 1720 end 1760 total 1980 skew    0 clock  37.50KHz
        v: height  720 start  725 end  730 total  750           clock  50.00Hz
  1280x720 (0x112) 74.176MHz +HSync +VSync
        h: width  1280 start 1390 end 1430 total 1650 skew    0 clock  44.96KHz
        v: height  720 start  725 end  730 total  750           clock  59.94Hz
  1024x768 (0x113) 65.000MHz -HSync -VSync
        h: width  1024 start 1048 end 1184 total 1344 skew    0 clock  48.36KHz
        v: height  768 start  771 end  777 total  806           clock  60.00Hz
  800x600 (0x114) 40.000MHz +HSync +VSync
        h: width   800 start  840 end  968 total 1056 skew    0 clock  37.88KHz
        v: height  600 start  601 end  605 total  628           clock  60.32Hz
  720x576 (0x115) 27.000MHz -HSync -VSync
        h: width   720 start  732 end  796 total  864 skew    0 clock  31.25KHz
        v: height  576 start  581 end  586 total  625           clock  50.00Hz
  720x576i (0x116) 13.500MHz -HSync -VSync Interlace
        h: width   720 start  732 end  795 total  864 skew    0 clock  15.62KHz
        v: height  576 start  580 end  586 total  625           clock  50.00Hz
  720x480 (0x117) 27.027MHz -HSync -VSync
        h: width   720 start  736 end  798 total  858 skew    0 clock  31.50KHz
        v: height  480 start  489 end  495 total  525           clock  60.00Hz
  720x480 (0x118) 27.000MHz -HSync -VSync
        h: width   720 start  736 end  798 total  858 skew    0 clock  31.47KHz
        v: height  480 start  489 end  495 total  525           clock  59.94Hz
  720x480i (0x119) 13.514MHz -HSync -VSync Interlace
        h: width   720 start  739 end  801 total  858 skew    0 clock  15.75KHz
        v: height  480 start  488 end  494 total  525           clock  60.00Hz
  720x480i (0x11a) 13.500MHz -HSync -VSync Interlace
        h: width   720 start  739 end  801 total  858 skew    0 clock  15.73KHz
        v: height  480 start  488 end  494 total  525           clock  59.94Hz
  640x480 (0x11b) 25.200MHz -HSync -VSync
        h: width   640 start  656 end  752 total  800 skew    0 clock  31.50KHz
        v: height  480 start  490 end  492 total  525           clock  60.00Hz
  640x480 (0x11c) 25.175MHz -HSync -VSync
        h: width   640 start  656 end  752 total  800 skew    0 clock  31.47KHz
        v: height  480 start  490 end  492 total  525           clock  59.94Hz

The vast majority are standard VESA modes.

↧

Linux / X-Windows: Which process owns this window?

September 3, 2019, 12:11 am

≫ Next: MySQL, OOM killer, overcommitting and other memory related issues

≪ Previous: Linux: Permanent graphics mode (resolution) on Cinnamon

Once in a while, there’s a piece of junk on the desktop, and the question is who should be blamed for it.

The short answer is:

$ xwininfo

and fetch the window’s ID from the line at the beginning saying e.g.

xwininfo: Window id: 0x860000a "xclock"

And next, fetch the alleged process ID:

$ xprop -id 0x860000a | grep _NET_WM_PID
_NET_WM_PID(CARDINAL) = 58637

Note that it’s the duty of the program that generated the window to set this value correctly, so it may be absent or wrong. If the window belongs to a client on another machine, this process ID might be misleading, as it’s on the process table of the client (check out WM_CLIENT_MACHINE, also given by xprop).

If _NET_WM_PID isn’t helpful, try to look for other hints in xprop’s answer, or the correct process can be found with the rather complicated method described on this page.

↧

MySQL, OOM killer, overcommitting and other memory related issues

October 13, 2019, 10:21 am

≫ Next: A VoIP phone at home: The tech details on leaving your phone company

≪ Previous: Linux / X-Windows: Which process owns this window?

It started with an error message

This post is a bit of a coredump of myself attempting to resolve a sudden web server failure. And even more important, understand why it happened (check on that) and try avoiding it from happening in the future (not as lucky there).

I’ve noticed that there are many threads in the Internet on why mysqld died suddenly, so to make a long story short: mysqld has the exact profile that the OOM killer is looking for: Lots of resident RAM, and it’s not a system process. Apache gets killed every now and then for the same reason.

This post relates to a VPS hosted Debian 8, kernel 3.10.0, x86_64. The MySQL server is a 5.5.62-0+deb8u1 (Debian).

As always, it started with a mail notification from some cronjob complaining about something. Soon enough it was evident that the MySQL server was down. And as usual, the deeper I investigated this issue, the more I realized that this was just the tip of the iceberg (the kind that doesn’t melt due to global warming).

The crash

So first, it was clear that the MySQL had restarted itself a couple of days before disaster:

191007  9:25:17 [Warning] Using unique option prefix myisam-recover instead of myisam-recover-options is deprecated and will be removed in a future release. Please use the full name instead.
191007  9:25:17 [Note] Plugin 'FEDERATED' is disabled.
191007  9:25:17 InnoDB: The InnoDB memory heap is disabled
191007  9:25:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
191007  9:25:17 InnoDB: Compressed tables use zlib 1.2.8
191007  9:25:17 InnoDB: Using Linux native AIO
191007  9:25:17 InnoDB: Initializing buffer pool, size = 128.0M
191007  9:25:17 InnoDB: Completed initialization of buffer pool
191007  9:25:17 InnoDB: highest supported file format is Barracuda.
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
191007  9:25:17  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
191007  9:25:19  InnoDB: Waiting for the background threads to start
191007  9:25:20 InnoDB: 5.5.62 started; log sequence number 1427184442
191007  9:25:20 [Note] Server hostname (bind-address): '127.0.0.1'; port: 3306
191007  9:25:20 [Note]   - '127.0.0.1' resolves to '127.0.0.1';
191007  9:25:20 [Note] Server socket created on IP: '127.0.0.1'.
191007  9:25:21 [Note] Event Scheduler: Loaded 0 events
191007  9:25:21 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.62-0+deb8u1'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  (Debian)
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_options' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_options'
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_posts' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_posts'
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_term_taxonomy' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_term_taxonomy'
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_term_relationships' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_term_relationships'

And then, two days layer, it crashed for real. Or actually, got killed. From the syslog:

Oct 09 05:30:16 kernel: OOM killed process 22763 (mysqld) total-vm:2192796kB, anon-rss:128664kB, file-rss:0kB

and

191009  5:30:17 [Warning] Using unique option prefix myisam-recover instead of myisam-recover-options is deprecated and will be removed in a future release. Please use the full name instead.
191009  5:30:17 [Note] Plugin 'FEDERATED' is disabled.
191009  5:30:17 InnoDB: The InnoDB memory heap is disabled
191009  5:30:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
191009  5:30:17 InnoDB: Compressed tables use zlib 1.2.8
191009  5:30:17 InnoDB: Using Linux native AIO
191009  5:30:17 InnoDB: Initializing buffer pool, size = 128.0M
InnoDB: mmap(137363456 bytes) failed; errno 12
191009  5:30:17 InnoDB: Completed initialization of buffer pool
191009  5:30:17 InnoDB: Fatal error: cannot allocate memory for the buffer pool
191009  5:30:17 [ERROR] Plugin 'InnoDB' init function returned error.
191009  5:30:17 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
191009  5:30:17 [ERROR] Unknown/unsupported storage engine: InnoDB
191009  5:30:17 [ERROR] Aborting

191009  5:30:17 [Note] /usr/sbin/mysqld: Shutdown complete

The mmap() is most likely anonymous (i.e. not related to a file), as I couldn’t find any memory mapped file that is related to the mysql processes (except for the obvious mappings of shared libraries).

The smoking gun

But here comes the good part: It turns out that the OOM killer had been active several times before. It just so happen that the processes are being newborn every time this happens. It was the relaunch that failed this time — otherwise I wouldn’t have noticed this was going on.

This is the output of plain “dmesg”. All OOM entries but the last one were not available with journalctl, as old entries had been deleted.

[3634197.152028] OOM killed process 776 (mysqld) total-vm:2332652kB, anon-rss:153508kB, file-rss:0kB
[3634197.273914] OOM killed process 71 (systemd-journal) total-vm:99756kB, anon-rss:68592kB, file-rss:4kB
[4487991.904510] OOM killed process 3817 (mysqld) total-vm:2324456kB, anon-rss:135752kB, file-rss:0kB
[4835006.413510] OOM killed process 23267 (mysqld) total-vm:2653112kB, anon-rss:131272kB, file-rss:4404kB
[4835006.767112] OOM killed process 32758 (apache2) total-vm:282528kB, anon-rss:11732kB, file-rss:52kB
[4884915.371805] OOM killed process 825 (mysqld) total-vm:2850312kB, anon-rss:121164kB, file-rss:5028kB
[4884915.509686] OOM killed process 17611 (apache2) total-vm:282668kB, anon-rss:11736kB, file-rss:444kB
[5096265.088151] OOM killed process 23782 (mysqld) total-vm:4822232kB, anon-rss:105972kB, file-rss:3784kB
[5845437.591031] OOM killed process 24642 (mysqld) total-vm:2455744kB, anon-rss:137784kB, file-rss:0kB
[5845437.608682] OOM killed process 3802 (systemd-journal) total-vm:82548kB, anon-rss:51412kB, file-rss:28kB
[6896254.741732] OOM killed process 11551 (mysqld) total-vm:2718652kB, anon-rss:144116kB, file-rss:220kB
[7054957.856153] OOM killed process 22763 (mysqld) total-vm:2192796kB, anon-rss:128664kB, file-rss:0kB

Or, after calculating the time stamps (using the last OOM message as a reference):

Fri Aug 30 15:17:36 2019 OOM killed process 776 (mysqld) total-vm:2332652kB, anon-rss:153508kB, file-rss:0kB
Fri Aug 30 15:17:36 2019 OOM killed process 71 (systemd-journal) total-vm:99756kB, anon-rss:68592kB, file-rss:4kB
Mon Sep  9 12:27:30 2019 OOM killed process 3817 (mysqld) total-vm:2324456kB, anon-rss:135752kB, file-rss:0kB
Fri Sep 13 12:51:05 2019 OOM killed process 23267 (mysqld) total-vm:2653112kB, anon-rss:131272kB, file-rss:4404kB
Fri Sep 13 12:51:05 2019 OOM killed process 32758 (apache2) total-vm:282528kB, anon-rss:11732kB, file-rss:52kB
Sat Sep 14 02:42:54 2019 OOM killed process 825 (mysqld) total-vm:2850312kB, anon-rss:121164kB, file-rss:5028kB
Sat Sep 14 02:42:54 2019 OOM killed process 17611 (apache2) total-vm:282668kB, anon-rss:11736kB, file-rss:444kB
Mon Sep 16 13:25:24 2019 OOM killed process 23782 (mysqld) total-vm:4822232kB, anon-rss:105972kB, file-rss:3784kB
Wed Sep 25 05:31:36 2019 OOM killed process 24642 (mysqld) total-vm:2455744kB, anon-rss:137784kB, file-rss:0kB
Wed Sep 25 05:31:36 2019 OOM killed process 3802 (systemd-journal) total-vm:82548kB, anon-rss:51412kB, file-rss:28kB
Mon Oct  7 09:25:13 2019 OOM killed process 11551 (mysqld) total-vm:2718652kB, anon-rss:144116kB, file-rss:220kB
Wed Oct  9 05:30:16 2019 OOM killed process 22763 (mysqld) total-vm:2192796kB, anon-rss:128664kB, file-rss:0kB

anon-rss is the resident RAM consumed by the process itself (anonymous = not memory mapped to a file or something like that).

total-vm is the total size of the Virtual Memory in use. This isn’t very relevant (I think), as it involves shared libraries, memory mapped files and other segments that don’t consume any actual RAM or other valuable resources.

So now it’s clear what happened. Next, to some finer resolution.

The MySQL keepaliver

The MySQL daemon is executed by virtue of an SysV init script, which launches /usr/bin/mysqld_safe, a patch-on-patch script to keep the daemon alive, no matter what. It restarts the mysqld daemon if it dies for any or no reason, and should also produce log messages. On my system, it’s executed as

/usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/log/mysql/error.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306

The script issues log messages when something unexpected happens, but they don’t appear in /var/log/mysql/error.log or anywhere else, even though the file exists, is owned by the mysql user, and has quite a few messages from the mysql daemon itself.

Changing

/usr/bin/mysqld_safe > /dev/null 2>&1 &

/usr/bin/mysqld_safe --syslog > /dev/null 2>&1 &

Frankly speaking, I don’t think this made any difference. I’ve seen nothing new in the logs.

It would have been nicer having the messages in mysql/error.log, but at least they are visible with journalctl this way.

Shrinking the InnoDB buffer pool

As the actual failure was on attempting to map memory for the buffer pool, maybe make it smaller…?

Launch MySQL as the root user:

$ mysql -u root --password

and check the InnoDB status, as suggested on this page:

mysql> SHOW ENGINE INNODB STATUS;

[ ... ]

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 137363456; in additional pool allocated 0
Dictionary memory allocated 1100748
Buffer pool size   8192
Free buffers       6263
Database pages     1912
Old database pages 725
Modified db pages  0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1894, created 18, written 1013
0.00 reads/s, 0.00 creates/s, 0.26 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 1912, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]

I’m really not an expert, but if “Free buffers” is 75% of the total allocated space, I’ve probably allocated too much. So I reduced it to 32 MB — it’s not like I’m running a high-end server. I added /etc/mysql/conf.d/innodb_pool_size.cnf (owned by root, 0644) reading:

# Reduce InnoDB buffer size from default 128 MB to 32 MB
[mysqld]
innodb_buffer_pool_size=32M

Restarting the daemon, it says:

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 34340864; in additional pool allocated 0
Dictionary memory allocated 1100748
Buffer pool size   2047
Free buffers       856
Database pages     1189
Old database pages 458

And finally, repair the tables

Remember those warnings that the tables were marked as crashed? That’s the easy part:

$ mysqlcheck -A --auto-repair

That went smoothly, with no complaints. After all, it wasn’t really a crash.

Some general words on OOM

This whole idea that the kernel should do Roman Empire style decimation of processes is widely criticized by many, but it’s probably not such a bad idea. The root cause lies in the fact that the kernel agrees to allocate more RAM than it actually has. This is even possible because the kernel doesn’t really allocate RAM when a process asks for memory with a brk() call, but it only allocates the memory space segment. The actual RAM is allocated only when the process attempts to access a page that hasn’t been RAM allocated yet. The access attempt causes a page fault, the kernel quickly fixes some RAM and returns from the page fault interrupt as if nothing happened.

So when the kernel responds with an -ENOMEM, it’s not because it doesn’t have any RAM, but because it doesn’t want to.

More precisely, the kernel keeps account on how much memory it has given away (system-wise and/or cgroup-wise) and make a decision. The common policy is to overcommit to some extent — that is, to allow the total allocated RAM allocated to exceed the total physical RAM. Even, and in particular, if there’s no swap.

The common figure is to overcommit by 50%: For a 64 GiB RAM computer, there might be 96 GiB or promised RAM. This may seem awfully stupid thing to do, but hey, it works. If that concept worries you, modern banking (with real money, that is) might worry you even more.

The problem rises when the processes run to the bank. That is, when the processes access the RAM they’ve been promised, and at some point the kernel has nowhere to take memory from. Let’s assume there’s no swap, all disk buffers have been flushed, all rabbits have been pulled. There’s a process waiting for memory, and it can’t go back running until the problem has been resolved.

Linux’ solution to this situation is to select a process with a lot of RAM and little importance. How the kernel does that judgement is documented everywhere. The important point is that it’s not necessarily the process that triggered the event, and that it will usually be the same victim over and over again. In my case, mysqld is the favorite. Big, fat, and not a system process.

Thinking about it, the OOM is a good solution to get out of a tricky situation. The alternative would have been to deny memory to processes just launched, including the administrator’s attempt to rescue the system. Or an attempt to shut it down with some dignity. So sacrificing a large and hopefully not-so-important process isn’t such a bad idea.

Why did the OOM kick in?

This all took place on a VPS virtual machine with 1 GB leased RAM. With the stuff running on that machine, there’s no reason in the world that the total actual RAM consumption would reach that limit. This is a system that typically has 70% of its memory marked as “cached” (i.e. used by disk cache). This should be taken with a grain of salt, as “top” displays data from some bogus /proc/vmstat, and still.

As can be seen in the dmesg logs above, the amount of resident RAM of the killed mysqld process was 120-150 MB or so. Together with the other memory hog, apache2, they reach 300 MB. That’s it. No reason for anything drastic.

Having said that, it’s remarkable that the total-vm stood at 2.4-4.3 GB when killed. This is much higher than the typical 900 MB visible usually. So maybe there’s some kind of memory leak, even if it’s harmless? Looking at mysql over time, its virtual memory allocation tends to grow.

VPS machines do have a physical memory limit imposed, by virtue of the relevant cgroup’s memory.high and memory.max limits. In particular the latter — if the cgroup’s total consumption exceeds memory.max, OOM kicks in. This is how the illusion of an independent RAM segment is made on a VPS machine. Plus faking some /proc files.

But there’s another explanation: Say that a VPS service provider takes a computer with 16 GB RAM, and places 16 VPS machines with 1 GB leased RAM each. What will the overall actual RAM consumption be? I would expect it to be much lower than 16 GB. So why not add a few more VPS machines, and make some good use of the hardware? It’s where the profit comes from.

Most of the time, there will be no problem. But occasionally, this will cause RAM shortages, in which case the kernel’s global OOM looks for a victim. I suppose there’s no significance to cgroups in this matter. In other words, the kernel sees all processes in the system the same, regardless of which cgroup (and hence VPS machine) they belong to. Which means that the process killed doesn’t necessarily belong to the VPS that triggered the problem. The processes of one VPS may suddenly demand their memory, but some other VPS will have its processes killed.

Conclusion

Shrinking the buffer pool of mysqld was probably a good idea, in particular if a computer-wide OOM killed the process — odds are that it will kill some other mysqld instead this way.
Possibly restart mysql with a cronjob every day to keep its memory consumption in control. But this might create problems of its own.
It’s high time to replace the VPS guest with KVM or similar.

—————————————————————-

Rambling epilogue: Some thoughts about overcomitting

The details for how overcomitting is accounted for is given on the kernel tree’s Documentation/vm/overcommit-accounting. But to make a long story short, it’s done in a sensible way. In particular, if a piece of memory is shared by threads and processes, it’s only accounted for once.

Relevant files: /proc/meminfo and /proc/vmstat

It seems like CommitLimit and Committed_AS are not available on a VPS guest system. But the OOM killer probably knows these values (or was it because /proc/sys/vm/overcommit_memory was set to 1 on my system, meaning “Always overcommit”?).

To get a list of the current memory hogs, run “top” and press shift-M as it’s running.

To get an idea on how a process behaves, use pmap -x. For example, looking at a mysqld process (run as root, or no memory map will be shown):

# pmap -x 14817
14817:   /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/log/mysql/error.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306
Address           Kbytes     RSS   Dirty Mode  Mapping
000055c5617ac000   10476    6204       0 r-x-- mysqld
000055c5623e6000     452     452     452 r---- mysqld
000055c562457000     668     412     284 rw--- mysqld
000055c5624fe000     172     172     172 rw---   [ anon ]
000055c563e9b000    6592    6448    6448 rw---   [ anon ]
00007f819c000000    2296     320     320 rw---   [ anon ]
00007f819c23e000   63240       0       0 -----   [ anon ]
00007f81a0000000    3160     608     608 rw---   [ anon ]
00007f81a0316000   62376       0       0 -----   [ anon ]
00007f81a4000000    9688    7220    7220 rw---   [ anon ]
00007f81a4976000   55848       0       0 -----   [ anon ]
00007f81a8000000     132       8       8 rw---   [ anon ]
00007f81a8021000   65404       0       0 -----   [ anon ]
00007f81ac000000     132       4       4 rw---   [ anon ]
00007f81ac021000   65404       0       0 -----   [ anon ]
00007f81b1220000       4       0       0 -----   [ anon ]
00007f81b1221000    8192       8       8 rw---   [ anon ]
00007f81b1a21000       4       0       0 -----   [ anon ]
00007f81b1a22000    8192       8       8 rw---   [ anon ]
00007f81b2222000       4       0       0 -----   [ anon ]
00007f81b2223000    8192       8       8 rw---   [ anon ]
00007f81b2a23000       4       0       0 -----   [ anon ]
00007f81b2a24000    8192      20      20 rw---   [ anon ]
00007f81b3224000       4       0       0 -----   [ anon ]
00007f81b3225000    8192       8       8 rw---   [ anon ]
00007f81b3a25000       4       0       0 -----   [ anon ]
00007f81b3a26000    8192       8       8 rw---   [ anon ]
00007f81b4226000       4       0       0 -----   [ anon ]
00007f81b4227000    8192       8       8 rw---   [ anon ]
00007f81b4a27000       4       0       0 -----   [ anon ]
00007f81b4a28000    8192       8       8 rw---   [ anon ]
00007f81b5228000       4       0       0 -----   [ anon ]
00007f81b5229000    8192       8       8 rw---   [ anon ]
00007f81b5a29000       4       0       0 -----   [ anon ]
00007f81b5a2a000    8192       8       8 rw---   [ anon ]
00007f81b622a000       4       0       0 -----   [ anon ]
00007f81b622b000    8192      12      12 rw---   [ anon ]
00007f81b6a2b000       4       0       0 -----   [ anon ]
00007f81b6a2c000    8192       8       8 rw---   [ anon ]
00007f81b722c000       4       0       0 -----   [ anon ]
00007f81b722d000   79692   57740   57740 rw---   [ anon ]
00007f81bc000000     132      76      76 rw---   [ anon ]
00007f81bc021000   65404       0       0 -----   [ anon ]
00007f81c002f000    2068    2052    2052 rw---   [ anon ]
00007f81c03f9000       4       0       0 -----   [ anon ]
00007f81c03fa000     192      52      52 rw---   [ anon ]
00007f81c042a000       4       0       0 -----   [ anon ]
00007f81c042b000     192      52      52 rw---   [ anon ]
00007f81c045b000       4       0       0 -----   [ anon ]
00007f81c045c000     192      64      64 rw---   [ anon ]
00007f81c048c000       4       0       0 -----   [ anon ]
00007f81c048d000     736     552     552 rw---   [ anon ]
00007f81c0545000      20       4       0 rw-s- [aio] (deleted)
00007f81c054a000      20       4       0 rw-s- [aio] (deleted)
00007f81c054f000    3364    3364    3364 rw---   [ anon ]
00007f81c0898000      44      12       0 r-x-- libnss_files-2.19.so
00007f81c08a3000    2044       0       0 ----- libnss_files-2.19.so
00007f81c0aa2000       4       4       4 r---- libnss_files-2.19.so
00007f81c0aa3000       4       4       4 rw--- libnss_files-2.19.so
00007f81c0aa4000      40      20       0 r-x-- libnss_nis-2.19.so
00007f81c0aae000    2044       0       0 ----- libnss_nis-2.19.so
00007f81c0cad000       4       4       4 r---- libnss_nis-2.19.so
00007f81c0cae000       4       4       4 rw--- libnss_nis-2.19.so
00007f81c0caf000      28      20       0 r-x-- libnss_compat-2.19.so
00007f81c0cb6000    2044       0       0 ----- libnss_compat-2.19.so
00007f81c0eb5000       4       4       4 r---- libnss_compat-2.19.so
00007f81c0eb6000       4       4       4 rw--- libnss_compat-2.19.so
00007f81c0eb7000       4       0       0 -----   [ anon ]
00007f81c0eb8000    8192       8       8 rw---   [ anon ]
00007f81c16b8000      84      20       0 r-x-- libnsl-2.19.so
00007f81c16cd000    2044       0       0 ----- libnsl-2.19.so
00007f81c18cc000       4       4       4 r---- libnsl-2.19.so
00007f81c18cd000       4       4       4 rw--- libnsl-2.19.so
00007f81c18ce000       8       0       0 rw---   [ anon ]
00007f81c18d0000    1668     656       0 r-x-- libc-2.19.so
00007f81c1a71000    2048       0       0 ----- libc-2.19.so
00007f81c1c71000      16      16      16 r---- libc-2.19.so
00007f81c1c75000       8       8       8 rw--- libc-2.19.so
00007f81c1c77000      16      16      16 rw---   [ anon ]
00007f81c1c7b000      88      44       0 r-x-- libgcc_s.so.1
00007f81c1c91000    2044       0       0 ----- libgcc_s.so.1
00007f81c1e90000       4       4       4 rw--- libgcc_s.so.1
00007f81c1e91000    1024     128       0 r-x-- libm-2.19.so
00007f81c1f91000    2044       0       0 ----- libm-2.19.so
00007f81c2190000       4       4       4 r---- libm-2.19.so
00007f81c2191000       4       4       4 rw--- libm-2.19.so
00007f81c2192000     944     368       0 r-x-- libstdc++.so.6.0.20
00007f81c227e000    2048       0       0 ----- libstdc++.so.6.0.20
00007f81c247e000      32      32      32 r---- libstdc++.so.6.0.20
00007f81c2486000       8       8       8 rw--- libstdc++.so.6.0.20
00007f81c2488000      84       8       8 rw---   [ anon ]
00007f81c249d000      12       8       0 r-x-- libdl-2.19.so
00007f81c24a0000    2044       0       0 ----- libdl-2.19.so
00007f81c269f000       4       4       4 r---- libdl-2.19.so
00007f81c26a0000       4       4       4 rw--- libdl-2.19.so
00007f81c26a1000      32       4       0 r-x-- libcrypt-2.19.so
00007f81c26a9000    2044       0       0 ----- libcrypt-2.19.so
00007f81c28a8000       4       4       4 r---- libcrypt-2.19.so
00007f81c28a9000       4       4       4 rw--- libcrypt-2.19.so
00007f81c28aa000     184       0       0 rw---   [ anon ]
00007f81c28d8000      36      28       0 r-x-- libwrap.so.0.7.6
00007f81c28e1000    2044       0       0 ----- libwrap.so.0.7.6
00007f81c2ae0000       4       4       4 r---- libwrap.so.0.7.6
00007f81c2ae1000       4       4       4 rw--- libwrap.so.0.7.6
00007f81c2ae2000       4       4       4 rw---   [ anon ]
00007f81c2ae3000     104      12       0 r-x-- libz.so.1.2.8
00007f81c2afd000    2044       0       0 ----- libz.so.1.2.8
00007f81c2cfc000       4       4       4 r---- libz.so.1.2.8
00007f81c2cfd000       4       4       4 rw--- libz.so.1.2.8
00007f81c2cfe000       4       4       0 r-x-- libaio.so.1.0.1
00007f81c2cff000    2044       0       0 ----- libaio.so.1.0.1
00007f81c2efe000       4       4       4 r---- libaio.so.1.0.1
00007f81c2eff000       4       4       4 rw--- libaio.so.1.0.1
00007f81c2f00000      96      84       0 r-x-- libpthread-2.19.so
00007f81c2f18000    2044       0       0 ----- libpthread-2.19.so
00007f81c3117000       4       4       4 r---- libpthread-2.19.so
00007f81c3118000       4       4       4 rw--- libpthread-2.19.so
00007f81c3119000      16       4       4 rw---   [ anon ]
00007f81c311d000     132     112       0 r-x-- ld-2.19.so
00007f81c313e000       8       0       0 rw---   [ anon ]
00007f81c3140000      20       4       0 rw-s- [aio] (deleted)
00007f81c3145000      20       4       0 rw-s- [aio] (deleted)
00007f81c314a000      20       4       0 rw-s- [aio] (deleted)
00007f81c314f000      20       4       0 rw-s- [aio] (deleted)
00007f81c3154000      20       4       0 rw-s- [aio] (deleted)
00007f81c3159000      20       4       0 rw-s- [aio] (deleted)
00007f81c315e000      20       4       0 rw-s- [aio] (deleted)
00007f81c3163000      20       4       0 rw-s- [aio] (deleted)
00007f81c3168000    1840    1840    1840 rw---   [ anon ]
00007f81c3334000       8       0       0 rw-s- [aio] (deleted)
00007f81c3336000       4       0       0 rw-s- [aio] (deleted)
00007f81c3337000      24      12      12 rw---   [ anon ]
00007f81c333d000       4       4       4 r---- ld-2.19.so
00007f81c333e000       4       4       4 rw--- ld-2.19.so
00007f81c333f000       4       4       4 rw---   [ anon ]
00007ffd2d68b000     132      68      68 rw---   [ stack ]
00007ffd2d7ad000       8       4       0 r-x--   [ anon ]
ffffffffff600000       4       0       0 r-x--   [ anon ]
---------------- ------- ------- -------
total kB          640460   89604   81708

The KBytes and RSS column’s Total at the bottom matches the VIRT and RSS figures shown by “top”.

I should emphasize that this a freshly started mysqld process. Give it a few days to run, and some extra 100 MB of virtual space is added (not clear why) plus some real RAM, depending on the setting.

I’ve marked six anonymous segments that are completely virtual (no resident memory at all) summing up to ~360 MB. This means that they are counted in as 360 MB at least once — and that’s for a process that only uses 90 MB for real.

My own anecdotal test on another machine with a 4.4.0 kernel showed that putting /proc/sys/vm/overcommit_ratio below what was actually committed (making /proc/meminfo’s CommitLimit smaller than Committed_AS) didn’t have any effect unless /proc/sys/vm/overcommit_memory was set to 2. And when I did that, the OOM wasn’t called, but instead I had a hard time running new commands:

# echo 2 > /proc/sys/vm/overcommit_memory
# cat /proc/meminfo
bash: fork: Cannot allocate memory

So this is what it looks like when memory runs out and the system refuses to play ball.

↧

A VoIP phone at home: The tech details on leaving your phone company

November 11, 2019, 9:05 am

≫ Next: systemd: Reacting to USB NIC hotplugging (post-up scripting)

≪ Previous: MySQL, OOM killer, overcommitting and other memory related issues

Introduction

This is some information and hard-learned wisdom I collected while setting up an Israeli phone number for use with a VoIP phone, so I can accept calls with a regular Israeli phone number and also make outbound calls. Important to note is that I did this without the local ISPs that provide this service. In particular, I did this for the sake of leaving Israeli Netvision’s service, which required a certain arrangement between the phone adapter and the ADSL line.

First and foremost: Setting up a VoIP line for a regular phone number is a time consuming task, which requires quite a bit of technical understanding in computer networks. Both service providers and hardware vendors behave as if only phone experts deal with them, and that’s the kind of support and documentation to expect. Unless you’re familiar with some internet protocols and know how to configure a firewall and work with a sniffer, you’re in for a big time frustration.

The biggest, ehm, alternative truth, related to VoIP is “phone line within minutes”. Expect it to take a few days at best. On a good day, this will be because of your own learning curve. And if you’re not the computer geek type to enjoy exploring a technical topic, odds are that it’s going to be extremely annoying.

The VoIP trinity

There are basically three components that need to be set up for this to work:

A phone number must be allocated or ported. This is referred to as a Direct Inward Dial or Direct Dial In service (DID / DDI). This means that some phone company knows it should relay phone calls of that number to a VoIP service provider, rather than to its own internal phone network.
An SIP trunk for connecting calls between yourself and the VoIP supplier over the internet. Or something of a similar nature.
A VoIP client: A program running on your computer / smartphone or a physical VoIP phone. Just remember that those innocent-looking handsets are actually small Linux computers with a web interface for configuring the network connection (SIP details and credentials, connection method to the Internet and whatnot). Most of the disappointed reviews on these phones are from people who expected a plug and play experience.

Typically (or always?), the first two items are a package deal, supplied by a single service provider. So you’ll sign up for the phone number, and once it has been set up, you’ll set up the VoIP connection with the same company’s servers.

Note that the DID phone number doesn’t have to be in your own country. As a matter of fact, your geographic location doesn’t matter. So if you want to supply local number to dial into for customers worldwide, this is the way. However if you want to be in contact with friends and family abroad, this is way too much work to set up.

And since I’m at it — a DID provider doesn’t necessarily relay to a VoIP network. The service might very well be plain forwarding to another regular phone number. This might be the optimal solution just to create an international presence, but without the VoIP headache.

SIP trunks…?

This is the most confusing issue. All I want is a single phone line, and suddenly I get the word “SIP trunk” everywhere. Do I really need it? I want a glass of milk, and I get the whole cow? The answer is, well, yes. But no fear, this cow supplies exactly the glass of milk you’ll need.

So first, let’s understand why it’s called an SIP trunk: For the sake of argument, say that you have an office with a number of phone lines connected to an in-office phone relay, and all internal phones can now connect with each other. For an outside world connection, a pool of voice circuits is set up, traditionally (that is, 1990-ish) through a single high-speed (~2 Mb/s) digital link to the local telephone company. This digital link is called a “trunk”, and phone calls in and out are allocated in TDM-multiplexed slots on the digital link as these calls are initiated. The telephony company allocates phone numbers, and when someone calls one of these, the trunk is used to carry both the signaling (which phone number is called) and voice into the office.

The “SIP trunk” uses an existing internet connection for the same thing. Instead of a fixed wire, UDP packets carry both signaling and voice. Instead of the regular phone company, an SIP server makes the connection between the VoIP link and real phone numbers. The in-office telephone relay registers itself on the SIP server, tells it what phone number it covers (so incoming calls are relayed to it) and proves it’s authenticity with some passphrase.

After registration, the in-office phone relay can initiate outgoing calls as required by someone in the office calling out, or accept inbound calls, in which case it will ring one of the phones in the office.

And here’s the point: In the end of the day, that VoIP handset does the same as that in-office phone relay: Requests outgoing calls and accepts incoming calls, with the same SIP protocol. That’s why some kind of trunk is set up (a SIP trunk or some other type) with the capacity of typically one phone number.

That also explains why the whole thing gets complicated: Setting up that little IP phone, you get into the shoes of a small business’ phone technician (of the rather high-end type, actually). So there are a few technical details to understand, and the service providers are somewhat adapted to work with people that do this for a living. You’re supposed to know what you’re doing.

Setting up a simple software SIP client

The protocol for maintaining the telephone signalling and session is called SIP. Hence either a piece of software running on the computer or some dedicated hardware phone can do the job. Or a combination of both.

After some looking around, I went for Linphone for this purpose. It’s simple and to the point.

# apt install linphone

How to set up an SIP connection on Linphone: Dismiss the setup assistant. Instead, go for Options > Preferences. Leave the Network settings with their default values: SIP (UDP) at port 5600, Direct connection to the Internet (even though I have iptables doing both firewall and NAT, however it allow established and related connections). In Multimedia settings, leave the echo canceling on (not that it helps much).

Then go to the “Manage SIP Accounts” tab, and go for the ” + Add” button (not the Wizard). Set Your SIP Identity to e.g. “sip:123456@core-sip-qts.avoxi.com” and SIP Proxy address to “sip:core-sip-qts.avoxi.com”. The “123456″ should be replaced with the DID phone number (which is also the first part of the credentials given by the service provider). Leave “Route” empty and registration duration is set automatically to 3600 secs, which is fine. I use Avoxi’s SIP server in this example, but see below on my choice of service provider.

Why the proxy address is used twice is beyond me (once in the identity and then as the proxy address), but this is commonly seen.

Immediately on clicking OK, Linphone prompts for the password. The password, also given by the service provider, is prompted for only on the first attempt to connect. A couple of seconds after supplying the password, the status line at the bottom of Linphone’s main windows reads “Registration on <core-sip-qts.avoxi.com> succesful”. It better be.

A debug log is available with Help > Show debug window. It shows, among a lot of other stuff, the SIP protocol exchange.

Keep in mind that linphone keeps running even after closing the main window. To really quit it, do so in the icon on the desktop’s toolbar.

A phone adapter

If you want to keep your regular phone, a SIP adapter will do the job. Even though I didn’t go this path, I considered a PAP2T from AliExpress at $20 or something. No experience with it however.

A real IP phone: Grandstream GXP1610

First, some general words: There are several SIP phones out there, and the reviews on this specific one are mixed, and there’s a good reason for that. It’s generally OK, with a lot of specified featured, but at the same time it misses on the small details. For example, it has a wall mount option, but the handset will fall off if you really try that. It can’t be flat on the table either, because the plugs are in the back. So only the table upright position is an option.

The ring tone options are rather poor for a machine that is effectively a computer. Setting it up to allow simple dialing of local phone numbers is a riddle. And the documentation is pretty lacking. There’s a lot of detail on esoteric issues, much less how to get started with the obvious stuff. Once can imagine the computer geeks adding more and more software features, but with nobody looking at the overall usage experience.

That said, it’s fine once set up, in particular if it’s intended for sporadic use, and it’s low-cost. It makes sense in an office, where there’s an IT person to handle the installation and setup.

So now to how to set it up for simple use.

First, some documentation: Download the User Guide from Grandstream’s resource page for the simple use. For configuration, download the Administration Guide. There’s also a Security Manual, which goes through a few security options with the phone.

Plug in the power supply and wait for a few seconds. The phone says “Booting” and then it boots for a while (it takes a minute, like almost exactly 60 seconds). Plug a the phone’s “LAN” Ethernet jack to the local LAN.

This is the time to mention, that the phone’s web configuration interface is on the same port as the voice communication. It’s quite common to do the configuration on a separate Ethernet port, but this is not the case. This means that anyone with access to phone from the LAN (or web?) can fiddle with its configuration. Maybe a good idea in an office with an IT department handling the phones. So setting up a firewall to prevent intrusion from outside is a good idea (if possible). There’s a separate “PC” Ethernet port, but it’s not clear what it is for.

The phone functions as a DHCP client by default, so it gets its address and displays it on the LCD (or press NextScr). Address 0.0.0.0 means that no address has been obtained with DHCP and static IP is disabled.

If DHCP isn’t enabled, press the button in the middle of the four arrows, and navigate: System > Network > IPv4 Settings, select DHCP. The menu returns to IPv4 Settings, meaning it has accepted the selection. Pressing “back” makes the phone ask if we want to reboot, so yes.

To get started, open a browser and type the IP address of the phone. The web interface asks for username and password, and it’s admin/admin, not surprisingly. The web app forces a password change if these are used. Note to self: Look for the phone-login-password.txt file.

To set up a SIP account, go to Accounts > Account 1 > General Settings and fill in the Account Name, SIP Server, SIP User ID (without the “@” and proxy server) and Authenticate Password. Click “Save and Apply”. Then check with Status > Account Status. SIP Registration should say Yes, meaning that the phone is functional (and there’s an icon on the LCD screen, see below).

Then enter Accounts > Account 1 > Audio Settings and set the Preferred Vocoder – choice 1 to G.722, then PCMA, then PCMU. These sounded best in my tests.

There’s also the Accounts > Account 1 > Call Settings which allows setting up local area codes and restrictions, but I didn’t bother — it looks like a riddle in regular expressions. I don’t expect to call out a lot from this phone, so I’ll use the full international number when necessary.

The ring tone: There are four ring tones to choose from, available on the LCD menu under Preferences > Ring Tone. Aside from the “default ring tone” there are three not-so-impressive choices.

The “default ring tone” can be configured through the web interface, but only as a composition of two frequencies. This is set as the “System Ring Tone” on Settings > Preferences > Ring Tone. The default is a plain dual tone going on and off (defined with the string “f1=440,f2=480,c=200/400;”). I changed it to something lighter with “f1=440,f2=480,c=10/30-30/170;”, which is one 100 ms tone, 300 ms pause, then 300 ms tone and 1700 ms pause. The ring volume can be adjusted with the arrow keys: Preferences > Ring Volume.

Finally, General Settings > Preferences > Date and Time (in the web GUI) for setting the correct time zone. Don’t expect it to get the daylight saving time correctly (at least not in Israel).

The registration status is indicated on the icon at the LCD screen’s upper left:

A filled T shape: Properly registered ready for phone calls
Same, but hollow icon (the “T” is absent): A LAN connection is present but no valid registration is in effect (possibly because of a rejection by remote SIP server)
No icon: No connection to the LAN port.

A DID / DDI provider

A Direct Inward Dial or Direct Dial In service is required to relay incoming phone calls to over an VoIP link. There are a lot of providers to choose from. I looked for one that could port an Israeli number to their service, and with a low monthly price.

To make a (very) long story short, I had two finalists: DIDWW and Avoxi. Spoiler: I went for DIDWW. But the way there was interesting.

I first started checking with DIDWW, mainly because their monthly fee was lowest (setup fee of $2.50 and then a monthly fee of $2.50 for all phone numbers, with $0.01 per minute fee, including incoming calls). But there were some worrying signs. First, I had to create an account and log in, just to see their prices. And even worse, at the very first order, one is required to top up a balance of $50, just to begin. One gets aware of this only after registering and at the last stage of making the order. So it looked like a bit of an ugly sales trick. Their “Terms and Conditions” explicitly says that this isn’t reimbursable.

So do they charge $50 upfront to make me a hostage, or is it a sign of serious intentions? I decided to check up Avoxi. $4.50 monthly at lowest plan, $0.04 / minute. More expensive, but no upfront payments. Actually, they offered a free phone line to try for a while.

This is where I’ve deleted quite a few lines describing a lot of good intentions but not so much competence by Avoxi’s support, that eventually pushed me back to DIDWW. I put the $50 on the table, and soon enough I had a working phone line to test. From there I went through the porting procedure with them. It was no fraud, it turned out. They have a pdf document with prices — it’s not something they fiddle with. And their support it quick and to the point, at least so far.

Setting up a DIDWW account

DIDWW has a proper web interface for helping yourself, but this is VoIP, and you’re supposed to behave like this is what you’ve been doing all your life. So this is a short survival guide to follow after having a phone number allocated on your account.

So here it comes: The DID number is linked with a “Phone Systems” trunk, which is apparently a powerful tool for routing phone calls between queues, DTMF menus, voice mail, fax and well, human response (see user manual). It’s a bit of a cow when needing a glass of milk (a recurring motive in this post), but this is how DIDWW offers a SIP phone connection to their numbers. Plus a lot of features to add on later.

Sign up for the “Phone Systems” product at the bottom of the dashboard to the left, “Lifetime Free” plan, which covers exactly one circuit, and launch the service in web interface. Follow their tutorial on setting up an SIP account with Phone Systems.

After this, there’s a “Phone Systems” trunk in the list of Voice IN trunks (or add one). The CLI possibilities merely allow selecting how Calling ID is displayed, so it’s not that important. Check “Map all DIDs” so the phone numbers are related to this trunk.

Then to the DIDWW Phone System’s interface: Click the menu icon at the top right, pick Add a New Contact. Add the name and then pick “Add a New Contact Method”. Pick “SIP Account”. Enable “Enable outbound calls” and select the External caller ID to use. Don’t enable “Allowed IPs”.

After clicking “Save”, pick the SIP details drop-down to see the SIP access information. And then pick “Finish”.

Now some graph games to make this phone line live. Exit the right-side settings, and drag-drop “Phone number” into the canvas. Select the desired phone number, and a name to appear on the icon on the canvas. Click Save.

Add a “Ring Group” the same way, and add a single ring destination, namely the contact you set up before. And connect a wire between the two. And that’s it! It should look something like this:

(click to enlarge)

At “My DID Numbers”, the relevant number should appear, with the Trunk set to the Phone Systems trunk set up for this phone number. Capacity should be 100 with a green dot. If it isn’t, click the dot and set up “Pay Per Minute”.

RTP packets came from 46.19.210.34, which is located in Ireland with a 84 ms ping from Israel, so it’s relatively fine.

The parameters are something like:

Username: h8rn7gkb0p
Password: ymqhn2f50k
Domain: sip.phone.systems

Note that if you’re using Linphone, the actual user name given to the SIP phone is sip:h8rn7gkb0p@sip.phone.systems. But when setting the same thing on a Grandstream IP phone, it’s just h8rn7gkb0p. You should know this. You’ve been doing this all your life.

Bonus with using Phone Systems: It’s possible to turn the phone number temporarily into a receiving fax (yes, we’re in 2019, and the bank wants to send me one). Just drag “Fax” into the canvas and configure it for delivery through email (or some other way you prefer). The configuration is somewhat tangled, but quite straightforward. Incoming faxes arrive as a pdf file. Leave the “Ring Group” icon in place, and make the connection to the Fax, as long as this service is desired instead.

Porting a phone number to DIDWW

It’s a matter of filling in web form with the desired number. Then upload a Letter of Authorization (short thing), some kind of ID (passport in my case, so it’s in English) and the latest invoice from the previous supplier. On the next business day I got an email confirming that the phone number is portable, asking me to confirm the porting fee and the operation in general on the web interface (on the “porting” tab). Clicking on the link that came with the email, I got the message “You have no Portable items”, and instead the phone number was under “In Progress”. Which makes sense, because the porting fee is zero (in Israel). So the email was somewhat misleading — there was no action necessary from my side.

Actually, there was no more to do from my side. 10 days after I submitted the application (which was on a Saturday) I got an email saying that the porting was finished. That was one day after the date given as the target during the process, but well within the official lead time. So it was really quick and painless. Phone number is up and running.

Troubleshooting

So what if it just doesn’t work? You go to the sniffer. But what should it look like? So here’s an example of packets on wire.

First, registration. Rule number one is that if the SIP server doesn’t like the user ID and/or domain of the REGISTER request, it responds with a 403 Forbidden response. It doesn’t ignore the request. So if there’s no response at all, it’s not a matter or user identity or anything of that sort. Odds are you’re talking with the wrong server.

Now to a session between the Grandstream phone and the SIP server that works with DIDWW. Only packet content is shown below. The user name shown below is not valid anymore. The phone’s IP address on the LAN it works on is supposedly 10.11.12.13. Not clear why this is exposed in the SIP session.

So the phone says hello with

REGISTER sip:sip.phone.systems SIP/2.0
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK1124928411;rport
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2221 REGISTER
Contact: <sip:h8rn7gkb0p@10.11.12.13:5060>;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-1000-8000-000B82EF9E5C>"
Authorization: Digest username="h8rn7gkb0p", realm="sip.phone.systems", nonce="XchKwF3ISZRG4VSl3lYx9geIizu9DZUGlS27c4A=", uri="sip:sip.phone.systems", response="fb33a3645a58de7ae1fab14d23196de8", algorithm=MD5
Max-Forwards: 70
User-Agent: Grandstream GXP1610 1.0.4.128
Supported: path
Expires: 3600
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE, SUBSCRIBE, NOTIFY, INFO, REFER, UPDATE, MESSAGE
Content-Length: 0

to which the server says “nice try, but you have to prove me your love first”:

SIP/2.0 401 Unauthorized
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK1124928411;rport=5060;received=109.186.90.35
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>;tag=3cbf29d5022d29bd5eb970c4fa286be5.83a2
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2221 REGISTER
WWW-Authenticate: Digest realm="sip.phone.systems", nonce="XchOhl3ITVpwj9Uu+eGasPntV1gjDNiVlT6+ZYA="
Server: hedgehog v7p0
Content-Length: 0

Scary, huh? It says Unauthorized. It makes it look like an error. It isn’t. Some just say “no” to begin with.

So the phone says “I know the answer to your challenge” (using the password to produce a digest):

REGISTER sip:sip.phone.systems SIP/2.0
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK489633282;rport
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2222 REGISTER
Contact: <sip:h8rn7gkb0p@10.11.12.13:5060>;reg-id=1;+sip.instance="<urn:uuid:00000000-0000-1000-8000-000B82EF9E5C>"
Authorization: Digest username="h8rn7gkb0p", realm="sip.phone.systems", nonce="XchOhl3ITVpwj9Uu+eGasPntV1gjDNiVlT6+ZYA=", uri="sip:sip.phone.systems", response="9ff65eb2e0c784af08cd11cc1a7a489f", algorithm=MD5
Max-Forwards: 70
User-Agent: Grandstream GXP1610 1.0.4.128
Supported: path
Expires: 3600
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE, SUBSCRIBE, NOTIFY, INFO, REFER, UPDATE, MESSAGE
Content-Length: 0

The server is impressed by the persistence, and opens its doors:

SIP/2.0 200 OK
Via: SIP/2.0/UDP 10.11.12.13:5060;branch=z9hG4bK489633282;rport=5060;received=109.186.90.35
From: <sip:h8rn7gkb0p@sip.phone.systems>;tag=2070587275
To: <sip:h8rn7gkb0p@sip.phone.systems>;tag=9ac8b8c8d68bc095abf326021301853f-0b5b
Call-ID: 1534949761-5060-18@BA.B.B.CAE
CSeq: 2222 REGISTER
Contact: <sip:h8rn7gkb0p@10.11.12.13:5060>;expires=1800;+sip.instance="<urn:uuid:00000000-0000-1000-8000-000B82EF9E5C>";reg-id=1
Server: hedgehog v7p0
Content-Length: 0

This concludes the registration.

And then, every 30 seconds, the server asks (this is specific to Phone Systems):

OPTIONS sip:h8rn7gkb0p@10.11.12.13:5060 SIP/2.0
Via: SIP/2.0/UDP 46.19.209.28:5060;branch=z9hG4bK5106755
From: sip:keepalive@sip.phone.systems;tag=uloc-18-5dbada3b-30f7-092741-1997ebd9-80e8b223
To: sip:h8rn7gkb0p@10.11.12.13:5060
Call-ID: 51e402d-38b1e157-4dd7e13@46.19.209.28
CSeq: 1 OPTIONS
Content-Length: 0

and the phone responds with

SIP/2.0 200 OK
Via: SIP/2.0/UDP 46.19.209.28:5060;branch=z9hG4bK5106755
From: <sip:keepalive@sip.phone.systems>;tag=uloc-18-5dbada3b-30f7-092741-1997ebd9-80e8b223
To: <sip:h8rn7gkb0p@10.11.12.13:5060>;tag=394229760
Call-ID: 51e402d-38b1e157-4dd7e13@46.19.209.28
CSeq: 1 OPTIONS
Supported: replaces, path, timer
User-Agent: Grandstream GXP1610 1.0.4.128
Allow: INVITE, ACK, OPTIONS, CANCEL, BYE, SUBSCRIBE, NOTIFY, INFO, REFER, UPDATE, MESSAGE
Content-Length: 0

The purpose of this eternal nagging is most likely to refresh any firewall’s memory on the existence of a UDP link, in particular if there’s NAT involved (more on this below). Maybe also for checking that there’s still a phone on the other end (not sure if it’s so important, from a server’s perspective).

With Linphone connected to Avoxi, it was the client that kept the UDP link alive with some short dummy UDP packets. Looks like this is down to each phone service.

Does it work with NAT?

The short answer: It does for me, out of the box. No need for any special firewall rule or something. With plain Linux iptables NAT, that is. But if your phone is behind NAT or a firewall, be sure to check that some kind of keepalive UDP packets are exchanged every minute or so. Otherwise, the firewall might forget the UDP connection and not let through an incoming call.

This page explains a typical handshake over a NAT router.

I found interest in this after failing to receive inbound calls with Linphone despite having no issues whatsoever with outbound calls. In other words, no problem dialing from the computer, but unavailable when trying to dial to it.

This had nothing to do with NAT and firewalls. The INVITE requests that start off a phone conversation are sent through the UDP link that is constantly maintained with keepalive packets. Hence the server knows at which IP address it should find the SIP client, and the NAT / Firewall remembers the UDP link. So the rule is that if the registration went through fine, there are no excuses. If sound doesn’t come through after the phone is picked up, that’s another story, however iptables should handle this well if it set to allow related connections (and it should).

One thing that surprised me was that the audio UDP (RTP) packets start streaming as soon as the phone starts ringing on the other side. This is the common practice with cellular phones, and still. Even more surprising was that even though they came from a completely different server, using a UDP port that is unrelated to anything before. How did the NAT know how to forward this?

The answer lies in a UDP packet sent from the “regular” SIP host, saying (example with Avoxi server, some numbers xxx’ed):

183 Session Progress
Via: SIP/2.0/UDP 10.1.1.22:5060;received=109.186.xx.xx;branch=z9hG4bK278239419;rport=5060
Record-Route: <sip:199.244.96.39:5060;transport=udp;lr>
Contact: sip:199.244.96.46:5070
To: <sip:9724xxxxxx@core-sip-qts.avoxi.com>;tag=rvguhtm5fjvevrnd.i
From: <sip:9723xxxxxxx@core-sip-qts.avoxi.com>;tag=680205676
Call-ID: 1519032647
CSeq: 21 INVITE
Allow: INVITE, ACK, BYE, CANCEL, INFO, SUBSCRIBE, NOTIFY, REFER, MESSAGE, OPTIONS, UPDATE
Content-Type: application/sdp
Server: Sippy
Content-Length: 240

v=0
o=Sippy 219713410032301436 1 IN IP4 199.244.96.46
s=SIP Media Capabilities
t=0 0
m=audio 49610 RTP/AVP 0 101
c=IN IP4 199.244.96.46
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-15
a=sendrecv
a=ptime:20

This is the last stage in the Session Description Protocol (SIP/SDP) session, which started with the INVITE request.

And then RTP/UDP packets started arrive from IP address 199.244.96.46′s port 49610 (to destination port 7078, but that doesn’t matter). So obviously this is how the NAT got prepared to let through the related link.

↧

systemd: Reacting to USB NIC hotplugging (post-up scripting)

November 11, 2019, 9:17 am

≫ Next: Octave: Empty plots (after “figure”)

≪ Previous: A VoIP phone at home: The tech details on leaving your phone company

The problem

Using Linux Mint 19, I have a network device that needs DHCP address allocation connected to a USB network dongle. When I plug it in, the device appears, but the DHCP daemon ignored eth2 (the assigned network device name) and didn’t respond to its DHCP discovery packets. But restarting the DHCP server well after plugging in the USB network card solved the issue.

I should mention that I use a vintage DHCP server for this or other reason (not necessarily a good one). There’s a good chance that a systemd-aware DHCP daemon will resynchronize itself following a network hotplug event. It’s evident that avahi-daemon, hostapd, systemd-timesyncd and vmnet-natd trigger some activity as a result of the new network device.

Most notable is systemd-timesyncd, which goes

Nov 11 11:25:59 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.

twice, once when the new device appears, and a second time when it is configured. See sample kernel log below.

It’s not clear to me how these daemons get their notification on the new network device. I could have dug deeper into this, but ended up with a rather ugly solution. I’m sure this can be done better, but I’ve wasted enough time on this — please comment below if you know how.

Setting up a systemd service

The very systemd way to run a script when a networking device appears is to add a service. Namely, add this file as /etc/systemd/system/eth2-up.service:

[Unit]
Description=Restart dhcp when eth2 is up

[Service]
ExecStart=/bin/sleep 10 ; /bin/systemctl restart my-dhcpd
Type=oneshot

[Install]
WantedBy=sys-subsystem-net-devices-eth2.device

And then activate the service:

# systemctl daemon-reload
# systemctl enable eth2-up

The concept is simple: A on-shot service depends on the relevant device. When it’s up, what’s on ExecStart is run, the DHCP server is restarted, end of story.

I promised ugly, didn’t I: Note the 10 second sleep before kicking off the daemon restart. This is required because the service is launched when the networking device appears, and not when it’s fully configured. So starting the DHCP daemon right away misses the point (or simply put: It doesn’t work).

I guess the DHCP daemon will be restarted one time extra on boot due to this extra service. In that sense, the 10 seconds delay is possible better than restarting it soon after or while it being started by systemd in general.

So with the service activated, this is what the log looks like (the restarting of the DHCP server not included):

Nov 11 11:25:54 kernel: usb 1-12: new high-speed USB device number 125 using xhci_hcd
Nov 11 11:25:54 kernel: usb 1-12: New USB device found, idVendor=0bda, idProduct=8153
Nov 11 11:25:54 kernel: usb 1-12: New USB device strings: Mfr=1, Product=2, SerialNumber=6
Nov 11 11:25:54 kernel: usb 1-12: Product: USB 10/100/1000 LAN
Nov 11 11:25:54 kernel: usb 1-12: Manufacturer: Realtek
Nov 11 11:25:54 kernel: usb 1-12: SerialNumber: 001000001
Nov 11 11:25:55 kernel: usb 1-12: reset high-speed USB device number 125 using xhci_hcd
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001002
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001002
Nov 11 11:25:55 kernel: r8152 1-12:1.0 eth2: v1.09.9
Nov 11 11:25:55 mtp-probe[59372]: checking bus 1, device 125: "/sys/devices/pci0000:00/0000:00:14.0/usb1/1-12"
Nov 11 11:25:55 mtp-probe[59372]: bus: 1, device: 125 was not an MTP device
Nov 11 11:25:55 upowerd[2203]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12
Nov 11 11:25:55 systemd-networkd[65515]: ppp0: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: vmnet8: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: vmnet1: Link is not managed by us
Nov 11 11:25:55 networkd-dispatcher[1140]: WARNING:Unknown index 848 seen, reloading interface list
Nov 11 11:25:55 systemd-networkd[65515]: lo: Link is not managed by us
Nov 11 11:25:55 systemd-networkd[65515]: eth2: IPv6 successfully enabled
Nov 11 11:25:55 systemd[1]: Starting Restart dhcp when eth2 is up...
Nov 11 11:25:55 kernel: IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: Adding interface eth2 index:848
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001043
Nov 11 11:25:55 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:55 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001003
Nov 11 11:25:55 vmnetBridge[1620]: Removing interface eth2 index:848
Nov 11 11:25:55 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00001003
Nov 11 11:25:55 upowerd[2203]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12/1-12:1.0
Nov 11 11:25:55 kernel: IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
Nov 11 11:25:55 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Nov 11 11:25:55 kernel: userif-3: sent link down event.
Nov 11 11:25:55 kernel: userif-3: sent link up event.
Nov 11 11:25:57 vmnetBridge[1620]: RTM_NEWLINK: name:eth2 index:848 flags:0x00011043
Nov 11 11:25:57 vmnetBridge[1620]: Adding interface eth2 index:848
Nov 11 11:25:57 systemd-networkd[65515]: eth2: Gained carrier
Nov 11 11:25:57 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:57 avahi-daemon[1115]: Joining mDNS multicast group on interface eth2.IPv4 with address 10.20.30.1.
Nov 11 11:25:57 avahi-daemon[1115]: New relevant interface eth2.IPv4 for mDNS.
Nov 11 11:25:57 avahi-daemon[1115]: Registering new address record for 10.20.30.1 on eth2.IPv4.
Nov 11 11:25:57 kernel: r8152 1-12:1.0 eth2: carrier on
Nov 11 11:25:57 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
Nov 11 11:25:57 vmnet-natd[1845]: RTM_NEWLINK: name:eth2 index:848 flags:0x00011043
Nov 11 11:25:57 vmnet-natd[1845]: RTM_NEWADDR: index:848, addr:10.20.30.1
Nov 11 11:25:57 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Nov 11 11:25:58 kernel: userif-3: sent link down event.
Nov 11 11:25:58 kernel: userif-3: sent link up event.
Nov 11 11:25:59 avahi-daemon[1115]: Joining mDNS multicast group on interface eth2.IPv6 with address fe80::2e0:4cff:fe68:71d.
Nov 11 11:25:59 avahi-daemon[1115]: New relevant interface eth2.IPv6 for mDNS.
Nov 11 11:25:59 systemd-networkd[65515]: eth2: Gained IPv6LL
Nov 11 11:25:59 avahi-daemon[1115]: Registering new address record for fe80::2e0:4cff:fe68:71d on eth2.*.
Nov 11 11:25:59 systemd-networkd[65515]: eth2: Configured
Nov 11 11:25:59 systemd-timesyncd[1101]: Network configuration changed, trying to establish connection.
Nov 11 11:25:59 systemd-timesyncd[1101]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).

As emphasized in bold above, there are 4 seconds between the activation of the script and systemd-networkd’s declaration that it’s finished with it.

It would have been much nicer to kick off the script where systemd-timesyncd detects the change for the second time. It would have been much wiser had WantedBy=sys-subsystem-net-devices-eth2.device meant that the target is reached when it’s actually configured. Once again, if someone has an idea, please comment below.

A udev rule instead

The truth is that I started off with a udev rule first, ran into the problem with the DHCP server being restarted too early, and tried to solve it with systemd as shown above, hoping that it would work better. The bottom line is that it’s effectively the same. So here’s the udev rule, which I kept as /etc/udev/rules.d/99-network-dongle.rules:

SUBSYSTEM=="net", ACTION=="add", KERNEL=="eth2", RUN+="/bin/sleep 10 ; /bin/systemctl restart my-dhcpd"

Note that I nail down the device by its name (eth2). It would have been nicer to do it based upon the USB device’s Vendor / Product IDs, however that failed for me. Somehow, it didn’t match for me when using SUBSYSTEM==”usb”. How I ensure the repeatable eth2 name is explained on this post.

Also worth noting that these two commands for getting the udev rule together:

# udevadm test -a add /sys/class/net/eth2
# udevadm info -a /sys/class/net/eth2

↧