[11:11] <Laney> bah
[11:11] <Laney> now I can't reproduce https://bugs.launchpad.net/bugs/1730717 very well
[11:11] <Laney> is there something I can install on all the test VMs to put them into high load easily?
[11:15] <Laney> guess I can write a userdata thing to install stress-ng and run that?
[11:29] <apw> right that thing will make it sad in all sorts of different ways, is it cpu load you need
[11:31] <Laney> not sure
[11:31] <Laney> we were suspecting it happens more when the cloud is 'busy'
[11:31] <Laney> but don't know in which way specifically
[11:31] <Laney> I'll just turn on all the things I guess
[11:32]  * Laney has cleared out lcy01 from adt instances for the time being
[11:47] <Laney> GREAT now I can't schedule instances at all
[11:57] <Laney> ubuntu@laney-test1:~$ uptime 11:57:35 up 3 min,  2 users,  load average: 509.30, 210.40, 79.27
[11:57] <Laney> that's working then :-)
[12:09] <apw> heheh
[12:16] <TJ-> Any ideas how to further diagnose an ecryptfs mount failure, when the internal mount() call returns "No such file or directory". This is on bare-metal and both source and target are accesible.
[12:17] <TJ-> Is there something internal via  debugfs I could watch?
[12:44] <Laney> 50 stressy units coming our way
[13:29] <apw> TJ-, hmmm
[13:31] <apw> TJ-, nothing in dmesg at the time it occurs ?
[13:33] <apw> TJ-, all the thingsi n ecryptfs that could return ENOENT seem to report something in dmesg first
[13:33] <apw> TJ-, otherwise it is generic
[13:35] <TJ-> apw: no; it's actually quite a severe problem I've been working on for a few days with an Ubuntu 17.04 user. They recently used the GUI to change their user password, after which they could not log-in. Turns out the encrypted home was no longer being unlocked, suggesting the GUI somehow didn't rewrap the ecryptfs passphrase. The user has the original key (hex passphrase) but that wasn't working either.
[13:35] <TJ->  Then I spotted the Private.sig entries were not the current keys, so wrote a script to automate test and mount. We found /sbin/mount.ecryptfs_private was failing due to the underlying "mount()" call reporting "No such file or directory". So, looking for what the real cause is.
[13:35] <TJ-> This was discovered via strace
[13:36] <apw> TJ-, and nothing in dmesg, hmm
[13:36] <TJ-> apw: No
[13:36] <apw> what were the parameters to mount when it failed
[13:36] <apw> in your strace output, PM me if they might be sensitive
[13:37] <TJ-> They're ellided unfortunately but I used the source-code to figure out the entire mount command and we tried that manually, but of course that wouldn't work since mount needs calling with sudo, but the root user doesn't have the user's keyring attached
[13:38] <TJ-> apw: The user account name is "a1". Everything about it checks out (ownership, permissions). Here's the strace: http://paste.ubuntu.com/26023608/
[13:39] <TJ-> This'd be the equivalent manually: sudo mount -t ecryptfs /home/a1/.Private /home/a1  -o rw,nosuid,nodev,relatime,ecryptfs_fnek_sig=782cb407b85d0079,ecryptfs_sig=769688550d78ced9,ecryptfs_cipher=aes,ecryptfs_key_bytes=16,ecryptfs_unlink_sigs
[13:40] <apw> and .Private exists
[13:40] <TJ-> apw: oh yes, everything is there. Pastebin's available if you want to check
[13:40] <apw> no i trust yoy
[13:41] <TJ-> I may have missed something :)
[13:42] <TJ-> Interesting aside - ) had to make strace setuid-root to be able to capture the trace, else te setreuid() failed
[13:43] <apw> TJ-, so if those signatures were not for keys the filesystem contains
[13:43] <apw> then you would get ENOENT
[13:44] <TJ-> The key sigs matched the ones in Private.sig 
[13:44] <apw> those are outside the filesystem though
[13:44] <apw> that is nothing to do with what the filesystem has inside
[13:46] <apw> anyhow, maybe not that, i wonder what else
[13:47] <TJ-> Right. The thing is, a few days ago when trying to manually mount, the user may have accidentially mistyped the hex passphrase. The mount succeeded but obviously the contet was garbage. I got him to unmount immediately, but the few files that got updated show up in .Private/ because the key-sig encodings changed for those files (in the filenames)
[13:48] <apw> so they hve mounted using the wrong passphrase over that directory
[13:48] <TJ-> Yes, once.
[13:49] <apw> when you say they show up in .Private, when ?
[13:49] <apw> or are they the only things in there now
[13:49] <TJ-> But You can see it here... starting line 55 http://paste.ubuntu.com/26023177/
[13:49] <TJ-> you'll see the original files-sigs prior to Nov 20th, and then the 15-ish changed when the wrong key was used
[13:52] <TJ-> I wonder, is there a way to convert the filename encoded signatures back to hex so they can be matched to the Private.sig entries?
[13:55] <apw> ok there is definativly no report for ecryptfs in dmesg, if that is so
[13:55] <TJ-> no
[13:55] <apw> then ecrypt_fs mount did not fail, it would have reported something itself always if it fails
[13:55] <TJ-> right, which is why I was asking for inspiration to debug this further :)
[13:57] <TJ-> The user has a keepassxc database in there, else he'd just redo the entire thing
[13:58] <apw> where is . in that mount, is that home ?
[13:59] <apw> have you tried mounting it somewhere else
[14:02] <apw> it woudl return ENOENT if . was a deleted directory
[14:03] <TJ-> Yes, it is. The "." is hard coded in mount.ecryptfs_private! That was the first thing I chased down, thinking it was wrong
[14:03] <apw> but you could like
[14:03] <apw> mkdir tmp
[14:03] <apw> cd tmp
[14:03] <apw> and run it there ?
[14:04] <TJ-> circa line 687 of src/utils/mount.ecryptfs_private.c:: if (mount(src, ".", FSTYPE, MS_NOSUID | MS_NODEV, opt) == 0) {
[14:04] <TJ-> Tried that too; doing a  "pushd /tmp" just before the call to /sbin/mount.ecryptfs_private (in my test script)
[14:05] <apw> you likely could not mount it over /tmp
[14:05] <apw> for other reasons
[14:05] <apw> it will also return ENOENT if the "." in this context is marked dont_mount, which also seems to be removed/renamed things
[14:06] <apw> have you tried mounting a ramfs over the "."
[14:06] <apw> that might tell you if any of those tests would fail
[14:06] <TJ-> No, but it doesn't work that way in mount.ecryptfs_private - it calls getpwid(), then does a chdir() to the home, *then* uses "."
[14:07] <apw> you could try the tmpfs thing then
[14:07] <apw> see if "." is sick for soem other reason
[14:07] <TJ-> No, haven't tried mounting ramfs over ., but . will always be $HOME (as in what getent passwd a1 shows)
[14:07] <apw>         printk(KERN_ERR "%s; rc = [%d]\n", err, rc);
[14:08] <TJ-> I've run the same tests exhaustively on multiple test user accounts whilst messing up the sigs etc, and not been able to recreate this mount fail
[14:08] <apw> as we have that ^ in ecryptfs_mount, i don't see how it can be ecryptfs related, so it should fail for other things
[14:09] <TJ-> I wondered if it was keyring related, but the user wrote down the original hex key and is using that now
[14:09] <apw> how did you look for the errors from btw
[14:10] <TJ-> errors from...? I'm not sure I understand what you're asking?
[14:11] <apw> in dmesg
[14:11] <TJ-> oh, got the user to pastebin the log
[14:11] <apw> have you got that pastebin ?
[14:12] <TJ-> Not in a tab right now... I only kept the ones that had 'interesting' info - e.g. something related. We got through about 60 pastebins!
[14:12] <apw> as the messages are not tagged for ecryptfs
[14:12] <TJ-> But, I told him yesterday I'd seek advice then we'd do another debug session, so I can collect that
[14:13] <apw> it would say liek "Getting sb failed; rc=N"
[14:13] <apw> rather than being obviously related
[14:13] <apw> rubbish errors for the win
[14:13] <TJ-> I was reading every line *very* carefully, also had "dmesg -w | tail" to capture anything that happened whilst the mount command was running
[14:14] <TJ-> I'll capture a dmesg run with the user next time I make contact
[14:14] <TJ-> I can get him to boot with 'debug' too which might add something
[14:16] <TJ-> in case it's of interest, this is the test/diagnostic script I've created: http://iam.tj/projects/ubuntu/ecryptfs-regenerate-wrapper_signatures_mount_test
[14:18] <apw> i cannot see anything which would make mount fail outseide of ecryptfs that would not i believe stop any mount
[14:18] <apw> and if it was inside, then it must seemingly be logged in dmesg
[14:19] <apw> so i guess confirming dmesg is clear, and that mounting tmpfs or something in the same place first
[14:19] <apw> to confirm something can be mounted there
[14:19] <apw> other than that, hrm, you need some debug in your kernel i recon
[14:20] <TJ-> Yeah... it's a weird one. What annoys me is, the origin of all this is the GUI failing to rewrap the passphrase on password-change; and I've hit that myself a few times at random since 14.04 at least but never figured out why
[14:21] <TJ-> well actually not even failing to rewrap... it does rewrap, but doesn't use the new user password, which is worse, so trying to unwrap with old or new password fails
[14:21] <apw> no i don't think i have ever really been confortable with ecryptfs and its madness
[14:21] <apw> i am gald we are moving to using native encryption on the filesystem
[14:22] <TJ-> Thanks for your help on this, I'm glad it's not just me missing something stupidly obvious
[14:22] <TJ-> I was planning on bugging Tyler with it :)
[14:23] <apw> possibly still worth it
[14:23] <TJ-> Yes... I'm going to write up a cogent bug report that doesn't look like a novel :)
[14:24] <apw> TJ-, i do wonder if there is any milage in moving the things with the wrong FNEK out of there
[14:24] <apw> i can't see it checking if any of them are bad as part of mount, but
[14:25] <TJ-> apw: I was planning on making an entire copy to another location, so we can test safely :)
[14:26] <TJ-> The 'wrong key files' issue was the user logged in at the console and the $HOME/.ecryptfs/auto-mount file was there so pam_ecryptfs did the mount. As soon as I realised I had that deleted
[14:52] <TJ-> apw: I think you're onto something with the different keys. Looking at the kernel source, all the -ENOENT are in keystore.c and most seem related to calls to ecryptfs_find_auth_tok_for_sig() ... looking at calls to that function, one that stands out is ecryptfs_parse_tag_70_packet() which parses the signatures out of the filename. Now, if the lower superblock is mounted and then it is trying to decrypt
[14:52] <TJ-> one of those filenames encoded with the possibly-mistyped-hex-key that might be triggering -ENOENT, but related to the keyring, not the file-system
[14:53] <apw> that should be testible, make two test mounts and umount htem, then copy a file from one to the other
[14:54] <apw> and see if that blows its brains up
[14:55] <TJ-> Yes
[14:56] <TJ-> I'm going to do that here after I've run the Huskies into the ground :)
[14:56] <apw> if it does that is one for the faq
[14:56] <TJ-> I need the fresh air after all this digging
[14:56] <apw> (of doom)
[14:56] <apw> heh i know the feeling, lick
[14:56] <TJ-> hehehe yeah
[14:56] <apw> luck, and lick if there are huskies
[14:57] <TJ-> We've got 50kmh wind gusts just now so I should take a hang glider out and let them tow me :D
[14:58] <apw> heh, a plan indeed, and much more fun than reading kernel code slowly