[13:38] cjwatson, wgrant, good $time_of_the_day! If you have a few spare minutes, can you tell me if you know of ways in Launchpad scripts to close a connection manually and if that has any (side-)effects when the script tries to access the database again? I ask because after the transaction commit following https://git.launchpad.net/launchpad/tree/lib/lp/archivepublisher/scripts/generate_contents_files.py#n329, the generate-content-files script [13:38] keeps a database connection open for hours and hours together, coming in the way of fastdowntime deployments. [13:39] I am reading the source for examples but haven't find any leads so far. [13:40] This is a somewhat recent issue and I wouldn't be surprised if it starteed happening after the Postgres 12 upgrade last August when wgrant was still working for Canonical. [13:40] But I don't have the history for this. [13:40] I thought fastdowntimes only cared about open _transactions_, not open _sessions_. [13:41] And the commit ends the transaction, by definition. [13:41] I thought so too after our last conversation on this topic. [13:41] But let me double check [13:42] The pg_stat_activity row that the preflight check is complaining about as an open transaction, is in idle state and the query is 'COMMIT' (which I understand to be the last query executed on that connection) [13:42] Ah, it's possible that the changes I had to make to FRAGILE_USERS (database/schema/preflight.py) means that check_fragile_connections now trips on generate-contents-files. [13:43] wait_event_type = Client, wait_event = ClientRead [13:43] It is possible since it uses the same ftpmaster juju postgres user [13:43] See the comment at the top of FRAGILE_USERS. I suspect there is a way to fix this but I ran into this quite close to the end of my time on the team and didn't have time to figure it out. [13:43] https://code.launchpad.net/~cjwatson/launchpad/+git/launchpad/+merge/456105 [13:45] Now that you've upgraded to a newer PostgreSQL, can you see whether there's any way to determine the actual role that you've switched to using SET ROLE, rather than the original user that you connected as? [13:46] Or maybe you can come up with some other way to detect the fragile cases reliably. But that is the area you need to be looking in. [13:47] Before switching to Juju, the various scripts there connected using different DB users so it was easier to tell them apart during preflight. [13:50] https://stackoverflow.com/questions/20557702/which-postgres-system-table-stores-a-map-of-pids-to-session-authorizations, but that was a long time ago ... [13:51] Also https://www.postgresql.org/message-id/48B503A7.30606%40guruhut.com, similarly ancient [13:57] guruprasad: If you could cause pg_stat_activity.application_name to be set, that would work too - it doesn't _have_ to be the role, just something that lets preflight distinguish between the various scripts. That's set when you connect (see https://www.postgresql.org/docs/17/libpq-connect.html), but it doesn't require different credentials - so maybe add an option in lp.services.scripts.base [13:57] similar to dbuser to allow a script to add an application_name to the connection string, and then preflight would have something to go on? [13:57] guruprasad: Untested, but if it works then I think that would be my favourite option by far. [13:59] guruprasad: LaunchpadScript could maybe even just stick application_name={self.name} into the connection string, and then you wouldn't have to modify any individual scripts. [13:59] Thanks. This is helpful. [14:00] Last week during the Postgres focal upgrade, we noticed the same script running holding a connection [14:00] When we killed it on pgbouncer, it looked like the script had terminated too. [14:00] Yeah, I suspected this would come up at some point. [14:00] So I thought that connections were important too, not just transpactions. [14:00] *transactions [14:00] Not really, just an accident emerging from the rest of how preflight is implemented. [14:01] An open connection outside a transaction isn't a problem for DDL. [14:01] So are you saying that a kill on the pgbouncer for applying DDL changes shouldn't kill a script like generate-contents-files? [14:01] I'm saying you should arrange not to have to kill it in the first place :) [14:02] That makes a lot of sense but is also confusing. [14:02] So let me go read up. [14:02] The problem is just that the detection of running fragile scripts is misfiring. [14:02] generate-contents-files deliberately arranges to do most of its slow work outside a transaction in order that it won't cause a problem for DB upgrades. [14:02] (and other things that are sensitive to such things) [14:03] But I introduced a regression as part of the migration to Juju that made that detection have false positives. [14:03] I knew about it, but at the time it was the least bad alternative. [14:04] I think I looked briefly at application_name at the time but only noticed that it wasn't set usefully and didn't look into how to set it. Now that I see how to set it, it seems to me that that's obviously a better option. [14:08] This makes sense [14:49] By "you should arrange to not have it kill it in the first place", do you mean the generate-contents-files script should handle the database connection getting killed while it is doing non-DB operations under the DatabaseBlockedPolicy context and a fastdowntime deployment happens? [14:52] guruprasad: No, I mean that preflight should be fixed so it doesn't kill a generate-contents-files script that only has a connection open, not a transaction. [14:52] guruprasad: There's no need for it to kill it. It only thinks it needs to do so because of the regression described above. [14:53] (preflight/full-update/whatever) [14:55] But the actual <10s outage during the DDL application will kill connections on pgbouncer, no? [14:56] Maybe. I wouldn't worry about that, though. [14:56] If you can make it more robust, fine, but otherwise it will catch up the next day. [14:56] The main thing is to not have it get in the way such that it requires manual work. [14:57] Yeah, that makes sense. [14:58] The only case where this is actually a real problem is if it means Contents-* files end up being out of date in a release because the DB deployment was on the same day as the release ... I think that might have happened once, but it's rare since the LP team is usually not mad enough to try DB deployments on the same day as an Ubuntu release :-) [14:59] (Or the night before, I guess, since it runs at 21:02.) [14:59] LP team was frustrated by not being able to deploy DB changes and mad enough that they decided to attempt it on release day - 2, stopped services for nearly 2 hours before giving up. [15:00] And by which time, Clinton had received a lot of questions and escalations :) [15:00] Even in that case, we just get the previous day's Contents-* files, which are usually close enough. [15:00] They're just lists of file names, so last-minute changes usually don't affect them. [15:01] (much, anyway) [15:02] It used to be the case that LP changes were frozen around Ubuntu releases. We relaxed that a lot as our deployment processes got better, but DB deployments are disruptive enough that I'd suggest it should still be a rule for them.