[00:00] tgardner: (Took a meeting break) Yes, I should be able to take care of what is needed, though RightScale keeps changing the best way to set up their integration software with Ubuntu on EC2. [00:26] ogasawara: 9a7e849 boots ok [00:27] kees: cool, I expected it to pass looking at the commit [00:27] I guess it's 9908ff736adf261e749b4887486a32ffa209304c causing the problem then? [00:28] kees: yah, that's what the bisect is pointing to [00:29] kees: k, I'm going to try revert just that commit and build a final test kernel [00:29] ogasawara: sweet [00:44] ccheney, http://kernel.ubuntu.com/~rtg/mainline/4f570f995f68ef77aae7e5a441222f59232f2d0e-maverick/linux-image-2.6.32-020632rc3g4f570f9-generic_2.6.32-020632rc3g4f570f9_amd64.deb === kamal-away is now known as kamal [00:59] kees: this is the latest 2.6.35-6.7 Maverick kernel with the 9908ff7 reverted - http://people.canonical.com/~ogasawara/lp597075/2.6.35-6.7-9908ff7/ [00:59] tgardner-afk, thanks [00:59] kees: I'm still building the mainline kernel with the commit reverted [01:05] tgardner-afk, looks good [01:05] be back in about 30m [01:12] kees: http://people.canonical.com/~ogasawara/lp597075/2.6.35-rc3-9908ff7/ - that's the latest 2.6.35-rc3 mainline kernel with 9908ff7 reverted [01:16] ogasawara: lp597075/2.6.35-6.7-9908ff7 did not boot :( [01:16] trying the mainline next... [01:16] kees: ugh, was afraid of that [01:23] ogasawara: :( lp597075/2.6.35-rc3-9908ff7 did not boot either. [01:23] boo [01:23] ogasawara: this is really odd. are there maybe multiple problems? [01:23] kees: I'm thinking it's a commit outside of i915 that might be triggering the i915 issue [01:24] oh, the bisection was targetting only changes in i915, but there were other commits between it? [01:24] kees: yep [01:24] gotcha. craps. [01:25] well, if you can show me the commnds you're running, I can take this over maybe. I have access to the builders. [01:25] * ccheney back [01:25] kees: ooh, have access to tyler? [01:25] tgardner-afk, ok yea your 4f57f9 kernel was good [01:25] kees: that's what I was using as the build box [01:27] ogasawara: yup, I'm on tyler [01:28] kees: git clone git://kernel.ubuntu.com/ubuntu/kteam-tools.git [01:28] done [01:29] kees: cp /home/ogasawara/kteam-tools/kteam-tools/mainline-build/mainline-build-one to /home/kees/kteam-tools/kteam-tools/mainline-build/mainline-build-one [01:30] kees: It's just a tweaked version of the script to use the tyler chroots [01:30] ogasawara, did you swipe the one I've been using on emerald? [01:31] ogasawara: okay (removed extra /kteam-tools from path, but good now) [01:31] tgardner-afk: nope, saw you'd added a README to the repo and then tweaked the script to make it work for me on tyler [01:31] I'm gonna spend tomorrow making them work on any machine, not just zinc [01:31] kees: oops, copy and paste error [01:32] tgardner-afk: that'd be awesome, I found it really useful doing these mainline bisects for kees [01:32] ogasawara, yep, I've been doing 'em for 2 days now [01:33] kees: so I did a mkdir lp597075 === tgardner-afk is now known as tgardner [01:33] kees: just to keep things separate [01:33] tgardner: The currently published Maverick kernel on EC2 is 2.6.32-305.9. Do you know if the "scheduler fixes" you mentioned are available in that version or would we need to wait for the next release? [01:33] ogasawara: okay [01:33] kees: then in lp597075 I cloned the mainline git tree, git clone --reference /usr3/ubuntu/linux-2.6.git/ /usr3/ubuntu/linux-2.6.git/ [01:34] okay, done [01:34] kees: cd linux-2.6 [01:35] kees: then run the following to build `/home/kees/kteam-tools/mainline-build/mainline-build-one maverick` [01:35] which sha1 do I want to start with? [01:35] erichammond, hmm, I have not backported the EC2 kernel from Maverick. All I have are the standard kernels. I may have mislead you thinking that a standard kernel would work, but thats only for UEC, not Amazon, right? [01:36] kees: ah, so in a separate linux-2.6 tree I kept track of the bisect [01:36] kees: git bisect start v2.6.35-rc2 v2.6.35-rc1 [01:37] tgardner: We are only interested in Amazon EC2 kernels as this is where the application servers run. [01:37] separate tree? ooooh. on tree doing the bisect, the other tree doing the builds? [01:37] kees: right [01:37] kees: I just liked being able to read the bisect output uninterrupted [01:37] ogasawara: and now I just get to do a full-blown bisect without limits [01:37] erichammond, right. I don't think jjohansen has a working Maverick kernel for EC2 yet, correct? [01:37] kees: I think I could replay it, but am too lazy to look it up [01:38] ogasawara, you know it keeps a log? .git/BISECT_LOG [01:38] kees: yep :( looked like around ~1500 commits between rc1 and rc2 [01:38] tgardner: there is? [01:38] tgardner, erichammond: yes I am trying to get the EC2 kernel for Maverick up [01:38] ogasawara: so the sha1 I give is the one git bisect spits out? [01:39] i.e.: [01:39] 6$ git bisect start v2.6.35-rc2 v2.6.35-rc1 [01:39] Bisecting: 373 revisions left to test after this (roughly 9 steps) [01:39] kees: yep [01:39] [b1413357d924792e2e332dcb6b712a7fb2a5fb25] fbdev: fix frame buffer devices menu [01:39] ^^^ that? [01:39] tgardner, jjohansen: I saw a note about that in the latest server-team meeting summary, but I was able to get the Maverick alpha-1 AMI running on EC2. [01:39] kees: yah, so use b1413.. [01:39] tgardner: well, I'll be damed. /me makes a note [01:39] erichammond: right [01:39] ogasawara: does mainline-build-one correctly maximize CPU parallelization in the build [01:39] ? [01:39] erichammond, my understanding from jjohansen is that it still fails in some zones [01:39] kees: not sure about that, but it'd crank em out pretty fast [01:40] ogasawara, kees: it does [01:40] tgardner, erichammond: specifically the pv-ops stuff fails in some zones, the full Xen has other bugs, and problems [01:40] fatal: 'maverick' does not appear to be a git repository [01:40] ogasawara: I seem to be missing something [01:41] kees: shoot, forgot to have you add the remote [01:41] ah-ha! [01:41] tgardner, jjohansen: For this Asterisk test, we only need the 64-bit kernel in us-east-1, but it sounds like it will need to wait anyway. [01:41] ogasawara: git remote add maverick /usr3/ubuntu/ubuntu-maverick.git/ ? [01:41] kees: git remote add maverick git://kernel.ubuntu.com/ubuntu/ubuntu-maverick.git [01:41] erichammond: try me again tomorrow, I have another kernel I will be testing soon [01:42] kees, for i in dapper hardy jaunty karmic lucid maverick; do git remote add $i git://kernel.ubuntu.com/ubuntu/ubuntu-$i.git; done [01:42] oh, not the local one? [01:42] kees: you could probably use the local one too [01:42] kees: it's sync'd regularly [01:42] kees, it doesn't make much difference after the first pull [01:43] ogasawara, tgardner: cool, now it's rockin' :) [01:43] kees: sweet [01:44] kees, tyler is a wuss, you oughta try emerald. [01:44] or tangerine [01:44] tgardner: I was doing my Yama builds on tangerine. woof. [01:44] tangerine and emerald are basically the same machine [01:45] I wonder how much ccache would help when doing a bisect [01:46] kees, I found on emerald and tangerine it makes about a 10 second diff over a full 3 arch pass on Lucid [01:46] so, not much [01:46] holy cow, that's not so great. :P [01:48] kees: ok, I gotta bail for a bit. keep me posted how the bisect goes. [01:51] ogasawara: cool; thanks! [01:51] tgardner, another build failure showed up [01:52] ccheney, copying now [01:52] ok [01:52] 150e6c6 ? [01:53] ccheney, http://kernel.ubuntu.com/~rtg/mainline/150e6c67f4bf6ab51e62defc41bd19a2eefe5709-maverick/linux-image-2.6.32-020632rc3g150e6c6-generic_2.6.32-020632rc3g150e6c6_amd64.deb [01:58] tgardner, good [02:00] tgardner, er it works i mean :) [02:00] ccheney, Bisecting: 2 revisions left to test after this [02:01] great :) [02:01] back on later === jjohansen is now known as jjohansen-afk [02:08] ogasawara, tgardner: should I expect this thing to work on tangerine? it only seems to build on tyler. [02:08] oh, er [02:08] /home/kees/kteam-tools/mainline-build/mainline-build-one: line 128: dch: command not found [02:08] heh [02:09] see mine stuff in emerald:~rtg/kteam-tools/mainline-build [02:09] kees, see my* [02:10] kees, its in a transitional phase. I'm gonna fix it tomorrow so that it works everywhere [02:10] tgardner: okay, cool [02:10] * kees runs away for dinner [02:20] ccheney, http://kernel.ubuntu.com/~rtg/mainline/e00ef7997195e4f8e10593727a6286e2e2802159-maverick/linux-image-2.6.32-020632rc5ge00ef79-generic_2.6.32-020632rc5ge00ef79_amd64.deb [02:20] testing [02:27] tgardner, works [02:27] ccheney, ack [02:28] looks like it must be cc56f7de7f00d188c7c4da1e9861581853b9e92f, the other commit is supposedly ps3 specific [02:29] ccheney, probably, but lets play it through [02:30] yea [02:30] ccheney, this one has a good vibe though. [02:36] tgardner, there is a v2 of the patch, if i understand what is on lkml that was posted on 5/29 [02:37] may be related to the bug we are seeing, if it is indeed that patch causing the problem [02:37] ccheney, I'll have it in 5 minutes [02:37] ok [02:43] ccheney, http://kernel.ubuntu.com/~rtg/mainline/cc56f7de7f00d188c7c4da1e9861581853b9e92f-maverick/linux-image-2.6.32-020632rc5gcc56f7d-generic_2.6.32-020632rc5gcc56f7d_amd64.deb [02:44] testing [02:47] Anyone can tell me if the current maverick kernel present on the PPA (2.6.35-5) already has vga-switchero compiled as default ? [02:47] Yes. [02:47] It deos. [02:48] thanks RAOF [02:49] i'm trying to switch to my RADEON graphics with "echo DDIS > /sys/kernel/debug/vgaswitcheroo/switch" [02:49] but i receive a blackscreen [02:49] is the command right ? [02:50] I forget. [02:51] From what I understand, that's highly unlikely to work properly if X is running. [02:51] tgardner, failed test, i am going to verify e00ef79 is actually good and can run instances to be sure its the cc56f7d [02:52] yep, after type that command i restart X [02:52] but the screen just go black [02:52] Only the intel graphics are working [02:53] benjamim: I'd probably grep the git log for the switcheroo commit to see what to do. [02:53] ccheney, this next kernel is the absolute proof since that last commit is now skipped (if it doesn't fail) [02:54] tgardner, ok [02:55] RAOF, could you tell what is the file that the switchero log is stored ? [02:56] benjamim: It's not - it's in the VCS history. You'd need to clone the kernel's git tree. Alternatively, go to git.kernel.org and use the search box in Linus' tree. [02:58] oh, i did not know that, I'll take a look. [02:58] thanks for help RAOF [02:59] tgardner, yea e00ef79 ran an instance so its good, i saw a different type of failure but we think its a different unrelated bug [03:00] the other bug is probably due to eucalyptus not doing error checking [03:08] ccheney, http://kernel.ubuntu.com/~rtg/mainline/f21121cde6e617b90cd03ce083652ca543004dc2-maverick/linux-image-2.6.32-020632rc5gf21121c-generic_2.6.32-020632rc5gf21121c_amd64.deb [03:08] testing [03:12] tgardner, looks good so far [03:13] ccheney, play it all the way, just like the others. [03:15] tgardner, its still running, just giving early feedback :) [03:16] takes around 15m to run the test 10 times [03:16] * ccheney has to go to the grocery store, bbia 1hr [03:17] so far passed the test 3/3 with 7 left to try [03:17] ccheney, k, see you in a bit [04:27] ccheney, back in the AM [05:16] i got swallowed by walmart [05:16] sorry for being gone so long :-\ [05:17] * ccheney needs to remember to look at the list his wife writes up before leaving to make sure it is detailed enough to buy [05:25] 1) "items" === kamal is now known as kamal-away [08:14] Morning [08:16] Good morning smb! Welcome to another wonder filled day [08:16] cking, Too true [08:16] cking, Just experienced the wonders of democratic bit reading. :-P [08:16] apw or ogasawara: any chance that enable-multiple-ring-buffer.patch can get backported for maverick? [08:17] * cking wonders what "democratic bit reading" is [08:17] LLStarks, Not apw and ogasawara will not be around till later [08:17] okay [08:17] any other kernel managers that i should speak to? [08:17] cking, You read 5 times and let the majority decide whether its 0 or 1 [08:18] smb, not sure that's entirely a good idea. a dictatorship suits me better. as long as I'm in control ;-) [08:19] LLStarks, Depends on the urgency and the patch. And as I still wait for my mail to sync I don't even know whether apw has written something about this patch [08:20] cking, In my case democracy seems the better choice as that allows the EDID to be ok even with some monitor switch in between [08:20] it's for vaapi support on intel [08:20] smb, I get your drift now [08:20] kinda stupid to not have it since libva is now in ubuntu archives [08:21] http://intellinuxgraphics.org/h264.html [08:21] LLStarks, Have you sent the patch to kernel-team@lists.ubuntu.com with some short explanation? [08:21] i'm not on the mailing list [08:22] That ended up missing 2.6.35, did it? It's been on the intel-gfx lists for some time - I kinda thought it made it into Linus' tree. [08:22] made 2.6.35 from what i've heard [08:22] LLStarks, You could still send to it. Though it then take a bit of time for it to get moderated to the list [08:22] not sure though [08:22] Then rejoice, for Maverick will have 2.6.35 [08:23] like i said, i'm not sure if it made the cut [08:24] ah. it did. http://permalink.gmane.org/gmane.comp.video.dri.devel/46855 [08:24] git log says that its in 2.6.35-rc3, so it's in Maverick now. [08:24] hopefully everything else required falls into place. [08:24] The easiest form of backporting! [08:25] RAOF, Very true [08:25] RAOF, Btw, just out of interest. There have been a few bugs about invalid EDIDs. Do you know whether they are all over adapter space or mainly intel? [08:27] The EDID processing code is in the shared drm code. EDID processing recently got more strict about the checksum being accurate (ie: actually checking the checksum, rather than completely ignoring it). [08:28] RAOF, Right the EDID code itself is. But the implementation of the bit banging functions are driver specific. In my case intel_i2c.c [08:28] RAOF, And I seem to experience flakyness when the monitor cable is not directly connected to the laptop [08:29] RAOF, But as of this mornings "democratic" approach it seems to work. With mostly a 4:1 vote for the bit being 0 [08:29] Urgh :) [08:29] Oh yeah. :) [08:30] Especially as the i915 code has so many bit values defined that I am completely lost about their intention [08:30] I'm not sure how I'd tell if a bad EDID on a bug was due to the monitor or the i2c interface. It's not something I've regularly asked reporters about. [08:31] And there are plenty of monitors with bad EDIDs to go around. [08:32] Yes, that might be specific to my case. The interesting part would be the number of messages. If the remainder is always the same, I guess the EDID /monitor is at fault [08:32] If, like in my case, the remainder always changes, there is flakyness [08:33] Yeah. I *have* noticed some changes in that sort of area for i915 but not radeon or nouveau. [08:37] That would sort of match my experience that all radeon (I think I have not tried nvidia) laptops when connected to that same monitor switch would get the right EDID but not that i915 based netbook [08:38] And as it drove me nuts (because I mainly work on that netbook and it would only allow the high mode after much cursing and tricking around) I played around with that yesterday and today [08:39] I think it might be at least in part because Intel chips are entirely integrated, so they rely on third parties to actually wire stuff up. ATi and nVidia both either (a) are discrete cards, wired up by people whose job it is to wire up discrete cards, or (b) on motherboards wired up by AMD or nVidia. [08:42] Maybe also because compared to radeon i2c code which only seems to set unset a bit in the registers, the intel implementation has dir bits and a mask and val bits and a mask and I think I really need to talk to somebody that can actually make sense of that. :) I'll try Eric, maybe he can tell me what all of this means [11:12] smb, http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-hardy.git;a=commit;h=13f8acafe67361d7afda05c23a0408820e0cb468 === ghostcube_ is now known as ghostcube === ayan-afk is now known as ayan [13:35] ccheney, http://kernel.ubuntu.com/~rtg/mainline/3b2e8e02ca5a31b6f8db8de05becb613d819622a-maverick/linux-image-2.6.35-6-server_2.6.35-6.7_amd64.deb [13:48] tgardner, thanks, testing [13:58] tgardner, seems to be working, doing full test and going to run instance to be sure [13:58] ccheney, ok, I've been looking at the patch. most of the changes are just pointer guards, but there is one change that could affect the behavior. [13:59] ok [14:28] tgardner, the pad block corrupted message doesn't appear but it seems to not run the instance, i am going to try it again and see if it will work [14:29] tgardner, its possible there is more than one bug, i am going to look into finding other ways to verify if the image was registered properly === sconklin-gone is now known as sconklin [14:34] ccheney, a different behavior then the previous successful runs? [14:35] with this kernel it appeared to register the image properly but then when i tried to launch an instance using the image it did not work, both worked before, i may need to retest your last kernel from last night to make sure it was working for running instance, but i am pretty sure it was [14:35] i slept since then so i have forgotten a bit :-\ [14:37] ccheney, retry 2.6.35-5.6 to verify it still produces the PAD error, then I'll build a kernel based on that version but with the offending commit reverted. [14:37] ok will do [14:40] tgardner, it ran this time, i think its probably an issue with eucalyptus that caused it to not work the first attempt [14:40] ccheney, does it pass the 10 loop test? [14:40] i'll try a few more times to make sure it consistently is working, but i wouldn't be surprised if it was euca for that part [14:41] tgardner, yea the 10 loop test passed, that just tests the registration for the pad bug [14:41] this is still on 2.6.35-6 [14:41] the step i do after testing the pad bug is gone is to run the instance which seemed to hang the first attempt but worked on the second [14:41] will get back to you soon about it, talking to the guys about it [14:42] ccheney, just to be clear, you're running the kernel with cc56f7de7f00d188c7c4da1e9861581853b9e92f reverted? [14:42] i am running http://kernel.ubuntu.com/~rtg/mainline/3b2e8e02ca5a31b6f8db8de05becb613d819622a-maverick/linux-image-2.6.35-6-server_2.6.35-6.7_amd64.deb [14:43] ccheney, k, I'll add that to the bug report [14:43] tgardner, ok [14:48] tgardner, yea it seems to fix it, it only failed the first time of running instance and i don't have a great track record of instances running on my box anyway due to other weird eucalyptus bugs [14:48] so i would consider the test to be successful [14:48] tgardner: what are our options wrt A2 ? would you consider using that "current minus problematic patch" kernel for A2 ? [14:49] s/patch/commit/ [14:53] ttx, um, I don't completely understand _why_ that patch causes a failure. As far as I can tell it merely implements more rigorous error checking, which the Java machine may be tripping over. I'm still of the opinion that you should release with a Lucid kernel 'cause I'm not yet willing to advise that we revert that commit for the master kernel. [14:54] tgardner: hm... [14:56] ttx, Linus has been gone for 2 weeks, so there may well be fixes for this already in the pipeline. I'll bug the authors about it. [14:56] tgardner: I thought there would be more value in testing a 2.6.35 kernel in the A2 milestone than to revert to a Lucid one. [14:56] tgardner: there have been some patches proposed by the same author that seem to "fix" things, ccehney found them [14:57] ttx, perhaps I read a different thread. Which one did you read? [14:57] No clue if they fix anything relevant for us though [14:57] might be worth generating a test kernel with them and see [14:57] * ttx looks at reference [14:58] ttx, I read this one http://marc.info/?l=linux-fsdevel&m=127488511324250&w=2 which was inconclusive [14:59] tgardner: there are a few others at http://marc.info/?a=119980668900012&r=1&w=2 but I have no clue if they are more conclusive [15:00] ttx, which is why I thought I'd contact the authors, 'cause I also have no clue [15:01] tgardner: if you're more comfortable with shipping a Lucid kernel for Maverick A2 than the modified 2.6.35 one, we should go that way [15:01] I'm missing historical referenbce to see what most acceptable for an A2 [15:02] tgardner: maybe that's a subject for the release meeting in one hour [15:03] ttx, what is the use of testing a Maverick kernel in UEC if we _already_ know it doesn't work. It seems to me that it'll preclude testing of other UEC features. [15:04] tgardner: agreed, by "modified 2.6.35 one" I was thinking the one that ccheney just tested [15:04] the issue being it affects more than just UEC, but all servers [15:05] ttx, well, I'm not ready to recommend that ogasawara revert that commit. I assume that the A2 release _must_ use the kernel from the archive, correct? [15:05] tgardner: that's my understanding as well [15:05] tgardner: and I'd prefer not to have to spin a new kernel, I uploaded what I hoped to be our final A2 kernel last night [15:06] ogasawara, yep, agreed [15:06] and it's cutting it close to do another upload [15:07] tgardner: if A2 uses the one from the archive, how can you make it "ship the Lucid kernel" ? [15:08] another solution is to just releasenote the fact that for UEC, you should revert to a Lucid kernel due to a yet-to-be-investigated bug [15:08] ttx, I guess its a matter of seeding? The Lucid kernel _does_ live in the archive, just in a different releas. [15:08] tgardner: ah ok. [15:09] i'm not sure if you can seed something not in your pocket [15:09] but that would be a question to ask someone like cjwatson probably [15:09] ogasawara: I'm unsure we should just prevent non-UEC server use cases from testing the 2.6.35 kernel in the A2 milestone [15:09] if we can't and we want to use the old kernel we can probably get it copied to maverick [15:09] documenting the workaround for the UEC users sounds like the best move here [15:10] hmm actually if it has the same source package name then probably the only way to do it is by telling users to install (but i may be wrong) [15:10] * tgardner gets breakfast. biab. [15:11] tgardner, ogasawara: i'll mention the issue in the release meeting since the release team should decide on what they prefer to have in A2 [15:11] (and will confirm what our options are, if we have more than one) [15:12] ttx: ack [15:12] ttx, yea our only options might be to either respin or to document in the a2 release notes, but i don't know if they have some magic they can perform on the image outside of regular seed generation [15:15] is there a way to keep my system at 10.04, but somehow tell apt to upgrade my kernel from maverick? [15:15] so far I've just been manually installing kernels [15:16] ccheney: that's what I'll ask at the release meeting [15:16] cnd, you could run the LTS backport. [15:17] tgardner: can you remind me whether that differs in any way from the real maverick kernel? [15:17] config options? [15:17] cnd, should be identical. [15:17] tgardner: are they now produced at the same time as the maverick kernels, or do they lag? [15:18] cnd, the only diff is that the backport is built with the Lucid toolchain [15:18] cnd, obviously it lags because its dependent on maverick, but usually no more then a day or so [15:18] ok [15:18] thanks [15:19] cnd, it depends somewhat on the crazy shit ogasawara has done to maverick. I'm working on the backport right, but apparmor is giving me some grief. [15:20] heh [15:31] tgardner, anything else you would like me to test? whenever the other patch gets reviewed by Linus I can test it out as well [15:32] ccheney, it'll likely take a few days. [15:32] tgardner, no problem just let me know when you are ready [15:42] ccheney, just to be clear, this PAD error is emitted by a node host kernel when it is attempting to unpack and load a KVM guest, right? [15:42] tgardner, is this bug on your radar? bug 597904 [15:42] Launchpad bug 597904 in linux (Ubuntu Maverick) (and 1 other project) "No video on beagleboard with 2.6.35 kernels. (affects: 1) (heat: 10)" [Critical,New] https://launchpad.net/bugs/597904 [15:42] JFo, it should get on mpoirier's radar [15:43] k [15:43] mpoirier, you around? [15:43] sure am [15:43] tgardner, its emitted when attempting to uec-publish-tarball on the walrus server [15:43] k, have you seen that bug I just pointed to tgardner? [15:43] hold on, looking at it... [15:43] k [15:43] tgardner, the node doesn't do anything with it at all, the node is the system running the kvm instance that you start using the image off the walrus server [15:44] tgardner, i hope that was clear, i can try to reword a bit better if it wasn't :) [15:44] ccheney, clear as mud. explain in a language that upstream will understand [15:44] tgardner, ok [15:44] JFo: I am aware of the issue - been present since I joined the company. [15:45] cool [15:45] just worried as I got mail indicating it was critical, etc. [15:45] tgardner, so this bug shows up on the server running 'walrus' which is the equivalent of Amazon S3 if you know what that is? [15:45] tgardner, reading my uecglossary page :) [15:45] JFo: they are all critical right now... [15:45] and they set milestone to Alpha2 and the kernel freeze is today [15:45] JFo wont' happen. [15:45] tgardner, so its running on the box not running the node controller, so not the one running the images in VMs [15:46] tgardner, when you start an instance (VM) the software provides the image you tell it to use via 'walrus' S3 to the node controller to use as the base image to run for the VM [15:46] JFo: we need to coordinate with the mobile team. [15:46] yeah, I agree [15:46] tgardner, our bug is failing when attempting to register the image (make it available for use) to the 'walrus' S3 service [15:47] just wanted to verify that no one had given them the idea that it was possible :) [15:47] ccheney, more fundamentally, this is the walrus host kernel attempting to decrypt a file? [15:47] tgardner, to explain better than that i may need to defer to ttx [15:47] tgardner, yes [15:47] thanks for looking mpoirier [15:47] tgardner, so not a vm kernel, actual bare metal kernel [15:47] tgardner: in fact it's walrus calling a Java decryption routine [15:48] ccheney, and your 10 cycle test simply runs that decryption 10 times? [15:48] JFo. I'll have a chat with tgardner on how to proceed with the workload and what to prioritize first. [15:48] ttx, i couldn't find where bouncycastle or openjdk-6 is actually using sendfile but i may be lacking my grep skills [15:48] when used against large files, ultimately it fails with that pad block error [15:48] tgardner, essentially, my script attempts to register the image which does various things unknown to me including the decryption [15:48] ttx, ccheney: ok, that seems to be the fundamental issue. [15:48] tgardner, and repeats that 10 times, yes [15:49] ttx: i did notice even when it does work it mentions something about the file being too large in the euca logs [15:49] ttx, i don't know if you saw that message before [15:49] I suspect the JVM does things that are no longer well supported by the kernel [15:50] some cache error message i'll see if i can find it, might be in the bug report, will check [15:50] smoser pointed at java.nio.channels.FileChannel transferTo using sendfile [15:50] mpoirier, ok [15:50] hmm the cache message seems not in the bug report, i'll add it in case its useful [15:51] ttx, do you happen to know what package that is in? seems not in the main openjdk-6 source [15:51] * ttx looks into his mighty java-Contents file [15:52] * ccheney doesn't know enough about java to know how to locate where it would be [15:52] openjdk-6-jre-headless rt.jar java.nio.channels.FileChannel [15:52] if its a bug in java we can get doko to look into it [15:52] it would seem unlikely to me that java would utilize and expose something that isn't standard. [15:52] hmm weird [15:52] i grepped over openjdk-6 source [15:52] ttx, do you think you could develop a simple reproducer? Something that an upstream developer could use to stimulate the bug without having to have all of the UEC stuff? [15:53] ccheney: didn't the eucalyptoids provide us with that ? [15:53] its possible that its a java bug, but i would think, given 2 years of function and then breakage in linux and expectation of java to run cross unix, that its kernel regression. [15:53] ttx, iirc someone said there is a huge reproducer in java attached to the bug [15:53] smoser, its clearly kernel regression (or at least a change in behavior) [15:53] right, it's just unclear why [15:53] yea 134MB test program [15:54] ah, I see it. [15:54] its either regression or undefined behavior that was relied on [15:55] ccheney, If I can reproduce it, then I can likely narrow down the exact line in the commit that is causing this issue [15:55] tgardner, yea, looking to see if i can find the sendfile code used [15:56] umm we have patches in the FileChannel area [15:56] * ccheney hopes we didn't cause this bug ourselves [15:56] actually hmm no those aren't ubuntu patches but something else [15:56] * ccheney upstream source looks a bit odd [15:57] ccheney: the bug is occurring on RedHat too [15:57] ccheney: so it's not "just us" [15:57] oh ok [15:57] it's anyone with 2.6.34+ [15:57] ttx, do you have a reference to the RedHat report? [15:57] yea the patches are from icedtea, which i think is similar to ooo-build for java [15:57] i still don't see a reference to sendfile though [15:57] it's a forum post at eucalyptus.com [15:58] http://open.eucalyptus.com/forum/decrypting-image-exception [15:58] maybe it gets generated in some way [15:58] * ccheney is going to write up the preliminary information to the mailing list in a few minutes === virtuald_ is now known as virtuald [16:00] * ccheney still isn't on the list apparently, or hasn't gotten the response back yet [16:01] ah ha my openjdk-6 source wasn't fully unpacked it has a bunch of extra stuff plus a huge tarball in the dir [16:02] no wonder i saw no reference to sendfile in my grep [16:02] now i found it [16:02] ogasawara, the only item for us in the release meeting is the 'pad block corrupted' issue. You're aware of the status changes, i.e., we've narrowed it down to a single kernel commit? [16:03] jdk/src/solaris/native/sun/nio/ch/FileChannelImpl.c [16:03] tgardner: yep, got it all written up to paste [16:03] ogasawara, cool. on top of things as always :) [16:04] its definitely special cased for solaris vs linux [16:05] so not fully standard across *nix anyway [16:05] http://pastebin.com/z5juRcCB === kamal-away is now known as kamal === bjf[afk] is now known as bjf [16:45] apw: ogasawara: I want to grab the latest kernel ddeb, but the amd64 version is missing [16:46] I see that the previous kernel, 2.6.35-5, is also missing amd64 ddebs [16:46] cnd: Is it still building? [16:46] amd64 that is [16:46] cnd, its still building [16:46] oh [16:46] ccheney, https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/588861/comments/10 doesn't reproduce the bug. [16:46] Launchpad bug 588861 in linux (Ubuntu Maverick) (and 4 other projects) ""pad block corrupted" error when trying to register an image with 2.6.34 kernel (affects: 1) (heat: 12)" [High,In progress] [16:47] tgardner, oh ok [16:47] ogasawara: tgardner: launchpad says it's been done building for 2 hours [16:47] tgardner, i'll grab it and see if it happens with that for me [16:47] ccheney, dang, it would be nice to a have a simple reproducer [16:48] cnd, it takes awhile for the publisher to run, plus things may be starting to freeze for A2 [16:48] tgardner, i'm writing an email to upstream about it so maybe they will know how to make a simple test case [16:48] tgardner: ok, I suppose I'll chill for a bit [16:48] er upstream eucalyptus i mean not kernel [16:48] ccheney, hmm, this is gonna be a kernel issue I think. [16:49] unless the java engine is just borked [16:50] tgardner, yea i don't know the code well enough to be able to tell [16:50] ccheney, its in openjdk-6-jdk, right? [16:51] tgardner, yes sendfile is in that, if you download it and unpack the source there is a large tarball with the actual source inside of that [16:51] working on it.... [16:51] so double packed debian source [16:51] thats how i managed to overlook the sendfile call earlier [16:52] tgardner, i attached a pastebin link of the file using it to the bug report earlier [17:01] jjohansen: just fyi, I went ahead and marked the two apparmor A2 work items you had as done (sync distro apparmor with upstream version, and update compatibility patch) [17:01] ogasawara: thanks, I meant to do that [17:01] jjohansen: was selfish really, I just wanted our burn down charts to look good for the release meeting :) [17:02] hehe :) [17:04] * ogasawara bails for a few hrs [17:05] tgardner: ^^ deflect any fires should there be any [17:08] JFo: Any suggestions on how to get apport data through a serial console on a system that doesn't have a released kernel? Re: bug 597904 [17:08] Launchpad bug 597904 in linux (Ubuntu Maverick) (and 1 other project) "No video on beagleboard with 2.6.35 kernels. (affects: 1) (heat: 10)" [Critical,New] https://launchpad.net/bugs/597904 [17:09] GrueMaster, not off the top of the head [17:10] Then QUIT MARKING OMAP BUGS AS INVALID. [17:10] thank you. [17:10] GrueMaster, before you start yelling... probably best to give an example where I did [17:11] I just did. [17:11] GrueMaster, get your facts straight [17:11] I set it incomplete [17:12] because I asked for info [17:12] and for the record, and your general fund of knowledge [17:12] I don't take people yelling at me very well [17:12] especially when they arewrong [17:12] System isn't booting. Best info we are able to gather is through minicom logs. [17:12] please refer to my boss pgraner if you have issues\ [17:13] This isn't an x86 system that has multiple tons of ways to collect this info. [17:13] And because we need to get it working, we are getting hand-built kernels prior to release (which apport considers invalid). [17:13] I know that, hence my question as to whether it was possible [17:14] Nope. Not possilble. [17:14] like I said, get your facts straight [17:14] then fine, add a comment to the bug to that effect, if necessary and we can move on [17:14] see how easy that could have been? [17:21] * smb moves on to boc and we. See you next week [17:28] JFo: sigh. I owe you an apology. I'm letting my frustrations of trying to get this image working get the best of me. I shouldn't have taken it out on you like that. [17:29] apology accepted. If you need to vent to me about something like that in the future, may I suggest a private window? :) [17:30] * JFo steps away for lunch [17:31] mpoirier, you around? [17:31] I am. [17:31] whazup ? [17:31] mpoirier, did Lucid omap3 have video? Is this a maverick regression? [17:31] ogasawara, tgardner: this bisect script rocks. :) [17:32] tgardner, maverick regression [17:32] let's mumble... [17:32] kees, If I can get clear of UEC I'll try and make it better [17:32] tgardner: Yes, Lucid works fine on omap 3. In fact, I tried the latest 2.6.35 kernel on that image as well. [17:33] tgardner: works great for me already. :) [17:33] ogasawara: I seem to be zeroing in on the middle of a mess of i915 commits again... we'll see how it goes. [17:34] I have just been informed that there is now a 2.6.35-6 release kernel (not a hand rolled one). I'm going to pull that and retry it just to verify. Might even get some apport data. [17:37] * JFo crosses his fingers [17:45] kees: I'm about to leave for the day, but if you end up sending email, I'm interested in anything having to do with i915 [17:45] sconklin: okay, sure. are you seeing problems too? [17:46] kees: I'mnot, but I don't know which problems, release, or hardware you're talking about - I'm just trying to keep up with i915 in general [17:53] sconklin: okay, cool. I have a Q35 intel mainboard with onboard G35 (?) that since -rc2 hangs on boot. I seem to be the only person with this problem. :P [17:53] (pci id 8086:29b3) [17:54] kees: Hmm. Hang is not good. I'm not paying much attention at all to maverick recently, been working on lucid stable updates. Have you tried the simple things like disabling kms? [17:55] sconklin: booting with "nomodeset" you mean? [17:55] kees: yes [17:55] yeah, that doesn't work. :) [17:55] the only way I could get past the initramfs was to blacklist i915 _and_ intel_agp. [17:55] kees: that's a new one on me [17:55] sorry [17:55] then once into root fs, they weren't blacklist any more (initramfs bug) and udev immediately loads them and hangs the system. [17:56] and if I blacklist them on the rootfs, when X starts, it ignore the blacklist and loads them again. it's been a really fun few days. ;) [17:58] AGP needs to be built in. [17:58] we got heavily hit by having it modular afaik upstream is not properly testing modular agp [17:59] * [alpha, amd64, i386, amd64, powerpc] Make all AGP driver built-in to [17:59] workaround race-condition between DRM and AGP. [17:59] it was directly asked by airlied. [18:01] kees: ^^ [18:02] oh, lovely [18:03] ogasawara: ^^ [18:03] tgardner, yea their test case works for me even on the broken kernel, so its not good enough test apparently [18:03] tgardner, i'm going to try to modify it to use the same image i was previously using to see if that makes a difference [18:04] kees, the race is a generic problem. we had a macro that would allow you to force a symbolic dependency, but ogasawara ripped it out of Maverick 'cause it wasn't getting used anywhere. [18:04] tgardner: er, so, is my problem not due to this? /me isn't sure what to test next. [18:05] kees, I'm not sure. Try changing the config to build-in AGP [18:06] will do [18:10] JFo: No joy here. [18:10] *** Problem in linux-image-2.6.35-6-omap [18:10] The problem cannot be reported: [18:10] This is not a genuine Ubuntu package [18:10] ccheney, I'm building an instrumented kernel that might give me enough information to indicate which part of that patch is causing problems,. [18:10] ok [18:12] I might be getting this due to bug 595949. [18:12] Launchpad bug 595949 in linux-meta-ti-omap (Ubuntu Maverick) (and 3 other projects) "linux-meta-ti-omap depends on the wrong binary kernel in maverick (affects: 1) (heat: 564)" [Low,In progress] https://launchpad.net/bugs/595949 === sconklin is now known as sconklin-gone [18:14] GrueMaster, this is omap3, right? [18:14] yes [18:14] GrueMaster, k, lemme have a look [18:15] It may be possible that this is because the kernel was just released and not on my mirror yet. [18:16] GrueMaster, AFAIK it hasn't completed building yet. ogasawara upload the meta package before amd64 or armel were complete [18:17] I have the kernel. It was at https://edge.launchpad.net/ubuntu/+source/linux/2.6.35-6.7/+build/1811239 [18:17] or am I missing something? [18:17] GrueMaster, what meta-package are you using? It should be linux-omap or linux-image-omap [18:18] This is a hand rolled image initially running 2.6.34-5. I downloaded linux-image-2.6.35-6-omap from the above link. [18:19] GrueMaster, ok, so why do you think you're hitting bug # 595949 ? [18:20] Has it been fixed yet? [18:20] My image was created 6/22. [18:20] GrueMaster, yes, but I just noticed that there might still be a bogus package in the archive. [18:21] And I haven't seen anything kernel related with apt-get update. [18:21] hmmm [18:21] GrueMaster, lemme chat up some archive admins about this [18:25] GrueMaster, 'dpkg -l|grep ti-omap' [18:26] none. [18:26] GrueMaster, then you don't have the meta package installed? [18:26] as I said earlier, it is broken (or was on 6/22 when this image was created). [18:27] GrueMaster, um, try 'dpkg -l|grep omap' [18:27] ii linux-image-2.6.34-5-omap 2.6.34-5.14 Linux kernel image for version 2.6.34 on OMA [18:27] ii linux-image-2.6.35-2-omap 2.6.35-2.3 Linux kernel image for version 2.6.35 on OMA [18:27] ii linux-image-2.6.35-6-omap 2.6.35-6.7 Linux kernel image for version 2.6.35 on TI [18:28] GrueMaster, what happens with 'sudo apt-get install linux-image-omap' ? [18:29] "apt-cache search ti-omap" only lists the linux-headers for 2.6.33|34|35 and omap4 meta packages. [18:29] its not ti-omap, just omap [18:30] It wants to revert to 2.6.35-5-omap kernel. [18:32] anyone care to apply the proposed patch in bug #595489 [18:32] GrueMaster, after you've done an update? 'apt-get update' ? [18:32] Launchpad bug 595489 in linux (Ubuntu) (and 1 other project) "lvm snapshot causes deadlock in 2.6.35 (affects: 1) (heat: 6)" [High,In progress] https://launchpad.net/bugs/595489 [18:33] Just a sec. Updated mirror, now updating apt. [18:34] psusi, it ain't gonna happen for A2, but I've milestoned it for A3 [18:35] doh [18:35] psusi, is Eric gonna get it into .35 ? Linux won't be back until next week. [18:36] Linus* [18:36] he fired off the patch to the mailing lists and I guess it's waiting for Linus to apply [18:36] but since kernel freeze starts today, I think we're going to have to apply it ourselves? [18:37] psusi, ok, this issue won't get lost now that its milestoned. I would prefer to get the fix direct from Linus' tree [18:37] yes, but that won't happen because of the kernel freeze will it? [18:37] psusi, we don't freeze the kernel until the official 2.6.35 is released [18:38] ohh... I thought I read that today begins kernel freeze... [18:38] this is just a momentary freeze in order to get A2 out the door [18:38] ahhh [18:38] so there time yet [18:38] there is* [18:39] then hopefully it will get into .35 final... though Eric said there is another deadlock possible, but I have yet to hit it [18:39] he's working on tracking that down now [18:40] ccheney, http://kernel.ubuntu.com/~rtg/mainline/ae0f36f0b964caf916c2ffc9f84b28c0f91c18a2-maverick/linux-image-2.6.35-6-server_2.6.35-6.7_amd64.deb [18:40] ccheney, do a 'sudo dmesg -c' right before you start your test [18:41] maks_, sconklin-gone, tgardner: building with intel_agp built-in did not fix it. the bisect has resulted, finally, in this one: f1befe71fa7a79ab733011b045639d8d809924ad [18:41] I'm building a maverick kernel with that reverted now... [18:41] kees, you have an older i945? [18:41] yes [18:42] G35, though, not G33 [18:42] kees, you should ask Kyle about this. He did a bunch of i945 stuff a few years ago. [18:43] tgardner: I figured I'd poke Anholt since I think he's in my timezone. (if I can find him) [18:43] * tgardner lunches for a bit [18:47] tgardner: Ok, I am officially giving up trying to get apport data on this omap bug. I have followed all the steps on https://help.ubuntu.com/community/ReportingBugs based on a headless system, but apport is now trying to collect data on my desktop system (different arch altogether). [18:54] I'd like to know what the fix for #554569 is [18:54] the linked rh and freedesktop bugs all point to different patches [18:56] oh the freedesktop bug finaly points to something interesting cool. [18:59] * bjf will be back in a bit === bjf is now known as bjf[afk] [19:03] ccheney, any update? [19:10] tgardner, not yet, sorry was at lunch [19:11] tgardner, no response from my email and still looking into how to do the test with my own image [19:12] ccheney, well, just rerun the original UEC failure case with this latest kernel so I can get some output from that patch [19:12] 2.6.35-6 ? [19:13] ccheney, http://kernel.ubuntu.com/~rtg/mainline/ae0f36f0b964caf916c2ffc9f84b28c0f91c18a2-maverick/linux-image-2.6.35-6-server_2.6.35-6.7_amd64.deb [19:13] oh nm i see you sent me a new one [19:19] tgardner, appears broken, is that expected? [19:19] i'll see how it does over all 10 tests but it already has one failure [19:20] ccheney, yes, I expect it to be broken. I want the dmesg output [19:20] oh ok [19:20] [ 211.296957] /home/rtg/kern/maverick/kern-64/ubuntu-maverick/fs/read_write.c:849 should this happen? [19:20] i'll get the whole log once its done [19:21] ccheney, I only need the results of one pass. _Before_ you start it do 'sudo dmesg -c' to clear the kernel buffer [19:21] oh oops, i need to redo it then [19:22] ok as soon as it finishes the first pass i will get it to you [19:23] all i saw was once for one pass: [ 365.268713] /home/rtg/kern/maverick/kern-64/ubuntu-maverick/fs/read_write.c:849 should this happen? [19:24] that was the entire output of dmesg during the first pass of the run [19:24] ccheney, yeah, its some debug, but I think its harmless [19:24] oh that wasn't my question it was this "[ 365.268713] /home/rtg/kern/maverick/kern-64/ubuntu-maverick/fs/read_write.c:849 should this happen?" exactly [19:24] :) [19:24] ccheney, oh, yes. its the debug I put in. [19:25] ok [19:30] tgardner, so what does the special output mean? :) [19:30] ccheney, hang on, I'm spinning one more [19:30] ok [19:30] ccheney, 10 minutes or so [19:31] ok [19:45] ccheney, http://kernel.ubuntu.com/~rtg/mainline/0a87a0c1b12f56bd556fd4506041966717c87fb1-maverick/linux-image-2.6.35-6-server_2.6.35-6.7_amd64.deb [19:45] same drill as before [19:46] ok will test [19:46] tgardner, i may have to disappear for an extended time in the next week, if so contact kirkland for how to proceed, i don't want to go into details in public [19:47] tgardner, most of the server team knows the details though if it does happen so they can fill you in [19:47] ccheney, ack [19:47] tgardner, will be a medical emergency if it happens [20:01] ccheney, you haven't disappeared already, have you? [20:04] no sorry [20:04] had to talk my father in law to fill him in on my wife's status [20:05] ccheney, I'm about EOD, but wanted to ponder the results of that last patch [20:05] ok let me see if it finished running [20:05] doh i ran off to tell him before i rebooted and started it [20:05] it should take about 3-4m [20:11] tgardner, [ 129.980324] /home/rtg/kern/maverick/kern-64/ubuntu-maverick/fs/read_write.c:849 ffffffff816256e0 (null) [20:11] repeated 20 times [20:11] most within a second, first was 50s before i'll pastebinit [20:12] tgardner, http://pastebin.ubuntu.com/455139/ [20:12] ccheney, ack [20:13] ccheney, hmm, I wonder if I restore the original lines there if it'll make a diff. you gonna be around for another 20 minutes? [20:14] yea [20:14] k, gimme 15 minutes [20:14] tgardner, i'll be around the rest of the day unless emergency happens [20:14] i'll do my best to alert everyone to that case if time permits [20:18] -> Lunch [20:33] ccheney, http://kernel.ubuntu.com/~rtg/mainline/afef312909fa10e603a05e030b2ee2feebde8d9f-maverick/linux-image-2.6.35-6-server_2.6.35-6.7_amd64.deb [20:34] testing [20:34] this should be the last one. === bjf[afk] is now known as bjf [20:38] ccheney, I left the debug in along with the restored code. It might just work. [20:38] ok will see what happens [20:45] tgardner, we got a bit of bad news, probably more to follow in the next week, but i'm still around today [20:45] ccheney, I'm definitely outta here after this test. [20:46] ok [20:46] just one result from 1 loop [20:46] [ 193.874005] /home/rtg/kern/maverick/kern-64/ubuntu-maverick/fs/read_write.c:849 ffffffff816256e0 (null) [20:47] ccheney, did the PAD error show up? [20:47] oh i forgot to check that, sorry :-\ [20:47] no i think it is working [20:47] i'll rerun [20:47] i didn't clear my euca logs [20:47] it looks like it worked though [20:48] yea it worked, i checked the timestamp from my script vs the timestamp of the last error on the log [20:48] haven't run an instance yet, but it seems to no longer show pad corrupted [20:48] yea no more pad corrupted error [20:49] what did you change? [20:49] ccheney, cool. I'll send a note to upstream about it later tonight. Add your results to the LP report. [20:49] ok [20:49] ccheney, I restored the 2 lines of code that made any substantive difference in the behavior of the patch [20:52] ok [21:15] ccheney, upstream email sent. I'm outta here.