[12:03] sounds like a potential pain, testing wise. [12:04] yes [12:04] indeed [12:04] but the first thing i will do is to test your combination [12:04] if i can reproduce it, i know what i need to tell people to do :) [12:05] because the point is that the UP kernel is well.. UP [12:05] it doesn't use anything of the SMP code if not to detect how many CPU's are available on the system [12:05] and that's done parsing the OBP devicetree [12:06] the test i did on your machine was to disable thread 0 of core 1 [12:06] (your booting CPU basically) [12:07] because your CPU has no core+ [12:07] core0 even [12:07] but the ALOM and OBP do some magic cpu virtual remapping [12:07] so even if in reality the code is running on core1 thread0, the kernel sees it as cpu0 [12:07] that might be buggy === braddr wonders if he has a 6 core model where cpu 0 was flawed somehow so it was disabled and shipped as a 4 core. [12:09] braddr: all the cpu's are born as 8 cores [12:09] ah [12:09] the 6/4 core models are 8 core where 8 - $x core did fail [12:09] so they get disabled in hw [12:10] if you do a showcomponent in the ALOM you can see it [12:10] the first block of MP/CPU0/.. are the core/thread set [12:10] the second block (8 entries) are the 4 dual bus for memory access [12:10] MB/CPU0/R0/D1 iirc === braddr nods. [12:11] MB/CMP0/P0 [12:11] <- core0 thread0 [12:11] so, barring remapping, cpu p0-3 and 20-31 were disabled [12:11] you don't have that [12:12] MB/CMP0/CH0/R0/D0 <- memory [12:12] basically yes. the 4 cores that did not pass the test have been disabled [12:12] that's also why (as you can see) there are "holes" [12:13] one thing that did puzzle me also was the ALOM POST process [12:14] so, on your and david's sytems, at least one of them has 0-3 disabled? [12:14] in both your machine and my machine i did disable the booting CPU [12:14] and the POST was still running on it.. [12:14] so that's what triggered a bell [12:14] nope.. on all our machines there is a core0/thread0 working [12:15] david has 1x8 and 1x6 [12:15] i have 1x6 and 2x8 [12:15] but all with core0 [12:15] your is "special" === braddr waits for the short bus to stop by [12:15] eheh [12:15] but we will figure it [12:16] it might as well be a case we don't handle properly OR a combination of bugs [12:16] we will see soon hopefully [12:16] it'll get sorted out, I have no doubts. [12:16] yepos === braddr [n=braddr@209.189.198.126] has joined #ubuntu-ports === braddr [n=braddr@209.189.198.126] has joined #ubuntu-ports [07:59] hark.. is that the sound of a jet passing overhead or did someone just turn on my t2000? :P [07:59] morning braddr :) [08:01] g'morning [08:02] brb.. my windows desktop's networking is acting all screwy.. gonna reboot. === braddr [n=braddr@209.189.198.126] has left #ubuntu-ports [] === braddr [n=braddr@209.189.198.126] has joined #ubuntu-ports [08:08] what's on the testing block for today/tonight? [08:09] david has a possible workaround for your problem, but i think he is actually eating to power up brain and brainstorm on a proper solution [08:10] i just woke up and i am going to test another fix on your box [08:10] "another possible fix" [08:16] boot.img-debian-sparc <- tsk ;) [08:16] debian doesn't support Niagara :D [08:17] hey, fire me. :) [08:18] notice the date on that file? [08:19] nah.. i was just teasing :P [08:19] I know [08:30] these are the moments in which i wish gzip could fork on N cpu's [08:30] heh [08:30] yup.. lots of cpu's just don't help single threaded apps. [08:39] about to boot the new image.. [08:40] you can go in console read only mode if you like [08:40] boot.img-fabbione-9 <- === braddr crosses his fingers. [08:42] oh well [08:42] kablooie [08:43] crap [08:45] shouldn't need to be that verbose.. just boot net mem=1024k should be enough [08:45] DOUBLE TYPO [08:45] go fabion! [08:45] er.. 1024m [08:45] it depends on the tftpd you are using [08:46] tftp-hpa doesn't support broadcast [08:46] so you need to specify the tftp server address [08:46] It's been working fine for me forever. :) [08:46] it doesn't here and never did [08:46] might be my anal firewall [08:48] THERE WE GO [08:48] I wasn't watching, but just saw the screen reset. [08:48] i am at the language selector [08:49] yup.. I see. :) [08:49] is there an extra disk i can use to test an install? [08:49] i would like to make sure that it doesn't crash later === braddr tries to remember if he did anything interesting with disk1 beyond just do the install back when we had it working. [08:50] drop to a shell and mount the large partition? [08:50] I definitly don't have a spare disk, but I'm thinking that it'd be fine to reinstall on disk1 [08:50] ok.. give me a few minutes to get to the partitioner [08:50] otherwise we can't mount anything [08:51] might be quicker to just boot that disk. :) [08:51] on the other hand.. gotta get that far to do an install for real anyway [08:52] yeah i think i am going to check a couple of more things before we do the install [08:55] yeps.. ok it's reproducible [08:55] let see what's the watermark for the issue to appear [08:56] want me to take control for a sec and check that disk for wipe-ability? [08:56] in a few minutes [08:56] if that's ok with you [08:56] take your time [08:56] thanks [08:57] this mem thing opens a completely new frontier :) [08:57] the amount of memory seems to be a factor? Bizarre. [08:57] i have 8GB too [08:57] and it works [08:58] it might be a very complex combination at this point [08:59] 2g boots === fabbione attempts 4 === fabbione attempts 8 [09:06] kablooie [09:07] yeps [09:07] that's good.. [09:07] it was expected [09:10] if this goes well.. try 8G-1M or -1k even [09:10] yeah [09:13] interesting [09:13] yes [09:15] oh, sure, return to bisecting. [09:15] yeps [09:16] it might even be 7G+1m [09:16] hem === fabbione hides in a corner [09:18] stick with bisecting.. it won't take that many steps to figure out. [09:18] yeah i know [09:18] i forgot the m at the end [09:19] and thus the upper bound has been changed. [09:20] er, lower bound [09:20] yeps [09:20] except i keep forgetting the m :) [09:20] what's it default to, bytes? [09:21] kb iirc [09:21] 7680-8191.. [09:22] no worries.. that one is normal because i did interrupt the boot [09:22] pick a number, any number.. where will the tip over point land. [09:22] 7936m <- 7.75G [09:25] 7936-8192.. I'll put my money on 7999/8000 for no good reason. :) [09:26] 8064 now :) [09:26] it might even be bad ram [09:26] in UP used for something and it crashes [09:27] in SMP used by some init tables that are not used once init is done [09:27] that would be ironic [09:28] that would be hilarious and certainly ironic. [09:28] I certainly haven't done much that stresses memory to hell and back. [09:29] 8064 is good [09:29] 8064-8191.. so much for my first pointless guess [09:29] time for easier numbers.. 8100 [09:30] that would be.... 8064+64 [09:30] 8128 ? [09:30] yeah [09:30] i am pretty sure this will work too [09:34] last 32M :) [09:34] 8128-8191, running out of midpoints. [09:35] for your blog -9 is just the dapper kernel from git with security fixes and stuff [09:35] -10 that i am about to boot after the bisect had DEBUG_BOOTMEM on [09:36] on bellevue: ~braddr/.www/t2000/boots.txt -- I just made it world writable, so you can edit [09:38] nah i am not going to edit www please :) [09:38] so this one boots too [09:38] well, you've been testing stuff when I'm not around, so it might be worth making notes. I just made that one file writable [09:39] I'll add this block though [09:39] oh i didn't do much before.. [09:39] this is realtime debugging ;) [09:39] but i need a short break [09:40] -9 git as of what date or id? [09:41] -10 is -9 + debug_bootmem? [09:41] anything to note about -8 and boot.img-dapper-orig? [09:42] I think -8 was just current as of that day, but I'm not sure about that [09:43] -8 is old [09:43] -dapper-orig is just the acutal dapper kernel. [09:43] -9 is just git as of yesterday [09:43] no special id [09:43] -10 is -9+DEBUG_BOOTMEM [09:44] i didn't boot -10 yet === braddr nods. [09:44] we are still booting -9 [09:44] I saw. Where are we with the memory boot ranges? [09:45] 8176 [09:45] booting now [09:45] highest good ? [09:45] davem suspects that we are sharing a page with some firmware stuff and everything goes bad [09:45] that's the bisect [09:46] and it works [09:46] so it's lower bound now [09:46] okey [09:46] so next is.... [09:46] 8182 [09:46] no [09:46] 8184 [09:49] there [09:50] sorry.. not paying attention to the boot attempts, updating the blog of what I've been doing with the box and the testing you've been doing. [09:50] bah [09:51] 8184 does error [09:51] and i forgot m again [09:51] so lower mark 8176 and higher mark is 8184 [09:51] a hah.. new top end point! [09:55] new top 8180 [09:55] 8176 - 8180 [09:55] down to a 4m range === fabbione tests 8178 [09:57] Check cable and try again [09:57] Link Down [09:57] Timed out waiting for Autonegotation to complete [09:57] what have you done? [09:57] it went up again.. brrr [09:57] scary [09:57] odd.. nothing, but it retried and made it === braddr checks bellevue's dmesg log [09:58] nothing interesting on that side [09:59] ok 8178 is good [09:59] 8178 - 8180 [09:59] last boot! [09:59] nope... [09:59] we need to get down to kb [09:59] well, unless you wanna dip into the k's. [09:59] heh [09:59] 8179 [10:00] well ideally we need to find the page that's making this issue === braddr nods. [10:00] a boundary of 4k would do [10:01] get the page and have the kernel dump the entire thing before the crash and if possible as part of the oops or whatever that die point is [10:01] the problem is that we can't dump the page [10:02] as soon as we access that page the hypervisor blocks the system [10:02] that's the error we are getting [10:02] we need to understand what is there and why [10:02] oh [10:02] slip the guy a bit of valium, he's obviously a little too high strung. [10:02] 8179 is good [10:03] so now.. some maths :) [10:04] 8375296 - 8376320 [10:04] range in kb [10:04] 2 boots to establish an acceptable good and bad value? [10:04] we will bisec that now :) [10:04] it's just one mega [10:05] so now it is 8179+512k [10:07] i think we found it :) [10:08] wow.. it doesn't even break! [10:08] for a definition of we that has you doing all the work. === fabbione powersoff [10:09] so low mark is still 8375296 [10:09] highmark is 8375808 [10:09] let's try in the middle [10:09] worth repeating the last try? [10:09] yes, but not now [10:09] i want to test lowmark+256k [10:10] if it still hangs, i want to test passing a known to work mem in kb === braddr nods [10:10] we need to make 100% sure we are not hitting other bugs, like MB = 1000kb instead of 1024 === braddr recalls suggesting that. :) [10:11] i am pretty sure it's done in 2^ values [10:11] but you may never know ;) [10:11] all we need is a good boot to see it in the boot output === braddr eyes it warily. [10:15] it's hanging again [10:15] waits a few secs.. [10:15] wait.. is this the -10 kernel? [10:15] -9 [10:15] we didn't boot 10 at all [10:15] ok.. just checking [10:15] this is the symptom we used to see with bootmem I tink. [10:15] yeps it hangs === braddr checks his notes [10:16] probably because we are scraping the same page [10:16] bootmem does basically clean all mem allocation [10:16] oh neat [10:17] sc> reset doesn't the trick without poweroff/poweron [10:18] now i am booting with 8179 in kbytes [10:18] just to make sure we are converting it properly [10:18] hrm.. my notes are insufficient [10:18] and guess what.. we are not === braddr kicks himself. [10:18] try it in m again [10:18] yeps [10:19] actually [10:19] I was seeing this sort of boot hang with my hand built kernels, and I don't have enough notes on the differences. I did get one of them too boot and I want to say that bootmem was one of the differences, but I'm not confident enough in that memory. [10:19] i think mem is in bytes.. [10:19] let me check [10:20] it was definitly hanging to the point of needing to reset it [10:20] just append a k [10:20] remove the doubt [10:21] mem=nn[KMG] [10:21] it doesn't say the default [10:21] so i guess it's bytes [10:21] this is 8179m in k [10:21] score [10:23] so the last boots were just wrong === fabbione resets the counters [10:23] ok.. so there's insufficient checking for minimum memory somewhere in early bootup. :) [10:23] 8375296 - 8376320 [10:23] 8375296 (8179) = good [10:24] 8376320 (8180) = booting now.... === braddr hopes, for our sanity's sake, it fails [10:26] math is not an opinion in these cases ;) [10:26] sure sure.. but as I get older, I'm starting to believe more and more that hardware can be spiteful. [10:26] hehe [10:26] see.. boom [10:27] yup [10:27] 8375808k (8179+512k) = booting now.. === braddr wanders to the kitchen to see what he can find to eat and drink. I kinda forgot to do something about dinner. [10:28] ehhe [10:28] yeah i will spend MAX another hour debugging [10:28] I was gonna grab a subway sandwich or something before the closed, but that was 90 minutes ago.. oops. [10:28] it's sunday, i am tired and i want to get ready for the F1 race :P [10:28] subway rocks! [10:28] they're not bad.. good for the sandwich category. [10:29] yeps [10:29] back shortly [10:29] sure [10:30] bang [10:30] new highmark :) [10:30] lowmark= 8375296 (8179) [10:30] highmark= 8375808k (8179+512k) === fabbione bisects [10:34] 8375552 (8179+256k) = good [10:38] 8375680 (8179+386k) = ahhhh new abort [10:38] i think we got it [10:39] that one just stops booting [10:39] Booting Linux... [10:39] Program terminated [10:39] so it's hitting something [10:42] the limit seems to be 8375552 (8179+256k) = good [10:43] that + 128 = stops immediatly [10:43] that + 64 = init the console and stops [10:43] clearly we don't catch the hypervisor error that early in the boot [10:44] bang bang [10:44] we are there [10:50] summary: [10:50] from mem=1024m to mem=8375552k (8179m+256k) is all good. [10:50] 8375552k + 32k = hangs hard at Booting Linux... [10:50] 8375552k + 64k = init the console and abort "Program terminated" 8375552k + 128k = "Booting Linux...\nProgram terminated" [10:50] 8375552k + 256k = as above [10:50] 8375552k + 512k = boots but stops at the hypervisor error we knew about. [10:50] i am switching to -10 [11:00] http://people.ubuntu.com/~fabbione/t2k-memboot-results.txt === braddr captures that in his log as well. [11:11] ok i am done for today :) [11:12] good progress [11:12] i guess we are pretty close now [11:12] I look forward to hearing with davidm has to say about it [11:13] yeah he went to sleep not too long ago [11:13] so it will be not before tomorrow === braddr nods, "I should have, but luckily it's the weekend. [11:13] i think i will try booting the bootmem kernel on my machine to see the diff [11:13] but later :) [11:14] have a good rest of the weekend [11:14] thanks [11:14] good night === ChanServ [ChanServ@services.] has joined #ubuntu-ports === ajmitch [n=ajmitch@203.89.166.123] has joined #ubuntu-ports === jb-home [n=jbailey@modemcable139.249-203-24.mc.videotron.ca] has joined #ubuntu-ports === ajmitch__ [n=ajmitch@203.89.166.123] has joined #ubuntu-ports === ajmitch [n=ajmitch@203.89.166.123] has joined #ubuntu-ports