[12:03] <braddr> sounds like a potential pain, testing wise.
[12:04] <fabbione> yes
[12:04] <fabbione> indeed
[12:04] <fabbione> but the first thing i will do is to test your combination
[12:04] <fabbione> if i can reproduce it, i know what i need to tell people to do :)
[12:05] <fabbione> because the point is that the UP kernel is well.. UP
[12:05] <fabbione> it doesn't use anything of the SMP code if not to detect how many CPU's are available on the system
[12:05] <fabbione> and that's done parsing the OBP devicetree
[12:06] <fabbione> the test i did on your machine was to disable thread 0 of core 1
[12:06] <fabbione> (your booting CPU basically)
[12:07] <fabbione> because your CPU has no core+
[12:07] <fabbione> core0 even
[12:07] <fabbione> but the ALOM and OBP do some magic cpu virtual remapping
[12:07] <fabbione> so even if in reality the code is running on core1 thread0, the kernel sees it as cpu0
[12:07] <fabbione> that might be buggy
[12:09] <fabbione> braddr: all the cpu's are born as 8 cores
[12:09] <braddr> ah
[12:09] <fabbione> the 6/4 core models are 8 core where 8 - $x core did fail
[12:09] <fabbione> so they get disabled in hw
[12:10] <fabbione> if you do a showcomponent in the ALOM you can see it
[12:10] <fabbione> the first block of MP/CPU0/.. are the core/thread set
[12:10] <fabbione> the second block (8 entries) are the 4 dual bus for memory access
[12:10] <fabbione> MB/CPU0/R0/D1 iirc
[12:11] <fabbione>     MB/CMP0/P0 
[12:11] <fabbione>  <- core0 thread0
[12:11] <braddr> so, barring remapping, cpu p0-3 and 20-31 were disabled
[12:11] <fabbione> you don't have that
[12:12] <fabbione>     MB/CMP0/CH0/R0/D0 <- memory 
[12:12] <fabbione> basically yes. the 4 cores that did not pass the test have been disabled
[12:12] <fabbione> that's also why (as you can see) there are "holes"
[12:13] <fabbione> one thing that did puzzle me also was the ALOM POST process
[12:14] <braddr> so, on your and david's sytems, at least one of them has 0-3 disabled?
[12:14] <fabbione> in both your machine and my machine i did disable the booting CPU
[12:14] <fabbione> and the POST was still running on it..
[12:14] <fabbione> so that's what triggered a bell
[12:14] <fabbione> nope.. on all our machines there is a core0/thread0 working
[12:15] <fabbione> david has 1x8 and 1x6
[12:15] <fabbione> i have 1x6 and 2x8
[12:15] <fabbione> but all with core0
[12:15] <fabbione> your is "special"
[12:15] <fabbione> eheh
[12:15] <fabbione> but we will figure it
[12:16] <fabbione> it might as well be a case we don't handle properly OR a combination of bugs
[12:16] <fabbione> we will see soon hopefully
[12:16] <braddr> it'll get sorted out, I have no doubts.
[12:16] <fabbione> yepos
[07:59] <braddr> hark.. is that the sound of a jet passing overhead or did someone just turn on my t2000? :P
[07:59] <fabbione> morning braddr :)
[08:01] <braddr> g'morning
[08:02] <braddr> brb.. my windows desktop's networking is acting all screwy.. gonna reboot.
[08:08] <braddr> what's on the testing block for today/tonight?
[08:09] <fabbione> david has a possible workaround for your problem, but i think he is actually eating to power up brain and brainstorm on a proper solution
[08:10] <fabbione> i just woke up and i am going to test another fix on your box
[08:10] <fabbione> "another possible fix"
[08:16] <fabbione> boot.img-debian-sparc <- tsk ;)
[08:16] <fabbione> debian doesn't support Niagara :D
[08:17] <braddr> hey, fire me. :)
[08:18] <braddr> notice the date on that file?
[08:19] <fabbione> nah.. i was just teasing :P
[08:19] <braddr> I know
[08:30] <fabbione> these are the moments in which i wish gzip could fork on N cpu's
[08:30] <braddr> heh
[08:30] <braddr> yup.. lots of cpu's just don't help single threaded apps.
[08:39] <fabbione> about to boot the new image.. 
[08:40] <fabbione> you can go in console read only mode if you like
[08:40] <fabbione> boot.img-fabbione-9 <-
[08:42] <fabbione> oh well
[08:42] <braddr> kablooie
[08:43] <fabbione> crap
[08:45] <braddr> shouldn't need to be that verbose.. just boot net mem=1024k should be enough
[08:45] <fabbione> DOUBLE TYPO
[08:45] <fabbione> go fabion!
[08:45] <braddr> er.. 1024m
[08:45] <fabbione> it depends on the tftpd you are using
[08:46] <fabbione> tftp-hpa doesn't support broadcast
[08:46] <fabbione> so you need to specify the tftp server address
[08:46] <braddr> It's been working fine for me forever. :)
[08:46] <fabbione> it doesn't here and never did
[08:46] <fabbione> might be my anal firewall
[08:48] <fabbione> THERE WE GO
[08:48] <braddr> I wasn't watching, but just saw the screen reset.
[08:48] <fabbione> i am at the language selector
[08:49] <braddr> yup.. I see. :)
[08:49] <fabbione> is there an extra disk i can use to test an install?
[08:49] <fabbione> i would like to make sure that it doesn't crash later
[08:50] <braddr> drop to a shell and mount the large partition?
[08:50] <braddr> I definitly don't have a spare disk, but I'm thinking that it'd be fine to reinstall on disk1
[08:50] <fabbione> ok.. give me a few minutes to get to the partitioner
[08:50] <fabbione> otherwise we can't mount anything
[08:51] <braddr> might be quicker to just boot that disk. :)
[08:51] <braddr> on the other hand.. gotta get that far to do an install for real anyway
[08:52] <fabbione> yeah i think i am going to check a couple of more things before we do the install
[08:55] <fabbione> yeps.. ok it's reproducible
[08:55] <fabbione> let see what's the watermark for the issue to appear
[08:56] <braddr> want me to take control for a sec and check that disk for wipe-ability?
[08:56] <fabbione> in a few minutes
[08:56] <fabbione> if that's ok with you
[08:56] <braddr> take your time
[08:56] <fabbione> thanks
[08:57] <fabbione> this mem thing opens a completely new frontier :)
[08:57] <braddr> the amount of memory seems to be a factor?  Bizarre.
[08:57] <fabbione> i have 8GB too
[08:57] <fabbione> and it works
[08:58] <fabbione> it might be a very complex combination at this point
[08:59] <fabbione> 2g boots
[09:06] <braddr> kablooie
[09:07] <fabbione> yeps
[09:07] <fabbione> that's good..
[09:07] <fabbione> it was expected
[09:10] <braddr> if this goes well.. try 8G-1M or -1k even
[09:10] <fabbione> yeah
[09:13] <braddr> interesting
[09:13] <fabbione> yes
[09:15] <braddr> oh, sure, return to bisecting.
[09:15] <fabbione> yeps
[09:16] <fabbione> it might even be 7G+1m
[09:16] <fabbione> hem
[09:18] <braddr> stick with bisecting.. it won't take that many steps to figure out.
[09:18] <fabbione> yeah i know
[09:18] <fabbione> i forgot the m at the end
[09:19] <braddr> and thus the upper bound has been changed.
[09:20] <braddr> er, lower bound
[09:20] <fabbione> yeps
[09:20] <fabbione> except i keep forgetting the m :)
[09:20] <braddr> what's it default to, bytes?
[09:21] <fabbione> kb iirc
[09:21] <braddr> 7680-8191..
[09:22] <fabbione> no worries.. that one is normal because i did interrupt the boot
[09:22] <braddr> pick a number, any number.. where will the tip over point land.
[09:22] <fabbione> 7936m <- 7.75G
[09:25] <braddr> 7936-8192.. I'll put my money on 7999/8000 for no good reason. :)
[09:26] <fabbione> 8064 now :)
[09:26] <fabbione> it might even be bad ram
[09:26] <fabbione> in UP used for something and it crashes
[09:27] <fabbione> in SMP used by some init tables that are not used once init is done
[09:27] <fabbione> that would be ironic
[09:28] <braddr> that would be hilarious and certainly ironic.
[09:28] <braddr> I certainly haven't done much that stresses memory to hell and back.
[09:29] <fabbione> 8064 is good
[09:29] <braddr> 8064-8191.. so much for my first pointless guess
[09:29] <braddr> time for easier numbers.. 8100
[09:30] <fabbione> that would be.... 8064+64
[09:30] <fabbione> 8128 ?
[09:30] <fabbione> yeah
[09:30] <fabbione> i am pretty sure this will work too
[09:34] <fabbione> last 32M :)
[09:34] <braddr> 8128-8191, running out of midpoints.
[09:35] <fabbione> for your blog -9 is just the dapper kernel from git with security fixes and stuff
[09:35] <fabbione> -10 that i am about to boot after the bisect had DEBUG_BOOTMEM on
[09:36] <braddr> on bellevue: ~braddr/.www/t2000/boots.txt -- I just made it world writable, so you can edit
[09:38] <fabbione> nah i am not going to edit www please :)
[09:38] <fabbione> so this one boots too
[09:38] <braddr> well, you've been testing stuff when I'm not around, so it might be worth making notes.  I just made that one file writable
[09:39] <braddr> I'll add this block though
[09:39] <fabbione> oh i didn't do much before..
[09:39] <fabbione> this is realtime debugging ;)
[09:39] <fabbione> but i need a short break
[09:40] <braddr> -9 git as of what date or id?
[09:41] <braddr> -10 is -9 + debug_bootmem?
[09:41] <braddr> anything to note about -8 and boot.img-dapper-orig?
[09:42] <braddr> I think -8 was just current as of that day, but I'm not sure about that
[09:43] <fabbione> -8 is old
[09:43] <fabbione> -dapper-orig is just the acutal dapper kernel.
[09:43] <fabbione> -9 is just git as of yesterday
[09:43] <fabbione> no special id
[09:43] <fabbione> -10 is -9+DEBUG_BOOTMEM
[09:44] <fabbione> i didn't boot -10 yet
[09:44] <fabbione> we are still booting -9
[09:44] <braddr> I saw.  Where are we with the memory boot ranges?
[09:45] <fabbione> 8176
[09:45] <fabbione> booting now
[09:45] <braddr> highest good ?
[09:45] <fabbione> davem suspects that we are sharing a page with some firmware stuff and everything goes bad
[09:45] <fabbione> that's the bisect
[09:46] <fabbione> and it works
[09:46] <fabbione> so it's lower bound now
[09:46] <braddr> okey
[09:46] <fabbione> so next is....
[09:46] <fabbione> 8182
[09:46] <fabbione> no
[09:46] <fabbione> 8184
[09:49] <fabbione> there
[09:50] <braddr> sorry.. not paying attention to the boot attempts, updating the blog of what I've been doing with the box and the testing you've been doing.
[09:50] <fabbione> bah
[09:51] <fabbione> 8184 does error
[09:51] <fabbione> and i forgot m again
[09:51] <fabbione> so lower mark 8176 and higher mark is 8184
[09:51] <braddr> a hah.. new top end point!
[09:55] <fabbione> new top 8180
[09:55] <fabbione> 8176 - 8180
[09:55] <braddr> down to a 4m range
[09:57] <fabbione> Check cable and try again
[09:57] <fabbione> Link Down
[09:57] <fabbione> Timed out waiting for Autonegotation to complete
[09:57] <fabbione> what have you done?
[09:57] <fabbione> it went up again.. brrr
[09:57] <fabbione> scary
[09:57] <braddr> odd.. nothing, but it retried and made it
[09:58] <braddr> nothing interesting on that side
[09:59] <fabbione> ok 8178 is good
[09:59] <fabbione> 8178 - 8180
[09:59] <braddr> last boot!
[09:59] <fabbione> nope...
[09:59] <fabbione> we need to get down to kb
[09:59] <braddr> well, unless you wanna dip into the k's.
[09:59] <braddr> heh
[09:59] <fabbione> 8179
[10:00] <fabbione> well ideally we need to find the page that's making this issue
[10:00] <fabbione> a boundary of 4k would do
[10:01] <braddr> get the page and have the kernel dump the entire thing before the crash and if possible as part of the oops or whatever that die point is
[10:01] <fabbione> the problem is that we can't dump the page
[10:02] <fabbione> as soon as we access that page the hypervisor blocks the system
[10:02] <fabbione> that's the error we are getting
[10:02] <fabbione> we need to understand what is there and why
[10:02] <braddr> oh
[10:02] <braddr> slip the guy a bit of valium, he's obviously a little too high strung.
[10:02] <fabbione> 8179 is good
[10:03] <fabbione> so now.. some maths :)
[10:04] <fabbione> 8375296 - 8376320
[10:04] <fabbione> range in kb
[10:04] <braddr> 2 boots to establish an acceptable good and bad value?
[10:04] <fabbione> we will bisec that now :)
[10:04] <fabbione> it's just one mega
[10:05] <fabbione> so now it is 8179+512k
[10:07] <fabbione> i think we found it :)
[10:08] <fabbione> wow.. it doesn't even break!
[10:08] <braddr> for a definition of we that has you doing all the work.
[10:09] <fabbione> so low mark is still 8375296
[10:09] <fabbione> highmark is 8375808
[10:09] <fabbione> let's try in the middle
[10:09] <braddr> worth repeating the last try?
[10:09] <fabbione> yes, but not now
[10:09] <fabbione> i want to test lowmark+256k
[10:10] <fabbione> if it still hangs, i want to test passing a known to work mem in kb
[10:10] <fabbione> we need to make 100% sure we are not hitting other bugs, like MB = 1000kb instead of 1024
[10:11] <fabbione> i am pretty sure it's done in 2^ values
[10:11] <fabbione> but you may never know ;)
[10:11] <braddr> all we need is a good boot to see it in the boot output
[10:15] <fabbione> it's hanging again
[10:15] <fabbione> waits a few secs..
[10:15] <braddr> wait.. is this the -10 kernel?
[10:15] <fabbione> -9
[10:15] <fabbione> we didn't boot 10 at all
[10:15] <braddr> ok.. just checking
[10:15] <braddr> this is the symptom we used to see with bootmem I tink.
[10:15] <fabbione> yeps it hangs
[10:16] <fabbione> probably because we are scraping the same page
[10:16] <fabbione> bootmem does basically clean all mem allocation
[10:16] <fabbione> oh neat
[10:17] <fabbione> sc> reset doesn't the trick without poweroff/poweron
[10:18] <fabbione> now i am booting with 8179 in kbytes
[10:18] <fabbione> just to make sure we are converting it properly
[10:18] <braddr> hrm.. my notes are insufficient
[10:18] <fabbione> and guess what.. we are not
[10:18] <braddr> try it in m again
[10:18] <fabbione> yeps
[10:19] <fabbione> actually
[10:19] <braddr> I was seeing this sort of boot hang with my hand built kernels, and I don't have enough notes on the differences.  I did get one of them too boot and I want to say that bootmem was one of the differences, but I'm not confident enough in that memory.
[10:19] <fabbione> i think mem is in bytes..
[10:19] <fabbione> let me check
[10:20] <braddr> it was definitly hanging to the point of needing to reset it
[10:20] <braddr> just append a k
[10:20] <braddr> remove the doubt
[10:21] <fabbione> mem=nn[KMG] 
[10:21] <fabbione> it doesn't say the default
[10:21] <fabbione> so i guess it's bytes
[10:21] <fabbione> this is 8179m in k
[10:21] <fabbione> score
[10:23] <fabbione> so the last boots were just wrong
[10:23] <braddr> ok.. so there's insufficient checking for minimum memory somewhere in early bootup. :)
[10:23] <fabbione> 8375296 - 8376320
[10:23] <fabbione> 8375296 (8179) = good
[10:24] <fabbione> 8376320 (8180) = booting now....
[10:26] <fabbione> math is not an opinion in these cases ;)
[10:26] <braddr> sure sure.. but as I get older, I'm starting to believe more and more that hardware can be spiteful.
[10:26] <fabbione> hehe
[10:26] <fabbione> see.. boom
[10:27] <braddr> yup
[10:27] <fabbione> 8375808k (8179+512k) = booting now..
[10:28] <fabbione> ehhe
[10:28] <fabbione> yeah i will spend MAX another hour debugging
[10:28] <braddr> I was gonna grab a subway sandwich or something before the closed, but that was 90 minutes ago.. oops.
[10:28] <fabbione> it's sunday, i am tired and i want to get ready for the F1 race :P
[10:28] <fabbione> subway rocks!
[10:28] <braddr> they're not bad.. good for the sandwich category.
[10:29] <fabbione> yeps
[10:29] <braddr> back shortly
[10:29] <fabbione> sure
[10:30] <fabbione> bang
[10:30] <fabbione> new highmark :)
[10:30] <fabbione> lowmark= 8375296 (8179)
[10:30] <fabbione> highmark= 8375808k (8179+512k)
[10:34] <fabbione> 8375552 (8179+256k) = good
[10:38] <fabbione> 8375680 (8179+386k) = ahhhh new abort
[10:38] <fabbione> i think we got it
[10:39] <fabbione> that one just stops booting
[10:39] <fabbione> Booting Linux...
[10:39] <fabbione> Program terminated
[10:39] <fabbione> so it's hitting something
[10:42] <fabbione> the limit seems to be 8375552 (8179+256k) = good
[10:43] <fabbione> that + 128 = stops immediatly
[10:43] <fabbione> that + 64 = init the console and stops
[10:43] <fabbione> clearly we don't catch the hypervisor error that early in the boot
[10:44] <fabbione> bang bang
[10:44] <fabbione> we are there
[10:50] <fabbione> summary:
[10:50] <fabbione> from mem=1024m to mem=8375552k (8179m+256k) is all good.
[10:50] <fabbione> 8375552k + 32k  = hangs hard at Booting Linux...
[10:50] <fabbione> 8375552k + 64k  = init the console and abort "Program terminated"               8375552k + 128k = "Booting Linux...\nProgram terminated"
[10:50] <fabbione> 8375552k + 256k = as above
[10:50] <fabbione> 8375552k + 512k = boots but stops at the hypervisor error we knew about.
[10:50] <fabbione> i am switching to -10
[11:00] <fabbione> http://people.ubuntu.com/~fabbione/t2k-memboot-results.txt
[11:11] <fabbione> ok i am done for today :)
[11:12] <braddr> good progress
[11:12] <fabbione> i guess we are pretty close now 
[11:12] <braddr> I look forward to hearing with davidm has to say about it
[11:13] <fabbione> yeah he went to sleep not too long ago
[11:13] <fabbione> so it will be not before tomorrow
[11:13] <fabbione> i think i will try booting the bootmem kernel on my machine to see the diff
[11:13] <fabbione> but later :)
[11:14] <braddr> have a good rest of the weekend
[11:14] <fabbione> thanks
[11:14] <fabbione> good night