[08:46] <braddr_> g'evening.
[08:46] <fabbione> hey brad
[08:46] <fabbione> that error message is really weird
[08:47] <fabbione> do you have a stock T2000 or did you add pci cards?
[08:47] <braddr_> stock
[08:47] <fabbione> same here
[08:47] <braddr_> and solaris boots fine
[08:48] <fabbione> ok.. just one sec..
[08:48] <braddr_> it's been over 10 years since I played with solaris though.. so I'm so lost in it. :)
[08:48] <fabbione> don't worry :)
[08:49] <fabbione> can you boot in solaris in the meanwhile?
[08:49] <braddr_> if you have something I should look at, sure.
[08:49] <fabbione> we need to figure out what is different on your box from the one david and I have
[08:49] <braddr_> righto.. booting it up
[08:49] <braddr_> well, once the extremely long POST finishes.
[08:50] <fabbione> yeah eheh
[08:50] <fabbione> at least i hope you are not sitting 3 feet from it
[08:50] <fabbione> because it's so damn noisy it's killing me
[08:50] <braddr_> oh, I am.
[08:50] <fabbione> you have ALL my understanding
[08:51] <braddr_> I have the rails in the rack, but I gotta do move another box before I can actually rack this beast.
[08:51] <fabbione> once you are in solaris, can you please slam somewhre the output of prtconf -v -p
[08:51] <braddr_> roger
[08:51] <fabbione> i have it rack mounted..
[08:51] <fabbione> but there aren't that many people with a rack at home
[08:51] <fabbione> really..
[08:51] <braddr_> I'm rather tempted to yank 2 of the 3 front fans.
[08:52] <braddr_> count me among the unusal.
[08:52] <fabbione> lol
[08:52] <braddr_> but this box is easily the loudest of the machines I have.
[08:52] <fabbione> even when i boot up the entire SAN is less noisy!
[08:52] <braddr_> ok.. booted.  lemee grab that info
[08:53] <braddr_> I can give you an account if you'd like.. it's on the public net.
[08:53] <fabbione> let see if we really need it first
[08:53] <braddr_> the sc-net part isn't though
[08:55] <fabbione> meh
[08:55] <fabbione> it's easier to pipe it to a file
[08:55] <fabbione> :)
[08:56] <braddr_> oh sure, point out me being stupid.. duh.
[08:56] <braddr_> ok.. it's at the same url, prtconf.txt
[08:57] <fabbione> thanks
[08:57] <braddr_> hrm.. only 16 cpu nodes?  This was supposed to be an 8 core model.
[08:58] <fabbione> mine was supposed to be 8 cores too and i got 6
[08:58] <fabbione> 16 cpus you have a  4 cores
[08:59] <braddr_> I'll send a nastygram to my sun contact tomorrow.
[09:00] <braddr_> another one.  They also didn't send the serial management cable.. had to... improvise.
[09:01] <fabbione> ah
[09:01] <fabbione> i had 2
[09:01] <braddr_> last sun box I had didn't _have_ the nifty management stuff.
[09:01] <fabbione> eheh
[09:02] <fabbione> i am still checking the output
[09:02] <fabbione> i am not as fast as david here :)
[09:02] <braddr_> s'ok.. I'm not in any hurry.
[09:02] <braddr_> I wish I was able to figure this stuff out on my own.. not used to being this handicapped.
[09:03] <fabbione> don't worry
[09:03] <fabbione> the only real difference i can see is that on your machine there is nvramrc in use
[09:03] <fabbione> +        use-nvramrc?:  'true'
[09:03] <fabbione> could you got back to the OBP and do:
[09:04] <fabbione> setenv use-nvramrc? false
[09:04] <fabbione> or
[09:04] <fabbione> setenv use-nvramrc?=false
[09:04] <fabbione> i can never remember
[09:04] <braddr_> not sure why that'd be. I didn't do any openboot config changes.  Sure.
[09:04] <braddr_> the former
[09:04] <fabbione> it was turned off on mine
[09:05] <fabbione> but it's the only immediate diff i can see
[09:05] <braddr_> hrm.. resume isn't getting me back into the os.
[09:05] <fabbione> go
[09:05] <braddr_> oh, right.
[09:06] <fabbione> ok> go
[09:06] <braddr_> ok.. it shows as false now.
[09:06] <braddr_> ... in prtconf
[09:06] <fabbione> perfect
[09:06] <fabbione> now you could try to netboot?
[09:06] <braddr_> reboot and.. roger.
[09:06] <fabbione> thanks
[09:07] <fabbione> i am trying to reproduce it in the otherway around here in the meantime
[09:07] <braddr_> btw, if you have a set, my range earmuffs do a _great_ job of cancelling the noise of those fans.
[09:07] <fabbione> eheh
[09:08] <fabbione> i use headphone set + metallica at 200% of the volume
[09:08] <braddr_> damaging your hearing further.
[09:08] <fabbione> with this treatment i can barely hear my own brain
[09:08] <fabbione> ;)
[09:08] <braddr_> ok.. it's into the boot sequence now.. at 9600
[09:09] <fabbione> i feel your pain
[09:09] <braddr_> no difference.
[09:09] <fabbione> ok we will need to wait for david
[09:09] <fabbione> he doesn't irc much..
[09:09] <fabbione> but i already send him the info
[09:09] <braddr_> yeah.. kinda a popular guy
[09:09] <braddr_> he'd get swarmed.
[09:10] <fabbione> yeah i know
[09:12] <braddr_> based on that stack trace, that's probably the first disk access of some sort, no?
[09:14] <fabbione> there is no disk access at that time
[09:14] <fabbione> it's loading from the initramfs
[09:14] <fabbione> or ramdisk
[09:14] <fabbione> but it loads fine here
[09:14] <braddr_> [   17.001666]  checking if image is initramfs... it is                               
[09:14] <braddr_> [    3.293044]  Freeing initrd memory: 3832k freed                                    
[09:15] <fabbione> yeps
[09:15] <fabbione> at that point where it fails it is starting up the installer
[09:16] <fabbione> it seems we found one relevant difference
[09:16] <fabbione> you have 8 cpus
[09:16] <fabbione> sorry 16
[09:16] <fabbione> i have 24
[09:16] <fabbione> david 32
[09:16] <braddr_> though this seems to be a non-smp kernel, since it only inits one
[09:16] <fabbione> the sparc kernel is still able to probe the amount of CPU's installed on the system
[09:17] <fabbione> the diff is that UP init only 1
[09:17] <fabbione> SMP enables the ones you ask for :)
[09:17] <braddr_> right.. just pointing out another potential difference.. david's not likely booting a non-smp kernel.
[09:18] <fabbione> he did :)
[09:18] <braddr_> I'm sure he _has_, but not nearly as much as smp.
[09:20] <braddr_> is there an older image that'd be worth trying?
[09:20] <braddr_> david's email hints that there was/is
[09:20] <fabbione> yes just one second that i am collecting info for david
[09:21] <braddr_> okey
[09:22] <fabbione> http://ports.ubuntu.com/ubuntu-ports/dists/dapper/main/installer-sparc/20051026ubuntu26/images/sparc64/netboot/2.6/
[09:22] <fabbione> you can try this one
[09:22] <fabbione> but i don't ensure it's "old" enough or that will work
[09:23] <braddr_> oh, no worries, just looking to broaden the facts on hand
[09:28] <braddr_> no change
[09:28] <braddr_> ... other than the version number of the build
[09:28] <fabbione> ok
[09:28] <fabbione> we are looking at it
[09:29] <braddr_> can I bring you a coke.. maybe a pizza? :)
[09:29] <fabbione> i jsut woke up.. coffee would do :P
[09:29] <braddr_> hrm.. gonna guess the round trip might be kinda longish.
[09:29] <fabbione> ehhe
[09:31] <braddr_> for the sake of completeness, the hex number at the end is slightly different: Error at TPC[4ca2bc]  with -20, and 4c9f9c with -19
[09:31] <braddr_> probably meaningless
[09:32] <fabbione> ok thanks
[09:33] <fabbione> i need a few minutes to upgrade my box and do some debugging...
[09:34] <fabbione> are you using tftp booting i assume?
[09:35] <braddr_> you and dave talking via some im network?  I'd love to observe just to soak up a bit of background.
[09:35] <fabbione> braddr_: IM
[09:35] <fabbione> braddr_: can you boot with:
[09:36] <fabbione> boot net max_cpus=1
[09:36] <fabbione> ?
[09:36] <fabbione> if it errors the same way we have some clues :)
[09:36] <braddr_> on it.
[09:36] <fabbione> thanks
[09:37] <braddr_> I can easily move the sc-net port to a public ip address, too, if that'd help.
[09:37] <fabbione> nah
[09:37] <braddr_> this boot is with -19 still, whoops.
[09:37] <fabbione> it's ok
[09:37] <fabbione> eheh
[09:37] <fabbione> -20- is better
[09:37] <braddr_> close enough?
[09:37] <braddr_> ok.. I'll redo
[09:38] <braddr_> much different this time
[09:38] <fabbione> is it?
[09:38] <braddr_> I'm at the choose a language part of the installer
[09:39] <fabbione> AH
[09:39] <braddr_> with -19
[09:39] <fabbione> can you try with -20- please?
[09:39] <braddr_> it finished while I was moving the images around
[09:39] <fabbione> ok
[09:40] <braddr_> hrm.. having trouble getting back to openboot
[09:40] <fabbione> telnet to the alom
[09:40] <fabbione> and do a break
[09:41] <fabbione> you can login multiple times to alom
[09:41] <braddr_> thanks.
[09:42] <braddr_> does the linux kernel disable sending break to get to openboot?
[09:42] <fabbione> no afaik
[09:43] <fabbione> at sc> reset 
[09:43] <braddr_> odd.. with -20 it looks like the same failure.. lemee scroll up to make sure I booted it correctly
[09:44] <braddr_> Kernel command line: max_cpus=1
[09:44] <braddr_> so.. -19 works, but -20 doesn't.
[09:45] <braddr_> ... with 1 cpu.  neither does with all of 'em.
[09:45] <fabbione> ok we are checking some stuff..
[09:47] <fabbione> i suggest to try to reproduce that it works on -19- and fails on -20-.. but can you please poweroff in between?
[09:47] <fabbione> with max_cpus=1
[09:47] <braddr_> let me go back to -19 w/o a power off.  Doing that introduced a several minute wait cycle.
[09:48] <fabbione> yes waiting is not an issue
[09:48] <fabbione> i want to know if the boot with -19- is reproducible
[09:48] <braddr_> ok, then a powercycle then boot net w/ -19
[09:49] <fabbione> exactly
[09:49] <braddr_> 12:49:10am start
[09:49] <fabbione> ehhe
[09:51] <braddr_> ok.. I just checked, I've got an 11am meeting at work tomorrow, so I'm gonna put a cap of 2 hours on this debugging before I'll have to bug out and get some sleep.
[09:51] <fabbione> ok 
[09:52] <braddr_> on the other hand, that could change depending on how close we are. :)
[09:53] <fabbione> even if we get to narrow down the problem. it will take at least a few hours to get the fix in the kenrel (assuming it is a kenrel issue) and propagate it to an installer
[09:53] <fabbione> tho i could build a custom one for you
[09:54] <braddr_> assuming -19's successful boot is reproducable, that ought to give a fairly large hint I assume.  and I also assume we won't know for sure that the problem is fixed w/o a test kernel/image.
[09:55] <fabbione> yes to both
[09:55] <fabbione> we know what changed between 19 and 20
[09:56] <fabbione> but there is still the rootcause of the max_cpus=1
[09:56] <fabbione> specially because what you are booting is a UP kernel
[09:56] <fabbione> and theoretically max_cpus=1 has no meaning
[09:56] <braddr_> ok.. booting now
[09:57] <braddr_> Requesting Internet Address for 0:14:4f:f:10:60
[09:57] <braddr_> ERROR: /pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2: Last Trap: Fast Instruction Access MMU Miss
[09:57] <braddr_> [Exception handlers interrupted, please file a bug] 
[09:57] <braddr_> [type 'resume' to attempt a normal recovery] 
[09:57] <braddr_> {0} ok resume
[09:57] <braddr_> Error -256 
[09:57] <braddr_> odd.
[09:57] <fabbione> did you break while it was already booting?
[09:57] <braddr_> sorta.. I hit break after the no keyboard part of the early boot
[09:57] <fabbione> ok
[09:58] <fabbione> for your own sanity set:
[09:58] <braddr_> so that's sorta 'normal'?
[09:58] <fabbione> setenv auto-boot? false :)
[09:58] <fabbione> yes
[09:58] <braddr_> it's into the boot sequence now.. just a second more
[09:58] <fabbione> sure
[09:59] <braddr_> damn.. same __lookup_hash stacktrace
[09:59] <fabbione> ok thanks
[09:59] <braddr_> I might be tired, but I know I saw it boot once. :)
[09:59] <fabbione> want to recheck the boot parameters?
[09:59] <fabbione> yes i know.. it might be "lucky"
[09:59] <braddr_> already did.. lemee paste
[09:59] <braddr_> Kernel command line: max_cpus=1
[09:59] <fabbione> ok
[10:00] <fabbione> stay tuned.. :)
[10:00] <braddr_> and just checked, it's definitly the -19 image
[10:00] <braddr_> gonna set the auto-boot part for next time.
[10:00] <braddr_> ok.. and it's a 10min boot cycle. :)
[10:14] <fabbione> braddr_: i have some good news and some bad news...
[10:14] <fabbione> which one do you want first?
[10:14] <braddr_> how about in that order
[10:14] <fabbione> the good news is that David has a pretty good idea of what is going on.
[10:15] <braddr_> and the bad news is that he wants sleep
[10:15] <fabbione> the bad news is that we will have to wait mid-week because he is busy to push some .17 stuff to Linus
[10:15] <fabbione> he doesn't sleep.. we don't usually allow him...
[10:15] <braddr_> likly story. :P
[10:16] <braddr_> ok.. well, I'll tinker with solaris for a while then.
[10:16] <fabbione> anyway it looks like that there is some heavy memory corruption problem
[10:16] <braddr_> and see if I can convince the sun weenies to ship me the box I asked for.
[10:16] <fabbione> that happens early in the boot
[10:16] <fabbione> way before where we see the OOPS
[10:16] <fabbione> the strange things are:
[10:16] <fabbione> - we don't see it with 24/32 cpu's
[10:17] <fabbione>   so it might be specific to the 16 cpu's one
[10:17] <fabbione> perhaps a bit that overflows or something like that... go figure
[10:17] <fabbione> so that machine is gold now
[10:17] <braddr_> well, it's not going anywhere.
[10:17] <fabbione> perfect
[10:18] <braddr_> I'll see if I can find a reliable pattern to get it past the bootup problem, since it did work once.
[10:19] <fabbione> i am afraid that it was more a lucky boot than you can acutally reproduce
[10:19] <braddr_> just got 'lucky' again.
[10:19] <fabbione> with -20- ?
[10:19] <braddr_> no, -19
[10:20] <braddr_> and with max_cpus=1
[10:21] <fabbione> and the weird thing is that max_cpus is useless.... go figure...
[10:21] <fabbione> can you try -19- without max_cpus?
[10:21] <braddr_> after this round, sure.
[10:21] <fabbione> ok thanks
[10:23] <braddr_> failed this cycle (19 w/ max_cpus=1)
[10:24] <braddr_> trying w/o max_cpus
[10:25] <fabbione> ok
[10:26] <braddr_> failed
[10:26] <fabbione> i think there is no real pattern..
[10:26] <fabbione> it's just memory corruption happening for some odd reasons
[10:26] <fabbione> there might be 2 reasons:
[10:26] <fabbione> - hw is buggy
[10:27] <braddr_> is there any reason to believe that on the boots that it gets into user space that it'll be stable?
[10:27] <fabbione> - the specs we got are not complete and don't cover the 16 CPU's case properly
[10:27] <braddr_> assuming non-buggy hw
[10:27] <fabbione> it fails opening a file in /proc
[10:28] <fabbione> so it eithers contain garbage
[10:28] <fabbione> or /proc is busted
[10:28] <fabbione> perhaps it works once it boots, but well that won't help you much
[10:29] <fabbione> because you might install and not being able to boot after
[10:29] <braddr_> but it'd progress.
[10:29] <fabbione> not really..
[10:29] <braddr_> I've only got 60 days, if the hype is to be believed.
[10:29] <fabbione> yes we will fix it WAY before that
[10:30] <braddr_> unless it's hardware. :P
[10:30] <fabbione> because by that time Dapper must be stable and released :)
[10:30] <fabbione> well clearly..
[10:30] <braddr_> is niagara hardware a showstopper for dapper?
[10:30] <fabbione> nobody at SUN has been reporting such case... that's why HW failure is somehow floating in my mind
[10:30] <fabbione> no
[10:30] <braddr_> didn't think so.
[10:30] <fabbione> sparc is not supported by Ubuntu
[10:31] <fabbione> but if we can get to release at the same time, the better :)
[10:31] <braddr_> right.. so why the confidence that it'll be fixed before then?
[10:31] <fabbione> because i am a nasty bitch
[10:31] <braddr_> well, right.
[10:31] <fabbione> i want it fixed before that
[10:31] <braddr_> determination and pigheadedness can go a long way.
[10:32] <fabbione> trust me.. i can bitch and nag people enough to make them cry like little babies :P
[10:32] <braddr_> I know the type.. I have to do a lot of that sorta project management at work
[10:32] <fabbione> eheh
[10:32] <braddr_> though rarely does anyone have to actually ccry
[10:32] <fabbione> :)
[10:34] <braddr_> anything else I or we can do until later in the week when david frees up?
[10:34] <fabbione> once he frees up, we will just have to be ready to test what he asks
[10:35] <fabbione> see another major issue is that we have "only" 3 machines in 3 different setups
[10:35] <fabbione> that doesn't make it easier to isolate
[10:36] <braddr_> agreed
[10:36] <braddr_> and unlike x86 hardware, can't just yank a cpu or 4 to normalize the configs. :)
[10:36] <fabbione> eheh
[10:38] <braddr_> I'm really looking forward to putting this hardware through it's paces.  I'll be using it to do high parallel compiles of gcc (with the gdc, the d language frontend) and it's test suite of 10's of thousands of nice independant tests.
[10:38] <fabbione> eheheh
[10:38] <braddr_> pretty much the perfect hardware for this sort of thing.
[10:39] <fabbione> if it doesn't use fpu yes
[10:39] <braddr_> well, some tests do, but most don't and the compiler certainly doesn't need it
[10:39] <fabbione> yeps...
[10:39] <fabbione> we did push all optimizations in for gcc already
[10:39] <fabbione> and glibc
[10:39] <fabbione> so it should be quite fast
[10:40] <fabbione> assuming we can get it to boot :P
[10:40] <braddr_> I assume only in gcc trunk though, ri ght?
[10:40] <braddr_> I haven't tried getting the d frontend moved anywhere past 4.0.x
[10:40] <fabbione> we did backport all the optimization into ubuntu gcc as well
[10:40] <fabbione> so if you use the gcc4 shipped by default it will work
[10:41] <braddr_> I'm less concerned about the actual generated code speed, though faster is better.
[10:41] <braddr_> anything will be faster than the old athlon box I've been using.
[10:41] <fabbione> ehhe
[10:41] <fabbione> i can cut the kernel in 1 minute and 20 secs here
[10:42] <fabbione> with 24 CPU's
[10:42] <fabbione> and minimal config of course (but still bootable)
[10:42] <braddr_> the test suite for D takes about 8-10 hours on the athlon
[10:43] <fabbione> not too bad
[10:43] <braddr_> kinda high for iterative development, but a decent overnight check it tomorrow run.
[10:43] <braddr_> 15-20 minutes would be better.
[10:44] <fabbione> well..
[10:44] <fabbione> like we say in Italy: "You can't have a drunk wife and the jar full of wine"
[10:44] <braddr_> just need 2 jars.
[10:44] <fabbione> ahah
[10:45] <braddr_> a little creativity will solve most problems.
[10:45] <fabbione> sometimes it does
[10:53] <braddr_> hey, is there a way to disable cpu's in openboot or sc?  Maybe dropping it down to 8 or 4 would prove interesting?
[10:55] <fabbione> nope
[10:55] <fabbione> it's hardcoded in the CPU
[10:55] <braddr_> oh well, worth a shot.
[10:55] <fabbione> yeah
[10:56] <fabbione> time for breakfast and more coffee :)
[10:56] <fabbione> brb
[11:09] <braddr_> I've yet to get -20 to boot ever, but -19 boots periodically.
[11:09] <braddr_> no pattern, but I tried a lot with -20 and never.
[11:13] <fabbione> probably the fixes in -20- can trigger the error constantly
[11:14] <braddr_> hey.. I caught an oops on this 'successful' boot
[11:15] <braddr_> http://www.puremagic.com/~braddr/t2000/oops-1.txt
[11:18] <fabbione> checking..
[11:18] <fabbione> was that with -19- ?
[11:19] <braddr_> yes.. but double checking to be sure.
[11:19] <fabbione> ok
[11:19] <braddr_> yup
[11:20] <braddr_> and looks like the install has stalled after the language screen
[11:20] <braddr_> not super suprising.
[11:20] <fabbione> yes mostlikely
[11:20] <fabbione> udev is generating a crash
[11:20] <fabbione> probably parsing /sysfs
[11:23] <braddr_> right, well, with the installer stalling, I'm gonna give up on the box for tonight
[11:24] <fabbione> yes.. get some good rest and get ready for some heavy debugging soon :)
[11:24] <braddr_> it seems silly to have a 2 middleman process between david and the hardware
[11:24] <fabbione> nah it's ok.. because i will need to push the changes in the Ubuntu kernel as well
[11:25] <fabbione> and trigger all the other bits to have the installer etc. etc.
[11:25] <braddr_> sure, but not until after it's fixed.
[11:25] <fabbione> it might be an Ubuntu specific bug..
[11:25] <braddr_> could be
[11:26] <braddr_> well, you've got my email address, and I'll hang out here during the usa/pacific evening/night hours.
[11:27] <braddr_> during work hours my availability is less predictable, but usually I can make myself available
[11:29] <fabbione> i will try to ping you with enough notice