#ubuntu-ports 2006-05-27
* Starting logfile irclogs/ubuntu-ports.log
<braddr> fabbione: have you ever built a linux/sparc kernel from solaris/sparc?  I'm gonna try to see if I can gather some more debugging info.  Dave doesn't believe he'll have time before I have to return the box.
<braddr> or from linux/x86, for that matter
<fabbione> braddr: well i did from linux arch foo to sparc
<fabbione> but not from solaris
<fabbione> google for crosstoolchain
* braddr nods.  compiling from solaris is proving to be rather painful
<fabbione> no wonder
<fabbione> if you have a patch i can build a kernel for you
<fabbione> it takes me only a few minutes here
<fabbione> or something you want to play with
<braddr> well, I want to turn on all the debug options and probably add a bunch of printfs around the point of failure.. so the ability to iterate and build the net boot images on demand
<fabbione> yeah i understand that
<fabbione> hmm
<fabbione> if you have a'linux box, just build the cross compile tools
<fabbione> it's really easy to do
* braddr nods, "just never have before. :)"
<fabbione> braddr:  there is a set of scripts that do that for you. it's one of the first hits on google :)
<braddr> building gcc and glibc.. this might just take a little while.
<fabbione> yeah it doesn't take that long.. iirc it disables the test suite
<braddr> while this is running, if you feel like spinning a build with all the debug options turned on that might give something to go on.  If I recall correctly, we were suspecting memory corruption, so CONFIG_DEBUG_SLAB, CONFIG_DEBUG_VM, CONFIG_DEBUG_PAGEALLOC, and maybe CONFIG_DEBUG_BOOTMEM might be interesting.
<fabbione> yes.. i can do that
<braddr> have you seen other reports of boot problems or successes with 16 cpu models?
<fabbione> not one :(
<fabbione> but i did add you to the Release Notes
<braddr> that's unfortunate
<fabbione>  * SUN T2000 with 4 cores have been reported as not working.
<fabbione> ok give me a sec and i will build for you
<fabbione> more than a sec
<fabbione> i need to power up everything
<braddr> no hurry.. the scripts are running to build the sparc64 tool chain
<fabbione> braddr: i assume you did try already to boot the latest installer...
<braddr> I haven't tried anything since we last worked on this.. been sticking with solaris.
<fabbione> can you try just in case?
<braddr> sure
<fabbione> http://archive.ubuntu.com/ubuntu/dists/dapper/main/installer-sparc/current/images/sparc64/netboot/2.6/
<fabbione> may never know it was "auto-fixed"
<braddr> stranger things have certainly happened
<braddr> fetching now
<fabbione> yup
* fabbione hates the console with all the his heart
* braddr had forgotten how quiet it was when the beast is off.
<fabbione> eehhe
<fabbione> i had it turned on for the last few days together with the SAN
<fabbione> to do test installs for release
<fabbione> i can't hear my wife anymore.. and that's good... ;)
<fabbione> not sure i will be able to hear anything anylonger..
<braddr> mine's been on for weeks
<fabbione> no i power it off when i don't need it
<fabbione> i did rebuild all of the 9000 pkgs of dapper in 36 hours flat on my T2000 :)
<fabbione> with 2 of them you could make a port of a distro in a day
<braddr> a few additional messages in the boot log, but dies with essentially the same error, but no stacktrace
<fabbione> ok
<braddr> before:
<braddr> [   11.338121]  NET: Registered protocol family 17 [   11.648468]  Badness in proc_get_inode at fs/proc/inode.c:157 [   11.821338]  Call Trace: [   11.898036]   [00000000004a1ee0]  __lookup_hash+0xe0/0x140 [   12.065575]   [00000000004a5128]  open_namei+0x128/0x640 [   12.227346]   [00000000004924c0]  filp_open+0x20/0x60 [   12.379717]   [0000000000492658]  do_sys_open+0x38/0xe0 [   12.543989]   [0000000000406a14]  linux_sparc_syscall32+0x34/0x40 [   12.739641]   [0
* braddr glares at the paste
* fabbione looks at braddr 
<fabbione> well i remember that
<braddr> in the 'after' picture.. after the net: registered protocol familly 17 is some usb probing then it probes all 4 eth ports
<braddr> ending with:
<braddr> [   14.590307]  e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection
<braddr> Starting system log daemon: syslogd, klogd.
<braddr> [   16.902270]  SUN4V-DTLB: Error at TPC[41c400] , tl -1034272768
<braddr> [   17.143287]  SUN4V-DTLB: vaddr[ffffffffffff0000]  ctx[0]  pte[800007ffffff0743]  error[2] 
<fabbione> HMMMM
<fabbione> where you running slowlaris before rebooting in Linux?
<braddr> thought you might like the user space part. :)
<braddr> yes
<fabbione> can you please poweroff -> poweron -> boot ?
<braddr> roger
<braddr> what was the openboot command to disable auto booting?
<braddr> set auto-boot false?
<fabbione> like if i remember.... hold on :)
<fabbione> setenv auto-boot false?
<braddr> looks like it's already set that way from last time.
<braddr> powering back on.. 10 minutes to info. :)
<fabbione> take your time
<fabbione> i am still installing kernel b-d
<fabbione> the machine was in an interesting state today after a few tons of installs
<braddr> hrm.. looks like the crosstool build blew up while building the kernel
<fabbione> that's no surprise.. the kernel config is crap there iirc
<braddr> I'll poke at the script and see how to disable building the kernel and glibc.. all I need is gcc
<braddr> Starting system log daemon: syslogd, klogd.
<braddr> Killed
<braddr> /build/buildd/cdebconf-0.97ubuntu3/src/debconf.c:135 (main): Cannot initialize debconf template database
<braddr> /build/buildd/cdebconf-0.97ubuntu3/src/debconf.c:135 (main): Cannot initialize debconf template database
<braddr> FATAL: Module usbkbd not found.
<braddr> FATAL: Module usbhid not found.
<braddr> FATAL: Module usbserial not found.
<braddr> /build/buildd/cdebconf-0.97ubuntu3/src/debconf.c:135 (main): Cannot initialize debconf template database
<braddr> the last line repeats
<fabbione> ok
<fabbione> it's that "killed" before 
<fabbione> but at least it looks less scary
<fabbione> ok building the kernel now
<braddr> there isn't a SUN4V-DTLB part this time
<fabbione> yes
<fabbione> i can see that :)
<fabbione> it's probably being catched by syslog or klogd
* braddr nods, just confirming that I didn't leave it out.
<fabbione> that's why i told you to poweroff
<fabbione> sometimes slowlaris leave the CPU in an "interesting" unresettable state (for linux)
<fabbione> i think somehow the main issue has been solved but there is still something fishy going on. otherwise you would have seen the same problem
<fabbione> TBH if i was you i would try yet another reboot.. just to enjoy the 4 secs of silence between reboots
<fabbione> no seriously.. i would like to see another reboot 
<fabbione> to see if the DTLB error comes up again
<fabbione> sometimes it's just a matter of timing with syslog starting
<braddr> oh.. you wanted it w/o the powercycle?
<fabbione> powercycle was perfectly fine
<fabbione> just do another one now
<braddr> ok.. in progress.
<fabbione> what we want to try to see if syslog is hiding it or not
<fabbione> given that we can't access the logs once it crashes
<braddr> it might be interesting to see if two boots back to back w/o a powercycle gets the original error
<fabbione> up to you
<fabbione> the kernel here is almost done
<braddr> past post, into the linux boot now
<fabbione> :)
<braddr> crosstool past a stupid linux source bug and into building binutils
<braddr> Starting system log daemon: syslogd, klogd.
<braddr> [   10.285601]  SUN4V-DTLB: Error at TPC[41c400] , tl -1034272768
<braddr> [   10.521781]  SUN4V-DTLB: vaddr[ffffffffffff0000]  ctx[0]  pte[800007ffffff0743]  error[2] 
<fabbione> yup
<fabbione> it was just hidden somewhere
<fabbione> kernel is linking
<fabbione> ccache populated :)
<braddr> gcc is buliding
<braddr> I presume I'll need to grab more bits to build a boot.img file..
* braddr wanders back to google, the fount of knowledge
<fabbione> oh to build that one.. yes
<fabbione> i can give you the initrd
<fabbione> all you have to do is to slam it at the bottom of the a.out image
<braddr> just cat the two together?
<fabbione> do you have debian/ubuntu on the other linux box?
<braddr> debian,yes
<fabbione> apt-get source debian-installer
<fabbione> look for tftpboot.sh
<fabbione> that's the script that creates the boot.img
<braddr> roger
<fabbione> hmm i can tell you that even if we manage to boot this debug, you won't be able to install
<fabbione> but i can workaround that with a lot of patience
<braddr> s'ok.. step one, figure out what's breaking
<fabbione> yeah
<braddr> step two.. profit
<fabbione> :)
<braddr> that tftpboot.sh script relies on some binaries not part of the installer.. I believe I recognize piggyback from the kernel build, not sure about elf2aout
<fabbione> elf2out just converts a elf exec to a aout
<fabbione> i am sure there are sources somewhere
<fabbione> for sparc they are on sparc-utils pkg
* braddr nods
* braddr will tackle that after getting the crosscompiler built and the linux kernel itself built. :)
<fabbione> i am uploading the image..
<fabbione> http://people.ubuntu.com/~fabbione/braddr/
* braddr wgets
* braddr boot net's.
<braddr> probably should powercycle, but let's see if this gets anything interesting
<braddr> seems to stall here:
<braddr> Booting Linux...
<braddr> mem_init: Calling free_all_bootmem().
<fabbione> interesting
<braddr> any thoughts before I start the 10 minute power cycle bootup?
<fabbione> it might be just slow???
<braddr> been well over a minute now
* braddr starts the cycle
<fabbione> ok try again
<fabbione> i will build one that doesn't trigger that
* braddr will need to put a cap on this.. it's after 1am here.
<braddr> I'll go until 2
<fabbione> ok
<fabbione> #ifdef CONFIG_DEBUG_BOOTMEM
<fabbione>         prom_printf("mem_init: Calling free_all_bootmem().\n");
<fabbione> #endif
<fabbione>         totalram_pages = num_physpages = free_all_bootmem() - 1;
<braddr> seems pretty harmless
<fabbione> checking what that does exactly
<braddr> considering there's other printf's in the neighborhood, adding one more can't be a problem by itself.
<fabbione> yeps
<fabbione> slamming one immediatly after
<fabbione> to se
<fabbione> to see if that completes
<braddr> same stall.
* braddr will give it several minutes
<fabbione> i am almost done with the other printk
* braddr nods.. that's what I was figuring would be the next boot. :)
<fabbione> it takes a little bit to do things properly
<fabbione> in terms of change -> track change -> build proper images -> create boot.img -> blabla
<braddr> screw the properly part. :)
<fabbione> no i don't.. it's important for me that each bit it's always tested properly...
<fabbione> we used to hack that way when we were adding the support to the kernel
<fabbione> but we realized only later that at cleanup time we lost bits here and there
* braddr eyes the crosstools build.. died again.
<fabbione> duplicating work etc.
* braddr nods.
<braddr> I tend to take a 2 pass approach.. hack and slash until the main bug goes away, then re-do things with better knowledge of the problem from the original source
<fabbione> yeah but this is more building the image trick to be consistent
<fabbione> the code is in git.. so it's easy for me to grab the diff and reapply it clean to the main truck
<fabbione> trunk
<braddr> sparc64-unknown-linux-gnu-hello-static: ELF 64-bit MSB executable, SPARC V9, version 1 (SYSV), for GNU/Linux 2.4.3, statically linked, for GNU/Linux 2.4.3, not stripped
<braddr> yay
<fabbione> nice
<fabbione> uploading
<fabbione> ok it's there
<fabbione> same url
<braddr> getting...
<braddr> for what it's worth, still stalled there.
<fabbione> #ifdef CONFIG_DEBUG_BOOTMEM
<fabbione>         prom_printf("mem_init: Calling free_all_bootmem().\n");
<fabbione> #endif
<fabbione>         totalram_pages = num_physpages = free_all_bootmem() - 1;
<fabbione> #ifdef CONFIG_DEBUG_BOOTMEM
<fabbione>         prom_printf("mem_init: done Calling free_all_bootmem().\n");
<fabbione> #endif
<fabbione> this should tell you if it goes any further
<fabbione> the stuff that free_all_bootmem calls are woodoo to me
* braddr consults his magic eight ball to see what it predicts.
<fabbione> i predict it boots, print my stuff and hangs
<braddr> that'd be the A answer.
<fabbione> if it goes further you should see:
<fabbione>         printk("Memory: %uk available (%ldk kernel code, %ldk data, %ldk init) [%016lx,%016lx] \n",
<braddr> wow.. break from sc isn't getting a prompt.  powercycleing
<braddr> mem_init: done Calling free_all_bootmem().
<fabbione> and it hangs...
* fabbione sighs
<fabbione> ok let's try without BOOMEM debugging
<braddr> at least the voodoo off in the boot mem freeing code can be avoided. :)
<fabbione> yeah
<fabbione> i should really poweron the SAN to do these builds
<fabbione> spending ages on I/O on these slow SAS disks
<braddr> heh
<ajmitch> hi fabbione 
<fabbione> hi aj
<fabbione> braddr: almost done...
<fabbione> got sucked with a customer :/
<braddr> no problem.. reading the docs on kbuild to learn how to do cross compiles
<fabbione> braddr: that's easy ;)
<fabbione> either edit arch/sparc64/Makefile
<fabbione> or export hmmm what envvar..
<fabbione> HOSTCC and CC
<fabbione> that should do
<braddr> and to trigger it as a cross build?
<fabbione> make ARCH=sparc64 ...
<fabbione> uploading new image
<fabbione> this one without DEBUG_BOOTMEM
<braddr> sweet
<fabbione> done
* braddr wgets the same url
<braddr> Remapping the kernel... done.
<braddr> Booting Linux...
<braddr> <stall...>
<fabbione> how much ram do you have there?
<braddr> 8G
<fabbione> ok letme try to boot that image here
<fabbione> i also have 8GB
<fabbione> at least we can exclude one thing
<fabbione> this one stalls for me too
<fabbione> so it's either scratching all the 8GB of ram very very slowly
<fabbione> or it is simply broken
* braddr votes the latter
<fabbione> i can try removing the ram
<fabbione> i will let it run for a bit while i get some lunch
<braddr> one of the boots had to have been sitting there for 10 minutes or so
<fabbione> and try to come back with an image for tomorrow
<fabbione> i have no idea how verbose these debugging things are
<fabbione> i think i will disable all of them and enable one at a time to see what breaks
<fabbione> it's problably easier
<fabbione> but it requires a few tons of reboots
* braddr nods.
<fabbione> go get some sleep
<fabbione> i hope to have an image by tomorrow or something
<braddr> I'm about to try the first kernel build
<fabbione> ok :)
<fabbione> but remember that netboot needs to be a.out or it won't work
* fabbione -> food
<braddr> thanks for the help.. see ya in 20ish hours
<fabbione> no problem.. sure
<fabbione> i should be here :)
<braddr> well, that failed fast. :)
<braddr> fyi: make ARCH=sparc64 CROSS_COMPILE=sparc64-unknown-linux-gnu- vmlinux.aout
* braddr grogs.
<fabbione> morning
<braddr> hiya.  before I went to sleep, I got a kernel build that worked enough to show the Booting linux... message
<fabbione> oh nice
<fabbione> i managed to add a couple of DEBUGGING options from none
<fabbione> but i had to stop 
<fabbione> i am going to add more now
<braddr> I didn't get the kernel + initrd image built.. I was more anxious to see the kernel working.  I'm gonna eat dinner, watch a bit of tv, then dive back in
<fabbione> sure
<fabbione> it will give me a little bit to build another couple of kernels
<braddr> you're trying to see what debugging options work and which ones cause your own box to die during bootup?
<fabbione> yeps
<fabbione> like we discussed yesterday
<braddr> just checking.  They're all 'supposed' to work, I assume.. something buggy with either sparc in general or t2000 specifically, potentially?
<braddr> I'm still running 2.4 everywhere, so I'm a tad behind the times
<fabbione> i assume they are all supposed to work individually
<fabbione> not sure in big combos
<braddr> I see references to turning some of 'em on in various l-k threads from time to time
<braddr> I should probably do a build with a .config that matches the installer v 35 build
<fabbione> yeps 
<fabbione> what sources are you using?
<fabbione> vanilla or ubuntu?
<braddr> vanilla 2.6.17-rc5
<fabbione> ok
<fabbione> the config might work
<fabbione> but a lot of drivers won't be there
<fabbione> i am booting up now
<fabbione> can give you a config in 2 minutes or so
<braddr> for a t2000, a lot of drivers aren't relevant.
<fabbione> i know
<fabbione> they are just there
<braddr> fyi.. a snippit from an email with my sun contact:
<braddr> > Thank you for your feedback. I have passed it along to the T2000 Try and Buy
<braddr> > Program people, as it is a new program, we appreciate the constructive
<braddr> > feedback. As a follow up to your first point, on the T2000 Linux is not
<braddr> > formally supported (yet) so any efforts Dave is making are best effort.
<braddr> the 'yet' part I like seeing.
<fabbione> ehhe
<braddr> my reply basically said I/we weren't looking for support, just time.
<gnu2it2> little help please.. E: Malformed line 3 in source list /etc/apt/sources.list (dist parse)
<gnu2it2> line 3 = deb http://ports.ubuntu.com/ubuntu-ports dapper
<gnu2it2> i dont see what is malformed
<gnu2it2> sure is quiet in here :)
<braddr> deb     http://http.us.debian.org/debian/ testing main non-free contrib
<braddr> you're missing section entries
<braddr> (making up the term, not sure what it is officially)
<braddr> fabbione: hopefully you're still asleep, but for when you wake, not a single oops or other kernel message during the entire install except for during module loading, and there just a bunch of symbol size mismatch messages
<fabbione> braddr: nice...
<fabbione> braddr: did you try to boot the SMP kernel?
<braddr> not yet
<fabbione> could you please?
<braddr> just try the default installed smp kernel?
<fabbione> yeps
<braddr> ok.. I'll reboot after the rsync of the ubuntu.git tree is done
<fabbione> thanks
<braddr> without the debug options, the race or whatever we were hitting is likely to hit, but it'll be interesting regardless.
<fabbione> it might not
<braddr> rebooting
<fabbione> take your time. i am kind of busy anyway
<braddr> no problem.. just munching on some pizza and catching up on some tv
<braddr> ncpus probed    : 16
<braddr> ncpus active    : 16
<fabbione> :)
<fabbione> now just play with it :)
<fabbione> and see how bad it goes
<braddr> I much prefer reproducable, reliable, regular crashes.
<fabbione> i might be able to get the config changes in if they are not too heavy
<fabbione> in terms that we can work around it, but i need to verify how much heavier is the DEBUG_SLAB UP kernel
<braddr> well, this smp kernel has no config changes.. it's what came from the default install.
<fabbione> yeps
<fabbione> that means that it's only with UP kernel the problem
<braddr> or that the race condition hasn't triggered
<fabbione> probably
<fabbione> as i said.. play with it and see if you notice anything strange
<fabbione> i have 24h to upload a kernel
<fabbione> after that it will be a matter of custom installers
<braddr> I'll have it do a few make -j32 kernel compiles.. always good for a stress test
<fabbione> -j512 is good too
<fabbione> don't be shy
<fabbione> the machine won't yell at you
<braddr> the disk might.. it's not one of the snappiest ever created.
<fabbione> don't worry about that
<fabbione> trust me :)
<fabbione> davem didn't finf bugs on the kernel doing -j32
<fabbione> i did when pushing to the edges of -j4096
<braddr> right.. make -j  <nothing> it is.
<fabbione> yeah
<fabbione> ok while you play.. i need to get ready to fly to London
<braddr> if it's gonna be stressed, let's throw it all at her
<braddr> whee
<fabbione> i have an airplane to catch in a few hours
<fabbione> stress her as much as you can
<fabbione> the more the better
<braddr> while /bin/true; do make clean; make -j; done
<fabbione> :)
<braddr> ouch.. aboutu 1.5G into swap, about 1200 processes all fighting for some time
<fabbione> that's o
<fabbione> k
<fabbione> don't worry
<braddr> oh, I know.. as long as I don't run outta swap it's all good
<fabbione> you won't
<fabbione> make -j will make sure not to go over machine resources
<fabbione> it's done on purpose
<braddr> yeah, looks like it's over the parallelism hump anyway
<fabbione> ok gotta go for a bit now
<fabbione> ttyl
<braddr> have fun in London
<fabbione> nah still need to prepare the bag, shower and close here ;)
<fabbione> another 2/3 hours
<fabbione> just need to get started :D
<braddr> interesting that 2 make processes seem to be consuming 2 of the cores full time
<braddr> ooh.. an oops, and file-max limit reached
<fabbione> max file is a known bug upstream
<fabbione> log the OOPS
<braddr> already done
<fabbione> the file-max is due to SLAB not releasing cleared fd fast enough
<braddr> the oops is in one of the make's that's spinning.. which makes some sense.
<braddr> wait, no it's not.
<braddr> fun.. the two make's aren't kill -9able
<fabbione> add all to the logs :)
<fabbione> oh hold on
<fabbione> once you hit the file-max you need to reboot
<braddr> way ahead of you
<fabbione> there is no proper way to recover from that
<braddr> the oops came before the first file-max
<ajmitch> evening
<braddr> hiya aj
<ajmitch> sounds like you're stressing the box nicely :)
<braddr> just following suggestions
* ajmitch is still waiting to hear what gets sorted out with this one
<braddr> me too
<braddr> but progress has definitly been made
* braddr reboots
<fabbione> braddr: what's the url for the log again?
<braddr> http://www.puremagic.com/~braddr/t2000/boots.txt
<fabbione> thanks
<braddr> drop the file for a blog + a list of all the past files (from a month or so ago)
<fabbione> ?
<braddr> http://www.puremagic.com/~braddr/t2000/
<fabbione> oh ok :)
<braddr> drivers/block/aoe/aoemain.o -- there's an Age of Empires game embedded in the kernel? :P
<fabbione> hahah
<fabbione> -> London
<fabbione> later
<braddr> 32 cycles of make -j700 / make clean -- no oopses
* braddr has restarted the loop with -j900
#ubuntu-ports 2006-05-28
<braddr> first -j900 build hit the file-max bug
