[13:44] <nevrax> #ubuntu-classroom-chat
[15:42] <budiw> hello all
[15:48] <MarkAt2od> #ubuntu-classroom-chat
[15:53]  * koolhead17 kicks nigelb 
[15:54] <kim0> Hi folks .. Ubuntu Cloud Days starting in around 5mins here in #ubuntu-classroom .. You can discuss and ask questions in #ubuntu-classroom-chat .. Please feel free to tweet and share this info with your friends
[15:56] <kim0> Those unfamiliar with IRC can simply use this web page http://webchat.freenode.net/?channels=ubuntu-classroom%2Cubuntu-classroom-chat
[15:59] <smoser> hi all
[16:00] <smoser> ok...
[16:00] <smoser> so shall i start, mr kim0 ?
[16:00] <kim0> You might indeed
[16:01] <smoser> Hi, my name is Scott Moser.  I'm a member of the Ubuntu Server team.
[16:01] <smoser> For the past 18 months or so, I've been tasked with preparing and managing the "Official Ubuntu Images" that can be used on Ubuntu Enterprise Cloud (UEC) or on Amazon EC2.
[16:01] <smoser> Hi, my name is Scott Moser.  I'm a member of the Ubuntu Server team.
[16:01] <smoser> For the past 18 months or so, I've been tasked with preparing and managing the "Official Ubuntu Images" that can be used on Ubuntu Enterprise Cloud (UEC) or on Amazon EC2.
[16:01] <ClassBot> Logs for this session will be available at http://irclogs.ubuntu.com/2011/03/24/%23ubuntu-classroom.html following the conclusion of the session.
[16:02] <smoser> Some links, for reference:
[16:02] <smoser> - https://help.ubuntu.com/community/UEC/Images
[16:02] <smoser> - http://uec-images.ubuntu.com/releases/
[16:03] <smoser> The first gives some general information about the images, the second gives access to download the images for use in UEC or Eucalyptus and AMI ids on Amazon.
[16:03] <smoser> The subject of my discussion here is "rebundling/re-using Ubuntu's UEC images".
[16:03] <smoser> "Rebundling" is the term used for taking an existing image (or instance), making some modifications to it, and creating a new image from it.
[16:03] <smoser> With the Ubuntu images on UEC or EC2, There are generally 3 ways to rebundle an image.
[16:03] <smoser>  * use a "bundle-vol" command from either euca2ools (euca-bundle-vol) or ec2-ami-tools (ec2-bundle-vol)
[16:03] <smoser>  * use the EC2 'CreateImage' interface.
[16:03] <smoser>  * make modifications to a pristine (unbooted) image via loop-back mounts
[16:04] <smoser> I'll talk a little bit about each one of these, and then open the floor to questions.
[16:04] <smoser> == General Re-Bundling Discussion ==
[16:04] <smoser> The "Official Ubuntu Images" are generally stock Ubuntu server installs.  As with any default install, they're not much use out of the box.
[16:05] <smoser> There are 2 basic ways to use Ubuntu images on EC2.
[16:05] <smoser>  * "rebundle" one and have your own private (or public) AMI. (as we're discussing here)
[16:05] <smoser>  * use the stock images, and customize the instance on first boot via scripted or manual methods.
[16:05] <smoser> We've gone to a fair amount of trouble to make them generally malleable so that you can use them without the need to rebundle.
[16:06] <smoser> Cloud-init (https://help.ubuntu.com/community/CloudInit) was developed to reduce the need for users to need to maintain their own AMIs.  Instead, the images are easily customizable on first boot.
[16:06] <smoser> kim0, has put together several blog posts about how to get cloud-init to do your bidding so its ready to use once you attach to it.
[16:07] <smoser> There is some work in maintaining a rebundled image, and, it you can remove that effort by using stock images.
[16:07] <smoser> s/it you/it /
[16:07] <smoser> bah. you understand what i was saying. you can remove the effort of maintaining a rebundled image by using a stock image.
[16:08] <smoser> http://foss-boss.blogspot.com/ is kim0's blog.
[16:08] <smoser> you should all have that bookmarked and available in your RSS reader of choice
[16:08] <smoser> :)
[16:09] <smoser> So, while there are some costs involved in rebundling, there are reasons to rebundle.  If you have a large stack that you're installing on top of the stock image, it may take some time for you to do so.
[16:09] <smoser> Rebundling generally allows you to have a more "ready-to-go", "custom" image.
[16:09] <smoser> By rebundling an image, you can add your stack and then reduce the bringup time of your custom AMI.
[16:10] <smoser> any questions ?
[16:10] <smoser> == Bundle Volume ==
[16:10] <smoser> Using 'ec2-bundle-vol' was the first way that I'm aware of to rebundle.  The euca2ools also provide a work-alike command 'euca-bundle-vol'.
[16:11] <smoser> The way most people use this tool is to
[16:11] <smoser>  * boot an instance that they want to start from
[16:11] <smoser>  * add some packages to it, make some changes ...
[16:11] <smoser>  * issue the rebundle command
[16:12] <smoser> [sorry, issue the 'ec2-bundle-vol' or 'euca-bundle-vol' command]
[16:12] <smoser> what this does is basically copy the contents of the root filesystem into a disk image
[16:12] <smoser> and then package that disk image up for uploading
[16:13] <smoser> as you can imagine, simply doing something like "cp -a / /mnt" (ignoring the recursive copy of /mnt) is not the most "clean" thing in the world.
[16:14] <smoser> the euca-bundle-vol command and ec2-bundle-vol command both include some OS specific hacks , so they dont copy certain files.
[16:14] <smoser> and, inside the images themselves, we've made some changes to make this "just work".
[16:15] <smoser> once you've set up the euca2ools, you might rebundle with something like:
[16:15] <smoser>   sudo euca-bundle-vol --config /mnt/cred/credentials --size 2048 --destination /mnt/target-directory
[16:16] <smoser> !question
[16:16] <smoser> hm..
[16:17] <smoser> there was a question as to whether this applied to openstack
[16:17] <smoser> largely, openstack's EC2 compatibility should make this work.
[16:18] <smoser> i've not tried rebundling an image in openstack exactly, but i do know that the euca2ools interact with openstack fine. and copying a filesystem is not really cloud specific
[16:19] <ClassBot> koolhead17 asked: Is this class limited to eucalyptus image bundling or openstack as well ?
[16:19] <smoser> right. so my previous comments attempted to address that question
[16:19] <ClassBot> kim0 asked: is euca-bundle-vol only for running instances, can't I poweroff an instance and bundle its disk while powered off
[16:20] <smoser> euca-bundle-vol only runs in instances.
[16:20] <smoser> euca-bundle-image (and ec2-bundle-image) take a filesystem image as input.
[16:21] <smoser> after using euca-bundle-image (or ec2-bundle-image) you then have to upload and register the output
[16:21] <smoser> i generally would suggest using 'uec-publish-image' instead, which is a wrapper that does those three things.  The most recent version of this tool in natty allows you to use either the ec2- tools or euca2ools under the covers.
[16:22] <ClassBot> koolhead17 asked: but open-stack doesnot use the ramdisk part of image
[16:22] <smoser> i might be missing something.
[16:22] <smoser> it is my understanding that the issue with the ubuntu images and openstack was that openstack was hard coded to expect a ramdisk
[16:23] <smoser> where as the 10.04 and beyond images from Ubuntu do not use a ramdisk, so none was available in the tarball that you download.
[16:23] <smoser> i believe that a.) that bug is fixed
[16:24] <smoser> b.) there *is* in openstack a way to boot a instance with an internal kernel, ie not having a separate kernel/ramdisk at all, but relying on the bootloader installed in the disk image
[16:24] <smoser> boy... i'm getting loads of questions, and i'd like to kind of get back to my over all plan, and then i can take questions.
[16:24] <smoser> rather than sitting in interupt mode for the whole hour
[16:25] <smoser> we will *definitely* have time to take questions, so please queue them up in #ubuntu-classroom-chat
[16:25] <smoser> now where was i...
[16:26] <smoser> so, after bundling, then you have to use euca-upload-bundle or ec2-upload-bundle and <prefix>-register to register your image.
[16:26] <smoser> I should have noted above, that this "bundle-vol" really is only for instance-store images.
[16:26] <smoser> Eucalyptus (in 2.0.X) only supports instance store images.
[16:27] <smoser> i believe that they plan to have EBS root images in the future.
[16:27] <smoser> So, that brings us to the second type of bundling
[16:27] <smoser> == CreateImage ==
[16:28] <smoser> When amazon began offering EBS root instances, they added an API call called 'CreateImage'
[16:28] <smoser> CreateImage is an AWS API call that basically does the following:
[16:28] <smoser>  * stop the instance if it is not stopped
[16:28] <smoser>  * snapshot it's root volume
[16:28] <smoser>  * register an AMI based on that snapshot
[16:28] <smoser>  * start the instance back up.
[16:28] <smoser> The CreateImage api is exposed via a command line tool (http://docs.amazonwebservices.com/AWSEC2/latest/CommandLineReference/ApiReference-cmd-CreateImage.html) and also via the EC2 Web Console.
[16:29] <smoser> This feature makes it dramatically easier for anyone to create a custom AMI.  There is literally one button that you push in the EC2 Console, and then type in a name and description.
[16:29] <smoser> I would generally recommend using CreateImage if you're using an EBS based image.  It is an extremely useful wrapper, and will get your a consistent filesystem snapshot.
[16:30] <smoser> Once you have a snapshot id of a filesystem, you could actually fairly easily upload an instance-store image based on that snapshot.
[16:30] <smoser> this is left as an excercise to the reader.
[16:31] <smoser> So, the final way of rebundling an image
[16:31] <smoser> == modify pristine download images ==
[16:31] <smoser> Few other image producers on EC2 make their images available as filesystem images for download.
[16:32] <smoser> Ubuntu does this so you can easily grab the image, make some changes to it, and then upload and register your modified image
[16:33] <smoser> This might be the most involved way of creating an image, but it is also the one that lends it self best to automation
[16:34] <smoser> For a simple example, say I wanted to add a user to an ubuntu image so that I could log in as that user on initial boot.
[16:34] <smoser> What I would do is:
[16:35] <smoser>  * launch a utility instance in EC2
[16:35] <smoser> I'd pick a lucid 64 bit image, possibly even use an EBS root image and a t1.micro size.  The size would largely depend on what I wanted to do.
[16:35] <smoser> once that image was up, I'd ssh to it.
[16:36] <smoser> then, download a image tarball that I found a link to from https://uec-images.ubuntu.com/releases/lucid/
[16:36] <smoser> $ wget https://uec-images.ubuntu.com/releases/lucid/release/ubuntu-10.04-server-uec-amd64.tar.gz
[16:36] <smoser> then, extract the image, mount it loopback and make my modifications
[16:37] <smoser> $ tar -Sxvzf ubuntu-10.04-server-uec-amd64.tar.gz
[16:37] <smoser> $ sudo mkdir /target
[16:37] <smoser> $ sudo mount -o loop *.img /target
[16:37] <smoser> $ sudo chroot /target adduser foobar
[16:37] <smoser> ... follow some prompts ...
[16:37] <smoser> maybe make some other changes here
[16:38] <smoser> $ sudo umount /target
[16:38] <smoser> assuming you've also set up your credentials so that you can use euca-* or ec2-* tools, then you can do:
[16:39] <smoser> uec-publish-image x86_64 *.img my-s3-bucket
[16:39] <smoser> and out will pop a AMI-XXXXX id that you can then launch.
[16:40] <smoser> This process lends itself *very* well to scripting.  You can launch the instance, connect to it, and do all the modifications via a program and revision control them.
[16:40] <smoser> so you'll know exactly what you have
[16:40] <smoser> Also, we make machine consumable information about how to download the images available at https://uec-images.ubuntu.com/query/
[16:41] <smoser> For some things I was working on, I put together a script that does much of the above
[16:41] <smoser> It assumes the instance is launched, and you're on it with credentials, but then does the rest
[16:41] <smoser> http://bazaar.launchpad.net/~smoser/+junk/kexec-loader/view/head:/misc/publish-uec-image
[16:41] <smoser> so...
[16:42] <smoser> sorry for pushign through all that without taking interupts, but I wanted to get through it.
[16:42] <ClassBot> obino asked: is there a "preferred" file system for instances?
[16:42] <smoser> 10.04 images I believe use ext3 filesystem.
[16:43] <smoser> that should have been ext4, as the images are really intended to be "stock ubuntu installs", and ext4 was the default filesystem in 10.04
[16:43] <smoser> the 10.10 images use ext4, and so does natty.
[16:44] <smoser> it is possible that Ubuntu will move to BRTFS as the default in 11.10. if thats the case, I'd like to follow that in the images (brtfs is has some *really* nice features).
[16:44] <smoser> so as to "preferred"....
[16:44] <smoser> I know people use xfs, as it offers snapshotting functionality not available in ext4
[16:45] <smoser> and Eric Hammond's "create-consistent-snapshot" is a popular tool that sits atop using xfs for your data partitions.
[16:45] <smoser> I wrote a blog entry on how you can rebundle the Ubuntu images into an xfs based image at
[16:45] <smoser> http://ubuntu-smoser.blogspot.com/2010/11/create-image-with-xfs-root-filesystem.html
[16:47] <smoser> navanjr, regarding "are you suggesting we should use the CreateImage method on EC2 to create an image for my UEC Private Cloud?"
[16:47] <smoser> i might have somehwat covered that.
[16:47] <smoser> but you really cannot use CreateImage on EC2 to create an image for UEC
[16:47] <smoser> one general approach that would work would be to get your instance into a state that you're happy with it
[16:47] <smoser> then stop the instance
[16:47] <smoser> snapshot its root volume
[16:48] <smoser> start the instance
[16:48] <smoser> attach that snapshot as another disk
[16:48] <smoser> then copy the filesystem contents of the second disk to a disk image.
[16:48] <smoser> that disk image then could be brought to UEC.
[16:49] <smoser> i'd have to think through that a bit more, but i believe the general path is correct.
[16:49] <smoser> semiosis pointed out: CreateImage will also snapshot any other (non-root) EBS volumes attached to the instance, and those snapshots are automatically restored & attached to new instances made from the AMI.
[16:49] <smoser> CreateImage is *really* a handy wrapper.
[16:49] <ClassBot> obino asked: thanks for cloud-init! Is there any plan to make cloud-init available for other distro?
[16:50] <smoser> well, amazon has taken cloud-init to their CentOS derived "Amazon Linux".
[16:50] <smoser> and I believe that they intend to continue doing so.
[16:50] <smoser> i'm definitely interested in helping them, and have worked with some of their engineers.
[16:51] <smoser> I'd also like to get cloud-init into debian.  I know of one person who was trying to do that, and one person who was interested in getting it into fedora.
[16:51] <ClassBot> There are 10 minutes remaining in the current session.
[16:52] <smoser> So, yes, I would like to see that.  I think consistent bootstrapping code accross linuxes would be a general win.
[16:53] <smoser> navanjr, asked 'so there is no "createImage" similar for use on a running UEC instance?'
[16:54] <smoser> there is no createImage like functionality in UEC.  It relies upon EBS and block level snapshotting.  Eucalyptus does not have EBS root functionality in any released version that i'm aware of
[16:55] <smoser> however, navanjr you might be able to get some more information out of obino
[16:55] <smoser> (sorry, obino)
[16:55] <smoser> I suspect htat question might have been planted.
[16:55] <smoser> it leads into the next session very well
[16:56] <smoser> TeTeT will talk about "UEC Persistency", which offers a way to get EBS root-like function on UEC, and even LTS 10.04.
[16:56] <smoser> i wont take his thunder, though.
[16:56] <ClassBot> There are 5 minutes remaining in the current session.
[17:01] <ClassBot> Logs for this session will be available at http://irclogs.ubuntu.com/2011/03/24/%23ubuntu-classroom.html following the conclusion of the session.
[17:01] <TeTeT> Hello! Nice to have you in class today.
[17:02] <TeTeT> It's a bit weird, but I'm as nervous as before giving a live class :)
[17:02] <TeTeT> If you have any questions, ask them in #ubuntu-classroom-chat with the prefix
[17:02] <TeTeT> QUESTION
[17:02] <TeTeT> I am not very familiar with the classbot, but I'll do my best to check if there are any questions in the queue.
[17:02] <TeTeT> A brief introduction: I'm Torsten Spindler, been working for Canonical since December 2006 and I am part of the corporate services team, so unlike most other presenters here, I'm on the commercial side.
[17:03] <TeTeT> I bring this up as one of my responsibilities is to maintain and deliver the Ubuntu Enterprise Cloud classes.
[17:03] <TeTeT> This is directly related to this session, as I present one of the case study exercises found in the UEC class, it's the latest addition to the course material and brand new
[17:03] <TeTeT> So what do I want to present here?
[17:04] <TeTeT> During the UEC class I often got the question: 'How can I have an instance on UEC that stores all of its information on an EBS volume, so I can simply use it like a regular virtualized server?'
[17:04] <TeTeT> Why is that of interest? The data on an instance is volatile, e.g. if the instance dies, all data of it is gone.
[17:04] <TeTeT> Unless you store the data on a persistent cloud datastore; in UEC we have two of them: Walrus S3 and EBS volumes.
[17:05] <TeTeT> An EBS volume is a device in an instance that serves as a disk, very much like a USB stick you insert in your system.
[17:05] <TeTeT> you can see a graphic for this at http://people.canonical.com/~tspindler/UEC/attach-volume.png
[17:05] <TeTeT> keep in mind that the disk actually attaches to an instance running on the node controller, not on the node controller itself
[17:06] <TeTeT> the technology used in UEC is 'ATA over Ethernet - AoE', which means that the EBS volume is exported via network from a server, the EBS storage controller.
[17:06] <TeTeT> With Amazon Web Services (AWS) it is possible to boot from an EBS volume in the public cloud, with a technology named 'EBS root'. For more info on this, please see http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?Concepts_BootFromEBS.html
[17:06] <TeTeT> With UEC for the private cloud we use Eucalyptus as upstream technology. With Eucalyptus the concept of booting an instance straight from an EBS volume is not given.
[17:07] <TeTeT> So the situation looks a bit like this: http://people.canonical.com/~tspindler/UEC/01-instance-volume.png
[17:07] <TeTeT> forgive my lack of design skills ;)
[17:07] <TeTeT> in UEC people usually have this: http://people.canonical.com/~tspindler/UEC/02-instance-volume-standard.png
[17:08] <TeTeT> an instance holds the kernel and /, while the data is saved on a persistent EBS volume
[17:08] <TeTeT> UEC persistency is about moving the kernel and / to the EBS volume
[17:08] <TeTeT> depicted in http://people.canonical.com/~tspindler/UEC/03-instance-volume-ebs-based-instance.png
[17:08] <TeTeT> that is, the kernel of the instance launches a kernel stored on an EBS volume
[17:09] <TeTeT> the kernel on the EBS volume will use / from the ebs volume and run completely from there
[17:09] <TeTeT> I asked the Ubuntu server team on advice for realizing such a service back in January 2011. It was motivated by the questions I received during the UEC class.
[17:10] <TeTeT> My initial thought was to have a specialized initrd that calls pivot_root on a root filesystem stored in an EBS volume.
[17:10] <TeTeT> But Scott Moser (smoser) had a much better idea: Why not use kexec to load the kernel and root from the EBS volume.
[17:10] <TeTeT> More information on kexec-loader can be found at http://www.solemnwarning.net/kexec-loader/
[17:10] <TeTeT> Scott went ahead and implemented this and my colleague Tom Ellis tested it.
[17:10] <TeTeT> Then I used the docs from Scott, tested them and created an exercise for the UEC class out of it.
[17:11] <TeTeT> We decided to publish this work in Launchpad and you can find the result in https://launchpad.net/uec-persistency.
[17:11] <TeTeT> The branch under lp:uec-persistency contains the needed code and the exercise in the docs directory.
[17:12] <TeTeT> smoser just told me that we use something like kexec-loader, not exactly the same tech
[17:12] <TeTeT> see the chat channel for more background info ;)
[17:12] <TeTeT> If you don't have bazaar installed right now, you can also take a look at the exercise PDF, found at http://people.canonical.com/~tspindler/UEC/ebs-based-instance.pdf
[17:12] <TeTeT> the odt file is also freely available
[17:12] <TeTeT> I will now cover this exercise step by step.
[17:12] <TeTeT> No questions so far on what the aim is?
[17:13] <TeTeT> to repeat, we want to use the kernel and root filesystem found on an EBS volume, not that in the instance itself
[17:13] <TeTeT> The steps during the exercise have to be conducted from two different system, one I named the [cloud control host], the other the [utility instance].
[17:13] <TeTeT> it would be useful to have the PDF or ODT open for the exercise now
[17:14] <TeTeT> With cloud control host I mean any system that holds your UEC credentials, so you can issue commands to the front end.
[17:14] <TeTeT> The utility instance is created during the exercise and is used to populate an EBS volume with an Ubuntu image.
[17:15] <TeTeT> The first two steps are preparing your cloud with a loader emi. This loader emi will later be used to kexec the kernel on your EBS volume.
[17:15] <TeTeT> Step three and four setup the utility instance. This is a regular UEC instance that is large enough to hold an Ubuntu image and store it on the attached EBS volume.
[17:15] <TeTeT> The EBS volume created and attached in step three will be the base for your EBS based instances later on.
[17:16] <TeTeT> All the steps five to nine are needed to copy an Ubuntu image to the EBS volume and ensure it boots fine later on.
[17:16] <TeTeT> In step 11 a snapshot of the pristine Ubuntu EBS volume is made. While one could use the EBS volume right away, it's much nicer to clone it via the snapshot mechanism of EBS.
[17:17] <TeTeT> Just in case you later want another server based on that image.
[17:17] <TeTeT> Steps 12 and 13 are there to launch an instance based on an EBS volume.
[17:17] <TeTeT> The final step 14 describes how to ssh to the instance and check if it is really based on the EBS volume, e.g. /dev/sdb1.
[17:18] <TeTeT> That's really all there is to it, thanks to smosers work. Perfectly doable by yourself within 2 - 4 hours, starting with two bare metal servers
[17:18] <TeTeT> Once you've been through all the steps and you want more EBS based instances of the same image, simply repeat from step 11 boot_vol.
[17:19] <TeTeT> With this you should have virtualized servers in your UEC within a few minutes, quite a nice time for provisioning.
[17:19] <TeTeT> Especially useful might be to assign permanent public addresses to those instances.
[17:19] <TeTeT> This can be done with help of euca-allocate-address and euca-associate-address.
[17:20] <TeTeT> any questions so far? Everything crystal clear?
[17:21] <TeTeT> Well, it was a very short session then I fear...
[17:22] <TeTeT> Closing words, we're looking into automating the storage of the Ubuntu image on the EBS volume to make this step less work intense. So keep an eye on the Launchpad project.
[17:22] <TeTeT> In the end you should be able to use any Ubuntu image within UEC on Ubuntu 10.04 LTS on an EBS volume in a few minutes, rather than hours
[17:23] <TeTeT> if there's anyone interested in contributing to the project or UEC in general, please get in touch with kim0
[17:23] <TeTeT> we're looking for people with any skills, from coding to writing documentation
[17:24] <TeTeT> we'd also love to hear from you if you try UEC persistency and it works or doesn't work for you
[17:24] <TeTeT> I tested the exercise a few times, but you'll never know
[17:25] <TeTeT> you can touch base with us in #ubuntu-cloud and myself also in #ubuntu-training
[17:25] <ClassBot> obino asked: have you even RAIDed the EBS volumes? Is there any advantage in doing so?
[17:26] <TeTeT> nope, I've never RAIDed the EBS volumes, but would think there might be a bit of a performance hit due to the ATA over Ethernet protocol
[17:26] <TeTeT> also keep in mind that while served via network, the EBS volumes are likely to come from the same Storage Controller (SC)
[17:27] <TeTeT> so not sure if this is a good approach
[17:28] <TeTeT> might be interesting to use drbd though, and maybe use the EBS volume as well as the ephemeral storage and see how that goes
[17:28] <TeTeT> I guess there's quite a bit of room for experimentation for EBS based instances
[17:30] <TeTeT> guess you have no 30 minutes left in the session, so enough time to actually do the exercise if you have a UEC ready :)
[17:30] <TeTeT> no=now
[17:32] <TeTeT> thanks for attending, catch me in #ubuntu-training if you run into problems with the exercise, bye now
[17:33] <kim0> So we finished a bit early on this session
[17:33] <kim0> Daviey starts in less than 30 mins with a puppet session
[17:34] <kim0> Time for a coffee break :)
[17:51] <ClassBot> There are 10 minutes remaining in the current session.
[17:56] <ClassBot> There are 5 minutes remaining in the current session.
[18:01] <Daviey> kim0, Are you managing the session?
[18:01] <kim0> It's pretty much self managing .. topic will be changed in a few seconds
[18:01] <ClassBot> Logs for this session will be available at http://irclogs.ubuntu.com/2011/03/24/%23ubuntu-classroom.html following the conclusion of the session.
[18:02] <Daviey> My name is Dave Walker (Daviey), and I am a member of the Ubuntu Server Team.
[18:02] <Daviey> Welcome to the puppet classroom session.  This session is mainly targeted at those that have had minimal or no exposure to the puppet.
[18:03] <Daviey> It allows reproducible, consistent deployments, which is good for horizontal scaling and replacing machines which have malfunctioned.
[18:03] <Daviey> A good reference for more details about the project is at:
[18:03] <Daviey> http://projects.puppetlabs.com/projects/puppet/wiki/About_Puppet
[18:03] <Daviey> Please take a few moment to grok the content of that page, there is little point in my reproducing the content here.
[18:03]  * Daviey waits a few minutes.
[18:04] <Daviey> Now, some of that might sound little complicated but it really is simple when you get started.
[18:05]  * Daviey continues.
[18:05] <Daviey> Puppet focuses on the 'configuration' management.  The initial operating system deployment is usually done with either, preseeding the installer, cobbler, FAI or simply spawning a cloud machine, such as EC2.
[18:05] <Daviey> In regards to EC2.. people tend to use user-scripts or increasingly cloud-init.
[18:06] <Daviey> Once the base operating is installed, there is always some changes that need to be made to make the server usable for production.  This varies from performance tweaks, application configuration and even custom versions of packages.  This could all be handled with scripts and such, but this is less than clean and near impossible to maintain.  This is where puppet provides a clean solution.
[18:06] <Daviey> Once the base operating is installed, there is always some changes that need to be made to make the server usable for production.  This varies from performance tweaks, application configuration and even custom versions of packages.  This could all be handled with scripts and such, but this is less than clean and near impossible to maintain.  This is where puppet provides a clean solution.
[18:06] <Daviey> Puppet generally acts on a client/server method, to manage multiple nodes.  However, it is also possible to use puppet on a single host.  For simplicity, this session will demonstrate a single host deployment example and some of the features of puppet via their configuration format - called a manifest.
[18:07] <Daviey> In this session, we will do the following:
[18:07] <Daviey> • Connect to an instance in the cloud
[18:07] <Daviey> • Install puppet
[18:07] <Daviey> • Initial configuration
[18:07] <Daviey> • Configure the same node to install and create a basic apache virtual host
[18:07] <Daviey> Firstly, i hope everyone will be able to look at a console window, and this IRC session concurrently.
[18:08] <Daviey> I'm going to invite everyone to connect via ssh to a cloud instance:
[18:08] <Daviey> $ ssh demo@demo.daviey.com
[18:08] <Daviey> You'll need to accept the host key
[18:08] <Daviey> I don't think it really requires verification in this instance.
[18:08] <Daviey> (Although, it's generally good pratice to compare the fingerprint)
[18:09] <Daviey> The password is 'demo'
[18:09] <Daviey> (Secure huh?)
[18:10]  * Daviey waits for a confirmation.
[18:10] <Daviey> I will type in the IRC channel comments, so please multi-task by looking at both.. Thanks :)
[18:11] <Daviey> So, i just checked to see if we have apache2 installed... we do not!
[18:11] <Daviey> (You can check there is nothing running as a httpd on port 80, by visiting http://demo.daviey.com
[18:12] <Daviey> (You should get a failure)
[18:12] <Daviey> I'm running sudo apt-get update, to make sure our indexes are updated
[18:12] <Daviey> The observant amongst you, will notice i'm running Natty
[18:12] <Daviey> The current development version
[18:12] <Daviey> (I must be crazy doing a demo on this! :)
[18:13] <Daviey> So, i just, sudo apt-get install puppet
[18:13] <Daviey> This installs the puppet application and it's dependencies.
[18:14] <Daviey> This stage, would normally be done automatically during installation
[18:14] <Daviey> (if you preseed it such)
[18:14] <Daviey> You'll notice the output here:
[18:14] <Daviey> puppet not configured to start, please edit /etc/default/puppet to enable
[18:14] <Daviey> Did you all see the START=no, option
[18:15] <Daviey> This means that the puppet client agent will not run automatically
[18:15] <Daviey> My intention is to invoke puppet manually.. so i do not need the client to be running
[18:16] <Daviey> (one moment please)
[18:17] <Daviey> (slight technical issue, please hold)
[18:21] <Daviey> Okay!
[18:21] <Daviey> we are back
[18:21] <Daviey> okay, this is the directory structure we should see
[18:21] <Daviey> on a fresh installation
[18:22] <Daviey> Okay, i have just copied a manifest to /etc/puppet/manifest
[18:22] <Daviey> I hope everyone can see this
[18:22] <Daviey> It's quite a quick one i have thrown together
[18:22] <Daviey> It should:
[18:22] <Daviey> Install apache2
[18:22] <Daviey> add a virtual host, called demo.daviey.com
[18:23] <Daviey> and enable it
[18:23] <Daviey> (I'll make it avaliable afterwards via a pastebin)
[18:23] <Daviey> The stanza towards the bottom mentions, ip-10-117-82-138
[18:23] <Daviey> (for the observant, you'll notice this is the hostname of the machine)
[18:24] <Daviey> I could equally, have put 'default' here... which would mean that it would do it for every machine connected
[18:24] <Daviey> (in this instance, i am only using one machine)
[18:24] <Daviey> Now, the actual virtual host needs a template...
[18:24] <Daviey> lets create it.
[18:25] <Daviey> puppet uses Ruby's ERB template system:
[18:25] <Daviey> You'll notice that there are parts which can be expanded.
[18:25] <Daviey> So, this is a generic apache virtual host template, that could be used for other virtualhosts
[18:25] <Daviey> other than demo.daviey.com
[18:26] <Daviey> Now... lets make puppet do it's thing!
[18:27] <Daviey> I love it when a plan comes together.
[18:28] <Daviey> Essentially, i did a dry run with this configs before the session.. and didn't clean up properly!
[18:28] <Daviey> This is why i should have used puppet to clean up, as it would have done it better than me!
[18:29] <Daviey> So, puppet install apache2 and enabled the virtual host
[18:29] <Daviey> puppet knows which package hander to use
[18:29] <Daviey> ie, apt, yum etc
[18:30] <Daviey> Now... if we check to see if apache started.. we'll see it failed... one moment
[18:31] <Daviey> So...
[18:31] <Daviey> (2)No such file or directory: apache2: could not open error log file /var/log/apache/demo.daviey.com-error.log.
[18:31] <Daviey> Unable to open logs
[18:31] <Daviey> This means i made a typo in my template... suggestions on how i should fix this?
[18:31] <Daviey> kim0, Is quite correct with:
 Daviey: should be "apache2" there
[18:31] <Daviey> But... How should i *fix* it?
[18:32] <Daviey> We edit the template of course!
[18:33] <Daviey> Now, we should be able to go to http://demo.daviey.com/
[18:33] <Daviey> (My simple Task didn't try to start apache if it wasn't already running!)
[18:33] <Daviey> notice: /Stage[main]//Node[ip-10-117-82-138]/Apache2::Simple-vhost[demo.daviey.com]/File[/etc/apache2/sites-available/demo.daviey.com]/content: content changed '{md5}5047b9f9a688c04e2697d9fd961960ed' to '{md5}2c32102fd06543c85967276eeee797e2'
[18:34] <Daviey> ^^ Puppet knew it should create a new virtual host, based on the template changing!
[18:34] <Daviey> How neat is that?!
[18:35] <Daviey> Now, in a real life example - puppet would also manage pulling in the website..
[18:35] <Daviey> puppet provides a FileBucket interface..
[18:35] <Daviey> This is similar to rsync, and allows files to be retrieved from there.
[18:35] <Daviey> However, for large files - people often use an external application which is configured via puppet.
[18:36] <Daviey> This could be anything from rsyncd, nfs or event torrent!
[18:36] <Daviey> facter is a really useful tool.  This is where the variables used in the templates are expanded from...  I think of it as lsb_release supercharged.
[18:37] <Daviey> Here is an example of the output, just generated:
[18:37] <Daviey> http://paste.ubuntu.com/584952/
[18:37] <Daviey> This is a list of 'facts' about the system
[18:38] <Daviey> One of the really nice things about the manifests... is that they can be conditional
[18:38] <Daviey> So, i could do a different task based on they virtual type (or lack of) for example.
[18:39] <Daviey> There is no point trying to use this machine as a virtual machine server, if it doesn't fit the requirements
[18:39] <Daviey> Usually bare metal - and amount of memory free
[18:39] <Daviey> The configuration files are largely inheritance based, which fully supports overriding of configurations from the base class.
[18:40] <Daviey> When puppet is installed on a client / server basis... it uses SSL for secure communciation between the elements
[18:41] <Daviey> The server runs on port 8140. so make sure firewall is opened (or ec2 security group allows communication!)
[18:41] <Daviey> Client (Agent) - puppetd
[18:41] <Daviey> Server - puppetmasterd
[18:41] <Daviey> ^^ This is the name of the applications
[18:41] <Daviey> The puppetd runs on all the clients, and polls the Server with the default value of every 30 minutes looking for changes
[18:42] <Daviey> It defaults to looking for the dns hostname of 'puppet'
[18:42] <Daviey> So, it's a good idea for the puppet master to have that dns entry set for a local network
[18:42] <Daviey> Equally, i could have set puppet.mydomain.com
[18:43] <Daviey> This is probably a good place to stop the demo.  I will make my puppet configuration avaiable for others to experiment with.
[18:43] <Daviey> It really is not as complicated as it seems to get started.
[18:43] <Daviey> When i first tried puppet, i found the 'getting started' docs to be somewhat complicated.
[18:44] <Daviey> I would recommend people start with a minimal example like this.. and then build from there.
[18:44] <Daviey> The puppet website has some excellent recipies to use as an example... but probably a good idea to start simple.
[18:44] <Daviey> I will now take questions, and answer them as best as i can
[18:45] <Daviey> Annnnd. classbot, i hate you
[18:45] <Daviey> classbot isn't +v
 sveiss asked: how large is 'large'? is there a rule of thumb as to how much data a FileBucket can cope with? -- There is 1 additional question in the queue.
[18:46] <Daviey> sveiss, that is a good question.. I seem to remember reading that since 2.6... massive improvements have gone into increasing it's efficiency
[18:46] <Daviey> However, it is still believed to be the likely bottlekneck
[18:47] <Daviey> I haven't found the later versions to suffer to badly from this bottlekneck
[18:47] <Daviey> but others have commented.
[18:47] <Daviey> I think it depends on load..
[18:48] <Daviey> I would ask that if you do try the filebucket that you report back to the ubuntu server team with your success.
[18:48] <Daviey> (We often don't get enough feedback)
[18:48] <ClassBot> sveiss asked: how large is 'large'? is there a rule of thumb as to how much data a FileBucket can cope with?
[18:49] <Daviey> kim0 asked: Wouldn't clients looking for dns name "puppet" and blindly following it .. be a secruity risk
[18:49] <Daviey> Well yes.. this is true.. This is one of the reasons SSL is used.
[18:49] <Daviey> Essentially, the pupper master (usually has a self signed key)
[18:49] <Daviey> but the client needs to accept it.
[18:49] <Daviey> This would normally happen as part of the installation, or bootstrapping
[18:50] <Daviey> Which is an area before puppet works.
[18:50] <ClassBot> kim0 asked: Wouldn't clients looking for dns name "puppet" and blindly following it .. be a secruity risk
[18:51] <Daviey> Wow.. i now understand ClassBot
[18:51] <ClassBot> There are 10 minutes remaining in the current session.
[18:51] <ClassBot> kim0 asked: Do you reuse ready made recipies
[18:52] <Daviey> You would be foolish not to!
[18:52] <Daviey> There is a true gem of samples on the puppet wiki, and other locations.
[18:52] <Daviey> Additionally, there are additional modules
[18:52] <Daviey> Which allow you to reduce the burden of what you need to do
[18:53] <Daviey> shttp://forge.puppetlabs.com/
[18:53] <Daviey> http://forge.puppetlabs.com/ , rather
[18:53] <Daviey> If there are no more questions, i will end my session.
[18:54] <Daviey> I would like to thank everyone for coming
[18:54] <Daviey> Please do experiment with puppet, and report back to us.
[18:54] <Daviey> We are a friendly team, which hang around in #ubuntu-server
[18:54] <Daviey> Thank you for your time.
[18:55]  * kim0 claps
[18:55] <kim0> Thanks Daviey for the awesome session
[18:56] <kim0> Next up is Edulix .. Presenting "Hadoop" The ultimate hammer to bang on big data :)
[18:56] <ClassBot> There are 5 minutes remaining in the current session.
[19:00] <Edulix> hello people, thanks for your assistance. this is the session titled "Using hadoop, divide and conquer"
[19:01] <Edulix> kim0 told me about these ubuntu cloud sessions, and kidly asked me to do a talk over hadoop, so here I am  =)
[19:01] <Edulix> first I must say that I am in no way a hadoop expert, as I have been working with hadoop just for a bit over a month
[19:01] <ClassBot> Logs for this session will be available at http://irclogs.ubuntu.com/2011/03/24/%23ubuntu-classroom.html following the conclusion of the session.
[19:02] <Edulix> but I hope that I can help to show you a bit of hadoop and ease the learning curve for those who want to use it
[19:03] <Edulix> I'm going to base this talk in the hadoop tutorial available in http://developer.yahoo.com/hadoop/tutorial/ as it helped me a lot, but it's a bit dense, so I'll do a resumed version
[19:03] <Edulix> So what's hadoop anyway?
[19:03] <Edulix> it's a large-scale distributed batch processing infrastructure, designed to efficiently distribute large amounts of work across a set of machines
[19:03] <Edulix> here large amounts of work means really really large
[19:04] <Edulix> Hundreds of gigabytes of data is low end for hadoop!
[19:04] <Edulix> hadoop supports handling hundreds of petabytes... Normally the input data is not that big, but the intermediate data is or can be
[19:04] <Edulix> of course, all this does not fit on a single hard drive, much less in memory
[19:05] <Edulix> so hadoop comes with support for its own distributed file system: HDFS
[19:05] <Edulix> which breaks up input data and sends fractions  (blocks) of the original data to some machines in your cluster
[19:06] <Edulix> everyone that has tried will know that performing large-scale computation is difficult
[19:06] <Edulix> whenever multiple machines are used in cooperation with one another, the probability of failures rises: partial failures are an expected and common
[19:07] <Edulix> Network failures, computers over heating, disks crashing, data corruption, maliciously modified data..
[19:07] <Edulix> shit happens, all the time (TM)
[19:07] <Edulix> In all these cases, the rest of the distributed system should be able to recover and continue to make progress. the show must go on
[19:08] <Edulix> Hadoop provides no security, and no defense to man in the middle attacks for example
[19:08] <Edulix> it assumes you control your computers so they are secure
[19:08] <Edulix> on the other hand, it is designed to handle hardware failure and data congestion issues very robustly
[19:09] <Edulix> to be successful, a large-scale distributed system must be able to manage resources efficiently
[19:09] <Edulix> CPU, RAM, Harddisk space, network bandwidth
[19:10] <Edulix> This includes allocating some of these resources toward maintaining the system as a whole
[19:10] <Edulix> ..... while devoting as much time as possible to the actual core computation
[19:10] <Edulix> So let's talk about the hadoop approach to things
[19:11] <Edulix> btw if you have nay questions, just ask in #ubuntu-classroom-chat with QUESTION: your question
[19:11] <Edulix> Hadoop uses a  simplified programming model which allows the user to quickly write and test distributed systems
[19:12] <Edulix> and also to tests its efficient & automatic distribution of data and work across machines
[19:13] <Edulix> and also allows to use the underlying parallelism of the CPU cores
[19:13] <Edulix> In a hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in
[19:13] <Edulix> HDFS will split large data files into blocks which are managed by different nodes in the cluster
[19:13] <Edulix> Also replicating data in different nodes, just in case
[19:14] <ClassBot> kim0 asked: Does hadoop require certain "problems" that fits its model ? can I throw random computations to it
[19:15] <Edulix> I'm going to answer that now =)
[19:16] <Edulix> basically, hadoop uses the mapreduce programming paradigm
[19:16] <Edulix> In hadoop, Data is conceptually record-oriented. Input files are split into input splits referring to a set of records.
[19:17] <Edulix> The stragy of the scheduler is moving the computation to the data, i.e. which data will be processed by a node is chosen based on its locality to the node, which results in high performance.
[19:17] <Edulix> Hadoop programs need to follow a particular programming model (MapReduce), which limits the amount of communication, as each individual record is processed by a task in isolation from one another
[19:18] <Edulix> In MapReduce, records are processed in isolation by tasks called Mappers
[19:18] <Edulix> The output from the Mappers is then brought together into a second set of tasks called Reducers
[19:18] <Edulix> where results from different mappers can be merged together
[19:18] <Edulix> Note that if you for example don't need the Reduce step, you can implement a Map-only processing.
[19:19] <Edulix> This simplification makes the Hadoop framework much more reliable, because if a node is slow or crashes, other node can simply replace the former taking the same inputsplit and processing it again
[19:19] <ClassBot> chadadavis asked: Is there any facility for automatically determining how to partition the data, i.e. based on how long one chunk of processing takes?
[19:21] <Edulix> to be able to partitoon the data,
[19:21] <Edulix> you need to have first a structure for that data. for example,
[19:22] <Edulix> if you have a png image that you need processthen the input data is the image file. you might partition your image in chunks that start in a given position (x,y) and have a height and a width
[19:22] <Edulix> but the partitioning is usually done by you, the hadoop program developer
[19:23] <Edulix> though hadoop is in charge of selecting where to send to that partition, depending on data locality
[19:24] <Edulix> when you partition the input data, you don't send the data (input split) to the node that will process it: ideally it will already have that data!
[19:24] <Edulix> how is this possible?
[19:25] <Edulix> because when you do the partition, the InputSplit only defines this partition (so it might be in the image example 4 numbers: x,y, height, width) and depending on which nodes the file blocks of the input data reside, hadoop will send that split to that node
[19:26] <Edulix> and then the node will open the file in HDFS for reading starting (fseek) in there
[19:26] <Edulix> ok, I continue =)
[19:26] <Edulix> separate nodes in a Hadoop cluster still communicate with one another, implicitly
[19:27] <Edulix> pieces of data can be tagged with key names which inform Hadoop how to send related bits of information to a common destination node
[19:27] <Edulix> Hadoop internally manages all of the data transfer and cluster topology issues
[19:27] <Edulix> One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve
[19:28] <Edulix> Using other distributed programming paradigms, you might get better results for 2, 5, perhaps a dozen machines. But when you need to go really large scale, this is where hadoop excels
[19:29] <Edulix> After you program is written and functioning on perhaps ten nodes (to tests that it can be used in multiple nodes with replication etc and not only in standalone mode),
[19:29] <Edulix> then  very little --if any-- work is required for that same program to run on a much larger amount of hardware efficiently
[19:29] <Edulix> == distributed file system ==
[19:29] <Edulix> a distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network
[19:30] <Edulix> HDFS is designed to store a very large amount of information, across multiple machines, and also supports very large files
[19:30] <Edulix> some of its requirements are:
[19:30] <Edulix> it should store data reliably even if some machines fail
[19:30] <Edulix> it should provide fast, scalable access to this information
[19:31] <Edulix> And finally it should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible
[19:31] <Edulix> This last point is crucial. HDFS is optimized for MapReduce, and thus has made some decisions/tradeoffs:
[19:31] <Edulix> Applications that use HDFS are assumed to perform long sequential streaming reads from file because of MapReduce
[19:31] <Edulix> so HDFS is optimized to provide streaming read performance
[19:32] <Edulix> this comes at the expense of random seek times to arbitrary positions in fileswhen a node
[19:32] <Edulix> this comes at the expense of random seek times to arbitrary positions in files
[19:32] <Edulix> i.e. when a node reads, it might start reading in the middle of a file, but then it will read byte after byte, not jumping here and there
[19:32] <Edulix> Data will be written to the HDFS once and then read several times; AFAIK there is no file update support
[19:33] <Edulix> due to the large size of files, and the sequential nature of reads, the system does not provide a mechanism for local caching of data
[19:33] <Edulix> data replication strategies combat machines or even whole racks failing
[19:34] <Edulix> hadoop comes configured to have each file block stored in three nodes by default: two in the same rack, and the other block in another machine
[19:35] <Edulix> if the first rack fails, speed might degrade relatively but information wouldn't be lost
[19:35] <Edulix> BTW HDFS design is based on google file system (GFS)
[19:36] <Edulix> and as you probably  has guessed, in HDFS data/files is/are split in blocks of equal size in DataNodes (machines in the cluster)
[19:36] <ClassBot> gaberlunzie asked: would this sequential access mean hadoop can work with tape?
[19:37] <Edulix> I haven't heard anyone doing such a thing,
[19:37] <Edulix> and I don't think it's a good idea
[19:38] <Edulix> why? because the reads are sequential, but you need to do the first seek to start reading at the point your inputsplit indicates
[19:38] <Edulix> doing this first seek might be too slow for a tape, but I might be completely wrong  here
[19:39] <Edulix> note that the data stored in HDFS is supposed to be temporary, mostly, just for working
[19:39] <Edulix> so you copy the data there, do your thing, then copy the output result back
[19:39] <Edulix> in contrast, tapes are mostly used for large term storage
[19:40] <Edulix> (cotinuing) default block size in HDFS is very large (64MB)
[19:40] <Edulix> This decreases metadata overhead and allows for fast streaming reads of data
[19:41] <Edulix> Because HDFS stores files as a set of large blocks across several machines, these files are not part of the ordinary file system
[19:41] <Edulix> For each DataNode machine, the blocks it stores reside in a particular directory managed by the DataNode service, and these blocks are stored as files whose filenames are their blockid
[19:41] <Edulix> HDFS comes with its own utilities for file management equivalent to ls, cp, mv, rm, etc
[19:41] <Edulix> the metadata (names of files and dirs and where are the blocks stored) of the files can be modified by multiple clients concurrently
[19:42] <Edulix> The metadata (names of files and dirs and where are the blocks stored) of the files can be modified by multiple clients concurrently. To orchestrate this, metadata is stored and handled by the NameNode, that stores metadata usually in memory (it's not much data), so that it's fast (because this data *will* be accessed randomly).
[19:43] <ClassBot> chadadavis asked: If I first have to copy the data (e.g. from a DB) to HDFS before splitting, couldn't the mappers just pull/query the data directly from the DB as well?
[19:43] <Edulix> yes you can =)
[19:43] <Edulix> and if the data is in a DB, you should
[19:45] <Edulix> input data is read from an InputFormat
[19:45] <Edulix> and there are different input formats provided by hadoop: FileInputFormat for example to read from a single file
[19:45] <Edulix> but there's also DBInputFormat, for example
[19:46] <Edulix> in my experience, you will probably create your own =)
[19:46] <Edulix> Deliveratively I haven't explained any code, but I recommend you that if you're interested you should start playing with hadoop locally in your own machine
[19:47] <Edulix> just download hadoop from http://hadoop.apache.org/ and follow the quickstart http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
[19:47] <Edulix> for quickstart and for development, you typically use hadoop as standalone in your own machine
[19:47] <Edulix> in this case HDFS will simply refer to your own file system
[19:47] <Edulix> You just need to download hadoop, configure Java (because hadoop is written in java), and execute the example as mentioned in the quickstart page
[19:47] <Edulix> as mentioned earlier, with hadoop you usually operate as follows, because of its batching nature: you copy input data to HDFS, then request to launch the hadoop task with an output dir, and when it's done, the output dir will have the task results
[19:48] <Edulix> For starting developing a hadoop app was this tutorial because it explains pretty much everything I needed http://codedemigod.com/blog/?p=120
[19:48] <Edulix> but note that it's a bit old
[19:48] <Edulix> and one of the things that I found most frustrating in hadoop while developing was that there are duplicated classes i.e. org.apache.hadoop.mapreduce.Job and org.apache.hadoop.mapre.jobcontrol.Job
[19:49] <Edulix> In that case, use alwys org.apache.hadoop.mapreduce, because is the new improved API
[19:49] <Edulix> be warned, the examples in http://codedemigod.com/blog/?p=120 use the old mapred api :P
[19:50] <Edulix> and hey, now I'm open to even more questions !
[19:51] <Edulix> and if you have questions later on, you can always join us in freenode.net, #hadoop, and hopefully someone will help you there =)
[19:51] <ClassBot> There are 10 minutes remaining in the current session.
[19:52] <ClassBot> gaberlunzie asked: does hadoop have formats to read video (e.g., EDLs and AAFs)?
[19:52] <Edulix> most probably.. not, but maybe someone has done that before
[19:52] <Edulix> anyway, creating a new input format is really easy
[19:53] <ClassBot> chadadavis asked: Mappers can presumably also be written in something other than Java? Are there APIs for other languages (e.g. Python?) Or is managed primarily at the shell level?
[19:54] <Edulix> good question!
[19:54] <Edulix> yes, there are examples in python and in C++
[19:55] <Edulix> I haven't used them though
[19:55] <ClassBot> kim0 asked: Can I use hadoop to crunch lots of data running on Amazon EC2 cloud ?
[19:55] <Edulix> heh I forgot to mention it =)
[19:56] <Edulix> answer is yes!
[19:56] <Edulix> more details in http://aws.amazon.com/es/elasticmapreduce/
[19:56] <ClassBot> There are 5 minutes remaining in the current session.
[19:56] <Edulix> that's one of the nice things of using hadoop: many big people uses it in the industry. yahoo, for example, and amazon has support for it too
[19:57] <Edulix> so don't need to really have lots of machines for doing large computation
[19:57] <Edulix> just use amazon =)
[19:57] <ClassBot> gaberlunzie asked: is there a hadoop format repository?
[19:58] <Edulix> I don't know huh
[19:58] <Edulix> :P
[19:58] <Edulix> I didn't investigate much about this because I needed to have my own
[19:58] <Edulix> but probably in contrib there is
[20:00] <Edulix> ok so that's it!
[20:00] <Edulix> Thanks for your assistance to the talk, and thanks for the organizers
[20:01]  * kim0 claps .. Thanks Edulix 
[20:01] <ClassBot> Logs for this session will be available at http://irclogs.ubuntu.com/2011/03/24/%23ubuntu-classroom.html following the conclusion of the session.
[20:01] <obino> thanks Edulix
[20:01] <obino> very nice presentation
[20:02] <obino> I am graziano obertelli and I work at eucalyptus systems
[20:02] <obino> feel free to ask questions at any time
[20:02] <obino> if they are about eucalyptus I may be able to answer them :)
[20:03] <obino> Eucalyptus powers the UEC
[20:04] <obino> Ubuntu added a nice themes to Eucalyptus, the image store and very nifty way to autoregister the components
[20:04] <obino> which makes it a breeze to install UEC on Ubuntu clusters
[20:05] <obino> at http://open.eucalyptus.com/learn/what-is-eucalyptus you can quickly check what is eucalyptus
[20:05] <obino> with it you can have your own private cloud
[20:05] <obino> currently Eucalyptus supports AWS EC2 and S3 API
[20:06] <obino> thus a lot of clients tools written for EC2 should work with Eucalyptus
[20:06] <obino> minus minor changes like the endpoint URL
[20:07] <obino> Eucalyptus has a modular architecture: there are 5 main components
[20:07] <obino> the cloud controller (CLC)
[20:07] <obino> walrus (W)
[20:07] <obino> the cluster controller (CC)
[20:08] <obino> the storage controller (SC)
[20:08] <obino> and the node controller (NC)
[20:08] <obino> the CLC and W are the user facing components
[20:08] <obino> they are the endpoints for the client tools
[20:08] <obino> respectively for the EC2 API and for the S3 API
[20:09] <obino> there has to be 1 CLC and 1 W per installed cloud
[20:09] <obino> and they need to be publicly accessible
[20:09] <obino> the CC is the middle man
[20:10] <obino> it controls a set of NCs
[20:10] <obino> and reports them to the CLC
[20:10] <obino> it controls the network for the instances running on its NCs
[20:11] <obino> there can be multiple CCs in a cloud
[20:11] <obino> the SC usually sits with the CC
[20:11] <obino> there has to be one SC per CC
[20:11] <obino> otherwise EBS won't be available for that cluster
[20:12] <obino> SC and CC needs to be able to reach (talk to) the CLC and W
[20:12] <obino> the NC is the real worker
[20:12] <obino> instances runs on the machine hosting the NC
[20:13] <obino> the previous tutorials explained a great deal of the user interaction, so I'll talk a bit of the behind the scene
[20:14] <obino> for example what happened when an instance is launched
[20:14] <obino> the CLC receive the requests
[20:15] <obino> depending on the 'zone' the request asks, it will select the correspondent CC
[20:15] <obino> after of course having checked that there is enough capacity left in that zone
[20:15] <obino> with it is sends information about the network to set up for the instance
[20:16] <obino> since every instance belongs to a security group and each security group has its own private network
[20:17] <obino> the CC will then decide which NC will run the instance
[20:17] <obino> based on the selected scheduler
[20:17] <obino> and it will setup the network for the security group
[20:18] <obino> this step is dependent on how Eucalyptus is configured, since there are 4 different Networking Modes
[20:18] <obino> once the NC receives the requests it will need to emi file (the root fs of the future instance)
[20:19] <obino> the NC keeps a local cache for the previous emi it saw before
[20:19] <obino> it's a LRU cache so the least used image will be evicted if the cache grows too big
[20:19] <obino> so the NC will check first to see if the emi is in the cache
[20:20] <obino> if not it will have to contact W to get it
[20:20] <obino> this is why W needs to be accessible by the NCs
[20:20] <obino> of course it's not only the emi that the NC downloads but the eki and the eri too
[20:21] <obino> once the image is transferred. it is copied into the cache first
[20:22] <obino> after that the emi, eki and eri are assembled for the specific hypervisor the NC has access to
[20:22] <obino> so, in the case of KVM, a single disk is created
[20:22] <obino> the size of which depends on the configuration the cloud administrator gave to the instances
[20:23] <obino> and the emi is copied into the first partition
[20:23] <obino> the 3rd partition is populated with the swap
[20:23] <obino> and the second one will be ephemeral
[20:24] <obino> libvirt is finally instructed to start the instance
[20:25] <obino> and of course the NC will take all the steps to setup the network  for the instance
[20:25] <obino> from this quick run down, you will see why the first time an instance is booted on one NC takes longer
[20:26] <obino> there is an extra network transfer (from W) and an extra disk copy (to populate the cache) that takes place
[20:27] <ClassBot> smoser asked: is Eucalyptus expecting to have EBS-root and accompaning API calls (StartInstances StopInstances ...) ?
[20:27] <obino> boot from EBS is expected to be in the next release
[20:28] <obino> at least that's what they told me :)
[20:28] <obino> I'm not sure about the start and stop instances call
[20:29] <obino> the above instance life cycle that I went through should be helpful to understand how to debug the most frequent problem on a Eucalyptus installation
[20:29] <obino> the instance won't reach running state
[20:30] <obino> from the above is easy to see that starting backward may be helpful
[20:30] <obino> so starting from the NC logs to see if the instance started correctly (or at least libvirt tried to start it)
[20:30] <obino> and if nothing is there, check the CC logs
[20:30] <obino> to finish with the CLC logs
[20:32] <obino> despite the complexity, eucalyptus is fairly easy to install
[20:32] <obino> and the UEC has taken this step even further
[20:32] <obino> so if you want to play with Eucalyptus or the UEC, you just need 2 machines available
[20:33] <obino> if instead you want to play with Eucalyptus before installing to see what is can do and how good is the EC2/S3 APIs
[20:33] <obino> then you can try our community cloud http://open.eucalyptus.com/CommunityCloud
[20:33] <obino> called ECC
[20:34] <obino> the ECC is available to everybody
[20:34] <obino> the SLA are designed to avoid abuses
[20:35] <obino> so your instance(s) will be terminated after 6 hours of running time
[20:35] <obino> you can of course re-run instances at will, but no more than 4 at any point in time
[20:35] <obino> same idea for the bucket, volumes and snapshots
[20:36] <obino> the ECC runs the latest stable version of Eucalyptus, currently 2.0.2
[20:37] <obino> if you are a developers and you are more insterested in the code and architecture, we have assembled few pages at http://open.eucalyptus.com/participate/contribute
[20:37] <obino> which may be useful
[20:38] <obino> starting from our launchpad home, and the API version we support
[20:39] <obino> we have 2 launchpad branches, for stable version and for the development of the next version
[20:39] <obino> both are accessible of course
[20:39] <obino> we provide also some 'nightly builds'
[20:40] <obino> they are actually produced on a weekly basis, but they kept the name
[20:40] <obino> finally we give some information on how to contribute back to eucalyptus
[20:41] <obino> and the final page is an assortment of various tips which may be of use to developers
[20:41] <obino> like debugging tricks, or using eclipse or partial compile/deploy
[20:41] <obino> we are hoping to expand this area sooon
[20:42] <obino> finally under http://open.eucalyptus.com/participate you will see various ways to interact with us and Eucalyptus
[20:42] <obino> in particular the forum is quite active and it is quite a resource to solve issues
[20:43] <obino> as well as the community wiki
[20:43] <obino> is there anything in particular that you want to hear about eucalyptus?
[20:44] <obino> of questions?
[20:44] <obino> *or*
[20:44] <obino> well then, it looks like I managed to put everyone to sleep! :)
[20:45] <obino> this http://open.eucalyptus.com/contact contains all the different ways to reach us in case you have questions
[20:46] <obino> and of course there is the UEC page https://help.ubuntu.com/community/UEC
[20:47] <obino> which contains very good information about Eucalyptus/UEC
[20:48] <ClassBot> tonyflint1 asked: are there any plans for addition tools/utilities/documentation for creating custom images?
[20:49] <obino> we currently have few pages under our community wiki under the tab images http://open.eucalyptus.com/wiki/images
[20:49] <obino> which could be a chore at times
[20:50] <obino> but they are useful to understand how things works
[20:50] <obino> most of the EC2 images should work with UEC/Eucalyptus
[20:50] <obino> so any way you have to generate images should work
[20:51] <obino> the kernel/initrd combination depends of course from the hypervisor the instances will use
[20:51] <ClassBot> There are 10 minutes remaining in the current session.
[20:52] <obino> in short we don't have short term plan to generate new tools but we are working with the current tools to make sure they are compatibles with eucalyptus
[20:52] <obino> if you have a favorite tool to generate images, let us know :)
[20:52] <obino> tonyflint1: does it answer your question?
[20:55] <ClassBot> gholms|work asked: Boxgrinder seems to be a decent tool for building images.  Any idea if that works with Ubuntu?
[20:55] <obino> good question
[20:55] <obino> we are in contact with a developer (marek) of boxgrinder, who has been very helpful so we are hoping to have an official eucalyptus plugin soon
[20:56] <obino> as for the question itself it could be interpreted in 2 ways: will boxgrinder create ubuntu images? or will boxgrinder be packaged for ubuntu?
[20:56] <ClassBot> There are 5 minutes remaining in the current session.
[20:57] <obino> both are probably better answer by the boxgrinder developers
[20:57] <obino> I would hope yes
[20:57] <obino> boxgrinder should be fairly portable
[20:58] <obino> and it has a nice plugin structure
[21:00]  * kim0 claps 
[21:00] <kim0> Thanks a lot for this wonderful session
[21:00]  * obino bows
[21:01] <kim0> Alright everyone .. Thank you for attending Ubuntu Cloud Days
[21:01] <kim0> I hope it was fun and useful
[21:01] <kim0> You can find us at #ubuntu-cloud
[21:01] <ClassBot> Logs for this session will be available at http://irclogs.ubuntu.com/2011/03/24/%23ubuntu-classroom.html
[21:01] <kim0> and feel free to ping me personally later
[21:01] <kim0> Thanks .. best regards .. till next time