=== freeflying_away is now known as freeflying [01:43] bigjools: maas-dhcp should be on the cluster right? [01:43] roaksoax: yes [01:43] you *just* caught me, about to go to lunch [01:44] bigjools: enjoy [01:44] :) [03:05] bigjools: [03:05] Failure: twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused. [03:05] 2013-09-20 12:03:12+0900 [Uninitialized] Stopping factory roaksoax: I have never seen that [03:13] bigjools: we are seeing this right now ;/ [03:13] how does pserv get that address? [03:13] what address? [03:13] bigjools: the address of the region [03:14] DEFAULT_MAAS_URL [03:14] has to be contactable by the nodes [03:14] and clusters [03:14] bigjools: it is [03:14] but pserv is trying to contact without address:" http:///MAAS/api/1.0/pxeconfig/?cluster_uuid=5c11d0fa-6b41-4cd6-b278 [03:14] oh I see now! [03:14] jtv1: ^ [03:14] halp [03:15] bigjools: this is kind of critical btw [03:16] roaksoax: what is in /etc/maas/maas_cluster.conf [03:16] the MAAS_URL in there should be the region's [03:16] it is [03:17] so that Connection refused error is in the pserv log? [03:18] roaksoax: ^ [03:18] roaksoax, seems the same [03:20] roaksoax: can you sniff its tcp and see what it's trying to connect to [03:20] it's just a bad config somewhere [03:20] or perhaps a proxy getting in the way [03:20] * jtv1 reads backscroll [03:22] yes p serv.log [03:24] bigjools: this is multinode maas btw [03:24] roaksoax: you mean multi cluster? [03:25] ok hold on [03:25] it seems squid homehow messing things up [03:25] but it shouldn't === jtv1 is now known as jtv [03:28] Might be worth checking whether the start-cluster-controller command line has the right server URL. [03:28] that's what I was juuuust about to say [03:29] jtv: it comes from MAAS_URL in the maas_cluster.conf file right? [03:29] If it doesn't, then either /etc/maas/maas_cluster.conf is messed up, or alternatively I'd say something changed in how upstart scripts work. [03:29] I think that's what the packaging uses [03:29] Yes. [03:29] The upstart script sources that config file, then passes $MAAS_URL on the command line. [03:30] Oh. [03:30] * jtv checks something... [03:30] roaksoax: /etc/init/maas-cluster-celery.conf starts it remember [03:31] yeah looking at it [03:31] but you said that config was already ok [03:31] so squid getting in the way? [03:31] How could squid get in this particular way? [03:31] bigjools: yeah, otherwise it would not have registered the cluster controller in the region [03:31] roaksoax: right [03:31] jtv: it could be down :) [03:32] Down, sure. But mangling URLs..? [03:33] who knows what its config us [03:33] is [03:34] maybe a dirty pyc file? [03:35] unlikely [03:35] Could happen with permissions problems and an upgrade, I suppose. [03:36] jtv: clean system [03:36] roaksoax: have you traced the tcp? [03:36] ethereal or something [03:36] tcpdump [03:37] take it one step at a time and divide the problem into possible areas [03:38] bigjools: https://pastebin.canonical.com/97785/ [03:40] roaksoax: stops short ... [03:40] bigjools: repeats itself [03:40] roaksoax: is it connecting to the right place? [03:41] bigjools: yeah [03:41] again, otherwise the cluster controller would not have registered itself [03:41] roaksoax: ok if you trace on the region controller do you see traffic? [03:41] bigjools: hold on, i rebooted the cluster [04:10] jtv: roaksoax: so the config actually comes from /etc/maas/pserv.yaml [04:10] tftp:generator: [04:11] roaksoax: so I suspect the bug is in the packaging and it doesn't set that file up properly sometimes [04:11] maybe a race condition with another package [04:11] and the wrong one got installed first [06:04] where can I watch commissioning log? === CyberJacob|Away is now known as CyberJacob [06:41] bigjools, now I can enlist node, but have problem to commission it, have apt_proxy in enlist_data and commission, any clue? [06:43] jtv, ^^ [07:01] Hi freeflying [07:01] I think a commissioning node will direct its syslog to either the region controller or the cluster controller... [07:02] Look for an "rsyslog" log. [07:03] jtv, no log there [07:04] jtv, it gives connect to 169.xxx fails, after it get ip address [07:05] The node says that on its console? [07:05] yes [07:05] Any idea what the 169.xxx address is? [07:06] no, all address we're using within 10.x.x.x [07:07] looks that is epheremal image? [07:08] Yes, commissioning runs an ephemeral image. [07:08] But I don't think the node should be talking to the internet at that point. [07:09] Doesn't look as if it's the Ubuntu archive either, although I suppose it might be your local mirror. [07:10] why its calling its calling 169.254.169.254/2009-04-04/meta [07:11] So that's where it's trying to find its metadata service. [07:11] Strange address... [07:12] jtv, where shall I configure it, or is it because it can't access to maas-regional contoller [07:12] That's a zeroconf address. Any chance there's a wifi interface being mistaken for your server? [07:12] jtv, no [07:13] thre are 8 ports on the server, 1 of them are 10 gig, for of them are 1 gig;s [07:14] At least, when I use "whois" on it, it says computers use 169.154.*.* (note 154, not 254!) when they don't have an IP address and don't get one from the network. [07:14] from weui, the metatada_url was set to maas regionl's [07:14] And that's a 10.*.*.* address, right? [07:15] And no 169.*.*.* networks at all? [07:15] 169.254.169.254 is used in Amazon EC2 and other cloud computing platforms to distribute metadata to cloud instances. [07:15] ! [07:15] So cloud-init is acting up. [07:16] Thinking this is EC2... [07:17] cloudinit/sources/DataSourceEc2.py:DEF_MD_URL = "http://169.254.169.254" [07:17] jtv, yes, in the preseed generated for commissioning is 10.209.13.204/MAAS [07:17] (this is in the cloud-init source code) [07:19] I'm looking at the cloud-init source now... [07:23] I suspect the node can't reach the region and thus cloud-init falls back to using the EC2 metadata address. [07:23] Ah, I was thinking it might not have received the right configuration... [07:25] Well, I don't see how the EC2 IP could originate from MAAS. === CyberJacob is now known as CyberJacob|Away [07:27] Neither do I... I was thinking that cloud-init might not have received the right configuration, and decided to try things the EC2 way. [07:27] I'm trying to figure out how cloud-init chooses the DataSource to use. [07:27] That's possible indeed. But from what freeflying was saying, it seems the configuration is right. [07:28] freeflying: if cloud-init errors somehow (and uses the EC2 address as a — somewhat crezy — fallback), you will see errors on the node console while it is commissioning… do you have access to it? [07:29] rvba, you mean the kvm? [07:30] The screen, yes. [07:32] freeflying: well, the node's screen if it is a physical machine. [07:33] let me post a screenshot [07:34] ("kvm" can be either the mouse/keyboard switch on a physical machine, or a commonly used type of virtual machine) [07:35] in this case, it mean mouse/keyboard :) [07:37] Well, it's the V in KVM that I'm interested in :) [07:41] people.canonical.com/~zhengpenghou/20130920_162508.jpg [07:42] uploading [07:42] Thanks [07:42] Yup, that's cloud-init running out of DataSource candidates. [07:43] Yep [07:43] AFAICS cloud-init tries to download data from various sources, until it finds one that works. [07:44] It's not finding any. [07:44] any suggestion? [07:45] We'll have to find the root cause first... Do you have a way of verifying that the nodes can reach the given IP address? [07:45] When I encountered that problem (cloud-init was using the EC2 IP), it was because the node could not reach the MAAS IP address. [07:46] So it's worth checking, as jtv said. [07:46] It would be ideal if you could access the full URL on http... if you get a 404, it's likely to be a problem in the URL configuration. [07:46] If it's a permissions error, then it should just have worked and we have a mystery. [07:46] If it's a networking error, then there's our problem. :) [07:46] jtv, no, the node never fails to be commissioned [07:47] I thought it failed during commissioning..? [07:47] jtv, funny thing is I do have 1 node commissioned :) [07:47] jtv, yes [07:48] I don't understand... if it fails during commissining, then the node fails to be commissioned, right? [07:48] before cloud-init gives the error info, I did see the commissioning node get ip [07:49] So DHCP is working... did it get its IP from the right server? [07:49] jtv, I enlist 3 machines, 1 succeeded, not the other 2 [07:49] jtv, yes [07:50] Is there anything special about these two machines network-wise? [07:50] rvba, this could be a issue, but not sure [07:50] Are all the nodes visible in the web UI, and similarly (correctly) configured? If something went wrong there, they might fail to get to the metadata service. [07:51] jtv, all of them are listed in webui [07:52] If we're very lucky, the metadata server log will show the nodes' requests... [07:52] So one of them is "ready" and two of them stuck "commissioning", correct? [07:52] rvba, exactly [07:53] rvba, and have proxy set up in both commissioning/enlist_data [07:54] jtv, rvba would like t have a check? [07:55] Always worth a check... if you have a way to simulate an http request to that URL from the same node, that might tell us something too. [08:12] (I should say: by "that URL" I mean the metadata URL) [08:13] I see requests to the metadata service from 10.209.13.1{0,1,2,3}. [08:14] freeflying: which node is the successful one? [08:15] jtv: the errors in maas/maas.log are concerning [08:16] The "PermissionDenied: Not authenticated as a known node." errors. [08:16] And first, a problem registering the node group..? [08:17] The two problems might be linked… [08:17] Yup. [08:18] Looks like the NodeGroupWithInterfacesForm. [08:18] jtv, working node called xggt6 [08:18] freeflying: we need its IP or it's uuid. [08:18] freeflying: do you know the last number of its IP address? We see 4 machines making requests to the metadata service. [08:19] rvba, 10.209.13.10 [08:19] Thanks! [08:19] freeflying: you said you had a config with 2 clusters right? [08:19] s/had/have/ [08:19] rvba, 1 cluster 1 regional [08:20] freeflying: when you go to the settings page, do you see one or two cluster controllers? (because there is one cluster alongside the region by default) [08:21] rvba, only 1 [08:21] Is it called 'master'? [08:21] rvba, we don't have maas-cluster-controller installed on regional [08:22] rvba, the one I can from webui is cluster-xxxxx [08:23] freeflying: okay, I see. Not having the maas-cluster-controller installed on the region is an untested setup. [08:24] rvba, hehe [08:31] freeflying: the MAAS region has IP 10.208.11.203 right? [08:31] Or is that the cluster machine? [08:32] that is cluster [08:32] regional is 204 [08:50] The 403 errors match the "Not authenticated as a known node" errors in the maas log. [08:56] A kernel download failed at one point. But that should either break commissioning completely, or leave it unaffected. [08:56] freeflying: could you please run this in 'sudo maas shell': http://paste.ubuntu.com/6131745/ ? [08:56] jtv, when idid that error happen [08:57] freeflying: it will just print information about the nodes and the clusters, to help us debug the problem. [08:58] freeflying: that failure happened at 11:13:54 +0900 [08:59] rvba, on it, give me secs [09:02] rvba, 1 nodegroup there, and 2 node [09:03] freeflying: only two nodes total? I though you said you had 3? [09:03] thought* [09:03] jtv, during that time, we're still trying to configure them, and after that, roaksoax has reconfigure cluster [09:03] Yeah, I think the 404 is harmless here or we'd have seen different failures. [09:03] What IP addresses did you get from rvba's script? [09:04] A paste of the output would be ideal. [09:04] rvba, sorry, forgot to say I deleted others stucked in commissioning [09:05] freeflying: okay, can you try re-enlisting then re-commissioning the problematic nodes, then run that script again? [09:07] rvba, ok, give me mins :) [09:08] thanks [09:09] freeflying: sorry but I'm a bit confused, you said you had 3 nodes total, but I can see 4 different IP adresses requesting the preseed… [09:10] rvba, we actually have 31 machines, guess some one else powered on it [09:10] Okay. [09:14] rvba, http://paste.ubuntu.com/6131822/ [09:17] freeflying: can you run [(n.ip_addresses(), n.system_id, n.nodegroup, n.architecture, n.status, n.hardware_details) for n in Node.objects.all()] [09:17] freeflying: the output will be large :) [09:22] rvba, any module I shall import? [09:23] freeflying: just 'from maasserver.models import Node' [09:23] (nothing new compared to the previous commands) [09:24] Traceback (most recent call last): │····················· [09:24] File "", line 1, in │····················· [09:24] AttributeError: 'Node' object has no attribute 'ip_addresses' [09:25] freeflying: ah, right, you're using the precise package. [09:26] freeflying: could you run this: [(n.system_id, n.nodegroup, n.architecture, n.status, n.hardware_details) for n in Node.objects.all()] [09:31] Should MaaS and Juju get installed on one of my servers or on a client system? | http://askubuntu.com/q/347866 [09:32] freeflying: I also have a bit of python I'd like to see the output to. [09:32] from metadataserver.models import NodeKey [09:32] for nk in NodeKey.objects.all(): [09:32] print(nk.node.nodegroup.uuid, nk.node.system_id, len(nk.key)) [09:32] <- that should tell us a bit (not the actual keys of course) about which oauth keys are being sent to which nodes. [09:32] Because we're seeing those nodes fail to authenticate with those keys. [09:34] rvba, http://paste.ubuntu.com/6131873/ [09:35] jtv, how can I redirect from python console to stdout [09:36] Try: [09:36] python < # Script code goes here [09:36] EOF [09:36] Ahem. Not python, of course, but "sudo maas shell < The < To redirect: sudo maas shell >/tmp/my-output < freeflying: okay, so everything seems fine so far, you're got one allocated node and 3 commissioning nodes… did they get out of commissioning or are they stuck, same as before? [09:43] rvba, same [09:44] freeflying: can you try removing the working node and re-enlist, re-commission it? [09:45] freeflying: if that works, then there is definitely a difference between this node and the others (I suspect related to the network config). [09:45] jtv, no output [09:47] rvba, error: gomaasapi: got error back from server: 409 CONFLICT (Node cannot be released in its current state ('Commissioning'). [09:47] No output!? [09:47] freeflying: wait, the node in question should not be commissioning. [09:48] jtv, no [09:49] That would explain why the nodes fail to authenticate with the metadata service, but... [09:53] jtv, any configure I need change to get it fix? [09:53] * freeflying gonna go, will be back late, need grab some food [09:54] freeflying: I think somehow either the input or the output must have been lost. It's got to print at least one entry for the working node, and if all is normal, one for each of the other ones as well. [09:55] freeflying: you can manually get rid of the juju environment and all the nodes using this: http://paste.ubuntu.com/6131943/ [09:56] freeflying: again, if you get one node commissioned, this means you need to investigate how the other nodes (the one stuck commissioning) differ from this one. [09:57] freeflying: can these nodes download stuff from the internet for instance? === allenap` is now known as allenap === freeflying is now known as freeflying_away [10:21] freeflying_away: could you also show us how the cluster controller is configured? (the network config on the cluster page) [10:22] At 14:02:22 there's a POST from the _successful_ node to a broken URL: /MAAS/api/1.0/nodes//MAAS/api/1.0/nodes/ [10:22] Oh, not broken apparently -- I'm told there's a workaround on the MAAS side for that. [10:28] The other nodes are signaling to the metadata service... their individual Node pages in the UI may show useful output. [10:31] Later attempts to do that hit 403. But the attempts around 14:02:27--14:02:45 got OK responses. === freeflying_away is now known as freeflying [10:55] rvba, ok, so I'll delete all nodes, and re-enlist [10:56] Yep, let's see if what you've seen before can reproduced. [10:56] besides this, anything else I shall try [10:56] jtv: ^ ? [10:56] * jtv can't think of anything [10:56] rvba, I left there office, so might not be able to watch the screen [10:57] freeflying: if you can still reach the MAAS machine, then that will be enough. [10:58] Like I said, I want to be sure that the odd behavior you've seen (one node fine, two nodes stuck) can be reproduced. [11:29] rvba, http://paste.ubuntu.com/6132269/ [11:30] rvba, we use maas to manager another network, which is a 10 gig, will have all later on traffic go through this one [11:39] freeflying: there is a problem right there, the network defined here is 10.208.11.203/24 and the nodes connect to the region using IPs like 10.209.13.10. [11:39] freeflying: did you get one node commissioned, same as before? [11:40] 10.209.13.0/24 we use for pxe boot, because of hw limitation, 10 gig can't do pxe boot [11:43] We're hitting the same problem we talked about before, MAAS can only deal with one interface right now, used for pxe booting and later on, the IP attached to that interface is the one juju services deployed on the node will use. [11:43] Now I really wonder how you got one node commissioned successfully :). [11:44] rvba, for commissioning should be fine, we have tested it last week, difference is it was using single server for maas [11:45] rvba, we have no problem to enlist/commission node [11:46] freeflying: hang on, the problem we've been trying to figure out today is that the nodes never get out of the commissioning phase. [11:47] rvba, yes, despite the magical one :) [11:51] freeflying: the whole problem revolves around the network setup. The nodes need to connect to the region using an IP in the network defined on a cluster controller. [11:51] freeflying: that's how MAAS figures out to which cluster a node belongs. [11:52] freeflying: now, why did it work with one node, that's what needs to be investigated (did you change the cluster configuration half-way through?). === freeflying is now known as freeflying_away === freeflying_away is now known as freeflying [12:23] rvba, no, the only things has done is set up proxy in preseed [12:23] rvba, btw, the second network(10 gig's 10.208.11.0/24, managed by maas) never been used so far, no item in that dhcp leases file [12:25] freeflying: the config of the cluster is what defined DHCP config. The onlyl option I can see is that the config changed when MAAS was running. [12:26] rvba, no idea [12:27] freeflying: what's in the DHCP config? cluster machine, file /etc/dhcp/dhcpd.conf. [12:30] rvba, http://paste.ubuntu.com/6132485/ [12:31] freeflying: it defines 10.208.11.0/24. How come the nodes have ips in 10.209.11.0/24? [12:32] rvba, because we use an external dhcp to provide pxe boot, so during this stage, its only use pxe boot network, which is 10.209.11.0/24 [12:35] freeflying: so the problem is there, the cluster is configured to manage DNS and DHCP (cf http://paste.ubuntu.com/6132269/). So it believes the nodes will have IP addresses in the range defined here ie. [10.208.11.10 - 10.208.11.250]. [12:36] rvba, as long as the node bootup, it will have the ip :) (after deploying) [12:37] freeflying: yes, but like I said, MAAS uses the IP the node uses to connect to the region to figure out to which nodegroup it should belong. [12:38] rvba, during our testing last week, it worked :) [12:38] rvba, thats the thing confusing me now [12:38] freeflying: what changed then? [12:39] rvba, split regional and cluster onto two servers [12:39] * freeflying really tired, need some relax and rest [12:40] rvba, anyway, I can't continue on it today, thanks for you guys [12:41] freeflying: I understand, it's pretty late for you. This is indeed a networking problem, with the region and the cluster on different machines, the nodes use, to connect to the region, an IP adress which is not recognized as belonging to the nodegroup. [12:41] nn freeflying [12:43] freeflying: Would you be able to summarise what's happened today and email it with us in Red? [12:44] allenap, individually? or do you have a list? === kentb is now known as kentb-afk === CyberJacob|Away is now known as CyberJacob === kentb-afk is now known as kentb [19:09] Where are the MAAS power templates stored? [19:15] While on the subject of power, maas seems to pre-populate the IPMI power parameters with a maas username and random password. Is that even usable? I've been going in and changing those to a known-working username/password without trying the one maas gives me. [19:29] marlinc: depends on the version [19:29] kentb: they work [19:29] I found out already :) [19:29] kentb: maas creates them intentionally [19:29] kentb: and doesn't prepopulate [19:29] kentb: it access the BMC and adds them [19:29] marlinc: cool :) [19:48] roaksoax: ok. thanks for clarifying === CyberJacob is now known as CyberJacob|Away === kentb is now known as kentb-out