/srv/irclogs.ubuntu.com/2013/09/20/#maas.txt

=== freeflying_away is now known as freeflying
roaksoaxbigjools: maas-dhcp should be on the cluster right?01:43
bigjoolsroaksoax: yes01:43
bigjoolsyou *just* caught me, about to go to lunch01:43
roaksoaxbigjools: enjoy01:44
bigjools:)01:44
roaksoaxbigjools:03:05
roaksoaxFailure: twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.03:05
roaksoax2013-09-20 12:03:12+0900 [Uninitialized] Stopping factory <HTTPClientFactory: http:///MAAS/api/1.0/pxeconfig/?cluster_uuid=5c11d0fa-6b41-4cd6-b27803:05
bigjoolsroaksoax: I have never seen that03:12
roaksoaxbigjools: we are seeing this right now ;/03:13
roaksoaxhow does pserv get that address?03:13
bigjoolswhat address?03:13
roaksoaxbigjools: the address of the region03:13
bigjoolsDEFAULT_MAAS_URL03:14
bigjoolshas to be contactable by the nodes03:14
bigjoolsand clusters03:14
roaksoaxbigjools: it is03:14
roaksoaxbut pserv is trying to contact without address:"  http:///MAAS/api/1.0/pxeconfig/?cluster_uuid=5c11d0fa-6b41-4cd6-b27803:14
bigjoolsoh I see now!03:14
bigjoolsjtv1: ^03:14
bigjoolshalp03:14
roaksoaxbigjools: this is kind of critical btw03:15
bigjoolsroaksoax: what is in /etc/maas/maas_cluster.conf03:16
bigjoolsthe MAAS_URL in there should be the region's03:16
roaksoaxit is03:16
bigjoolsso that Connection refused error is in the pserv log?03:17
bigjoolsroaksoax: ^03:18
freeflyingroaksoax, seems the same03:18
bigjoolsroaksoax: can you sniff its tcp and see what it's trying to connect to03:20
bigjoolsit's just a bad config somewhere03:20
bigjoolsor perhaps a proxy getting in the way03:20
* jtv1 reads backscroll03:20
roaksoaxyes p serv.log03:22
roaksoaxbigjools: this is multinode maas btw03:24
bigjoolsroaksoax: you mean multi cluster?03:24
roaksoaxok hold on03:25
roaksoaxit seems squid homehow messing things up03:25
roaksoaxbut it shouldn't03:25
=== jtv1 is now known as jtv
jtvMight be worth checking whether the start-cluster-controller command line has the right server URL.03:28
bigjoolsthat's what I was juuuust about to say03:28
bigjoolsjtv: it comes from MAAS_URL in the maas_cluster.conf file right?03:29
jtvIf it doesn't, then either /etc/maas/maas_cluster.conf is messed up, or alternatively I'd say something changed in how upstart scripts work.03:29
bigjoolsI think that's what the packaging uses03:29
jtvYes.03:29
jtvThe upstart script sources that config file, then passes $MAAS_URL on the command line.03:29
jtvOh.03:30
* jtv checks something...03:30
bigjoolsroaksoax: /etc/init/maas-cluster-celery.conf starts it remember03:30
roaksoaxyeah looking at it03:31
bigjoolsbut you said that config was already ok03:31
bigjoolsso squid getting in the way?03:31
jtvHow could squid get in this particular way?03:31
roaksoaxbigjools: yeah, otherwise it would not have registered the cluster controller in the region03:31
bigjoolsroaksoax: right03:31
bigjoolsjtv: it could be down :)03:31
jtvDown, sure.  But mangling URLs..?03:32
bigjoolswho knows what its config us03:33
bigjoolsis03:33
roaksoaxmaybe a dirty pyc file?03:34
bigjoolsunlikely03:35
jtvCould happen with permissions problems and an upgrade, I suppose.03:35
roaksoaxjtv: clean system03:36
bigjoolsroaksoax: have you traced the tcp?03:36
bigjoolsethereal or something03:36
bigjoolstcpdump03:36
bigjoolstake it one step at a time and divide the problem into possible areas03:37
roaksoaxbigjools: https://pastebin.canonical.com/97785/03:38
bigjoolsroaksoax: stops short ...03:40
roaksoaxbigjools: repeats itself03:40
bigjoolsroaksoax: is it connecting to the right place?03:40
roaksoaxbigjools: yeah03:41
roaksoaxagain, otherwise the cluster controller would not have registered itself03:41
bigjoolsroaksoax: ok if you trace on the region controller do you see traffic?03:41
roaksoaxbigjools: hold on, i rebooted the cluster03:41
bigjoolsjtv: roaksoax: so the config actually comes from /etc/maas/pserv.yaml04:10
bigjoolstftp:generator:04:10
bigjoolsroaksoax: so I suspect the bug is in the packaging and it doesn't set that file up properly sometimes04:11
bigjoolsmaybe a race condition with another package04:11
bigjoolsand the wrong one got installed first04:11
freeflyingwhere can I watch commissioning log?06:04
=== CyberJacob|Away is now known as CyberJacob
freeflyingbigjools, now I can  enlist node, but have problem to commission it, have apt_proxy in enlist_data and commission, any clue?06:41
freeflyingjtv, ^^06:43
jtvHi freeflying07:01
jtvI think a commissioning node will direct its syslog to either the region controller or the cluster controller...07:01
jtvLook for an "rsyslog" log.07:02
freeflyingjtv, no log there07:03
freeflyingjtv, it gives connect to 169.xxx fails, after it get ip address07:04
jtvThe node says that on its console?07:05
freeflyingyes07:05
jtvAny idea what the 169.xxx address is?07:05
freeflyingno, all address we're using within 10.x.x.x07:06
freeflyinglooks that is epheremal image?07:07
jtvYes, commissioning runs an ephemeral image.07:08
jtvBut I don't think the node should be talking to the internet at that point.07:08
jtvDoesn't look as if it's the Ubuntu archive either, although I suppose it might be your local mirror.07:09
freeflyingwhy its calling its calling 169.254.169.254/2009-04-04/meta07:10
jtvSo that's where it's trying to find its metadata service.07:11
jtvStrange address...07:11
freeflyingjtv, where shall I configure it, or is it because it can't access to maas-regional contoller07:12
jtvThat's a zeroconf address.  Any chance there's a wifi interface being mistaken for your server?07:12
freeflyingjtv, no07:12
freeflyingthre are 8 ports on the server, 1 of them are 10 gig, for of them are 1 gig;s07:13
jtvAt least, when I use "whois" on it, it says computers use 169.154.*.* (note 154, not 254!) when they don't have an IP address and don't get one from the network.07:14
freeflyingfrom weui, the metatada_url was set to maas regionl's07:14
jtvAnd that's a 10.*.*.* address, right?07:14
jtvAnd no 169.*.*.* networks at all?07:15
rvba169.254.169.254 is used in Amazon EC2 and other cloud computing platforms to distribute metadata to cloud instances.07:15
jtv!07:15
rvbaSo cloud-init is acting up.07:15
jtvThinking this is EC2...07:16
rvbacloudinit/sources/DataSourceEc2.py:DEF_MD_URL = "http://169.254.169.254"07:17
freeflyingjtv, yes, in the preseed generated for commissioning is 10.209.13.204/MAAS07:17
rvba(this is in the cloud-init source code)07:17
jtvI'm looking at the cloud-init source now...07:19
rvbaI suspect the node can't reach the region and thus cloud-init falls back to using the EC2 metadata address.07:23
jtvAh, I was thinking it might not have received the right configuration...07:23
rvbaWell, I don't see how the EC2 IP could originate from MAAS.07:25
=== CyberJacob is now known as CyberJacob|Away
jtvNeither do I...  I was thinking that cloud-init might not have received the right configuration, and decided to try things the EC2 way.07:27
jtvI'm trying to figure out how cloud-init chooses the DataSource to use.07:27
rvbaThat's possible indeed.  But from what freeflying was saying, it seems the configuration is right.07:27
rvbafreeflying: if cloud-init errors somehow (and uses the EC2 address as a — somewhat crezy — fallback), you will see errors on the node console while it is commissioning… do you have access to it?07:28
freeflyingrvba, you mean the kvm?07:29
jtvThe screen, yes.07:30
rvbafreeflying: well, the node's screen if it is a physical machine.07:32
freeflyinglet me post a screenshot07:33
jtv("kvm" can be either the mouse/keyboard switch on a physical machine, or a commonly used type of virtual machine)07:34
freeflyingin this case, it mean mouse/keyboard :)07:35
rvbaWell, it's the V in KVM that I'm interested in :)07:37
freeflyingpeople.canonical.com/~zhengpenghou/20130920_162508.jpg07:41
freeflyinguploading07:42
jtvThanks07:42
jtvYup, that's cloud-init running out of DataSource candidates.07:42
rvbaYep07:43
jtvAFAICS cloud-init tries to download data from various sources, until it finds one that works.07:43
jtvIt's not finding any.07:44
freeflyingany suggestion?07:44
jtvWe'll have to find the root cause first...  Do you have a way of verifying that the nodes can reach the given IP address?07:45
rvbaWhen I encountered that problem (cloud-init was using the EC2 IP), it was because the node could not reach the MAAS IP address.07:45
rvbaSo it's worth checking, as jtv said.07:46
jtvIt would be ideal if you could access the full URL on http...  if you get a 404, it's likely to be a problem in the URL configuration.07:46
jtvIf it's a permissions error, then it should just have worked and we have a mystery.07:46
jtvIf it's a networking error, then there's our problem.  :)07:46
freeflyingjtv, no, the node never fails to be commissioned07:46
jtvI thought it failed during commissioning..?07:47
freeflyingjtv, funny thing is I do have 1 node commissioned :)07:47
freeflyingjtv, yes07:47
jtvI don't understand... if it fails during commissining, then the node fails to be commissioned, right?07:48
freeflyingbefore cloud-init gives the error info, I did see the commissioning node get ip07:48
jtvSo DHCP is working...  did it get its IP from the right server?07:49
freeflyingjtv, I enlist 3 machines, 1 succeeded, not the other 207:49
freeflyingjtv, yes07:49
rvbaIs there anything special about these two machines network-wise?07:50
freeflyingrvba, this could be a issue, but not sure07:50
jtvAre all the nodes visible in the web UI, and similarly (correctly) configured?  If something went wrong there, they might fail to get to the metadata service.07:50
freeflyingjtv, all of them are listed in webui07:51
jtvIf we're very lucky, the metadata server log will show the nodes' requests...07:52
rvbaSo one of them is "ready" and two of them stuck "commissioning", correct?07:52
freeflyingrvba, exactly07:52
freeflyingrvba, and have proxy set up in both commissioning/enlist_data07:53
freeflyingjtv, rvba would like t have a check?07:54
jtvAlways worth a check...  if you have a way to simulate an http request to that URL from the same node, that might tell us something too.07:55
jtv(I should say: by "that URL" I mean the metadata URL)08:12
rvbaI see requests to the metadata service from 10.209.13.1{0,1,2,3}.08:13
jtvfreeflying: which node is the successful one?08:14
rvbajtv: the errors in maas/maas.log are concerning08:15
rvbaThe "PermissionDenied: Not authenticated as a known node." errors.08:16
jtvAnd first, a problem registering the node group..?08:16
rvbaThe two problems might be linked…08:17
jtvYup.08:17
jtvLooks like the NodeGroupWithInterfacesForm.08:18
freeflyingjtv, working node called xggt608:18
rvbafreeflying: we need its IP or it's uuid.08:18
jtvfreeflying: do you know the last number of its IP address?  We see 4 machines making requests to the metadata service.08:18
freeflyingrvba, 10.209.13.1008:19
jtvThanks!08:19
rvbafreeflying: you said you had a config with 2 clusters right?08:19
rvbas/had/have/08:19
freeflyingrvba, 1 cluster 1 regional08:19
rvbafreeflying: when you go to the settings page, do you see one or two cluster controllers?  (because there is one cluster alongside the region by default)08:20
freeflyingrvba, only 108:21
rvbaIs it called 'master'?08:21
freeflyingrvba, we don't have maas-cluster-controller installed on regional08:21
freeflyingrvba, the one I can from webui is cluster-xxxxx08:22
rvbafreeflying: okay, I see.  Not having the maas-cluster-controller installed on the region is an untested setup.08:23
freeflyingrvba, hehe08:24
rvbafreeflying: the MAAS region has IP 10.208.11.203 right?08:31
rvbaOr is that the cluster machine?08:31
freeflyingthat is cluster08:32
freeflyingregional is 20408:32
jtvThe 403 errors match the "Not authenticated as a known node" errors in the maas log.08:50
jtvA kernel download failed at one point.  But that should either break commissioning completely, or leave it unaffected.08:56
rvbafreeflying: could you please run this in 'sudo maas shell': http://paste.ubuntu.com/6131745/ ?08:56
freeflyingjtv, when idid that error happen08:56
rvbafreeflying: it will just print information about the nodes and the clusters, to help us debug the problem.08:57
jtvfreeflying: that failure happened at 11:13:54 +090008:58
freeflyingrvba, on it, give me secs08:59
freeflyingrvba, 1 nodegroup there, and 2 node09:02
rvbafreeflying: only two nodes total?  I though you said you had 3?09:03
rvbathought*09:03
freeflyingjtv, during that time, we're still trying to configure them, and after that, roaksoax has reconfigure cluster09:03
jtvYeah, I think the 404 is harmless here or we'd have seen different failures.09:03
jtvWhat IP addresses did you get from rvba's script?09:03
jtvA paste of the output would be ideal.09:04
freeflyingrvba, sorry, forgot to say I deleted others stucked in commissioning09:04
rvbafreeflying: okay, can you try re-enlisting then re-commissioning the problematic nodes, then run that script again?09:05
freeflyingrvba, ok, give me mins :)09:07
freeflyingthanks09:08
rvbafreeflying: sorry but I'm a bit confused, you said you had 3 nodes total, but I can see 4 different IP adresses requesting the preseed…09:09
freeflyingrvba, we actually have 31 machines, guess some one else powered on it09:10
rvbaOkay.09:10
freeflyingrvba, http://paste.ubuntu.com/6131822/09:14
rvbafreeflying: can you run [(n.ip_addresses(), n.system_id, n.nodegroup, n.architecture, n.status, n.hardware_details) for n in Node.objects.all()]09:17
rvbafreeflying: the output will be large :)09:17
freeflyingrvba, any module I shall import?09:22
rvbafreeflying: just 'from maasserver.models import Node'09:23
rvba(nothing new compared to the previous commands)09:23
freeflyingTraceback (most recent call last):                                                                                                                 │·····················09:24
freeflying  File "<console>", line 1, in <module>                                                                                                            │·····················09:24
freeflyingAttributeError: 'Node' object has no attribute 'ip_addresses'09:24
rvbafreeflying: ah, right, you're using the precise package.09:25
rvbafreeflying: could you run this: [(n.system_id, n.nodegroup, n.architecture, n.status, n.hardware_details) for n in Node.objects.all()]09:26
AskUbuntuShould MaaS and Juju get installed on one of my servers or on a client system? | http://askubuntu.com/q/34786609:31
jtvfreeflying: I also have a bit of python I'd like to see the output to.09:32
jtvfrom metadataserver.models import NodeKey09:32
jtvfor nk in NodeKey.objects.all():09:32
jtv    print(nk.node.nodegroup.uuid, nk.node.system_id, len(nk.key))09:32
jtv <- that should tell us a bit (not the actual keys of course) about which oauth keys are being sent to which nodes.09:32
jtvBecause we're seeing those nodes fail to authenticate with those keys.09:32
freeflyingrvba, http://paste.ubuntu.com/6131873/09:34
freeflyingjtv, how can I redirect from python console to stdout09:35
jtvTry:09:36
jtvpython <<EOF09:36
jtv# Script code goes here09:36
jtvEOF09:36
jtvAhem.  Not python, of course, but "sudo maas shell <<EOF"09:36
jtvThe <<EOF tells it to feed what comes next (until the EOF line) into stdin.09:37
jtvTo redirect: sudo maas shell >/tmp/my-output <<EOF09:37
rvbafreeflying: okay, so everything seems fine so far, you're got one allocated node and 3 commissioning nodes… did they get out of commissioning or are they stuck, same as before?09:41
freeflyingrvba, same09:43
rvbafreeflying: can you try removing the working node and re-enlist, re-commission it?09:44
rvbafreeflying: if that works, then there is definitely a difference between this node and the others (I suspect related to the network config).09:45
freeflyingjtv, no output09:45
freeflyingrvba, error: gomaasapi: got error back from server: 409 CONFLICT (Node cannot be released in its current state ('Commissioning').09:47
jtvNo output!?09:47
rvbafreeflying: wait, the node in question should not be commissioning.09:47
freeflyingjtv, no09:48
jtvThat would explain why the nodes fail to authenticate with the metadata service, but...09:49
freeflyingjtv, any configure I need change to get it fix?09:53
* freeflying gonna go, will be back late, need grab some food 09:53
jtvfreeflying: I think somehow either the input or the output must have been lost.  It's got to print at least one entry for the working node, and if all is normal, one for each of the other ones as well.09:54
rvbafreeflying: you can manually get rid of the juju environment and all the nodes using this: http://paste.ubuntu.com/6131943/09:55
rvbafreeflying: again, if you get one node commissioned, this means you need to investigate how the other nodes (the one stuck commissioning) differ from this one.09:56
rvbafreeflying: can these nodes download stuff from the internet for instance?09:57
=== allenap` is now known as allenap
=== freeflying is now known as freeflying_away
rvbafreeflying_away: could you also show us how the cluster controller is configured? (the network config on the cluster page)10:21
jtvAt 14:02:22 there's a POST from the _successful_ node to a broken URL: /MAAS/api/1.0/nodes//MAAS/api/1.0/nodes/10:22
jtvOh, not broken apparently -- I'm told there's a workaround on the MAAS side for that.10:22
jtvThe other nodes are signaling to the metadata service...  their individual Node pages in the UI may show useful output.10:28
jtvLater attempts to do that hit 403.  But the attempts around 14:02:27--14:02:45 got OK responses.10:31
=== freeflying_away is now known as freeflying
freeflyingrvba, ok, so I'll delete all nodes, and re-enlist10:55
rvbaYep, let's see if what you've seen before can reproduced.10:56
freeflyingbesides this, anything else I shall try10:56
rvbajtv: ^ ?10:56
* jtv can't think of anything10:56
freeflyingrvba, I left there office, so might not be able to watch the screen10:56
rvbafreeflying: if you can still reach the MAAS machine, then that will be enough.10:57
rvbaLike I said, I want to be sure that the odd behavior you've seen (one node fine, two nodes stuck) can be reproduced.10:58
freeflyingrvba, http://paste.ubuntu.com/6132269/11:29
freeflyingrvba, we use maas to manager another network, which is a 10 gig, will have all later on traffic go through this one11:30
rvbafreeflying: there is a problem right there, the network defined here is 10.208.11.203/24 and the nodes connect to the region using IPs like 10.209.13.10.11:39
rvbafreeflying: did you get one node commissioned, same as before?11:39
freeflying10.209.13.0/24 we use for pxe boot, because of hw limitation, 10 gig can't do pxe boot11:40
rvbaWe're hitting the same problem we talked about before, MAAS can only deal with one interface right now, used for pxe booting and later on, the IP attached to that interface is the one juju services deployed on the node will use.11:43
rvbaNow I really wonder how you got one node commissioned successfully :).11:43
freeflyingrvba, for commissioning should be fine, we have tested it last week, difference is it was using single server for maas11:44
freeflyingrvba, we have no problem to enlist/commission node11:45
rvbafreeflying: hang on, the problem we've been trying to figure out today is that the nodes never get out of the commissioning phase.11:46
freeflyingrvba, yes,  despite the magical one :)11:47
rvbafreeflying: the whole problem revolves around the network setup.  The nodes need to connect to the region using an IP in the network defined on a cluster controller.11:51
rvbafreeflying: that's how MAAS figures out to which cluster a node belongs.11:51
rvbafreeflying: now, why did it work with one node, that's what needs to be investigated (did you change the cluster configuration half-way through?).11:52
=== freeflying is now known as freeflying_away
=== freeflying_away is now known as freeflying
freeflyingrvba, no, the only things has done is set up proxy in preseed12:23
freeflyingrvba, btw, the second network(10 gig's 10.208.11.0/24, managed by maas) never been used so far, no item in that dhcp leases file12:23
rvbafreeflying: the config of the cluster is what defined DHCP config.  The onlyl option I can see is that the config changed when MAAS was running.12:25
freeflyingrvba, no idea12:26
rvbafreeflying: what's in the DHCP config?  cluster machine, file /etc/dhcp/dhcpd.conf.12:27
freeflyingrvba, http://paste.ubuntu.com/6132485/12:30
rvbafreeflying: it defines 10.208.11.0/24.  How come the nodes have ips in 10.209.11.0/24?12:31
freeflyingrvba, because we use an external dhcp to provide pxe boot, so during this stage, its only use pxe boot network, which is 10.209.11.0/2412:32
rvbafreeflying: so the problem is there, the cluster is configured to manage DNS and DHCP (cf http://paste.ubuntu.com/6132269/).  So it believes the nodes will have IP addresses in the range defined here ie. [10.208.11.10 - 10.208.11.250].12:35
freeflyingrvba, as long as the node bootup, it will have the ip :) (after deploying)12:36
rvbafreeflying: yes, but like I said, MAAS uses the IP the node uses to connect to the region to figure out to which nodegroup it should belong.12:37
freeflyingrvba, during our testing last week, it worked :)12:38
freeflyingrvba, thats the thing confusing me now12:38
rvbafreeflying: what changed then?12:38
freeflyingrvba, split regional and cluster onto two servers12:39
* freeflying really tired, need some relax and rest12:39
freeflyingrvba, anyway, I can't continue on it today, thanks for you guys12:40
rvbafreeflying: I understand, it's pretty late for you.  This is indeed a networking problem, with the region and the cluster on different machines, the nodes use, to connect to the region, an IP adress which is not recognized as belonging to the nodegroup.12:41
rvbann freeflying12:41
allenapfreeflying: Would you be able to summarise what's happened today and email it with us in Red?12:43
freeflyingallenap, individually? or do you have a list?12:44
=== kentb is now known as kentb-afk
=== CyberJacob|Away is now known as CyberJacob
=== kentb-afk is now known as kentb
marlincWhere are the MAAS power templates stored?19:09
kentbWhile on the subject of power, maas seems to pre-populate the IPMI power parameters with a maas username and random password.  Is that even usable?  I've been going in and changing those to a known-working username/password without trying the one maas gives me.19:15
roaksoaxmarlinc: depends on the version19:29
roaksoaxkentb: they work19:29
marlincI found out already :)19:29
roaksoaxkentb: maas creates them intentionally19:29
roaksoaxkentb: and doesn't prepopulate19:29
roaksoaxkentb: it access the BMC and adds them19:29
roaksoaxmarlinc: cool :)19:29
kentbroaksoax: ok. thanks for clarifying19:48
=== CyberJacob is now known as CyberJacob|Away
=== kentb is now known as kentb-out

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!