[13:17] Morning [13:20] mornin [13:31] Man, surprise covers abound [13:31] * cmaloney is listening to Deconbrio - Guilty [13:31] cover of Gravity Kills - Guilty [13:31] I like. [13:31] rather it be a klute cover :p [13:32] Not familiar [13:38] leather strip side project [13:38] very aggro [13:44] Ah [13:44] laeeaeaeaaeeeaeeeather strip? [13:44] ;) [17:36] weeee [17:36] https://twitter.com/Wikimedia/status/563388375898411008 [17:36] "All Wikimedia sites are experiencing issues due to a network problem. We'll be back up shortly!" [17:39] greg-g: Woo woo [18:26] greg-g: Fix it! [18:30] luckily, I don't have to [18:30] poweroutage to an important switch [18:54] Lovely [18:55] yeah, I feel bad for the DC tech [18:55] I'm sure his death will be swift. [18:55] Though not honorable [18:58] single switch? [18:58] that is a shit spof design. [18:58] bridging is your friend FFS! :p [18:58] jrwren: Likely one switch with bad failover. [18:59] yeah, I don't know all the details, but it sounds like it was a perfect shitty accidental storm [19:00] yeah, there's a lot of instances where if one component fails then things are fine, but if one component disappears (or conversely doesn't disappear enough) then shit doesn't work [19:01] "Hi, I know this is weird and all but I just powered up and I have no idea what a route is. Pleased to meet me" [19:02] the best part is, it looks like our logging system kept us from coming back up in a timely manner, we had to disable logging for a $timespan [19:02] "Hi other router. Apparently you're up, so here's all the traffic. Derp derp" [19:02] :) [19:02] greg-g: Those are the worst [19:03] When your tracking is actively fucking you. [19:03] yeah, which we also just beefed up a bit (and starting logging a lot more stuff by default) [19:12] We've indeed had a total site outage for roughly 30 minutes. We're still [19:12] collecting all data, but we've tracked down the cause to multiple cascading [19:12] issues including loss of power to a critical SPOF network switch and HHVM [19:12] MediaWiki application servers getting blocked due to multiple unoptimal [19:12] timeout settings. We'll post a full incident report soon, and work to [19:12] correct the underlying issues as soon as possible. [21:44] hmmmm [21:44] https://www.irccloud.com/pastebin/oz47PtCo [21:49] Ah, it's docker.io, not docker [21:57] https://wikitech.wikimedia.org/wiki/Incident_documentation/20150205-SiteOutage [22:23] evening [22:23] Hello from OCC [22:24] party! [22:25] W00t [23:00] * DrDaemonEye waves at cmaloney [23:14] Howdy [23:17] how goes? [23:18] Writing [23:18] fun times [23:18] Yeah