PDA

View Full Version : Server Problems



Andy Garton
03-05-2012, 08:55
We had a power cut in the London office yesterday and unfortunately when the servers came back up we discovered our version control database was corrupted. We've lost nothing, but it's taking a while to pull everything together again from the backup and local copies. This will have a knock-on effect on build deliveries this week: 206 hasn't been built at all yet as a result, we hope to do it later today but it's possible it won't happen at all. 207 will be built tomorrow morning hopefully so should still be published tomorrow afternoon, but it may be a bit late (or a lot late if we hit issues with it, not unlikely when so many files are being recovered and updated).

sic_kapkan
03-05-2012, 10:06
no problems. several times you gave builds earlier. good boys can wait :)

Mickael Bertani
03-05-2012, 11:06
Yes sometimes in advance, sometimes late. Just take the time to do things properly. :)

Marc-Andre Denis
03-05-2012, 15:59
*insert Vader saying NOOOOOOOOOOOOOOOOOOOOOO* Alright then, I can wait, I'm patient because I know the new FFB adjustments should be pretty interesting to test :)

Simon Ashbourne
03-05-2012, 17:44
Hey my work PC crashes at least once a day, you guys have done well, keep up the good work and do what you have to do. Shit happens!

morfmedia
03-05-2012, 18:10
Let us know if you want us to spec out a UPS, especially with the olympics coming up I'd plan for a few more power outages!

Andy Garton
03-05-2012, 18:12
We have a UPS ... it's a long story.

morfmedia
03-05-2012, 18:22
Hehe it's never simple in IT! No worries glad the recovery is going ok (hopefully).

Peter Ball
03-05-2012, 18:32
We have a UPS ... it's a long story.

Does it star an assistant producer/QA manager? :)

Tom Curtis
03-05-2012, 18:34
Does it star an assistant producer/QA manager? :)
QA manager??? I wish :D

Marc-Andre Denis
03-05-2012, 19:00
We have a UPS ... it's a long story. Involving a hamster, a small wheel and not enough food for the week?

Steve Loader
03-05-2012, 19:11
OK, I guess as long as the membership data is triple backed up & in the cloud I can live with it. :distress:

Vittorio Rapa
03-05-2012, 20:21
Some inside story:

The (semi) long story is that someone in the London underground (oh they call it "The Tube") decided to cut a cable or whatever (they are working for the olympics i believe)... so the power went down for a whole city zone, and for a "long" time (surely not enough for our UPS's to survive). The server we're talking about is hosted in-house, we have several other servers all around the globe (and universe), so if something goes wrong somewhere we have backups elsewhere, but usually the bad news doesn't come alone, so we found the data on that server to be corrupted and we had to restore it: having the data synched requires some time, so we had to work one day and half (and one night) to have it back online to respect our deadlines. We had any escape or a way to avoid it? Thinking at it after it happened and knowing the cause i can say "yes", maybe not totally but we would have reduced the downtime, but then it would have been too easy (you never stop to learn).

Unlickily we cannot post the workflow here (about the people involved into solving the problem as fast as possible), you would have seen the dedication and passion in doing this job. That's why i love working for this company.

(btw everything is back to the normality... almost)

KimKom
03-05-2012, 20:37
Ouch! Sounds nasty. Always a worrying time when this sort of thing happens.

We use a secure online backup service, which runs every evening but it would still be a pain to restore everything.

morfmedia
03-05-2012, 21:02
Not wanting to teach you guys stuff you probably already know but most UPS can do a graceful shutdown of the host for you depending on how much battery life is left in the UPS. We've been having fun recently as crossrail has required all our DWDM private fibre optic cables to be moved. Also worth disabling write cache on the RAID controller/ HDD if you don't trust the UPS. Not trying to patronise anyone here, just help out the project.

Vittorio Rapa
03-05-2012, 21:29
Not wanting to teach you guys stuff you probably already know but most UPS can do a graceful shutdown of the host for you depending on how much battery life is left in the UPS. We've been having fun recently as crossrail has required all our DWDM private fibre optic cables to be moved. Also worth disabling write cache on the RAID controller/ HDD if you don't trust the UPS. Not trying to patronise anyone here, just help out the project.

We doesn't have a single server and they are both windows and linux. We have implemented a host system now that monitors the UPS, and that takes care of shutting down (gracefully) all the networked clients. The RAID controllers also has their own battery. It was really a serie of coincidences, those ones that analyzed after seems so easy to prevent: we really didn't had suffered any corrupted data, at least not about that large portion of data.. but it happend. Luckily we have some good staff and everything come back fast enough.

PS: it would be boring not having some thrilling time to time.. not?

morfmedia
03-05-2012, 21:36
Hehe, been there, seen plenty of "things that should never happen" before IT scenarios. If it's APC they usually have a management module + windows / unix utils to do this for you but you clearly have it under control anyway. Back to the normal routine now :)

ermo
03-05-2012, 22:07
We doesn't have a single server and they are both windows and linux. We have implemented a host system now that monitors the UPS, and that takes care of shutting down (gracefully) all the networked clients. The RAID controllers also has their own battery. It was really a serie of coincidences, those ones that analyzed after seems so easy to prevent: we really didn't had suffered any corrupted data, at least not about that large portion of data.. but it happend. Luckily we have some good staff and everything come back fast enough.

PS: it would be boring not having some thrilling time to time.. not?

Yeah, I know that sinking feeling when you realize that something has gone wrong and you've yet to understand the full scope of the incident. You know it's really bad when the Sys Admin suddenly mutters a quiet 'Uh Oh ...' and goes dead quiet for a minute or two and then starts tapping furiously on the keyboard after a hasty visit to the server room.

Good to hear that it worked out in the end, and that the various services could be easily -- if not quickly -- restored to normal operations. :a31:

Kostman22
03-05-2012, 22:37
i'm sure everyone understands... the way i see it there was no need for you guys to give us an explanation about this, the next build will be released when it's ready we can wait. as long as EA has nothing to do with this game i'm sure it will be done right... get her done boys (and girls?)

Remco Van Dijk
04-05-2012, 08:31
i'm sure everyone understands... the way i see it there was no need for you guys to give us an explanation about this, the next build will be released when it's ready we can wait. as long as EA has nothing to do with this game i'm sure it will be done right... get her done boys (and girls?)
Maybe EA had something to do with the cut cable... :eek:

rocker_lx
04-05-2012, 09:45
I work in IT, and believe me : if you install a system with failover, fallback, backup,etc.... The universe will find a way to get around those and take out the single point of failure you oversaw or is considered not probable ;)

My title is "Head of Muphys Law Application labs"

Vic Kirby
04-05-2012, 11:47
Maybe EA had something to do with the cut cable... :eek:

Wow... you would give EA that much credit, if EA did it the tube would of been flooded.:D

thanks for the infomation.

Fernando Horta
04-05-2012, 12:16
(...) This will have a knock-on effect on build deliveries this week: 206 hasn't been built at all yet as a result, we hope to do it later today but it's possible it won't happen at all. 207 will be built tomorrow morning hopefully so should still be published tomorrow afternoon, but it may be a bit late(...)
http://3.bp.blogspot.com/-5AWyewVUmzo/T3EgpxGvaDI/AAAAAAAAAiE/ErE98vXuF-w/s1600/boy+waiting.gif

jk :) as kapkan said, we got some builds ahead of time, we can wait a bit more if needed. specially with Zondas coming.

Scott Coffey
04-05-2012, 13:42
The server we're talking about is hosted in-house, we have several other servers all around the globe (and universe)...

Is that where we picked up the wookie?

Vittorio Rapa
04-05-2012, 13:46
akrcoorhrarhanro rowoc...

Graham Hawkins
04-05-2012, 13:50
akrcoorhrarhanro rowoc...

Bless you !

Scott Coffey
04-05-2012, 14:01
akrcoorhrarhanro rowoc...

acoooh oarawh rooohu caorawhwa aoacwo cscwoanan?

cluck
04-05-2012, 15:07
acoooh oarawh rooohu caorawhwa aoacwo cscwoanan?Oooh, I can answer that :)

huhui hash klioasdjkifj popqpwopoqowpqowp ol :).

Remco Van Dijk
04-05-2012, 16:04
kawawaa oogiphup burkabi

Sorry for my Wookish, I'm Dutch!

Juan Pablo Rguez
04-05-2012, 16:10
jakshdj k jkashdjkhasld jkakjajhdkjljhiewiu

Meaning: Is there any new problem on the server? I can't seem to download the torrent or the normal download.

EDIT: Just seen is already discussed (http://forum.wmdportal.com/showthread.php?5842-Build-207-Discussion-(Junior-Member-)&p=144004&viewfull=1#post144004), sorry :)

Scott Ibbetson
04-05-2012, 16:16
Non-Wookie speaker here, but....Servers are hammered right now. So anyone want to go pick up the hammer so we can get it working again and/or drag the servers out of the pub? I know a new build is cause for celebration but...:P