The end is in sight

The ‘to do’ list is lacking significant items.

There is no remaining tech debt.

In fact, undocumented here, a significant amount of quality has been added to the datacentre project.

I now have real time replication between hosts.

Every VM update, every physical host change, is now replicated to a redundancy environment.

I can spin up a new VM in about 25 seconds (this includes a full LAMP stack, and webmail, mobile device mail, and FTP access).

For extra redundancy, in addition to RAID and real time replication, every physical host is also backed up to another host.

This currently sits in a separate location, but within the same building.

I plan to relocate this feature to another location as soon as I can find one.

So all in all, this project has come a very long way.

And oh what a lot has been learned!

Cursing recursive permissions recursively

During a phase of system testing on server c1, in the new datacentre, an interesting problem was discovered.

Using WordPress as the template (so the results could be applied to Drupal installations – and any other of those .php-related content management systems), we discovered that permissions on high-level directories were not being replicated down to low-level directories.

This meant a loss of function (where that function relied on scripts that are installed in those low-level directories by the vanilla application).

In WordPress and Drupal, for example, uploading media of any type wouldn’t work.

This is a significant barrier for a content management system.

The first workaround seemed to solve the problem, except that the uploaded files were owned by Apache (the webserver in the LAMP stack).

Unfortunately this took us to another permissions-based problem which stopped the owner (system user) modifying those files – even through ftp.

If you think about it for a minute, it’s an interesting problem – where a vanilla software installation granted the higher-level directories one set of permissions, while the lower-level directories were granted different (and functionality-limiting) permissions.

Anyway.

The first attempt to fix the problem was to deploy suEXEC on the server (VM). Unfortunately suEXEC didn’t get us all the way out of the problem, so we needed to look for another solution.

The second attempt to make the problem go away was to use fastCGI.

Yesterday afternoon Manuel, our brilliant technical resource, deployed fastCGI on the VM we have been using as a test.

I used the standard WordPress admin control panel to upload an image in to a test post and successfully published that.

Then created another test post, uploaded the image in to that and successfully published it.

Then I went back to both published posts and using the standard WordPress screen, I modified the second image and republished the post.

All of these tests worked.

Manuel’s next job is to add the deployment of fastCGI in to the VM creation template.

This will enable the datacentre to deploy a fully-functional LAMP-stack VM for a customer within a matter of seconds.

Well done Manuel!

reviewing datacentre servers and racks

Rack1:

  • UPS #1
  • primary router
  • primary network switch
  • power distribution unit

Rack 2:

  • server c1
  • server c2
  • server c3

In its physical state server c1 is a RAID6 platform that runs CentOS6.6, DenyHosts and the kvm hypervisor, and iLO2.

In its virtual state server c1 hosts the full LAMP stack, hosts an SFTP server, hosts email servers, and runs some security functions.

Server c2 is an extended storage platform to complement/support server c1.

Server c3 is a real-time replica of the physical and virtual entities that are server c1.

I plan on moving server c3 out of the Nottingham datacentre and in to a secondary location. This would give me failover resilience, in the event of something cataclysmic happening to the datacentre.

Rack 3:
When it arrives, I plan on populating rack 3 with a secondary router, a secondary network switch, a secondary power distribution unit and UPS #2. And maybe some Blades.

spitting venom

I did a controlled shutdown and restart of server c1 in the datacentre today.

This shutdown and restart of the physical server meant that the primary LAMP server – and all of the hosted VMs – were also shutdown and restarted in a controlled state.

The reason for this event was to embed the Venom fix that has been released (that my servers downloaded during the week), to close the latest Unix/Linux vulnerability.

Nothing else to report.

I love how reliable and robust the infrastructure is proving to be.

datacentre

This is what DC1 looked like the first time I saw it:

Datacentre 1 first look

 

 

 

 

 

 

 

 

 

 

 

Datacentre 1 first look 2

 

 

 

 

 

 

 

 

 

 

 

 

A few weeks later, this is rack 1, server c1, after having upped the RAM, installed all the disks, configured the disks to RAID6, installed the CentOS operating system, and I’m part-way through installing the KVM hypervisor. You can see the top of server c2 below:

Rack 1 Server c1

 

 

 

 

 

 

 

 

 

 

 

 

 

And this is a screenshot of me starting to configure eth0:

eth0

 

 

 

 

 

 

 

 

And here’s a screenshot of me configuring iptables:

iptables

 

 

 

 

 

 

 

 

For reasons of security, I’m not posting any other photos.

 

icing the climactic win

Today I learned an inelegant but very effective way to access iLO2 remotely.

From the safety and comfort of my own bed I was able to log in to the iLO2 function, access the full remote physical- and virtual-console for the server, and carry out the usual range of console-related management/admin functions.

Brilliant.

climactic win!

Meanwhile, back in the datacentre…

One of the early problems I encountered with the first DL380 out of the box (server c1, obv), was an inability to configure iLO2.

I’d racked the server up, plugged the three NICs in to the router (eth0, eth1, iLO2), installed the base CentOS operating system, and configured eth0 for static IP.

Then I spent ages trying to get to the bottom of why iLO2 wouldn’t work.

And I mean four-five weeks, off and on.

A couple of nights ago (while watching a YouTube video of someone configuring their iLO2 with annoying ease), I read a revelation.

iLO2 won’t work outside the host network (LAN).

So even if I got iLO2 working, I couldn’t use it remotely (in the pure sense of the word – remotely), unless I built and stretched a VLAN to wherever in the world I was working remotely from.

This is a massive pain in the backside.

Obv.

Armed with this new information I dismissed not having iLO2 as an inconvenience – a non operational inconvenience – as not having iLO2 wasn’t actually stopping anything from proceeding,

So I set this slight disappointment aside and carried on rolling out the datacentre.

Last weekend I fitted the remaining disks to the second server in the rollout, server c2.

I configured the disks to RAID 6, installed CentOS, virtualised the physical environment, configured a static ip on the network on eth0, and brought the server online.

From server c2 I successfully pinged the internal ip address for server c1, to confirm everything was working on the LAN, and return pinged server c2 from server c1 but left c2 unconfigured for WAN access for now, then left the datacentre feeling a bit pleased.

Over the last couple of weeks I’ve been reading and watching a lot of tutorials.

One or two have gone in to detail about how both eth0 and eth1 should be configured.

I had only configured eth0 on both of the servers I have brought up so far.

Having eth0 and eth1 configured is more to do with redundancy and resilience rather than enhancing speed (because it won’t).

So that’s a new task on my list of things to do.

I’ve also spent a lot of time reading up on iLO2 problems, because I don’t like an unanswered query.

I kept these things in my mental pot, and mulled them over when I had lots of time (commuting!).

Last Friday afternoon I got a notification that one of the sshd components on server c1 – the public-facing physical host – had stopped working.

c1 was still ‘there’ and pinging away to my queries, but sshd wasn’t behaving exactly as it should.

Unfortunately I couldn’t access server c1 remotely, because whatever it was that was adversely affecting sshd, was blocking my attempt to remote on to the server.

Saturday morning I rocked in to the datacentre and accessed server c1 on console.

I queried sshd (service sshd status) and got a ‘not found’ response.

Hmm, sshd stopped working and shut itself down?

What could cause that?

I checked eth0 status (ifconfig -a) and got the responses I expected (found: eth0 configured to 192.168.1.4, found eth1 unconfigured state, found lo unconfigured state).

So, as I already knew from my ping responses, the server was actually still online, just not 100% there.

I then checked the devices attached to the LAN via the router admin panel.

The attached devices query found ILOGB8724JETY on 192.168.1.4 and (MAC address of c1 eth0 NIC) on 192.168.1.5 and – surprisingly – ILOGB87505WE7 on 192.168.1.6 and (MAC address of c2 eth0 NIC) on 192.168.1.7

This puzzled me.

All ports were forwarded to 192.168.1.4, but the router was telling me that the attached device on that ip address wasn’t (MAC address for eth0), but was the iLO2 address.

I checked the config on eth0 and it was definitely set to pick up a static address of 192.168.1.4

Puzzling!

Where had ILOGB8724JETY on 192.168.1.4 come from?

And why hadn’t the static address config on eth0 over-ruled it?

I decided to leave the iLO2 address alone for now and go for simplicity.

I powered down server c2 and unplugged it from the mains, to remove all traces it of from the network.

On server c1 I configured eth1 to a static ip of 192.168.1.6 and reconfigured eth0 to a higher static ip of 192.168.1.5

Rebooted the server.

Checked the attached devices in the router.

Sure enough, I suddenly – and for the first time – had a full house:
ILOGB8724JETY on 192.168.1.4
(MAC address for c1 eth0 NIC) on 192.168.1.5
(MAC address for c1 eth1 NIC) on 192.168.1.6

On a hunch I attempted to access and login to the iLO2 console.

Success!

And then I changed the port-forwarding rules to pick up 192.168.1.5, and saw that the sshd service was fully working.

Of course, it might have been the reboot that brought the sshd back online, but I thought there was more to it than that.

I wanted to test a hunch that was forming, so I reconfigured eth0 to 192.168.1.7 and eth1 to 192.168.1.8 and rebooted the server.

When I checked the attached devices in the router I still had a full house, but the addressing was updated:
ILOGB8724JETY on 192.168.1.6
(MAC address for c1 eth0 NIC) on 192.168.1.7
(MAC address for c1 eth1 NIC) on 192.168.1.8

Aha!

So iLO2 attaches to the LAN with an ip address of eth0 -1

Well that was a revelation (and thank you, HP, for probably burying this information in the reams of words about iLO2 and for not making it plain and obvious).

This discovery meant that all of the I/O traffic that server c1 had been processing for the last couple of weeks on 192.168.1.4 was actually being forced to the iLO2 NIC by virtue of the port-forwarding rules, and not being passed via the eth0 NIC.

I hadn’t noticed any speed issues, despite this misrouting, but I resolved to fix this.

I reset the ip addresses, rebooted the server and checked the attached devices in the router. and saw the (now expected) full house of:
ILOGB8724JETY on 192.168.1.4
(MAC address for c1 eth0 NIC) on 192.168.1.5
(MAC address for c1 eth1 NIC) on 192.168.1.6

Then I reset the router’s port-forwarding rules to pick up the eth0 NIC on 192.18.1.5

I ran some WAN tests, just to be sure, and was pleasantly surprised by the speed responses.

The bottom line here is that it looks as though an internal IP address conflict between the iLO2 and the static 192.168.1.4 for eth0 is what stopped sshd from running.

So this is a good result. I now have iLO2 running and I have detected and resolved the internal IP address conflict – and sshd is running normally.

anticlimactic win!

Server c1 is done.

The OS is installed.

The environment has been virtualised.

MySQL and Postfix installed.

The environment has been pen/security tested.

Four (client-facing) VMs have been built and are being used by various people, trying to break them.

Three layers of firewall have been implemented (2x physical, 1x software).

As far as hosting BaaS data goes, the environment feels very close to being absolutely right.

And I have to say that the environment is very fast.

I would like to put some time and effort in to practicing building VMs for FQDN hosting.

I guess that’s what I’ll be doing this week.

cheating at hardware fixes

Somewhere around Wednesday evening, about 72 hours after I fixed the remote SSH problem by changing the Plusnet-supplied Sagecom router for a Netgear router, all port 80 and all port 22 calls to the server c1 started being dropped.

There was nothing I could do, because I was down in Bristol and server C1 needed an onsite visit back at the Nottingham datacentre.

Frustrating!

Eventually the weekend rolled around and I tottered off my sickbed in to the datacentre to begin explorations.

Server c1 is an HP DL380/G5.

It had just one (500Gb) disk, which contained all the CentOS 6.6 goodies that had been rolled out so far.

Which wasn’t much cop, because server c1 wouldn’t stay alive.

When I walked up to the cabinet, c1 was definitely receiving power, but was switched off.

I pushed the button and it whirred and whined, noisily, to life.

The console showed me the usual boot sequence.

Then server c1 just powered itself down.

I tried again; it booted up. This time it got as far as the CentOS login prompt.

And then powered down again.

Long story short, I removed the PSU from server c1, cleaned all the PSU and serverside contacts, and replaced it.

The server booted up and stayed up.

I logged in as root and performed some basic functions.

Everything looked fine.

Rather than leave things like that for the week, I decided I’d like to add some extra resilience to the situation.

I removed the PSU from server c2 (another HP DL380/G5), and slotted that in to the spare PSU bay in server c1 (the HP DL380/G5 servers have the capability for two independent PSUs running at the same time).

So server c1 is now running two PSUs, and I’ll keep an eye on the server logs to see if the original PSU drops out, or if there any more powerdown problems.