March 2017

Friday, March 31, 2017

performance - How to use most of memory available on MySQL

I've got a MySQL server which has both InnoDB and MyISAM tables. InnoDB tablespace is quite small under 4 GB. MyISAM is big ~250 GB in total of which 50 GB is for indexes.

Our server has 32 GB of RAM but it usually uses only ~8GB. Our key_buffer_size is only 2GB. But our key cache hit ratio is ~95%. I find it hard to believe..

Here's our key statistics:

| Key_blocks_not_flushed | 1868 |
| Key_blocks_unused | 109806 |
| Key_blocks_used | 1714736 |
| Key_read_requests | 19224818713 |

| Key_reads | 60742294 |
| Key_write_requests | 1607946768 |
| Key_writes | 64788819 |

key_cache_block_size is default at 1024.

We have 52 GB's of index data and 2GB key cache is enough to get a 95% hit ratio. Is that possible? On the other side data set is 200GB and since MyISAM uses OS (Centos) caching I would expect it to use a lot more memory to cache accessed myisam data. But at this stage I see that key_buffer is completely used, our buffer pool size for innodb is 4gb and is also completely used that adds up to 6GB. Which means data is cached using just 1 GB?

My question is how could I check where all the free memory could be used? How could I check if MyISAM hits OS cache for data reads instead of disk?

Thursday, March 30, 2017

1 SSD + 1 HDD - how should I design my ZFS storage?

I have a PC with a 256 GB M.2 SSD drive used for Linux Mint root, swap and home partitions right now.
I have a 2TB HDD for all my data (I keep that separate from home dir actually, because users' home folder has user configuration).

I would like to migrate that system to ZFS to take advantage of checksums, snapshots, transparent compression and the power-loss robustness.

Since I didn't build the hardware with ZFS in mind I wonder now what would be the best way to utilize this setup.

I am doing lots of video production among other things, so I have big files (10GB) being thrown around a lot. I thought about using SSD as an SLOG, but wouldn't that unnecessarily wear the SSD off, when I'm capturing and rendering video files - HDD speed in not a bottleneck for me in this regard anyway.

Should I create separate pools for SSD and HDD?
Can I easily move datasets between the pools with send/receive commands?

Could I partition the SSD to use it both for root partition and SLOG for HDD? Wouldn't that defeat wear leveling and kill my SSD faster?
Wouldn't that kill SSDs performance benefits?

As a side-note:

I might add a second 2 TB HDD later as a mirror to gain redundancy. I also have 2 1TB USB 3.0 drives that I used with ZFS in Raid-0 for a while. They seem to work pretty well, and can handle ~130 MB/s write speed. I wonder if using that as a mirror vdev for my main HDD would be a good idea. They have proven to be stable for a few months (I know USB can be problematic with ZFS - I played with that a quite lot too).

I currently do rdiff-backup to an external 3TB USB 3.0 drive.

I have deployed ZFS a few times so far, once in a production environment for system root, and data storage in 2-disk mirror. I never used SSDs with it though.

What would you advise me to do?

Why is PHP ignore session.gc_maxlifetime?

We have a WAMP based server set up. php.ini is set up with the following:

session.gc_maxlifetime = 60*60*12
session.save_path = "d:/wamp/tmp"

The problem we are facing is that session files inside the tmp folder are being sporadically deleted and we can't tell why. Sessions will last anything from about 10 minutes to 40 minutes, when they should be lasting 12 hours.

This is a virtual host environment, but none of the code we use in these sites overrides this setting (with ini_set, apache config PHP values or otherwise) so we can't see why they are being deleted. There are also no scheduled tasks deleting the files.

Is there a way to successfully figure out why gc_maxlifetime is being ignored? For the record, I changed one of our sites to use session_save_path('D:/wamp/tmptmp'); temporarily just to double check it was the garbage collection, and session files remain in there untouched - though admittedly this doesn't give really many more clues.

centos - Wrong gitosis config. I'm locked out the server

I made a stupid mistake. I have a small vhosted CentOS server and I was configuring git+gitosis when i think I ran the gitosis ssh key init on my own user in addition to the git user. I didn't realize it at the time but now when i try to ssh to the server with my user i get:

TY allocation request failed on channel 0
ERROR:gitosis.serve.main:Need SSH_ORIGINAL_COMMAND in environment.
Connection to [servernamehere] closed.

Any idea how I can login to the server again? Unfortunately i have disabled ssh root login.
Thanks!

A more complete log (relevant part):

debug1: Authentication succeeded (publickey).
debug1: channel 0: new [client-session]
debug3: ssh_session2_open: channel_new: 0
debug2: channel 0: send open
debug1: Entering interactive session.
debug2: callback start
debug2: client_session2_setup: id 0
debug2: channel 0: request pty-req confirm 1

debug2: channel 0: request shell confirm 1
debug2: fd 3 setting TCP_NODELAY
debug2: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug2: channel_input_status_confirm: type 100 id 0
PTY allocation request failed on channel 0
debug2: channel 0: rcvd adjust 2097152
debug2: channel_input_status_confirm: type 99 id 0
debug2: shell request accepted on channel 0
debug2: channel 0: rcvd ext data 67

ERROR:gitosis.serve.main:Need SSH_ORIGINAL_COMMAND in environment.
debug2: channel 0: written 67 to efd 7
debug2: channel 0: rcvd eof
debug2: channel 0: output open -> drain
debug2: channel 0: obuf empty
debug2: channel 0: close_write
debug2: channel 0: output drain -> closed
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug2: channel 0: rcvd close
debug2: channel 0: close_read

debug2: channel 0: input open -> closed
debug3: channel 0: will not send data after close
debug2: channel 0: almost dead
debug2: channel 0: gc: notify user
debug2: channel 0: gc: user detached
debug2: channel 0: send close
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:

#0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)
debug3: channel 0: close_fds r -1 w -1 e 7 c -1
Connection to myserver closed.

Answer

Sorry Cristian no easy answer to this, your user can only interact with git.

Wednesday, March 29, 2017

amazon web services - Unable to ssh to AWS instance after Cloudformation deployment

I'm trying to build out a small infrastructure:

A single VPC

A single subnet

A single security group with a single rule: ssh

A single instance

So far, in order to make it remotely functional I've had to add:

Internet Gateway

Route Table

When I deploy the template, it creates successfully. I can look through my console and see all of the components.

What I'm having a problem with is connecting to the instance afterward. Attempts to connect via ssh time out.

I think I've narrowed this problem down to the fact that when the stack is built, two route tables are deployed. One is marked as main which has no default route added to it, The other one I explicitly define in my template and to which I add the default route.

If I add the default route to the template-define table after the fact, I can ssh.

I guess my questions are:

how do I mark the table that I'm creating in the template as the main table, or

how do I tell CloudFormation to not create the default table that is being marked as main, or

how do I get the default route into the main table?

Template:


AWSTemplateFormatVersion: 2010-09-09
Resources:

  vpcCandidateEyMm7zuOcn:
    Type: 'AWS::EC2::VPC'
    Properties:
      CidrBlock: 192.168.0.0/16
      EnableDnsHostnames: 'true'
      EnableDnsSupport: 'true'
      InstanceTenancy: default
      Tags:
        - Key: Test
          Value: Test

    Metadata:
      'AWS::CloudFormation::Designer':
        id: 052446e9-ed29-4689-8eb2-2006482f7a65
  IgCandidateEyMm7zuOcn:
    Type: "AWS::EC2::InternetGateway"
    Properties:
      Tags:
      - Key: Test
        Value: Test
  AigCandidateEyMm7zuOcn:

    Type: "AWS::EC2::VPCGatewayAttachment"
    Properties:
      VpcId: 
        Ref: vpcCandidateEyMm7zuOcn
      InternetGatewayId:
        Ref: IgCandidateEyMm7zuOcn
  RtblCandidateEyMm7zuOcn:
    Type: "AWS::EC2::RouteTable"
    Properties: 
      VpcId: 

        Ref: vpcCandidateEyMm7zuOcn
  myRoute:
    Type: AWS::EC2::Route
    DependsOn: IgCandidateEyMm7zuOcn
    Properties:
      RouteTableId:
        Ref: RtblCandidateEyMm7zuOcn
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId:
        Ref: IgCandidateEyMm7zuOcn

  subnetCandidateEyMm7zuOcn:
    Type: 'AWS::EC2::Subnet'
    Properties:
      CidrBlock: 192.168.1.0/24
      MapPublicIpOnLaunch: 'true'
      Tags:
        - Key: Test
          Value: Test
      VpcId: !Ref vpcCandidateEyMm7zuOcn
    Metadata:

      'AWS::CloudFormation::Designer':
        id: b9300540-4fb5-4a9c-a432-d12d9a78e08c
  allowSSH:
    Type: 'AWS::EC2::SecurityGroup'
    Properties:
      GroupDescription: 'Allow SSH from Anywhere'
      VpcId:
        Ref: vpcCandidateEyMm7zuOcn
      SecurityGroupIngress:
        - IpProtocol: tcp

          FromPort: '22'
          ToPort: '22'
          CidrIp: 0.0.0.0/0
  ansibleInstance:
    Type: 'AWS::EC2::Instance'
    Properties:
      ImageId: ami-4bf3d731
      KeyName: AWSCentOS7
      InstanceType: t2.micro
      SubnetId: !Ref subnetCandidateEyMm7zuOcn

      SecurityGroupIds:
      - !Ref allowSSH
      Tags:
        - Key: Name
          Value: Test
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          pip install ansible
          cd ~

          wget https://s3.amazonaws.com/ansibledepot/web.tar.gz
          tar zxvf web.tar.gz
    Metadata:
      'AWS::CloudFormation::Designer':
        id: 63fffdde-e058-45ad-b2c8-7cf00fd54351

Monday, March 27, 2017

The workstation doesn't switch to synchronous group policy processing on restart

The workstation is running Windows 8.1 and joined to the domain. There are a few group policies that deploy software. "Always wait for the network at computer startup and logon" setting is disabled. Software deployment has used to take 1 or 2 logons to apply the changes, as expected.

But something happened, and now I can't get software deployment to apply the changes at all. rsop.msc says: "Software Installation did not complete policy processing because a system restart is required for the settings to be applied. Group Policy will attempt to apply the settings the next time the computer is restarted." Restarting has no effect: the group policies are still processed in asynchronous mode, and rsop.msc says just the same. Rejoining the domain doesn't help. No error is in the Event log.

The question is what is the switch / the flag that tells Windows to enable synchronous processing of the group policies.

Answer

The reason was the startup policy processing timeout: the default value of 30 s isn't sometimes enough. Setting "Specify startup policy processing wait time" policy (Computer Configuration - Policies - Administrative Templates - System - Group Policy) to 120 s fixed the issue.

disk cache - How to ensure HDD Write Caching is disabled on MS Storage Spaces

I have a Storage Spaces pool on a server that is not currently power protected. It consists only of mechanical hard drives.

I want to compare performance of write caching enabled vs disabled so I can make a decision about how much of a priority power protection should be.

Is it sufficient to disable write caching through Device Manager -> Policies for each physical hard drive, or do I need to disable it anywhere else through Storage Spaces?

For example, it is enabled on the Microsoft Storage Space Device entry in Device Manager, and cannot be disabled. I'm not sure if that is important at all.

domain name system - DNS Nameserver Error

Sorry I am new to this DNS jargon so the following question might sound obvious.

I have a VPS server and I am using third party dns server for managing DNS. My DNS provider had given me his own name servers which I replaced with my own using vanity name servers.

Everything is working fine and I also have updated glue records and name servers at my domain registrar. But when I run a DNS test I get following error.

Inconsistent glue for name server ns2.domain.com.
The address of a name server differed from the child and the parent. This is a configuration error and should be corrected as soon as possible.

Error explanation.

Inconsistent glue for name server.

Pingdom: On at least one of the parent-side a name server listed, there was an A or AAAA record and a record with the same name but different content was found on at least one of the child-side name servers.

I can't understand what's causing this error? Is this with my DNS provider or my domain registrar?

Thanks for your help.

Regards,
Jay

Answer

It is very simple.

For the name ns2.domain.com, you have one A (or AAAA) record in your zone, and one glue record in the TLD's zone. They point to different addresses.

The glue record is set up by your registrar. The other one is set up by you in your DNS configuration. If the address you have set for the ns2 host yourself is right, let your registrar know they need to update the glue records for your domain (and give them the new IP). If it is wrong, change the A or AAAA record for your ns2 host to match what is in the glue.

You can find what is in the glue record by querying the TLD nameservers. Here is an example for com. on linux:

dig @m.gtld-servers.net in ns google.com

This will typically return an additional section with the A records for the nameservers. You could also change the query to be more direct:

dig @m.gtld-servers.net in a ns1.google.com

Sunday, March 26, 2017

iis - Advantages (or lack thereof) of a single SAN SSL Certificate over multiple SSL Certificates

I am running a website with multiple subdomains, each with its own SSL certificate.

It's time to renew, and while the current certificate uses SAN and has an entry for each subdomain, I am questioning the convenience of this approach - as I can find three individual SSL certificates for apparently cheaper price than a single SSL with 3xSAN entries.
I am using IIS 8.5.

Are there any particular drawbacks for going for a multiple individual SSLs?

Answer

Multiple SSL certificates on a single IP address requires SNI. If your clients do not support SNI, then they'll end up on whatever your default binding is. Having a SAN certificate is also potentially less maintenance, as it's just one cert to renew every year or two.

But you are correct in your assumption that SAN certificates are more expensive than 3x individual certs. This is because certificates are run like a cartel. So if you want to save a few bucks and are willing to sacrifice clients that don't support SNI - go for it. Get your three individual certs at a fraction of the price of a SAN.

Configuring Workstations from a corrupted active directory to new active directory

My clients active directory database corrupted there was no to way to recover it and they didn't had any backups. So I had to remove and reinstall active directory.

Now I have to reconfigure the workstations to new active directory. What is the fastest way to keep users desktops while setting up new AD.

Saturday, March 25, 2017

domain name system - Why is DNS failover not recommended?

From reading, it seems like DNS failover is not recommended just because DNS wasn't designed for it. But if you have two webservers on different subnets hosting redundant content, what other methods are there to ensure that all traffic gets routed to the live server if one server goes down?

To me it seems like DNS failover is the only failover option here, but the consensus is it's not a good option. Yet services like DNSmadeeasy.com provide it, so there must be merit to it. Any comments?

configuration - VirtualHosts with Apache not respecting ServerName

This has always bugged me and I've never got around to figuring out why Apache does this, I always resorted to the mod_vhost plugin to work around the issue.

Basically, I have 2 vhosts in sites-enabled (Ubuntu server), their contents:


DocumentRoot "/var/www/vhosta.domain.com/"
ServerName vhosta.domain.com

allow from all
Options +Indexes

And


DocumentRoot "/var/www/vhostb.domain.com/"
ServerName vhostb.domain.com

allow from all
Options +Indexes

Now logically these 2 would be accessible separately, however it seems that all requests to my server, no matter what vhosts I define on top of this, are going to vhosta.domain.com.

Am I missing something incredibly obvious? I really don't get why it's doing this..

Thanks

Answer

You are missing a NameVirtualhost; however:

DO NOT use VirtualHost *; use VirtualHost *:80 instead.

The following is the correct way:

NameVirtualHost *:80



  Servername vhosta



  Servername vhostb

nexenta - ZFS (NexentaStor) and 4k Advanced Format Partition Alignment

I have a storage system that contains 8 x 1TB drives that use the 4k sector size "Advanced Format". I'm planning to run NexentaStor on this hardware and want to ensure that I'm taking the 4k sector size into account. Is there anything special I need to keep in mind when creating the root pool and subsequent data pools with ZFS?

Answer

ZFS handles 4k sectors well as long as the drive advertises them correctly.

However, some drives have 4k sectors internally but present a logical 512 sector size to the operating system for backwards compatibility. If ZFS believes the drive, and writes in 512 byte chunks to 4k sectors, you'll suffer a heavy read-modify-write penalty.

Have a look at the Solarismen blog:

If your drive reports a sector size of 4k, you're fine. If your drive reports a sector size of 512, you may be able to work around it by using the modified zpool binary from the same site:

The modified binary hardcodes the sector size to 4k. Note that you only need to use it for the initial zpool creation. This may be a bit difficult for your root pool - you may need to slipstream the modified binary in to the NexentaStor ISO.

Friday, March 24, 2017

linux - How to create partition when growing raid5 with mdadm

I have 4 drives, 2x640GB, and 2x1TB drives.
My array is made up of the four 640GB partitions and the beginning of each drive.
I want to replace both 640GB with 1TB drives.
I understand I need to
1) fail a disk
2) replace with new

3) partition
4) add disk to array

My question is, when I create the new partition on the new 1TB drive, do I create a 1TB "Raid Auto Detect" partition? Or do I create another 640GB partition and grow it later?

Or perhaps the same question could be worded: after I replace the drives how to I grow the 640GB raid partitions to fill the rest of the 1TB drive?

fdisk info:

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes

255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0xe3d0900f

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       77825   625129281   fd  Linux raid autodetect
/dev/sdb2           77826      121601   351630720   83  Linux

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0xc0b23adf

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1       77825   625129281   fd  Linux raid autodetect
/dev/sdc2           77826      121601   351630720   83  Linux

Disk /dev/sdd: 640.1 GB, 640135028736 bytes
255 heads, 63 sectors/track, 77825 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk identifier: 0x582c8b94

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1       77825   625129281   fd  Linux raid autodetect

Disk /dev/sde: 640.1 GB, 640135028736 bytes
255 heads, 63 sectors/track, 77825 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0xbc33313a


   Device Boot      Start         End      Blocks   Id  System
/dev/sde1               1       77825   625129281   fd  Linux raid autodetect

Disk /dev/md0: 1920.4 GB, 1920396951552 bytes
2 heads, 4 sectors/track, 468846912 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk identifier: 0x00000000

Answer

My question is, when I create the new partition on the new 1TB drive, do I create a 1TB "Raid Auto Detect" partition?

You can, but you're not going to gain anything immediately from that.

Or do I create another 640GB partition and grow it later?

Yes.

RAID-on-partition has its uses, but when you're using the drives in a pseudo-storage-pool setup, you're sometimes better off using 'whole drive' instead of 'partition' RAID members. Designating the whole drive (i.e. /dev/sdc instead of /dev/sdc1) has the advantage of implicitly telling the RAID mechanism that the entire drive is to be used, and therefore, no partition needs to be created/expanded/moved/what-have-you. This turns the hard drive into a 'storage brick' that is more-or-less interchangable, with the caveat that the largest 'chunk size' in your 'set of bricks' will be the smallest drive in the set (i.e. if you have a 40gb, 80gb, and 2x 120gb, the RAID mechanism will use 4x 40gb because it can't obtain more space on the smallest drive). Note that this answer is for Linux software RAID (mdadm) and may or may not apply to other environments.

The downside is that if you need flexibility with your RAID configuration, you will loose that ability, because the entire drive will be claimed. You can however offset that loss through the use of LVM-on-RAID. Another issue with whole-drive RAID is that some recovery processes will require a little more thought, as they often assume the presence of a partition. If you use a tool that expects a partition table, it may balk at the drive.

Unsolicited Advice (and nothing more than that, if it breaks, you keep both pieces, etc.):

Your best bet is to set up your RAID array as you like, using the 'whole drive' technique, but then using LVM to manage your partitions. This gives you a smidgeon of fault-tolerance with RAID, but the flexibility of dynamically sizable partitions. An added bonus: if you use Ext3 (and possibly Ext2 supports this, not sure) you can resize the 'partitions' while they are mounted. Being able to shift the size of mountpoints around while they are 'hot' is a wonderful feature and I recommend considering it.

Additional follow-up:

I received a comment that Ext2 does not support hot resizing. In reality, it does, but only for increases in size. You can read more at this link here. Having done it a few times myself, I can say that it is possible, it does work, and can save you time.

networking - VMWare Fusion network dropping for OSX clients

We have a continuous integration server which runs a series of tests over several different client OSs (different version of Windows and OSX). It is running on an Apple XServ which is running OSX Leopard Server, and the clients are running within VMWare Fusion - This combination was chosen as Apple's licensing only allows OSX Server VMs to run on Apple hardware.

The CI system uses the VMWare tools to communicate with the Windows clients, but this does not work properly with OSX, so it is using SSH in these cases. However, every so often, the network will drop out fairly quickly after VM startup.

The VMs are configured to use host-only networking, and the Windows VMs, while slow, don't seem to have any connection issues.

Does anyone know what might be causing this?

Thursday, March 23, 2017

active directory - Is it possible for an AD domain's DNS name to point to a web server instead of the DCs?

My active directory domain's name is "mywebsite.com" instead of "mywebsite.local". I had to do this as a workaround to other issues, and to change it will be a pain. When people inside of my company visit "mywebsite.com", it redirects to our DC instead of our website. How can I make it redirect to our website?

Answer

You can't.

You can define any hostname or subdomain you want in your AD's main DNS zone, but for AD to work properly, the A records for the domain itself must point to your domain controllers.

So, having "www.mywebsite.com" pointing to your web site is fine, but having "mywebsite.com" do the same is not.

Addendum: hosting your web site on your DC would of course fix the issue, but I strongly advise you against that; DCs are definitely not meant to host web sites.

mysql - Having trouble keeping a 1GB RAM Centos server running

This is my first time configuring a VPS server and I'm having a few issues. We're running Wordpress on a 1GB Centos server configured per the internet (online research). No custom queries or anything crazy but closing in on 8K posts. At arbitrary intervals, the server just goes down. From the client side, it just says "Loading..." and will spin more or less indefinitely. On the server side, the shell will lock completely. We have to do a hard reboot from the control panel and then everything is fine.

Watching "top" I see it hovering between 35 - 55% memory usage generally and occasional spikes up to around 80%. When I saw it go down, there were about 30 - 40 Apache processes showing which pushed memory over the edge. "error_log" tells me that maxclients was reached right before each reboot instance. I've tried tinkering with that but to no avail.

I think we'll probably need to bump the server up to the next RAM level but with ~120K pageviews per month, it seems like that's a bit overkill since it was running fairly well on a shared server before.

Any ideas? httpd.conf and my.cnf values to add? I'll update this with the current ones if that helps.

Thanks in advance! This has been a fun and important learning experience but, overall, quite frustrating!

Edit: quick top snapshot:

top - 15:18:15 up 2 days, 13:04,  1 user,  load average: 0.56, 0.44, 0.38
Tasks:  85 total,   2 running,  83 sleeping,   0 stopped,   0 zombie

Cpu(s):  6.7%us,  3.5%sy,  0.0%ni, 89.6%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   2051088k total,   736708k used,  1314380k free,   199576k buffers
Swap:  4194300k total,        0k used,  4194300k free,   287688k cached

Answer

If your server cannot handle spinning up 30-40 httpd processes (it can't), then don't let it. I go into a lot of detail regarding LAMP configuration in my answer to this question. The examples I give are for a 512 MiB VPS, so don't just blindly copy the configuration "per the internet". :)

Short version: scale back your httpd MaxClients and ServerLimit variables to prevent 30+ httpd processes from spinning up. I'd start with something like 10 or 15 depending on the average size of your processes, and how much room you've given MySQL. Note that httpd's behavior will be to refuse requests when all client processes are busy.

linux - How do I keep a drive from booting accidently that used to be a bootable drive?

I used to have a RAID 0 setup but I ditched it and restarted from scratch with a plain non raid single disk system. The second disk is just used for general storage.

Well today, I boot the machine and for whatever reason the BIOS booted from the secondary drive which still had the old RAID 0 boot record and OS and everything. Took me a few minutes to figure out what went wrong, I changed the boot order in my BIOS and my 'new' machine came back.

How do I blow away the bootability of the old secondary drive without destroying any of the partitions on it? The MBR has the partition table so I can't just blow it away, but I can't find any partition tool that will erase the MBR while keeping the partition table.

This is a ubuntu 10.10 setup.

Answer

The actual MBR boot loader is located on the first sector of the hard drive. This sector also contains the partition table for the drive, so care must be taken to remove the boot loader code without removing the partition table. The following command should do it:

dd if=/dev/null of= bs=446 count=1

Slightly more information is at this URL WARNING: Before making this modification, have good backups available.

Wednesday, March 22, 2017

linux - Forward SSH request to an External Server to an Internal Server

I have a VM config currently where I have 1 external IP pointing to a VM with an Nginx HTTP Reverse proxy serving HTTP Web Pages from several internal VMs without external IPs.

I need a similar setup to re-direct SSH requests to certain hostnames to internal servers, this is for a hosted git repository which sits behind by proxy and thus has no external IP of its own, so I require some kind of iptables rule to allow me to reverse proxy/forward SSH requests to corresponding VMs.

Can anybody point me in the right direction?

filesystems - PHP can't save sessions due to permission error

I know that this question must have been asked (and answered) before, but I can't find a solution for my problem in any of those questions. It's a bit odd... The problem is that my PHP scripts (and my Apache server) cannot write to folders on my system. Not at all.

For example, I get the following error while running a script:

Fatal error: Uncaught exception 'Zend_Session_Exception' with message
'Zend_Session::start() -
/var/www/subdomains/vmb/vendor/zendframework/zendframework1/library/Zend/Session.php(Line:482):
Error #2 session_start():
open(/var/www/subdomains/vmb/application/../var/session/sess_ingph33ir4shr1e60kkifp37s7,
O_RDWR) failed: Permission denied (13)

I have a VPS with CentOS 7, Apache2.4, PHP5.6 (which runs with the apache php mod) and some other stuff. Apache runs as user apache and group apache (as set in the httpd.conf file). I have set the session_path in both /etc/php.ini and /etc/httpd/conf.d/php.conf to /tmp/phpsessions and chown'd/chmod' this folder as apache:apache 777. The above example stores sessions in another folder (which is also chown'd/chmod' as apache:apache 777), but I get the same error for other folders.

So my apache server runs as apache:apache, I chownd the folders to apache:apache that I needed to and even with 777 permissions Apache fails to write to these folders.

Have you ever seen something like this? I haven't before...

Answer

assuming permissions and ownerships are OK, I believe this relates to SELinux.

Quick and dirty way: ... assuming you're getting Permissive while running getenforce, try disabling SELinux by running setenforce 0 and hit your script again, if it works then it was SELinux, from there you can either leave it disabled (not recommended) or turn it back on by running setenforce 1 and check your /var/log/audit/audit.log and work towards end solution.

apache 2.2 - What are some reasons why PHP would not log errors?

I'm running Apache2 on Ubuntu 10.04 LTS. I've configured /etc/php5/apache2/php.ini with the following:

error_reporting = E_ALL & ~E_DEPRECATED
display_errors = Off

log_errors = On
error_log = /var/log/php_errors.log

I have a PHP application that I believe has an error in it. With those settings and the suspected error, no php_errors.log was created. So, I created a PHP file with syntax errors. Still no php_errors.log. What are some reasons why PHP would not log errors?

Could it be permissions? I see that /var/log is owned by root:root. I don't know what owns the php process. I believe root starts the apache process and then a new starts a new thread using www-data:www-data for serving files. Would that mean I need to change permissions on the /var/log folder?

Resolve Dynamic DNS to internal IP

I am not very familiar with dynamic DNS, and was curious if I could get it to work for a certain use case.

I have a few Raspberry Pi's I'm setting up for mocking server setups of applications we use at a small scale of our larger setup. They have wireless access capability. I'd like to be able to throw them in a bag and work with them using my laptop in various settings. However, working out the IP's and addresses every time to communicate with them on new networks would be quite annoying (having to change the endpoints all of the applications/configurations are referring to).

I can have normal DNS A Records point to internal IPs and they work great while on private networks. However this is less ideal for changing IPs. Would I be able to use Dynamic DNS to resolve the DNS records to internal addresses? (Such that connecting to a new wireless network all of the lookups would work after everything is connected without having to monkey with the router, custom dns server, etc.)

Initial research indicates Dynamic DNS usually resolves to the external IP whereas in this case I wish to automatically resolve to the address obtained on a specific interface for each client e.g. Eth0.

Answer

The simplest way forward would be to use mDNS to do "ad-hoc" DNS resolution amongst the machines in the same subnet. This is, basically, as simple as installing avahi-daemon and libnss-mdns (Debian package names; adjust as appropriate) and making sure your firewall isn't blocking 5353/udp. This will cover both forward and reverse DNS entries, and create resolvable names of the form .local for all other machines on the local subnet.

If you need naming which is available beyond the local multicast domain, you'll probably want to setup a DNS server somewhere on the Internet which accepts TSIG-authenticated UPDATE queries, and then configure your client machines to send updates using nsupdate (or some other equivalent means).

linux - KVM-Guests can't get past bridge - no internet connection

I'm running a backported KVM on a Debian Squeeze. ATM the KVM-Guest can't connect to the internet through the bridge I have set up. The guests can reach each other, the host but nothing outside. I can neither ping, nslookup or do anything to a remote address. The guest are configured to have a static IP. When I didn;t have the bridge but a virtual bridge (the KVM-default) the guest could connect fine. After setting up the bridge things broke, so I think the problem lies there.

# The loopback network interface
auto lo br0
iface lo inet loopback

# Bonding Interface
auto bond0
iface bond0 inet static
address 10.XXX.XXX.84

netmask 255.255.255.192
network 10.XXX.XXX.64
gateway 10.XXX.XXX.65
slaves eth0 eth1
bond_mode active-backup
bond_miimon 100
bond_downdelay 200
bond_updelay 200

iface br0 inet static

    bridge_ports eth0 eth1
    address 172.xxx.xxx.65
    broadcast 172.xxx.xxx.127
    netmask 255.255.255.192
    gateway 172.xxx.xxx.65
    bridge_stp on
    bridge_maxwait 0

Thanks in advance for your help !

Answer

Ok, silly me - as it turns out the problem was that the interface and the gateway had the same IP.

Tuesday, March 21, 2017

Glue records for sub domains?

Let's say I own example.com

I created an A record called dns.example.com then point it at a nameserver.
On the nameserver I gave it the name ns1.dns.example.com and also dns.example.com

Now I'm confused. Do I create glue records of ns1.dns.example.com or dns.example.com? Do I also still need the A records for my dns.example.com?

Answer

You need glue records if the NS records for a domain point to hostnames within that domain. For example, this would not need glue-records:

testing.example.com:
NS = ns1.infra.example.com.
NS = ns2.infra.example.com.

Where, this one would need glue-records:

testing.example.com:
NS = ns1.testing.example.com.

NS = ns2.testing.example.com.

In that case, the glue records go into the example.com domain.

example.com:
[...]
testing.example.com.     IN NS ns1.testing.example.com.
testing.example.com.     IN NS ns2.testing.example.com.
ns1.testing.example.com. IN  A 172.16.202.152

ns2.testing.example.com. IN  A 172.12.9.11

database - Is MySQL supposed to be installed alone

I often hear people making statements such as "our MySQL server machine failed", which gives me the impression that they dedicate a single machine as their MySQL server (I guess they just install the OS and only MySQL on it). As a developer not a sysadmin, I'm used to MySQL being installed as part of a LAMP stack together with the web server and the PHP.

Can someone explain to me:

what's the point of installing MySQL on a separate server? sounds like a waste of resources when I can add the entire lamp stack there and additional servers as well.

if the database is on a separate machine, how do the apps that need to use connect to it?

Answer

When your application platform and your database are competing for resources, that's usually the first indication that you're ready for a dedicated database server.

Secondly, high-availability: setting up a database cluster (and usually in turn, a load-balanced Web/application server cluster).

I would also say security plays a large role in the move to separate servers as you can have different policies for network access for each server (for example, a DMZ'ed Web server with a database server on the LAN).

Access to the database server is over the network. i.e. when you're usually specifying "localhost" for your database host, you'd be specifying the host/IP address of your database server. Note: usually you need to modify the configuration of your database server to permit connections/enable listening on an interface other than the loopback interface.

SQL Server 2008 R2 memory leak

We're in the process of moving from one server to another.

I'm running the SQL Server database off our new server, with the old server now using that as a DB server with the aim to make it a smooth transition from one server to the other whilst DNS propagates so no down time for anyone.

Nothing else is currently running on the new server, it's a fresh install of everything.

The sqlserver.exe process seems to be increasing in memory requirements non stop, is this most likely to be connections not closing properly on my website? Or are there any known memory leaks in SQL Server? We receive around 40k page views a day, and probably around 5x that number in crawler pageviews.

Total DB MDF+LDF size is quite small, only 600mb. The current commit (KB) of sqlservr.exe is 2,393,000. It's increasing at a rate of about 0.5MB every second or two.

If this is connections not closing properly on our website is there any way to clean old open connections? Apart from obviously fixing the root cause what can we do in the meantime?

Answer

I'm not a SQL internals guru, but to my knowlege, SQL will use as much RAM as you've configured it to and that the OS will allow it to have. There's more cached than the DB its self, for example the execution plans
http://msdn.microsoft.com/en-us/library/ms181055.aspx

As a good rule of thumb your SQL memory should be configured in the following way:

Take your OS RAM and subtract 2GB or 10% whichever is greater. That should be the "Maximum" memory. Next set your mininimum memory to half of what your OS has set. As per Brent Ozar http://www.brentozar.com/archive/2008/03/sql-server-2005-setup-checklist-part-1-before-the-install/

Monday, March 20, 2017

How to properly configure a mailing list server to be SPF friendly?

UPDATE: The SPF record I have in DNS for domain mailinglist.com:

mailinglist.com. 3600 IN TXT "v=spf1 mx ptr include:gateway.com ?all"

UPDATE2: The From:, Reply-To:, Sender: and Return-Path:headers from a failed message:

...
From: "bob" 
Reply-To: 
Sender: 
List-Post: 
Return-Path: list1-owner@my.mailinglist.com
...

I've inherited a Sympa mailing list server from a previous Admin and am not very familiar with the whole process. Recently, we've been getting some calls from users that their posts to the various mailing lists are being marked as failing fraud detection checks.

I've been reading up on SPF and suspect that what is happening is when a user (bob@somewhere.org) posts to the list (my.mailinglist.com), the outbound message from the list server has the envelope sender set to "bob@somewhere.org". Our list server then relays the outgoing message to mail.gateway.com which then delivers it over the Internet. When the SMTP server at somewhere.org (or other domain) receives the post, it sees that it was sent by our relay, mail.gateway.com (13.14.15.16), which does not have it's IP address on the SPF record for somewhere.org.

In the mail headers of the outbound post sent from mail.gateway.com, I have an SPF line which reads:

Received-SPF: SoftFail (mail.gateway.com: domain of
 transitioning bob@somewhere.org discourages use of 13.14.15.16 as
 permitted sender)

We have many users from many different domains sending mail to our list server. Asking every domain to include the mail.gateway.com IP in their SPF record seems ridiculous, but that's what I gather is one way to fix this.

The other fix involves probably using a different envelope sender. I'm not sure how this would affect "Reply" and "Reply to" functionality. Right now it seems a bit wonky; Reply and Reply-to both go the the mailing list which seems odd. I'm trying to figure out where that is configured.

Are there some other ways to work this out that I have missed?
Thanks

virtualization - Best infrastructure configuration for application hosting (IIS/MSSQL)? Bare Metal, Virtualize?

We're a small consulting shop that hosts some public facing websites and web applications for clients (apps we've either written or inherited). Our strength lies in coding, not necessarily in server management. However a managed hosting solution is out of our budget (monthly costs would exceed any income we derive from hosting these applications).

Yesterday we experienced a double hard drive failure in one of our servers running RAID5. Rare that something like this happens. Luckily we had backups and simply migrated the affected databases and web applications to other servers. We got really lucky, only one drive 100% failed, the other simply got marked as pending failure, so we were able to live move almost everything [one db had to be restored from backup] and we only had about 5 minutes of downtime per client as we took their database offline and moved it.

However we're now worried that we've grown a bit... organically... and now we're attempting to figure out the best plan for us moving forward.

Our current infrastructure (all bare metal):

pfSense router [old repurposed hardware]

1U DC [no warranty, old hardware]

2U web & app server (server 2k8 R2, IIS, MSSql, 24gb ram, dual 4C8T Xeon) -- this had the drive failures -- [warranty good for another year, drives being replaced under the warranty]

4U inherited POS server (128gb ram, but 32bit OS only, server 2k3) [no warranty]

(2) 1U webservers (2k8, IIS, 4C8T Xeon, 4gb ram) in a load balanced cluster (via pfSense) [newish with warranty]

1U database server (2k8, MSSQL, 4C8T Xeon, 4gb ram) [new with warranty]

NAS running unRaid with 3TB storage (used for backups and file serving for webapps to the 2 load balanced web servers)

Our traffic is fairly light, however downtime is pretty much unacceptable. Looking at the CPU monitors throughout the day, we have very, very little CPU usage.

We've been playing with ESXi as a development server host and it's been working reasonably well. Well enough for me to suggest we run something like that in our production environment.

My initial inclination is to build a SAN (following a guide such as this: http://www.smallnetbuilder.com/nas/nas-howto/31485-build-your-own-fibre-channel-san-for-less-than-1000-part-1) to host the VMs. I'd build it in RAID 1+0 to avoid the nasty hard drive failure issue we had yesterday.

We'd run the majority of VMs on the server that currently has failed hard drives as it is the most powerful. Run other VMs on the 1U servers that we've currently got load balanced. P2V the old out of warranty hardware (except pfSense, I prefer it on physical hardware). Continue running the unRaid for backups.

I've got a bunch of questions, but the infrastructure based ones are as such:

Is this a reasonable solution to mitigate physical hardware issues? I think that the SAN becomes a massive SPOF and the large server (that would be hosting the VMs) is another. I read that the paid versions of vmWare support automatic failover of VMs and that might be something that we look into to alleviate the VM Host failure potential.

Is there another option that I'm missing? I've considered basically "thin provisioning" the applications where we'd use our cheaper 1U server model and run the db and the app on one box (no VM). In the case of hardware failure, we'd have a smaller segment of our client base affected. This increases our hardware costs, rackspace costs, and system management costs ("sunk" cost in employee time).

Would Xen be more cost effective if we did need the VM failover solution?

Sunday, March 19, 2017

networking - Network troubleshooting

I'm new on here but I'm sure my network problem isn't new and so hope I can get a bit of help .

I have a router in the office with 3 pc's connected, and a length of ethernet cable, less than 100m, which runs to another part of the building straight into a 8 port switch, call it sw1. Connected to sw1 are 2 wifi access points and another length of cable going into another switch, call it sw2. Sw2 has a tv and another wifi access point connected to it.

My problem seems to be with the cable going into sw2, before using sw2 I could connect a laptop into that cable and it works fine, but the tv would work and sometimes wouldn't. Even connecting to the access point on sw2 is a bother with intermittent connection problems using mobile phones.

All the access points can be seen from the router side of the network even when I can't get a connection on the sw2 side of the network.

Any insight into this problem would be much appreciated.

I take it it is ok to setup the network the way I have cascading the switches and connecting access points or I think you would have mentioned that. I have used several different switches for sw2 and that part of the network always ents up the same after a time - not data.

Saturday, March 18, 2017

linux - What's the difference between apt-get and aptitude?

I don't get why there are two different programs in a minimal install to install software. Don't they do the same thing? Is there a big difference? I have read everywhere to use aptitude over apt-get but I still don't know the difference

Answer

aptitude is a wrapper for dpkg just like apt-get/apt-cache, but it is a one-stop-shop tool for searching/installing/removing/querying.
A few examples that apt might not supply:

$ aptitude why libc6
i   w64codecs Depends libc6 (>= 2.3.2)
$ aptitude why-not libc6
Unable to find a reason to remove libc6.


$ aptitude show libc6
Package: libc6
State: installed
Automatically installed: no
Version: 2.9-4ubuntu6
Priority: required
Section: libs
Maintainer: Ubuntu Core developers 
Uncompressed Size: 12.1M
Depends: libgcc1, findutils (>= 4.4.0-2ubuntu2)

Suggests: locales, glibc-doc
Conflicts: libterm-readline-gnu-perl (< 1.15-2), 
tzdata (< 2007k-1), tzdata-etch, nscd (< 2.9)
Replaces: belocs-locales-bin
Provides: glibc-2.9-1
Description: GNU C Library: Shared libraries
 Contains the standard libraries that are used by nearly all programs 
 on the system. This package includes shared versions of the standard 
 C library and the standard math library, as well as many others.

Friday, March 17, 2017

networking - How do I map a network in Visio?

I work for a organization which runs several servers mainly used for internet to schools and Exchange on the corporate side.

One of my tasks during my training is to be able to understand the network and how it works.

Now I don't need to understand absolutely everything but I would like to map out the network in Visio to see how it connects together via a visual perspective.

We have 1 core room which holds the main servers and switches for http/mail/voice/storage etc. which are connected to several switches in other "sub" rooms which then connect that section of the building, i.e 1 Hop.

So my task is to be able to track all the connections leading from the main server room going to the sub switch rooms, so we know what connection does what.

Do you guys have any idea of how to get started with this task, how to go about it, etc.

I'm not 100% guru with Servers as I'm just a MA - Technician and will need to learn how the servers interact with switches and how it all comes together.

domain name system - Why does dig +trace sometimes fail against Windows Server DNS?

My team has a server pointing at the DNS supplied by Active Directory to ensure that it is able to reach any hosts managed by the domain. Unfortunately, my team also needs to run dig +trace frequently and we sporadically get strange results. I am a DNS admin but not a domain admin, but the team responsible for these servers isn't sure what is going on here either.

The problem seems to have shifted around between OS upgrades, but it's hard to say whether that's a characteristic of the OS version or other settings being changed during the upgrade process.

When the upstream servers were Windows Server 2003, the first step of dig +trace (request . IN NS from the first entry in /etc/resolv.conf) would occasionally return 0 byte responses.

When the upstream servers were upgraded to Windows Server 2012, the zero byte response problem went away but was replaced with an issue where we would sporadically get the list of forwarders configured on the DNS server.

Example of the second problem:

$ dig +trace -x 1.2.3.4      

; <<>> DiG 9.8.2 <<>> +trace -x 1.2.3.4
;; global options: +cmd
.                       3600    IN      NS      dns2.ad.example.com.
.                       3600    IN      NS      dns1.ad.example.com.
;; Received 102 bytes from 192.0.2.11#53(192.0.2.11) in 22 ms


1.in-addr.arpa.         84981   IN      NS      ns1.apnic.net.
1.in-addr.arpa.         84981   IN      NS      tinnie.arin.net.
1.in-addr.arpa.         84981   IN      NS      sec1.authdns.ripe.net.
1.in-addr.arpa.         84981   IN      NS      ns2.lacnic.net.
1.in-addr.arpa.         84981   IN      NS      ns3.apnic.net.
1.in-addr.arpa.         84981   IN      NS      apnic1.dnsnode.net.
1.in-addr.arpa.         84981   IN      NS      ns4.apnic.net.
;; Received 507 bytes from 192.0.2.228#53(192.0.2.228) in 45 ms

1.in-addr.arpa.         172800  IN      SOA     ns1.apnic.net. read-txt-record-of-zone-first-dns-admin.apnic.net.

4827 7200 1800 604800 172800
;; Received 127 bytes from 202.12.28.131#53(202.12.28.131) in 167 ms

In most cases this isn't a problem, but it will cause dig +trace to follow the wrong path if we are tracing within a domain that AD has an internal view for.

Why is dig +trace losing its mind? And why do we seem to be the only ones complaining?

Answer

You are being trolled by root hints. This one is tricky to troubleshoot, and it hinges on understanding that the . IN NS query sent at the start of a trace does not set the RD (recursion desired) flag on the packet.

When Microsoft's DNS server receives a non-recursive request for the root nameservers, it's possible that they will return the configured root hints. So long as you do not add the RD flag to the request, the server will happily continue to return that same response with a fixed TTL all day long.

$ dig @192.0.2.11 +norecurse . NS

; <<>> DiG 9.8.2 <<>> @192.0.2.11 +norecurse . NS
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12586
;; flags: qr ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 2


;; QUESTION SECTION:
;.                              IN      NS

;; ANSWER SECTION:
.                       3600    IN      NS      dns2.ad.example.com.
.                       3600    IN      NS      dns1.ad.example.com.

;; ADDITIONAL SECTION:
dns2.ad.example.com.    3600    IN      A       192.0.2.228

dns1.ad.example.com.    3600    IN      A       192.0.2.229

This is where most troubleshooting efforts will break down, because the easy assumption to leap to is that dig @whatever . NS will reproduce the problem, which actually masks it completely. When the server gets a request for root nameservers with the RD flag set, it will reach out and grab a copy of the real root nameservers, and all subsequent requests for . NS without the RD flag will magically start working as expected. This makes dig +trace happy again, and everyone can go back to scratching their heads until the problem reappears.

Your options are to either negotiate a different configuration with your domain admins, or to work around the problem. So long as the poisoned root hints are good enough in most circumstances (and you're aware of the circumstances where they're not: conflicting views, etc.), this isn't a huge inconvenience.

Some workarounds without changing the root hints are:

Run your traces on a machine that has a less nutty set of default resolvers.

Start your trace from a nameserver that returns root name servers for the internet in response to . NS. You can also hardwire this nameserver into ${HOME}/.digrc, but this may confuse others on a shared account or be forgotten by you at some point.
dig @somethingelse +trace example.com

Seed the root hints yourself prior to running your trace.
dig . NS
dig +trace example.com

tomcat - Apache UseCanonicalName On isnt passing ServerName to CGI

On Apache 2.4 in in a virtualhost I have:

UseCanonicalName On
ServerName somename
ServerAlias www.someothername.com

According to the docs:

With UseCanonicalName On Apache will use the hostname and port
specified in the ServerName directive to construct the canonical name
for the server. This name is used in all self-referential URLs, and
for the values of SERVER_NAME and SERVER_PORT in CGIs.

So in my Tomcat/CFML application when I visit the URL www.someothername.com I would expect to see in the CGI scope:

server_name: somename

but instead I get:

server_name: www.someothername.com

It's like the directive is totally ignored.

# Proxy CFML files to Tomcat
RewriteCond %{REQUEST_FILENAME} /[^/:]+\.cfml*($|/)
RewriteRule (.*) ajp://%{HTTP_HOST}:8009$1 [P,L]

I also tried:

RewriteRule (.*) ajp://%{SERVER_NAME}:8009$1 [P,L]

and using mod_proxy instead of AJP:

RewriteRule (.*) http://%{SERVER_NAME}:8888$1 [P,L]

The last 2 cause a DNS lookup on somename but still returns www.someothername.com in the CGI.SERVER_NAME field

I should point out that the only reason I'm doing this is because I'm doing mass virtual-hosting with mod_cfml to automaticatically create tomcat contexts and I would like the context and application to use a short name derived from the vhost configuration. I guess I could just set a header (even rewrite the Host: header) but using ServerName seemed the most elegant solution.

UPDATE: There is something I noticed in the client headers that is probably relevant. There are 2 headers I haven't seen before:

x-forwarded-host: www.someothername.com
x-forwarded-server: somename

I need to know what set these headers and why. I'm assuming it was either Tomcat or mod_cfml. Can I rely on the x-forwarded-server value to always be ServerName?

Samba: Access trouble from Windows

I am trying to access an NTFS (ntfs-3g) share from a Windows machine through the Samba 3 on Debian.

I am getting this error on Windows when I try net use command:

System error 5 has occurred.

Access is denied.

, which I am sure is not due to a bad password because that would be

System error 86 has occurred.

The specified network password is not correct.

I have my /etc/samba/smb.conf setup like this under global:

   security = user

and for the share:

  valid users = @users
  force group = users
  create mask = 0777
  directory mask = 0777
  writeable = Yes
  browseable = yes
  guest ok = no

Output of "$ sudo testparm -s" command includes under the specific share:

valid users = %S
force group = users
create mask = 0700
directory mask = 0700

Any clues/hints for what could be going wrong? Please let me know if more information is needed to solve the problem. Thanks.

Wednesday, March 15, 2017

Azure Virtual Machines - what fault tolerance do they provide?

We are thinking about moving our virtual machines (Hyper-V VHDs) to Windows Azure but I haven't found much about what kind of fault tolerance that infrastructure provides. When I run VHD in Azure, I've got two questions:

Is my VHD and all the data in it safe? I think that uploaded VHDs use the "Storage" infrastructure so they should be automatically replicated to multiple disks and geographically distributed but should I still make a full-image backup just to be safe? (Note that of course I will be backing up the actual data inside VMs that I care about; I just want to know if there is a chance greater than 0.0000001% that one day I will receive an email from Microsoft telling me that my VM is gone and that I should create or restore it from scratch).

Do I need to worry about other things regarding the availability of my VMs? I mean, when I have an on-premise server I need to worry about the hardware itself, about the host operating system, what would happen if my router failed, if my Hyper-V's C: drive failed etc. Am I right in thinking that with Azure, their infrastructure takes care of all of this?

Thanks.

Heat generated by a HP ProLiant DL380p Gen8 entry EU servers

I have to buy a stand alone server rack to be placed in a machine room. The temperature can reach 30°C in the room (I have no choice with regards to the location). In order to correct for this I will use active cooling with the used air being extracted.

I need to calculate the rating for the cooling cabinet. The cabinet will have 2 catalyst 2960 g switches and 4 HP ProLiant DL380p Gen8 E5-2609 2.4GHz 4-core 1P 4GB-R P420i SFF 460W PS Entry EU Servers as well as 3 USB extender receiver signal adapters.

The catalyst I can look up as well as the KVM but I have a problem with the Proliant as I cant figure out what heat it generates. The only BTU specification is the power supply which is rated in this case for 1725BTUs which is circa 506 watts. Does this mean that for four of these units I only need a cabinet that can cater for
4*(506)+2*(129)+3*(50) = 2.432 kW.

Can it be that easy?

Answer

This depends on your usage pattern, but I think your 2.432 kW number may be overkill.

In terms of your room's ambient temperature, the thresholds of the DL380p platform are:

Ambient temps warn at 42°C

Thermal shutdown occurs at 46°C.

This can be overridden/ignored, but that's a target to keep in mind. Here's the heat map and spot temperatures for a very busy DL380p Gen8 system described in detail below.

The BTU figure you see from the HP Quickspecs is a maximum, but we know that you won't be running these servers at full utilization. In a SMB scenario, It doesn't make sense to engineer for worst-case power/heat loads. These are extremely efficient servers and I've had success running them in less-than-ideal environments (warehouses, closets, bathrooms...)

However, HP does say: BTU = Volts X Amps X 3.41, so your calculation is correct.

Real-life power utilization figures...

For a DL380p with that spec (or slightly more powerful E5-2620 single-CPU configuration), with a light duty cycle I see:

System Information

        Manufacturer: HP
        Product Name: ProLiant DL380p Gen8

model name  : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
cpu MHz     : 2000.000

hpasmcli> SHOW POWERMETER
Power Meter #1
    Power Reading  : 62


hpasmcli> SHOW POWERSUPPLY
Power supply #1
    Present  : Yes
    Redundant: Yes
    Condition: Ok
    Hotplug  : Supported
    Power    : 35 Watts
Power supply #2
    Present  : Yes
    Redundant: Yes

    Condition: Ok
    Hotplug  : Supported
    Power    : 35 Watts

For an extremely busy DL380p Gen8 with TWO E5-2643 v2 3.5GHz CPUs, running with a 8 internal SAS disks, 192GB RAM, fully-populated PCIe slots and no power saving settings enabled, I see:

enter image description here

active directory - Why should the IP address of a Domain Controller map to the site it serves?

I have questions related to this specific event:

Index              : 865

EntryType          : Warning
InstanceId         : 5802
Message            : None of the IP addresses (192.168.254.17) of this Domain Controller map to the configured site 'North'.
                    While this may be a temporary situation due to IP address changes, it is generally
                    recommended that the IP address of the Domain Controller (accessible to machines in
                    its domain) maps to the Site which it services. If the above list of IP addresses is
                    stable, consider moving this server to a site (or create one if it does not already
                    exist) such that the above IP address maps to the selected site. This may require the
                    creation of a new subnet object (whose range includes the above IP address) which maps
                    to the selected site object.

Category           : (0)
CategoryNumber     : 0
ReplacementStrings : {North, 192.168.254.17}
Source             : NETLOGON
TimeGenerated      : 11/10/2018 4:45:42 PM
TimeWritten        : 11/10/2018 4:45:42 PM
UserName           :

The event was being logged repeatedly by a domain controller whose IPv4 address is not associated to the site it serves, as configured on Active Directory Sites and Services console. I supressed it by creating a /32 subnet object that maps to the served site, however I am wondering to know about the actual consequences.

Why should the IPv4 address of the domain controller map to the site it serves?

Why is such test being performed by Netlogon? Why is the recommendation generally recommended?

Besides the event log, how would Active Directory infrastructure be impacted by such configuration mismatch?

Although the network infrastructure that links the sites consists of no more than a few meters of optic fibers and has low latency and high bandwidth, multiple sites were created in order to establish affinities between users and domain controllers while maintaning IPv4 addresses unchanged. It is a purpose of capacity management.

Under a test environment, a few Windows PowerShell lines may reproduce the issue.

DC1:

New-NetIPAddress -IPAddress 192.168.254.16 `
    -InterfaceAlias Ethernet -AddressFamily IPv4 `
    -Type Unicast -PrefixLength 24

Set-DnsClientServerAddress -InterfaceAlias Ethernet `

    -ServerAddresses @('192.168.254.17','192.168.254.16')

Import-Module ServerManager
Install-WindowsFeature -IncludeManagementTools ("AD-Domain-Services")

Import-Module ADDSDeployment
$dsrm_password = ConvertTo-SecureString 'Pa$$w0rd' -AsPlainText -Force
Install-ADDSForest `
    -DomainName 'contoso.com' `
    -InstallDns `

    -SafeModeAdministratorPassword $dsrm_password

#--------------

New-ADReplicationSite -Name 'North'
New-ADReplicationSite -Name 'South'
Get-ADReplicationSite -Identity 'Default-First-Site-Name' | `
    Get-ADObject | Rename-ADObject -NewName 'CPD'
New-ADReplicationSubnet -Name '192.168.0.0/16' -Site 'CPD'
New-ADReplicationSubnet -Name '192.168.0.0/18' -Site 'North'

New-ADReplicationSubnet -Name '192.168.128.0/18' -Site 'South'

New-ADReplicationSiteLink -Name 'CPD-North' `
    -SitesIncluded @('CPD', 'North') `
    -InterSiteTransportProtocol IP `
    -ReplicationFrequencyInMinutes 15 `
    -OtherAttributes @{'Options'=5}

New-ADReplicationSiteLink -Name 'CPD-South' `
    -SitesIncluded @('CPD', 'South') `

    -InterSiteTransportProtocol IP `
    -ReplicationFrequencyInMinutes 15 `
    -OtherAttributes @{'Options'=5}

Get-ADReplicationSiteLink 'DEFAULTIPSITELINK' | Remove-ADReplicationSiteLink

DC2:

New-NetIPAddress -IPAddress 192.168.254.17 `

    -InterfaceAlias Ethernet -AddressFamily IPv4 `
    -Type Unicast -PrefixLength 24

Set-DnsClientServerAddress -InterfaceAlias Ethernet `
    -ServerAddresses @('192.168.254.16','192.168.254.17')

Import-Module ServerManager
Install-WindowsFeature -IncludeManagementTools ("AD-Domain-Services")

Import-Module ADDSDeployment

$dsrm_password = ConvertTo-SecureString 'Pa$$w0rd' -AsPlainText -Force

Install-ADDSDomainController `
    -InstallDns `
    -SiteName 'North' `
    -DomainName 'contoso.com' `
    -SafeModeAdministratorPassword $dsrm_password `
    -Credential (Get-Credential)

#--------------


Get-EventLog -LogName 'System' -InstanceId 5802 -Newest 1

linux networking - arp table entry outside configured network

Can someone tell me why do I have arp table entry outside interface network?

Server has 2 interfaces, both on the same network:
eth0 10.10.34.146/22
eth1 10.10.33.188/22

ip route
default via 10.10.32.2 dev eth0 proto static metric 1024
10.10.32.0/22 dev eth0 proto kernel scope link src 10.10.34.146
10.10.32.0/22 dev eth1 proto kernel scope link src 10.10.33.188

arp -n
Address HWtype HWaddress Flags Mask Iface
176.119.32.2 ether d4:d7:48:b5:a3:c1 C eth1
176.119.32.2 ether d4:d7:48:b5:a3:c1 C eth0

And then, when I ping 8.8.8.8 (from eth0) I get icmp reply but arp table does not change.
But when I ping -I eth1 8.8.8.8 I get icmp reply and there is a new entry in arp table:
8.8.8.8 ether d4:d7:48:b5:a3:c1 C eth1

But, when I add 'ip route add default via 176.119.32.2 dev eth1' to the routing table and ping 8.8.8.8 from eth1 there is no new arp table entry for 8.8.8.8.
Why is that?
Thanks.

NOTE: Both interfaces are connected to Cisco switch with 'ip local-proxy-arp' on the SVI with ip 176.119.32.2/22 which is default gw, and both are in Private VLAN.

Answer

It sounds like the Cisco interface your eth0 is connected to is configured with ip local-proxy-arp but the Cisco interface your eth1 connects to is actually configured with ip proxy-arp instead. This would account for the switch taking responsibility over the data-link addressing of an IP outside of the locally configured LAN.

Can you please verify this on the switch configuration? If I am incorrect, please post the results of show run int [PORT_FOR_ETH0] and show run int [PORT_FOR_ETH1]?

Tuesday, March 14, 2017

centos - kswap using 100% of CPU even with 100GB+ of RAM available

I'm running a Centos 7 ESXi VM with almost 300GB of RAM and 24 vCPUs.

The average load is 3 and apps almost never use more than 150GB of RAM. The rest of the available memory is used by Linux for cache.

The issue is that when the cache fills up available RAM, two kswapd processes will start using 100% of CPU and suddenly I see that all CPUs are also showing 99% of sys usage ( it's not wait or user, it's mainly sys ).

This will cause a high load ( 100+ ) for several minutes until the system recovers and load goes down to 3 again.

At this moment I don't have a swap partition, but even when I had one this issue happened.

One "solution" that I found is to execute the following command every day:

 echo 3 > /proc/sys/vm/drop_caches

which drops buffers/caches. This will "fix" the issue since cache usage never reaches 100%.

My questions are:

Is there a real solution for this issue?

Shouldn't linux kernel be smart enough to simply clear old cache pages from memory instead of launching kswap?

After all, from what I understand the main function of RAM memory is to be used by apps. Caching is just a secondary function which can be discarded/ignored if you don't have enough memory.

My kernel version is 3.10.0-229.14.1.el7.x86_64.

command line interface - Can you list virtual hardware of a guest via the CLI in ESXi 5.0?

Can any one tell me of an ESXi line command that can be used to list the different virtual hardware components assigned to VMWare guests running on ESXi, with vcenter?

E.g. I want to find out how many of our guests are running with the e1000 network adaptor or how many have 2 sockets and 2 cores.

I'd like to do this in ESXi/vSphere not in the guest OS.

Answer

In PowerCLI, the number of CPUs is accessible directly as a property of the VirtualMachine object returned by Get-VM, but in v5.0, the other virtual hardware objects have their own cmdlets, e.g. Get-HardDisk, Get-NetworkAdapter. So you'd have to do something like:

    Get-VM | ForEach-Object {$nic = Get-NetworkAdapter -VM $_; Write-Host "$_.Name $nic.Type"}

installation - Linux: Managing users, groups and applications

I am fairly new to linux admin so this may sound quite a noob question.

I have a VPS account with a root access.

I need to install Tomcat and Java on it and later other open source applications as well.
Installation for all of these is as simple as unzipping the .gz in a folder.

My questions are

A) Where should I keep all these programs?
In Windows, I typically have a folder called programs under c:\ where I unzip all applications.
I plan to have something similar here as well.
Currently, I have all these under apps folder under/root- which I am guessing is a bad idea.
What's wrong with always being root?
Right now I am planning to put them under /opt

B) To what group should Tom belong to ?
I would need a user - say Tom who can simply execute these programs.
Do I need to create a new group? or just add Tom to some existing group ?

C) Finally- Am I doing something really stupid by installing all these application by simply unzipping them?
I mean an alternate way would be to use Yum or RPM or something like that to install these applications.
Given my familiarity and (tight budget) that seems too much to me.
I feel uncomfortable running commands which i don't understand too well

kvm virtualization - KVM Ubuntu Guest cannot connect to the internet on bridged networking

I have Ubuntu 14.04 (64 bits) + KVM Host with 2 NICs:
- eth0 connected to the "public" network
- eth1 connected to the br0 bridge with a private ip address range

From Host I can access internet, ping VM Guest and connect to it via SSH.
From VM Guest I can only ping Host, but cannot access Internet and cannot ping google.com

Please help me with connecting VM Guest to the internet in the setup described below:

On Host:

/etc/network/interfaces



auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
  address 192.168.2.60
  netmask 255.255.255.0
  gateway 192.168.2.254
  dns-nameservers 8.8.8.8


auto eth1
iface eth1 inet manual

auto br0
iface br0 inet static
  address 10.0.0.1
  netmask 255.255.255.0
  bridge_ports    eth1
  bridge_stp      off
  bridge_maxwait  0

  bridge_fd       0

 # Create and destroy the bridge automatically.
pre-up brctl addbr br0
ip link set dev br0 up
post-up /usr/sbin/brctl setfd br0 0 addif br0 eth1
post-down brctl delbr br0

KVM Network is defined as:


br0-net
9d24b473-0b4d-4cfa-8b12-7bf267d856ae

# sysctl -p /etc/sysctl.conf


 net.ipv4.ip_forward = 1
 net.bridge.bridge-nf-call-ip6tables = 0
 net.bridge.bridge-nf-call-iptables = 0
 net.bridge.bridge-nf-call-arptables = 0

# route -n


Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.2.254   0.0.0.0         UG    0      0        0 eth0

10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 br0
192.168.2.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0

# iptables -t nat -vnL


Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination


Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

On VM Guest:

/etc/network/interfaces


auto lo
iface lo inet loopback


auto eth0
iface eth0 inet static
  address 10.0.0.11
  netmask 255.255.255.0

Guest xml is defined as

# route -n


Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.1        0.0.0.0         UG    0      0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0

# iptables -t nat -vnL


Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)

 pkts bytes target     prot opt in     out     source               destination

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Ping KVM Host from Guest does work for 10.0.0.1 and 192.168.2.60:

$ ping 10.0.0.1


PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.555 ms

$ ping 192.168.2.60


PING 192.168.2.60 (192.168.2.60) 56(84) bytes of data.
64 bytes from 192.168.2.60: icmp_seq=1 ttl=64 time=0.772 ms

Ping a different computer 192.168.2.3 from Guest does not work:


--- 192.168.2.3 ping statistics ---
277 packets transmitted, 0 received, 100% packet loss, time 276399ms

Ping google.com from Guest does not work:


ping: unknown host google.com

Answer

I think you are missing a iptable rule for the masquerade

iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -o eth0 -j MASQUERADE

Monday, March 13, 2017

domain name system - mxttolbox SPF check - showing - all fail & Rest ptr a mx are Passed . What is the reason

Newly created spf record for one of our domain . texxxxx.com

goes like this

teXXXXXX.com v=spf1 a mx ptr ip4:189.123.111.000 mx:teXXXXX.com -all ( not real values)

and mxtool spf check proviede following result

a Pass Match if IP has a DNS 'A' record in given domain

mx Pass Match if IP is one of the MX hosts for given domain name

ptr Pass Match if IP has a DNS 'PTR' record within given domain

ip4 189.123.111.000 Pass Match if IP is in the given range

mx texxxxxx.com Pass Match if IP is one of the MX hosts for given domain name

- all Fail Always matches. It goes at the end of your record.

I dont know why the last line result for -all failing is it the issue to concern ??

Increase size of root directory / on ubuntu

How to increase the size of the root directory / on ubuntu 9.10?.

Please need instructions...

Disk /dev/sda: 8589 MB, 8589934592 bytes

255 heads, 63 sectors/track, 1044 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Disk identifier: 0x00065bfb

Device Boot Start End Blocks Id System

/dev/sda1 * 1 993 7976241 83 Linux

/dev/sda2 994 1044 409657+ 5 Extended

/dev/sda5 994 1044 409626 82 Linux swap / Solaris

Amazon ec2 Data transfer out

I recently reserved a ec2 instance for 36months, i plan on moving my web app over to it very soon but i started doing some research, amazon charges for data transfer out. Basically the entire reason i'm making the move to ec2 is because my web app is growing in size, but the problem is my web app gets ddos attacks from time to time, what's to stop me from getting a $10k bill because of a ddos attack?
(or is that data transfer in?)

So what i'm asking is what uses data transfer out and what uses data transfer in?
The web app consists of a few js and css files as well as like 50 images tops, every other image is hosted on imgur.

Thanks in Advanced

Saturday, March 11, 2017

linux - How to execute with /bin/false shell

I am trying to setup per-user fastcgi scripts that will run each on a different port and with a different user. Here is example of my script:

#!/bin/bash
BIND=127.0.0.1:9001
USER=user
PHP_FCGI_CHILDREN=2

PHP_FCGI_MAX_REQUESTS=10000

etc...

However, if I add user with /bin/false (which I want, since this is about to be something like shared hosting and I don't want users to have shell access), the script is run under 1001, 1002 'user' which, as my Google searches showed, might be a security hole. My question is: Is it possible to allow user(s) to execute shell scripts but disable them so they cannot log in via SSH?

hp proliant - Plextor SSDs don't work reliably in HP DL380p Gen8 servers

We recently purchased a few HP 380p G8 servers to add some VM capacity, and decided to add a pool of SSDs to our standard build, to create a "fast" RAID 1+0 array for some of our VMs that have higher performance requirements. (e.g. log servers and dbs) Since the HP drives are super-duper expensive, we went with Plextor PX-512M5Pro SATA SSDs, since we had good luck with Plextor SSDs in our previous G7 servers.

However, in 3 out of 3 servers, 3 of the 4 drives have entered the failed state, shortly after being configured, before we even attempted to put them in use. The reliability of the failures leads me to believe it's incompatibility between the RAID controller and the drives, and when the RMA replacements arrive, I'm assuming they'll fail, too. Any tips or tricks that might help with this issue, besides just buying the HP official drives?

exchange user receive Undeliverable: form postmaster after every sended message

this happens after users mailbox moved from one exchange to other (in one organization, but differend domains)

details:

From: postmaster
Sent: Saturday, April 03, 2010 8:43 AM
To: USER1
Subject: Undeliverable: subject of message

Delivery has failed to these recipients or distribution lists:

IMCEAex-_O=DOMAIN_OU=First+20Administrative+20Group_cn=Recipients_cn=USER1@domain
A problem occurred during the delivery of this message. Microsoft Exchange will not try to redeliver this message for you. Please try resending this message later, or provide the following diagnostic text to your system administrator.

Answer

You can't get too far with troubleshooting this with just the non delivery report you posted above. There are two things you should do. First, get into Exchange System Manager and use the Message Tracking feature to track the message. Often it doesn't tell you much other than that it generated an NDR but sometimes there are clues (for instance, did the user's mailbox server generate the NDR or did it get forwarded to another server that then generated the NDR). Second, you need to look in the Application log of the server that generated the NDR for any associated errors. If you don't find anything, you should turn up diagnostic logging on that server for Transport to Maximum and send a few messages. You'll certainly get a few applicable errors in the Application log at that point and you'll then have something more useful to look at for troubleshooting purposes (including potentially posting an update here with the specific error).

apache 2.2 - What monitoring solution for servers, databases and web applications ? Nagios or Hyperic?

For the purpose of a startup, I have a loan for one physical dedicated server with several virtual machines inside it

For now there is mainly 2 VM on this server:

VM "tools", using ubuntu server 10.04 LTS

A source code repository using
mercurial and hgserve
A bunch of

JAVA app for Atlasian for bug

tracking, wiki...

PostgreSQL as the Database for the
tools

Apache HTTPD as HTTPS front end.

VM "asterisk", using ubuntu server 10.04 LTS

with an asterisk server, functionnal,

but more for testing as of now than
anything.

But in the future we will have a "production" VM with ou JAVA application :

Apache HTTPD frontend

PostgreSQL database

Tomcat webapp (maybe cluterised)

What I'am interrested into is a software that can monitor availability of services, KVM VM, applications and database so I can react in case of problem.

I have also another use case where I'd like to monitor the performance of the application (request, CPU, memory...) and gather usage statistics.

We have basically no money, and want a free tool, at least at first.

What would be easy and simple tool for the job ? I have heard of Nagios and Hyperic but I don't know them. So I don't know if they are suited for our needs.

EDIT :

The need is not only for server monitoring but also as a way to investigate actual application perfornance, responsiveness and if possible isolate bottlenecks.

From the links (not the same question as it seem more generic but quite informative) and the actual responses, Nagios + Munin should be a good fit. Problem is Nagios seems a little complex (I don't know for Munin).
Will the Nagios/Munin combo will be able to gather detailled statistics and historical data for a java application (request/seconds, request latency, both with statistics by URL, hour, day, week... ?)

Are there other (better ?) alternatives ?

Answer

Nagios. I was scared of the text configuration for a long time and tried all the other popular or remotely popular solutions out there, but was never satisfied. Once I eventually tried nagios and actually went through the configuration - I loved it, and actually find it much easier to configure and customize than gui tools like Zenoss.

While I have not done this yet, you could combine this with Monit to automatically try to recover from problems, and with Munin to collect historical data.

Edit:

Documentation for setting up Nagios and for Munin. It's Ubuntu specific, but I actually followed this to configure Nagios on Red Hat.

Friday, March 10, 2017

hardware - Something is burning in the server room; how can I quickly identify what it is?

The other day, we notice a terrible burning smell coming out of the server room. Long story short, it ended up being one of the battery modules that was burning up in the UPS unit, but it took a good couple of hours before we were able to figure it out. The main reason we were able to figure it out is that the UPS display finally showed that the module needed to be replaced.

Here was the problem: the whole room was filled with the smell. Doing a sniff test was very difficult because the smell had infiltrated everything (not to mention it made us light headed). We almost mistakenly took our production database server down because it's where the smell was the strongest. The vitals appeared to be ok (CPU temps showed 60 degrees C, and fan speeds ok), but we weren't sure. It just so happened that the battery module that burnt up was about the same height as the server on the rack and only 3 ft away. Had this been a real emergency, we would have failed miserably.

Realistically, the chances that actual server hardware is burning up is a fairly rare occurrence and most of the time we'll be looking at the UPS the culprit. But with several racks with several pieces of equipment, it can quickly become a guessing game. How does one quickly and accurately determine what piece of equipment is actually burning up? I realize this question is highly dependent on the environment variables such as room size, ventilation, location, etc, but any input would be appreciated.

Answer

The general consensus seems to be that the answer to your question comes in two parts:

How do we find the source of the funny burning smell?

You've got the "How" pretty well nailed down:

The "Sniff Test"

Look for visible smoke/haze

Walk the room with a thermal (IR) camera to find hot spots

Check monitoring and device panels for alerts

You can improve your chances of finding the problem quickly in a number of ways - improved monitoring is often the easiest. Some questions to ask:

Do you get temperature and other health alerts from your equipment?

Are your UPS systems reporting faults to your monitoring system?

Do you get current-draw alarms from your power distribution equipment?

Are the room smoke detectors reporting to the monitoring system? (and can they?)

When should we troubleshoot versus hitting the Big Red Switch?

This is a more interesting question.
Hitting the big red switch can cost your company a huge amount of money in a hurry: Clean agent releases can be into the tens of thousands of dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping the room") can be devastating.
You do not want to drop a datacenter because a capacitor in a power supply popped and made the room smell.

Conversely, a fire in a server room can cost your company its data/equipment, and more importantly your staff's lives.
Troubleshooting "that funny burning smell" should never take precedence over safety, so it's important to have some clear rules about troubleshooting "pre-fire" conditions.

The guidelines that follow are my personal limitations that I apply in absence of (or in addition to) any other clearly defined procedure/rules - they've served me well and they may help you, but they could just as easily get me killed or fired tomorrow, so apply them at your own risk.

If you see smoke or fire, drop the room
This should go without saying but let's say it anyway: If there is an active fire (or smoke indicating that there soon will be) you evacuate the room, cut the power, and discharge the fire suppression system.
Exceptions may exist (exercise some common sense), but this is almost always the correct action.

If you're proceeding to troubleshoot, always have at least one other person involved
This is for two reasons. First, you do not want to be wandering around in a datacenter and all of a sudden have a rack go up in the row you're walking down and nobody knows you're there. Second, the other person is your sanity check on troubleshooting versus dropping the room, and should you make the call to hit the Big Red Switch you have the benefit of having a second person concur with the decision (helps to avoid the career-limiting aspects of such a decision if someone questions it later).

Exercise prudent safety measures while troubleshooting
Make sure you always have an escape path (an open end of a row and a clear path to an exit).
Keep someone stationed at the EPO / fire suppression release.
Carry a fire extinguisher with you (Halon or other clean-agent, please).
Remember rule #1 above.
When in doubt, leave the room.
Take care about your breathing: use a respirator or an oxygen mask. This might save your health in case of chemical fire.

Set a limit and stick to it
More accurately, set two limits:
- Condition ("How much worse will I let this get?"), and
- Time ("How long will I keep trying to find the problem before its too risky?").
The limits you set can also be used to let your team begin an orderly shutdown of the affected area, so when you DO pull power you're not crashing a bunch of active machines, and your recovery time will be much shorter, but remember that if the orderly shutdown is taking too long you may have to let a few systems crash in the name of safety.

Trust your gut
If you are concerned about safety at any time, call the troubleshooting off and clear the room.
You may or may not drop the room based on a gut feeling, but regrouping outside the room in (relative) safety is prudent.

If there isn't imminent danger you may elect bring in the local fire department before taking any drastic actions like an EPO or clean-agent release. (They may tell you to do so anyway: Their mandate is to protect people, then property, but they're obviously the experts in dealing with fires so you should do what they say!)

We've addressed this in comments, but it may as well get summarized in an answer too -- @DeerHunter, @Chris, @Sirex, and many others contributed to the discussion