I've got a MySQL server which has both InnoDB and MyISAM tables. InnoDB tablespace is quite small under 4 GB. MyISAM is big ~250 GB in total of which 50 GB is for indexes.
Our server has 32 GB of RAM but it usually uses only ~8GB. Our key_buffer_size is only 2GB. But our key cache hit ratio is ~95%. I find it hard to believe..
We have 52 GB's of index data and 2GB key cache is enough to get a 95% hit ratio. Is that possible? On the other side data set is 200GB and since MyISAM uses OS (Centos) caching I would expect it to use a lot more memory to cache accessed myisam data. But at this stage I see that key_buffer is completely used, our buffer pool size for innodb is 4gb and is also completely used that adds up to 6GB. Which means data is cached using just 1 GB?
My question is how could I check where all the free memory could be used? How could I check if MyISAM hits OS cache for data reads instead of disk?
I have a PC with a 256 GB M.2 SSD drive used for Linux Mint root, swap and home partitions right now. I have a 2TB HDD for all my data (I keep that separate from home dir actually, because users' home folder has user configuration).
I would like to migrate that system to ZFS to take advantage of checksums, snapshots, transparent compression and the power-loss robustness.
Since I didn't build the hardware with ZFS in mind I wonder now what would be the best way to utilize this setup.
I am doing lots of video production among other things, so I have big files (10GB) being thrown around a lot. I thought about using SSD as an SLOG, but wouldn't that unnecessarily wear the SSD off, when I'm capturing and rendering video files - HDD speed in not a bottleneck for me in this regard anyway.
Should I create separate pools for SSD and HDD? Can I easily move datasets between the pools with send/receive commands?
Could I partition the SSD to use it both for root partition and SLOG for HDD? Wouldn't that defeat wear leveling and kill my SSD faster? Wouldn't that kill SSDs performance benefits?
As a side-note:
I might add a second 2 TB HDD later as a mirror to gain redundancy. I also have 2 1TB USB 3.0 drives that I used with ZFS in Raid-0 for a while. They seem to work pretty well, and can handle ~130 MB/s write speed. I wonder if using that as a mirror vdev for my main HDD would be a good idea. They have proven to be stable for a few months (I know USB can be problematic with ZFS - I played with that a quite lot too).
I currently do rdiff-backup to an external 3TB USB 3.0 drive.
I have deployed ZFS a few times so far, once in a production environment for system root, and data storage in 2-disk mirror. I never used SSDs with it though.
The problem we are facing is that session files inside the tmp folder are being sporadically deleted and we can't tell why. Sessions will last anything from about 10 minutes to 40 minutes, when they should be lasting 12 hours.
This is a virtual host environment, but none of the code we use in these sites overrides this setting (with ini_set, apache config PHP values or otherwise) so we can't see why they are being deleted. There are also no scheduled tasks deleting the files.
Is there a way to successfully figure out why gc_maxlifetime is being ignored? For the record, I changed one of our sites to use session_save_path('D:/wamp/tmptmp'); temporarily just to double check it was the garbage collection, and session files remain in there untouched - though admittedly this doesn't give really many more clues.
I made a stupid mistake. I have a small vhosted CentOS server and I was configuring git+gitosis when i think I ran the gitosis ssh key init on my own user in addition to the git user. I didn't realize it at the time but now when i try to ssh to the server with my user i get:
TY allocation request failed on channel 0 ERROR:gitosis.serve.main:Need SSH_ORIGINAL_COMMAND in environment. Connection to [servernamehere] closed.
Any idea how I can login to the server again? Unfortunately i have disabled ssh root login. Thanks!
debug2: channel 0: input open -> closed debug3: channel 0: will not send data after close debug2: channel 0: almost dead debug2: channel 0: gc: notify user debug2: channel 0: gc: user detached debug2: channel 0: send close debug2: channel 0: is dead debug2: channel 0: garbage collecting debug1: channel 0: free: client-session, nchannels 1 debug3: channel 0: status: The following connections are open:
#0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1) debug3: channel 0: close_fds r -1 w -1 e 7 c -1 Connection to myserver closed.
Answer
Sorry Cristian no easy answer to this, your user can only interact with git.
So far, in order to make it remotely functional I've had to add:
Internet Gateway
Route Table
When I deploy the template, it creates successfully. I can look through my console and see all of the components.
What I'm having a problem with is connecting to the instance afterward. Attempts to connect via ssh time out.
I think I've narrowed this problem down to the fact that when the stack is built, two route tables are deployed. One is marked as main which has no default route added to it, The other one I explicitly define in my template and to which I add the default route.
If I add the default route to the template-define table after the fact, I can ssh.
I guess my questions are:
how do I mark the table that I'm creating in the template as the main table, or
how do I tell CloudFormation to not create the default table that is being marked as main, or
how do I get the default route into the main table?
Template:
AWSTemplateFormatVersion: 2010-09-09 Resources:
vpcCandidateEyMm7zuOcn: Type: 'AWS::EC2::VPC' Properties: CidrBlock: 192.168.0.0/16 EnableDnsHostnames: 'true' EnableDnsSupport: 'true' InstanceTenancy: default Tags: - Key: Test Value: Test
Metadata: 'AWS::CloudFormation::Designer': id: 052446e9-ed29-4689-8eb2-2006482f7a65 IgCandidateEyMm7zuOcn: Type: "AWS::EC2::InternetGateway" Properties: Tags: - Key: Test Value: Test AigCandidateEyMm7zuOcn:
The workstation is running Windows 8.1 and joined to the domain. There are a few group policies that deploy software. "Always wait for the network at computer startup and logon" setting is disabled. Software deployment has used to take 1 or 2 logons to apply the changes, as expected.
But something happened, and now I can't get software deployment to apply the changes at all. rsop.msc says: "Software Installation did not complete policy processing because a system restart is required for the settings to be applied. Group Policy will attempt to apply the settings the next time the computer is restarted." Restarting has no effect: the group policies are still processed in asynchronous mode, and rsop.msc says just the same. Rejoining the domain doesn't help. No error is in the Event log.
The question is what is the switch / the flag that tells Windows to enable synchronous processing of the group policies.
Answer
The reason was the startup policy processing timeout: the default value of 30 s isn't sometimes enough. Setting "Specify startup policy processing wait time" policy (Computer Configuration - Policies - Administrative Templates - System - Group Policy) to 120 s fixed the issue.
I have a Storage Spaces pool on a server that is not currently power protected. It consists only of mechanical hard drives.
I want to compare performance of write caching enabled vs disabled so I can make a decision about how much of a priority power protection should be.
Is it sufficient to disable write caching through Device Manager -> Policies for each physical hard drive, or do I need to disable it anywhere else through Storage Spaces?
For example, it is enabled on the Microsoft Storage Space Device entry in Device Manager, and cannot be disabled. I'm not sure if that is important at all.
Sorry I am new to this DNS jargon so the following question might sound obvious.
I have a VPS server and I am using third party dns server for managing DNS. My DNS provider had given me his own name servers which I replaced with my own using vanity name servers.
Everything is working fine and I also have updated glue records and name servers at my domain registrar. But when I run a DNS test I get following error.
Inconsistent glue for name server ns2.domain.com. The address of a name server differed from the child and the parent. This is a configuration error and should be corrected as soon as possible.
Error explanation.
Inconsistent glue for name server.
Pingdom: On at least one of the parent-side a name server listed, there was an A or AAAA record and a record with the same name but different content was found on at least one of the child-side name servers.
I can't understand what's causing this error? Is this with my DNS provider or my domain registrar?
Thanks for your help.
Regards, Jay
Answer
It is very simple.
For the name ns2.domain.com, you have one A (or AAAA) record in your zone, and one glue record in the TLD's zone. They point to different addresses.
The glue record is set up by your registrar. The other one is set up by you in your DNS configuration. If the address you have set for the ns2 host yourself is right, let your registrar know they need to update the glue records for your domain (and give them the new IP). If it is wrong, change the A or AAAA record for your ns2 host to match what is in the glue.
You can find what is in the glue record by querying the TLD nameservers. Here is an example for com. on linux:
dig @m.gtld-servers.net in ns google.com
This will typically return an additional section with the A records for the nameservers. You could also change the query to be more direct:
I am running a website with multiple subdomains, each with its own SSL certificate.
It's time to renew, and while the current certificate uses SAN and has an entry for each subdomain, I am questioning the convenience of this approach - as I can find three individual SSL certificates for apparently cheaper price than a single SSL with 3xSAN entries. I am using IIS 8.5.
Are there any particular drawbacks for going for a multiple individual SSLs?
Answer
Multiple SSL certificates on a single IP address requires SNI. If your clients do not support SNI, then they'll end up on whatever your default binding is. Having a SAN certificate is also potentially less maintenance, as it's just one cert to renew every year or two.
But you are correct in your assumption that SAN certificates are more expensive than 3x individual certs. This is because certificates are run like a cartel. So if you want to save a few bucks and are willing to sacrifice clients that don't support SNI - go for it. Get your three individual certs at a fraction of the price of a SAN.
My clients active directory database corrupted there was no to way to recover it and they didn't had any backups. So I had to remove and reinstall active directory.
Now I have to reconfigure the workstations to new active directory. What is the fastest way to keep users desktops while setting up new AD.
From reading, it seems like DNS failover is not recommended just because DNS wasn't designed for it. But if you have two webservers on different subnets hosting redundant content, what other methods are there to ensure that all traffic gets routed to the live server if one server goes down?
To me it seems like DNS failover is the only failover option here, but the consensus is it's not a good option. Yet services like DNSmadeeasy.com provide it, so there must be merit to it. Any comments?
This has always bugged me and I've never got around to figuring out why Apache does this, I always resorted to the mod_vhost plugin to work around the issue.
Basically, I have 2 vhosts in sites-enabled (Ubuntu server), their contents:
DocumentRoot "/var/www/vhosta.domain.com/" ServerName vhosta.domain.com allow from all Options +Indexes
And
DocumentRoot "/var/www/vhostb.domain.com/" ServerName vhostb.domain.com allow from all Options +Indexes
Now logically these 2 would be accessible separately, however it seems that all requests to my server, no matter what vhosts I define on top of this, are going to vhosta.domain.com.
Am I missing something incredibly obvious? I really don't get why it's doing this..
Thanks
Answer
You are missing a NameVirtualhost; however:
DO NOT use VirtualHost *; use VirtualHost *:80 instead.
I have a storage system that contains 8 x 1TB drives that use the 4k sector size "Advanced Format". I'm planning to run NexentaStor on this hardware and want to ensure that I'm taking the 4k sector size into account. Is there anything special I need to keep in mind when creating the root pool and subsequent data pools with ZFS?
Answer
ZFS handles 4k sectors well as long as the drive advertises them correctly.
However, some drives have 4k sectors internally but present a logical 512 sector size to the operating system for backwards compatibility. If ZFS believes the drive, and writes in 512 byte chunks to 4k sectors, you'll suffer a heavy read-modify-write penalty.
Have a look at the Solarismen blog:
If your drive reports a sector size of 4k, you're fine. If your drive reports a sector size of 512, you may be able to work around it by using the modified zpool binary from the same site:
The modified binary hardcodes the sector size to 4k. Note that you only need to use it for the initial zpool creation. This may be a bit difficult for your root pool - you may need to slipstream the modified binary in to the NexentaStor ISO.
I have 4 drives, 2x640GB, and 2x1TB drives. My array is made up of the four 640GB partitions and the beginning of each drive. I want to replace both 640GB with 1TB drives. I understand I need to 1) fail a disk 2) replace with new
3) partition 4) add disk to array
My question is, when I create the new partition on the new 1TB drive, do I create a 1TB "Raid Auto Detect" partition? Or do I create another 640GB partition and grow it later?
Or perhaps the same question could be worded: after I replace the drives how to I grow the 640GB raid partitions to fill the rest of the 1TB drive?
fdisk info:
Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0xe3d0900f
Device Boot Start End Blocks Id System /dev/sdb1 1 77825 625129281 fd Linux raid autodetect /dev/sdb2 77826 121601 351630720 83 Linux
Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0xc0b23adf
Device Boot Start End Blocks Id System /dev/sdc1 1 77825 625129281 fd Linux raid autodetect /dev/sdc2 77826 121601 351630720 83 Linux
Disk /dev/sdd: 640.1 GB, 640135028736 bytes 255 heads, 63 sectors/track, 77825 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x582c8b94
Device Boot Start End Blocks Id System /dev/sdd1 1 77825 625129281 fd Linux raid autodetect
Disk /dev/sde: 640.1 GB, 640135028736 bytes 255 heads, 63 sectors/track, 77825 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0xbc33313a
Device Boot Start End Blocks Id System /dev/sde1 1 77825 625129281 fd Linux raid autodetect
Disk /dev/md0: 1920.4 GB, 1920396951552 bytes 2 heads, 4 sectors/track, 468846912 cylinders Units = cylinders of 8 * 512 = 4096 bytes Disk identifier: 0x00000000
Answer
My question is, when I create the new partition on the new 1TB drive, do I create a 1TB "Raid Auto Detect" partition?
You can, but you're not going to gain anything immediately from that.
Or do I create another 640GB partition and grow it later?
Yes.
RAID-on-partition has its uses, but when you're using the drives in a pseudo-storage-pool setup, you're sometimes better off using 'whole drive' instead of 'partition' RAID members. Designating the whole drive (i.e. /dev/sdc instead of /dev/sdc1) has the advantage of implicitly telling the RAID mechanism that the entire drive is to be used, and therefore, no partition needs to be created/expanded/moved/what-have-you. This turns the hard drive into a 'storage brick' that is more-or-less interchangable, with the caveat that the largest 'chunk size' in your 'set of bricks' will be the smallest drive in the set (i.e. if you have a 40gb, 80gb, and 2x 120gb, the RAID mechanism will use 4x 40gb because it can't obtain more space on the smallest drive). Note that this answer is for Linux software RAID (mdadm) and may or may not apply to other environments.
The downside is that if you need flexibility with your RAID configuration, you will loose that ability, because the entire drive will be claimed. You can however offset that loss through the use of LVM-on-RAID. Another issue with whole-drive RAID is that some recovery processes will require a little more thought, as they often assume the presence of a partition. If you use a tool that expects a partition table, it may balk at the drive.
Unsolicited Advice (and nothing more than that, if it breaks, you keep both pieces, etc.):
Your best bet is to set up your RAID array as you like, using the 'whole drive' technique, but then using LVM to manage your partitions. This gives you a smidgeon of fault-tolerance with RAID, but the flexibility of dynamically sizable partitions. An added bonus: if you use Ext3 (and possibly Ext2 supports this, not sure) you can resize the 'partitions' while they are mounted. Being able to shift the size of mountpoints around while they are 'hot' is a wonderful feature and I recommend considering it.
Additional follow-up:
I received a comment that Ext2 does not support hot resizing. In reality, it does, but only for increases in size. You can read more at this link here. Having done it a few times myself, I can say that it is possible, it does work, and can save you time.
We have a continuous integration server which runs a series of tests over several different client OSs (different version of Windows and OSX). It is running on an Apple XServ which is running OSX Leopard Server, and the clients are running within VMWare Fusion - This combination was chosen as Apple's licensing only allows OSX Server VMs to run on Apple hardware.
The CI system uses the VMWare tools to communicate with the Windows clients, but this does not work properly with OSX, so it is using SSH in these cases. However, every so often, the network will drop out fairly quickly after VM startup.
The VMs are configured to use host-only networking, and the Windows VMs, while slow, don't seem to have any connection issues.
My active directory domain's name is "mywebsite.com" instead of "mywebsite.local". I had to do this as a workaround to other issues, and to change it will be a pain. When people inside of my company visit "mywebsite.com", it redirects to our DC instead of our website. How can I make it redirect to our website?
Answer
You can't.
You can define any hostname or subdomain you want in your AD's main DNS zone, but for AD to work properly, the A records for the domain itself must point to your domain controllers.
So, having "www.mywebsite.com" pointing to your web site is fine, but having "mywebsite.com" do the same is not.
Addendum: hosting your web site on your DC would of course fix the issue, but I strongly advise you against that; DCs are definitely not meant to host web sites.
This is my first time configuring a VPS server and I'm having a few issues. We're running Wordpress on a 1GB Centos server configured per the internet (online research). No custom queries or anything crazy but closing in on 8K posts. At arbitrary intervals, the server just goes down. From the client side, it just says "Loading..." and will spin more or less indefinitely. On the server side, the shell will lock completely. We have to do a hard reboot from the control panel and then everything is fine.
Watching "top" I see it hovering between 35 - 55% memory usage generally and occasional spikes up to around 80%. When I saw it go down, there were about 30 - 40 Apache processes showing which pushed memory over the edge. "error_log" tells me that maxclients was reached right before each reboot instance. I've tried tinkering with that but to no avail.
I think we'll probably need to bump the server up to the next RAM level but with ~120K pageviews per month, it seems like that's a bit overkill since it was running fairly well on a shared server before.
Any ideas? httpd.conf and my.cnf values to add? I'll update this with the current ones if that helps.
Thanks in advance! This has been a fun and important learning experience but, overall, quite frustrating!
If your server cannot handle spinning up 30-40 httpd processes (it can't), then don't let it. I go into a lot of detail regarding LAMP configuration in my answer to this question. The examples I give are for a 512 MiB VPS, so don't just blindly copy the configuration "per the internet". :)
Short version: scale back your httpd MaxClients and ServerLimit variables to prevent 30+ httpd processes from spinning up. I'd start with something like 10 or 15 depending on the average size of your processes, and how much room you've given MySQL. Note that httpd's behavior will be to refuse requests when all client processes are busy.
I used to have a RAID 0 setup but I ditched it and restarted from scratch with a plain non raid single disk system. The second disk is just used for general storage.
Well today, I boot the machine and for whatever reason the BIOS booted from the secondary drive which still had the old RAID 0 boot record and OS and everything. Took me a few minutes to figure out what went wrong, I changed the boot order in my BIOS and my 'new' machine came back.
How do I blow away the bootability of the old secondary drive without destroying any of the partitions on it? The MBR has the partition table so I can't just blow it away, but I can't find any partition tool that will erase the MBR while keeping the partition table.
This is a ubuntu 10.10 setup.
Answer
The actual MBR boot loader is located on the first sector of the hard drive. This sector also contains the partition table for the drive, so care must be taken to remove the boot loader code without removing the partition table. The following command should do it:
dd if=/dev/null of= bs=446 count=1
Slightly more information is at this URLWARNING: Before making this modification, have good backups available.
I have a VM config currently where I have 1 external IP pointing to a VM with an Nginx HTTP Reverse proxy serving HTTP Web Pages from several internal VMs without external IPs.
I need a similar setup to re-direct SSH requests to certain hostnames to internal servers, this is for a hosted git repository which sits behind by proxy and thus has no external IP of its own, so I require some kind of iptables rule to allow me to reverse proxy/forward SSH requests to corresponding VMs.
I know that this question must have been asked (and answered) before, but I can't find a solution for my problem in any of those questions. It's a bit odd... The problem is that my PHP scripts (and my Apache server) cannot write to folders on my system. Not at all.
For example, I get the following error while running a script:
I have a VPS with CentOS 7, Apache2.4, PHP5.6 (which runs with the apache php mod) and some other stuff. Apache runs as user apache and group apache (as set in the httpd.conf file). I have set the session_path in both /etc/php.ini and /etc/httpd/conf.d/php.conf to /tmp/phpsessions and chown'd/chmod' this folder as apache:apache 777. The above example stores sessions in another folder (which is also chown'd/chmod' as apache:apache 777), but I get the same error for other folders.
So my apache server runs as apache:apache, I chownd the folders to apache:apache that I needed to and even with 777 permissions Apache fails to write to these folders.
Have you ever seen something like this? I haven't before...
Answer
assuming permissions and ownerships are OK, I believe this relates to SELinux.
Quick and dirty way: ... assuming you're getting Permissive while running getenforce, try disabling SELinux by running setenforce 0 and hit your script again, if it works then it was SELinux, from there you can either leave it disabled (not recommended) or turn it back on by running setenforce 1 and check your /var/log/audit/audit.log and work towards end solution.
I'm running Apache2 on Ubuntu 10.04 LTS. I've configured /etc/php5/apache2/php.ini with the following:
error_reporting = E_ALL & ~E_DEPRECATED display_errors = Off
log_errors = On error_log = /var/log/php_errors.log
I have a PHP application that I believe has an error in it. With those settings and the suspected error, no php_errors.log was created. So, I created a PHP file with syntax errors. Still no php_errors.log. What are some reasons why PHP would not log errors?
Could it be permissions? I see that /var/log is owned by root:root. I don't know what owns the php process. I believe root starts the apache process and then a new starts a new thread using www-data:www-data for serving files. Would that mean I need to change permissions on the /var/log folder?
I am not very familiar with dynamic DNS, and was curious if I could get it to work for a certain use case.
I have a few Raspberry Pi's I'm setting up for mocking server setups of applications we use at a small scale of our larger setup. They have wireless access capability. I'd like to be able to throw them in a bag and work with them using my laptop in various settings. However, working out the IP's and addresses every time to communicate with them on new networks would be quite annoying (having to change the endpoints all of the applications/configurations are referring to).
I can have normal DNS A Records point to internal IPs and they work great while on private networks. However this is less ideal for changing IPs. Would I be able to use Dynamic DNS to resolve the DNS records to internal addresses? (Such that connecting to a new wireless network all of the lookups would work after everything is connected without having to monkey with the router, custom dns server, etc.)
Initial research indicates Dynamic DNS usually resolves to the external IP whereas in this case I wish to automatically resolve to the address obtained on a specific interface for each client e.g. Eth0.
Answer
The simplest way forward would be to use mDNS to do "ad-hoc" DNS resolution amongst the machines in the same subnet. This is, basically, as simple as installing avahi-daemon and libnss-mdns (Debian package names; adjust as appropriate) and making sure your firewall isn't blocking 5353/udp. This will cover both forward and reverse DNS entries, and create resolvable names of the form .local for all other machines on the local subnet.
If you need naming which is available beyond the local multicast domain, you'll probably want to setup a DNS server somewhere on the Internet which accepts TSIG-authenticated UPDATE queries, and then configure your client machines to send updates using nsupdate (or some other equivalent means).
I'm running a backported KVM on a Debian Squeeze. ATM the KVM-Guest can't connect to the internet through the bridge I have set up. The guests can reach each other, the host but nothing outside. I can neither ping, nslookup or do anything to a remote address. The guest are configured to have a static IP. When I didn;t have the bridge but a virtual bridge (the KVM-default) the guest could connect fine. After setting up the bridge things broke, so I think the problem lies there.
# The loopback network interface auto lo br0 iface lo inet loopback
I created an A record called dns.example.com then point it at a nameserver. On the nameserver I gave it the name ns1.dns.example.com and also dns.example.com
Now I'm confused. Do I create glue records of ns1.dns.example.com or dns.example.com? Do I also still need the A records for my dns.example.com?
Answer
You need glue records if the NS records for a domain point to hostnames within that domain. For example, this would not need glue-records:
In that case, the glue records go into the example.com domain.
example.com: [...] testing.example.com. IN NS ns1.testing.example.com. testing.example.com. IN NS ns2.testing.example.com. ns1.testing.example.com. IN A 172.16.202.152
I often hear people making statements such as "our MySQL server machine failed", which gives me the impression that they dedicate a single machine as their MySQL server (I guess they just install the OS and only MySQL on it). As a developer not a sysadmin, I'm used to MySQL being installed as part of a LAMP stack together with the web server and the PHP.
Can someone explain to me:
what's the point of installing MySQL on a separate server? sounds like a waste of resources when I can add the entire lamp stack there and additional servers as well.
if the database is on a separate machine, how do the apps that need to use connect to it?
Answer
When your application platform and your database are competing for resources, that's usually the first indication that you're ready for a dedicated database server.
Secondly, high-availability: setting up a database cluster (and usually in turn, a load-balanced Web/application server cluster).
I would also say security plays a large role in the move to separate servers as you can have different policies for network access for each server (for example, a DMZ'ed Web server with a database server on the LAN).
Access to the database server is over the network. i.e. when you're usually specifying "localhost" for your database host, you'd be specifying the host/IP address of your database server. Note: usually you need to modify the configuration of your database server to permit connections/enable listening on an interface other than the loopback interface.
We're in the process of moving from one server to another.
I'm running the SQL Server database off our new server, with the old server now using that as a DB server with the aim to make it a smooth transition from one server to the other whilst DNS propagates so no down time for anyone.
Nothing else is currently running on the new server, it's a fresh install of everything.
The sqlserver.exe process seems to be increasing in memory requirements non stop, is this most likely to be connections not closing properly on my website? Or are there any known memory leaks in SQL Server? We receive around 40k page views a day, and probably around 5x that number in crawler pageviews.
Total DB MDF+LDF size is quite small, only 600mb. The current commit (KB) of sqlservr.exe is 2,393,000. It's increasing at a rate of about 0.5MB every second or two.
If this is connections not closing properly on our website is there any way to clean old open connections? Apart from obviously fixing the root cause what can we do in the meantime?
Answer
I'm not a SQL internals guru, but to my knowlege, SQL will use as much RAM as you've configured it to and that the OS will allow it to have. There's more cached than the DB its self, for example the execution plans http://msdn.microsoft.com/en-us/library/ms181055.aspx
As a good rule of thumb your SQL memory should be configured in the following way:
I've inherited a Sympa mailing list server from a previous Admin and am not very familiar with the whole process. Recently, we've been getting some calls from users that their posts to the various mailing lists are being marked as failing fraud detection checks.
I've been reading up on SPF and suspect that what is happening is when a user (bob@somewhere.org) posts to the list (my.mailinglist.com), the outbound message from the list server has the envelope sender set to "bob@somewhere.org". Our list server then relays the outgoing message to mail.gateway.com which then delivers it over the Internet. When the SMTP server at somewhere.org (or other domain) receives the post, it sees that it was sent by our relay, mail.gateway.com (13.14.15.16), which does not have it's IP address on the SPF record for somewhere.org.
In the mail headers of the outbound post sent from mail.gateway.com, I have an SPF line which reads:
Received-SPF: SoftFail (mail.gateway.com: domain of transitioning bob@somewhere.org discourages use of 13.14.15.16 as permitted sender)
We have many users from many different domains sending mail to our list server. Asking every domain to include the mail.gateway.com IP in their SPF record seems ridiculous, but that's what I gather is one way to fix this.
The other fix involves probably using a different envelope sender. I'm not sure how this would affect "Reply" and "Reply to" functionality. Right now it seems a bit wonky; Reply and Reply-to both go the the mailing list which seems odd. I'm trying to figure out where that is configured.
Are there some other ways to work this out that I have missed? Thanks
We're a small consulting shop that hosts some public facing websites and web applications for clients (apps we've either written or inherited). Our strength lies in coding, not necessarily in server management. However a managed hosting solution is out of our budget (monthly costs would exceed any income we derive from hosting these applications).
Yesterday we experienced a double hard drive failure in one of our servers running RAID5. Rare that something like this happens. Luckily we had backups and simply migrated the affected databases and web applications to other servers. We got really lucky, only one drive 100% failed, the other simply got marked as pending failure, so we were able to live move almost everything [one db had to be restored from backup] and we only had about 5 minutes of downtime per client as we took their database offline and moved it.
However we're now worried that we've grown a bit... organically... and now we're attempting to figure out the best plan for us moving forward.
Our current infrastructure (all bare metal):
pfSense router [old repurposed hardware]
1U DC [no warranty, old hardware]
2U web & app server (server 2k8 R2, IIS, MSSql, 24gb ram, dual 4C8T Xeon) -- this had the drive failures -- [warranty good for another year, drives being replaced under the warranty]
4U inherited POS server (128gb ram, but 32bit OS only, server 2k3) [no warranty]
(2) 1U webservers (2k8, IIS, 4C8T Xeon, 4gb ram) in a load balanced cluster (via pfSense) [newish with warranty]
1U database server (2k8, MSSQL, 4C8T Xeon, 4gb ram) [new with warranty]
NAS running unRaid with 3TB storage (used for backups and file serving for webapps to the 2 load balanced web servers)
Our traffic is fairly light, however downtime is pretty much unacceptable. Looking at the CPU monitors throughout the day, we have very, very little CPU usage.
We've been playing with ESXi as a development server host and it's been working reasonably well. Well enough for me to suggest we run something like that in our production environment.
We'd run the majority of VMs on the server that currently has failed hard drives as it is the most powerful. Run other VMs on the 1U servers that we've currently got load balanced. P2V the old out of warranty hardware (except pfSense, I prefer it on physical hardware). Continue running the unRaid for backups.
I've got a bunch of questions, but the infrastructure based ones are as such:
Is this a reasonable solution to mitigate physical hardware issues? I think that the SAN becomes a massive SPOF and the large server (that would be hosting the VMs) is another. I read that the paid versions of vmWare support automatic failover of VMs and that might be something that we look into to alleviate the VM Host failure potential.
Is there another option that I'm missing? I've considered basically "thin provisioning" the applications where we'd use our cheaper 1U server model and run the db and the app on one box (no VM). In the case of hardware failure, we'd have a smaller segment of our client base affected. This increases our hardware costs, rackspace costs, and system management costs ("sunk" cost in employee time).
Would Xen be more cost effective if we did need the VM failover solution?
I'm new on here but I'm sure my network problem isn't new and so hope I can get a bit of help .
I have a router in the office with 3 pc's connected, and a length of ethernet cable, less than 100m, which runs to another part of the building straight into a 8 port switch, call it sw1. Connected to sw1 are 2 wifi access points and another length of cable going into another switch, call it sw2. Sw2 has a tv and another wifi access point connected to it.
My problem seems to be with the cable going into sw2, before using sw2 I could connect a laptop into that cable and it works fine, but the tv would work and sometimes wouldn't. Even connecting to the access point on sw2 is a bother with intermittent connection problems using mobile phones.
All the access points can be seen from the router side of the network even when I can't get a connection on the sw2 side of the network.
Any insight into this problem would be much appreciated.
I take it it is ok to setup the network the way I have cascading the switches and connecting access points or I think you would have mentioned that. I have used several different switches for sw2 and that part of the network always ents up the same after a time - not data.
I don't get why there are two different programs in a minimal install to install software. Don't they do the same thing? Is there a big difference? I have read everywhere to use aptitude over apt-get but I still don't know the difference
Answer
aptitude is a wrapper for dpkg just like apt-get/apt-cache, but it is a one-stop-shop tool for searching/installing/removing/querying. A few examples that apt might not supply:
$ aptitude why libc6 i w64codecs Depends libc6 (>= 2.3.2) $ aptitude why-not libc6 Unable to find a reason to remove libc6.
Suggests: locales, glibc-doc Conflicts: libterm-readline-gnu-perl (< 1.15-2), tzdata (< 2007k-1), tzdata-etch, nscd (< 2.9) Replaces: belocs-locales-bin Provides: glibc-2.9-1 Description: GNU C Library: Shared libraries Contains the standard libraries that are used by nearly all programs on the system. This package includes shared versions of the standard C library and the standard math library, as well as many others.
I work for a organization which runs several servers mainly used for internet to schools and Exchange on the corporate side.
One of my tasks during my training is to be able to understand the network and how it works.
Now I don't need to understand absolutely everything but I would like to map out the network in Visio to see how it connects together via a visual perspective.
We have 1 core room which holds the main servers and switches for http/mail/voice/storage etc. which are connected to several switches in other "sub" rooms which then connect that section of the building, i.e 1 Hop.
So my task is to be able to track all the connections leading from the main server room going to the sub switch rooms, so we know what connection does what.
Do you guys have any idea of how to get started with this task, how to go about it, etc.
I'm not 100% guru with Servers as I'm just a MA - Technician and will need to learn how the servers interact with switches and how it all comes together.
My team has a server pointing at the DNS supplied by Active Directory to ensure that it is able to reach any hosts managed by the domain. Unfortunately, my team also needs to run dig +trace frequently and we sporadically get strange results. I am a DNS admin but not a domain admin, but the team responsible for these servers isn't sure what is going on here either.
The problem seems to have shifted around between OS upgrades, but it's hard to say whether that's a characteristic of the OS version or other settings being changed during the upgrade process.
When the upstream servers were Windows Server 2003, the first step of dig +trace (request . IN NS from the first entry in /etc/resolv.conf) would occasionally return 0 byte responses.
When the upstream servers were upgraded to Windows Server 2012, the zero byte response problem went away but was replaced with an issue where we would sporadically get the list of forwarders configured on the DNS server.
Example of the second problem:
$ dig +trace -x 1.2.3.4
; <<>> DiG 9.8.2 <<>> +trace -x 1.2.3.4 ;; global options: +cmd . 3600 IN NS dns2.ad.example.com. . 3600 IN NS dns1.ad.example.com. ;; Received 102 bytes from 192.0.2.11#53(192.0.2.11) in 22 ms
1.in-addr.arpa. 84981 IN NS ns1.apnic.net. 1.in-addr.arpa. 84981 IN NS tinnie.arin.net. 1.in-addr.arpa. 84981 IN NS sec1.authdns.ripe.net. 1.in-addr.arpa. 84981 IN NS ns2.lacnic.net. 1.in-addr.arpa. 84981 IN NS ns3.apnic.net. 1.in-addr.arpa. 84981 IN NS apnic1.dnsnode.net. 1.in-addr.arpa. 84981 IN NS ns4.apnic.net. ;; Received 507 bytes from 192.0.2.228#53(192.0.2.228) in 45 ms
1.in-addr.arpa. 172800 IN SOA ns1.apnic.net. read-txt-record-of-zone-first-dns-admin.apnic.net.
4827 7200 1800 604800 172800 ;; Received 127 bytes from 202.12.28.131#53(202.12.28.131) in 167 ms
In most cases this isn't a problem, but it will cause dig +trace to follow the wrong path if we are tracing within a domain that AD has an internal view for.
Why is dig +trace losing its mind? And why do we seem to be the only ones complaining?
Answer
You are being trolled by root hints. This one is tricky to troubleshoot, and it hinges on understanding that the . IN NS query sent at the start of a trace does not set the RD (recursion desired) flag on the packet.
When Microsoft's DNS server receives a non-recursive request for the root nameservers, it's possible that they will return the configured root hints. So long as you do not add the RD flag to the request, the server will happily continue to return that same response with a fixed TTL all day long.
;; ANSWER SECTION: . 3600 IN NS dns2.ad.example.com. . 3600 IN NS dns1.ad.example.com.
;; ADDITIONAL SECTION: dns2.ad.example.com. 3600 IN A 192.0.2.228
dns1.ad.example.com. 3600 IN A 192.0.2.229
This is where most troubleshooting efforts will break down, because the easy assumption to leap to is that dig @whatever . NS will reproduce the problem, which actually masks it completely. When the server gets a request for root nameservers with the RD flag set, it will reach out and grab a copy of the real root nameservers, and all subsequent requests for . NSwithout the RD flag will magically start working as expected. This makes dig +trace happy again, and everyone can go back to scratching their heads until the problem reappears.
Your options are to either negotiate a different configuration with your domain admins, or to work around the problem. So long as the poisoned root hints are good enough in most circumstances (and you're aware of the circumstances where they're not: conflicting views, etc.), this isn't a huge inconvenience.
Some workarounds without changing the root hints are:
Run your traces on a machine that has a less nutty set of default resolvers.
Start your trace from a nameserver that returns root name servers for the internet in response to . NS. You can also hardwire this nameserver into ${HOME}/.digrc, but this may confuse others on a shared account or be forgotten by you at some point. dig @somethingelse +trace example.com
Seed the root hints yourself prior to running your trace. dig . NS dig +trace example.com
UseCanonicalName On ServerName somename ServerAlias www.someothername.com
According to the docs:
With UseCanonicalName On Apache will use the hostname and port specified in the ServerName directive to construct the canonical name for the server. This name is used in all self-referential URLs, and for the values of SERVER_NAME and SERVER_PORT in CGIs.
So in my Tomcat/CFML application when I visit the URL www.someothername.com I would expect to see in the CGI scope:
The last 2 cause a DNS lookup on somename but still returns www.someothername.com in the CGI.SERVER_NAME field
I should point out that the only reason I'm doing this is because I'm doing mass virtual-hosting with mod_cfml to automaticatically create tomcat contexts and I would like the context and application to use a short name derived from the vhost configuration. I guess I could just set a header (even rewrite the Host: header) but using ServerName seemed the most elegant solution.
UPDATE: There is something I noticed in the client headers that is probably relevant. There are 2 headers I haven't seen before:
I need to know what set these headers and why. I'm assuming it was either Tomcat or mod_cfml. Can I rely on the x-forwarded-server value to always be ServerName?
We are thinking about moving our virtual machines (Hyper-V VHDs) to Windows Azure but I haven't found much about what kind of fault tolerance that infrastructure provides. When I run VHD in Azure, I've got two questions:
Is my VHD and all the data in it safe? I think that uploaded VHDs use the "Storage" infrastructure so they should be automatically replicated to multiple disks and geographically distributed but should I still make a full-image backup just to be safe? (Note that of course I will be backing up the actual data inside VMs that I care about; I just want to know if there is a chance greater than 0.0000001% that one day I will receive an email from Microsoft telling me that my VM is gone and that I should create or restore it from scratch).
Do I need to worry about other things regarding the availability of my VMs? I mean, when I have an on-premise server I need to worry about the hardware itself, about the host operating system, what would happen if my router failed, if my Hyper-V's C: drive failed etc. Am I right in thinking that with Azure, their infrastructure takes care of all of this?
I have to buy a stand alone server rack to be placed in a machine room. The temperature can reach 30°C in the room (I have no choice with regards to the location). In order to correct for this I will use active cooling with the used air being extracted.
I need to calculate the rating for the cooling cabinet. The cabinet will have 2 catalyst 2960 g switches and 4 HP ProLiant DL380p Gen8 E5-2609 2.4GHz 4-core 1P 4GB-R P420i SFF 460W PS Entry EU Servers as well as 3 USB extender receiver signal adapters.
The catalyst I can look up as well as the KVM but I have a problem with the Proliant as I cant figure out what heat it generates. The only BTU specification is the power supply which is rated in this case for 1725BTUs which is circa 506 watts. Does this mean that for four of these units I only need a cabinet that can cater for 4*(506)+2*(129)+3*(50) = 2.432 kW.
Can it be that easy?
Answer
This depends on your usage pattern, but I think your 2.432 kW number may be overkill.
In terms of your room's ambient temperature, the thresholds of the DL380p platform are:
Ambient temps warn at 42°C
Thermal shutdown occurs at 46°C.
This can be overridden/ignored, but that's a target to keep in mind. Here's the heat map and spot temperatures for a very busy DL380p Gen8 system described in detail below.
The BTU figure you see from the HP Quickspecs is a maximum, but we know that you won't be running these servers at full utilization. In a SMB scenario, It doesn't make sense to engineer for worst-case power/heat loads. These are extremely efficient servers and I've had success running them in less-than-ideal environments (warehouses, closets, bathrooms...)
However, HP does say: BTU = Volts X Amps X 3.41, so your calculation is correct.
Real-life power utilization figures...
For a DL380p with that spec (or slightly more powerful E5-2620 single-CPU configuration), with a light duty cycle I see:
System Information
Manufacturer: HP Product Name: ProLiant DL380p Gen8
model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz cpu MHz : 2000.000
hpasmcli> SHOW POWERMETER Power Meter #1 Power Reading : 62
hpasmcli> SHOW POWERSUPPLY Power supply #1 Present : Yes Redundant: Yes Condition: Ok Hotplug : Supported Power : 35 Watts Power supply #2 Present : Yes Redundant: Yes
Condition: Ok Hotplug : Supported Power : 35 Watts
For an extremely busy DL380p Gen8 with TWO E5-2643 v2 3.5GHz CPUs, running with a 8 internal SAS disks, 192GB RAM, fully-populated PCIe slots and no power saving settings enabled, I see:
EntryType : Warning InstanceId : 5802 Message : None of the IP addresses (192.168.254.17) of this Domain Controller map to the configured site 'North'. While this may be a temporary situation due to IP address changes, it is generally recommended that the IP address of the Domain Controller (accessible to machines in its domain) maps to the Site which it services. If the above list of IP addresses is stable, consider moving this server to a site (or create one if it does not already exist) such that the above IP address maps to the selected site. This may require the creation of a new subnet object (whose range includes the above IP address) which maps to the selected site object.
The event was being logged repeatedly by a domain controller whose IPv4 address is not associated to the site it serves, as configured on Active Directory Sites and Services console. I supressed it by creating a /32 subnet object that maps to the served site, however I am wondering to know about the actual consequences.
Why should the IPv4 address of the domain controller map to the site it serves?
Why is such test being performed by Netlogon? Why is the recommendation generally recommended?
Besides the event log, how would Active Directory infrastructure be impacted by such configuration mismatch?
Although the network infrastructure that links the sites consists of no more than a few meters of optic fibers and has low latency and high bandwidth, multiple sites were created in order to establish affinities between users and domain controllers while maintaning IPv4 addresses unchanged. It is a purpose of capacity management.
Under a test environment, a few Windows PowerShell lines may reproduce the issue.
Can someone tell me why do I have arp table entry outside interface network?
Server has 2 interfaces, both on the same network: eth0 10.10.34.146/22 eth1 10.10.33.188/22
ip route default via 10.10.32.2 dev eth0 proto static metric 1024 10.10.32.0/22 dev eth0 proto kernel scope link src 10.10.34.146 10.10.32.0/22 dev eth1 proto kernel scope link src 10.10.33.188
arp -n Address HWtype HWaddress Flags Mask Iface 176.119.32.2 ether d4:d7:48:b5:a3:c1 C eth1 176.119.32.2 ether d4:d7:48:b5:a3:c1 C eth0
And then, when I ping 8.8.8.8 (from eth0) I get icmp reply but arp table does not change. But when I ping -I eth1 8.8.8.8 I get icmp reply and there is a new entry in arp table: 8.8.8.8 ether d4:d7:48:b5:a3:c1 C eth1
But, when I add 'ip route add default via 176.119.32.2 dev eth1' to the routing table and ping 8.8.8.8 from eth1 there is no new arp table entry for 8.8.8.8. Why is that? Thanks.
NOTE: Both interfaces are connected to Cisco switch with 'ip local-proxy-arp' on the SVI with ip 176.119.32.2/22 which is default gw, and both are in Private VLAN.
Answer
It sounds like the Cisco interface your eth0 is connected to is configured with ip local-proxy-arp but the Cisco interface your eth1 connects to is actually configured with ip proxy-arp instead. This would account for the switch taking responsibility over the data-link addressing of an IP outside of the locally configured LAN.
Can you please verify this on the switch configuration? If I am incorrect, please post the results of show run int [PORT_FOR_ETH0] and show run int [PORT_FOR_ETH1]?
I'm running a Centos 7 ESXi VM with almost 300GB of RAM and 24 vCPUs.
The average load is 3 and apps almost never use more than 150GB of RAM. The rest of the available memory is used by Linux for cache.
The issue is that when the cache fills up available RAM, two kswapd processes will start using 100% of CPU and suddenly I see that all CPUs are also showing 99% of sys usage ( it's not wait or user, it's mainly sys ).
This will cause a high load ( 100+ ) for several minutes until the system recovers and load goes down to 3 again.
At this moment I don't have a swap partition, but even when I had one this issue happened.
One "solution" that I found is to execute the following command every day:
echo 3 > /proc/sys/vm/drop_caches
which drops buffers/caches. This will "fix" the issue since cache usage never reaches 100%.
My questions are:
Is there a real solution for this issue?
Shouldn't linux kernel be smart enough to simply clear old cache pages from memory instead of launching kswap?
After all, from what I understand the main function of RAM memory is to be used by apps. Caching is just a secondary function which can be discarded/ignored if you don't have enough memory.
Can any one tell me of an ESXi line command that can be used to list the different virtual hardware components assigned to VMWare guests running on ESXi, with vcenter?
E.g. I want to find out how many of our guests are running with the e1000 network adaptor or how many have 2 sockets and 2 cores.
I'd like to do this in ESXi/vSphere not in the guest OS.
Answer
In PowerCLI, the number of CPUs is accessible directly as a property of the VirtualMachine object returned by Get-VM, but in v5.0, the other virtual hardware objects have their own cmdlets, e.g. Get-HardDisk, Get-NetworkAdapter. So you'd have to do something like:
I am fairly new to linux admin so this may sound quite a noob question.
I have a VPS account with a root access.
I need to install Tomcat and Java on it and later other open source applications as well. Installation for all of these is as simple as unzipping the .gz in a folder.
My questions are
A) Where should I keep all these programs? In Windows, I typically have a folder called programs under c:\ where I unzip all applications. I plan to have something similar here as well. Currently, I have all these under apps folder under/root- which I am guessing is a bad idea. What's wrong with always being root? Right now I am planning to put them under /opt
B) To what group should Tom belong to ? I would need a user - say Tom who can simply execute these programs. Do I need to create a new group? or just add Tom to some existing group ?
C) Finally- Am I doing something really stupid by installing all these application by simply unzipping them? I mean an alternate way would be to use Yum or RPM or something like that to install these applications. Given my familiarity and (tight budget) that seems too much to me. I feel uncomfortable running commands which i don't understand too well
I have Ubuntu 14.04 (64 bits) + KVM Host with 2 NICs: - eth0 connected to the "public" network - eth1 connected to the br0 bridge with a private ip address range
From Host I can access internet, ping VM Guest and connect to it via SSH. From VM Guest I can only ping Host, but cannot access Internet and cannot ping google.com
Please help me with connecting VM Guest to the internet in the setup described below:
auto br0 iface br0 inet static address 10.0.0.1 netmask 255.255.255.0 bridge_ports eth1 bridge_stp off bridge_maxwait 0
bridge_fd 0
# Create and destroy the bridge automatically. pre-up brctl addbr br0 ip link set dev br0 up post-up /usr/sbin/brctl setfd br0 0 addif br0 eth1 post-down brctl delbr br0
I recently reserved a ec2 instance for 36months, i plan on moving my web app over to it very soon but i started doing some research, amazon charges for data transfer out. Basically the entire reason i'm making the move to ec2 is because my web app is growing in size, but the problem is my web app gets ddos attacks from time to time, what's to stop me from getting a $10k bill because of a ddos attack? (or is that data transfer in?)
So what i'm asking is what uses data transfer out and what uses data transfer in? The web app consists of a few js and css files as well as like 50 images tops, every other image is hosted on imgur.
However, if I add user with /bin/false (which I want, since this is about to be something like shared hosting and I don't want users to have shell access), the script is run under 1001, 1002 'user' which, as my Google searches showed, might be a security hole. My question is: Is it possible to allow user(s) to execute shell scripts but disable them so they cannot log in via SSH?
We recently purchased a few HP 380p G8 servers to add some VM capacity, and decided to add a pool of SSDs to our standard build, to create a "fast" RAID 1+0 array for some of our VMs that have higher performance requirements. (e.g. log servers and dbs) Since the HP drives are super-duper expensive, we went with Plextor PX-512M5Pro SATA SSDs, since we had good luck with Plextor SSDs in our previous G7 servers.
However, in 3 out of 3 servers, 3 of the 4 drives have entered the failed state, shortly after being configured, before we even attempted to put them in use. The reliability of the failures leads me to believe it's incompatibility between the RAID controller and the drives, and when the RMA replacements arrive, I'm assuming they'll fail, too. Any tips or tricks that might help with this issue, besides just buying the HP official drives?
this happens after users mailbox moved from one exchange to other (in one organization, but differend domains)
details:
From: postmaster Sent: Saturday, April 03, 2010 8:43 AM To: USER1 Subject: Undeliverable: subject of message
Delivery has failed to these recipients or distribution lists:
IMCEAex-_O=DOMAIN_OU=First+20Administrative+20Group_cn=Recipients_cn=USER1@domain A problem occurred during the delivery of this message. Microsoft Exchange will not try to redeliver this message for you. Please try resending this message later, or provide the following diagnostic text to your system administrator.
Answer
You can't get too far with troubleshooting this with just the non delivery report you posted above. There are two things you should do. First, get into Exchange System Manager and use the Message Tracking feature to track the message. Often it doesn't tell you much other than that it generated an NDR but sometimes there are clues (for instance, did the user's mailbox server generate the NDR or did it get forwarded to another server that then generated the NDR). Second, you need to look in the Application log of the server that generated the NDR for any associated errors. If you don't find anything, you should turn up diagnostic logging on that server for Transport to Maximum and send a few messages. You'll certainly get a few applicable errors in the Application log at that point and you'll then have something more useful to look at for troubleshooting purposes (including potentially posting an update here with the specific error).
For the purpose of a startup, I have a loan for one physical dedicated server with several virtual machines inside it
For now there is mainly 2 VM on this server:
VM "tools", using ubuntu server 10.04 LTS
A source code repository using mercurial and hgserve A bunch of
JAVA app for Atlasian for bug
tracking, wiki...
PostgreSQL as the Database for the tools
Apache HTTPD as HTTPS front end.
VM "asterisk", using ubuntu server 10.04 LTS
with an asterisk server, functionnal,
but more for testing as of now than anything.
But in the future we will have a "production" VM with ou JAVA application :
Apache HTTPD frontend
PostgreSQL database
Tomcat webapp (maybe cluterised)
What I'am interrested into is a software that can monitor availability of services, KVM VM, applications and database so I can react in case of problem.
I have also another use case where I'd like to monitor the performance of the application (request, CPU, memory...) and gather usage statistics.
We have basically no money, and want a free tool, at least at first.
What would be easy and simple tool for the job ? I have heard of Nagios and Hyperic but I don't know them. So I don't know if they are suited for our needs.
EDIT :
The need is not only for server monitoring but also as a way to investigate actual application perfornance, responsiveness and if possible isolate bottlenecks.
From the links (not the same question as it seem more generic but quite informative) and the actual responses, Nagios + Munin should be a good fit. Problem is Nagios seems a little complex (I don't know for Munin). Will the Nagios/Munin combo will be able to gather detailled statistics and historical data for a java application (request/seconds, request latency, both with statistics by URL, hour, day, week... ?)
Are there other (better ?) alternatives ?
Answer
Nagios. I was scared of the text configuration for a long time and tried all the other popular or remotely popular solutions out there, but was never satisfied. Once I eventually tried nagios and actually went through the configuration - I loved it, and actually find it much easier to configure and customize than gui tools like Zenoss.
While I have not done this yet, you could combine this with Monit to automatically try to recover from problems, and with Munin to collect historical data.
Edit:
Documentation for setting up Nagios and for Munin. It's Ubuntu specific, but I actually followed this to configure Nagios on Red Hat.
The other day, we notice a terrible burning smell coming out of the server room. Long story short, it ended up being one of the battery modules that was burning up in the UPS unit, but it took a good couple of hours before we were able to figure it out. The main reason we were able to figure it out is that the UPS display finally showed that the module needed to be replaced.
Here was the problem: the whole room was filled with the smell. Doing a sniff test was very difficult because the smell had infiltrated everything (not to mention it made us light headed). We almost mistakenly took our production database server down because it's where the smell was the strongest. The vitals appeared to be ok (CPU temps showed 60 degrees C, and fan speeds ok), but we weren't sure. It just so happened that the battery module that burnt up was about the same height as the server on the rack and only 3 ft away. Had this been a real emergency, we would have failed miserably.
Realistically, the chances that actual server hardware is burning up is a fairly rare occurrence and most of the time we'll be looking at the UPS the culprit. But with several racks with several pieces of equipment, it can quickly become a guessing game. How does one quickly and accurately determine what piece of equipment is actually burning up? I realize this question is highly dependent on the environment variables such as room size, ventilation, location, etc, but any input would be appreciated.
Answer
The general consensus seems to be that the answer to your question comes in two parts:
How do we find the source of the funny burning smell?
You've got the "How" pretty well nailed down:
The "Sniff Test"
Look for visible smoke/haze
Walk the room with a thermal (IR) camera to find hot spots
Check monitoring and device panels for alerts
You can improve your chances of finding the problem quickly in a number of ways - improved monitoring is often the easiest. Some questions to ask:
Do you get temperature and other health alerts from your equipment?
Are your UPS systems reporting faults to your monitoring system?
Do you get current-draw alarms from your power distribution equipment?
Are the room smoke detectors reporting to the monitoring system? (and can they?)
When should we troubleshoot versus hitting the Big Red Switch?
This is a more interesting question. Hitting the big red switch can cost your company a huge amount of money in a hurry: Clean agent releases can be into the tens of thousands of dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping the room") can be devastating. You do not want to drop a datacenter because a capacitor in a power supply popped and made the room smell.
Conversely, a fire in a server room can cost your company its data/equipment, and more importantly your staff's lives. Troubleshooting "that funny burning smell" should never take precedence over safety, so it's important to have some clear rules about troubleshooting "pre-fire" conditions.
The guidelines that follow are my personal limitations that I apply in absence of (or in addition to) any other clearly defined procedure/rules - they've served me well and they may help you, but they could just as easily get me killed or fired tomorrow, so apply them at your own risk.
If you see smoke or fire, drop the room This should go without saying but let's say it anyway: If there is an active fire (or smoke indicating that there soon will be) you evacuate the room, cut the power, and discharge the fire suppression system. Exceptions may exist (exercise some common sense), but this is almost always the correct action.
If you're proceeding to troubleshoot, always have at least one other person involved This is for two reasons. First, you do not want to be wandering around in a datacenter and all of a sudden have a rack go up in the row you're walking down and nobody knows you're there. Second, the other person is your sanity check on troubleshooting versus dropping the room, and should you make the call to hit the Big Red Switch you have the benefit of having a second person concur with the decision (helps to avoid the career-limiting aspects of such a decision if someone questions it later).
Exercise prudent safety measures while troubleshooting Make sure you always have an escape path (an open end of a row and a clear path to an exit). Keep someone stationed at the EPO / fire suppression release. Carry a fire extinguisher with you (Halon or other clean-agent, please). Remember rule #1 above. When in doubt, leave the room. Take care about your breathing: use a respirator or an oxygen mask. This might save your health in case of chemical fire.
Set a limit and stick to it More accurately, set two limits:
Condition ("How much worse will I let this get?"), and
Time ("How long will I keep trying to find the problem before its too risky?").
The limits you set can also be used to let your team begin an orderly shutdown of the affected area, so when you DO pull power you're not crashing a bunch of active machines, and your recovery time will be much shorter, but remember that if the orderly shutdown is taking too long you may have to let a few systems crash in the name of safety.
Trust your gut If you are concerned about safety at any time, call the troubleshooting off and clear the room. You may or may not drop the room based on a gut feeling, but regrouping outside the room in (relative) safety is prudent.
If there isn't imminent danger you may elect bring in the local fire department before taking any drastic actions like an EPO or clean-agent release. (They may tell you to do so anyway: Their mandate is to protect people, then property, but they're obviously the experts in dealing with fires so you should do what they say!)
We've addressed this in comments, but it may as well get summarized in an answer too -- @DeerHunter, @Chris, @Sirex, and many others contributed to the discussion