Tuesday, July 31, 2018

networking - Improvements for small business network?



My question is more-or-less a newer version of this 3 year old question: Small Business Network Switches/General network configuration



Summary
Everything is Gigabit and we don't have any real complaints about network performance. My real question is: for the 4 or 5 of us here, is there a next step that makes sense?




Network information
My small business's informal network centers around one central 16 port Dell switch and our file server in a star layout. Internal server is a Debian samba sharing a RocketRaid hardware RAID6 on UPS with shutdown tested and working. Whenever I do dev work, I use the file server for http or MySql apps. We have our domain email hosted by Google Apps, which we've been using since it was Beta and we love it. I like what I read on these two QAs: guidelines for wiring an office and Best Pratices for a Network File Share?.



Current network configuration



It wasn't until I started preparing for this question two days ago that I even realized a person can log in to the Dell switch to manage it. (facepalm)



Oh and it gets better!! I wandered around the business taking pictures of everything with an ethernet cable on it. Turns out I had a some legacy haunting me! 6+ years ago, we had a server with 2 NICs & my IT helping friend talked me into putting a DMZ on the 2nd port. The switch for that trick easily is from the early 1990s: I remember getting it from the trash at a company that got bought by AOL in 1997! It's so old, you can't google it. So thanks to my recent server reinstall and reading around serverfault, I got rid of that sadness. Edit to add: haven't noticed difference at work yet, but scp files from work to home (initiated from home) is remarkably faster now with that 10/100 switch gone!



20+ year old technology where it doesn't belong




With preparation and luck we have gotten through a few failures. Right now, I'm comfortable with everything but suspect I'll be replacing the firewall and switch soon...



Questions:
Are there any easy ways we can get performance increases?



Would it make sense to get a switch with a fiber port and then put in a fiber NIC on the server? From the reading I've been doing, it seems LANs are holding steady with Gigabit and nothing really new has happened since then for the small folks.



I haven't even Google'd how to log into the Dell switch, so I'm assuming it's unmanaged. I was going to visit the switch for a webserver so I checked the DHCP server (on firewall box) and the switch doesn't show up among the clients. I've only scratched the surface of reading up on all that: should the switch and RAID server be using large packets or something?



Network load is normally pretty mellow until the Office computer is working on our videos. Right now, they cannot be served live from the RAID and are copied back and forth. I do all my CAD from the RAID but it uses a local scratch pad and saves of 40M+ take 10 or so seconds.




Illustrations were made using Inkscape. I tried a few of the network diagramming tools and it was easier to just draw everything by hand. SVGs available upon request






Edit for Update
I was working in the rack, moved the firewall Acer computer, & it's HDD died. Disappeared from bios, so is probably the controller. Yes, literally, touching the computer to move it from the back shelf to the front shelf killed it. For now, the Buffalo WHR-HP-G54 got reconfigured & pressed into firewall duty until the already ordered dual-NIC new firewall box shows up. SCPing from home seems a tad slower than the old firewall with the USB->eth adapter. I Google'd and found out that it's WAN port is 10/100.



The observations made:
1) The legacy 10/100 link from firewall Acer to the cable modem is slower than when the link from firewall to modem is Gigabit.
2) The Buffalo WHR-HP-G54 WAN port's 10/100 link to modem is slower than when all Gigabit.
3) The TU2-ETG's USB2.0 link is faster than 10/100.
4) Cox Biz cable upload is faster than 10/100.



Once the new switch shows up (with profiling), I'm going to review Evan's and Chris's answers, attempt Evan's suggested tests, and then choose between them for "The Accept".







Final result



Getting to "the answer" of this question has been an amazing three week journey. Thank you both Chris and Evan: it was tough to choose whose answer to accept.



Seeing how inexpensive a better switch cost, I bought a hp ProCurve V1910-24G. It cost less than the lesser Dell did 5 years ago. That it is shiney is only half why I bought it; the Dell is at least 6 years old and while still performing well, I'm going to have to set a rule about retiring hardware that's more than five years old.



Having said that, the ProCurve has spawned a new question (I would appreciate some thought about the functions of what I mention here) but I'm super happy to have eliminated all of the desktop switches. That sounds like another rule, maybe those go in the gun safe?




Below is the revised drawing. Of note is that I moved the Cox Cable coax "T" and the cable modem now resides in the rack. The two CAT5 cables going to the telco corner are now dedicated to feeding the Cisco VoIP boxes, so the telco corner now only serves telephones. Also, the cabling in the drawing reflects the physical reality of the infrastructure, including switch ports. Most cables are paired up and the newest "drop" I created to get rid of the office switch has the three CAT6 cables going to it.



Current updated network drawing











For the first time ever, the cabling associated with my switch and the firewall/router is something I'm happy with! Bottom left is the cable modem, top right is the pfsense mini-ITX firewall/router:



my rack



The switch is set back some: I actually didn't like it mounted "flush" in the rack, so I fabbed up some adapters to set the switch back about 10" from the front of the rack. The hp/Compaq server cabinet has the additional rails so I took advantage of those. The cabinet has freedom to roll forward enough to allow access to the back doors. Wifi AP is resting on top of the cabinet as is excess coiled up network cable.



The yellow ethernet cabling is CAT6 I bought from StarTech on close-out for $7 for 75 foot cross-over cable & has the plenum. That was such a bargain, I bought a couple dozen and am very good at fitting jacks. (+ have the T568B wire color order memorized)



This setup is noticeably faster than before! When I ssh -X from home and run a browser window from the server at work is faster than I remember 14.4k modems, so seems about 3x as fast as when I'd log in and need the web as surfed from within the LAN. At work, opening files is as fast as if the drive wasn't networked. If I already have photoshop cs6 running, opening a 6M jpeg from the raid is instantaneous.




Additionally, I realized the cable from the raid to the switch was one of those CAT5 cables that comes with wireless routers, etc, so I replaced it with a 2' CAT6 cable and could tell a before / after performance boost with my photoshop experiment. Now there's all CAT6 from cable modem to firewall to switch to server. My desk has CAT5 from the switch for now, but I'll upgrade whenever I open up a wall.



Once settled in and I'm caught up with regular work, I'll try my hand at benchmarking the network's performance. Right now, I'm pretty sure it can't get much better than applying the best of practice advice of getting rid of switches and unnecessary hardware. The hardware raid controller is 6+ years old, so getting a new one is on the horizon. Once that happens, this one will fall back to archival duty.


Answer




Are there any easy ways we can get performance increases?




You need to employ a systematic process of identifying bottlenecks and eliminating them. You can shovel money into new gear, services, etc, but if you're not being methodical about it there's really no point. I'll make some specific recommendations at the end of my answer for some things you should look at.





Would it make sense to get a switch with a fiber port and then put in
a fiber NIC on the server?




Nope. Your fiber-based Ethernet media choices are gigabit and 10 gigabit. Gigabit fiber and gigabit copper are the same speed, so there's no "win" to using fiber for gigabit speed (though, as @ChrisS says, fiber does excel in some specific use cases). You don't have a server in your office that can even begin to saturate 10 gigabit fiber, so there's no "win" with 10 gigabit either.




I haven't even Google'd how to log into the Dell switch, so I'm

assuming it's unmanaged. I was going to visit the switch for a
webserver so I checked the DHCP server (on firewall box) and the
switch doesn't show up among the clients. I've only scratched the
surface of reading up on all that: should the switch and RAID server
be using large packets or something?




The PowerConnect 2716 is a low-end "web managed" switch when its set up in "Managed" mode (which, by default, it isn't, but it sounds like you've figured out you can enable web management). You can get a manual from Dell for that switch that will explain the management functionality. They aren't great performers. I've got a couple of them in little "backwater" places and my experience has been that they won't even do wire-speed gigabit switching.



When you say "large packets" I believe you're referring to jumbo frames. You have no reason to use jumbo frames. Generally you'll only see jumbo frames in use in very specialized, isolated networks-- like between iSCSI targets and initiators (SANs and the servers that "connect" to them). You're not going to see any marked improvement in general file/print sharing performance on your LAN using jumbo frames. You'd likely have headaches and performance problems, actually, because all the devices would need to be configured for jumbo frame support-- and I would suspect that you have at least one device that doesn't have support (just based on the wide variety of gear you have).







Here are some things I'd look at doing if I wanted to isolate bottlenecks:




  • Enable web management on the PowerConnect 2716 switch so that you can see error and traffic counters. This switch doesn't have SNMP-based management so you're not going to get any fancy traffic graphing, but you'll at least be able to see if you're having errors.


  • Benchmark the server performance w/ a single client computer connected directly to the server's NIC (for which you should be able to use a regular straight-through patch cable, assuming the client computer you're using has a gigabit NIC). That will give you a feeling for the server's maximum possible I/O throughput with a real file sharing workload. (If I had to hazard a guess I'd bet that the server's I/O to/from the disks is your biggest bottleneck.)


  • Use a tool like iperf (ttcp, etc) to get a feeling for the network bandwidth available between various places in the network.





The best single thing you can change, from a reliability perspective, is to eliminate all the little Ethernet switches and home-run all the cabling back to a single core switch. In a network as small as the one you've diagrammed there's no reason to have more than a single Ethernet switch (assuming all the nodes are within 100 meters of a single point).


apache 2.2 - SSL SNI security concerns



Just wondering if SNI is useful in segregating public content from private content. I managed to configure our server to serve /foo for every client but serve /bar only for clients from the intranet, by specifying the host name that is resolved only from intranet.



So the config goes like: (stripped to very essential part)



NameVirtualHost *:443
# JkWorkersFile must be global so including it here
JkWorkersFile workers.properties



ServerName public.foo.com
JkMountFile uriworkermap-pub.properties



ServerName private-foo
JkMountFile uriworkermap-priv.properties




ServerName 10.1.2.3
JkMountFile uriworkermap-priv.properties



The catch is, if you add that name into your hosts file to resolve to the public IP then SNI will actually resolve handle it the same way as if it were a valid request from the intranet.



I played around the thoughts of using only numeric IP instead of names (e.g. 10.1.2.3) but I presume the same can be tricked if the client has the same IP in their own subnet (e.g. a Linux host that forwards ports to the public IP of my web server.




The node sits behind a firewall on which I don't have influence. It has only one IP (the internal one) but if needed I can probably make it two.



Practical question is: how do you prevent such a leak? By means of htaccess for example? By specifying different IP addresses? Or is there no other way than creating a separate server instance and forgetting SNI?


Answer



If you need to restrict content based on the origin of the site visitors you use that information as the primary access control (and not just the name of the resource you are trying to protect).



With Apache 2.2 that would be the Allow Directive.




ServerName private-foo


Order Deny,Allow
Deny from all
# Allow from the internal subnet 10.1.2.0/24
Allow from 10.1.2

...


Often in your scenario a server would have an internal and a public ip-address though and since internal users would come in using that internal IP-address only you would bind the virtual host to only that internal IP e.g. rather than listening to all IP's




Additionally your remark regarding .htaccess triggered my pet peeve, quoted from from the manual on .htaccess files:




You should avoid using .htaccess files completely if you have access to httpd main server config file. Using .htaccess files slows down your Apache http server. Any directive that you can include in a .htaccess file is better set in a Directory block in the main Apache configuration file(s), as it will have the same effect with better performance.



Accidentally ran chmod 775 -R / (not ./) on linux server (RedHat)

I accidentally ran "chmod 775 -R /" instead of "chmod 775 -R ./" and changed the permissions for everything and now the server is broke.



Anyone know how I can quickly fix this???

Apache Performance analysis

I am having really difficult time with my web server. I have been tweaking things as per the suggestions on web, yet not able to find anything concrete.



My Apache process was eating 450MB under Virtual Memory column when I did htop. I searched on internet and people said that installing eaccelerator the system will become faster and efficient and would eat lesser memory and CPU. Unfortunately, this turned out to be worse than before. Now my apache processes showing 1488MB memory under Virtual Memory column.





  1. Although each process shows 1488MB memory, I can see that total RAM consumption is just 7GB that too when 4GB has been taken away by Varnish Cache (I am using it as reverse proxy).


  2. I am not sure if I should worry about Virtual Memory column or not.


  3. After installing eaccelerator, my server has not went down due to consumption of complete RAM of 18GB and 2GB of SWAP space. This used to happen before. But again, its been just 1 day since I have installed eaccelerator so may be issues will start coming in day or two.


  4. Please do not suggest me to use APC...its not installing on my server.


  5. I have checked on phpinfo page of my server and found that Eaccelerator is caching the scripts. As of now it has used up some 80MB of memory (out of 1Gb assigned by me) and has cached some 900 scripts.


  6. As of now my prefork settings are -




    StartServers 8
    MinSpareServers 5
    MaxSpareServers 20
    ServerLimit 256
    MaxClients 256
    MaxRequestsPerChild 100





Please find below the screenshot of htop.



FYI -
Its a dedicated server and has 8 core CPU.
Till the time my server is up and running my site performance is excellent. It loads in around 8 sec for the first time and second view is 2.5 seconds. The site is image heavy as its an ecommerce site.



enter image description here

Monday, July 30, 2018

security - Returning "200 OK" in Apache on HTTP OPTIONS requests




I'm attempting to implement cross-domain HTTP access control without touching any code.



I've got my Apache(2) server returning the correct Access Control headers with this block:



Header set Access-Control-Allow-Origin "*"                   
Header set Access-Control-Allow-Methods "POST, GET, OPTIONS"


I now need to prevent Apache from executing my code when the browser sends a HTTP OPTIONS request (it's stored in the REQUEST_METHOD environment variable), returning 200 OK.




How can I configure Apache to respond "200 OK" when the request method is OPTIONS?



I've tried this mod_rewrite block, but the Access Control headers are lost.



RewriteEngine On                  
RewriteCond %{REQUEST_METHOD} OPTIONS
RewriteRule ^(.*)$ $1 [R=200,L]

Answer




You're adding a header to a non-success (non-2xx) response, such as a redirect, in which case only the table corresponding to always is used in the ultimate response.



Correct "Header set":



Header always set Access-Control-Allow-Origin "*"                   
Header always set Access-Control-Allow-Methods "POST, GET, OPTIONS"

WebDAV Security and Hardening




What are the security ramifications that one should be aware of when considering using WebDAV? How does one go about securing it? What else should I know about it?


Answer



WebDav by its self doesn't have any security. It'll let anyone touch anything. It says in the standards docs that this should be handled in the web-server layer (or application, if that's providing the WebDAV service).



Authentication
WebDAV has no native auth service, so one needs to be put in front of it. Different webservers handle this differently, depending on what dav module you're using. Server-specific modules (mod_dav) will behave differently than those that are based out of app-servers like Tomcat). This is the normal HTTP auth stuff; basic, digest, SASL, Kerberos, etc.



HTTPS
Since the authentication won't be encrypted without it (unless you're doing IIS-based webdav and NTLM), and the files won't be transferred encrypted.



Local Auth
Depending on what's driving the WebDAV, pay attention to the actual OS user that drops the files. Sometimes the Dav server will impersonate the actual user, other times it's all one user dropping files and it's up to the application to keep users away from files they shouldn't have access to.


Sunday, July 29, 2018

ssl - Apache https redirect only redirects port

Been banging my head on this one for a while, for some reason Apache refuses to properly implement https redirect. I have tried it using a permanent redirect as well as a mod rewrite and everything in between. Currently I only have one virtual hosts file as I was trying to remove any unnecessary convolution. I checked the status of the Apache config and it shows the virtual hosts file in question being used.



I am trying to renew a lets encrypt cert, and I cant renew since it accesses the site via http. Whenever I try to access my site via http it gives me a 400 error stating that it cant deliver an http site using port 443. So basically Apache is redirecting http port 80 traffic to port 443, but it will not redirect http to https no matter what I try.



    

ServerName mysite.net
RewriteEngine On

RewriteCond %{HTTPS} !=on
RewriteRule ^(/(.*))?$ https://%{HTTP_HOST}/$1 [R=301,L]

ServerName mysite.net
RewriteEngine On
RewriteCond %{HTTPS} !=on
RewriteRule ^(/(.*))?$ https://%{HTTP_HOST}/$1 [R=301,L]
ServerAdmin webmaster@localhost
DocumentRoot /var/www/html/

SSLEngine on
ErrorLog ${APACHE_LOG_DIR}/error.log
CustomLog ${APACHE_LOG_DIR}/access.log combined
SSLCertificateFile /etc/letsencrypt/live/mysite.net/fullchain.pem
SSLCertificateKeyFile /etc/letsencrypt/live/mysite.net/privkey.pem
Include /etc/letsencrypt/options-ssl-apache.conf

Saturday, July 28, 2018

centos - What are the major practical differences between OpenSolaris and Linux?



I currently use CentOS for on my server, and I've been trying to figure out the practical differences between Linux and OpernSolaris. I'm not a linux master, I merely know my way around the system and can generally install things if I need to (though I won't lie, I get tripped up on that sometimes).




If I switch to OpenSolaris, are there going to be major things that I am unable to do now or that at least won't work the same way? My stacks mainly just consist of PHP/MySQL or Node.js/MongoDB.


Answer



OpenSolaris is being forked to OpenIndiana, and I would highly recommend using the later, as Oracle has a tendency to close up previously-open projects. Otherwise,



OpenIndiana/Solaris Pros:





Cons:





  • Slower on most commodity hardware

  • Supports much narrower set of hardware

  • Fewer applications are ported/maintained for OpenSolaris



Other differences include file system structure, command naming and syntax, etc. There are a few good articles on the difference if you google "linux v opensolaris;" eg: http://linuxhelp.blogspot.com/2009/09/open-solaris-vs-linux-comparison.html, http://tuxradar.com/content/opensolaris-vs-linux



SAMP (solaris, apache, mysql, php) stacks should run just fine, assuming your hardware is all supported.


Linux FHS: /srv vs /var ... where do I put stuff?




My web development experience has started with Fedora and RHEL but I'm transitioning to Ubuntu. In Fedora/RHEL, the default seems to be using the /var folder while Ubuntu uses /srv.



Is there any reason to use one over the other and where does the line split? (It confused me so much that until very recently, I thought /srv was /svr for server/service)



My main concern deals with two types of folders




  • default www and ftp directories

  • specific application folders like:



    1. samba shares (possibly grouped under a smb folder)

    2. web applications (should these go in www folder, or do can I do a symlink to its own directory like "___/www/wordpress" -> "/srv/wordpress")




I'm looking for best practice, industry standards, and qualitative reasons for which approach is best (or at least why its favored).


Answer



This stems from LSB which says "/var contains variable data files. This includes spool directories and files, administrative and logging data, and transient and temporary files." but says this for /srv: "/srv contains site-specific data which is served by this system."




SuSE was one of the first disto's that I used that kept webroot's in /srv - typically Debian/Ubuntu/RHEL use /var/www - but also be aware that if you install a web application using yum or apt then they will likely end up in /usr/share. Also the packaging guidelines for Fedora say that a "package, once installed and configured by a user, can use /srv as a location for data. The package simply must not do this out of the box".



On balanced reflection I would say keep to /var/www - or you can do both by making /var/www a symlink to /srv/www. I know that on oracle RDBMS systems that I build I often create /u01 /u02 etc as symlinks to /home/oracle. The reason for this is that many DBA's expect to find things in /u01 and many others expect /home/oracle. The same can be said of Sysadmins in general - some will instinctively look in /var/www and some in /srv/www while others like myself will grep the apache config for the DocumentRoot.



Hope this provides some guidance for you.


hp proliant - Replace HDD in "predictive failure" state with spare HDD in HP Smart Array P400 controller



I have an HP Smart Array P400 RAID controller with six physical disks configured in RAID 5. Four of the physical disks are in "OK" state, one is in "Predictive Failure" state, and one is in "spare" state:



logicaldrive 1 (3.6 TB, RAID 5, OK)


physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 1 TB, Predictive Failure)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 1 TB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 1 TB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 1 TB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 1 TB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 1 TB, OK, spare)


Is it safe to add (hpacucli controller slot=6 array A add drives=1:6) a "spare" drive to RAID 5 and once its drive type changes from "Spare Drive" to "Data Drive", then remove (either physically or maybe with hpacucli controller slot=6 array A remove drives=1:1) the HDD with the "Predictive Failure" state?



Answer



Wait until the drive fails or physically remove or replace it.



Seeing "Predictive Failure" in an indication to order a new drive.



The best course of action if to get a new drive and replace the failing drive with it (and not to rely on the spare rebuild).


Friday, July 27, 2018

raid - HP SmartArray P400: How to repair failed logical drive?



I have a HP Server with SmartArray P400 controller (incl. 256 MB Cache/Battery Backup) with a logicaldrive with replaced failed physicaldrive that does not rebuild.



This is how it looked when I detected the error:





~# /usr/sbin/hpacucli ctrl slot=0 show config
Smart Array P400 in Slot 0 (Embedded) (sn: XXXX)

array A (SATA, Unused Space: 0 MB)
logicaldrive 1 (698.6 GB, RAID 1, OK)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 750 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 750 GB, OK)


array B (SATA, Unused Space: 0 MB)
logicaldrive 2 (2.7 TB, RAID 5, Failed)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 750 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 750 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA, 750 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA, 750 GB, Failed)
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SATA, 750 GB, OK)

unassigned
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SATA, 750 GB, OK)

~#


I thought that I had drive 2I:1:8 configured as a spare for Array A and Array B, but it seems this was not the case :-(. I noticed the problem due to I/O errors on the host, even if only 1 physicaldrive of the RAID5 is failed.



Does someone know why this could happen? The logicaldrive should go into "Degraded" mode but still be fully accessible from the host os!?



I first tried to add the unassigned drive 2I:1:8 as a spare to logicaldrive 2, but this was not possible:





~# /usr/sbin/hpacucli ctrl slot=0 array B add spares=2I:1:8
Error: This operation is not supported with the current configuration.
Use the "show" command on devices to show additional details
about the configuration.
~#


Interestingly it is possible to add the unassigned drive to the first array without problems. I thought maybe the controller put the array into "failed" state due to the missing spare and protects failed arrays from modification. So I tried was to reenable the logicaldrive (to add the spare afterwards):





~# /usr/sbin/hpacucli ctrl slot=0 ld 2 modify reenable
Warning: Any previously existing data on the logical drive may not
be valid or recoverable. Continue? (y/n) y

Error: This operation is not supported with the current configuration.
Use the "show" command on devices to show additional details
about the configuration.
~#



But as you can see, re-enabling the logicaldrive this was not possible.



Now I replaced the failed drive by hotswapping it with the unassigned drive. The status now looks like this:




~# /usr/sbin/hpacucli ctrl slot=0 show config
Smart Array P400 in Slot 0 (Embedded) (sn: XXXX)

array A (SATA, Unused Space: 0 MB)
logicaldrive 1 (698.6 GB, RAID 1, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 750 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 750 GB, OK)

array B (SATA, Unused Space: 0 MB)
logicaldrive 2 (2.7 TB, RAID 5, Failed)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 750 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 750 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA, 750 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA, 750 GB, OK)
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SATA, 750 GB, OK)

~#


The logical drive is still not accessible. Why is it not rebuilding?



What can I do?



FYI, this is the configuration of my controller:





~# /usr/sbin/hpacucli ctrl slot=0 show
Smart Array P400 in Slot 0 (Embedded)
Bus Interface: PCI
Slot: 0
Serial Number: XXXX
Cache Serial Number: XXXX
RAID 6 (ADG) Status: Enabled
Controller Status: OK
Chassis Slot:
Hardware Revision: Rev E

Firmware Version: 5.22
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 15 secs
Surface Analysis Inconsistency Notification: Disabled
Raid1 Write Buffering: Disabled
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Accelerator Ratio: 25% Read / 75% Write

Drive Write Cache: Disabled
Total Cache Size: 256 MB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: True
~#



Thanks for you help in advance.


Answer



The answer is not pleasant. There's a high probability that your array is in a "waiting for rebuild" state, where there's another failing disk in the RAID5 array set that's preventing the recovery from completing. This is why you should avoid RAID5 these days. It doesn't help that these are SATA drives... The likelihood of problems is even higher. Try powering the system off (letting the drives spin down) and powering back on. Follow the prompts at the BIOS array screen and choose the F2 option to "reenable all logical drives". This may kickstart the rebuild process.



Otherwise, it's a rebuild/recovery with new disks.


email - Received-SPF: neutral




When I send emails from my application I am getting a spf neutral error. I have been working with Google and my hosting company, but none of them can figure it out. Below is my spf record.




"v=spf1 include:s920.tmd.cloud include:mx1.tmdhosting.com include:mx2.tmdhosting.com ip4:184.154.73.81 ip4:108.178.0.170 ip4:198.143.161.162 ip4: include:_spf.google.com ~all"


Below is a snip of the email meta data.



    ARC-Authentication-Results: i=1; mx.google.com;
dkim=temperror (no key for signature) header.i=@holyfirepublishing.com header.s=default header.b=HRuHEiL6;
spf=neutral (google.com: 108.178.0.170 is neither permitted nor denied by best guess record for domain of publisher@holyfirepublishing.com) smtp.mailfrom=publisher@holyfirepublishing.com
Return-Path:
Received: from mx1.tmdhosting.com (mx1.tmdhosting.com. [108.178.0.170])

by mx.google.com with ESMTPS id b67-v6si3713737ioj.9.2018.04.28.17.31.24
for
(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
Sat, 28 Apr 2018 17:31:24 -0700 (PDT)
Received-SPF: neutral (google.com: 108.178.0.170 is neither permitted nor denied by best guess record for domain of publisher@holyfirepublishing.com) client-ip=108.178.0.170;
Authentication-Results: mx.google.com;
dkim=temperror (no key for signature) header.i=@holyfirepublishing.com header.s=default header.b=HRuHEiL6;
spf=neutral (google.com: 108.178.0.170 is neither permitted nor denied by best guess record for domain of publisher@holyfirepublishing.com) smtp.mailfrom=publisher@holyfirepublishing.com
Received: from [184.154.73.81] (helo=s920.tmd.cloud) by mx1.tmdhosting.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89) (envelope-from ) id 1fCaFP-0005U5-6t for test@holyfirepublishing.com; Sat, 28 Apr 2018 19:31:24 -0500



I can really use some help building my spf record.



Thanks in advance.


Answer



Your IN SPF "v=spf1 include:s920.tmd.cloud include:mx1.tmdhosting.com include:mx2.tmdhosting.com ip4:184.154.73.81 ip4:108.178.0.170 ip4:198.143.161.162 ip4: include:_spf.google.com ~all" has several problems.




  • Use TXT instead of SPF (RFC 7208, 3.1).

  • In general, you should avoid multiple includes as there is a maximum amount of DNS queries per SPF check. For the same reason, direct ip4 and ip6 directives are always the best.


  • Here, you have includes that doesn't contain SPF records. They should probably use a mechanism, instead. Only "include" existing SPF records.

  • You should list a server only once, preferably using ip4. As s920.tmd.cloud A 184.154.73.81, mx1.tmdhosting.com A 108.178.0.170 & mx2.tmdhosting.com A 198.143.161.162, the a mechanisms from the previous can be removed.

  • The empty ip4: is a syntax error.

  • While + for Pass is the default qualifier, I find it easier for beginners to use it to avoid confusion with the exists/include mechanisms and redirect/exp modifiers that doesn't have qualifiers.



We can assume you have the rest as you desire:




  • The results suggests that at least the MX 108.178.0.170 is used for outgoing mail, so probably the three IP addresses are ok.


  • The last include allows Gmail. Let's just assume you are using it for this domain.

  • ~all soft fail for rest. I agree that you shouldn't use (hard) fail before you have more experience with SPF and can be sure it won't cause any problems.



Result:



IN TXT "v=spf1 +ip4:184.154.73.81 +ip4:108.178.0.170 +ip4:198.143.161.162 include:_spf.google.com ~all"

Thursday, July 26, 2018

domain name system - Does anycast allow for alternate routes to be attempted?



A bit of a bodged up title but I don't know enough of the subject to come up with a more suitable one.




I've read time and time again that anycast is a great solution for load balancing and is the preferred solution to DNS load balancing. However, I am wondering, anycast only appears to have an advantage of load balancing and provides no help of redundancy. Whereas a plain DNS solution with no load balancing (i.e. just multiple A records) doesn't offer any load balancing but does appear to offer better redundancy.



I have been taking a closer look at DNS services and noticed that in 2016 Dyn suffered an outage: https://en.wikipedia.org/wiki/2016_Dyn_cyberattack . But two things:



1) If something goes wrong with the server behind a particular anycast announcement, are other routes automatically tried? If so, why did Dyn suffer such an outage - or is this due to DNS running on UDP?



For example, if we are trying to connect to a blue node, and follow the route 1-2-6, and find route 6 is broken (cannot connect to server or some error), will routes 1-2-5 or 1-3-4 automatically be tried?



enter image description here




2) Is there anything that a client could do to mitigate this problem?



3) It seems to me that anycast is more likely to sacrifice a particular region to keep other regions online, as opposed to more of a DNS round-robin affair that would not offer the same performance but would offer better cushioning of such an attack. So, why is it (assuming my thoughts are correct) that there seems be be a big push for anycast and less of a push for more round robin DNS services that would return the order of servers relevant for the user.



I'm aware of this question Multiple data centers and HTTP traffic: DNS Round Robin is the ONLY way to assure instant fail-over? although I don't consider this a duplicate as I'm interested in the reasons why anycast can fail as it does.


Answer



So first off let's briefly review what's implied by using anycast for DNS:




  1. A given IP address a is the resolver that we wish to make more available. The a host is a member of the A /24 subnet. Anycast can be accomplished with specific host routes (i.e. a/32) but this is generally only seen within private networks, not on the general Internet.



  2. There is some mechanism in place such that the A subnet is dynamically announced only when the corresponding DNS service is operational. Please note (and this is really important) that the advertisement itself could be coming from a single host within a site that runs a resolver, from an entire physical site containing multiple instances of said resolver (i.e. many hosts running resolvers, the site as a whole sharing a single route).


  3. The same route (A) will be advertised from multiple points on the public Internet. This might take the form of a large provider (read: points of presence dispersed across the globe) presenting the same route at each point of interconnection with foreign networks or the same route coming from points hosted within multiple carriers.




So - when an arbitrary client sends a packet toward the anycast IP, said packet will tend to find itself to the "closest" point of advertisement. I've put scare-quotes around closest because it's only close in the sense of how the routing topology has been laid out and what policies are in place for the routers along the way. It's entirely possible that the closest instance of the anycast address might actually be the furthest physically.



If, in turn, the point at which this route is advertised fails (...which could be result of the service failing on the host and the route retracting or a more traditional network reachability issue) then packets bound to the anycast address will be routed to the next-closest (again - in routing protocol terms) instance of the route. During network reconvergence the client's resolution might fail and be re-attempted, with the re-attempt now following a longer path to reach what is - apparently - the same address. This is all transparent to both the client process and the user and is best thought of in network terms as following an alternate path to a given network.



It's sometimes helpful to think of an anycast network as a logical construct. It's a virtual subnet that contains the service you're interested in. That virtual subnet is reachable via many paths through the network.




That said, here are the major caveats to anycast designs:




  1. Since there's no guarantee that a given packet to the anycast IP will reach the same physical host, this approach really only maps to connectionless protocols.


  2. The reliability of the solution is only as good as the logic tying the correct operation of the service to the advertisement of the route. If the service dies and the route continues to be advertised then there will be a potentially significant black hole.


  3. Getting the anycast route advertisements well- and properly- distributed across the public Internet is not trivial. It's very easy to create hot-spots: a particular instance of an anycast route that happens to be preferable to most clients. This is still a potentially decent HA solution (for the easier types of failures) but it doesn't speak to load balancing.




Now - finally - with all of this laid out, your question is easier to answer:




There's nothing inherent to anycast that makes it more resistant to DDoS. Each of the potentially millions of flows of DDoS traffic will find their way to their nearest instance, likely making it unavailable to any other legitimate clients who are would otherwise be routed to these points.



Now, if the vast majority of the hosts on the botnets in use happened to be in, say, Eastern Europe and one of the anycast routes happened to be originated in a nearby PoP (again - "nearby" in terms of routing topology) then this traffic would be sunk to one point while much of the rest of the world continued to resolve to the same route that was also hosted at convenient points on other continents. In this particular case anycast would arguably be one of the best mechanisms to minimize the damage of a DDoS attack. This is highly contingent on how the anycast routes have been distributed and how policy has been configured (see #3 above - not a trivial problem).



Clearly this use-case isn't as compelling in the case of a DDoS attack that's truly distributed. If properly engineered, though, the localization of the anycast routes means that the attack load can now be spread across an arbitrary number of geographically dispersed physical hosts. This will tend to dilute the effect of the attack on the target as well as potentially spreading the impact across a bigger chunk of the network. Again - a huge amount is contingent on how things have been engineered and configured.



Why is this considered a win over round-robin? Simply because it's possible to deploy an arbitrary number of hosts without the need for separate load-balancers on the individual IP's and there's also no reliance on the timeout value for particular clients deciding to move over to another resolver. One could literally deploy a thousand hosts within a single data center with the same IP and balance the traffic accordingly (nb - obviously massive practical limits based on size of ECMP tables, etc) or deploy a thousand geographically disparate sites each with a thousand hosts. All this could be accomplished without changing a client configuration, without the (admittedly usually clustered) point of failure of a load balancer, etc. In short - when properly engineered it scales as well as the Internet as a whole.


scheduled task - cron - How many times the cron job will run when given aterisk(*) on all positions




If we define a cron job with * * * * * /some/task/to/perform, how many times the job will executed in 60 seconds?


Answer



The cron job runs every minute.



Unix cron is limited to minutes. If you want faster cronjob execution see How can I schedule a cron job that runs every 10 seconds in linux?


Wednesday, July 25, 2018

domain name system - Where does Active Directory-integrated DNS store its data?



This has been bugging me for a while.



We all know Active Directory is a LDAP database.




We also know that the Windows DNS service, when running on a domain controller, can store its data in AD instead of plain text zone files, thus taking advantage of AD automatic replication and removing the need for primary/secondary DNS servers.



The question: where and how are DNS data actually stored in Active Directory?



Can they be accessed using LDAP tools such as ADSIEdit?
Is any DNS entry an actual LDAP object?
An attribute in an object?
Something entirely different?


Answer



Here is an article I found that may get you started. I can never remember the path to the records off the top of my head.



As it mentions basically you can find your DNS information in the AD at this path.




DC=,cn=MicrosoftDNS,cn=System,,


So if you had a domain example.org you would see it at.



DC=example.org,CN=MicrosoftDNS,CN=System,DC=example,DC=org


Your questions:





Is any DNS entry an actual LDAP object?




Your zones will have a object class of dnsZone. Under the zone there will be all your records stored as the class dnsNode.




Can they be accessed using LDAP tools such as ADSIEdit?





Yes, fire up adsiedit or ldp and browse to the above location.


same SSL certificate in two servers



I want to add same ssl certificate in two servers. They boath use same domain. In first server works ok but when i add certificate in second one everything is ok but when i close IIS and check back again in Server Certificates List certificate dissaper.



I google and read forums but i can not fins solution why certificate dissapears.


Answer



Maybe I'm missing something. In my understanding, an SSL cert is FQDN specific, not machine specific (although maybe there is a type of SSL certificate that is machine specific) and that you could export the cert from one server in pfx format and import it to another server, which is what I've done with my primary and standby Exchange servers for the past 5 years.


Secure Email for Hosted Website?

A friend of mine has been running a small non-profit agency for some years now that assists refugees (shelter, food, medical supplies) displaced by war.



Due to current events she has asked me if she and her staff can secure their emails?



Their website is a hosted website on GoDaddy, is there any service that can be used to enhance their emails by adding a form of encryption that can be used by non-technical staff?



All the methods that I can think of would be above their technical skills. I am looking for a service with a trade off between security and user friendliness.

Tuesday, July 24, 2018

networking - CIDR for Dummies



I understand what CIDR is, and what it is used for, but I still can't figure out how to calculate it in my head. Can someone give a "for dummies" type explanation with examples?


Answer



CIDR (Classless Inter-Domain Routing, pronounced "kidder" or "cider" - add your own local variant to the comments!) is a system of defining the network part of an IP address (usually people think of this as a subnet mask). The reason it's "classless" is that it allows a way to break IP networks down more flexibly than their base class.



When IP networks were first defined, IPs had classes based on their binary prefix:



Class    Binary Prefix    Range                       Network Bits
A 0* 0.0.0.0-127.255.255.255 8

B 10* 128.0.0.0-191.255.255.255 16
C 110* 192.0.0.0-223.255.255.255 24
D 1110* 224.0.0.0-239.255.255.255
E 1111* 240.0.0.0-255.255.255.255


(Note that this is the source of people referring to a /24 as a "class C", although that's not a strictly true comparison because a class C needed to have a specific prefix)



These binary prefixes were used for routing large chunks of IP space around. This was inefficient because it resulted in large blocks being assigned to organizations who didn't necessarily need them, and also because Class Cs could only be assigned in 24 bit increments, meaning that routing tables could get unnecessarily large as multiple Class Cs were routed to the same location.




CIDR was defined to allow variable length subnet masks (VLSM) to be applied to networks. As the name applies, address groups, or networks, can be broken down into groups that have no direct relationship to the natural "class" they belong to.



The basic premise of VLSM is to provide the count of the number of network bits in a network. Since an IPv4 address is a 32-bit integer, the VLSM will always be between 0 and 32 (although I'm not sure in what instance you might have a 0-length mask).



The easiest way to start calculating VLSM/CIDR in your head is to understand the "natural" 8-bit boundaries:



CIDR    Dotted Quad
/8 255.0.0.0
/16 255.255.0.0
/24 255.255.255.0

/32 255.255.255.255


(By the way, it's perfectly legal, and fairly common in ACLs, to use a /32 mask. It simply means that you are referring to a single IP)



Once you grasp those, it's simple binary arithmetic to move up or down to get number of hosts. For instance, if a /24 has 256 IPs (let's leave off network and broadcast addresses for now, that's a different networking theory question), increasing the subnet by one bit (to /25) will reduce the host space by one bit (to 7), meaning there will be 128 IPs.



Here's a table of the last octet. This table can be shifted to any octet to get the dotted quad equivalent.



CIDR    Dotted Quad

/24 255.255.255.0
/25 255.255.255.128
/26 255.255.255.192
/27 255.255.255.224
/28 255.255.255.240
/29 255.255.255.248
/30 255.255.255.252
/31 255.255.255.254
/32 255.255.255.255



As an example of shifting these to another octet, /18 (which is /26 minus 8 bits, so shifted an octet) would be 255.255.192.0.


Monday, July 23, 2018

postfix - Best Practices for Open Relay Email Server



I have a scenario where I need to setup postfix with no TLS, no SMTP authentication, and open relay allowing from only one remote IP address.
Emails from this remote IP may have spoofed "from" address as well.



I know, dont ask about how I got to this point...




My concern is that my server will be blacklisted in the future.



What are the best practices for managing open relay server so that it will not be blacklisted??



Thanks in advance.


Answer



It's not an open relay if you are merely accepting any mail from a single IP address. (Open relays accept any mail from anywhere.)



In this case, simply add the IP address to mynetworks in your Postfix main.cf.




Oh, and don't send spam.


How to create custom 404 error page for all users in apache?

I need to create 404 error pages but it should be usable for all of users, If I use ErrorDocument 404 /404.php it will display "Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request." on users web site, How to create it without placing 404.php in every users documentroot?

domain name system - Are nameservers with heterogeneous TLDs acceptable in DNS records?




I am trying to migrate my nameservers from GoDaddy to Amazon Route 53. I would like to edit my DNS record to first add the Amazon Route 53 servers, let that propagate, then remove the GoDaddy servers.



Is it acceptable for my DNS record to have multiple nameservers on different hosts with different top-level domains, provided they all return identical Zone files (ie they all return the same A, CNAME and MX records)? GoDaddy said it could break things, but couldn't explain how.



Thanks,



-Eric


Answer



You'll want the servers to present the same SOA information, too - particularly, the zone serial number. This may not be possible, as it's unlikely that you're able to control this with these providers.




It's not really buying you anything to have it split like that, though - might as well just change over all at once.



Set up all the records on the Amazon servers, then switch the domain to point to them. It will take some time for clients to switch completely off of the GoDaddy servers, but regardless of whether a resolver has cached the old GoDaddy delegation, or switches to the new Amazon delegation, they'll have working name resolution - and you won't be presenting potentially conflicting information from two different SOA records.


ubuntu - apt-get update can not connect

i run a dedicated ubuntu 10.04 server.




i use kvm/libvirt/virsh to run a virtual machine that's also ubuntu 10.04.



i bridged the network (1 of 2 IPs is routed via NAT to the LAN, where my VM (192.168.1.111) picks it up.



i can locally connect to the vm via ssh from the vm i can ping any site outside my network.



i routed port 80 (and others) through using iptables and i can connect from the outside to my apache on the vm .



yet, apt is not working at all, which kills me because i can't install anything...




$ apt-get update


leads to a series of errors like this:



W: Failed to fetch http://de.archive.ubuntu.com/ubuntu/dists/lucid-backports/multiverse/binary-amd64/Packages.gz  Unable to connect to de.archive.ubuntu.com:http: [IP: 141.30.13.30 80]


i can ping the domain and ip from the terminal without a problem.




i can resolveip the domain without a problem.



i tried all /etc/apt/sources.list variations i found on the net. the one working from my dedicated machine, the default list, several hand-compiled lists. the result is always the same: unable to connect



I think it is some kind of a routing problem, but i am really puzzled, because i seem to have full network access from the vm. As the packages are not installed i can't wget or ftp from the vm terminal (and i can't compile them as no gcc is installed - i wanted to do all that using apt ;) ). oh, aptitude is the same of course...



HELP!



P.S. here are my iptables settings:




iptables -t nat -L -v
Chain PREROUTING (policy ACCEPT 86 packets, 14254 bytes)
pkts bytes target prot opt in out source destination
0 0 DNAT tcp -- any any anywhere anywhere tcp dpt:https to:192.168.1.111:443
0 0 DNAT tcp -- any any anywhere anywhere tcp dpt:ftp to:192.168.1.111:21
13 780 DNAT tcp -- any any anywhere anywhere tcp dpt:www to:192.168.1.111:80

Chain POSTROUTING (policy ACCEPT 31 packets, 2236 bytes)
pkts bytes target prot opt in out source destination

0 0 MASQUERADE tcp -- any any 192.168.1.0/24 !192.168.1.0/24 masq ports: 1024-65535
1 76 MASQUERADE udp -- any any 192.168.1.0/24 !192.168.1.0/24 masq ports: 1024-65535
1 84 MASQUERADE all -- any any 192.168.1.0/24 !192.168.1.0/24

iptables -L -v
Chain INPUT (policy ACCEPT 1699 packets, 354K bytes)
pkts bytes target prot opt in out source destination
18 1179 ACCEPT udp -- virbr0 any anywhere anywhere udp dpt:domain
0 0 ACCEPT tcp -- virbr0 any anywhere anywhere tcp dpt:domain
2 656 ACCEPT udp -- virbr0 any anywhere anywhere udp dpt:bootps

0 0 ACCEPT tcp -- virbr0 any anywhere anywhere tcp dpt:bootps

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
2448 3146K ACCEPT all -- any any anywhere 192.168.1.0/24 state NEW,RELATED,ESTABLISHED
0 0 ACCEPT all -- any virbr0 anywhere 192.168.1.0/24 state RELATED,ESTABLISHED
1448 79657 ACCEPT all -- virbr0 any 192.168.1.0/24 anywhere
0 0 ACCEPT all -- virbr0 virbr0 anywhere anywhere
0 0 REJECT all -- any virbr0 anywhere anywhere reject-with icmp-port-unreachable
0 0 REJECT all -- virbr0 any anywhere anywhere reject-with icmp-port-unreachable



@g-bach



okay, below are the filter rules (iptables -L -v -t filter).



about the architecture: host with 2 ips mapped to eth0 and eth1.
eth1 is bridget for libvirt. and should route/masq to different VMs (usually we have no overlapping ports open for the VMs - at least not below 1024).



after playing around a bit more I can specify the problem a bit better:




it's the firewall (iptables) rules. I obviously don't get how to setup iptables (never did that before).
When I played around with them wildly, I got different things to work, others not (connections to ubuntu servers worked, no incoming connections worked anymore, etc.).



Hence, you were right and the bridge etc is okay. About the pinging and connecting from the VM to the outside: it's not working with wget and ssh/telnet . there is an initial connection, but I then no data is sent (i routed 20, 21 and 22 through) . also - e.g. i can install apache and wordpress in the VM and connect to it from the outside, but then wordpress can't establish a ftp connection to fetch updates etc.



iptables -L -v -t filter



Chain INPUT (policy ACCEPT 19574 packets, 7015K bytes)
pkts bytes target prot opt in out source destination

27 1757 ACCEPT udp -- virbr0 any anywhere anywhere udp dpt:domain
0 0 ACCEPT tcp -- virbr0 any anywhere anywhere tcp dpt:domain
43 14104 ACCEPT udp -- virbr0 any anywhere anywhere udp dpt:bootps
0 0 ACCEPT tcp -- virbr0 any anywhere anywhere tcp dpt:bootps

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
1850 119K ACCEPT all -- any any anywhere 192.168.1.0/24 state NEW,RELATED,ESTABLISHED
0 0 ACCEPT all -- any virbr0 anywhere 192.168.1.0/24 state RELATED,ESTABLISHED
1538 308K ACCEPT all -- virbr0 any 192.168.1.0/24 anywhere

0 0 ACCEPT all -- virbr0 virbr0 anywhere anywhere
0 0 REJECT all -- any virbr0 anywhere anywhere reject-with icmp-port-unreachable
0 0 REJECT all -- virbr0 any anywhere anywhere reject-with icmp-port-unreachable

Chain OUTPUT (policy ACCEPT 5787 packets, 635K bytes)
pkts bytes target prot opt in out source destination


thanks for your efforts!

Sunday, July 22, 2018

linux - server with high CPU and memory usage

I have a server with one WordPress website on it, but the resources of the server is almost finished. please check the image:



here



Where should i start to figure out the problem? should i get a server with higher resources ?



The server has one website with 600MB DB and about 900 visits per day
Some details about the server:

- mpm prefork



    

StartServers 100
#StartServers 10
MinSpareServers 100
MaxSpareServers 400
MaxRequestWorkers 800
MaxConnectionsPerChild 800

ServerLimit 800




2- my.cnf



 key_buffer      = 1G

key_buffer = 512M


max_allowed_packet= 512M

max_connections = 10000

max_connections = 1000

innodb_buffer_pool_size = 10G

innodb_log_file_size = 10G


innodb_file_per_table = 1

innodb_autoextend_increment=256

innodb_buffer_pool_size=10G

innodb_buffer_pool_instances=4

innodb_buffer_pool_instances=2


innodb_log_file_size = 104857600

innodb_log_files_in_group = 5

innodb_log_buffer_size = 268435456

innodb_io_capacity = 10000

innodb_io_capacity = 1000


thread_cache_size = 16

thread_cache_size = 8



key_buffer = 16M

max_allowed_packet = 16M


thread_stack = 192K

thread_cache_size = 16

thread_cache_size = 8

query_cache_type = 1

query_cache_limit = 20M


query_cache_limit = 10M

query_cache_size = 100M

query_cache_size = 50M



tmp_table_size = 512M


table_open_cache_instances = 16


slow_query_log = 1

slow_query_log_file = /var/log/mysql/slow.log


3- meminfo




    MemTotal:       32459956 kB

MemFree: 6863712 kB

MemAvailable: 13162120 kB

Buffers: 1393928 kB

Cached: 5670744 kB


SwapCached: 39276 kB

Active: 20290788 kB

Inactive: 4120816 kB

Active(anon): 17336008 kB

Inactive(anon): 973836 kB


Active(file): 2954780 kB

Inactive(file): 3146980 kB

Unevictable: 0 kB

Mlocked: 0 kB

SwapTotal: 1046520 kB


SwapFree: 240992 kB

Dirty: 9088 kB

Writeback: 0 kB

AnonPages: 17307716 kB

Mapped: 263800 kB


Shmem: 962912 kB

Slab: 742360 kB

SReclaimable: 601588 kB

SUnreclaim: 140772 kB

KernelStack: 9760 kB


PageTables: 179828 kB

NFS_Unstable: 0 kB

Bounce: 0 kB

WritebackTmp: 0 kB

CommitLimit: 17276496 kB


Committed_AS: 33425764 kB

VmallocTotal: 34359738367 kB

VmallocUsed: 329084 kB

VmallocChunk: 34358947836 kB

HardwareCorrupted: 0 kB


AnonHugePages: 20480 kB

HugePages_Total: 0

HugePages_Free: 0

HugePages_Rsvd: 0

HugePages_Surp: 0


Hugepagesize: 2048 kB

DirectMap4k: 295140 kB

DirectMap2M: 15108096 kB

DirectMap1G: 17825792 kB

debian - HP DL380 G7 + Smart Array P410i + sysbench -> poor raid 10 performance



I have running system with low IO utilization:




  1. HP DL380G7 ( 24gb RAM )

  2. Smart Array p410i with 512mb battary backed write cache

  3. 6x SAS 10k rpm 146gb drives in RAID10

  4. Debian Squeze linux, ext4 + LVM, hpacucli installed




iostat (cciss/c0d1 = raid10 array, dm-7 = 60G lvm partition for test):




Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
cciss/c0d0 0,00 101,20 0,00 6,20 0,00 0,42 138,58 0,00 0,00 0,00 0,00
cciss/c0d1 0,00 395,20 3,20 130,20 0,18 2,05 34,29 0,04 0,26 0,16 2,08
dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
dm-2 0,00 0,00 3,20 391,00 0,18 1,53 8,87 0,04 0,11 0,05 1,84

dm-3 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
dm-4 0,00 0,00 0,00 106,80 0,00 0,42 8,00 0,00 0,00 0,00 0,00
dm-5 0,00 0,00 0,00 0,60 0,00 0,00 8,00 0,00 0,00 0,00 0,00
dm-6 0,00 0,00 0,00 2,80 0,00 0,01 8,00 0,00 0,00 0,00 0,00
dm-1 0,00 0,00 0,00 132,00 0,00 0,52 8,00 0,00 0,02 0,01 0,16
dm-7 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
dm-8 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00


hpacucli "ctrl all show config"





Smart Array P410i in Slot 0 (Embedded) (sn: 5001438011FF14E0)

array A (SAS, Unused Space: 0 MB)


logicaldrive 1 (136.7 GB, RAID 1, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)

physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)

array B (SAS, Unused Space: 0 MB)


logicaldrive 2 (410.1 GB, RAID 1+0, OK)

physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 146 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 146 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)

physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)

SEP (Vendor ID PMCSIERA, Model SRC 8x6G) 250 (WWID: 5001438011FF14EF)


hpacucli "ctrl all show status"





Smart Array P410i in Slot 0 (Embedded)
Controller Status: OK
Cache Status: OK
Battery/Capacitor Status: OK


Sysbench command




sysbench --init-rng=on --test=fileio --num-threads=16 --file-num=128 --file-block-size=4K --file-total-size=54G --file-test-mode=rndrd --file-fsync-freq=0 --file-fsync-end=off run --max-requests=30000



Sysbench results




sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16
Initializing random number generator from timer.



Extra file open flags: 0
128 files, 432Mb each
54Gb total file size
Block size 4Kb
Number of random requests for random IO: 30000
Read/Write ratio for combined random IO test: 1.50
Using synchronous I/O mode
Doing random read test

Threads started!
Done.

Operations performed: 30000 Read, 0 Write, 0 Other = 30000 Total
Read 117.19Mb Written 0b Total transferred 117.19Mb (935.71Kb/sec)
233.93 Requests/sec executed

Test execution summary:
total time: 128.2455s
total number of events: 30000

total time taken by event execution: 2051.5525
per-request statistics:
min: 0.00ms
avg: 68.39ms
max: 2010.15ms
approx. 95 percentile: 660.40ms

Threads fairness:
events (avg/stddev): 1875.0000/111.75
execution time (avg/stddev): 128.2220/0.02



iostat during test




avg-cpu: %user %nice %system %iowait %steal %idle
0,00 0,01 0,10 31,03 0,00 68,86

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
cciss/c0d0 0,00 0,10 0,00 0,60 0,00 0,00 9,33 0,00 0,00 0,00 0,00

cciss/c0d1 0,00 46,30 208,50 1,30 0,82 0,10 8,99 29,03 119,75 4,77 100,00
dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
dm-2 0,00 0,00 0,00 51,60 0,00 0,20 8,00 49,72 877,26 19,38 100,00
dm-3 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
dm-4 0,00 0,00 0,00 0,70 0,00 0,00 8,00 0,00 0,00 0,00 0,00
dm-5 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
dm-6 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 100,00
dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 100,00
dm-7 0,00 0,00 208,50 0,00 0,82 0,00 8,04 25,00 75,29 4,80 100,00
dm-8 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00



Bonnie++ v1.96




cmd: /usr/sbin/bonnie++ -c 16 -n 0

Writing a byte at a time...done
Writing intelligently...done
Rewriting...done

Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 16 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
seo-db 48304M 819 99 188274 17 98395 8 2652 78 201280 8 265.2 1
Latency 14899us 726ms 15194ms 100ms 122ms 665ms

1.96,1.96,seo-db,16,1337541936,48304M,,819,99,188274,17,98395,8,2652,78,201280,8,265.2,1,,,,,,,,,,,,,,,,,,14899us,726ms,15194ms,100ms,122ms,665ms,,,,,,



Questions



So, sysbench showed 234 random reads per second.
I expect it to be at least 400.
What can be the bottleneck ? LVM ?
Another system with mdadm raid1 + 2x 7200rpm drives shows over 200 random reads per second...



Thanks for any help!


Answer



Your system is definitely underperforming based on your hardware specifications. I loaded the sysbench utility on a couple of idle HP ProLiant DL380 G6/G7 servers running CentOS 5/6 to check their performance. These are normal fixed partitions instead of LVM. (I don't typically use LVM, because of the flexibility offered by HP Smart Array controllers)




The DL380 G6 has a 6-disk RAID 1+0 array on a Smart Array P410 controller with 512MB of battery-backed cache. The DL380 G7 has a 2-disk enterprise SLC SSD array. The filesystems are XFS. I used the same sysbench command line as you did:



sysbench --init-rng=on --test=fileio --num-threads=16 --file-num=128 --file-block-size=4K --file-total-size=54G --file-test-mode=rndrd --file-fsync-freq=0 --file-fsync-end=off --max-requests=30000 run


My results were 1595 random reads-per-second across 6-disks.

On SSD, the result was 39047 random reads-per-second. Full results are at the end of this post...




  • As for your setup, the first thing that jumps out at me is the size of your test partition. You're nearly filling the 60GB partition with 54GB of test files. I'm not sure if ext4 has an issue performing at 90+%, but that's the quickest thing for you to modify and retest. (or use a smaller set of test data)



  • Even with LVM, there are some tuning options available on this controller/disk setup. Checking the read-ahead and changing the I/O scheduler setting from the default cfq to deadline or noop is helpful. Please see the question and answers at: Linux - real-world hardware RAID controller tuning (scsi and cciss)


  • What is your RAID controller cache ratio? I typically use a 75%/25% write/read balance. This should be a quick test. The 6-disk array completed in 18 seconds. Yours took over 2 minutes.


  • Can you run a bonnie++ or iozone test on the partition/array in question? It would be helpful to see if there are any other bottlenecks on the system. I wasn't familiar with sysbench, but I think these other tools will give you a better overview of the system's capabilities.


  • Filesystem mount options may make a small difference, but I think the problem could be deeper than that...




hpacucli output...



Smart Array P410i in Slot 0 (Embedded)    (sn: 50123456789ABCDE)


array A (SAS, Unused Space: 0 MB)

logicaldrive 1 (838.1 GB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 300 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 300 GB, OK)


SEP (Vendor ID PMCSIERA, Model SRC 8x6G) 250 (WWID: 50123456789ABCED)


sysbench DL380 G6 6-disk results...



sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16

Initializing random number generator from timer.

Extra file open flags: 0
128 files, 432Mb each
54Gb total file size
Block size 4Kb
Number of random requests for random IO: 30000
Read/Write ratio for combined random IO test: 1.50
Using synchronous I/O mode
Doing random read test

Threads started!
Done.

Operations performed: 30001 Read, 0 Write, 0 Other = 30001 Total
Read 117.19Mb Written 0b Total transferred 117.19Mb (6.2292Mb/sec)
1594.67 Requests/sec executed

Test execution summary:
total time: 18.8133s
total number of events: 30001

total time taken by event execution: 300.7545
per-request statistics:
min: 0.00ms
avg: 10.02ms
max: 277.41ms
approx. 95 percentile: 25.58ms

Threads fairness:
events (avg/stddev): 1875.0625/41.46
execution time (avg/stddev): 18.7972/0.01



sysbench DL380 G7 SSD results...



sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16
Initializing random number generator from timer.



Extra file open flags: 0
128 files, 432Mb each
54Gb total file size
Block size 4Kb
Number of random requests for random IO: 30000
Read/Write ratio for combined random IO test: 1.50
Using synchronous I/O mode
Doing random read test
Threads started!

Done.

Operations performed: 30038 Read, 0 Write, 0 Other = 30038 Total
Read 117.34Mb Written 0b Total transferred 117.34Mb (152.53Mb/sec)
39046.89 Requests/sec executed

Test execution summary:
total time: 0.7693s
total number of events: 30038
total time taken by event execution: 12.2631

per-request statistics:
min: 0.00ms
avg: 0.41ms
max: 1.89ms
approx. 95 percentile: 0.57ms

Threads fairness:
events (avg/stddev): 1877.3750/15.59
execution time (avg/stddev): 0.7664/0.00


Friday, July 20, 2018

Zimbra 7.2 reverse proxy to arbitrary internal website?

I have ZCS 7.2 opensource installed at webmail.domain.com and mailman on mailman.domain.com/mailman.



I wanted to setup a proxy so that when someone goes to webmail.domain.com/mailman, the proxy would instead pull up the contents of mailman.domain.com/mailman.



With apache and mod_proxy I could do something like
ProxyPass /mailman https://mailman.domain.com/mailman
ProxyPassReverse /mailman https://mailman.domain.com/mailman




With the amount of customization in zimbra, is it possible (and advisable) to do the same with zimbra's web server? So basically a reverse proxy that forwards to an arbitrary internal website.

linux - openVPN and myhttpd on port 433



I'm trying to set up OpenVPN to listen on port 443 on my Asustor NAS, and then pass all HTTPS traffic to Apache, by using the port-share option based on:
OpenVPN port-share with Apache/SSL



However i'm not getting it to work.
I think the problem is that port 443 seems to be listening to a process myhttp.
When i run the # netstat -tulpn | grep LISTEN command, i'll get this result:





tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 4475/myhttpd




When i change the port on OpenVPN to 444 and run the # netstat -tulpn | grep LISTEN command again, i'll get the next result:




tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 4475/myhttpd




tcp 0 0 0.0.0.0:444 0.0.0.0:* LISTEN 1507/openvpn



tcp 0 0 127.0.0.1:1195 0.0.0.0:* LISTEN 1507/openvpn




I'm not sure how to solve this issue.
Does anyone have suggestions?


Answer



The issue is that your Apache is listening to 0.0.0.0:443, when you need to set it up to listen to localhost:443. Then you won't get conflicting ports for servers.


Windows 7 Client Can't Join Server's Active Directory Domain

I am currently helping my company prototype automated Windows installation via network. I am using Server 2008r2 and Windows Deployment Services with Windows 7 as the OS being installed on the client computer. Everything works fine EXCEPT joining the client PC to the domain. DNS is configured correctly, client computer is already prestaged in Active Directory Computers as the user "Client1" with a password of "password". I have posted my unattend xml file and the relevant section of the Panther/UnattendGC setupact.log and setuperr.log files.



Setupact.log:




2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: Begin



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: Loading input parameters...



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: AccountData = [NULL]



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: UnsecureJoin = [NULL]



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: MachinePassword = [secret not logged]




2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: JoinDomain = [master.local]



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: JoinWorkgroup = [NULL]



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: Domain = [master.local]



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: Username = [Client1]



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: Password = [secret not logged]




2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: MachineObjectOU = [NULL]



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: DebugJoin = [false]



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: DebugJoinOnlyOnThisError = [NULL]



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: Checking that auto start services have started.



2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: Joining domain [master.local]...




2017-06-29 09:25:04, Info [DJOIN.EXE] Unattended Join: Calling DsGetDcName for master.local...



2017-06-29 09:25:04, Warning [DJOIN.EXE] Unattended Join: DsGetDcName failed: 0x2746, last error is 0x0, will retry in 5 seconds...
[[[My personal note: At this point it retries and displays the above error many more times before finally quitting]]]



2017-06-29 09:32:04, Error [DJOIN.EXE] Unattended Join: NetJoinDomain failed error code is [10054]



2017-06-29 09:32:04, Error [DJOIN.EXE] Unattended Join: Unable to join; gdwError = 0x2746



2017-06-29 09:32:04, Info [DJOIN.EXE] Unattended Join: Exit, returning 0x0




Setuperr.log:



2017-06-29 09:32:04, Error [DJOIN.EXE] Unattended Join: NetJoinDomain failed error code is [10054]



2017-06-29 09:32:04, Error [DJOIN.EXE] Unattended Join: Unable to join; gdwError = 0x2746



Unattend.xml:









*SENSITIVE*DATA*DELETED*
master.local
true
5
Client1


MyCompany
MyCompany
eastern standard time

*SENSITIVE*DATA*DELETED*






122.45.36.1

false
master.local
true
Local Area Connection


master.local
false


master.local





master.local
*SENSITIVE*DATA*DELETED*
Client1


master.local
false






32

96
1280
60
1024


*SENSITIVE*DATA*DELETED*


master.local


Domain Users
Client1





*SENSITIVE*DATA*DELETED*
MyCompany

MyCompany
MyCompany
Administrators



eastern standard time
MyCompany
MyCompany


true
true
Work
1



en-us
en-us
en-us

en-us
en-us





I have already tried turning true and it still didn’t work. Notably, I didn’t include credentials when I set UnsecureJoin to true because you are NOT supposed to include credentials when performing UnsecureJoin. Additionally, I tried variations of UnsecureJoin=true with MachinePassword set to that machine’s local admin account password and also with the MachinePassword field blank and it STILL did not work.



Can someone help me figure out why the client pc is not joining the domain at all? Additionally, DsGetDCName error code 0x2746 and NetJoinDomain error code 10054 seem to be undocumented, so any insight into these error codes would be greatly appreciated?

Thursday, July 19, 2018

networking - Intermintent slow MySQL Connections



We keep getting errors like this on our PHP sites.




Can't connect to MySQL server on '192.168.100.85' (4)





web is the web server 192.168.1.116
mysql is the mysql server 192.168.100.85



So I built a script on the web server to make 10,000 mysql connections in a row and time them. It will "sometimes" produce the error. Most times everything runs fine and there is an average time of 5-10 ms for the call to mysql_connect



Some serious Googling showed that the (4) error is due to the connection being cut off by the timeout.




web# grep mysql.connect_timeout /etc/php.ini
mysql.connect_timeout = 1





So I modified the timeout in the script to 30 to see if it would work. The connection errors went away, but occasionally the connection would take 5 seconds.



After more Googling and some tcpdump I found that occasionally when the MySQL server is doing its reverse lookup of the IP the DNS server would fail to respond. So after 5 seconds it would give up and allow the connection.



I have since added skip-name-resolve to the server. But this did not solve the problem.



Now my test would show slow connections taking 3-4.5 seconds instead of the set 5 with the DNS issue.



So I ran my tests again with tcpdump running on both ends.





web# tcpdump -n -s 65535 -w web3-$(date +"%F_%H-%M-%S").pcap host 192.168.100.85 and port 3306
mysql# tcpdump -n -s 65535 -w master1-$(date +"%F_%H-%M-%S").pcap host 192.168.1.116 and port 3306




Here is the packets from the relevant slow connection.



Packets from web:



No.     Time                       Source                Destination           Protocol Info

13312 2010-10-13 10:01:01.201965 192.168.1.116 192.168.100.85 TCP 41560 > mysql [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=904829062 TSER=0 WS=2
13316 2010-10-13 10:01:04.201577 192.168.1.116 192.168.100.85 TCP 41560 > mysql [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=904832062 TSER=0 WS=2
13317 2010-10-13 10:01:04.204837 192.168.100.85 192.168.1.116 TCP mysql > 41560 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1380 TSV=562240314 TSER=904832062 WS=7
13318 2010-10-13 10:01:04.204853 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=904832065 TSER=562240314
13319 2010-10-13 10:01:04.205886 192.168.100.85 192.168.1.116 MySQL Server Greeting proto=10 version=5.0.77-log
13320 2010-10-13 10:01:04.205899 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=61 Win=5840 Len=0 TSV=904832066 TSER=562240316
13321 2010-10-13 10:01:04.205959 192.168.1.116 192.168.100.85 MySQL Login Request userexample
13322 2010-10-13 10:01:04.206800 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=61 Ack=71 Win=5888 Len=0 TSV=562240317 TSER=904832066
13323 2010-10-13 10:01:04.206874 192.168.100.85 192.168.1.116 MySQL Response OK
13324 2010-10-13 10:01:04.208823 192.168.1.116 192.168.100.85 MySQL Request Quit

13325 2010-10-13 10:01:04.208839 192.168.1.116 192.168.100.85 TCP 41560 > mysql [FIN, ACK] Seq=76 Ack=72 Win=5840 Len=0 TSV=904832069 TSER=562240317
13326 2010-10-13 10:01:04.210422 192.168.100.85 192.168.1.116 TCP mysql > 41560 [FIN, ACK] Seq=72 Ack=76 Win=5888 Len=0 TSV=562240320 TSER=904832069
13327 2010-10-13 10:01:04.210437 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=77 Ack=73 Win=5840 Len=0 TSV=904832071 TSER=562240320
13328 2010-10-13 10:01:04.210567 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=73 Ack=77 Win=5888 Len=0 TSV=562240320 TSER=904832069


Packets from mysql:



No.     Time                       Source                Destination           Protocol Info
13315 2010-10-13 10:01:04.204817 192.168.1.116 192.168.100.85 TCP 41560 > mysql [SYN] Seq=0 Win=5840 Len=0 MSS=1380 TSV=904832062 TSER=0 WS=2

13316 2010-10-13 10:01:04.204836 192.168.100.85 192.168.1.116 TCP mysql > 41560 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=562240314 TSER=904832062 WS=7
13317 2010-10-13 10:01:04.206611 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=904832065 TSER=562240314
13318 2010-10-13 10:01:04.206808 192.168.100.85 192.168.1.116 MySQL Server Greeting proto=10 version=5.0.77-log
13319 2010-10-13 10:01:04.207658 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=61 Win=5840 Len=0 TSV=904832066 TSER=562240316
13320 2010-10-13 10:01:04.207815 192.168.1.116 192.168.100.85 MySQL Login Request user=example
13321 2010-10-13 10:01:04.207872 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=61 Ack=71 Win=5888 Len=0 TSV=562240317 TSER=904832066
13322 2010-10-13 10:01:04.207910 192.168.100.85 192.168.1.116 MySQL Response OK
13323 2010-10-13 10:01:04.210817 192.168.1.116 192.168.100.85 MySQL Request Quit
13324 2010-10-13 10:01:04.210849 192.168.100.85 192.168.1.116 TCP mysql > 41560 [FIN, ACK] Seq=72 Ack=76 Win=5888 Len=0 TSV=562240320 TSER=904832069
13325 2010-10-13 10:01:04.211632 192.168.1.116 192.168.100.85 TCP 41560 > mysql [FIN, ACK] Seq=76 Ack=72 Win=5840 Len=0 TSV=904832069 TSER=562240317

13326 2010-10-13 10:01:04.211640 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=73 Ack=77 Win=5888 Len=0 TSV=562240320 TSER=904832069
13327 2010-10-13 10:01:04.213243 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=77 Ack=73 Win=5840 Len=0 TSV=904832071 TSER=562240320


As you can see, web re-sent the initial ACK after 3 seconds of no response. But mysql never even saw that initial packet.



I also tried to run a ping flood to check for dropped packets. If you leave it running long enough you will get dropped packets.



web3# ping -f 192.168.100.85
PING 192.168.100.85 (192.168.100.85) 56(84) bytes of data.

....................................................................
--- 192.168.100.85 ping statistics ---
38253 packets transmitted, 38185 received, 0% packet loss, time 460851ms
rtt min/avg/max/mdev = 0.880/3.430/66.904/8.015 ms, pipe 7, ipg/ewma 12.047/1.378 ms


This issue is very intermittent but is happening all the time. I do understand that simply increasing our timeout would help greatly, but I'd rather have the connects always be fast and never have a 3-4 second delay in serving a page.



I contacted our hosting provider and they say its likely a result of the bad interaction between Nagle's Algorithm and Delayed ACK. http://www.stuartcheshire.org/papers/NagleDelayedAck/




It seems to me like packets are being dropped. Any ideas on a way I could better prove it to the hosting guys? Ping dropping 60 out of 38,000 packets does seem kind of minor. Is this something I should just live with?



Thank you for your time in looking into this!


Answer



I wouldn't try it with a ping flood, but I would set up a continous stream and let it run for, say, 2 minutes. That's a more valid test of network "loss" - sometimes ping -f can overrun particular pieces of equipment and give unreliable results.


linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...