I want to learn more about how to perform root-cause analysis. More times than not, our department tells the user to try rebooting (thier Windows XP system), which actually "fixes" a good number of problems. When I am in a hurry (and sometimes getting paid hourly contributes to this) I might try to find a workaround in order to get the problem solved quickly instead of actually performing root-cause analysis.
Most of the time I am looking in log files or the event viewer for this information. Sometimes I will use the Sysinternals tools or occasionally run a packet sniffer. I probably don't use the Sysinternals programs as much as I should. Some specific insight on how you use which pf these tools, when and why would also be helpful.
I know this is a wide open question but could you please briefly explain your methodology, tools, etc. that you use? It looks like a lot of admins on SF use a more in-depth process which I would like to learn more about. If this helps narrow down the question any, I would be most interested in tools, tips, tricks, etc. relevant to Windows servers & clients within an AD environment.
Answer
Figuring out the root cause of a problem depends on the problem -- Your initial instinct to look at log files/sysinternals tools/packet sniffers is generally correct.
I would add running the MS Malicious Software Removal Tool and a good AV program on Windows systems (and ensuring that they don't have something like CyberDefender or other AV-trojan-malware.
The folks are proponents of the "5 Whys" method (http://en.wikipedia.org/wiki/5_Whys, also this nice short PDF that shows it in action). It is a pretty valuable tool for doing root cause analysis.
Beyond that I'll paint two broad categories and some of the questions I usually ask/things I check:
Mysterious behavior not related to the network
e.g. "Word keeps crashing on me"
Basic questions to ask:
- What Changed?
(Dont take "nothing" for an answer -- it is the first lie. New software, patches, etc. all count.) - What were you doing when you had the problem?
(Try to extract as much detail as possible here -- in my example above "I hit the hotkey for insert initials and the program crashed") - Did it ever work before?
(If so, start looking at stuff from (1) above) - Can you reproduce the problem on your system?
(If so that's a good sign: A tech support call to the vendor may help. If not you'll need to look at the user's system for the rest of these questions.) - What's different about the user's environment than your environment?
- Is the user's hardware suspect (Run a memory test, look for SMART errors from the hard drive, etc.)
- If you've gotten this far (hardware checks out, software checks out, no viruses, no malware) go visit the user for a day. Observe their work habits.
My company once had a mysterious system lock-up that related to clicking the mouse at a specific frequency (We still don't know why, but we had to watch a user doing it and practice for a day in order to be able to reproduce it reliably)
Problems related to the network
A lot of this is similar, but with some more specific guidance.
- What Changed?
(Yeah, you always start there) - What is broken?
- Can you reach web pages? Is it just one that's down? If so Is it down for everyone or just you?
- Can you ping stuff on the internet by name?
How about by IP? How far does the traceroute get?
- When is it broken?
- Always the same time of the day?
- For a brief period every N days?
- Randomly (is it REALLY random? Plot it on a calendar...)
- Is there something odd about the remote site?
- Look at DNS - If it's round-robin'd there could be remote-side breakage
- Are we talking about the other end of a VPN? What's up with the VPN (logs!)?
- Is there something odd about the local site?
- Check your local firewall
- Check any "filtering software"
- Check with your ISP to see if there are any known issues
- Check sites like http://www.internetpulse.net/ for known network-wide issues
- Check out the user's machine
(TCP settings, etc. - Usually not the problem, but sometimes.)
No comments:
Post a Comment