Troubleshooting like a Boss (not a real Boss tho)

Troubleshooting like a Boss (not a real Boss tho)

Recently I have been doing some informal training sessions with some junior staff members and realised that there is quite a few things we do in the this industry that can be very hard to pick up without prompting from the guys that have been around for a while. As such I figured I would take a few hours and detail some of the processes I use to troubleshoot issues.

From a troubleshooting point of view, let’s review a scenario that recently I have been involved in resolving.

Scenario:

When logging on to a computer post ADMT on a few computers we were receiving the follow error message when attempting to log onto the computer with an account that hadn’t previously logged onto the computer:

User Profile Service Service Failed the logon. User Profile cannot be loaded

When this came to me there had already been 2 tech’s looking into this error, and were blaming it on the Domain migration, we could log onto the machine as Local administrator and I could connect remotely to it via the SCCM 2012 Remote tools.

Trouble Shooting Steps:

The first steps to investigate this issue would be to start in the event log, I know it sounds daunting to look in the event log it’s huge and there is so much information in there, which is exactly why this is perfect for the task at hand.

We looked in the Security event log and there was no issues with the computer or the user authenticating, right away at a high level we can start ruling out the ADMT component of the changes to the system, as there would be authentication issues between the machine and the domain if there was.

We then moved to the Application event log and right away we started seeing warning events like this every time a new user attempts to logon:

Windows cannot copy file C:\Users\Default\AppData\Local\Microsoft\Windows Live\SqmApi\SqmData720896_00.sqm to location C:\Users\Guest\AppData\Local\Microsoft\Windows Live\SqmApi\SqmData720896_00.sqm. This error may be caused by network problems or insufficient security rights.

DETAIL – Access is denied.

And:

Windows cannot find the local profile and is logging you on with a temporary profile. Changes you make to this profile will be lost when you log off

And:

Windows cannot copy file C:\Users\Default\AppData\Local\Microsoft\Windows Live\SqmApi\SqmData720896_00.sqm to location C:\Users\TEMP\AppData\Local\Microsoft\Windows Live\SqmApi\SqmData720896_00.sqm. This error may be caused by network problems or insufficient security rights.

DETAIL – Access is denied.

And:

Windows cannot log you on because your profile cannot be loaded. Check that you are connected to the network, and that your network is functioning correctly.

DETAIL – Only part of a ReadProcessMemory or WriteProcessMemory request was completed.

So we browse to “C:\Users\Default\AppData\Local\Microsoft\Windows Live\SqmApi” and sure enough when we looked at the permissions on SqmData720896_00.sqm we found that the user’s security group didn’t have access to the file thus it was causing the User Profile Service to fail as it couldn’t copy this file into the new user profile nor the Temp profile. Once the permissions we replicated from the parent folder the issue was resolved. I know it sounds simple when you see it like this, but this whole troubleshooting took around 20-30 minutes, with much searching around the internet and discussion with the techs on site to find out the extent of the issue and alike, and keeping them informed throughout the trouble shooting phase.

Wrap up:

The biggest piece of advice I can provide anybody just starting out and wanting to impress around there troubleshooting ability, is to use the KISS method,

K eep

I t

S imple

S tupid

Always think that the simplest answer is the correct one, this methodology can be used for creating the fix for the issue, if it’s for the issue above where it was impacting lest than 5 users it doesn’t make sense to script or even automate the issue, this is something that you hand the solution back to the support teams with the comment if you see this issue check this event message and confirm it is the exact issue then run the remediation steps. If this issue was impacting a large percentage of my fleet I would look at creating a fix to remediate it proactively, be it with a simple Group Policy as this one could be covered with, or a compliance setting from SCCM it can be automated if need be.

In the heat of an issue it can be very hard to keep calm especially when you need to be able to quickly and confidently rule out idea even if everybody else working on the issue keeps pointing at that being the issue, I recommend setting your IM to Busy or Do Not Disturb so only the people you can control the flow of information coming in, let’s face it being told for the 10th time that there are users unable to logon to their computers gets a bit grating when you are trying to focus on how you are going to resolve the issue, in saying that being able to bounce ideas of co-workers is just as invaluable as they might have made a change to the system or alike that you are not aware of.

The next thing to start looking at is the log files be it the event log or application specific logs, now days most good applications log almost everything, this is where you will find out more information about the goings on of your system then randomly clicking around the OS to try to just resolve the issue like a lot of admins now days do so often, with the goal of, I just have to fix the issue and if I try this it might fix the problem. In some cases you can resolve or at least Band-Aid a solution by doing this, but it normally takes a lot longer to come to the root cause and in most cases you don’t know the root cause as you have just found the fix and moved on to the next fire. I can’t say that logs will provide the answer for everything problem but it is a fantastic place to start.

Another simple thing you can do if the machine is blue screen is to get the Debug tool kit from Microsoft for you OS and run the dump check application over the memory.dmp/mini.dmp file which typically will return the offending component of the OS just confirm dates before you do it as it might have been from a blue screen 2 weeks/months/years earlier.

I know a lot of what I’m saying is common sense to most of us but the number of people I deal with now days that gloss over these troubleshooting steps it staggering, the other thing that makes a great troubleshooter is somebody who has the confidence to sit there and state their case and back it up, there is no point working out the problem, then raising the ticket to a senior resource without a hand over because you are not quite sure about the answer. The senior resources have typically made it to those roles because they have back themselves and ask the right questions to build trust with the management teams.

Good Luck and Happy trouble shooting,

Steve

Leave a Reply