SQL Server stops processing within 20 seconds

I can not understand this. On SQL Server, I have a process that runs dozens of times per second (data is sent to the server). The process works fine, request processing takes from 50 ms to 200 ms. Then, roughly (but sporadically) every 1.5 minutes, all requests suddenly take 15000ms to 22000 ms (15 to 22 seconds). At the same time, the processor load on the server drops sharply. Sometimes (about 70% of the time) the average queue length of the disk queue before the processor crashes and requests slow down.

I look at the processor at perfmon, it usually jumps between 20% and 70%, with an average processor around 50%. When everything stops, it drops to 0% with a pair of 20% spikes for about 20 seconds.

At the same time, I am watching SQL Activity Monitor. Typically, between 1 and 4 EXECUTE transactions are listed, but when this happens, EXECUTE transactions begin to grow to 20 or 30 transactions. Transactions come in but are not processes.

I check the blocks and never see:

Select A.* From master.dbo.sysprocesses as A with (nolock) Where A.blocked <> 0 

Note that I run snapshot isolation

I have a system for recording deadlock conditions in an error log that was not reported.

I checked the SQL agent for the other processes that are running, not one scheduled at the time these events occur.

I look SQL Profiler for other events, there was nothing. I also watched File Growth events and reported nothing.

Even when queries take 20,000 ms, SQL Profiler reports are read in 2000 and cpu up to 50. The processes themselves do not seem to consume resources. However, logout events report high readings and a processor (I'm not sure if this is really important).

There is also nothing in my event log during these events.

Any ideas? Any other place to view?

Running SQL Server 2005 Standard on Windows 2003 32bit.

+4
source share
8 answers

The problem is the automatic checkpoint. When the SQL server starts the automatic checkpoint, other transactions are delayed, this is probably due to the I / O disk involved in the checkpoint.

dm_exec_requests showing waittype WRITELOG (waittime 0) means that the requests completed the transaction and expect the log to be solidified (written to disk) --Remus Rusanu

To verify this, I turned on checkpoint logging and recorded a perfmon session during several incidents. Then I compared the log to the perfton to see that incidents were always associated with a checkpoint in one of my databases.

DBCC TRACEON (3502, -1) - twist the record at the control point

DBCC TRACEOFF (3502, -1) - disable checkpoint registration

EXEC xp_readerrorlog - check log

SELECT DB_Name ([dbid]) as [Database name] - specify the database identifier specified in the log

There is one process in this particular database that creates many attachments and deletes. The solution is to overwrite this process to reduce the amount of recorded data. Another option is to add equipment.

Thanks to everyone who contributed.

+1
source

Have you checked the drive for errors? It seems like something is happening. If it is a RAID array, check that the array is working.

+2
source

Do you use full-text search?

I think that from time to time a rebuild of the index may occur.

Perhaps try to automate a complete rebuild of indexes or change to non-clustered indexes?

0
source

I would add a few more counters to your perfmon, for example, it can read and write every second. From here you can see if there is an I / O problem. Also check the MSDN record for SQL performance . It really gave some good ideas about things to test me, at least.

0
source

What are wait_type, wait_resource and wait_time sys.dm_exec_requests for long queries (sample periodically)? Do these queries perform additional tasks ( sys.dm_os_tasks )? What do these tasks do?

0
source

Have you checked your memory usage? Windows Server 2003 R2 sometimes basically reloads all memory allocations under heavy load. When this happens, SQL Server is flushed to the minimum amount of memory (4 MB or so), and then slowly redistributes the memory to the server until it returns to relatively normal levels. We saw how this happens when very large files are copied through our SAN. I heard that this could be caused by the transaction log backup process if the transaction logs are very large and the server is in extremely heavy use.

0
source

This is not slow code because latency does not increase processor time. It looks like the server is making a blocking call that does not succeed, and then it eventually expires. You have eliminated dead ends. If this was a problem with the hard drive, you expected to see something in the event log.

Try installing a network sniffer, such as Wireshark , to find out if there is anything interesting at the moment the pause begins.

0
source

One option: updating statistics. If you write often enough, you can click on the reflow threshold.

Take a look at this article, "Index Statistics on MSDN" and the " AUTO_UPDATE_STATISTICS_ASYNC " AUTO_UPDATE_STATISTICS_ASYNC

Although every 90 seconds a little ...

0
source

All Articles