I'll try and be as concise as possible. Any assistance is much appreciated. The problem is a memory leak.
I have a 32bit dll written in c# that runs as a com component on a windows 2003 server. This component processes biztalk 2002 messages. The process runs under 200mb most of the time until it spikes up above a gig and will run out of memory around 2 gig unless it’s recycled. The memory is never released even when the process is idle for some time.
I used perfmon to verify that when the process is using a gig of memory, only 30-40 megs are on the managed heap so I know it’s not .net memory. !address –summary shows growth is the native heap. !heap command shows one heap (the first one) contains 95% of the memory used. Typically at this point I would use debugdiag to give me a summary of allocations. When I ran the process under debugdiag… no leak for a month. Detached debugdiag and the process blew up in a day. So I tried umdh.exe this time. I turned on user heap traces with “gflags –i dllhost +ust”, restarted the process and took an initial snapshot with umdh.exe. Again, the process did not leak for a week. I turned off user mode stack traces and restarted the process 4 hours ago and it is at 600mb already. I grabbed a dump of it when it was around 500mb.
So that’s my first issue. Why does it not leak with heap traces on? It’s possible that the timing is a coincidence but the odds of that seem low. I know that turning on heap traces and/or running under a debugger disables the LFH, turns on page heap, and tweaks some flags on the heap so technically there is a difference in heap behavior. I see that with traces on and off, the !heap –s command shows “L” in the fast heap column for the large problem heap and not LFH if that helps.
I decided to take a look at the 500mb dump I got without stack traces and poke around in the heap and see what was there. Heap 00090000 has 450+ mb of data.
!heap -stat -h 00090000
heap @ 00090000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
21120 116 - 23e98c0 (13.73)
3c 769ef - 1bcd404 (10.63)
2e 66d0f - 12798b2 (7.06)
1c 667e8 - b35d60 (4.29)
24 4f942 - b30d48 (4.28)
2a 3a03c - 9849d8 (3.64)
3a 27c6d - 9030b2 (3.45)
7c80 116 - 873300 (3.23)
6c5c 116 - 75abe8 (2.81)
…
I though those large allocations of 21120 were odd so I dumped those
!heap -flt s 21120
_HEAP @ 90000
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
1bbd0040 4225 0000 [01] 1bbd0048 21120 - (busy)
? <Unloaded_elp.dll>+1b6dc7b7
1be12fc8 4225 4225 [01] 1be12fd0 21120 - (busy)
? <Unloaded_elp.dll>+1b6c21a7
1bf24188 4225 4225 [01] 1bf24190 21120 - (busy)
? <Unloaded_elp.dll>+1b5f68ef
1c011000 4225 4225 [01] 1c011008 21120 - (busy)
? <Unloaded_elp.dll>+1b72542f
1c074320 4225 4225 [01] 1c074328 21120 - (busy)
? <Unloaded_elp.dll>+1b84dc37
1c096fe0 4225 4225 [01] 1c096fe8 21120 - (busy)
? <Unloaded_elp.dll>+1bad1a57
1c0daf98 4225 4225 [01] 1c0dafa0 21120 - (busy)
? <Unloaded_elp.dll>+1b961f7f
1c0fc128 4225 4225 [01] 1c0fc130 21120 - (busy)
I did the same for the next largest allocation, 3c, and got similar results for 90% of the entries
1b7d61e8 0009 0009 [01] 1b7d61f0 0003c - (busy)
? <Unloaded_dll>+2d0b22
Lm output shows this for unloaded modules
Unloaded modules:
00320033 00960061 Unknown_Module_00320033
Missing image name, possible paged-out or corrupt data.
00680063 00cc00a8 Unknown_Module_00680063
0000f50f 0075f572 dll
00000001 45d70a37 elp.dll
71af0000 71b12000 ShimEng.dll
Any tips on next steps here? Spot checking those addresses with db command doesn’t show any strings or recognizable pattern. I cannot find info on this elp.dll anywhere on the internet and there is no such dll on the server. When I randomly break into the live process with the debugger and dump modules, elp.dll is always in the unloaded modules area. And what’s with the address for elp.dll, 00000001 ?