1,688 Commits over 427 Days - 0.16cph!
Update: prototype of continuous allocation tracking is working
- started with profile.watchallocs [Name, default="Allocs"]
- can be stopped with profile.stopwatchingallocs
Dumps [Name].json.gz with data about allocs and associated callstacks. Still hardcoded to trigger export every 3 frames. Exporter frequently gets lost in the bin stream, and needs to be optimized (craggy has noticeable stutter).
Tests: in CLIENT+SERVER editor on Craggy with allocation tracking.
Bugfix: fix missing callstacks
- Updated binary that contains the fix internally to 6440ec7 (still hardcoded to always capture callstacks) - was due to ABI missmatch
- Updated continuous profiler unit test to check for AllocsWithStack and a bit of the data set
- Partially updated ProfileExporter, and definitelly borked ProfilerBinViewer
Got updated overhead numbers - always capturing a callstack leads to 9micros per allocation cost(even though inflated due to tests adding 38 calls per alloc).
Tests: unit tests + perf test
Update: initial stack gathering support for allocs in Continuous mode
- using release libs based on d48bcf49, with hardcoded stack gathering for now
Somehow it's 15% faster than mono_get_last_method, which doesn't make sense - need to update the exporter to figure out what's being generated.
Tests: none
Profiling shows
Test: adding an profiler-allocation overhead estimate test
- Switched to relase binaries of d340789f
Without profiler recording, allocs cost us ~0.3micros, with recording it costs 1micro. Next will see if we can afford gathering full callstacks for each alloc.
Tests: unit tests
Merge from: main
Tests: none (no conflicts)
Bugfix: NotSupportedException when trying to use NetWrite.Read
- Fixed by going directly via underlying buffer of NetRead/NetWrite
- Removed generic Stream call path for recording of packets
Tests: ran a server-side client demo recording in editor - before exceptions, now clean
Update: Continuous profiling that only captures allocations (for now)
- using debug binaries built from d340789f, it triggers a snapshot every 3rd frame for testing
- added a test to validate the loop of capture-and-resume
- Native.StartRecording -> Native.TakeSnapshot
Pretty barebones for now, need to profile callstack gathering to see how expensive it is for continuous profiling.
Tests: unit test
Merge: from parallel_validatemove
- Optim to reduce physics cast scheduling overhead
Tests: unit tests
Optim: reduce physics casts scheduling overhead when using batches
- Added a helper function that subdivides workloar across equal batches and potential for work stealing
On a 10player test case(40 ticks total), reduces parallel processing time from 0.2ms to 0.12ms. Still slower than serial execution 0.06ms.
Tests: unit tets
Merge: from profiling_improvements
- New: experimental perfsnapshot_stream [Name, MainCap(MB), WorkerCap(MB), Debug] server command that streams perf data into a user-defined buffer. Limited to 256MB per thread. Stable up to 32MB, past that might fail to export.
- New: profile.quiet persistent server-var - controls whether perfsnapshot commands notify chat of incoming stutters
- Bugfix: fix snapshots failing to export in standalone(introduced yesterday to staging)
Tests: build tests, unit tests, taking snapshots in editor, and snapshots in standalone server build
Update: updated binaries after merge of task branch
Tests: snapshot in editor on craggy
Bugfix: ProfileExport.JSON - correctly identify frame start when there's 0 or 1 callstack depth at the start of recording
This is a standalone-specific issue
Tests: built standalone, did a perfsnapshot there
Merge: from main
Tests: none, no conflicts
Bugfix: ServerProfiler - ensure worker threads that get created initialize with the right fixed buffer size
- built from 76319c30 commit
Previously they wouild initialize with the default 16KB.
Tests: perfsnapshot_stream in editor on craggy with varying main thread buffer sizes
Update: profile.quiet persistent server command to control whether perfsnapshot commands should post chat messages
Tests: ran perfsnapshot_stream with quiet set to true - no chat messages
Update: perfsnapshot_stream [Name, MainThreadBuffer, WorkerThreadBuffer, Debug] server command
- Fixed string buffer over-allocating (need to replace it with a memory stream)
- Fixed frame index wrapping due to now being able to larger than byte.MaxValue
- Fixed invalid final mark reconstruction that would lead to 180d+ slices
Allows streaming of up to 128MB of performance data before generating a snapshot. Seems to be stable up to 64MB, but afterwards it's a bit of a dice-roll. Haven't caught where it's failing yet.
Tests: perf snapshot in editor on craggy.
Update: ServerProfiler - initial FixedStorage support
- Added test for FixedStorage snapshot recording
- update ProfileExporter to handle FixedStorage binary stream quirks
- updated binaries based on 2ce19cfe
New mode will allow us to stream profiling info until the buffer fills up. RCon commands will come next
Tests: unit tests
Merge: from profiling_improvements
- Buildfix for tests trying to use ServerProfiler in non-server env
Tests: scripts compile in editor CLIENT mode
Buildfix: exclude ServerProfiler tests from non-server builds
Tests: built CLIENT config in editor
Merge: from profiling_improvements
- Rewrote internal storage of profiling data to use 1 buffer per thread
- Bugfix for allocation graphs not properly resetting and having gaps at the start
- Bugfix for failing to generate export in rare cases where mono runtime allocates with no managed callstack
Tests: unit tests and generating snapshots in editor
Bugfix: ProfileExporter.JSON - don't spam allocs-0 counter for worker threads
Tests: snapshot in editor on craggy
Update: ServerProfiler.Core binaries
- Relase bins built from 66537fcc
Didn't remember if I snuck in debug bins at one point, so updating them to be safe
Tests: unit tests
Bugfix: ProfileExporter.JSON - reset allocation graphs
- Reset when a new frame starts
- Reset on worker threads if it allocates
Tests: snapshot in editor on craggy
Clean: ProfileExporter.JSON - don't cache per-frame callstack depths
Was never used, so don't waste allocs.
Tests: none, trivial change
Clean: ProfileExporter.JSON - remove debug logs
Tests: none, trivial change
Bugfix: ProfileExporter.JSON - gracefully handle managed allocations coming from native runtime
- Emit "<mono-native-runtime>" if we don't have managed callstack
Finally caught it - this can happen when mono tries to invoke a managed callback which requires a managed allocation (the callback accepts string[], for example) as a first method in managed code. Was able to repro in editor due to it's script compilation callbacks.
Tests: triggered perfsnapshot 40 times without issues
Bugfix: ProfileExporter - avoid reading allocs at the start of the frame as method-entries
- Added a bunch of temporary logging to help track down last issue
Rare, but legal due to our filtering of code.
Tests: snapshotted a bunch of times in editor (there's still one issue with main thread export)
Update: ProfileBinViewer displays binary offsets for marks
Tests: opened a couple bin snapshots
Bugfix: avoid out-of-bounds access when scanning for alloc-only threads
Fixes perf snapshot export failing to generate while processing worker threads. There's still a rare case of main thread crashing - investigating
Tests: Exported a snapshot in editor
Update: ProfileBinViewer can now grok the new profile .bin files
Tests: opened bin profile from editor
Update: rewrite ServerProfile exporters to work with new data layour
- expanded unit test
- Core binary should be the same, but copied from the freshest release build
- purposefully borked ProfileBinViewer - will fix next
Tests: unit tests & took a snapshot on craggy
Update: Rewrote how ServerProfiler.Core stores profiling data
- got a 2% improvement on 1mil empty call benchmark
It will be easier to add new features to the profiler (user scopes, user counters, continuous capture). Need to rewrite the exporters now.
Tests: unit tests
Tests: perf test to measure ServerProfiler overhead
Going to try a different internal storage approach to make future changes easier (and potentially faster)
Tests: ran the new tests
Merge: from parallel_validatemove
- Removes PlayerCache.ValidPlayers allocs
Tests: took a snapshot on Craggy in editor
Optim: PlayerCache.ValidPlayers no longer allocates garbage
Tests: took snapshot on Craggy in editor to confirm
Merge: from main
Tests: none, no conflicts
Merge: from profiling_improvements
- Reduces overhead of serializing to/from ProtoBuf by not tracking BufferStream calls
Tests: snapshot on Craggy in editor
Update: further filtering of methods
- Dropping BaseEntity.Is* methods that are just HasFlag wrappers
- Dropping new Rust.Data BufferStream and RangeHandle
Should reduce overhead on serialization
Tests: snapshot on Craggy in editor
Merge: from main
Tests: none, no conflicts
Merge: from relationshipmanager_leaks
- Server and Client-side bugfixes for pooling around RelationshipManager types
Tests: local 2 player session with flicking open-closed the contacts screen and with pooling tracking
Merge: from main
Tests: none, no conflicts
Bugfix: return ProtoBuf.PlayerRelationships back to pool after client rpc is executed
Think it was the only PlayerRelationships leak on client
Tests: local session on craggy, opened&closed contacts - saw pool metrics stay stable
Update: using new RelationshipManager.ClearRelations to avoid potential pooling leaks
Tests: none
Bugfix: clear PlayerRelationshipInfo when returning PlayerRelationship to pool
Tests: none, since it turns out that we don't have any code that stresses this path - one more thing to fix then
Bugfix: return PlayerRelationshipInfo back to pool when forgetting a player
2 more places to fix
Tests: none
Bugfix: return to pool ProtoBuf.RelationshipManager.PlayerRelationships + nested types after client rpcs
There's still one more server leak and a client leak
Tests: local 2-player session. Made sure changed code is being stepped through.
Merge: from main
Tests: none, no conflicts
Merge: from reduce_appmarkersellorder_allocs
- Reduces pool spillage and misses of ProtoBuf.AppMarker.SellOrder
Tests: started server on procgen map