We are using SQL 2014 enterprise and transactional push publication. Distributor is on its own SQL server. The subscriber is on SQl 2012 enterprise. We are a 4 node cluster using Always-ON. Our replication process has been error-free
for over two years until couple days ago. I had two instances so far that the error alert indicates the problem was due to "The
row was not found at the subscriber when applying the replicated command". Replication was trying to update a record but that record no longer exists.
To trouble shoot, I checked the permission on all users on that db and they are all read only. I checked the profiler trace (server side trace) and searched for any delete statement issued by any users or by replication and found none. I checked the inserting sequence which is based on an identity key, when a record was being inserted in the publisher db, say in a half our span, it inserted record # 1001, 1002, 1003, 1004, the profiler trace on the subscriber showed the insert for #1001, 1002, 1004. For reason, SQL replication didn't replicate the change command for record #1003.
In our replication setup, we don't use filter or any specialized custom settings. Basically we use the default setup to keep thing simple to manage. This is the very first time we see this type of 'skip insert' situation. This past week our publisher server has been very busy and I did see some significant latency during the day time and night time was back to normal. Strangely enough that this 'skip insert' problem happened in the evening. One happened at about 8 pm CST and one happened at 11:40 pm. Does anyone experience this type of problem or know how to track down the root cause?
Thanks in advance,
OD
Ocean Deep