RESOLVED: Websites content no longer available (no impact on live chat)
May 3 4:14am Mountain time The issue has been resolved. All websites are back online.
May 3 3:26am Mountain time Our website content hosting provider posted a updated on their service disruption: https://wpenginestatus.com/read-only-on-some-servers-in-one-us-datacenter/ As the issue is not yet resolved and there is no ETA available from our provider, we are redirecting all traffic from our marketing website to the SnapEngage application signin page until the issue is resolved.
May 3 3:05am Mountain time Our website content hosting provider is currently experiencing an outage causing our websites for product information, client self support, developer documentation and service status to be unavailable. There is no impact on the SnapEngage live chat service though. We are in communication with our provider to get the issue resolved.
RESOLVED: We are currently experiencing issues running queries to generate reports in the SnapEngage Analytics
Feb. 29 3:49pm Mountain time The issue has been resolved. As always we will continue to monitor things and will post any follow-up information as it becomes available.
Feb. 29 12:34pm Mountain time The issue with the analytics is subsiding but we’re continuing to monitor things. We will post another update as soon as we are certain the issue has been completely resolved.
Feb. 29 8:28am Mountain time We are currently experiencing issues running queries to generate reports in the SnapEngage Analytics. We are in working with our service provider on a quick resolution.
RESOLVED: Occurrence of chat routing inconsistencies
Feb. 24 8:07am Mountain time Our hosting provider has replaced the server component at the root cause of the small bursts of errors 10 minutes ago. If you still experience some chat routing inconsistencies, please come chat with us. We are continuing to monitor things on our end, and everything is looking fine.
Feb. 24 6:29am Mountain time Our hosting provider has isolated a component which seems to have been causing small bursts of errors over the past couple of days. It is most likely the root cause of chat routing inconsistencies that some clients have been experiencing (i.e. chats expired due to being idle not being timely closed, causing some agents to have chat slots blocked for a few additional minutes). We have confirmation that they are actively working on the issue. We will post an update when the problem is confirmed to have been addressed, or if they provide some ETA for resolution.
Feb. 24 2:17am Mountain time We have not been seeing any routing inconsistency anymore and haven’t had any client report for a few hours, but we are continuing to work with our hosting provider and extracting logs to assist them in their investigations. We are adding a few safety measures to better handle platform underlying latency to better cope with our hosting platform potential performance degradation. If you experience some issues, please come chat with us.
Feb. 23 4:02pm Mountain time We’ve been making some small adjustments to the system to help compensate for the some of the issues that our hosting provider is experiencing.
Feb. 23 12:57pm Mountain time We’re still continuing to look into this issue with our hosting provider to determine the source of the problem. We’ll continue to update this post as we learn more.
Feb. 23 10:47am Mountain time Some customers have reported chat routing inconsistencies. We are currently investigating this issue with our hosting provider and are working to quickly identify and resolve this issue.
RESOLVED: Service disruption for the Visitor Chat API*
Postmortem: We have identified that our error monitoring did not detect the increased error rate on the API endpoints, requiring our customers to report the issue before we could escalate to our hosting provider. We have taken the corrective actions and have reconfigured our alert policies to be notified as soon as the error rate increases on this component. If an increased error rate would happen on the API, our technical team will be notified right away. Google Cloud Platform is still working on a full resolution of the deployment process which introduced the configuration issue yesterday.
6:59 am Mountain time We have leveraged a work-around provided by Google to stop the Chat API error rate. A permanent solution is being worked on by Google. Customers using the Chat API in their mobile applications should see the API performing back at normal levels.
6:26 am Mountain time Google Cloud Platform, our hosting provider, has identified a configuration problem on their infrastructure that seems to be the root cause of the Chat API returning HTTP 500. Google’s system reliability engineers are working on a resolution to restore the proper configuration. We are waiting for a resolution or an ETA for the resolution from Google now.
6:03 am Mountain time We are still actively working on the issue, and so is our hosting provider, doing the same in parallel. We are trying a few actions to attempt a resolution but it seems like a correction from our hosting provider will be necessary. We will post an update as soon as we have additional feedback, or in an hour from now.
5:04 am Mountain time The API developers are still working on getting to the root cause of the elevated error rate to resolve the issue. We are working with our hosting provider to help localize the root cause. We will post an update in an hour or earlier.
4:00 am Mountain time We are seeing the Chat API reporting a high error rate (HTTP 500). This API is used by some of our clients to add the live chat functionality into their own mobile applications. The API developers are working on resolving this as soon as possible. We will post an update in an hour or before. The elevated error rate on the API endpoint started a few hours ago, we are still researching when this started.
*Please note that normal chats inside web browsers, desktop or mobile, are not impacted.
RESOLVED: Brief occurrence of chat routing inconsistencies
5:05pm Mountain time As of this time all issues have been fully resolved.
4:57pm Mountain time Our hosting provider, Google Cloud Platform, recently posted an update that some of its users were affected by latency. They updated that “The issue with persistent disks latency should have been resolved as of 15:20 US/Pacific.” The issue should now be resolved for us.
4:15pm Mountain time Some customers have reported chat routing inconsistencies. We are currently investigating this issue with our hosting provider and are working to quickly identify and resolve this issue.
RESOLVED: Service disruption for analytics, file upload, and some widget setting changes
Postmortem Google has posted a detailed explanation of the cause and resolution to this issue.
11:45 pm Mountain time We are currently experiencing some features not working as expected. Our core services are not impacted. The Chat Portal, the visitor side live chat, etc. are working as expected. The specific features currently not working are the analytics, the file upload (and download) feature, and changing settings of your widgets. We are actively investigating the issue to find the root cause, and are in contact with our hosting provider.
12:40 am Mountain time Google has acknowledged a service disruption impacting the components required for our analytics, our file upload feature and the API we use to update widgets. They confirmed they are working on the issue but they have not provided an ETA for resolution yet.
1:20 am Mountain time Google, our platform provider, is still actively working the issue. The SnapEngage team is looking into possible work-arounds as well. We apologize for this service disruption. We will post an update in 30 minutes.
1:52 am Mountain time Google, our platform provider, is still actively working the issue. No work-around has been found to bypass the Google Cloud Platform disruption. We will post an update in 30 minutes.
2:22 am Mountain time Google, our platform provider, is still actively working the issue. We are in regular contact with them trying to get a status and ETA. They are going to provide an update at 3:30 am. We will pass it here too, or will update as soon as we have something new.
3:30 am Mountain time Google, our platform provider, is still actively working the issue. They are going to provide an update at 4:30 am. We will pass it here too, or will update as soon as we have something new.
4:28 am Mountain time Google, our platform provider, is now rolling out a fix for the issue. They stated a few minutes ago that they “expect a full resolution in the near future”, with the next update from them at 9:00 am Mountain Time. We are not seeing much improvements with the analytics, file upload and widget style updates yet. We will post back here as soon as we see some significant improvements, or when we have an update within an hour.
7:30 am Mountain time Google, our platform provider, is still actively working the issue. They are going to provide an update at 8:30 am. We will pass it here too, or will update as soon as we have something new.
8:59 am Mountain time Google, our platform provider, is still actively working the issue. We are very sorry the full resolution is taking so much time. As documented earlier, there is no impact on chats. Analytics are unfortunately not accessible because of this platform outage, but no data will be lost. The file upload feature is currently not available, and updating widget style, proactive chat rules, and system messages is not possible. Google is reporting that their “engineering teams are working on a complete resolution at the highest priority”.
RESOLVED: Chat routing inconsistency
2:28 am Mountain time We started rolling out a change in our chat routing logic 10 minutes ago, and we are seeing this new code not performing as anticipated. Some chat agents are getting more chats than expected, some chats in broadcast mode are not reaching out all the agents. Only a subset of customers are impacted. We are initiating a roll back to return to the chat routing logic without this change causing some issues.
2:44 am Mountain time We have rolled back the code change, and cleaned up the chat routing information. Customers who experienced some inconsistency in the last 20 minutes should not see any weirdness anymore.
We truly apologize for this incident. We push updates and improvements to the SnapEngage service daily, and we always run extensive testing to prevent such incidents. Our test coverage missed some scenario, and we will correct that to avoid such an issue to happen in the future.
RESOLVED: Infrastructure issues
11:15 am Mountain time It appears that our hosting provider was having some issues that was causing some of our data to get out of sync. The issue seems to be resolved, however, we will be monitoring the situation and will update here if there is anything new to report.
10:40 am Mountain time We’re currently investigating an issue with chats that are getting marked as notified, or in the queue, but then not getting updated when they are picked up by agents. These chats are showing in the Dashboard as in the notified, or queued, status even though they are currently active with an agent. Along with this, because of the mis-marking, these chats are taking a little longer to get closed out completely. We are currently investigating and will post updates here.
RESOLVED: Infrastructure issues
October 23rd, 2015 5:02 pm Mountain Time Our hosting provider, Google Cloud Platform, is currently experiencing elevated errors in some of their services. We are working closely with Google to insulate our service from these underlying issues. However, as a result these errors, some SnapEngage customers may be experiencing the following issues:
- Agent status may not be accurate
- Agent may get more chats than maximum # of chats configured
- Chats may appear in the queue in the monitoring page while they have already been responded (no impact for visitor or agent)
October 26th, 2015 9:15 am Mountain Time The problems we had been experiencing with our hosting provider have been resolved. As we learn more about the details of the issue we will post updates here. We have made changes to our service to help insulate ourselves for these types of issues in the future.
RESOLVED: Max chat per agent issue
2:00 am Mountain Time Our hosting provider, Google Cloud Platform, completed their analysis of the issue causing the max chat per agent problem last night: “The detailed analysis showed that 10% of memcache set calls were failing between 20:00 and 20:15” (time from Google in Pacific Time). We have been monitoring the chat assignment closely for the past few hours, and we haven’t seen the issue recur. The problem therefore isolated to the 9:00pm to 9:15pm Mountain Time period, and impacted chat teams with high traffic, causing a few agents to receive more chats than specified in the team or agent max chat settings. No chat communication was impacted. Please, contact us if you have further questions.
9:50 pm Mountain Time We are seeing that some customers have agents getting more chat assigned to them than the maximum defined in the widget configuration or the agent specific configuration. We traced back the issue to a component of our hosting provider not fully behaving as expected. We have opened a ticket with our hosting provider to get this issue resolved as soon as possible. In the mean time, your agents might get impacted and receive a little bit more chat than the maximum you have defined for them. We will post an update as soon as this issue has been resolved.