Integration with HubSpot failing to create timeline events

RESOLVED: Integration with HubSpot failing to create timeline events
All systems are back to normal

 

July 26 12:14pm MDT The issue came back for some clients in the past couple of hours. HubSpot’s technical team reworked some permissions. They have resolved it again now and are ensuring that the problem does not come back. We have just finished the delivery of the impacted chats and offline messages.

July 26 8:39am MDT All chats and offline messages that were not delivered during the earlier HubSpot outage have been sent and delivered.

July 26 7:28am MDT HubSpot has addressed the issue on their end. The integration is properly sending cases from new chats, and we are in the process of sending a very large batch of unsent chats and offline messages from the past 10 hours. We will post an update here once all chats and offline messages have been sent to HubSpot.

July 26 5:30am MDT HubSpot acknowledged the issue and is working on a resolution. Our team is ready to resubmit all chats and offline messages once HubSpot has resolved the API issue.

July 26 3:30am MDT We have notified HubSpot, but are still awaiting a resolution time frame from their side. Our technical team is evaluating options on our end, possibly reverting our clients from the timeline event mode to the old form submission mode. We are working to ensure that no data is missing when the integration is restored. We will post an update in a couple of hours or as soon as we have more information.

July 26 1:00am MDT A breaking change in the HubSpot API is preventing SnapEngage clients from creating timeline events in their HubSpot account. The issue seems to have started around 5pm yesterday and was first assessed to be a HubSpot transient problem. We are escalating the issue with HubSpot now.

Integrations with CRMs, Help Desks, and other 3rd parties experiencing delays

RESOLVED: Integrations with CRMs, Help Desks, and other 3rd parties experiencing delays
All systems are back to normal

 

July 13 7:40am Mountain time The backlog of all chat transcripts and offline messages from the early hours of the incident have been delivered. If you are unable to locate a transcript in your Help Desk or CRM, please come chat with us so that we can assist you locating it.

July 13 3:59am Mountain time Google confirmed they have mitigated the issue at 3:50am (9 minutes ago). We are seeing integration communications being back to normal. All chat transcripts and offline messages of the past hours are being delivered now. Chats and offline messages started prior to 2 hours ago and not yet in your CRM or Help Desk system might take a few additional hours to get delivered. Will we post an update once we can confirm that the backlog of all chat transcripts and offline messages have been delivered.

July 13 3:56am Mountain time Google communicated with us that their networking engineers are working on redirecting the Fetch traffic (used by 3rd party integrations) to a stable environment. They have not provided an ETA yet, but we anticipate a quick resolution now.

July 13 3:35am Mountain time Our hosting provider posted an update on the ongoing platform issue, saying they are still working on the issue, and will post another update at 03:30 US/Pacific.

July 13 2:50am Mountain time Our hosting provider published an update on the ongoing platform issue:

We are currently investigating an intermittent issue with Google App Engine URLFetch API service. Fetch requests to non-Google related services are failing with deadline exceeded errors.
For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 02:30 US/Pacific with current details.

We will post an update as soon as one is available from Google Cloud Platform.

July 13 2:38am Mountain time Our hosting provider, Google Cloud Platform, has localized the issue causing an elevated error rate with the infrastructure we rely on for our integrations. They are working on a fix. We will post an update as soon as one is available from them, or in an hour at the latest.

July 13 1:35am Mountain time Integrations with 3rd party providers are currently experiencing delays (social discovery, weather, help desks, CRMs, knowledge bases). We are investigating the issue with our hosting provider. We will post an update within an hour.

Website content issue

RESOLVED: Websites content no longer available (no impact on live chat)
All systems are back to normal

May 3 4:14am Mountain time The issue has been resolved. All websites are back online.

May 3 3:26am Mountain time Our website content  hosting provider posted a updated on their service disruption: https://wpenginestatus.com/read-only-on-some-servers-in-one-us-datacenter/ As the issue is not yet resolved and there is no ETA available from our provider, we are redirecting all traffic from our marketing website to the SnapEngage application signin page until the issue is resolved.

May 3 3:05am Mountain time Our website content hosting provider is currently experiencing an outage causing our websites for product information, client self support, developer documentation and service status to be unavailable. There is no impact on the SnapEngage live chat service though. We are in communication with our provider to get the issue resolved.

Analytics issues

RESOLVED: We are currently experiencing issues running queries to generate reports in the SnapEngage Analytics
All systems are back to normal

Feb. 29 3:49pm Mountain time The issue has been resolved. As always we will continue to monitor things and will post any follow-up information as it becomes available.

Feb. 29 12:34pm Mountain time The issue with the analytics is subsiding but we’re continuing to monitor things. We will post another update as soon as we are certain the issue has been completely resolved.

Feb. 29 8:28am Mountain time We are currently experiencing issues running queries to generate reports in the SnapEngage Analytics. We are in working with our service provider on a quick resolution.

Chat Routing Inconsistencies

RESOLVED: Occurrence of chat routing inconsistenciesAll systems are back to normal

 

Feb. 24 8:07am Mountain time Our hosting provider has replaced the server component at the root cause of the small bursts of errors 10 minutes ago. If you still experience some chat routing inconsistencies, please come chat with us. We are continuing to monitor things on our end, and everything is looking fine.

Feb. 24 6:29am Mountain time Our hosting provider has isolated a component which seems to have been causing small bursts of errors over the past couple of days. It is most likely the root cause of chat routing inconsistencies that some clients have been experiencing (i.e. chats expired due to being idle not being timely closed, causing some agents to have chat slots blocked for a few additional minutes). We have confirmation that they are actively working on the issue. We will post an update when the problem is confirmed to have been addressed, or if they provide some ETA for resolution.

Feb. 24 2:17am Mountain time We have not been seeing any routing inconsistency anymore and haven’t had any client report for a few hours, but we are continuing to work with our hosting provider and extracting logs to assist them in their investigations. We are adding a few safety measures to better handle platform underlying latency to better cope with our hosting platform potential performance degradation. If you experience some issues, please come chat with us.

Feb. 23 4:02pm Mountain time We’ve been making some small adjustments to the system to help compensate for the some of the issues that our hosting provider is experiencing.

Feb. 23 12:57pm Mountain time We’re still continuing to look into this issue with our hosting provider to determine the source of the problem. We’ll continue to update this post as we learn more.

Feb. 23 10:47am Mountain time Some customers have reported chat routing inconsistencies. We are currently investigating this issue with our hosting provider and are working to quickly identify and resolve this issue.

Service disruption for the Visitor Chat API

RESOLVED: Service disruption for the Visitor Chat API*
All systems are back to normal

 

Postmortem: We have identified that our error monitoring did not detect the increased error rate on the API endpoints, requiring our customers to report the issue before we could escalate to our hosting provider. We have taken the corrective actions and have reconfigured our alert policies to be notified as soon as the error rate increases on this component. If an increased error rate would happen on the API, our technical team will be notified right away. Google Cloud Platform is still working on a full resolution of the deployment process which introduced the configuration issue yesterday.

6:59 am Mountain time We have leveraged a work-around provided by Google to stop the Chat API error rate. A permanent solution is being worked on by Google. Customers using the Chat API in their mobile applications should see the API performing back at normal levels.

6:26 am Mountain time Google Cloud Platform, our hosting provider, has identified a configuration problem on their infrastructure that seems to be the root cause of the Chat API returning HTTP 500. Google’s system reliability engineers are working on a resolution to restore the proper configuration. We are waiting for a resolution or an ETA for the resolution from Google now.

6:03 am Mountain time We are still actively working on the issue, and so is our hosting provider, doing the same in parallel. We are trying a few actions to attempt a resolution but it seems like a correction from our hosting provider will be necessary. We will post an update as soon as we have additional feedback, or in an hour from now.

5:04 am Mountain time The API developers are still working on getting to the root cause of the elevated error rate to resolve the issue. We are working with our hosting provider to help localize the root cause. We will post an update in an hour or earlier.

4:00 am Mountain time We are seeing the Chat API reporting a high error rate (HTTP 500). This API is used by some of our clients to add the live chat functionality into their own mobile applications. The API developers are working on resolving this as soon as possible. We will post an update in an hour or before. The elevated error rate on the API endpoint started a few hours ago, we are still researching when this started.

*Please note that normal chats inside web browsers, desktop or mobile, are not impacted.

Brief occurrence of chat routing inconsistencies

RESOLVED: Brief occurrence of chat routing inconsistenciesAll systems are back to normal

5:05pm Mountain time  As of this time all issues have been fully resolved.

4:57pm Mountain time Our hosting provider, Google Cloud Platform, recently posted an update that some of its users were affected by latency. They updated that “The issue with persistent disks latency should have been resolved as of 15:20 US/Pacific.” The issue should now be resolved for us.

4:15pm Mountain time  Some customers have reported chat routing inconsistencies. We are currently investigating this issue with our hosting provider and are working to quickly identify and resolve this issue.

Service disruption for Analytics, File Upload, and Style change on widget

 

RESOLVED: Service disruption for analytics, file upload, and some widget setting changes
Partial service interuption

Postmortem Google has posted a detailed explanation of the cause and resolution to this issue.

11:45 pm Mountain time We are currently experiencing some features not working as expected. Our core services are not impacted. The Chat Portal, the visitor side live chat, etc. are working as expected. The specific features currently not working are the analytics, the file upload (and download) feature, and changing settings of your widgets. We are actively investigating the issue to find the root cause, and are in contact with our hosting provider.

12:40 am Mountain time Google has acknowledged a service disruption impacting the components required for our analytics, our file upload feature and the API we use to update widgets. They confirmed they are working on the issue but they have not provided an ETA for resolution yet.

1:20 am Mountain time Google, our platform provider, is still actively working the issue. The SnapEngage team is looking into possible work-arounds as well. We apologize for this service disruption. We will post an update in 30 minutes.

1:52 am Mountain time Google, our platform provider, is still actively working the issue. No work-around has been found to bypass the Google Cloud Platform disruption. We will post an update in 30 minutes.

2:22 am Mountain time Google, our platform provider, is still actively working the issue. We are in regular contact with them trying to get a status and ETA. They are going to provide an update at 3:30 am. We will pass it here too, or will update as soon as we have something new.

3:30 am Mountain time Google, our platform provider, is still actively working the issue. They are going to provide an update at 4:30 am. We will pass it here too, or will update as soon as we have something new.

4:28 am Mountain time Google, our platform provider, is now rolling out a fix for the issue. They stated a few minutes ago that they “expect a full resolution in the near future”, with the next update from them at 9:00 am Mountain Time. We are not seeing much improvements with the analytics, file upload and widget style updates yet. We will post back here as soon as we see some significant improvements, or when we have an update within an hour.

7:30 am Mountain time Google, our platform provider, is still actively working the issue. They are going to provide an update at 8:30 am. We will pass it here too, or will update as soon as we have something new.

8:59 am Mountain time Google, our platform provider, is still actively working the issue. We are very sorry the full resolution is taking so much time. As documented earlier, there is no impact on chats. Analytics are unfortunately not accessible because of this platform outage, but no data will be lost. The file upload feature is currently not available, and updating widget style, proactive chat rules, and system messages is not possible. Google is reporting that their “engineering teams are working on a complete resolution at the highest priority”.

 

Chat routing inconsistency for some customers

RESOLVED: Chat routing inconsistency
All systems are back to normal

2:28 am Mountain time We started rolling out a change in our chat routing logic 10 minutes ago, and we are seeing this new code not performing as anticipated. Some chat agents are getting more chats than expected, some chats in broadcast mode are not reaching out all the agents. Only a subset of customers are impacted. We are initiating a roll back to return to the chat routing logic without this change causing some issues.

2:44 am Mountain time We have rolled back the code change, and cleaned up the chat routing information. Customers who experienced some inconsistency in the last 20 minutes should not see any weirdness anymore.

We truly apologize for this incident. We push updates and improvements to the SnapEngage service daily, and we always run extensive testing to prevent such incidents. Our test coverage missed some scenario, and we will correct that to avoid such an issue to happen in the future.

Chat Statuses Not Updating Correctly

RESOLVED: Infrastructure issuesAll systems are back to normal

 

11:15 am Mountain time It appears that our hosting provider was having some issues that was causing some of our data to get out of sync. The issue seems to be resolved, however, we will be monitoring the situation and will update here if there is anything new to report.

10:40 am Mountain time We’re currently investigating an issue with chats that are getting marked as notified, or in the queue, but then not getting updated when they are picked up by agents.  These chats are showing in the Dashboard as in the notified, or queued, status even though they are currently active with an agent.  Along with this, because of the mis-marking, these chats are taking a little longer to get closed out completely.  We are currently investigating and will post updates here.