Prior Claimed Domains are reporting as Unclaimed against Atlassian Organizations
Incident Report for Atlassian Access
Postmortem

SUMMARY

From Feb 10th, 2021, at 3:15 AM UTC to Feb 11th at 12:23 AM UTC, a subset of Atlassian customers using Trello, Jira, Opsgenie, Access, and Confluence products were unable to login. The event was caused by a faulty change in Atlassian Access that was deployed to production. The changes included Atlassian Access verifying domains and claiming accounts associated with organizations, even though those organizations did not initiate the domain verifications or account claims. However this did not have any impact on customer privacy. This impacted customers in all regions. When a scheduled job executed, the faulty change was activated and the incident was triggered. The incident was detected after 118 minutes by customer support and mitigated by rolling back the faulty change and by progressively setting affected domains and accounts to a good state. The total time to resolution was about 21 hours and 8 minutes.

The impact on the products affected is listed below.

IMPACT

The product specific impact is between Feb 10th, 2021, 3:15 AM UTC and Feb 11th, 12:23 AM UTC

Atlassian Access 

Confluence, Trello, Jira, Opsgenie

  • A subset of users were unable to login to the products during this time.

ROOT CAUSE

The issue was caused by a faulty background job in Atlassian Access, which was periodically executed to verify domain ownership, verify domains, and claim accounts for the domain. This resulted in some end-user accounts being locked out. As a result, the products called out above did not allow login to those end users, and the users received login failure messages.

The faulty change was in one of the key services of our system which had an impact on downstream systems including products mentioned above. Determining a good state took longer than anticipated.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We deploy our changes progressively (by cloud region) to avoid broad impact. However, in this case our detection of the domain verification and accounts claim did not work as expected. Moving forward, to minimize the impact of breaking changes to our environments, we will implement preventative measures such as the ones listed below.

Prevention and Detection

  • While we have very good coverage on testing of the affected service with the faulty change, additional use cases are being identified and tests are being added. These additional tests would help us verify the changes at various stages of deployment.
  • We are improving our process of deployment of the affected service to increase confidence in our deployments by taking some steps such as:

    • Progressive rollouts to production.
    • Increased level of scrutiny on changes to be deployed to sensitive services.

Restoration Time

  • We are improving our end-to-end processes for recovering from such incidents and to reduce the outage/degradation time by:

    • Introducing runbooks for identifying impact quickly and restoring the data to a good state.
    • Investigating the architecture between our Access and Identity systems to identify quick recovery opportunities.
  • We will be conducting a review of our architecture to identify any opportunities for faster recovery under such circumstances.

We have identified multiple improvement actions across the affected products to improve resiliency on failures. At the time of writing, we are in the process of implementing some of these.

We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the reliability of the domain verification and accounts claim services.

Thanks,

Atlassian Customer Support

Posted Feb 25, 2021 - 23:39 UTC

Resolved
This incident has been resolved.
Posted Feb 11, 2021 - 15:26 UTC
Update
This issue has been resolved and verified against the affected Orgs.
Posted Feb 11, 2021 - 04:29 UTC
Update
We have taken all the steps required to resolve the incident. We are completing the verification of the resolution on all affected Orgs to mark the incident as resolved.
Posted Feb 11, 2021 - 04:11 UTC
Update
We continue to work on resolving the issue. Currently, we have restored some of the organizations and are in the process of restoring the remaining Organizations to a good state. The work underway includes some Organizations whose SAML configuration may not be working for SSO. The team is working on corrective actions and we expect to fully recover shortly.

Additional updates will be posted when available.
Posted Feb 10, 2021 - 23:37 UTC
Identified
We continue to work on resolving the unclaimed domain issues against the organizations that should otherwise have claimed domains. This includes domain claims that are marked as "superseded". Team has identified the root cause and have performed corrective actions against a handful of organizations' claimed domains reverting back to the last valid state. Same team is working on corrective actions for the other impacted organizations and expect full recovery shortly.

Additional updates will be posted when available.
Posted Feb 10, 2021 - 21:12 UTC
Investigating
We are currently investigating an issue where a subset of domains (and thus organisations) that were impacted by the earlier incident are still facing issues with their prior domain claims not being shown and resulting users may be locked out of associated products. Atlassian has identified that this outage is caused by the incident and is not a security related issue and there is no data loss or any of the associated accounts have been compromised.

We are actively working to resolve this outage. We will post more information here in the next 2 hours
Posted Feb 10, 2021 - 19:33 UTC
This incident affected: Domain Claims.