Failure Mode Effect Analysis (step 1)
Microsoft Entra ID
Failure Modes:
- Microsoft Entra ID isn’t available or can’t be reached due to a network problem. Redirection to the authentication endpoint fails.
System Impact: User can’t login.
Detection: Authentication middleware catches error
Higher Level Impact: System unusable for user
Mitigation: For now we’ll just trust Microsoft, but in the future we’ll add login by email, as backup, as well for users without Microsoft or Google accounts. - User can’t login.
System Impact: User can’t login.
Detection: Microsoft Entra ID handles this.
Higher Level Impact: System unusable for user
Mitigation: In the future, the backup login by email will be an option
Google Authentication
Failure Modes:
- Google Auth isn’t available or can’t be reached due to a network problem. Redirection to the authentication endpoint fails.
System Impact: User can’t login.
Detection: Authentication middleware catches error
Higher Level Impact: System unusable for user
Mitigation: For now we’ll just trust Google, but in the future we’ll add login by email, as backup, as well for users without Microsoft or Google accounts. - User can’t login.
System Impact: User can’t login.
Detection: Microsoft Entra ID handles this.
Higher Level Impact: System unusable for user
Mitigation: In the future, the backup login by email will be an option
Crawl Container Instance(s)
Failure Modes:
- The Container Instance exits unexpected
System Impact: The crawl session isn’t finished.
Detection: Azure Retry mechanism, and if all retries fails our scheduled Container Management Function detects Container with exit code not equal to 0.
Higher Level Impact: Very inconvenient since the crawl may take hours
Mitigation: Automatic retry on error, after 3 unsuccessful attempts, alert support/dev team, and display an error in the UI - The Container Instance application never finishes
System Impact: The crawl session isn’t finished.
Detection: Scheduled Container Management Function detects running Container Session with no results logged for a set time
Higher Level Impact: Very inconvenient since the crawl may take hours
Mitigation: Automatic retry on error, after 3 unsuccessful attempts, alert support/dev team, and display an error in the UI
Report Generator Container Instance(s)
Failure Modes:
- The Api (app inside container) is unresponsive
System Impact: Reports cannot be generated.
Detection: Client side error handling
Higher Level Impact:
Mitigation: Display an error in the UI, the user can try again. Log this, and figure out a way to restart the container instance - The Container Instance exits unexpected
System Impact: Reports cannot be generated.
Detection: Azure Retry mechanism.
Higher Level Impact:
Mitigation: Automatic retry on error, after 3 unsuccessful attempts, alert support/dev team and display an error in the UI, the user can try again.