The Client

MS Exchange environment with 11,000 mailboxes

In this case study, we examine a recent project involving the analysis and update of a Microsoft Exchange environment for a global company with 11,000 mailboxes. The initial task was to focus on updating a pair of edge servers. However, during the discovery phase, it became evident that the environment had suffered from a phenomenon known as System Creep. This term refers to a system that has undergone various modifications and adjustments by different technical staff over time, resulting in a lack of uniformity and efficiency. The investigation revealed multiple challenges, including servers in a database availability group (DAG), inaccessible relays, inadequate documentation, and difficulties in maintaining mail flow during system issues. This case study aims to document the findings and provide recommendations for optimising the infrastructure.

Background

The company’s Microsoft Exchange environment supported 11,000 mailboxes across its global operations. With such a significant user base, minimising downtime was critical to ensure uninterrupted communication and productivity. The existing infrastructure had undergone modifications and additions over time, resulting in an environment plagued by System Creep. This situation necessitated a thorough analysis to identify weaknesses and formulate recommendations for improvement.

Discovery Phase

Database Availability Group (DAG)

The investigation revealed that several servers were part of a database availability group (DAG). Although the concept of a DAG is beneficial for redundancy and high availability, its configuration lacked standardisation. The absence of uniform procedures and documentation made it challenging to understand and troubleshoot DAG-related issues effectively.

Inaccessible Relays

Technical staff encountered difficulties accessing some of the relays within the infrastructure. These relays played a crucial role in ensuring efficient mail flow, but their inaccessibility hampered proper management and maintenance. A lack of administrative access and knowledge of the relay configurations posed a significant risk to the stability and reliability of the system.

Inadequate Documentation

The absence of comprehensive and up-to-date documentation added to the complexities faced by the technical staff. Without readily available documentation, critical information about the system’s architecture, configurations, and procedures was either inaccessible or incomplete. This knowledge gap impeded efficient troubleshooting and hindered the ability to make informed decisions.

Maintenance Mode

Another significant challenge encountered was the lack of familiarity with putting the DAG into maintenance mode. This mode is crucial for performing updates and maintenance tasks without impacting mail flow. The absence of expertise in this area resulted in disruptions and downtime during system maintenance, causing inconvenience and productivity loss for the company’s users.

Analysis and Recommendations

Standardise and Document DAG Configuration

Develop and implement a standardised configuration for the database availability group (DAG). This involves defining consistent procedures, naming conventions, and documentation to ensure all DAG members are correctly configured and monitored. The documentation should include step-by-step instructions for troubleshooting common issues and performing routine maintenance tasks.

Improve Relay Accessibility and Management

Ensure that all relays within the infrastructure are accessible to authorised technical staff. Review and consolidate relay configurations to minimise complexity and increase transparency. Establish a centralised management approach, providing comprehensive documentation on relay settings, access control, and routing policies. Regularly review and update relay configurations to maintain optimal performance and security.

Enhance Documentation Repository

Establish a centralised documentation repository accessible to all technical staff. Document critical information, including infrastructure architecture, configurations, network diagrams, and standard operating procedures. Encourage a culture of maintaining up-to-date documentation

Conclusion

This global organisation has established a robust exchange platform, supported by comprehensive documentation for its workflows. All prerequisites have been satisfied, and mail flow has been consolidated with a single third-party supplier. The environment has successfully met its goals during disaster recovery testing scenarios.