The importance of data centers is constantly increasing, with new large-scale facilities being announced almost every month. It is common that the attention of data center operators/owners is directed toward the design of data centers, but it is also critical to focus on effective operations that will contribute to high availability levels.
On average, downtime costs more than $5,000 per minute with human error frequently being named the biggest cause of data center failure, making it even more crucial that operators take note of the common mistakes that lead to costly incidents.
1. Poor communication between operations departments and design teams
Aligning teams with goals and plans is paramount for any business, and the data center industry is no exception. From an operations point of view, if a data center is poorly designed it could impact the ability to maintain and manage the data center facility efficiently and effectively. This could have a serious impact on the operational cost, safety and security, additional unplanned and possibly costly downtime, or at a minimum, incremental expenses correcting issues. This could damage a company’s investment, ROI and reputation.
Data center operations teams are under continuous and increasing pressure to meet growing demands for flexibility, speed and capacity, as well as enabling infrastructure for cloud computing, mobile technology, and virtualization. To achieve this, operation teams must be involved in the design phase to optimise the data center tailored to the requirements of advancing technologies and clients business requirements.
Involving operation teams in the design process can ensure that the significant total cost of ownership from a maintenance and management perspective is taken into account whilst at the same time optimizing resources and increase safety and security.
2. Insufficient training and skill development for staff
Having a well-trained and skilled operations staff can help reduce overhead costs, turnover, and downtime incidents.
However, staffing and skills shortages is generally agreed to be a concern in the data center industry, with 22% of respondents for AFCOM’s State of the Data Center survey in 2019 indicating that they have difficulty filling roles for facility technicians, engineers and operators. These concerns continue to increase, as the required skills continue to advance along with rapidly evolving technology, society and business.
With human error being the most common cause of data center downtime, it is incredibly important for the businesses to focus on providing effective training and skill development for operations staff. By implementing this, operation teams can understand how to safely manage and maintain the data center, and also know how to react when there is an incident. More than that, they can identify weaknesses and prevent errors and mistakes from happening and further optimise the data center environment.
3. Inadequate risk mitigation and management
Operation teams face a multitude of risks within the data center, including loss of power and cooling, natural disasters and fires, and cybersecurity threats.
The organisation should have appropriate risk management policies and procedures which are updated on a regular planned basis to ensure they are fit for purpose. If the rsk management plan is poorly written or is not up-to-date it will give the data center operators/owners a false sense of security that all is well under management control.
Staff should be well trained and should be tested on their abilities to deal with emergencies by conducting emergency drills and delivery ongoing training programs. Where real emergencies were being dealt with, reviews should be conducted and lessons learned should be documented and serve as input to further improve the risk management plans and emergency procedures.
4. Weak policies and lack of integration between processes
Non-integrated processes is a common issue in many data centers across the world. This occurs when different departments are not aligned on their objectives and business interest due to poor communication across teams.
Take for example, a facility management department may have generator and UPS maintenance scheduled, while an IT department has a planned database migration from one system to another. If a UPS system went down during maintenance in the middle of this migration, this could lead to disaster.
Organizations and teams must align their policies, procedures and processes to maximize cohesion and harmony between departments. This could be achieved by setting up standard or emergency operating procedures, method of procedure libraries, and vendor management plans. These same procedures should be implemented for effective change management.
It is recommended that these processes match the businesses criticality and maturity of the data centers.
5. Inefficient document management and change procedures
Without detailed and well-managed documentation, well intentioned staff are at the risk of making non-intended mistakes. These mistakes can be further compounded repeatedly when there is no proper process to document lessons learned and implement these changes.
Effective documentation should include the agreed operating procedures, full detailed and as-built design drawings, emergency response, equipment lists, and more.
These documents ideally are digitally accessible or printed out. However, for printed documentation, adequate procedures shall be in place to ensure that documents in someone’s drawer are always reflecting to true current state.
Summary and conclusion
Improving operations management will be able to minimise downtime. The success of this depends on two factors – ensuring staff has the right competences through training, and effective processes.
To help teams avoid these mistakes, the Data Center Operations Standard (DCOS) provides a full set of domains to improve data center operations management.
The Certified Data Center Facilities Operations Manager (CDFOM) training course is set up to fully prepare managers in their journey towards optmizing data center operations.