“Best Practices” An analysis of engineered solutions used in UL colos to promote uptime
Here are some statistics that are the result of UnitedLayer’s “best practices” design:
- 99.999% electrical uptime over the last 36 months.
- Colo cooling additional capacity of 25-40% through added CRACs and efficiencies.
- Upgraded monitoring of colo subsystems, with live reportage of conditions by emergency call system.
- High reliability Network operations and expertise, with very strong peering partnerships.
- 24 hr support / 7 days a week equipment service personnel availability, with all in-house expertise.
UnitedLayer runs an efficient and cost-effective operation, using the “best practices” model of engineering. The facilities are outfitted primarily with Liebert CRACs and UPSes, and networking uses workhorse Cisco routers, which are long-term leading industry suppliers. This ensures ready availability of parts and trained service personnel, lowering repair and servicing costs.
The goal, as always in lowering operational costs, is to provide customers with a cost-effective product. Thus, efficient operations requires careful diligence by the colo operator. Good tools and good craft ensure that we get top performance from people and systems while making good business sense.
Specifically to the questions above about performance under maximum load, our answers here are based upon actualities: current field performance and characteristics under moderate and maximum loads. After commissioning a colo site for operation, its performance over time changes, and for long term operations, maintenance cycles become more frequent, until systems undergo renewals. By design, our datacenters can be upgraded as well, adding new systems to handle increased loads or to moderate existing loads.
As operators, we have to set the standards from which any of these values are derived. There are different parameters for HVAC and UPS equipment, naturally. Each system has different running characteristics, with performances variable due to outside conditions like weather for HVAC, and yet normally steady loads for UPSes.
Reliability and Fault Analysis:
Our datacenters have operated at greater than 99.999 % facilities uptime for the past 36 months.. That number would be even greater but for one recent event on one UPS of (11) that we utilize. The other (10) UPSes have operated continuously, at 100% uptime. All systems have performed well even during small and infrequent short-duration power outages. Our uptime is outstanding.
The UL Network uptime has also been quite good, excepting one big event in the past year, caused by multiple cable cuts by vandals to 4 AT&T major cables, underground in San Jose. Many thousands of customers, locally and across the world were affected. We returned to online status many hours before most affected systems and providers, through quick response actions.
All of our customer-effecting outages have been due to external causes, not under our control. We have used these opportunities to strengthen our delivery and to make incremental improvements in network uptimes by turning up additional redundant circuitry.
Of constant concern to UL, as with any plans that are in effect to ensure high uptime service delivery, problems can still occur. Experience teaches this. Our approach is, as always, to learn from these outages by analyzing discovered weaknesses and getting the necessary upgrades completed as quickly as possible.
Our backup systems have been fully tested, under the highest loads. All generators have at least monthly start-ups, on testing schedules. 80% of our colos enjoy backup UPSes to the on-line UPSes, in 2+1N configurations. Loads in all our colos are not allowed to exceed 80% of capacity. This is a primary “best practice” precaution.
Some newer designs have not proven to be reliable. Because of design complexities, they can be subject to more points of failure. One highly-rated colo hotel had bleeding-edge design using flywheel UPSes and with lots of controls features, and still suffered complete failures, as a result of introducing new, unforeseen points of failure, and human error.
We have inherited and adopted the same strategies that the world’s largest colo company uses, namely: proven equipment run to safety-limited operational standards. Proven, sound, experiential engineering.
Follow the leader:
From a recent DRT white paper on data center operations:
“…commissioning is the most important component of the … process. In this phase, every aspect of the data center is tested while it is running at its maximum capacity, … (so that) resiliency is validated, including scenarios such as:
- How do the back-up systems perform in the event of a dropped utility line?
- How do the redundant units respond when a CRAC fails?
- Does the facility’s power architecture switch over upon the failure of a UPS?
The straight answer is our datacenters operate at 99.999% operational uptime, with all systems having performed very well. All of our outages have been due an external causes, not under our control. None are caused by internal faults. Very few have any customer impact. Backup systems have always worked. The systems have operated under a range of loads, light to overload conditions.
Our systems are not complex, but they are durable. We have even performed the equivalent to open-heart surgery, very successfully, to enable upgrades to cooling and power systems while not causing downtimes. During such operations, our backup systems have been fully tested.
More interesting info:
The cooling systems:
Our HVAC systems are sized according to certain standards. We run mid-level powered colos, averaging to 3.74kW per rack or Cab. We design our colos to be flexible to allow for higher kW Cabs where we can give these units more space for cooling. This design yields a couple of convenient numbers:
1 Ton of cooling can handle 1 Cab, which occupies about 16 sq ft of colo floor space, including 8 sq ft of access. Therefore, 100 Cabs roughly require 100 Tons of cooling.
Another metric of our colo capacity yields:
2 – 4kW (2000-4000W) / 16 sqft = 150 – 300W of IT power per sq ft.
Our basic kW / cooling design modules are usable IT floor space of 2500 -5000 sq ft and 240 – 360 kW.
One more colo metric:
PUE: Power Utilization Efficiency – the ratio of additional energy required to operate a colo as well as the IT load.
PUE is another metric used by colo operator to gauge the efficiency of the colo cooling design. We have lowered our effective use of kW power by design in SF7 by investigating and implementing cold and hot row separations, which has reduced the number of CRACs needed to cool the colo from 6 20-ton units to 5 units. This is equal to a 15% efficiency boost, and affords extra cooling to be available for emergency use. It also lowers the PUE.
Our method of calculating PUE is:
IT Load (kW) + Cooling & Accessory Loads (kW) / (divided by) IT Load (kW) = 1.xx (PUE value).
The UPS systems:
An Un-interruptable Power Supply is the primary uptime devisce that allows colos to advertise for business. Telephones operated on battery power in the past, and UPSes are an outgrowth of that system. There could be future technologies that will provide inexhaustible power for colos, such as hydrogen-based powercells, and these could replace this older, defense-industry developed backup system.
Most often, the UPSes occupy very little of daily thought of operators. UPS are steady workhorses, yet under their plain steel doors are quite complex systems of mother- and daughter- boards, executing control and logic systems to monitor and change AC voltage to DC and back, which allows a hi-volt battery system to always be the primary backup of IT power. The voltages and frequencies, inputs and outputs, must be sampled and matched to guarantee as much as 8-16 minutes of back up power. This is enough time to have a generator start and replace lost utility power. We use the “best practices” UPS configuration of 2+1N in 80% of our colos.
Care for a UPS is normally minimal; however, like all mechanical devices, things break. Interiors of UPS have high voltages and can give fatal shocks. Typically, a trained technician is highly-paid and getting rarer to find. This workhorse requires timely servicing to prevent blowouts, which occur with statistical regularity. Please check back here for more colo-specific info that we will release shortly in white papers on topical and timely colo subjects.