Web Services Failures and Recovery Strategies: A Review

Objectives: Due to heterogeneity, cross-boundary integration, and deployment over the Internet, Web services are highly vulnerable to a wide variety of failures. This study provides an overview of different types of failures, and recovery strategies for Web services. Method/findings: To conduct this study, we have reviewed several novel research studies to provide a precise and all-in-one summary of different types of failures and possible recovery solutions for Web services. The study reveals that, a clear understanding of different failures-types and possible recovery solutions will help to develop services which are highly reliable and dependable. Applications: Highly reliable and dependable Web services are the key focus of all sensitive and mission-critical applications like navigating systems in aircrafts, nuclear reactor systems, robotics, and so on.


Introduction
With the growing use of the Internet and mobile technologies, Web services have gained much popularity since last few years. 1,2 In one or another way, we use Web services in our daily lives, for example, paying our bills, booking a taxi, or reserving a table in a restaurant. [3][4][5] A task performed by the Web service can be as simple as converting one type of currency to another, or it can be a complex task requiring multiple services to coordinate and collaborate to perform that task jointly. 6,7 Performing a complex task jointly requires services to interact over the unreliable Internet and beyond their organizational boundaries under heterogeneous environments. 8,9 This makes Web services vulnerable to a wide variety of failures that may range from simple inconvenience to a significant financial or monetary loss. A service may fail due to many reasons like service unavailability or down-time, logic errors, inconsistent or incompatible inputs and so on. 10 However, because of their use in important and critical applications, services are required to be highly reliable. 11 Efforts to produce reliable Web services are under way, [12][13][14][15] but, it is a very challenging task due to the unreliability of Internet, heterogeneous and cross-boundary interaction, incompatible business logics and so on. This study presents a survey of different types of failures which affect the normal execution of Web services. Furthermore, different types of recovery strategies to safeguard from such failures have also been presented. It is believed that a thorough understanding of different types of failures and corresponding recovery strategies will help to design Web services which are resilient to failures.
The rest of the study is structured as follows: Section 2 presents an overview of different types of failures which often occur during the execution of Web services. Section 3 gives an overview of different types of strategies used to recover from services failures. Section 4 presents discussion, and finally, section 5 gives the conclusion of the work.
Keywords: Web Services, Failures, Recovery Strategies, Fault-tolerance

Web Services Failures
Web services are actually software applications designed to perform specific task(s) using the Internet. In addition to the containment of all features of traditional software, Web services also contain some additional features like autonomy, heterogeneity, and interoperability. Like traditional software, Web services also suffer from errors and failures from development to execution. 16 Moreover, due to their heterogeneous and cross-boundary interaction, and deployment over the Internet, which is an unreliable media, Web services are more vulnerable to failures than their traditional counterparts. Different types of failures which affect the execution of Web services are categorized into three general categories: development, physical, and interaction faults. 10,17 Occurrence of any or all these failure-types can be transient or permanent, and can lead to service degradation, unavailability, or complete shutdown. All fault-types are described below:

Development Faults
Development faults are introduced during the development phase of Web services, but are exposed when services are actually executed. These faults are introduced by the environment, human developers, development tools, and production facilities. 10 Development faults are classified into parameter incompatibility and interface change faults as defined below: • Parameter incompatibility faults arise when services receive incompatible input values other than the expected; for example, a service expects an integer value but is provided a string constant. In that case, the service will end-up in an error or invalid result message. • Interface change failures (or inconsistency failures) occur when the interface or ontology of the service is changed (or updated), whereas, service invocation requests are forwarded to the old interface. This happens due to the unawareness of users from corresponding updates. In some cases, the interface of services is changed, but, the process (logic) is not updated accordingly. For example, in a hotel reservation service, a user requests for the booking of four rooms, but only two rooms are available at that time.

Physical Faults
Physical faults (also known as system faults) occur due to the failure of a server on which requested service is deployed, or the failure of a network connection. Physical faults result in service unavailability. Services become unavailable due to server shutdown or downtime, for maintenance and update purposes or in cases when the power supply to the server machines is discontinued due to the power breakdowns or natural faults.

Interaction Faults
Interaction faults are all operational or external faults, which popup during the execution, or the use phase of services. These fault-types are broadly classified into content and timing faults. Content faults also referred to as corrupt service faults are further classified into Service Level Agreement (SLA), Quality of Service (QoS), and incorrect service invocation faults, whereas, timing faults are classified into semantic, and timeout faults. All these faults-types are described below: • SLA faults are actually the violation of non-functional properties of a service, that is, the service completes successfully but does not conform to predefined service level agreement. For example, the expected execution time of an operation-completion is 12 seconds; but, the service took 20 seconds to complete the task. • QoS faults including also SLA faults occur due to the degradation of service in terms of quality: slow speed or delays in response time. • Incorrect service invocation faults occur when a service is called with an incorrect name instead of the actual name. • Semantic faults occur due to the incompatibility of composed services requested to perform a joint task, for example, in a joint booking of a hotel and a taxi, operation does not complete successfully due to the different time formats of these services. • Timeout faults arise when a component service fails to complete execution within allocated time frame. This happens when the service is overloaded to process many requests at the same time. For example, too many requests for grabbing a cheap ticket may overload the booking service; this may result in excessive delays (timeouts) at the requester's end or even in the unavailability of the service.
All above fault-types can further be classified with respect to different viewpoints during the life time of services. These fault classes can be viewed as development, operational, internal, external, hardware, software, functional, and non-function faults. 10 taxonomy of all fault-types with respect to different viewpoints is summarized in Table 1.
As it can be seen in Table 1, different fault-types may belong to different fault-classes and may occur in an overlapping fashion. For example, timeout faults of the interaction faults category can be viewed as operational, external and hardware faults. Occurrence of all or any of the fault-type can leave the service in a failure mode incapable of providing the required functionality.

Recovery Strategies
Fault-tolerance refers to the ability of system to detect and recover from failures. 18 Because of their increasing use in sensitive and mission-critical applications, Web services are required to provide desired functionality even in cases of failures. To provide reliable services, various failurerecovery strategies have been proposed in the literature (see Refs. 10,11,19,20 ). However, the most commonly used recovery strategies for Web services are described below: • Ignore: As its name suggests, this strategy ignores those faults which do not affect the primary goal of the service. For example, in a sight-seeing booking service, failure of (optional) getSalesInfo service may be ignored as important tasks like booking of flight and hotel have completed successfully. • Skip: Under this strategy, if a service deviates from QoS and SLA logic, then its successive services are skipped to execute conditional to the fact that skipped services do not affect the primary goal of the service composition. For example, if computeDistance service of sight-seeing scenario deviates from its actual execution time, say from 5 sec to 8 sec then getSales Info service is skipped to execute in order to meet the promised execution time of the whole process. • Retry: This strategy re-executes the faulty service to a particular number of times or till the service completes successfully. Retry is used to recover from temporary failures caused by the hardware, software, or the network. • RetryUntil: With an addition of time-based re-invocation of faulty service, this strategy is an extension of the "retry" strategy. That is, each re-invocation is constrained to a particular time-stamp. For example, RetryUntil (bookFlight,5,10) re-invoke bookFlight service to a maximum 5 retries with each retry occurring after 10 time-stamps. • Wait: This strategy delays the execution of a service to a specified time instant. For example, Wait (book Flight, 8:00) is used to invoke bookFlight service not before 8:00. This strategy is used to handle service unavailable faults. • Alternate: This strategy selects another functionally equivalent service to perform some task when the first service encounters a failure. Alternative action invokes different service instead of the same service. All above recovery strategies can be used individually or in combination with others to handle different types of failures. Table 2 gives a review-summary of different types of failures and their possible recovery strategies.

Discussion
Though, much research has been conducted in the area of fault-tolerance of Web services, however, not all faults are avoidable. [21][22][23] Due to the dynamic, heterogeneous, and cross-boundary integration of Web services deployed over the unreliable Internet, faults become hard to predict and resolve. 4,9 It is possible that a number of faults occur at the same time during the execution of services; this may require more than one recovery strategy to be applied to recover from those failures. However, which combination of recovery strategies can provide best optimal solution, and in which order these strategies should be applied is a very cumbersome problem. The field of fault-tolerance is still maturing, and the introduction of advanced heuristic, AI, and other state-of-the-art techniques may further improve the reliability of Web services. 24,25

Conclusion
A Web service offers its users a coarse-grained and valueadded functionality using the Internet. In addition to the containment of all feature of traditional software, Web services contain some additional features like autonomy, heterogeneity, and interoperability. Furthermore, like their traditional counterparts, Web services may also suffer from errors and failures during their entire life (development to execution). The issue of failures increases when Web services are deployed over the unreliable media and communicate under heterogeneous environments. Due to their use in important and critical applications, Web services are required to be highly available and reliable. Based on the importance of services dependability, this study presented an overview of different failures-types which affect the execution of Web services. Furthermore, an overview of different recovery strategies with respect to different failure types has also been present. Based on the discussion with references to the novel research, it is concluded that detecting and avoiding services failures is a cumbersome problem, specially, when many faults occur at the same time. Furthermore, in order to recover from complex failures, a combination of different recovery strategies may be applied at the same time; however, what is the best combination and best order in which these strategies need to be executed is a very difficult problem to resolve. The field of fault-tolerance is still maturing and the introduction of more sophisticated and state-of-theart recovery techniques enriched with AI and heuristics is highly needed to make more reliable services.