Analysis of Query Optimization Components in Distributed Database

Objectives: This paper brings to light different query optimization components and their optimizing functionalities which are helpful to improve the response time of query and the efficiency of distributed database. A cache based optimization is also analyzed to highlight the query optimization process. Methods: As data is the most valuable asset for any organization due to this they want to get access and use it efficiently and in a timely manner. To evaluate the efficiency of query optimization its different components e.g. search space, search strategy and cost model are evaluated with the help of examples, tables and diagrams. By comparing the different results, a cache based optimization technique is also evaluated. Findings: It is observed that in search space generated plans are equivalent in the sense they provide same results but their operation, implementation and performance is different. Different algorithms of search strategy are also examined to get the quicker and accurate results and notice that movement of search strategy is greatly depend upon join ordering and cost model. It is also observed that the cost model is helpful to select the best query execution plan but it depends upon the different parameters for example queue length, sever distance, server capacity and load. The latest cache based query optimization technique is also examined and noted that it is key to improve the response time of query as its computational cost is very low. It will be more helpful if it is placed at each site. Applications and Future Improvements: Currently cache based query optimization is applicable only for homogeneous distributed databases. In future this technique can also be implemented for heterogeneous type of databases.


Introduction
Databases allow users to efficiently store, retrieve and analyze data. Databases are very vital for business, research organizations and other fields where data has important role to play. Globally in business environment, organizations are to process more data than ever before. Data is most important asset for any organization 1 . All the crucial business decisions are made on the basis on that available data. Only a well-designed data management policy will make data more reliable to help make better decisions.
The volume of data is continuously increasing, the centralized databases are becoming bottleneck for organizations that are physically dispersed and have to access data remotely 2,3 . Data management is very easy in centralized databases because only a single database administrator can handle it 4 . Despite the fact that centralized databases have high communication cost and also most importantly these have high response time. In Principles for Distributed Databases in Telecom Environment it is suggested that there is an attractive option for such organization to switch on the concept of distributing the data over multiple sites, because it has benefits over centralized databases such as availability, reliability, reduced communication overhead, data localization, improved performance and an easier system expansion 2,5 .
The working and performance of a Distributed Database 6 heavily relies on the capability of the query optimizer 7 to implement suitable query processing plans. Implementation of a query processing plans by using query optimizer is very complicated in Distributed Database as compared to Centralized Database 8 . The only reason for the same is an enormous number of parameters which affect on working of queries in distributed database. In distributed databases relations are fragmented 9 and/or replicated to the different sites 10 and all sites want to get data from each other. Therefore, the response time of query is very high. There are different query execution plans for distributed databases 11 . The responsibility of the optimizer is to find out the indices and sequences in the operation of query performed. The role of optimizer is to indicate the alternatives plans and costs associated with them by using cost model, and select that plan which is cost effective 12 .
Joins is another important factor that affects the determination of suitable query execution plans, is to fetch the data from multiple sites 1 . In distributed databases the data comes from multiple sites, so optimization of join query in distributed databases is more complex than centralized databases. Simply a lot of effort is required on query joins in centralized databases and optimization in distributed database 13 . To solve the issue different algorithms are used to provide the results of user's query as soon as possible based on time and cost factor.
To improve the response time of the user's query a cache based query optimization model is used recently. In case, if the query of the user entertained from the cache it saves the huge computational cost and time. Of course, accessing the data placed in cache is faster than accessing the database 14 . Cache based query optimization model is also based on the main four factors i.e. server distance, server capacity, server load and current queue length 15 .

Distributed Database
"A distributed database is a collection of multiple, logically interrelated databases distributed over a computer network" 6 . The distributed database is widely used with the extension of computer networks and database tools. Due to global business strategy, distributed database has become very trendy and they are mostly used worldwide 15 . Data stored in distributed environment on remote areas which are interconnected using different network topologies or it may be on separate computers which are located in the same geographical location 16 .
Distributed database is physically dispersed on multiple sites by fragmentation and replication of data 9,10 . Fragmentation can further be divided into two types horizontal (row wise) used for select operation and vertical (column wise) used for projection operation 17 . Each fragment can be placed on multiple locations so it can be easily used where necessary. This is the best when the same information is accessed from applications that keep running at various destinations 18 .
A distributed database is beneficial with the following advantages: • Improved Performance: Since the information is stored on different destinations, so the overhead on one machine diminishes which enhances the execution power. • Localization: Localization means the information is available nearby the site where it is required, in this manner information can be accessed in less time and data transfer time also decreases as well. • Availability and Reliability: The accessibility of the information increments because the multiple copies of the information is stored at various sites. Reliability of data is also increased in case of one site fizzles, because data can be obtained from the other nearby site where the backup of the data is available. In this way, in distributed environment, failure of one site does not cause any distractions or inconvenience. • Reduced Communication Overhead: Communication cost decreases in distributed environment because a relation is accessible at each site locally that contains the replicas of the data. • Easier System Expansion: The capacity of distributed database can be expanded effectively by adding the computers to the network.
No doubt Distributed database systems have so many advantages. However, there are some basic issues, which must be considered in distributed databases and researchers are working on these e.g. management of data, concurrency control, security issue, database administration and response time.
In this paper, major issue that is analyzed is 'response time' because distributed databases are geographically dispersed and users cannot wait for results after initiating the query. Users always assume that computer provides correct data, so their main concern is always efficiency those for relational algebra. These generated plans are equivalent, in the sense that they yield the same result, but their operation, implementation and performance is different. Concluding, they have different time and cost consumption. When we apply transformation rule on query we obtain a search space for given query. Then the cost model checks the cost of each execution plan and finds out the suitable one. Cost model can only provide accurate results if it has knowledge of distributed execution environment.
The main consideration of query optimizer is the join ordering. Query optimizer tries to utilize the join in best way due to the following reasons: • It is easy to understand the problem.
• Joins cost is most effective.
• Most frequent type of queries having joins, selection and projection operator. The equivalent join trees of the above query are given below by using the commutative and associative properties of binary operators.
of the system mean how quickly their queries are entertained. But in designers point of view the main concern is correctness than efficiency, because if the efficiency of the system is high and do not provide accurate result it will be useless for any user or organization. Therefore designer must design the database in a way that provides accurate result without too much delay and also focuses that query consume minimum resources of database. Designer can only overcome these issues if they consider the important aspects of query e.g. query processing and optimization.

Query Optimization
The term 'Query' means to search or to find. But in the context of database it is code that transfer to the database to get the information from the database. Queries are usually created using SQL (structured query language), which is similar to high-level programming language. In early stages of database, Codd created the relations. Initially query optimization was not in proper form. Later on the researchers made contributions to produce a query in proper form when networking came into use the distributed query optimizers were introduced 19,20 .
Query optimization is one of the key functions of DBMS in which different plans are examined and amongst them identify the suitable plan. Best query plan can be produced by applying different strategies, so the plan that comes after the query optimization may be accepted or rejected 21 . For selecting the best query plan from the given set is dependent upon the cost based query optimizer. In next process the code of selected query execution plan is generated andthe results of query are obtained. This code may be compiled or interpreted 22 . The different steps of Query Optimization Process are shows in Figure 1.
In distributed systems, Query Execution Plan cost is comprised of transmission cost and the local processing cost 6 . While query optimizer generates the best query plan by considering the 3 W's: • What are the number of alternative plans of query • What is the cost of every plan of query • Which plan is the cheapest of query

Search Space
The search space is the set of equivalent operator trees that can be produced by transformation rules 17 , such as All three strategies produce the accurate, same and exact results, but the cost occurs in their execution is different. The job of query optimizer is to select the best one from the given strategies having low cost. In Figure 2, operator tree (iii) cannot be considered the best plan of search space, because in this plan base table has no join and use Cartesian product which has high cost as compared to rest of the join trees. Table 1 shows that when numbers of relations in a query are increased then the cost is also increased, because every plan in the search space has its own cost 23 . The basic purpose of the query optimizer is to trace out the best joining method from the given set of relations 24 .

Heuristics for Search Space
Sometimes calculating the cost of each execution plan is prohibitive in query optimization because it might be more expensive than actual execution of query. Therefore query optimizer control the size of search space by applying some restrictions 6 .
• The first check is that it applies to control the size of relation is heuristics.
• When accessing the base relation, perform selection and projection first. • Avoid from the relation (having Cartesian product) which has no join relation. • Shapes of the tree may also vary in search space. An important restriction must be followed is the shape of tree.

Linear Tree
In this type of tree minimum one of the operand must be base relation. The volume of search space is reduced up to O(2 N ) in it. In the Figure 3 (i) given linear tree, there are four relations and at each stage at least one relation is base relation.

Bushy Tree
In this type of tree the operand may or may not be the base relation, it is possibility in this scenario that both operands are intermediate operator 25 . In Figure 3 (ii) bushy tree working is shown; this approach is useful to execute the queries in parallel way or relations which are partitioned on disjoint homes 26 .

Search Strategy
It can be described as how we "move" in the search space.
In this query optimization phase we scan or explore the search space and from which chose the most suitable Query Execution Plan by using cost model. Deterministic, Randomized and Genetic are three most common types of algorithms for join-ordering optimization 27 . There are basically three different algorithms to solve the query optimization problem.

Deterministic
It is work on building plan. It takes start from the base relation and keeps adding more relation at each step unless it can get the complete plan. The pruned plans are ignored in the initial stage and the low cost plans are kept to construct the complete plan. Table 2 show the comparison of Deterministic strategies.

• Dynamic Programming
In this approach all bad plans are ignored at the very initial stage of query optimization 28 . It is also known as pruning strategy, because the bad plans are removed as early as possible and consider theplans which provide the same result with low cost 7 .  • Greedy Strategy It makes plans by using depth-first searching technique 7 . It generates worse plans, but these plans are faster than dynamic programming 23 .

Randomized
For complex and complicated queries Randomized strategy is used 29 . Its main focus is to search the optimal solution. It does not usually generate an optimal access plans. But is save the cost of optimization in the context of storage and time used 30. Table 3 shows the comparison of Randomized Strategies with respect to their movement strategy and other factors.

Genetic
Genetic strategy 30 is also not usually used to generate an optimal access plan. However in it is observed that genetic and randomized at the end provide similar results on the given parameters, but doing some experiment it is observed that in some cases genetic strategy produced better result than random 6 .

Distributed Cost Model
This cost model includes the cost function to forecast the cost incurred in statistics, formulas, operators and the size of intermediate results. It can be measured in the context of time taken by the query during execution. Storage is also another factor that costs of the query execution plan can be calculated, but it is not recently considered a main factor due to low prices of storage devices. So in distributed cost model the cost functions can be described with respect to the total time, or the response time and some important parameters 25 . Time can further be calculated or translated into other units e.g. dollars.

Total Time
Total Time includes and calculates all the time taken during processes. It includes processing time of instruction, read/write operation of an instruction, size of message and transfer time of message from one site to another. In cost model these resources are utilized in a way that these take minimum possible time. The optimal use of resources helps to minimize the cost of every component and improve the throughput of the system. The TOTAL TIME (TT) is calculated on the basis of following parameters 31

Total Time = T CPU * # of instructions + T I/O * # of I/Os+ T MSG * No. of Messages+ T TR * # of bytes
The topology of the network enormously impacts the proportion between communication cost and cost of T CPU + T I/O . In WAN, for example, the Internet, the communication time is by and large the predominant variable. In LAN, however, there is more of balance in components. Earlier studies refer to proportions of communication time to I/O time for one page to be on the request of 20:1 for WAN while it is 1:1.6 for a run of early typical era Ethernet (10Mbps) shows in Table 4. Thus most of the researchers of early Distributed DBMSs (design for WAN) did not consider the cost of local processing. Their focus is to minimize the communication cost. On the other hand Distributed DBMSs (design for LAN) consider every one of the three cost factors. Newly introduced faster networks both WAN & LAN have improved the ratios. However, communication cost is still key factors in terms of WAN.

Response Time
The time taken by the query after its initiation to completion is known as response time. In case when the response time is the main consideration point then parallel local processing and communication are also considered. This time can be calculated by using given formula.

RT = T CPU * # of instructions+ T I/O *# of I/Os + T MSG * # of messages + T TR *# of bytes
Thus any processing and communication that are performed on the parallel basis is ignored here.

Difference between Total Time and Response Time by Example
With the help of Figure 4 it is shown that two different relations stored on two different sites e.g. Site 1 and Site 2 and then result of the given query is going to be calculated at Site 3. (Only communication cost is considered here). Some assumptions for given example: Time Unit = T MSG and T TR Data Units = x (transferring data from Site 1 to Site 3), y (transferring data from Site 2 to Site 3) Total time for the given query can be calculated as: TT = 2 * T MSG + T TR * (X + Y) Response time for the same query can be calculated as: RT = max {T MSG + T TR * X, T MSG + T TR * Y} RT is reduced by increasing the number of parallel executions. It does not guarantee that due to this TT is also reduced. The minimization of TT can be obtained by the utilization of the resources properly; due to this the system throughput also increases. So it is observed that there is some tradeoff between TT and RT 6 . Beside 32 in some cases it is focused on improve the total time and in some cases, it is focused on to improve the response time.

Parameters
For query process, cost optimizer is responsible for the suitable selection of node because the different sites are used for the replication of data 10 . It is the responsibility of the cost base optimizer to choose the best processing node as per the following parameters.
• Current Queue Length: It is the amount of requests that are executing or on the way to get processed on the server.

Cache Based Query Optimization
In distributed database data access is faster, but search complexity is rising due to the introduction of new applications for distributed databases. So there is a need of better algorithm that speeds up the queries results as compare to traditional databases. To solve the hurdle of query optimization in distributed database system, a cache based query optimization has been introduced. In this model cache is placed between database server and local optimizer at each site. The data stored in the cache might be the already calculated values or redundant values that stored on elsewhere. If the data is found from the cache then it is recorded as cache hit that is faster than accessing data without cache. If the data is not found from the cache then it is called cache miss. Performance of the

Algorithm's Working
This method totally stands on the reality that the cache must be deployed nigh the server. A user initiates the query by using the application interface. This query is broken down with the help of query optimizer in arranging the decision regarding the best node where these sub-inquires are analyzed for further handling 35,36 . It is the responsibility of the query divisor to distribute the sub-inquiry to the suitable node. At each node the local optimizer examines that the output of the query gets from cache or not. If the output of the query is same then it is repeated query else it is new one. If information is retrieved through database the cache is upgraded as needed and outputs are shown on application or user interface. But this technique is suitable for homogeneous databases.

Comparison of Cache Models
Comparison of two papers is shown with the help of table and diagram. In which Wen-Syan's model work on a network based cache 34 whereas Mantu's Model proposes the cache at each site 15 . Usually the result of query optimization by using cache is based upon two factors e.g. number of queries and their types. But here comparison is made on just one factor i.e. number of queries. The only reason to ignore query type is that cache has very little significance when queries receive from the users are continuously repeated. The relationship of number of queries and cache hit ratio percentage is shown in Figure 5 on the base of Table  5. Cache hit ratio is drastically increased when the number of queries are 20 and the curves are nearly balanced when the number of queries reach to 100. So it is concluded that Manu model works sharply when the number of queries are in the range of 1 to 60. But as the numbers of queries increase there is very less difference in their models.

Conclusion
In this paper the authors have worked to explore one of the concerns in distributed database, Query optimization technique, one of the major issues analyzed is response time. It is explored that how the response time is improved by using query optimization. In this analysis the working of distributed databases in terms of basic components of query optimization has been explored e.g. search space, search strategy and cost model and their effects on response time. To improve the optimization efficiency of the query these components areanalyzed with the help of tables, diagrams and examples. It is also discussed that the cost factors which play a key role in query optimization. At the end it is also examined that the improvement of response time by using cache based query optimization is suitable option because it saves huge computational cost and time, especially introducing cache at each site instead of implementation of one network based cache. It is observed that currently cache is implemented in homogeneous distributed databases. Concerning future work, the proposed model can be executed for heterogeneous distributed databases. Still there is a lot of room for researchers to do in this regard our paper should sever a guide and an initiative to work forward for others also.