Hence, search engines return a lot of results which has to be manually aggregated. The current web contents are mostly represented in HTML which is more presentation language and henceforth, does not help in machine interpretability.įor the query entered the results are a lot of documents or web pages a user has to manually aggregate the partial information to get the complete information.
SEARCH ENGINE FOR PROTEUS LIBRARIES SOFTWARE
The information is based on HTML based free format web pages which are very suitable for direct human use but are not appropriate for automated information exchange, retrieval, and processing by software agents(machines). The machine has the inability to understand the provided information due to a lack of universal format. Often it happens that users don’t get any relevant answer for request, or important and relevant pages are not retrieved. But with a lot of results retrieved is that even if the main relevant pages are retrieved, they are of little use if large numbers of mildly relevant or irrelevant documents are also retrieved. The main issue with the returned results is that they have high recall but low precision which means that it returns a lot of important results from its repository but those results are not that relevant which refers to low precision. Search engines based on keyword matching have certain problems associated with them as listed below: Finding out the relevant information from such a large set of web pages proves out to be a very tedious task. It is the user’s work to extract out the relevant information from a large set of results. Major search engines such as Google, Yahoo works on keyword-based matching. 1.3 Limitations of the traditional search engines: Retrieving the relevant information from the information available is an important research issue in search engines. match the keywords in the query with the web pages that are having those keywords, resulting into a result page set which has relevant and irrelevant results. The search engines, for example, Google, Yahoo, etc. The most widely used ranking algorithms are Page-Rank and Hypertext Induced Topic Specific (HITS) algorithm. When a user query is entered, the terms of the query are matched with the terms in the index structure and the terms matching the query terms are returned as a result to the user.ġ.2.4 Ranking: The web pages returned after matching with a query are ranked based on various factors. ġ.2.3 Searching: Query terms entered by the user are compared with the index, producing the results. Different types of indexes are constructed depending upon the type of contents Text Index, Structure Index, Utility Index. These terms are sorted and maintained as a posting list consisting of the frequency of the terms and the document that each term occurs in. The major steps involved in index construction are -Tokenization, linguistic pre-processing process such as hyphenation, stop word removal, stemming, lemmatization, normalization.
SEARCH ENGINE FOR PROTEUS LIBRARIES SERIES
The extracted links are sent to URL Frontier Queuefor fetching of web pages from those links after passing through a series of tests of duplicate contents and URL elimination.ġ.2.2 Indexing: The crawled web pages are then indexed by the Indexer Module. The web pages fetched from the web are sent for parsing, for further extraction of links. Typically Search Engine has the following main components:ġ.2.1 Crawling: It is the first stage of search engine in which the documents from the web are downloaded based on the URL received from the URL Frontier Queue. Search Engine is a tool that is used to retrieve the information stored over the WWW. Thus, search engines are considered as an important tool for information retrieval system that returns a set of ranked web pages according to their relevance and matches the query keywords. The information is searched through a search engine by submitting queries that are in the form of keywords and as a result information seekers find the required information. With such a large collection of information, search engines are emerging as an important tool for searching the relevant information. Retrieving the relevant information from WWW is an unprecedentedly difficult task. The Web site has grown to a large extent and due to the large volume of available information it is becoming difficult to locate useful information. “Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)”. Information retrieval can be precisely defined as: Search Engine Introduction: World Wide Web (An information retrieval is a technique for searching the information about a subject over an enormous number of resources relevant to the user’s information need.