Pattern Mining for Information Retrieval from Large Databases

Abstract
With the widespread use of databases and explosive growth in their sizes are reason for the attraction of the data mining for retrieving the useful informations. Desktop has been used by tens of millions of people and we have been humbled by its usage and great user feedback. However over the past seven years we have also witnessed some changes in how users store and access their own data, with many moving to web based application. Despite the increasing amount of information available in the internet, storing files in personal computer is a common habit among internet users. The motivation is to develop a local search engine for users to have instant access to their personal information.The quality of extracted features is the key issue to text mining due to the large number of terms, phrases, and noise. Most existing text mining methods are based on term-based approaches which extract terms from a training set for describing relevant information. However, the quality of the extracted terms in text documents may be not high because of lot of noise in text. For many years, some researchers make use of various phrases that have more semantics than single words to improve the relevance, but many experiments do not support the effective use of phrases since they have low frequency of occurrence, and include many redundant and noise phrases. In this paper, we propose a novel pattern discovery approach for text mining.To evaluate the proposed approach, we adopt the feature extraction method for Information Retrieval (IR).
Keywords –Pattern mining, Text mining, Information retrieval, Closed pattern.
Introduction
In the past decade, for retrieving an information from the large database a significant number of datamining techniques have been presented that includes association rule mining, sequential pattern mining, and closed pattern mining. These methods are used to find out the patterns in a reasonable time frame, but it is difficult to use the discovered pattern in the field of text mining. Text mining is the process of discovering interesting information in the text documents. Information retrieval provide many methods to find the accurate knowledge form the text documents. The most commonly used method for finding the knowledge is the phrase based approaches, but the method have many problems such as phrases have low frequency of occurrence, and there are large number of noisy phrases among them.If the minimum support is decreased then it will create lot of noisy pattern

Don't use plagiarized sources. Get Your Custom Essay on
Pattern Mining for Information Retrieval from Large Databases
Just from $13/Page
Order Essay

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

Pattern Classification Method
To find the knowledge effectively without the problem of low frequency and misinterpretation a pattern based approach (Pattern classification method) is discovered in this paper. This approach first find out the common character of pattern and evaluates the weight of the terms based on distribution of terms in the discovered pattern. It solves the problem of misinterpretation. The low frequency problem can also be reduced by using the pattern in the negatively trained examples. To discover patterns many algorithms are used such as Apriori algorithm, FP-tree algorithm, but these algorithms does not tell how to use the discovered patterns effectively. The pattern classification method uses closed sequential pattern to deal with large amount of discovered patterns efficiently. It uses the concept of closed pattern in text mining.
Preprocessing
The first step towards handling and analyzing textual data formats in general is to consider the text based information available in free formatted text documents.Real world databases are highly susceptible to noisy, missing, and inconsistent data due to their huge size. These low quality data will lead to low quality mining results. Initially the preprocessing is done with text document while storing the content into desktop systems.Commonly the information would be processed manually by reading thoroughly and then human domain experts would decide whether the information was good or bad (positive or negative). This is expensive in relation to the time and effort required from the domain experts. This method includes two process.
Removing stop words and stem words
To begin the automated text classification process the input data needs to be represented in a suitable format for the application of different textual data mining techniques, the first step is to remove the un-necessary information available in the form of stop words.Stop words are words that are deemed irrelevant even though they may appear frequently in the document. These are verbs, conjunctions, disjunctions and pronouns, etc. (e.g. is, am, the, of, an, we, our). These words need to be removed as they are less useful in interpreting the meaning of text.
Stemming is defined as the process of conflating the words to their original stem, base or root. Several words are small syntactic variants of each other since they share a common word stem. In this paper simple stemming is applied where words e.g. ‘deliver’, ‘delivering’ and ‘delivered’ are stemmed to ‘deliver’. This method helps to capture whole information carrying term space and also reduces the dimensions of the data which ultimately affects the classification task. There are many algorithms used to implement the stemming method. They are Snowball, Lancaster and the Porter stemmer. Comparing with others Porter stemmer algorithm is an efficient algorithm. It is a simple rule based algorithm that replaces a word by an another. Rules are in the form of (condition)s1->s2 where s1, s2 are words. The replacement can be done in many ways such as, replacing sses by ss, ies by i, replacing past tense and progressive, cleaning up, replacing y by i, etc.
Weight Calculation
The weight of the each term is calculated by multiplying the term frequency and inverse document frequency. Term frequency find the occurrence of the individual terms and counts. Inverse document frequency is a measure of whether a term is common or rare across all documents.
Term Frequency:
Tf(t,d)=0.5+0.5*f(t,d)/max{f(w,d):wbelongs to d}
Where d represents single document and t represents the terms
Inverse Document Frequency:
IDF(t,D)= log(Total no of doc./No of doc. Containing the term)
Where D represents the total number of documents
Weight:
Wt=Tf*IDF
Clustering
Cluster is a collection of data objects. Similar to one another within the same cluster. Cluster analysis will find similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.Clustering is defined as a process of grouping data or information into groups of similar types using some physical or quantitative measures. It is an unsupervised learning. Cluster analysis used in many applications such as, pattern recognition, data analysis and web for information discovery. Cluster analysis support many types of data like, Data matrix, Interval scaled variables, Nominal variables, Binary variables and variables of mixed types. There are many methods used for clustering. The methods are partitioning methods, hierarchical methods, density based methods, grid based methods and model based methods. In this paper partitioning method is proposed for clustering.
Partitioning methods
This method classifies the data into k-groups, which together satisfy the following requirements: (1) each group must contain at least one object, (2) each object must belong to exactly one group. Given a database of n objects, a partitioning method constructs k partitions of the data, where each partition represents a cluster and kK-means algorithm
K-means is one of the simplest unsupervised learning algorithms. It takes the input parameter, k, and partitions a set of n objects into k-clusters so that the resulting intra cluster similarity is high but the inter cluster similarity is low. It is centroid based technique. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the clusters centroid.
Input:k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
Methods:

Select an initial partition with k clusters containing randomly chosen samples, and compute the centroids of the clusters.
Generate a new partition by assigning each sample to the closest cluster center.
Compute new cluster centers as the centroids of the cluster.
Repeat steps 2 and 3 until an optimum value of the criterion function is found or until the cluster membership stabilizes.

This algorithm faster than hierarchical clustering. But it is not suitable to discover clusters with non-convex shapes.

Fig.1. K-Means Clustering
Classification
It predicts categorical class labels and classifies the data based on the training set and the values in classifying the attribute and uses it in classifying the new data. Data classification is a two step process (1) learning, (2) classification. Learning can be classified into two types supervised and unsupervised learning. The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data. There are many classification methods are available such as, K-nearest neighbor, Genetic algorithm, Rough Set Approach, and Fuzzy Set approaches.The classification technique measures the nearing occurrence. It assumes the training set includes not only the data in the set but also the desired classification for each item. The classification is done through training samples, where the entire training set includes not only the data in the set, but also the desired classification for each item. The Proposed approaches find the minimum distance from the new or incoming instance to the training samples. On the basis of finding the minimum distance only the closest entries in the training set are considered and thenew item is placed into the classwhich contains the most items of the K. Here classify thesimilarity text documents and file indexing is performed to retrieve the file in effective manner.
Result and Discussion
The input file is given and initial preprocessing is done with that file. To find the match with any other training sample inverse document frequency is calculated. To find the similarities between documents clustering is performed.Then classification is performed to find the input matches with any of the clusters. If it matches the particular cluster file will be listed.Theclassification techniques classify the various file formats and the report is generated as percentage of files available. The graphical representation shows the clear representation of files available in various formats. This method uses least amount of patterns for concept learning compare to other methods such as, Rocchio, Prob, nGram , the concept based models and the most BM25 and SVM models. The proposed model is achieved the high performance and it determined the relevant information what users want. This method reduces the side effects of noisy patterns because the term weight is not only based on term space but it also based on patterns. The proper usage of discovered patterns is used to overcome the misinterpretation problem and provide a feasible solution to effectively exploit the vast amount of patterns generated by data mining algorithms.
Conclusion
Storing huge amount of files in personal computers is a common habit among internet users, which is essentially justified for the following reasons,
1) The information will not always permanent
2) The retrieval of information differs based on the different query search
3) Location same sites for retrieving information is difficult to remember
4) Obtaining information is not always immediate. But these habits have many drawbacks. It is difficult to find when the data is required.In the Internet, the use of searching techniques is now widespread, but in terms of personal computers, the tools are quite limited. The normal “Search or “Find” options take several hours to produce the search result. It acquires more time to predict the desire result where the time consumption is high.The proposed system provides accurate result comparing to normal search.All files are indexed and clustered using the efficient k means techniques so the information retrieved in efficient manner.
The best and advanced clustering gadget provides optimized time results.Downtime and power consumption is reduced.
References
[1]K. Aas and L. Eikvil, ‘’Text Categorization: A Survey,’’ Technical Report NR 941, Norwegian Computing Centre, 1999.
[2] R. Agarwal and R.Srikanth, ‘’Fast Algorithm for Mining Association Rules in Large Databases, ‘’ Proc. 20th Int’l Conf. Very Large Data Bases(VLDB’94), pp.478-499, 1994.
[3] H. Ahonen, O. Heinonen, M. Klemettinen, and A.I. Verkamo, “Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections,” Proc. IEEE Int’l Forum on Research and Technology Advances in Digital Libraries (ADL ’98), pp. 2-11, 1998.
[4] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, 1999.
[5] N. Cancedda, N. Cesa-Bianchi, A. Conconi, and C. Gentile, “Kernel Methods for Document Filtering,” TREC, trec.nist.gov/ pubs/trec11/papers/kermit.ps.gz, 2002.
[6] N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders, “Word- Sequence Kernels,” J. Machine Learning Research, vol. 3, pp. 1059- 1082, 2003.
[7] M.F. Caropreso, S. Matwin, and F. Sebastiani, “Statistical Phrases in Automated Text Categorization,” Technical Report IEI-B4-07- 2000, Instituto di Elaborazionedell’Informazione, 2000.
[8] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
[9] S.T. Dumais, “Improving the Retrieval of Information from External Sources,” Behavior Research Methods, Instruments, and Computers, vol. 23, no. 2, pp. 229-236, 1991.
[10] J. Han and K.C.-C. Chang, “Data Mining for Web Intelligence,” Computer, vol. 35, no. 11, pp. 64-70, Nov. 2002.
[11] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’00), pp. 1-12, 2000.
[12] Y. Huang and S. Lin, “Mining Sequential Patterns Using Graph Search Techniques,” Proc. 27th Ann. Int’l Computer Software and Applications Conf., pp. 4-9, 2003.
[13] N. Jindal and B. Liu, “Identifying Comparative Sentences in Text Documents,” Proc. 29th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’06), pp. 244-251, 2006. [14] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with tfidf for Text Categorization,” Proc. 14th Int’l Conf. Machine Learning (ICML ’97), pp. 143-151, 1997.
[15] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. European Conf. Machine Learning (ICML ’98),, pp. 137-142, 1998.
[16] T. Joachims, “Transductive Inference for Text Classification Using Support Vector Machines,” Proc. 16th Int’l Conf. Machine Learning (ICML ’99), pp. 200-209, 1999.
[17] W. Lam, M.E. Ruiz, and P. Srinivasan, “Automatic Text Categorization and Its Application to Text Retrieval,” IEEE Trans. Knowledge and Data Eng., vol. 11, no. 6, pp. 865-879, Nov./Dec. 1999.
[18] D.D. Lewis, “An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,” Proc. 15th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’92), pp. 37-50, 1992.
[19] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language, pp. 212-217, 1992.
[20] D.D. Lewis, “Evaluating and Optimizing Automous Text Classification Systems,” Proc. 18th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’95), pp. 246-254, 1995.
[21] G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management: An Int’l J., vol. 24, no. 5, pp. 513-523, 1988.
[22] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[23] Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization,” Information Retrieval, vol. 1, pp. 69-90, 1999.
[24] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. 22nd Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’99), pp. 42-49, 1999.
 

What Will You Get?

We provide professional writing services to help you score straight A’s by submitting custom written assignments that mirror your guidelines.

Premium Quality

Get result-oriented writing and never worry about grades anymore. We follow the highest quality standards to make sure that you get perfect assignments.

Experienced Writers

Our writers have experience in dealing with papers of every educational level. You can surely rely on the expertise of our qualified professionals.

On-Time Delivery

Your deadline is our threshold for success and we take it very seriously. We make sure you receive your papers before your predefined time.

24/7 Customer Support

Someone from our customer support team is always here to respond to your questions. So, hit us up if you have got any ambiguity or concern.

Complete Confidentiality

Sit back and relax while we help you out with writing your papers. We have an ultimate policy for keeping your personal and order-related details a secret.

Authentic Sources

We assure you that your document will be thoroughly checked for plagiarism and grammatical errors as we use highly authentic and licit sources.

Moneyback Guarantee

Still reluctant about placing an order? Our 100% Moneyback Guarantee backs you up on rare occasions where you aren’t satisfied with the writing.

Order Tracking

You don’t have to wait for an update for hours; you can track the progress of your order any time you want. We share the status after each step.

image

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

Areas of Expertise

Although you can leverage our expertise for any writing task, we have a knack for creating flawless papers for the following document types.

image

Trusted Partner of 9650+ Students for Writing

From brainstorming your paper's outline to perfecting its grammar, we perform every step carefully to make your paper worthy of A grade.

Preferred Writer

Hire your preferred writer anytime. Simply specify if you want your preferred expert to write your paper and we’ll make that happen.

Grammar Check Report

Get an elaborate and authentic grammar check report with your work to have the grammar goodness sealed in your document.

One Page Summary

You can purchase this feature if you want our writers to sum up your paper in the form of a concise and well-articulated summary.

Plagiarism Report

You don’t have to worry about plagiarism anymore. Get a plagiarism report to certify the uniqueness of your work.

Free Features $66FREE

  • Most Qualified Writer $10FREE
  • Plagiarism Scan Report $10FREE
  • Unlimited Revisions $08FREE
  • Paper Formatting $05FREE
  • Cover Page $05FREE
  • Referencing & Bibliography $10FREE
  • Dedicated User Area $08FREE
  • 24/7 Order Tracking $05FREE
  • Periodic Email Alerts $05FREE
image

Our Services

Join us for the best experience while seeking writing assistance in your college life. A good grade is all you need to boost up your academic excellence and we are all about it.

  • On-time Delivery
  • 24/7 Order Tracking
  • Access to Authentic Sources
Academic Writing

We create perfect papers according to the guidelines.

Professional Editing

We seamlessly edit out errors from your papers.

Thorough Proofreading

We thoroughly read your final draft to identify errors.

image

Delegate Your Challenging Writing Tasks to Experienced Professionals

Work with ultimate peace of mind because we ensure that your academic work is our responsibility and your grades are a top concern for us!

Check Out Our Sample Work

Dedication. Quality. Commitment. Punctuality

Categories
All samples
Essay (any type)
Essay (any type)
The Value of a Nursing Degree
Undergrad. (yrs 3-4)
Nursing
2
View this sample

It May Not Be Much, but It’s Honest Work!

Here is what we have achieved so far. These numbers are evidence that we go the extra mile to make your college journey successful.

0+

Happy Clients

0+

Words Written This Week

0+

Ongoing Orders

0%

Customer Satisfaction Rate
image

Process as Fine as Brewed Coffee

We have the most intuitive and minimalistic process so that you can easily place an order. Just follow a few steps to unlock success.

See How We Helped 9000+ Students Achieve Success

image

We Analyze Your Problem and Offer Customized Writing

We understand your guidelines first before delivering any writing service. You can discuss your writing needs and we will have them evaluated by our dedicated team.

  • Clear elicitation of your requirements.
  • Customized writing as per your needs.

We Mirror Your Guidelines to Deliver Quality Services

We write your papers in a standardized way. We complete your work in such a way that it turns out to be a perfect description of your guidelines.

  • Proactive analysis of your writing.
  • Active communication to understand requirements.
image
image

We Handle Your Writing Tasks to Ensure Excellent Grades

We promise you excellent grades and academic excellence that you always longed for. Our writers stay in touch with you via email.

  • Thorough research and analysis for every order.
  • Deliverance of reliable writing service to improve your grades.
Place an Order Start Chat Now
image

Order your essay today and save 30% with the discount code Happy