Volume X – International Journal of Computer Engineering and Applications

TELUGU LANGUAGE TEXT MINING

Y. Sri Lalitha

Associate Professor, Department of IT, Gokaraju Rangaraju Institute of Engineering and Technology

Keywords - Text Mining, Decision Support System, Classification, C 5.0, machine translation.

ABSTRACT

Text mining is crucial for extracting knowledge from important texts that are available in a variety of formats. These texts include pertinent information relating to the user's demand. In this research, we provide a tourist decision support system that extracts data from Telugu text files on tourist locations in both the Telugu-speaking states of Andhra Pradesh and Telangana, preprocesses the data, and divides the locations into three categories using the C 5.0 algorithm. The outcome is then put to use to assist foreign tourists in choosing points of interest that suit their preferences. The southern Indian states of Telangana and Andhra Pradesh both have Telugu as their official language. It is a language that more than 75 million people can read and write. Telugu language texts are used to retain information in a variety of formats, and the country has a rich cultural legacy. We also give a brief overview of our current and upcoming work employing field force automation and opinion mining approaches on the same tourist datasets.

Full PDF File

A STUDY ON SEED POINT SELECTION METHODS FOR TEXT DOCUMENT CLUSTERING USING K-MEANS

Y. Sri Lalitha

Associate Professor, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad

Keywords - Clustering, Seed Point Selection, Initial Centroid Selection.

ABSTRACT

The steady and amazing progress of storage media has given a great boost to the database and information technologies to form huge repositories of structured (databases) and unstructured (text) data, easily available at a mouse click. Discovering valuable information, hidden in these repositories is not a trivial task. Partitional Clustering algorithms exhibit best performance in high dimensional data and by nature Text documents are sparse and are in high dimensional. But, this technique suffers from two drawbacks. Converges to local optimum solution, clustering results are sensitive to Seed document selection and a Need to Indicate Number of partitions prior to clustering process. This work addresses the problem of determining the best seed point for efficient clustering. This study implemented sequence, random buckshot methods and the proposed rank based seed selection method to study the effect of clustering quality and accuracy.

Full PDF File

OUTLIER DETECTION EFFICIENCY FOR HIGH DIMENSIONAL DATA

C.Jayaramulu¹, P.Krishna Akhila², M.Geetha Mounika³, SK.Jasmine⁴, B. Divya⁵

¹Associate Professor, Krishna Chaitanya Institute of Technology & Sciences, Markapur, A.P, India
^2,3,4,5Scholar, Krishna Chaitanya Institute of Technology & Sciences, Markapur, A.P, India

Keywords - Dimension reduction, high-dimensional data, k nearest neighbors(kNN), low rank approximation, outlier detection.

ABSTRACT

A difficult problem in machine learning is still how to properly and efficiently handle huge dimensionality of data. Real-world applications for identifying abnormal items from provided data are numerous. The high-dimensional issue and the size of the neighbourhood in outlier discovery have not yet garnered enough attention, despite the fact that many traditional outlier detection or ranking algorithms have been seen during the previous several years. While the latter necessitates suitable parameter values, making models extremely complex and more sensitive, the former may lead to the distance concentration problem where the distances of observations in high-dimensional space tend to be indiscernible. We offer a notion termed local projection score (LPS) to express the degree of divergence of an observation from its neighbours in order to partially overcome these issues, particularly the high dimensionality. The low-rank approximation method is used to get the LPS from the neighbourhood information. An observation with a high LPS is likely to be an anomaly with good odds. Based on this idea, we provide an effective outlier identification technique that is also resistant to the k-nearest neighbour parameter. The performance of the suggested technique is competitive and promising, as demonstrated by extensive assessment trials using five well known outlier identification methods on twelve publicly available real-world data sets.

Full PDF File

IDENTIFYING IN-ARTICLE ATTRIBUTION AS A SUPERVISED LEARNING ESTIMATOR BY CLASSIFYING FAKE NEWS ARTICLES USING NATURAL LANGUAGE PROCESSING

A.Amrutavalli¹, K. Pavitha², Ch. Hemalatha³, G. Aparna⁴, A. Mallikarjuna⁵

¹Associate Professor, Krishna Chaitanya Institute of Technology & Sciences, Markapur, A.P, India
^2,3,4,5Scholar, Krishna Chaitanya Institute of Technology & Sciences, Markapur, A.P, India

Keywords - Fake News, Machine Learning, Natural Language Processing, Attribution Classification, Influence Mining.

ABSTRACT

The deliberate misrepresentation of news under the appearance of reputable journalism is a global issue with information accuracy and integrity that has an impact on people's decision-making, voting, and opinion forming processes. The majority of allegedly "fake news" is first disseminated via social media platforms like Facebook and Twitter before making its way to more established media outlets like traditional television and radio news. Key linguistic traits of the fake news reports that are originally disseminated through social media platforms include an overuse of unsupported exaggeration and unattributed quoted information. The performance of a fake news classifier is examined through the results of a study on the identification of false news. A revolutionary fake news detector that employs cited attribution in a Bayesian machine learning system as a major feature to assess the risk that a news story is fraudulent was created using the Textblob, Natural Language, and SciPy Toolkits. The accuracy of the resulting algorithm is 63.333 percent in determining if a quote-heavy article is likely to be fabricated. Influence mining is the process that enables the identification of propaganda and fake news, and it is touted as an innovative tool for doing so. The classifier performance and findings, technical analysis, technical linguistics work, and the research procedure are all described in this study. The description of how the existing system will transform into an influence mining system serves as the paper's conclusion.

Full PDF File