A Review: Social Media Mining an Introduction

A great read on Social Media Mining and text analytics is readily available online under the title: Social Media Mining an Introduction. The authors of this book are Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu, published under the Cambridge University Press, drafted April 20, 2014. A link to the book is found at: http://dmml.asu.edu/smm/SMM.pdf

Summary:

The book begins by a brief introduction and overview of data mining, social media, and the emergence of Social media mining. Chapter 1 essentially describes how to use the book through dependencies of information from chapter to chapter throughout the rest of the book. It’s geared toward Computer Science majors of senior –undergrad or graduate level studies, with a bent on some of the mathematical theory behind algorithms and techniques described throughout the book. Anyone with a decent background in discrete optimization would find the theory quite readable.

Part I of the book effectively begins with concept and theory behind graphs and networks in chapter 2. This covers everything from the basics in terminology to algorithms which describe how to transverse networks given limited resources, e.g. minimum spanning tree, max flow problems, etc. Chapter 3 moves onto topics about Networks such as common terminology involving structure and attributes of networks. Chapter 4 is all about different models being employed in evaluating, and mapping such networks and combines pieces from both chapter 2 and 3. Chapter 5 is the last and more relevant of part 1 in regards to the book giving the basics of terminology and algorithms used in Data mining. This chapter provides details into pseudo-code and the logic behind such algorithms as well as brief examples of how these algorithms work in practice. Covers both topics of supervised and unsupervised learning, with various methodologies and evaluation techniques.

Part 2 begins in chapter 6 which aims at community identification and analysis employing techniques of unsupervised learning algorithms, theoretical applications of networks evolving, and information dissemination in Social Media found in chapter 7. Chapter 7 also goes into various diffusion models, employing differential equations as well as techniques of measuring intervention across social media. Finally, Part 3 aims to describe three common applications used in Social Media Mining, namely, addressing how influencing and gaining friends/followers (chapter 8), marketing applications for recommendations (chapter 9), and Behavior analytics (chapter 10).

Due to Personal Interests, a selection of the book, chapter 5 in particular follows in summary:

Chapter 5- Data Mining Essentials- Notes

Data Mining is the process of extracting useful patterns from raw data sometimes known as KDD (knowledge discovery in databases). This process is pretty standard across all data mining and involves taking raw data (structured and unstructured) and preprocessing it in such a way that a data mining algorithms can process the data and spit out some analysis. This analysis must be evaluated and interpreted in order obtain useable and accurate knowledge about the raw data.

A general procedure is to identify a subset of the raw data which is of interest (target data) and preprocess this target data such that the data mining algorithm can properly understand and analyze it.

Two ways of analyzing social media are using available repositories about the site or collecting the raw data. Some social media sites provide API (application programming interfaces) which can be used to this end, others require independently programmed web scrappers/parsers.

In either case, in general, social media sites are often a network of individuals where one can preform graph transversal algorithms to collect information.

5.1 Data

In KDD, data is represented by tabular form. This tabular form often has instances (data points/observations), features (measurements or attributes associated with an instance(s) ) and classes ( given the features of an instances one tries to predict which class it belongs to)

Classes can be both labeled and unlabeled which may determine the data mining algorithm which would be more effective.

Features can be both discrete or continuous and include:

  • Nominal( Categorial)- e.g. name, or a string of text
  • Ordinal – measure with relative size or order
  • Interval – a measure for which subtraction or addition make sense but not division or multiplication (time)
  • Ratio – a measure of comparison via proportion or likewise

Social media mining produces many type of non-tabular data, e.g.- text, voice, video, etc. These must be converted to tabular data to get processed by data mining algorithms.

There are many techniques for turning text to tabular data, one common approach involves vectorization which may employ a vector-space model or some related algorithm for preprocessing.

5.1.1 Data Quality

When preprocessing data for use in data mining algorithms, the following 4 data quality aspects must be considered:

  1. Noise – distortion of data
  2. Outliers – instances which are considerably different from other instances in the data
  3. Missing values- features missing from particular instances
  4. Duplicate data – Multiple instances with the same exact features

5.2 Data Preprocessing

                Typical data preprocessing tasks include:

  1. Aggregation – Multiple Features need to be combined or summarized
  2. Discretization – Converting continuous features to discrete bins ( optional)
  3. Feature Selection – Choosing the minimum subset of features required for DM algorithm
  4. Feature Extraction – Converting current feature set to a new set of features via transformation or manipulation
  5. Sampling – using a subset of the data which is a good representative of the underlying population
    1. Random sampling
    2. Sampling w/ and w/o replacement
    3. Stratified Sampling – uniform sampling from imbalanced class bins

5.4 Supervised learning

Data sets for which the class attribute values are known before running an algorithm, i.e. known labelled classes for various instances in a “training” set of data

Instances form tuples (X, y ) where X  is generally a vector and y is a class attribute, commonly a scalar.

The concept is to build a model that maps X to y, where we want to identify the mapping m(…) such that m(X)= y.

Supervised learning also uses an unlabeled data set of test data set which X is known but y is not known, also called a “test” set.

Common Supervised Learning Techniques:

  • Decision tree learning
  • Naïve Bayes Classifier
  • Nearest Neighbor Classifier
  • Classification w/ Network Information

Supervised learning uses a training-testing framework in which the training data set is used to train the model and the model is evaluated on a test data set.

5.5 Unsupervised Learning

  • Unsupervised division of instances into groups of similar objects
  • This book focuses on clustering techniques only
  • Data is often unlabeled, i.e. classes are not known prior to mining
  • All clustering algorithms require a measure of “distance”
  • Common measure includes Euclidean, L1 and L2 (taxi-cab) norms among others

5.5.1 Clustering algorithms

 Partition algorithms such as K-means are well-known and popular.

General framework of algorithm:

  1. Start with selected candidates for k-means centroids (typically a few of the instances of the data set)
  2. Categorize all instances into these various clusters by minimizing the distances to any given centroid
  3. Re-compute centroid when all instances have been categorized
  4. Repeat 2.) with new centroids
  5. Stop when centroid changes from previous centroid by less than some ε >0

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.