Big Data Analytics

Scalable and Interpretable Data Analytics

Interactive Visualization

User’s Interpretation and Control

Statistical Learning

Statistical Patterns and Models. i.e. Machine Learning

Database

Scalable Data Processing and Data Semantics Modelling. i.e. capture information behind data

Parallel Processing

Massively Parallel Computation on Modern Hardware

What is Big data

3Vs:

  • High-volume (Terabytes->Zettabytes)
  • High-velocity (Batch->Streaming data)
    Real-time analysis:late decisions->Missed opportunities TIME-SENSITIVE
  • High-variety (Structured_>Semistructured & unstructured data)
    various databases:relational, XML, transactional, spatial, graph, text, multimedia…
    Better place in HDFS or non-relational NoSQL databases

4V: Veracity How accurate or trustworthy is the data

5V: Value Data contains value and knowledge

Big Data Analytics (Data Mining / Data Science)

raw data -> actionable indormation(patterns, correlations, trends, preferences)
Data needs to be Stored, Managed and Analyzed.
Discover patterns that are:

  • valid: fit in new data
  • useful: possible to act on the item
  • unexpected: not easy to diriectly tell from data
  • understandable

Data Analytics Tasks

Descriptive methods: Clustering
Predictive methods: Recommandation Systems

Data Analytics Pipeline

  • Data collection: acquire data from various sources
    different name representation, conflict/incomplete information, ambiguous references
  • Data curation: clean, format, integrate with other datasets, store in database
    duplicate, conflict, missing values, outlieirs, entity resolution
  • Data processing: ren queries, plot graphs
  • Data analysis: examine trends and anomalies, understand results