Scalable and Interpretable Data Analytics
Interactive Visualization
User’s Interpretation and Control
Statistical Learning
Statistical Patterns and Models. i.e. Machine Learning
Database
Scalable Data Processing and Data Semantics Modelling. i.e. capture information behind data
Parallel Processing
Massively Parallel Computation on Modern Hardware
What is Big data
3Vs:
- High-volume (Terabytes->Zettabytes)
- High-velocity (Batch->Streaming data)
Real-time analysis:late decisions->Missed opportunities TIME-SENSITIVE - High-variety (Structured_>Semistructured & unstructured data)
various databases:relational, XML, transactional, spatial, graph, text, multimedia…
Better place in HDFS or non-relational NoSQL databases
4V: Veracity How accurate or trustworthy is the data
5V: Value Data contains value and knowledge
Big Data Analytics (Data Mining / Data Science)
raw data -> actionable indormation(patterns, correlations, trends, preferences)
Data needs to be Stored, Managed and Analyzed.
Discover patterns that are:
- valid: fit in new data
- useful: possible to act on the item
- unexpected: not easy to diriectly tell from data
- understandable
Data Analytics Tasks
Descriptive methods: Clustering
Predictive methods: Recommandation Systems
Data Analytics Pipeline
- Data collection: acquire data from various sources
different name representation, conflict/incomplete information, ambiguous references - Data curation: clean, format, integrate with other datasets, store in database
duplicate, conflict, missing values, outlieirs, entity resolution - Data processing: ren queries, plot graphs
- Data analysis: examine trends and anomalies, understand results