Awesome
AJA - Accomplish Joyfull Adventures
Data Science with Spark and Scala
###Topics Explored
###Introduction to Scala
####Basics
- Scala foundation
- Features of Scala
- Setup Spark and Scala on Unbuntu and Windows OS
- Install IDE's for Scala
- Run Scala Codes on Scala Shell
- Understanding Data types in Scala
- Implementing Lazy Values
- Control Structures
- Looping Structures
- Functions
- Procedures
- Collections
- Arrays and Array Buffers
- Map's, Tuples and Lists
####Object Oriented Programming in Scala
- Implementing Classes
- Implementing Getter & Setter
- Object & Object Private Fields
- Implementing Nested Classes
- Using Auxilary Constructor
- Primary Constructor
- Companion Object
- Apply Method
- Understanding Packages
- Override Methods
- Type Checking
- Casting
- Abstract Classes
####Functional Programming in Scala
- Understanding Functional programming in Scala
- Implementing Traits
- Layered Traits
- Rich Traits
- Anonymous Functions
- Higher Order Functions
- Closures and Currying
- Performing File Processing
###Introduction to Spark
What is Spark?
- Review: From Hadoop MapReduce to Spark
- Review: HDFS
- Review: YARN
- Spark Overview
Spark Basics
- Using the Spark Shell
- RDDs (Resilient Distributed Datasets)
- Functional Programming in Spark
####Working with RDDs in Spark
- Creating RDDs
- Other General RDD Operations
####Aggregating Data with Pair RDDs
- Key-Value Pair RDDs
- Map-Reduce
- Other Pair RDD Operations
Writing and Deploying Spark Applications
- Spark Applications vs. Spark Shell
- Creating the SparkContext
- Building a Spark Application (Scala and Java)
- Running a Spark Application
- The Spark Application Web UI
- Hands-On Exercise: Write and Run a Spark Application
- Configuring Spark Properties
- Logging
Parallel Processing
- Review: Spark on a Cluster
- RDD Partitions
- Partitioning of File-based RDDs
- HDFS and Data Locality
- Executing Parallel Operations
- Stages and Tasks
Spark RDD Persistence
- RDD Lineage
- RDD Persistence Overview
- Distributed Persistence
####Spark Streaming
- Spark Streaming Overview
- Example: Streaming Request Count
- DStreams
- Developing Spark Streaming Applications
- Multi-Batch Operations
- State Operations
- Sliding Window Operations
- Advanced Data Sources
####Common Patterns in Spark Data Processing
- Common Spark Use Cases
- Iterative Algorithms in Spark
- Graph Processing and Analysis
- Machine Learning
- Example: k-means
Improving Spark Performance
- Shared Variables: Broadcast Variables
- Shared Variables: Accumulators
- Common Performance Issues
- Diagnosing Performance Problems
- Spark SQL and the SQL Context
- Creating DataFrames
- Transforming and Querying DataFrames
- Saving DataFrames
- DataFrames and RDDs
- Comparing Spark SQL, Impala and Hive-on-Spark
####Spark Machine Learning
####GraphX
##Data Science
Project Structure
- Android : Android + Scala integration!
- docs : All reference materials
- data : Datasets used in the implementation
##Build Environment Linux Ubuntu 12.04+
Git Links
##Wiki
Contribution
Let us begin our jouney from here!
Contact: mageswaran1989@gmail.com