Awesome
Mobius development is deprecated and has been superseded by a more recent version '.NET for Apache Spark' from Microsoft (Website | GitHub) that runs on Azure HDInsight Spark, Amazon EMR Spark, Azure & AWS Databricks.
<img src='logo/mobius-star-200.png' width='125px' alt='Mobius logo' />Mobius: C# API for Spark
Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.
For example, the word count sample in Apache Spark can be implemented in C# as follows :
var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");
var words = lines.FlatMap(s => s.Split(' '));
var wordCounts = words.Map(w => new Tuple<string, int>(w.Trim(), 1))
.ReduceByKey((x, y) => x + y);
var wordCountCollection = wordCounts.Collect();
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");
A simple DataFrame application using TempTable may look like the following:
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
reqDataFrame.RegisterTempTable("requests");
metricDataFrame.RegisterTempTable("metrics");
// C0 - guid in requests DataFrame, C3 - guid in metrics DataFrame
var joinDataFrame = GetSqlContext().Sql(
"SELECT joinedtable.datacenter" +
", MAX(joinedtable.latency) maxlatency" +
", AVG(joinedtable.latency) avglatency " +
"FROM (" +
"SELECT a.C1 as datacenter, b.C6 as latency " +
"FROM requests a JOIN metrics b ON a.C0 = b.C3) joinedtable " +
"GROUP BY datacenter");
joinDataFrame.ShowSchema();
joinDataFrame.Show();
A simple DataFrame application using DataFrame DSL may look like the following:
// C0 - guid, C1 - datacenter
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")
.Select("C0", "C1");
// C3 - guid, C6 - latency
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
.Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = reqDataFrame.Join(metricDataFrame, reqDataFrame["C0"] == metricDataFrame["C3"])
.GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();
A simple Spark Streaming application that processes messages from Kafka using C# may be implemented using the following code:
StreamingContext sparkStreamingContext = StreamingContext.GetOrCreate(checkpointPath, () =>
{
var ssc = new StreamingContext(sparkContext, slideDurationInMillis);
ssc.Checkpoint(checkpointPath);
var stream = KafkaUtils.CreateDirectStream(ssc, topicList, kafkaParams, perTopicPartitionKafkaOffsets);
//message format: [timestamp],[loglevel],[logmessage]
var countByLogLevelAndTime = stream
.Map(kvp => Encoding.UTF8.GetString(kvp.Value))
.Filter(line => line.Contains(","))
.Map(line => line.Split(','))
.Map(columns => new Tuple<string, int>(
string.Format("{0},{1}", columns[0], columns[1]), 1))
.ReduceByKeyAndWindow((x, y) => x + y, (x, y) => x - y,
windowDurationInSecs, slideDurationInSecs, 3)
.Map(logLevelCountPair => string.Format("{0},{1}",
logLevelCountPair.Key, logLevelCountPair.Value));
countByLogLevelAndTime.ForeachRDD(countByLogLevel =>
{
foreach (var logCount in countByLogLevel.Collect())
Console.WriteLine(logCount);
});
return ssc;
});
sparkStreamingContext.Start();
sparkStreamingContext.AwaitTermination();
For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.
API Documentation
Refer to Mobius C# API documentation for the list of Spark's data processing operations supported in Mobius.
API Usage
Mobius API usage samples are available at:
-
Examples folder which contains standalone C# and F# projects that can be used as templates to start developing Mobius applications
-
Samples project which uses a comprehensive set of Mobius APIs to implement samples that are also used for functional validation of APIs
-
Mobius performance test scenarios implemented in C# and Scala for side by side comparison of Spark driver code
Documents
Refer to the docs folder for design overview and other info on Mobius
Build Status
Ubuntu 14.04.3 LTS | Windows | Unit test coverage |
---|---|---|
Getting Started
Windows | Linux | |
---|---|---|
Build & run unit tests | Build in Windows | Build in Linux |
Run samples (functional tests) in local mode | Samples in Windows | Samples in Linux |
Run examples in local mode | Examples in Windows | Examples in Linux |
Run Mobius app | <ul><li>Standalone cluster</li><li>YARN cluster</li></ul> | <ul><li>Linux cluster</li><li>Azure HDInsight Spark Cluster</li><li>AWS EMR Spark Cluster</li> |
Run Mobius Shell | <ul><li>Local</li><li>YARN</li></ul> | Not supported yet |
Useful Links
- Configuration parameters in Mobius
- Troubleshoot errors in Mobius
- Debug Mobius apps
- Implementing Spark Apps in F# using Mobius
Supported Spark Versions
Mobius is built and tested with Apache Spark 1.4.1, 1.5.2, 1.6.* and 2.0.
Releases
Mobius releases are available at https://github.com/Microsoft/Mobius/releases. References needed to build C# Spark driver applicaiton using Mobius are also available in NuGet
Refer to mobius-release-info.md for the details on versioning policy and the contents of the release.
License
Mobius is licensed under the MIT license. See LICENSE file for full license information.
Community
-
Mobius project welcomes contributions. To contribute, follow the instructions in CONTRIBUTING.md
-
Options to ask your question to the Mobius community
- create issue on GitHub
- create post with "sparkclr" tag in Stack Overflow
- join chat at Mobius room in Gitter
- tweet @MobiusForSpark
Code of Conduct
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.