1st Workshop on Data Management for End-to-End Machine Learning

—————–
CALL FOR PAPERS:
—————–

DEEM’17
The 1st Workshop on Data Management for End-to-End Machine Learning, May 14, 2017.
http://deem-workshop.org

Held in conjunction with ACM SIGMOD 2017
Raleigh, NC, USA, May 14-19, 2017
http://sigmod2017.org/

———-
WORKSHOP
———-

Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.

For example, data preprocessing and feature extraction workloads result in complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen, for that often a set of popular approaches such as linear models, decision trees and deep neural networks have to be tried out on the problem at hand. The prediction quality of such ML models heavily depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process, that poses huge opportunities for parallelization and optimization. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions, while still allowing for the lifecycle of models (that become stale over time) to be managed. As a further complication, the resulting systems need to take the target audience of ML applications into account; this audience is very heterogenous, ranging from analysts without programming skills that possibly prefer an easy-to-use cloud-based solution on the one hand, to teams of data processing experts and statisticians developing and deploying custom-tailored algorithms on the other hand.

Therefore, DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management, and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits *regular research papers describing preliminary and ongoing research results*. In addition, the workshop encourages the submission of *industrial experience reports of end-to-end ML deployments*. Submissions can either be *short papers (4 pages)* or *long papers (up to 10 pages)* following the ACM proceedings format, as described in https://www.acm.org/publications/proceedings-template.

Areas of particular interest for the workshop include (but are not limited to):

– Data Management in Machine Learning Applications
– Definition, Execution, and Optimization of Complex ML Pipelines
– Systems for Managing the Lifecycle of Machine Learning Models
– Systems for Efficient Hyperparameter Search and Feature Selection
– Machine Learning Services in the Cloud
– Modeling, Storage, and Lineage of ML experimentation data
– Integration of Machine Learning and Dataflow Systems
– Integration of Machine Learning and ETL Processing
– Benchmarking of Machine Learning Applications
– Definition and Execution of Complex Ensemble Predictors
– Architectures for Streaming Machine Learning

—————-
IMPORTANT DATES
—————-

Papers submission deadline: February 1, 2017
Authors notification: March 1, 2017
Deadline for camera-ready copy: March 20, 2017
Workshop: Sunday May 14th, 2017

———————-
SUBMISSION GUIDELINES
———————-

The workshop will have two tracks for regular research papers (including research in progress) and industrial papers (e.g., industrial experience reports of end-to-end ML deployments). Submissions can either be *short papers (4 pages)* or *long papers (up to 10 pages)* following the ACM proceedings format, as described in https://www.acm.org/publications/proceedings-template.

—————-
PUBLICATION
—————-

The workshop proceedings will be published in ACM Digital Library.

—————————
ORGANIZERS
—————————

– Sebastian Schelter (Amazon)
– Reza Zadeh (Stanford & Matroid)
– Markus Weimer (Microsoft)
– Rajeev Rastogi (Amazon)
– Volker Markl (TU Berlin)

—————————
PROGRAM COMMITTEE
—————————

– Sunita Sarawagi (IIT Bombay)
– Sudip Roy (Google)
– Rainer Gemulla (University of Mannheim)
– Matthias Boehm (IBM Research)
– Matthias Seeger (Amazon)
– Evan Sparks (UC Berkeley)
– Chris Ré (Stanford)
– Ted Dunning (MapR Technologies)
– Dionysios Logothetis (Facebook)
– Nedelina Teneva (University of Chicago)
– Vasia Kalavri (KTH Stockholm)
– Venu Satuluri (Twitter)
– Shannon Quinn (University of Georgia)
– Dmitriy Lyubimov (Apache Mahout)
– Tilmann Rabl (TU Berlin)
– Max Heimel (Snowflake)
– Felix Biessmann (Amazon)
– Arun Kumar (UC San Diego)

BeyondMR’17 – Call for papers

* Call for papers *

BEYONDMR’17
The 4th Workshop on Algorithms and Systems for MapReduce and Beyond, May 19, 2017.
https://sites.google.com/site/beyondmr2017/

Held in conjunction with SIGMOD 2017
Raleigh, NC, USA, May 14-19, 2017
http://sigmod2017.org/

—————-
WORKSHOP FOCUS
—————-

The third BeyondMR workshop aims to explore algorithms, computational
models, architectures, languages and interfaces for systems that need
large-scale parallelization and systems designed to support efficient
parallelization and fault tolerance. These include specialized programming
and data-management systems based on MapReduce and extensions, graph
processing systems, data-intensive workflow and dataflow systems.

We invite submissions on topics such as:

Frameworks for Large-Scale Analytical Processing:
– Models, architectures and languages for data processing pipelines,
data-intensive workflows, networks of operations/MapReduce jobs, dataflows,
and data-mashups.
– Analysis of programs for workflow systems, e.g., Spark.
– Expressing and parallelising iterations, incremental iterations, and
programs consisting of large networks of operations.
– Approaches to achieving fault tolerance and to recovering from failures.

Algorithms for Large-Scale Data Processing:
– Methods and techniques for designing efficient algorithms for MapReduce
and similar systems.
– Experiments and experience with new algorithms in these settings.

Cost Models and Optimization Techniques:
– Formal definitions of models that evaluate the efficiency of algorithms
in large-scale parallel processing systems taking into account the
requirements of such systems in different applications.
– Testing and benchmarking of MapReduce extensions and data-intensive
workflows.

Resource Management for Many-Task Computing:
– Scheduling of tasks and load-balancing techniques.
– Study of cases where automatic data distribution in MapReduce and
similar systems does not provide sufficient data balancing.
– Algorithms, methods and frameworks to address data skewness.

—————-
IMPORTANT DATES
—————-
Papers submission deadline: Wed Jan 27, 2017
Authors notification: Sun March 5, 2017
Deadline for camera-ready copy: Sun March 19, 2017
Workshop: Fri May 19, 2017

—————-
SUBMISSION GUIDELINES
—————-
We invite full research or experience papers (up to 10 pages), or short
papers (up to 4 pages) describing research in progress, formatted using
the ACM double-column style
(http://conferences.sigcomm.org/imc/2009/sig-alternate-10pt.cls)

—————-
PUBLICATION
—————-
The workshop proceedings will be published in ACM DL and the organizers
will prepare a SIGMOD Record report.

—————————
ORGANIZERS
—————————
– Foto Afrati National Technical University of Athens, Greece)
– Jan Hidders Vrije Universiteit Brussel, Belgium
– Paris Koutris University of Wisconsin-Madison, USA
– Jacek Sroka University of Warsaw, Poland
– Jeffrey Ullman Stanford University

—————————
Program Committee
—————————

– Paris Koutris, University of Wisconsin-Madison (CHAIR)
– Foto Afrati, National Technical University of Athens
– Sourav S. Bhowmick, Nanyang Technological University
– Yingyi Bu, Couchbase
– Ahmed Eldawy, University of California, Riverside
– Todd Green, LogicBlox
– Jan Hidders, Vrije Universiteit Brussel
– Asterios Katsifodimos, Technical University of Berlin
– Paraschos Koutris, University of Wisconsin-Madison
– Nectarios Koziris, National Technical University of Athens
– Ulf Leser, Humboldt-Universität zu Berlin
– Dionysios Logothetis, Facebook
– Frank McSherry
– Frank Neven, Hasselt University
– Daniel de Oliveira, Fluminense Federal University
– Krzysztof Onak, IBM T.J. Watson Research Center
– Fabio Porto, National Laboratory of Scientific Computation
– Chris Re, Stanford University
– Krzysztof Rzadca, University of Warsaw
– Semih Salihoglu, University of Waterloo
– Mark Santcroos, Rutgers University
– Francesco Silvestri, IT Copenhagen
– Yogesh Simmhan, Indian Institute of Science, Bangalore
– Jacek Sroka, University of Warsaw
– Dan Suciu, University of Washington
– Jeffrey Ullman, Stanford University
– Theodore Vassilakis, Microsoft
– Jianwu Wang, University of Maryland, Baltimore County
– Zhengkui Wang, National University of Singapore
– Ke Yi, Hong Kong University of Science and Technology
– Eiko Yoneki, University of Cambridge
– Matei Zaharia, Stanford University

3rd Workshop on Algorithms and Systems for MapReduce and Beyond

* Call for papers *

BEYONDMR’16
3rd Workshop on Algorithms and Systems for MapReduce and Beyond, July 1, 2016.
https://sites.google.com/site/beyondmr2016/

Held in conjunction with SIGMOD 2016
San Francisco, USA, June 26th – July 1st, 2016
http://sigmod2016.org/

—————-
KEYNOTES
—————-

Author: Ion Stoica, AMPLab, University of California Berkeley

Title: Spark: Past, Present, and Future

Abstract: Almost six years ago we started the Spark project at UC Berkeley.
Spark is a cluster computing engine that is optimized for in-memory
processing, and unifies support for a variety of workloads, including
batch, interactive querying, streaming, and iterative computations. Spark
is now the most active big data project in the open source community, and
is already being used by over one thousand organizations. In this talk,
I’ll take a look back at Spark’s humble beginnings, discuss it’s current
status, and the new and exciting developments that are coming up.

Author: Carlos Guestrin, University of Washington

Title: Big Data, Small Cluster: Choosing “big memory” (RAM, disks, SSDs) over big clusters

Abstract: TBA

—————-
WORKSHOP FOCUS
—————-

The third BeyondMR workshop aims to explore algorithms, computational
models, architectures, languages and interfaces for systems that need
large-scale parallelization and systems designed to support efficient
parallelization and fault tolerance. These include specialized programming
and data-management systems based on MapReduce and extensions, graph
processing systems, data-intensive workflow and dataflow systems.

We invite submissions on topics such as

Frameworks for Large-Scale Analytical Processing:
– Models, architectures and languages for data processing pipelines,
data-intensive workflows, DAGs of operations/MapReduce jobs, dataflows,
and data-mashups.
– Extensions of MapReduce with more fundamental functions other than Map
and Reduce and more complex dataflow connections between function inputs
and outputs.
– Expressing and parallelising iterations, incremental iterations, and
programs consisting of large DAGs of operations.
– Approaches to achieving fault tolerance and to recovering from failures.

Algorithms for Large-Scale Data Processing:
– Methods and techniques for designing efficient algorithms for MapReduce
and similar systems.
– Experiments and experience with new algorithms in these settings.

Cost Models and Optimization Techniques:
– Formal definitions of models that evaluate the efficiency of algorithms
in large-scale parallel processing systems taking into account the
requirements of such systems in different applications.
– Testing and benchmarking of MapReduce extensions and data-intensive
workflows.

Resource Management for Many-Task Computing:
– Scheduling of tasks and load-balancing techniques.
– Methods to tackle data skewness.
– Study of cases where automatic data distribution in MapReduce and
similar systems does not provide sufficient data balancing.
– Design of algorithms that avoid skewness.
– Extensions of MapReduce that automatically tackle data skewness.

—————-
IMPORTANT DATES
—————-
Papers submission deadline: Sun March 5, 2016
Authors notification: Sun April 11, 2016
Deadline for camera-ready copy: Sun May 1, 2016
Workshop: Fri July 1, 2016

—————-
SUBMISSION GUIDELINES
—————-
We invite full research or experience papers (up to 10 pages), or short
papers (up to 4 pages) describing research in progress, formatted using
the ACM double-column style
(http://conferences.sigcomm.org/imc/2009/sig-alternate-10pt.cls)

—————-
PUBLICATION
—————-
The workshop proceedings will be published in ACM DL and the organizers will prepare a SIGMOD Record report.

—————————
ORGANIZERS
—————————
Foto Afrati (National Technical University of Athens, Greece)
Jan Hidders (TU Delft, The Netherlands)
Christopher Re (Stanford, USA)
Jacek Sroka (University of Warsaw, Poland)
Jeffrey Ullman (Stanford University)

—————————
Program Committee (in progress)
—————————

– Chris Re, Stanford University (PC chair)
– Foto Afrati, National Technical University of Athens
– Jeffrey Ullman, Stanford University
– Jacek Sroka, University of Warsaw
– Jan Hidders, Delft University of Technology
– Zhengkui Wang, Singapore Institute of Technology
– Khalid Belhajjame, PSL, Universite Paris-Dauphine, LAMSADE
– Sourav Bhowmick, Nanyang Technological University
– Graham Cormode, University of Warwick
– Asterios Katsifodimos, Technical University of Berlin
– Paris Koutris, University of Washington
– Dionysios Logothetis, Facebook
– Frank McSherry, ETH Zurich
– Krzysztof Onak, IBM Research
– Mark Santcroos, Rutgers University
– Gautam Shroff, Tata Consultancy Services RD
– Dan Suciu, University of Washington
– Jianwu Wang, University of Maryland, Baltimore County
– Tim Kraska, Brown University
– Krzysztof Rzadca, University of Warsaw
– Semih Salihoglu, Stanford University
– Ulf Leser Humboldt-Universität zu Berlin
– Fabio Porto National Laboratory of Scientific Computation, Brasil
– Eiko Yoneki University of Cambridge
– Umut Acar Carnegie Mellon University
– Daniel De Oliveira Fluminense Federal University
– Tamer Özsu University of Waterloo
– Anthony Tung National University of Singapore
– Sergei Vassilvitskii Google
– Yogesh Simmhan Indian Institute of Science, Bangalore

Paper accepted at NDSS’15

Our paper on identifying fake accounts in Online Social Networks has been accepted at the 2015 Network and Distributed System Security (NDSS’15) Symposium.

The paper makes the observation that victims, benign users with real accounts that have befriended fakes, form a distinct classification category that is useful for designing robust fake-account detection mechanisms.

You can find more information on the work here and a copy of the paper here.

Submit your work to ParLearning’15

4th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics

CALL FOR PAPERS

Scaling up machine-learning (ML), data mining (DM) and reasoning algorithms from Artificial Intelligence (AI) for massive datasets is a major technical challenge in the times of “Big Data”. The past ten years has seen the rise of multi-core and GPU based computing. In distributed computing, several frameworks such as Mahout, GraphLab and Spark continue to appear to facilitate scaling up ML/DM/AI algorithms using higher levels of abstraction. We invite novel works that advance the trio-fields of ML/DM/AI through development of scalable algorithms or computing frameworks. Ideal submissions would be characterized as scaling up X on Y, where potential choices for X and Y are provided below.

Scaling up

  • recommender systems
  • gradient descent algorithms
  • deep learning
  • sampling/sketching techniques
  • clustering (agglomerative techniques, graph clustering, clustering heterogeneous data)
  • classification (SVM and other classifiers)
  • SVD
  • probabilistic inference (bayesian networks)
  • logical reasoning
  • graph algorithms and graph mining

On

  • Parallel architectures/frameworks (OpenMP, OpenCL, Intel TBB)
  • Distributed systems/frameworks (GraphLab, Hadoop, MPI, Spark etc.)

2nd Workshop on Algorithms and Systems for MapReduce and Beyond

* Call for papers *

BEYONDMR’15
2nd Workshop on Algorithms and Systems for MapReduce and Beyond, March 27, 2015.
https://sites.google.com/site/beyondmr2015/

Held in conjunction with EDBT/ICDT 2015
Brussels, Belgium, March 23-27, 2015
http://edbticdt2015.be

—————-
WORKSHOP FOCUS
—————-
The second BeyondMR workshop aims to explore algorithms, computational models, architectures, languages and interfaces for systems that need large-scale parallelization and systems designed to support efficient parallelization and fault tolerance. These include specialized programming and data-management systems based on MapReduce and extensions, graph processing systems, data-intensive workflow and dataflow systems.

We invite submissions on topics such as

Frameworks for Large-Scale Analytical Processing:
– Models, architectures and languages for data processing pipelines, data-intensive workflows, DAGs of operations/MapReduce jobs, dataflows, and data-mashups.
– Extensions of MapReduce with more fundamental functions other than Map and Reduce and more complex dataflow connections between function inputs and outputs.
– Expressing and parallelising iterations, incremental iterations, and programs consisting of large DAGs of operations.
– Approaches to achiving fault tolerance and to recovering from failures.

Algorithms for Large-Scale Data Processing:
– Methods and techniques for designing efficient algorithms for MapReduce and similar systems.
– Experiments and experience with new algorithms in these settings.

Cost Models and Optimization Techniques:
– Formal definition of models that evaluate the efficiency of algorithms in large-scale parallel processing systems taking into account the requirements of such systems in different applications.
– Testing and benchmarking of MapReduce extensions and data-intensive workflows.

Resource Management for Many-Task Computing:
– Scheduling of tasks and load-balancing techniques.
– Methods to tackle data skewness.
– Study of cases where automatic data distribution in MapReduce and similar systems does not provide sufficient data balancing.
– Design of algorithms that avoid skewness.
– Extensions of MapReduce that automatically tackle data skewness.

—————-
IMPORTANT DATES
—————-
Papers submission deadline: Dec 11th, 2014
Authors notification:  Jan 7th, 2014
Deadline for camera-ready copy: Jan 20, 2014
Workshop: March 27, 2015

—————-
SUBMISSION GUIDELINES
—————-
We invite full research or experience papers (up to 10 pages), or short papers (up to 4 pages) describing research in progress, formatted using the ACM double-column style (http://conferences.sigcomm.org/imc/2009/sig-alternate-10pt.cls)

—————-
PUBLICATION
—————-
The workshop proceedings will be published with EDBT/ICDT by the Center for European Union Research (CEUR).

—————————
ORGANIZERS
—————————
Foto Afrati     (National Technical University of Athens, Greece)
Jan Hidders     (TU Delft, The Netherlands)
Frank McSherry  (Microsoft Research, formerly)
Paolo Missier   (Newcastle University, UK)
Jacek Sroka     (University of Warsaw, Poland)
Jeffrey Ullman  (Stanford University)

—————————
Program Committee (in progress)
—————————

Umut Acar                               (CMU)
Khalid Belhajjame       (University Paris-Dauphine)
Sarah Cohen-Boulakia    (Universite Paris-Sud)
Asterios Katsifosdimos  (TU Berlin)
Cristoph Koch           (EPFL)
Dionysios Logothetis     (Telefonica Research)
Marta Mattoso           (Federal University of Rio de Janeiro)
Frank McSherry (Chair)  (Microsoft Research, formerly)
Derek Murray            (Microsoft Research, formerly)
Jelena Pjesivac-Grbovic (Google)
Christopher Re          (Stanford)
Krzystof Rzadca         (University of Warsaw)
Piotr Sankowski         (University of Warsaw)
Mark Santcroos          (Rutgers)
Sergei Vassilvitskii    (Google)
Jianwu Wang             (UCSD)