An open, crowdsourced, privacy-preserving, virtual assistant (part 1)

Hello all,

I have been encouraged to blog more, and that is not hard because anything is more than nothing. So here I am, while waiting for some Mechanical Turk results, with part one of a series of posts in which I'll explain the work we've been doing in the Mobisocial Computing Laboratory here at Stanford.¹ ²

To put it quick, we're building a Virtual Assistant called Almond³. Think Siri or Alexa.

But 1) open 2) privacy-preserving and most importantly 3) way more powerful.

An Open Assistant

Alexa has 15,000 skills. If you can think of something IoT, Alexa supports it. If you can think of some major website or web service, Alexa supports it. Amazon has over 5,000 developers working on Alexa, and there is 4 students and one professor working on Almond. How do you compete with that?

In short, you don't. Instead, you open the platform up, and let people contribute. Amazon didn't write the 15,000 skills themselves: people did, when the Echo became popular. In a similar spirit, they didn't write all the cool NLP technology: they ran the Alexa Challenge, and let the best researchers compete.
Now, we don't have $ 1M to offer, but being a university, even if a private one, we're in an especially good position. So we decided to open source everything we do⁴.
You can take our software, our ideas, our contributions, and make something cool with it. Our hope is that as people find it and start playing with it, they'll also contribute back, and the platform will grow.

A Privacy-Preserving Assistant

If you build an open assistant, but don't care about privacy, the first thing people will do is to take it, fork it, and make it care about privacy. It's amazing the overlap between people who understand computers, the people who care about software freedom and the people who care about privacy. So we decided to do the work ourselves.
Our Assistant is built on two major principles to make it respect privacy:

Separate what is private from what is public: we collected the public information, such as the descriptions and glue code of the services to support, the datasets and machine learning models for semantic parsing, into an open repository called Thingpedia; everything else, such as your credentials, your data, your history of commands, stays in a private system called ThingSystem.
Let you run your own Almond: your ThingSystem contains your stuff only, so we let you run on the device you choose; there is no risk of privacy leaks, because the data never actually leaves the device. To make it easier to set up, we have a version of ThingSystem that works as a ready-to-install Android application (available on the Play Store); we also have a version for desktop computers (based on GNOME technologies)⁵ and for Echo-style home assistants.

The two sides of the world, Thingpedia and ThingSystem, communicate with a third Thing*: ThingTalk, which is in my opinion the coolest part of Almond⁶. This is a formal programming language (a domain specific language, if you're into that) that exactly describes what the assistant should do. When you give a command to Almond, this command is translated into a ThingTalk program; this ThingTalk program makes use of Thingpedia, like a regular Java program would use maven or a node program would use npm. Your ThingSystem just takes this program, loads the libraries it needs, and runs it.

A Way More Powerful Assistant

We're PhD students in a university, so unlike Amazon, Google, or even Red Hat and Canonical, building an useful open-source assistant that people will care about is not enough. You need to build something technologically new (or as they say here, “novel”).
Our work checks that bullet by letting people do things that no other assistant can do. Specifically, our assistant supports end user programming, which is a fancy way of saying that the user should decide what the assistant can do, and not the programmers.
Now, an end user will never “program” in the sense of writing an app a traditional language - it's not their job, they don't care, they want things done. Instead we interpret this to mean connecting things that already exists, in new and interesting ways. At the core, our ThingTalk programming language supports one single construct, when - get - do, which lets you specify that when something happens, you get some data and do something with it.
For example, you can generate a meme and then send it to your friends in one go; you can look at your location and turn off the lights if you're not home; you can reply to emails automatically, and even have a different reply for your boss, for your family and for scammers; you can send cute pictures to your SO automatically (I do that).
Every portion of the program is a high-level primitive, roughly corresponding to a single Alexa command, but now you can combine three of them in one sentence, meaning you can use stuff coming out from one command and feed it into the second. The result can also run on its own, so you do this rigmarole of setting up whens and gets and dos once, and then the assistant just does things for you. Additionally, you get to apply arbitrary predicates that limit when the action should run - like email filters, but for anything. Because this is a fully general system, any combination you can think of is allowed. You just say it in natural language, and Almond does it for you.
Now, this raises some technical challenges, starting from the fact that the natural language commands are a lot more complicated that what Alexa can understand. Alexa gets away with 15,000 skills because they force you to use one of a few magic words (“ask”, “tell”, “open”, “play”) followed by the exact name of the thing you want; after that, you only only have a few choices of commands. It would be totally uncool if we required you to do that: “Almond, connect GMail to GMail when I receive an email then send an email” just sounds weird, “Almond, reply to my emails automatically” sounds a lot more natural. I'll go into more details on semantic parsing in the next blog post in this series, but suffice to say, this is hard and nobody has a magic bullet yet.
Additionally, when - get - do on your own stuff in voice is nice, but the sentences get long and repetitive real quick, and the programming is quite limited still. Can we do better for every day repetitive tasks? Can we be more engaging than a monotone serial voice output? Can we let you share your stuff with other people, so they can operate on it from their Almonds? Can we let you put any sort of restrictions on who can touch your stuff, where, when and what they can do with it? Turns out we can, but this is the material for the next posts in this series. Stay tuned!

PS: I changed the title of the blog because sadly I don't do as much GNOME stuff as I used to. I don't do that much C++ either, but heh.

1 This is work done in collaboration with Rakesh Ramesh, Silei Xu and Michael Fischer, under the supervision of prof. Monica Lam.^↩
2 I should emphasize that the views expressed in this post are mine and do not reflect the view of Stanford, the Mobisocial Lab, or my colleagues.^↩
3 Yes, like the nut. It's because we're nuts.^↩
4 https://github.com/Stanford-Mobisocial-IoT-Lab.^↩
5 I should eventually turn around and package it as a flatpak. For now you need to build it from source. Help welcome!^↩
6 Also my thesis. But my judgement is totally unbiased, believe me.^↩

G::Campax - Giovanni's tech blog

venerdì 20 ottobre 2017