Generating Art & Music – with Douglas Eck | DEEPDREAM SYMPOSIUM

Generating Art & Music – with Douglas Eck | DEEPDREAM SYMPOSIUM


My name is Doug. First I’d like to thank the Gray Area organizers and the Cerebra Google
organizers. I think it’s awesome they were able to come out here on a Saturday
afternoon and geek out a little bit on machine learning and art. I had the chance this
year to start what for me is a dream project, which is to be able to build the
team to work full-time on Google Brain on the problem of generation. And as Mike
said I’m focusing really on sequence generation. The project we call Magenta for music and art generation, but really we’re focusing on music and video and other types of sequences –
storytelling. And I wanna spend a little time – I think there’s some interesting questions
behind this – and spend a little time highlighting some of the questions and
also giving an idea we might go with this. Also want to highlight the connection that
Magenta has with Tensor Flow. We’re going to be working to open source a number of tools
for working with media for doing machine learning with media and hope to drive
the field a little bit. Actually one reason I think it’s so exciting right now is, you
know, science moves with machinery, science moves with instrumentation. You invent a telescope and suddenly there are a bunch of new discoveries that come
from that telescope. And this current renaissance in machine learning, the kind
of scale at which were doing machine learning, I think is a kind of telescope
and it’s going to allow a new generation of researchers to ask questions that we
haven’t been able to ask before. So that’s super exciting.This slide was meant to
talk about what deep learning is. That job has been done by some of the best
communicators in the world so I can move quickly through this. I’m particularly amused
by Y. LeCun’s showing the filters at different levels of a neural network
trained to recognize images. It has been pointed out before but I really love it
you look carefully on the right hand side deeper in the network, if you excite and
a neuron and having generate something, you see the filters are really capturing
specific classes. So, just another view of some of the
stuff that we saw in the previous three talks. Ok so, if there’s a big goal
that I have for Magenta it is simply put to generate media that is so good and so
compelling that people want to come back to a week after week after week and
interact with it. And I think that’s an astonishingly hard goal. And I
think, moving from generating a point of beauty to having a kind of long-term
engagement with some algorithms such that you care enough about it to
let it be generating the soundtrack for your job or, you know, wallpapering your
house with some sort of changing image or telling you stories about the trip
that you took by editing your photo collection for you, it’s really really
hard. And I wanted to focus on some of the some of the underlying components
that I think are important: namely attention and surprise and
interestingness. And see if we can get some ideas about how we might use neural
networks to get us part of the way there. And I think it’s clear we have to talk
about getting only part of the way there. I mean people, in the same way that a
drum machine gets us part of the way there, I think we’ll be providing tools
for artists musicians to work with. I personally am not thinking about seeing
an easy chair with my two loudspeakers listening to something that we compose – these are tools.
So I’m gonna do a quick cook’s tour of a few…some work that I think is interesting and try to highlight why
it’s interesting. First in terms of attention, so first what is attention at
the simplest level? Think of what your visual system is doing when
you’re looking at something. You’re you’re unconsciously, in fact
uncontrollably, jumping around the image and sampling it. You are saccading from point
to point. And in fact, there’s nothing you can do to stop your brain and your
visual system from doing that. At some level there is kind of automatic
attention that’s happening and how our sampling the space, sampling the
world around us. That’s one way to use the term attention. Another way to talk
about attention is actual conscious attention. Like I’m
going to think about a cup of coffee right now or I’m not going to think
about a cup of coffee right now. Right? Even harder. And then third is, and I’m mixing this around) in terms of what we might want to do with art is –
can the creator of the art actually draw the user’s attention and some
interesting way? I think it’s hard to disagree that that’s, you know,
that’s definitely part of the process. Right? You’re choosing something to try to
draw out, pull, the user along at the sacrifice of some other things. And it
turns out there some very, pretty primitive by in terms of what we’re
doing is humans but very interesting ways, in which you can use recurrent
neural networks to model attention. And so this is recent work – this is at ICML
from 2015 and it’s from Yoshua Bengio’s lab. Which, actually I was a professor in that lab before
joining Google. That’s not why I picked this, but…in Montreal. And it talks about having an
image that’s being learned, prediction of an image. The goal is to generate a sentence that
describes the image. And, what the attention on that work is going to do is look at
different parts of the image and focus on that part of the image as part of
making the sentence. And so a simple one right here is “a giraffe standing in a
forest with trees in the background.” And it may be hard to read the words but
they’re printed on every image. If we look at the top row, “a giraffe standing”.
On the second row, “in a forest” and you’ll notice that the forest the image is
looking all around the giraffe but not at the giraffe. The attention has shifted what
it’s looking at, etcetera. “With trees in the background”, and the model is completely
pulled back. And so what’s happening is a sequence model is being used to create a
story around this picture. It is the most primitive story you can imagine and if
you look at the other examples from this paper they’re quite primitive. But it’s still suggested to me
something we might want something to generate amusing things like stories
that we think about for the sheer pleasure of it. So I’m gonna use this term LSTM
going forward. I do, I think my only real claim to getting to run this
project is that in 2002 I failed to produce compelling music with a
recurrent neural network and so I’m back to do it again with much larger networks
so maybe it will be much larger failure, we’ll see. But I wanna give you – you know
we don’t have time for a technical dive into recurrent neural networks, but the basic
idea of a recurrent neural network is it has a self connection and so we can use that
self connection from time step to time step to try to remember something about
the past. If you happen to know digital signal processing it’s infinite impulse
response (IIR filter not a finite impulse response filter). And, you know, you connect
something to itself and you start playing with it and what you find is
these things really tend to blow up. They blow up quite easily. And in the nineties
there was a really nice observation with a nice analysis that said in a recurrent
neural networks can do one of two things: if you set the weight high enough,
they’re guaranteed to blow up and if you set the weights low enough, they are
guaranteed to not ever really remember anything interesting. It’s called the vanishing
greeting problem – you want to think about it in terms of the error that’s going
through these networks, you can’t pass it back in time for enough to learn
anything from it. The pretty fundamental breakthrough, with long short term memory, came up with
this idea of protecting a neuron. Ok so the ‘C’ in the middle – think of that just
as a neuron. And you can see that ‘C’ in the middle has a little self recurrent
connection that it’s using to be able to remember something from the past as it’s
trying to solve the problem. All of this mechanism around that are what are called ‘gates’. And those
gates are there to protect that cell so that in a linear fashion that cell can
count – you can keep track of things over time and the gradient flows linearly
back through the model vis-a-vis protecting the input and the output with
multiplicative gates. How many people actually understood what I was saying?
All right, I’m going to just move on here. So I threw that LSTM cell in just as
an aside will get back to the graphs I’m going to music examples in a
minute. This attention model, you know, a group of
people – you can stop the model in the middle is that they’re doing here and
just look at what the models looking at. You know, it’s looking a girl, a Frisbee,
dog, etc. It is pretty amazing. You can even analyze error. I don’t wanna spend too much time on
this but that the one in the middle of top is really cool. The model said
actually generated the sentence “a woman holding a clock in her hand” and it’s
actually a sweatshirt has a logo of a college that looks like a clock. So it has managed…[audience comment]
Which one? We have…there’s work to be done here. Right? Alright, this is this a short talk. Just to close up this attention thing – for me there is just something
absolutely intriguing about the idea that you might have a second model that is
moderating a first model of attention mechanism. Might be able to use that. You know second
is this idea surprise. Art is surprising. Art should be surprising. It would also be
surprising if this building fell down right now and that’s another kind of
surprise that’s less interesting. And one thing, one thing that, like, I think we
can do with deep networks that are interesting (vis-a-vis surprise) is play
different semantic spaces off of each other. Ok? By that I mean just very abstractly you
can imagine a language model that knows something about the world and an image
model knows something about the world and they think slightly different things
about something that’s happening and so there’s some friction that you can
create. Very abstract. My favorite example of this is from a paper called “That’s
what she said, double entendre identification” from 2011.
It is I think the simplest example of how you might get some surprising effect
from something very very simple. The problem that they tried to solve was – can they go on Twitter, can they read Twitter feeds, and can they figure out when they
should apply the hashtag #thatsWhatSheSaid? That’s cool, right? And they did it and people laughed
at it. Ok so now just put your thinking hat on
for a minute and imagine “I’m gonna solve that problem.” So you’re going to go mine
Twitter and maybe I’ll get some positive examples and some negative examples or
whatever and we need to use TensorFflow and write something like a huge inception
graph to solve this problem. It is actually pretty hard, right? So here’s what they
did do. This is why bring it up – they had two language models, okay? One language model is
trained on like everyday language. By language model I mean just think
abstractly, I can ask this model give it a string like “hello Doug how are you” and
I can give it to the model and the model will come back and say how probable that
string is? How likely is that string? If you haven’t read the paper,
they had another model – can anybody guess what the other model is trained on? If you’ve read the
paper don’t give it away. [Audience comment]. Bad jokes? No, no, that’s good. No, erotica!
Erotica. You have a second model – and I hope you guys just run this forward in
your head. Now you can ask for the probability of a string – how likely was
this in porn vs how likely is this in the real world? Right? You are laughing cuz its funny! So all
they did – they get the probability of the string a string like “hello Doug
how are you” or you know “that’s a very big microphone”, right? And they, they like,
and guess what? Right – if it’s probable in both corpus what do you do? No no no no no no you don’t. That’s the
funny part. If it’s probable in both you don’t do anything because “hello there” is
just probable everywhere. If it’s probable in porn but not all that
probable in the real world, you just tune that threshold and you’re like boom! That’s
what she said. That’s all they do! So, I don’t know, the reason I’m
spending time with this is I think it’s deeply interesting that you can get such
an effect from such as simple, such a simple bit of machinery.
Just two probability distributions but the right probability distributions are
chosen. So you know I think in terms of music, you might think of just models
that understand something about tonality, versus models that understand something
about rhythm etc. And and then instead of try to do everything with one hyper
smart model, actually playing models against one another. And I think this is
also you know one way in which we get, we generate, tension. So surprise. Now some
demos. Oh the by the way, do you know what I mean when I said like playing games with tonality? So I have to do this because
it makes people uncomfortable. [sings] happy birthday to you, happy birthday to
you, happy birthday dear Doug.[stops singing] That’s very very far out from the tonal
center and generates some tension for some people. I’ll sing the rest of that
later but we’ll move on for now. Ok, so we’re going to talk about generating music
with therecurrent neural network and in this case this was from 2002. This
is this is what didn’t work very well but I want to get the basic idea. The
basic paradigm is we have some training data and we want the model to be a stable
Music Generator. So let’s just play this, hear it, get it over with. Three hours of this. Get your sandwiches out. One of my goals with this was
actually ongoing stable music generation so that the model can kind of keep
playing and get perturbed and come back and play some more. Think of it like a
stable limit cycle in some space. That ends up being really hard. It’s really
hard to get stability. But let’s see, if we move forward in the world here, how
much better are we doing after 12 years? Is Elliott here? No. Ok so this is someone on the Magenta team. He’s been working with models that, instead of just trying to learn from the
previous note, or actually given the previous note and then a few notes
having to do with the metrical structure of the piece. So like4 notes ago and a measure
ago. And so I play a couple of these this just to give you an idea of where we are – we are
just starting this project. Most of what we’re doing right now is infrastructure
but I couldn’t bear to come here with no demos so we did something. So you
heard two different instruments. The first instrument you’ll hear is the network being
primed and then the network is continuing on. Oops. Or not. I wish we could see these. There is a second one where he just does some sustained quarter notes jumping up and down and it just falls apart. So that’s where we are with recurrent neural networks. There’s also this idea that we had that which is… can you guys see these images? I’m not sure from this angle if you can see the grey dots? Is anybody really good? Can you tell me what those are? [audience responses] Oh, they are piano rolls of Bach music. Turns out you can also treat music like an image and you can try to generate from the images. So
Dan, who is here, on the team was working with using a model that we’ll
hear about later, a generative model to actually just try to generate these
vision, these pictures of music and then we’ll just turn them back into audio. And
right now it’s very patch-like but it’s still interesting to see what they’re doing. [msuci, audience laughter] I don’t know how Dan feels about you laughing – just putting that out there. He’s here, people. [music] It’s like some soundtrack composer had a stroke. Still got to get to the… So I’m closing up here. I could talk about this for hours, but… There’s a couple of, there are some deep problems with music having to do with repetition and long time scale structure. One thing you can observe is that ‘with repetition comes structure.’ So if you are looking back at forms of music that don’t repeat they tend to have flat structure, like Gregorian plainsong. When you start to add repetition, the repetition of motifs and then modulation of motifs, you get meter for free, because I think the brain just likes to repeat things in chunks. Something really nice about that. So
with repetition however comes I think a puzzle for learning how to compose music
with with neural networks which is that, at some level, any old melody will do as
long as you repeat it. This is the jazz adage…if you make a mistake, make it four times. I’m gonna prove that to you right now. So you
can read on screen what’s happening. So your brain kind of slides over that, right? It’s like elevator Bach. That’s not fair to Bach. That’s not fair to elevators. What we’re gonna
do is just going to repeat it now. Note to yourselves, there’s no melody there per se. We’re just going to repeat it now. So this interests me a lot because when you train a neural
network, it’s trained on the notes right? But if you can use any old notes, if what’s
important is how you repeat and modulate the notes, that a second order thing for
a network to figure out and it’s actually quite a puzzle. It would be almost with
human language as if it didn’t really matter which syllables we used just so we
repeated them the right way and that’s how we ordered dinner. It’s not quite how it works. I’m probably out of
time right? Ok one more and I’m done. And this is
just a last final…so…there’ssomething about…the last thought I had – one was about repetition – the other
one is about going back to, tying in with what Blaise had to say. Which is that –
we use the world to support cognition in very very very interesting ways we use
our bodies to support cognition in very interesting ways. And if you start
thinking about music and video long enough the one thing you think about is “how do
you generate audio”, right? And the first thing I would observe is that brains
don’t generate audio. In fact when I’m speaking to you I’m using machinery:
vocal cords and a filter from from from my vocal tracks to make noise. And when I
play a musical instrument I’m actually just hammering on things or strumming on
things. But in both cases my brain never actually has to solve the problem of
generating, you know, 44.1 KHz audio, thankfully. Because it’s hard to do. There are domain experts in the room who will tell you that
it’s hard to do. I wanted to display one last demo for you because I think, this
is from early from work I did at University of Montreal before coming here, it
is what the body brings to the table and what performance brings to the table vis-a-vis a composition. And I’m just gonna play for you two pieces, two performances
of the same piece of music done on this Bosendorfer Imperial Grand at University
Montreal. In neither case was there a pianist sitting there. So it’s the same mechanism and all
you’re gonna here is the difference between a piece of music played as the notes say
they should be played just robotically and then a performer coming in and playing the same
piece of music. And the only point of making with this is that
there’s a lot being brought to the table that has nothing to do with sort of the
logical or symbolic structure of music. So here’s – this is this is fun. So the observation I would make about that…first of all that was nicer, right? Going back to attention and storytelling, it’s absolutely clear that the performer is telling us a story. And that the performer is doing that by drawing our
attention to some things at the expense of other things. It’s so clear that is happening. There’s all kinds of musicology behind it: voice leading, phrase final lengthening, reasons why performers make choices. But fundamentally they are telling a story, and that story is not in the score. There are hints to the story in the score. But the
story really gets told when you combine the score with the performance of the score in the the real world and I think that’s that’s
just fascinating and cool. Ok so that’s all I have. Thank you for your attention.

One thought on “Generating Art & Music – with Douglas Eck | DEEPDREAM SYMPOSIUM

Leave a Reply

Your email address will not be published. Required fields are marked *