Rajat Monga: TensorFlow

Lex Fridman Podcast XX

--:--

Full Transcription:

[0] The following is a conversation with Rajat Manga.

[1] He's an engineering director of Google, leading the TensorFlow team.

[2] TensorFlow is an open -source library at the center of much of the work going on in the world in deep learning, both the cutting -edge research and the large -scale application of learning -based approaches.

[3] But it's quickly becoming much more than a software library.

[4] It's now an ecosystem of tools for the deployment of machine learning in the cloud, on the phone, in the browser, on both generic and specialized hardware.

[5] GPU, GPU, and so on.

[6] Plus, there's a big emphasis on growing a passionate community of developers.

[7] Roger, Jeff Dean, and a large team of engineers at Google Brain are working to define the future of machine learning with TensorFlow 2 .0, which is now in Alpha.

[8] I think the decision to open source TensorFlow is a definitive moment in the tech industry.

[9] It showed that open innovation can be successful and inspire many companies to open source their code, to publish, and in general engage in the open exchange of ideas.

[10] This conversation is part of the artificial intelligence podcast.

[11] If you enjoy it, subscribe on YouTube, iTunes, or simply connect with me on Twitter at Lex Friedman, spelled F -R -I -D.

[12] And now, here's my conversation with Rajat Manga.

[13] You were involved with Google Brain since its start in 2011 with Jeff Dean.

[14] It started with disbelief, the proprietary machine learning library, and turn into TensorFlow in 2014, the open source library.

[15] So what were the early days of Google Brain like?

[16] What were the goals, the missions?

[17] How do you even proceed forward once there's so much possibilities before you?

[18] It was interesting back then, you know, when I started or when you were even just talking about it, the idea of deep learning was interesting and intriguing in some it hadn't yet taken off, but it held some promise.

[19] It had shown some very promising and early results.

[20] I think the idea where Andrew and Jeff had started was, what if we can take this, what people are doing in research, and scale it to what Google has in terms of the compute power, and also put that kind of data together.

[21] What does it mean?

[22] And so far, the results have been if you scale the compute, scale the data, it does better and would that work?

[23] And so that was the first year or two, can we prove that out, right?

[24] And with disbelief when we started the first year, we got some early wins, which is always great.

[25] What were the wins like?

[26] What was the wins where you were, there's some problems to this, this is going to be good?

[27] I think there are two early wins where one was speech that we collaborated very closely with the speech research team who was also getting interested in this.

[28] And the other one was on images where we, you know, the cat papers, call it that was covered by a lot of folks and the birth of Google Brain was around neural networks that was so it was deep learning from the very beginning that was the whole mission yeah so what would uh in terms of scale what was the sort of dream of what this could become like what were there echoes of this open source tensor flow community that might be brought in was there a sense of TPUs was there a a sense of like machine learning is not going to be at the core of the entire company is going to grow into that direction.

[29] Yeah, I think, so that was interesting.

[30] Like if I think back to 2012 or 2011, and first was, can we scale it?

[31] In the year or so, we had started scaling it to hundreds and thousands of machines.

[32] In fact, we had some runs even going to 10 ,000 machines, and all of those shows great promise.

[33] In terms of machine learning at Google, the good thing was Google's been doing machine learning for a long time.

[34] Deep learning was new, but as we scaled this up, we showed that, yes, that was possible and it was going to impact lots of things.

[35] Like, we started seeing real products wanting to use this.

[36] Again, speech was the first.

[37] There were image things that photos came out of in many other products as well.

[38] So that was exciting.

[39] As we went into with that a couple of years, externally also academia started to, you know, there was lots of push on, okay, deep learning's interesting we should be doing more and so on and so by 2014 we were looking at okay this is a big thing it's going to grow and not just internally externally as well yes maybe google's ahead of where everybody is but there's a lot to do so a lot of this start to make sense and come together so the decision to open source i was just chatting with the chris latiner about this uh the decision go open source with tensorflow i would say sort of for me personally seems to be one of the big seminal moments in all of software engineering ever.

[40] I think that's when a large company like Google decides to take a large project that many lawyers might argue has a lot of IP, just decide to go open source of it, and in so doing lead the entire world and saying, you know what, open innovation is a pretty powerful thing and it's okay to do.

[41] That was, I mean, that's an incredible an incredible moment in time.

[42] So do you remember those discussions happening, whether open source should be happening?

[43] What was that like?

[44] I would say, I think, so the initial idea came from Jeff, who was a big proponent of this.

[45] I think it came off of two big things.

[46] One was research wise.

[47] We were a research group.

[48] We were putting all our research out there.

[49] We were building on other's research and we wanted to push the state of the art forward.

[50] And part of that was to share the research.

[51] That's how I think deep learning and machine learning has really grown so fast.

[52] So the next step was, okay, now would software help with that?

[53] And it seemed like they were existing a few libraries out there, Tiano being one, Torch being another, and a few others.

[54] But they were all done by academia, and so the level was significantly different.

[55] The other one was, from a software perspective, Google had done lots of software that we used internally, you know, and we published papers.

[56] Often there was an open source project that came out of that, that somebody else picked up that paper and implemented, and they were very successful.

[57] Back then it was like, okay, there's Hadoop, which has come off of tech that we've built.

[58] We know the tech we've built is way better for a number of different reasons.

[59] We've invested a lot of effort in that.

[60] And turns out we have Google Cloud, and we are now not really providing our tech, but we are saying, okay, we have Big Table, which is the original thing.

[61] We are going to now provide edge -based APIs on top of that, which isn't as good, but that's what everybody's used to.

[62] So there's like, can we make something that is better and really just provide?

[63] Helps the community in lots of ways, but also helps push the right, a good standard forward.

[64] So how does cloud fit into that?

[65] There's a TensorFlow open source library, and how does the fact that you can use so many of the resources that Google provides and the cloud fit into that strategy?

[66] So TensorFlow itself is open, and you can use it anywhere, right?

[67] And we want to make sure that continues to be the case.

[68] On Google Cloud, we do make sure that there's lots of integrations with everything else, and we want to make sure that it works really, really well there.

[69] So you're leading the TensorFlow effort.

[70] Can you tell me the history and the timeline of TensorFlow project in terms of major design decisions?

[71] So like the open source decision, but really, you know, what to include and not.

[72] There's this incredible ecosystem that I'd like to talk about.

[73] There's all these parts.

[74] But what if you just some sample moments that defined what TensorFlow eventually became through its, I don't know if you.

[75] you are allowed to say history when it's just but in deep learning everything moves so fast and just a few years is already history yes yes so looking back we were building tens of flow I guess we open sourced it in 2015 November 2015 we started on it in summer of 2014 I guess and somewhere like three to six late 2014 by then we had decided that okay there's a high likely would we'll open source it.

[76] So we started thinking about that and making sure we're heading down that path.

[77] At that point, by that point, we had seen a few, you know, lots of different use cases at Google.

[78] So there were things like, okay, yes, you want to run in at large scale in the data center.

[79] Yes, we need to support different kind of hardware.

[80] We had GPUs at that point.

[81] We had our first TPU at that point, or was about to come out, you know, roughly around that time.

[82] So the design sort of included those.

[83] We had started to push on mobile, so we were running models on mobile.

[84] At that point, people were customizing code.

[85] So we wanted to make sure Densaflo could support that as well, so that that sort of became part of that overall design.

[86] When you say mobile, you mean like a pretty complicated algorithm of running on the phone?

[87] That's correct.

[88] So when you have a model that you deploy on the phone and run it the right?

[89] So already at that time, there was ideas of running machine learning on the phone.

[90] That's correct.

[91] We already had a couple of products that were doing that by then.

[92] Right.

[93] And in those cases, we had basically customized handcrafted code or some internal libraries that we're using.

[94] So I was actually at Google during this time in a parallel, I guess, universe, but we were using Thiano and Café.

[95] Yeah.

[96] Was there some degree to which you were bouncing, like trying to see what Cafe was offering people, trying to see what Theano was offering, that you want to make sure you're delivering on whatever that is, perhaps the Python part of thing.

[97] Maybe that influenced any design decisions.

[98] Totally.

[99] So when we built this belief and some of that was in parallel with some of these libraries coming up, Tiano itself is older, but we were building disbelief focused on our internal thing because our systems were very different.

[100] By the time we got to this, we looked at a number of libraries that were out there.

[101] Tiano, there were folks in the group who had experience with Torch with Lua.

[102] There were folks here who had seen cafe.

[103] I mean, actually, Yang Ching was here as well.

[104] There's, what other libraries?

[105] I think we looked at a number of things might even have looked at JNR back then I'm trying to remember if it was there in fact we did discuss ideas around should we have a graph or not and they were so putting all these together was definitely you know there were key decisions that we wanted we had seen limitations in our prior disbelief things a few of them were just in terms of research was moving so fast We wanted the flexibility.

[106] We want the hardware was changing fast.

[107] We expected to change that so that those probably were two things.

[108] And yeah, I think the flexibility in terms of being able to express all kinds of crazy things was definitely a big one then.

[109] So what the graph decision is moving towards TensorFlow 2 .0, there's more, by default, it'll be eager execution.

[110] So sort of hiding the graph a little bit because it's less intuitive in terms of the way people, develop and so on.

[111] What was that discussion like with in terms of using graphs?

[112] It's kind of the Thiana way.

[113] Did it seem the obvious choice?

[114] So I think where it came from was our like disbelief had a graph like thing as well.

[115] It wasn't a general graph.

[116] It was more like a straight line thing.

[117] More like what you might think of cafe, I guess, in that sense.

[118] But the graph was, and we always cared about the production stuff.

[119] Like even with disbelief, you were deployed.

[120] a whole bunch of stuff in production.

[121] So graph did come from that when we thought of, okay, should we do that in Python and we experimented with some ideas where it looked a lot simpler to use, but not having a graph meant, okay, how do you deploy now?

[122] So that was probably what tilted the balance for us and eventually we ended up with the graph.

[123] And I guess the question there is, did you, I mean, so production seems to be the really good thing to focus on.

[124] But did you even anticipate the other side of it?

[125] where there could be, what is it, what are the numbers?

[126] Something crazy, 41 million downloads.

[127] Yep.

[128] I mean, was that even like a possibility in your mind that would be as popular as it became?

[129] So I think we did see a need for this a lot from the research perspective and like early days of deep learning in some ways.

[130] 41 million no I don't think I imagine this number then it seemed like there's a potential future where lots more people would be doing this and how do we enable that I would say this kind of growth I probably started seeing somewhat after the open sourcing where it was like okay you know deep learning is actually growing way faster for a lot of different reasons and we are in just the right place to push on that and leverage that and deliver on lots of things that people want.

[131] So what changed once you're open source?

[132] Like how, you know, this incredible amount of attention from a global population of developers, how do the project start changing?

[133] I don't even actually remember during those times?

[134] I know looking now, there's really good documentation, there's an ecosystem of tools, there's a community blog, there's a YouTube channel now, right?

[135] Yeah.

[136] It's very community -driven.

[137] Back then, I guess, 0 .1 version.

[138] Is that the version?

[139] I think we called it 0 .6 or 5, something like that.

[140] I forget about that.

[141] What changed leading into 1 .0?

[142] It's interesting.

[143] I think we've gone through a few things there.

[144] When we started out, when we first came out, people loved the documentation we have because it was just a huge step -up from everything else because all of those were academic projects, people doing, you know, who don't think about documentation.

[145] I think what that changed was instead of deep learning being a research thing, some people who were just developers could now suddenly take this out and do some interesting things with it, right, who had no clue what machine learning was before then.

[146] And that, I think, really changed how things started to scale up in some ways and pushed on it.

[147] over the next few months as we looked at, you know, how do we stabilize things?

[148] As we look at not just researchers, now we want stability, people want to deploy things.

[149] That's how we started planning for 101 .0 .0.

[150] And there are certain needs for that perspective.

[151] And so, again, documentation comes up, designs more kinds of things to put that together.

[152] And so that was exciting to get that to a stage where more and more enterprises wanted to buy in and really get behind that.

[153] And I think post -1 .0 and, you know, with the next few releases, their enterprise adoption also started to take off.

[154] I would say between the initial release and 1000, it was, okay, researchers, of course, then a lot of hobbies and early interest, people excited about this who started to get on board, and then over the 1