We have recently set up a TensorFlow assessment function in AWS lambda, and got very close to the maximum allowed size of a lambda function (250MB) with the trained model currently being 85MB, and the TensorFlow libraries and binaries taking up another 140 or so megabytes by default (<- which we cracked down a bit, but that's quite a hack).
I feel like Amazon could do some work in this area to support users to use their own engines and not be bound to AWS AI Platforms and Services.
This could be as simple as publicly documenting the time lambda's stay 'warm' for and retain data in /tmp persisting through multiple invocations or some other examples on how an AI workflow could be implemented with popular custom engines such as TensorFlow.
Does anybody else have any experience in this regard?
I've hacked around lambda quite a bit (I think the compressed size of one function is a tad under the max allowed). My hacks I remember are:
- Run strip all .so libraries -- many aren't stripped fully
- In Python I manually deleted sub packages of numpy/scipy I didn't need
- If you're loading large models at initialize, numpy load routines are _much_ faster than cPickle. Have it load at module initialization, not during each invocation.
I should really write a blog post about my experience with it.
At a certain point I decided I was doing something that lambda really wasn't designed for -- I'm looking at migrating off, but the current implementation makes capacity planning super easy. Provisioning 1000 machines with 1GB of RAM for 15 minutes every day to read off a queue isn't a trivial problem.
(Also, if anyone from AWS is reading, being able to limit the max concurrency of a single function vs account level limits would be super useful).
Just to throw this out there as another possible optimization, if you find that you're putting a big fat library into every function, one possibility is to run the library as its own lambda function. You'll be slowed down a bit by the network but it might be made up for by not having to constantly initialize the same thing.
No, we haven't looked at SWF at all - might do some reading up on that.
We considered running it on EC2, but the economics just didn't work out for our needs (hundreds of parallel processed jobs with irregular spikes, <5s runtimes per invocation and some others)
The keep warm time isn't static AFAIK but some frameworks like Zappa have chosen 5 minutes (I work for AWS). If you raise an issue with support there may be other ways around this. There are also other reasons for not relying on functions being warm (spike in concurrent invocations, AZ outages, latency, etc.). If you open a support ticket and email me the # I'll see what I can find out: randhunt at amazon dot com
I'm curious what sort of things you did to shrink your model?
Have you considered pulling the model data from S3 outside of the main lambda handler and seeing if that negatively impacts performance -- with a cloud watch event running every 5 minutes or so to keep the function warm?
Hadn't thought about a CloudWatch even to keep the function warm, I might suggest that to the team.
We haven't shrunk the model, we've deleted superfluous files from the TensorFlow python library and dependencies (we don't need tensorboard, for example).
It would be nice if you could package TensorFlow up into a minimal component just for assessment and not have any of the 'learning' stuff or other added-on libraries, but we couldn't find a simple way of doing that - we're not pro C++ engineers and even our python is not the greatest. We're managing for now, but if our model grows any bigger we will run into issues, but there have been some good suggestions in here.
We had considered the S3 store, but we ruled that out quite quickly for cost & performance at the number of invocations a months we're looking at - but that was before we knew more about the 'keeping warm', so that may be revisited, too.
We get around this by using a dependency injection framework (e.g. Guice) and initialising the module statically, so it will only get called once when the JVM loads (i.e. when the Lambda 'warms up') and then inject the dependencies into anything that uses them
For example in one Lambda function, we use Guice to get resources from S3.
We're (sadly) a bit locked into using AWS. We've been eying the Google offerings for a while though (not just for this), so who knows what we may do in the future.
Knowing nothing about assement and tensor flow: Could the trained model be downloaded via s3 at init instead of being included inside the function zip? I believe the memory limits are more lax than the actual function limits. And the /tmp limit is 512MB.
Quick slightly unrelated question: Does anyone have a comparison of using Google cloud services vs AWS for machine learning? I'm planning to pick one, and I was leaning towards Google Cloud Services because of the TensorFlow support and the fact that Google is big on ML, making it likely that it's something that Google will support and be good at. With this blog post, I'm not sure.
Both providers offer you raw VMs with GPUs and such so you can run popular machine learning frameworks yourself by hand. After that the three providers diverge a bit, and I've not seen a good writeup myself. Roughly:
- Google has both a hosted TensorFlow (Cloud ML) as well as specific, pre-trained models you can simply use (Cloud Vision, Cloud Speech, etc.). For an easy to use interface, we have direct TensorFlow (and more) integration in Datalab.
- AWS also has some pre-trained services (Rekognition, Polly, Lex) but for "obvious" reasons doesn't do hosted TensorFlow. Instead Amazon Machine Learning is a bit more like Azure's offering: "Put data in, wire up stuff in the console and hit go".
If you're really interested in ML, my biased opinion is that you'll be using TensorFlow. And as you surmised, we're committed to making TensorFlow the "best" ML framework and making sure it runs well on Google Cloud. Like Kubernetes, we're not going to handicap it elsewhere, but having it managed and accelerated for you, is extremely convenient.
[Edit for formatting. I also should have mentioned there will be lots of ML-related talks at NEXT in San Francisco in two weeks!].
GM for AI at AWS here: actually - we like TensorFlow quite a bit, too.
We provide a machine image with TF, MXNet and others pre-installed, along with Keras, CPU and NVIDIA divers, and other libraries for deep learning. We just added Ubuntu support too:
Sorry if that wasn't clear from my opening (both providers offer the DIY option, and are happy to support every ML framework). I think for folks looking to have a hosted TF service though, we wouldn't expect that (sadly) from AWS, particularly after the MXnet announcement:
ImageNet, GoogleNet, etc. are all image datasets for precisely this purpose. There's also the recently announced YouTube dataset and Kaggle challenge [1] and Google Research's datasets [2].
I agree though, the kind of artificial / play-against-yourself datasets that the folks at DeepMind created for say Alpha Go are an entirely different beast.
I've used both and both are great overall. Even if you are using TensorFlow, I would recommend AWS right now for someone just starting out because the documentation is more currently more thorough, although that will probably change. The CloudML service looks really cool (and I think it's really what everyone will ultimately use), but I hit enough problems/bugs getting my model trained and running that I plan to wait for it to come out of beta before trying again.
On the GPU side of things, I can confirm that AWS p2.xlarge has worked well for me. It has one Tesla K80. Azure's offering is similarly priced. Back in November Google Cloud announced P100 GPUs would be available soon; that will be interesting.
Hmm. Cloud ML is "serverless" in a sense, but it's backed by VMs running CPUs or GPUs (up to you). Cloud ML is "hosted TensorFlow" and does a lot for you. The set of pre-trained model services (Cloud Vision, etc.) are quite a bit different.
If you are just learning Deep Learning, using Keras on top Theano on a single GPU is a good option. AWS has p2 instance that is used in Part 1 of Jeremy Howard's excellent fast.ai MOOC. Tensorflow becomes more useful when you have multiple GPUs.
I find it frustrating for all the power they want to give me... that some basic service design is lacking.
Polly is a stand alone component but the reverse is closely bound up into Lex which is a conversational interface API.
Amazon has internally built an engine I could ask to convert an audio file in S3 into a text content representative of the audio file... yet I can only use Lex to drive a conversation via text and audio.
If AWS really want to give me the power of their AI tools. How about unbundling them?
Very impressive. I am working in a cognitive computing book, and I am going to add a chapter or appendix on Amazon AI. A little off topic, but even though I self-classify as a Google fan and very much enjoyed working there as a contractor, when a friend asked once which technology company impressed me more I said Amazon.
A big win for Amazon is that so many companies already have huge data sets in S3. Having AI APIs 'close to' existing data makes it easier getting started.
I am a complete noob to the AI space but I was wondering whether the following is possible (in AWS).
I have a million scanned images of court documents. Some are briefs, some are motions, some are court orders, etc... Given that I have images and their types, could I "train" the AI with these million documents to recognize a new image that might come in?
People keep recommending things and approaches, but I'm really not clear on what it is you're actually trying to solve, so most of the responses might not help.
> Given that I have images and their types, could I "train" the AI with these million documents to recognize a new image that might come in?
Do I understand this right:
1. You have lots of documents as images, and their type (brief, motion, court order)
2. You get a new document, as an image.
3. You want to assign a type to this new document.
Is that right?
What's your goal in terms of quality? A key way of thinking about this is:
1. What's the risk/cost if you mis-classify a document?
2. What's the risk/cost if you fail to classify a document?
Are the documents typically very structurally different? Could I probably tell them apart without wearing glasses? Or are they largely the same, but with nuanced differences in the text?
As far as quality, if I classify the document wrong today (or fail to classify it), it's not that big of a deal to the system as a whole - but users will be "very" annoyed and have to correct it.
> Are the documents typically very structurally different? Could I probably tell them apart without wearing glasses? Or are they largely the same, but with nuanced differences in the text?
The documents are structurally reasonably different, but not "very" different. If a trained human looks at it, they would be able to tell them apart. There are exceptions, of course. For instance, if the attorney doesn't follow accepted convention, but those are rare.
There's probably a way to determine the document type based on the differing structures (would have to see it to be sure). Alternately, is it not possible for a user to specify what type the document is?
Principally, yes. However, the approach may be more nuisanced than that. If I were you, I would first pick a character recognition engine (which might have already been well trained) to convert the image to text. Once the text is there, that might serve as a better feature to classify the content. Furthermore, I had recommend converting words in the text to word-embeddings/ vectors using a suitable Glove or Word2Vec dataset similar to your content.
While there are many benefits of end-to-end training, I don't think it might be best suited for this case. This is because we already know that the only useful feature in the document is the text, and not the contours or textures. Theefore wasting neurons in your neural network to learn the wastefulness of this features is just waste of resources. Furthermore, you benefit from even a larger corpus of learned data, which the char recognition engine has been trained on.
OCR is my current approach. I am not really happy with it. The quality of OCRing leaves much to be desired, probably due to the documents themselves being haphazardly handled by the court personnel. OCR itself is a pretty CPU intensive activity and takes a significant time to complete for many documents.
OCR is a better understood problem than a general neural net, so I think it's likely easier to improve its quality that to superseded the quality with image-based recognition.
Ideally, I would like to get all the information from the page. Phone numbers, who is suing whom, case caption, etc...
With OCR, you get bits and pieces of information, but because I don't know what the type of the document is, it is difficult to determine where, structurally speaking, this information resides on a page.
If I could use AI to determine the type of the doc, I would know the structure of the document and I could then use OCR to pinpoint specific information on the page.
If your OCR is unable to recognize the characters on a full page, I don't think it when scanning a region either?
Unless using full resolution of a full image is somehow too much for the algorithm in use.
But then I'd just subdivide the entire image into regions, and scan them all independently. This is also a trivially parallelizable task, so you can throw many servers at it, if the time to get results is an issue.
I think you'd have a better time doing your OCR on AWS - spread the CPU intensive activity across multiple machines. Even if the resulting text has errors, if you're doing further document classification, it would be better to do it on the text including the errors, than on the original images.
> OCR itself is a pretty CPU intensive activity and takes a significant time to complete for many documents.
Leaving the quality part aside -- this job itself is easy to parallelize in that you can split it up by document or by page.
Open option is to run each job in Lambda asynchronously, with the input being a URL to the page or the full document, and have the job call back to you with the text of the page (or put it on S3 as a text file, or add it to a message queue, or whatever works). Regarding splitting: we've been using a python wrapper + pdfium for splitting PDFs into page images on Lambda, with excellent results.
To make the Lambda function, you'll either have to build e.g. Tesseract such that it fits into a 50MB zip, or download it while the Lambda function executes. LambCI has a set of docker containers that they've made for simulating lambda, and the "lambda:build" container makes building things easy and repeatable: https://github.com/lambci/docker-lambda. In a pinch, you can build on an Amazon Linux EC2 instance and it should work on Lambda, but you will have to be more careful about dynamic linking.
As another option: I'm not sure if it's been mentioned, but you can also try a ready-made OCR service before packaging up Tesseract, like this one: https://algorithmia.com/algorithms/ocr/SmartOCR.
So anyway, the performance part has good solutions, at least.
For fixing the accuracy: I know next to nothing about approximate string matching, but perhaps it would then be possible to do a fuzzy search over the text using something similar a Levenshtein automaton: https://en.wikipedia.org/wiki/Levenshtein_automaton.
More broadly, I'm sure that there are text-based document classification methods that are robust against sloppy OCR. It may just take some research on the main approaches people take to document classification -- it's not my area, but my understanding is that this is typically approached with statistical methods. Otherwise your spam filter would get defeated by typos.
You don't need perfect character recognition. It's just gotta be good enough. The way you determine good enough is by completing the pipeline and measuring the result.
i don't know, maybe different approaches could be combined. Maybe the layout provides a clue for some types of court documents? You could calculate the probability function of prediction a certain type right (or just use the outputs of the NN, that depends on the problem) as a confidence value and only do the OCR as a last resort.
That is exactly the approach I had in mind, except I would use the knowledge of the document type (as determined by AI) to guide OCR to specific sections of the page to get information from it.
But I know next to nothing about AI and ML - that's why I was asking this question.
well, the problem is data. NN needs a lot (depending on the problem thousands or millions of samples). It will probably get very difficult to get this much data needed to train an ML algorithm the location of the relevant OCR text.
Often ML problems way more experimental than "normal" coding. I would first try modelling it as a classification-problem and just do some cross-entropy validation to check the performance of the model. If it's useful, go with it, if not back to the drawing board. You will need some serious computing power, so either buy some GPUs or use the cloud.
You could train a random forest based on the inputs of OCR and NN, if you want to get total ML. You would gain some interpretability (i don't know whether thats important, but i would guess it might)
I am sorry that I can't give you a more concrete answer, these are just ideas. They are probably wrong. Like i said, i am a beginner and also don't really know the problem.
Edit other idea: If you know the location of the relevant OCR-text, you could use the following approach: Use the NN for classification. It will return probability-like values for every category. Take the top 2 (or 3, or every top until they add up to 70 percent...idk). Then do some OCR for every category you have to check. If one is positive you have your result, if not run the others.
AWS doesn't offer any high-level services for training your own custom model. You'd have to build the neural network yourself and deploy EC2 boxes to run it.
I've had success with Clarifai's [0] custom CV model API in the past. You basically upload batches of labeled images to train a model, and then you can submit new images for classification.
Of course, I have no idea how effective it would be for your documents. Obviously it depends on how visually distinct the different types are.
// predict the contents of an image by passing in a url
app.models.predict(Clarifai.GENERAL_MODEL,
'https://samples.clarifai.com/metro-north.jpg').then(
function(response) {
console.log(response);
},
function(err) {
console.error(err);
}
);
> Finally, we provide AI engines, a collection of open-source, deep learning frameworks for academics and data scientists who want to build cutting edge, sophisticated intelligent systems, pre-installed configured on a convenient machine image.
Which is to say anything that TensorFlow can do, AWS AI can do. So yes. But is it possible without implementing it yourself? It looks like Amazon Rekognition may be able to do what you're looking for, but I'm not certain. You'll have to research that one.
You can make it more performant by downsizing the documents, but obviously this is incredibly GPU-intensive to train. I second the suggestions to just use OCR.
Production ready AI services are few and far and but Polly is up there. I am currently using it in a workflow as part of a IVR front end. Seeing good results.
I know ML is the big cheese right now, but doesn't it seem like a bad use case for the cloud? Consider:
1) Training ML models does not require network access, which is one of the biggest competitive advantages of the cloud.
2) Training ML models is typically a batch process, which benefits minimally from the scale-on-demand model of the cloud.
Since the cloud premium is a significant exchange for the value that it adds, I don't see this being a big win for cloud providers. I can't help but think that if I were making use of extensive machine learning with continuous training, I'd have it training models on a local bare metal cluster statically scaled to my application's demand with minimal network connectivity needs. And then ship the serialized trained models to the cloud. The potential cost difference is huge.
Neither of these is true when dealing with terabytes of data (or more, if you're working with image/video corpora). Many AI/ML problems have stages that are trivially parallelizable - if you can divide your problem into iterations where a subgraph of nodes communicates internally, then sends/receives updates to other subgraphs, it's very similar to an iterative map-reduce algorithm, perfect for networked cloud systems.
And as you're tuning your hyperparameters, you don't know what the performance characteristics are, and you will absolutely want to run experiments in parallel, until you find the right settings that you'll use in production. You'd need to invest in a LOT of redundant bare metal to have that capability. As Netflix puts it in this presentation, the key to effective machine learning is iterating often, and that means having a lot of parallelism to bring to bear. https://www.infoq.com/presentations/machine-learning-netflix...
Good point. I've worked with lots of local exploratory machine learning (R or Spark on a single workstation), and lots of stable applications that use machine learning processes that are no longer fiddled with, but I've never witnessed the transition between the two.
I don't follow point 2, batch processes can definitely take advantage of scaling on demand. You can spin up a huge cluster, only pay for it for an hour while training, and then spin down to just what you need to serve requests for your already trained model.
Of course it depends on how much data you're training on, and how up-to-date you need your model to be.
Makes sense - though if you add some kind of validation to the trained model you could build a continually improving 'ground truth' and automate the training and improvement over time.
How should I interpret their picture? Can I get an AMI with Keras preconfigured on a p2 instance? Because that would be pretty useful. I currently have a p2 instance (smallest possible one) that I spin up for training and the like.
Thanks, somehow never saw this. I guess this can replace the scripts from the "Practical Deep Learning" series for me then. I'll just use the vanilla Amazon Deep Learning AMI + a p2 instance :)
Quick Heads up in case anyone wants to do the same. The only European region that supports p2 at the moment is Ireland (I tried Frankfurt at first).
Has anyone used Rekognition? We're thinking about pumping traffic cam feeds into it in cities for vehicle counting but don't want to waste time if it's junk.
Hi Randall, long time no chat! Hope you're doing well. Good to know, it's not really our core product mission right now, so we're not at the point of building something (like a tensorflow implementation). I figured it rekognition has what we needed out of the box we might just clip it on and see how it performs, but wouldn't bother if it's not there yet, I'd imagine over time it will improve. Thanks.
I've tried it, it's pretty good, gets expensive quick but you don't have to manage your own servers I guess. Plenty of options if you do want to do this on your own server too - hit me up if you want some help
Does your trail camera capture video or image sequence? You probably only need to analyse 1 frame every few seconds or less if you just want to get deer alerts.
I would go with a DIY solution, since Rekognition will get expensive quick ($1/1000 images). If you did 1fps running continuously, you'd pay $87/day.
Nearly any ML framework with an ImageNet pre-trained model will work. Accuracy will be better if augmented with more labeled deer images (ImageNet probably only has frontal poses) but should be alright straight out of the box.
You could run the whole thing locally on a cheap android phone for free instead of AWS cloud pricing, 1fps will be fine with TensorFlow or MXNet on Android. Then just setup email alerts.
There was a recent post on HN about getting a tensorflow model running on a raspberry pi to detect trains. The blog post describing the process was quite well done. I bet it would apply quite well to detecting deer.
> wondering what makes him a quotable authority on the topic
His wealth.
The obsession with cyber-malthusianism among the owners of tech informs me more about their valuation of humanity than about the future of tech. Fortunately there's a more level headed analysis that was published by the NYT recently for those of us that aren't consciously or subconsciiously all-in about the idea of making Snow Crash non-fiction https://www.nytimes.com/2017/02/20/opinion/no-robots-arent-k...
I followed the link to his interview where he says his 2 biggest investments are Netflix and amazon, maybe that why?
TV personality known for investment -> he's invested in amazon -> claims amazon is one of the best tech company followed closely by google, facebook -> people will want to go with amazon ? just making guesses based on that interview article.
I did see that, but that RSS feed (atom?) I don't think I've ever used that in my life. I'll Google and see what/how to use it. I see it everywhere though.
edit: I did see that they have a podcast, nice, something to listen to while I walk 2 hours in the middle of the night.
Ughh, you're making me feel old. RSS is (was) an amazing system for tracking website updates over time. It lost out to Twitter and Facebook (which don't remotely replace the use case), though it is still what makes podcasts work.
I highly recommend Inoreader as a great RSS reader.
Thanks for the recommendation. I tried one, I am not quite sure how it works yet at least for that AWS one. One page was a full-article, another page was several articles in one, it was odd but thank you.
Regarding Blogger back in like 2011 or so, I saw RSS feed all the time at the bottom (usually still do).
This is the one I tried due to editor choice/most downloads. I'm not sure if my experience was due to the source (AWS) or feedly's design. Guess I would just have to get used to it.
What I want is offline daily download of specific stuff. Being lazy myself, to build scrapers and download them to my phone but not an Android dev only web dev at this stage.
I feel like Amazon could do some work in this area to support users to use their own engines and not be bound to AWS AI Platforms and Services.
This could be as simple as publicly documenting the time lambda's stay 'warm' for and retain data in /tmp persisting through multiple invocations or some other examples on how an AI workflow could be implemented with popular custom engines such as TensorFlow.
Does anybody else have any experience in this regard?