Neural Network Bedtime Stories: Things to Read While Your Neural Networks Train

Did I say training? I meant taking over the world… What do you think it’s actually doing with all that CPU and GPU time?

Today I’m going to focus on things that I’ve been reading to become a better ML practitioner.

Working with Data and Sharing Data

https://github.com/jtleek/datasharing

This is a helpful article I found accidentally that provides a great framework for working with data for yourself and for working with other people. It outlines a process that includes a little extra writing, but, on the other hand, leads to repeatable work that is more understandable and easier to share and show off your work. The page comes from the Leek group at Johns Hopkins. They do interesting work in public health and meta-research.

Machine Learning Self Help

https://www.kdnuggets.com/2016/12/4-reasons-machine-learning-model-wrong.html

This helped me see a couple insights and helped me think about what to track to better understand and evaluate the models that I’m working on. I think that the two biggest things I changed were to move to tracking both training and test accuracy to evaluate variance (overfitting) and tracking precision-recall tradeoffs. The first is generally useful, but the second is even more important for me. I’m working on machine learning for a safety system, so balancing precision and recall amounts to balancing the safety of making sure a robot stops when it sees a moving obstacle and the practicality of not stopping every time it sees something that has a slightly red hue to it.

This is a short read, but it can help you think through why your model’s working (or proactively address potential issues).

Someone Else’s List of Awesome Statistics in 2017

https://simplystatistics.org/2017/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2017/

This is a list of good, interesting work in a variety of areas that I’m not super familiar with. I recommend it, I found that the list was very helpful for diving in and I actually didn’t get through all of it because I did more of a depth first search through the first couple of items. The reading time here is only bounded by the number of tabs you’ll open on your browser.

Scikit-Learn Docs

For example: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html

A lot of my knowledge of machine learning started in the classroom, but I’d like to think it’s taken a turn for the more practical. One of the ways that I’ve been able to continue forward with a more applied understanding was reading through the scikit-learn documentation (sounds like lots of fun, right?). I spent a lot of time reading about the different linear models and learning about how they evolved, but more recently I’ve started looking into some of the more complex models. My personal favorites have been the neural networks (with its practical tips section), bagging (reduces variance) and boosting (increases accuracy by weighting missed samples higher). All 3 are techniques introduced in the classroom that I’ve been able to unpack and explore how different parameters affect their performance (prediction latency, accuracy, precision, recall).

There’s where more to read here (even if you don’t look at the code and just read the text) than you could get through while watching AlphaGo Zero train on a Raspberry Pi (is that a even a thing that could happen?). Disclaimer: please don’t try this at home, it has been known to cause extreme boredom and exploration into re-implementing wheels in brainfuck.

Advertisements

Useful – pandas describe() function

pandas-docs/pandas.DataFrame.describe

So I heard you want to:

Quickly summarize or understand a dataset

You’re in luck! You’ve made it to the right place. I use this every time I load new data in to check and make sure the dimensions are right and the data’s in the form that I expect. This is also very helpful for checking distributions (you can verify the mean and std deviation) as well as check for distributions of data that aren’t really Gaussian using the quartiles. One handy benefit that I didn’t realize was a feature was that it will compress similar columns that repeat to make the output more readable. On my most recent computer vision project, the ORB descriptors for the keypoints are a bunch of 8-bit grouped bitmaps, so it’s helpful to make sure the data is transformed to be in the 0-255 range. For different analysis or normalizing the data, it might make sense to shift the range from 0-1, and that’s pretty easy with a single apply done across the dataframe. One thing I’d like to see is a combination with matplotlib that visually summarizes the data in an analogous compressed form.

Let me know if you thought this was helpful @bebaskin

Quick plug: Get on the Python3 bandwagon. It’s the future!

Useful – Parallelized Machined Learning with Dask-ML

http://tomaugspurger.github.io/dask-ml-announce.html

So I heard you want to:

Use Scikit-Learn

Great! You’re headed the right way, and using a stable library to do the kinds of work that you probably don’t need to duplicate for your specific task (yet). Charge ahead into this good night and learn your way to a better you.

But my code is so slow you say

Dask-ML is the answer. It helps parallelize your existing code with mostly drop-in replacements for existing scikit-learn functionality. They’ve aggregated some of the good stuff from other places, streamlined its use and provided some new algorithms designed for parallelized machine learning from the start. The good stuff I know: distributed joblib, out-of-core scikit-learn pipelines and Tensorflow support for combining with dask. The new stuff (also good): KMeans and scikit compatible parallel preprocessing. The stuff I don’t know but is probably great too: dask + xgboost and distributed GLMs.

Quick plug: Get on the Python3 bandwagon. It’s the future! Dask-ml is a great library that supports Python3.

Today I’m reading about pandas optimization

Today I’m reading about writing optimized pandas. Here’s what I found:

Context if you’re interested

I’m currently working on an example machine learning project using a self-created, noisy dataset built from a computer vision task. During my current learning task (bulk reclassifying noisy data to create a cleaner set of features for classifying stopsigns in images), I’ve started using the Python library pandas heavily. While the library is great for expressing what I’m trying to do (and I’ve started to rework the way I’m solving problems to be more efficient and more effectively use numpy and associated tools with pandas) I’ve run into some significant slowdowns on what I feel like should be a reasonable task. During this process, I’ve come across a couple different resources I figured I’d share here.

Getting Started

https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

It’s a good resource for starting, and honestly all I’ll probably need. There are code examples and progressive steps from how I’m doing things (involving looping over the data) all the way to vectorization on numpy arrays (which I’ve done before, but I don’t jump there for fun without some prodding yet). +1 because the challenge it is solving in its own example is doing bulk calculation of distances over the curvature of the earth, which seems like it could be vaguely similar to something I need.

Going farther with a little help from the pandas project docs

https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html

When all else fails, write C Python that compiles to C! Using Cython and a dash of help from the pandas docs, one can statically compile pandas operations to C or even use numba to JIT (just in time) compile the code. The JIT compiling might not help a ton, but it is less involved than Cython. At this point I’m not looking to do this level of optimization for a script that I want to run “once” in the data gathering and correction cycle. It also involves user input, so the code speed is a lot less of a limiting factor.

Idiomatic pandas and a Useful Idea

https://www.datacamp.com/community/tutorials/pandas-idiomatic

This article duplicates some of the first link while adding some interesting elements of visualization. In between those two, it talks about the concept of groupby and optimizations by moving to a pandas style groupby. I think this will be very helpful in grouping by image. Right now, it has to search the entire million line dataset for vectors that have the imageid field set to the proper id once for each of the 2000 images. A better groupby should be great.

A good reference for when I want to learn how I really should do it

https://tomaugspurger.github.io/modern-4-performance.html

This link is probably a little more than I want right now in terms of learning how I should do it correctly. It’s a guide to the details of pandas performance with links to more “writing idomatic pandas” articles that look very helpful at first scratch. The first step is talking about constructors and their effects and I want to stay a little more hands off my pandas for now (see me still approaching this script as a one time data transformation).

Useful – Python API design advice with examples

http://python.apichecklist.com/

So I heard you want to:

Design a Python API for a hugely popular library to be

Good luck! You’re making the world a better place. The checklist is a good place to start. I’ve learned some of my best practices by starting with existing projects like Requests. This brings the practices to you in a convenient format. It’ll also help you think through the considerations in the API that you might have missed focusing on the cool functionality!

Be a better Python programmer

The API checklist is a list of examples for good libraries to browse and learn from their source code. Reading source code can be a pain, but these links are linking to good examples straight away. A bonus for Python specifics: there’s information on how to make the code more Pythonic and examples of where existing Python code can be pulled in to make your life better.

Be a better programmer

Even if Python isn’t your language of choice, the checklist covers object oriented design considerations such as encouraging classes with a single responsibility. For another helpful point: if you’re using mock and patching, it might be indicating that the API isn’t flexible enough to be modified how a user might actually be trying to use it. Another signal: if the user is copying and pasting code to use your API or change behavior, your code is actively working against DRY principles.

Quick plug: Get on the Python3 bandwagon. It’s the future!

Alternative Reality: Augmented Reality in a Weekend of Firsts

This past weekend I participated in the hackathon MHacks X at the University of Michigan. The people there built a wide variety of things from VR games to analytics engines that tracked memes via Twitter and it was a generally great atmosphere for creating something new. I took the weekend as a chance to hit a couple of firsts on the way to having a lot of fun and generally enjoying a relatively low stress (as hackathons go) weekend while I built “Seeing Eye Droid”, an augmented reality app for the blind. It uses accessibility features built into Android and Google’s new AR Core to provide a better sense of the world through touch and audio. For the code, see Github – buckbaskin/eye.

My first first for the weekend was successfully building an Android app. I have, at various times, gone through the motions of looking up tutorials and attempting to compile my first app. I think I may have even gone so far as to have created a page or two that I could click between, but these apps never really counted. I didn’t have a strong motivation behind them or a unique idea to do something other than the most basic mechanics of putting in every feature that the tutorial talked about in order to learn how it worked. This past weekend, I launched directly past the Getting Started tutorial (it’s probably worth looking at honestly, but I didn’t make time for it) and landed at a web page about installing Google’s new AR Core software. Conveniently, AR Core comes with its own simple (as in minimal functionality, not actually simple code).

My second first was using Gradle. It was a small first and largely accidental. Android Studio projects deeply integrate with Gradle as far as I could tell, so I messed around with the build scripts until I got fewer compile time errors and had all my dependencies installed. I’m fairly certain I’ve meant to actually use it earlier, but I’ve never gotten around to actually finding a project with the right needs to use Gradle. Honestly, I haven’t had any projects that could use Gradle (to my knowledge) and I can say that for certain because Gradle is a hammer and I spent a lot of time looking for nails to use it on before it fell by my mental wayside. It was a nice find because I’d been meaning to learn about it and the weirdest quirk was the syntax (at least to me). In my head I was hoping it would just use Java or a Java-like syntax to specify how everything worked, but it turned out to be something different.

My third first of the weekend was using AR software as a user and a developer, in this case AR Core. It turned out to be a pretty intuitive way to interact with the world and it aligned with a lot of my robotics background. After spending some time reading through the documentation looking for interesting functionality to use and being mildly successful, I was able to get started pretty easily (and confirm that I hadn’t wasted time in classes in past years). My favorite find was the ray tracing capabilities (projected out from the screen) and the weirdest quirk was the documentation didn’t include the units for what I was getting. It turned out to be something close to meters, but I was never quite sure. Go check it out, I’d be happy to help anyone figure it out in the process of making something cool!

To go along with the AR Core software, my fourth first was using OpenGL and doing graphics rendering. My previous minimal Android experience involved using default buttons without giving any thought to visuals so that I could move forward with learning other meaningless incremental features. This time I jumped right in to messing with shaders, projection matrices and rendering classes. I was fortunate that I could start with the example AR Core app, because if I had not I would have had no chance at all in figuring out how to make the AR display work. It wasn’t terribly critical for my intended user, but it did help me debug my code and share what the app was doing with others at the hackathon would could see the screen. A conversation that I had during the demo time at the end of the hackathon led to a tripling of my understanding of what actually went on inside OpenGL (see https://learnopengl.com/) and shaders (shadertoy.com), so even just for that I’m happy I pursued the project.

My fifth first of the weekend was using Android’s accessibility services to make a disability-friendly app. I didn’t do anything terribly custom, but I did figure out a way to enable users to drag their finger across the screen to explore the world in front of them (both in estimating the distance to obstacles and using vibration to convey a sense of color). It was an interesting dive into the world of accessibility and it was interesting to learn about what all goes into making an app accessible. In the end, what I worked on was largely a single page app where the changing elements changed in the real world instead of on screen, so it wasn’t very complicated to manage an accessible user interface. If you’re at all interested, consider reading some of what Google’s put together about designing accessible apps here or watch a video about developing for visually impaired users by a visually impaired user.

My last first of the weekend turned into something that I wish I’d started way before. I spent a lot of my time this weekend reaching out to other participants at the hackathon to try and help them solve their problems. For most of the problems I helped with the solution came down to seeing something that a fresh set of eyes picked up that was hard to look for in code that had been seen a million times in 36 hours. For those kind of problems, it was the little things that counted. I’m happy to see that a number of the projects that I helped with a tiny bit turned out well. I also had the chance to share some of my machine learning knowledge this past weekend and it felt quite good to exercise what I’d picked up in and out of class. My initial impression of my own knowledge was that it was quite limited (and I almost certainly have tons to learn), but I found that I was still able to help multiple teams get their projects off the ground. Something is better than nothing! I hope to try and find some way to exercise what I know in the near future, perhaps on a Kaggle challenge or something of that sort.

Cheers!

Insight – Rolling My Own Analytics for the Web-App

One of the more interesting parts of the Insight project that I’ve been working on, in addition to the basics of the tutorial (delivering dynamic pages, unit testing, creating a Twitter-like microblogging service), I have begun rolling my own analytics package for the site. For now, it has been server side only, but I’ve already found some ways to gain useful data from the information coming in.

The data is collected by adding an @analytics decorator to the routing calls, which in turn uses a session (I call it a trace to avoid name conflicts with cookies as sessions from imported packages) to save and track pages that are loaded by a given user. There are a few types of analysis that I think would be useful, ranging from page loading patterns to timing analysis, and future ideas that will be integrated such as client side data collection and A/B testing. To read more, go to the Insight Analytics page.

series: first | prev | next | latest

Do not worry about your difficulties in life. A perse turtle’s are still greater.

Insight – Moving from Global Ideas to Personal Focus

My previous passes at digging deeper into the data available from Twitter primarily focused on the directed graph of users created by the follow system on Twitter. While this was interesting, and largely easy to do using a few simple calls to the API, in the last week or so I’ve looked into what other information was available. One of the key things that I realized was that I could get a lot more information about a single user out of the API, and in turn use that to create a programmatic representation of the user to better aid in my end goal of finding better accounts to follow.

Most of the work that I’m working on doing now focuses around using user engagements on Twitter to better identify how the user is using Twitter. I, for one, follow lots of people now, but I don’t necessarily care about all of them equally, and I don’t interact with them equally. To move beyond the quick and shallow follow action, I’ve turned my focus to a user’s activity profile, as I’ve started calling it. Using favorites, replies and retweets, I’ve started compiling a record of activities based on time, that can be further associated with tweet specific data. I’m hoping to find patterns and correlations across time, content, hashtags, engagement type and other quantitative factors that would in turn quantify how I use Twitter, and use that data to improve the content that I follow and find.

This profile is most interesting to me, because it will allow me to programatically find multiple ways of expanding the people I follow. In the style of more standard collaborative recommendation systems, users that engage the same way with Twitter that I do (possibly more engagements, but at a similar time and content profile) can become sources of recommendations to me, based on the accounts that they engage with most. On the other hand, I can also look for content creators that don’t necessarily follow and engage the same way that I do, but instead create content that matches up with content that I engage with most. For example, if I’m retweeting a lot of information about drones in the evenings, then an account that posts the day’s drone news in review around my active time could be recommended.

One focus that would be particularly interesting to users trying to increase their visibility is retweets. If a user is most active in the mornings and only sort of active in the evenings, but is only retweeting in the evenings, than the most interesting time to post content relevant to them would be in the evening. Further, if they are most likely to retweet content about a certain topic or hashtag in this evening time, that would allow the content creator to refine their tweet to match that information.

All in all, I’m very hopeful that a user-first approach that is then expanded to looking at how that user matches their network will allow me to jump from an idea to useful project in a short period of time.

series: first | prev | next | latest

Do not worry about your difficulties in life. A perse turtle’s are still greater.

Fractal – A Complex Motion Model

When I last left off, I was planning on developing a complex MotionType to highlight the power and versatility of the platform. To do this, I decided to create a stochastic evolution based Markov model that abstracted away the decision-making process and instead focused on how the other robot moved. Every time it receives an update, it then looks at the transition that happened and marks it down in its notebook of sorts. Then when it comes time to predict forward, it looks at the current state of the robot, and then with a weighted random distribution based on past observations (more observations = more likely) it picks the next state for the robot. It then repeats this the required number of steps until it has predicted the correct number of steps forward. It amounts to something like a Markov model of the other robot’s behavior. This is particularly useful when it comes to using a reverse-wave approach that I’m going to explain in the next post, because the distribution of the robot’s location-based on multiple predictions at any given step will give a best estimation of the direction and time to fire. I call this new, more complex MotionType ProbableAction. Continue reading “Fractal – A Complex Motion Model”

Fractal – MotionTypes and Projections

There is one more method that I’ve left off for the MotionModel, and it is perhaps the most important. This is the predict(TimeCapsule, int) method, where the TimeCapsule history is used to make a projection forward to predict where the robot in question is going to be in the next int turns. Beyond that, I’m going to need to fix the MotionType and MotionProjection classes, because both are focused on the old way of doing things, and if nothing else, Occam’s Razor can probably clean out useless functionality. The code for where I’m starting can be found here.

Continue reading “Fractal – MotionTypes and Projections”