Update: Our current solution for this is working pretty well. Skip to it here.
Storemapper, like many many B2B apps, allows our users to create, edit and export records in bulk using a CSV upload file. I have spent a ton of time trying to find the best method to allow the upload and asynchronous processing of very large (10,000+ rows) CSV files. This post is a documentation of our work to date on this, a plea for collective help from the internet, and hopefully will be the public home of the best solution we can come up with.
Why it’s own blog post? Well I fell that
A. This feels like a very common need for lots of Rails apps.
B. I’m surprised that I haven’t turned up a really optimal solution anywhere on the web.
Here are the user story details:
- A User uploads a large CSV file (say 20 columns of data by 10,000 rows or ~2MB file size).
- The CSV is immediately validated for correct file type and appropriate formatting and column headings. Appropriate error messages are rendered if not.
- The file is processed asynchronously. The exact nature of processing is not critical here but let’s say there is an ID column, the background job either finds the record by ID and updates or creates a new record if ID is blank or not found.
- When all rows are processed the User receives an email notification of completion — other callbacks are triggered here too.
- If any rows do not pass validation, the User receives a single email with a CSV attached of the invalid rows with validation error messages.
Some technology constraints we are working with:
- Heroku 2x dynos (1GB)
- Limited to Heroku’s 30 second timeout
- Resque & RedisToGo
So here are few of the approaches we have tried and their flaws:
1. Naive processing CSV in controller
Like many developers I started with the Railscast solution. Basically just reading the CSV in the controller. This actually works reasonably well for files of up to about 1,000 rows but anything more than that, or if your app is even remotely busy at the time of upload, you will start getting H12 timeout errors when your controller action takes more than 30s to respond. If you’re a total noob at coding like I was at the time, it will be very frustrating to see that everything works fine on your local machine, which doesn’t have the same timeout constraints as Heroku dynos. Timeout errors on the controller action are one of the primary pinch points in this process.
Key flaw: controller action timeout around 700 rows
2. Simple Resque
So we knew that we needed to move the CSV processing to a background job. The first thing we tried was the simplest version we could think of. In the controller we would still use CSV.read (or CSV.parse I don’t actually remember) to read the CSV file but we would pass that data into Resque for the actual processing portion. This added a tiny bit of performance but not much and still lead to lots of timeouts. Not to mention you have to pass the entire CSV data directly into the Resque worker, which is not best practice at all.
Key flaw: controller action timeout around 1,000 rows
3. Upload the raw text of the CSV to Postgres, process in Resque
So now we know that processing the file with Ruby CSV in the controller action is just not going to work. But you’ve got this params[:file] coming to your controller and it’s not totally obvious how to get it from there to a Resque worker. Eventually the best solution we came up with was to add a text field to the Postgres db and throw the whole raw CSV file into the database. This actually works very well in terms of avoiding a controller timeout as Postgres is able to write even very huge 10,000+ row file to the db quickly. Something as simple as the following was fairly performant:
user.file = params[:file] user.save
A Resque worker is then queued up and it grabs the raw file from the database, parses it with Ruby CSV and processes it. Although this solves the timeout issue very well it has two key flaws:
- Because you are just accepting the raw file data and stashing it for later, you are not providing any feedback on the file to the user at the time of upload. Never underestimate the capacity for, otherwise quite intelligent, B2B users to still upload an .XLS file, or a CSV file in a completely unrelated format to your template, or in one case (yes this really happened) a .PNG screenshot of the data. No joke. This becomes a support headache and a bad experience for users.
- More critically very large CSV uploads will still fail. Apparently Ruby CSV is not super memory efficient. Parsing and using the rows will load the entire processed CSV into memory. Above a few thousands rows this will cause your background worker to use up too much memory. Heroku will let you overspend on memory for a short time but if you’re background workers are staying way over their memory allotment Heroku will throw R14 and R15 errors and automatically start killing them off. I tried several gems that aim to improve on Ruby CSV (i.e. SmarterCSV) but none improved meaningfully enough for this to be a sufficient solution.
Key flaw(s): no file validation; worker memory usage = dead workers
4. PapaParse: client-side CSV processing
4a. Pass processed data directly to Resque.
I know it’s not best practice but it works fine even for fairly big uploads. But eventually the controller will timeout. It looks like this in the controller.
row = params[:rows] # the row data from PapaParse Resque.enqueue(CSVProcessor, user_id, rows)
4b. Serialize processed data into the db.
This is a variation on #3 above attempting to speed up the controller by quickly moving the processed CSV data to the database as a Ruby Hash.
# In the User model serialize: :csv_rows, Hash # and in the controller user.csv_rows = params[:rows]
But this will timeout at about the same limit at #4a above.
Key flaw: for 4a and 4b above the controller will timeout no matter you try to do with the data (even just writing the data to the logs). It’s just too much data in the params for the Heroku dyne to handle
4c. Stream the data row by row.
PapaParse has the option to stream the data row by row. We did try this as it certainly avoid the H12 timeout issues above. But we use caching very aggressively and our overall app requests-per-minute rarely goes over 100. So sending 10,000 AJAX calls at high velocity is nearly guaranteed to crash the server.
Key flaw: waay too many AJAX calls
Here are a few other ideas that we haven’t yet tried in production.
5. Upload the file to Amazon S3
A lot of people have suggested this as a line of attack. But for me, I can’t see how it would be much an improvement over #3 Uploading to Postgres.
6. Use a giant Heroku dyno for the background worker
Our last resort right now, which we haven’t tried because it’s just going to be a lot of work looks like this:
a) Pre-process the CSV file with PapaParse to verify formatting and file type.
b) Upload the unprocessed file directly to Postgres (or Amazon S3) and queue up a Resque worker.
c) Grab the raw file from a background worker and go back to using Ruby CSV, but this time use one of Heroku’s new giant production workers.
So what next?
That’s where we are right now. Any suggestions? I would be happy to pay a consulting fee and/or to post the best solution here publicly. Feel free to comment here or on Twitter or email me directly.
Update: Our current solution
Thanks primarily to some hard work from Taufiq Muhammadi we have a pretty good solution that aggregates several approaches we were previously using.
The flow looks something like this:
- We are using the carrierwave_direct gem to upload the CSV files directly to Amazon S3.
- Use the “accept” parameter on Rails file_field to only allow ‘text/csv’ which literally won’t allow users to even click files that don’t have the .csv extension in the file picker.
- Before the form is submitted we use PapaParse to parse the entire CSV file client side (PP is incredibly fast and the user will barely notice even for a 100,000 row file) to check that it is in a valid CSV format and to verify a few things like correct headers. If not we halt the process and provide relevant error messages, explanations and links to support articles immediately.
- The un-parsed CSV is then upload directly to S3. Yes, this means we actually parse the same CSV file twice, but it’s worth it.
- A Resque worker is queued to fetch the CSV file and process it using the SmarterCSV gem.
- A notification email is sent to the user with success. In the case of certain rows having errors, a new CSV is compiled of the error rows and included in the attachment to the user.
This approach has several benefits:
- All the client-side parsing and immediate feedback has reduced our support tickets on this part of the process by at least 90%. Since the first bulk upload is a pretty critical part of our onboarding flow, this is huge.
- Sending the files directly to S3 had a huge performance improvement. We used to live in constant of a user uploading some wildly mis-formatted 500,000 row CSV and taking down the whole app. But no more!
- An added bonus is that we now retain the actual uploaded CSV file. Our Admin > User view now exposes the most recently uploaded files, which can be directly downloaded and viewed to help users with more minor questions around formatting, what data goes in what column and so on. Almost all of our support tickets around bulk uploading used to require – Step 1: “Can you email me the file you are trying to upload please.” But now we can just directly download and view it, skipping a step.
Thanks for all the help via Twitter on this!