Long time back, I built an extension, by which you can take the openrefine mappings, and run on a hadoop cluster. The idea is you would run all transformations on a small dataset on your local machine. Once you are satisfied with the mappings, you would deploy the same on hadoop cluster. I have tested this on large datasets, and it works. Let me know how I can help.
You can check my github: https://github.com/rmalla1/OpenRefine-HD
Scanadu is a consumer medical devices company developing a human-centered suite of products using science and technology to empower everyday people to monitor and better understand their health — anytime, anywhere. This position is responsible for designing, building, deploying and supporting broadly used Cloud-based components used by a diverse set of products and outside partners. The person in this role will be responsible for creating/sharing/collaborating on materials describing the overall platform solution in the broadest and most open sense. This person will also develop a deep understanding of other technologies and potential competitive offerings and insight as to where the market is headed. You will have an enormous opportunity to make a large impact on the design, architecture, and implementation of cutting edge products used every day, by people you know. We are looking for a talented and passionate Senior Data Platform Architect to be part of our exciting team.
Yes, We do. We primarly use it as Job scheduling engine. I have multiple teams asking for data extracts, and I fashioned Jenkins to run extracts both adhoc as well as fixed schedule basis.