The Elasticiti team is looking forward to the ODSC (Open Data Science Conference), which begins on May 20 in Boston where we are a proud sponsor. We look forward to networking with peers and sharing ideas. And there are lots of topics we’re interested in hearing and learning more about when we’re there.
One of our top concerns: The complete and utter lack of a standardized nomenclature for digital advertising data. Whether you’re an analyst, data scientist, or data executive, you’re all too familiar with data munging. It’s the critical, yet painful process required to put any kind of data to use. And it takes up way too much of our time. And when it comes to video and working with datasets from numerous partners, that challenge is significantly amplified.
Without a standard nomenclature, data analysts are left with numerous questions, such as:
Value variation. Are 'pre-roll' and 'preroll' are the same thing? Ditto for 'P&G' and 'Proctor & Gamble.'
Dates. Are they in the European or American format? Is there a full date or just a month and year? Did someone switch to a YY-date format from at YYYY-format at some point? Excel does terrible things to dates once you use that data outside of Excel. Ensuring date fidelity is always a challenge.
Units. Is currency is included, or not included? If it is, how do we harmonize precision; are two decimals insufficient? For a sold CPM, only two is fine, but if we are working with an eCPM, is four decimals is preferred?
Missing data. If data is missing, do we keep it that way or use the last known good value or the average of the two values? What does the business user expect? Can it be reasonably inferred?
These questions must be answered, and all data analysts need a strategy for doing so. In the absence of a standardized nomenclature, we find some tools very useful, and we keep coming back to them. They are:
In R, we use gsub() all the time. This is grep substitution, and it finds a pattern and replaces it with something else. We frequently replace that something with nothing, as in the case of a $ sign in a CPM column when we want to use it as a numeric.
Hadley Wickham's lubridate package makes the guaranteed-to-have-them challenges of working with dates easy to resolve. Need to figure out what business quarter that date falls in? Lubridate makes it a snap.
Hadley's reshape2 package is also in many of our scripts. Data comes in wide format and bang! now it's in long format. Or vice versa. Try doing that in Excel.
In Python, not surprisingly, we use Pandas all the time, as well as dictionaries to categorize data into cleaner buckets. For use cases like the value variation above, we get great mileage out of FuzzyWuzzy for string matching. With a little training it is shockingly powerful. We use RE for regular expressions and datetime for, well, datetimes.
For uncharted and ad hoc territory, working through messy data is inevitable and to be expected. But when we work with the downstream data from many of the biggest web and ad-tech vendors, it seems like an opportunity for the industry to clarify and standardize along common definitions (e.g. date, like currency, how to handle taxonomies etc.). While our customers benefit from the source data knowledge we bring to a given project, we all would be better off spending our time on more analytic pursuits.
So those are some of the things on our minds going into the conference. What are your biggest concerns? Share them here.