Data Wrangling

The term 'data wrangling' usually refers to a great deal of repetitive and very time-consuming data preparation tasks, such as the acquisition, integration, manipulation, cleansing, enriching and transformation of data.


We have created this catalog to be the first data wrangling dataset repository. We have collected most of the datasets used previously in other tools for data manipulation or presented in the literature. In addition, we have generated new datasets collecting new data. All the datasets include six examples of one particular problem, with an input and the expected output.

This data is made available under the Open Data Commons Attribution License: https://opendatacommons.org/licenses/by/1.0/ .

What domains are included in the catalog?

  • Dates: Related to date manipulation, such as detecting or extracting months or days from a substring, or extending 2-digit years to a 4-digit full format, etc.
  • Emails: Related to email manipulation, such as getting the words after or before the '@' symbol, append the '@' symbol at the end of a string, etc.
  • Names: Related to personal names manipulation, such as getting the initials of a name, creating a user login, etc.
  • Phones: Related to phone numbers manipulation, for example, setting the prefix by a country name or code, detecting a phone in a text, etc.
  • Times: Related to strings containing time, such as change between 24/12h format, changing time zone, etc.
  • Units: Convert units of length, mass, time, electric current, thermodynamic temperature, and others such as volume, etc.
  • Freetext: Basic string manipulation problems.
Do you want to collaborate? Send a dataset!
2017 |   Data Wrangling [Dataset Respository] |   Universitat Politècnica de València |   DMiP Team
This data is made available under the Open Data Commons Attribution License