OpenRefine

Standardize Dates

I used the built-in OpenRefine "common transform to date" function to take the over 140,000 date entries and standardize them under one format. This allows me to visualize entries as a continous timeline.

Fingerprint Clustering

Standardize Lay

Since the entries for "Lay" are not completely standardized in their format, I used "fingerprint clustering", because the algorithim used for "fingerprint clustering" fulfills my differentiation needs. In paritcular, because whitespace is normalized, characters are lowercased, and punctuation is removed, those parts don't play a differentiation role in the fingerprint. Because these attributes of the string are the least significant in terms of meaning differentiation, these turn out to be the most varying parts of the strings and removing them has a substantial benefit in emerging clusters. If you would like to read more on how this clustering method could be useful to your applications, please see this article.

For cluster edits that were not inherently clear, that is were not a matter of a simple change in formatting, I opted to nullify the field. I justified this as this was less than 100 entries of all the populated lay fields, or less than 0.1% of my overall dataset. At the end of this process, I discovered that over half of my dataset (around 80,000 entries) has lay as a populated field.

Proxy for Crew Ethnicity

I used OpenRefine's clustering function to first cluster and edit the existing entries for skin. Needless to say, I am using skin as the closest proxy for ethnicity, as the hair and eye entries do not provide as much detail as the skin column. Using a fingerprint cluster, for the reasons outlined above, I was able to quickly classify all the existing entries into seven categories: white, black, albino, portuguese, brown, Indian, and mulatto. I referenced a dictionary for descriptors that I did not understand and nullified fields when the best match was not inherently clear. Note that "mulatto" is an outdated term for someone who has parents of both white and black skin color. From this first function alone, I found that half of my dataset had skin, and ethnicity by proxy, populated.

Next, I duplicated the first name and last name fields, so that I could cluster them phoentically using the Metaphone clustering algorithm designed for the English language. Once the clustering process was complete, I matched the newly clustered entries with a complete skin entry to those without an entry, relying heavily on the fact that names that sound the same come from the same background. This data was added on a separate column because I personally believe it to be inaccurate, and find that I would rather conduct accurate analysis based off of 80,000 records rather than use highly biased data for less than 10,000 additional entries.