Processed Data
Statistics
File Type: CSV stored as OpenRefine Database
Entries: 140,907
MB: 41MB
Data Diagnostic
The cleaned and final dataset has notable underpopulated fields. Most notably, birthplace and height have over 133,000 and 99,000 unpopulated fields. Fields that are still underpopulated but could be useful include lay, with around 85,000 unpopulated fields. Ethnicity, by proxy, follows with 77,000 empty fields, or around half the dataset. Residence and sperm haul both are half-populated throughout the dataset. Oil hauls are well documented, with only 15,000 blank fields. This is a very intersting observation and could be analzyed more to see if certain products were more important at various points in the whaling timeline based on how well they were documented. Lastly, information on master, rig, vessel, tonnage, and voyage are completely populated.
Areas of High Bias To Note When Analyzing
I have noted processes that have introduced bias throughout my data cleaning walkthrough. I want to draw particular attention to using skin as a proxy for ethnicity. Both the act of clustering to clean existing data and "recreating" data by clustering names phoenetically introduces large amounts of bias. Conclusions draw from these data points should be heavily scrutinized, especially as the assumption that people with similar sounding names come from the same background does not necessarily hold.