Overview

Crew List Dataset

My primary dataset is a Microsoft Excel speadsheet containing information on 168,367 crew list entries for voyages departing from New England between the years of 1809 to 1927. This dataset is accesible from the New Bedford Whaling Museum Whaling History Project. The first of two workbooks in the file contains a number of fields with information of the crew members, including:

Crew Entry ID Unique Identifiers for each entry
Voyage ID Unique Identifier for each voyage
Vessel ID Unique Identifiers for each vessel
Voyage Date
First Name and Last Name
Birthplace
Resident City, State, and Country
Citizenship
Age
Height in Feet and Inches
Skin, Hair, Eye Color
Rank and Lay

The second workbook contains relevant information that, when linked to the first workbook, can realize interesting insights:

Rig
Custom House

Luckily, the dataset was provided to me in a form that did not require scanning, OCRing, or web scraping. I have the New Bedford Museum volunteers to thank for that. While there exist many data points, only the "rig" field is completely populated across all the data. The data is most complete, and most accurate according to the New Bedford Museum, for whaling trips registered at the New Bedford Custom House. In order to maintain some historical and methodical accuracies in testing my hypotheses, that is to say introduce the least amount of bias or "created" (inferred) data, I decided to bound the scope of the project to whaling as portrayed from the New Bedford Custom House documents. This decision on its own introduces bias in the sense that I am explicitly ignoring other available data points; however, drawing conclusions from those sparsely populated points would requre heavy manipulation, and bias introduction, which I believe is less preferable to limiting the analysis. Other notable inconsistencies that need to be addressed in the dataset include:

CSV Conversion: Excel file needs to be converted to a CSV to be manipulated by digital history tools
Proxy for Crew Origin: Birthplace & Resident City, State, and Country fields are not sparesely populated and so a proxy needs to be constructed for at least the origin of crew members
Standardize Lay: Lay (a fraction of earnings) does not have a standard format throughout the dataset
Link Worksheets: Rig and Custom House need to be linked to the crew list dataset via Voyage ID
Expand Acronyms: The Rig field needs to have acronyms expanded using a legend table
Filter for Scope: Custom House needs to be filtered to adjust project scope to New Bedford
Proxy for Ethinicity: Skin, Hair, and Eye color fields are inconsistently populated without standard wording. A proxy needs to be constructed to map each crew member to an ethnicity

Voyage Dataset

My supporting dataset is a Microsoft Excel speadsheet containing information on 16,650 whaling voyages between the 18th and 20th centuries. This dataset is accesible from the New Bedford Whaling Museum Whaling History Project. The file contains one workbook with information relevant to the primary dataset analysis, with paritcularly interesting fields:

Voyage ID Unique Identifier for each voyage
Ground, or water the ship treads on
Year In and Year Out of the voyage
Bone, Sperm, and Oil counts for the voyage
Information on the master of the ship, including birth location and marital status if available
Vessel tonnage, build place, and build date
Ultimate fate of the vessel as short text commentary

Given that both datasets share a unique identifier field, they can be linked in a relational database or joined together in CSV form. While a relational database is pragmatically the better choice for data processing, taking computer memory and database size into account, I have opted to make the tradeoff to join the relevant fields of the two datasets into one CSV file. I acknowledge there is a computational tradeoff; however, undoing the join of the two datasets would not be taxing using some code. Moreover, it is significantly easier to use graphical and visualization tools for the latter half of this project if all the data sits in one file, particularly for online tools. Lastly, the manipulation of the dataset to visualize on certain dimensions, just as a pivot table zooms analysis onto chosen dimensions, is easy to do using online processing tools like Tableau. I hope to clean and use this dataset to complement my primary data by:

CSV Conversion: Excel file needs to be converted to a CSV to be manipulated by digital history tools
Joining Tables: Join the relevant fields of this entry to the primary dataset. I will be doing this only after filtering the primary dataset, so there will be no need to filter for year in this dataset as information will join to the already filtered primary dataset entries.
Extrapolate Voyage Length: Adding a 'length of voyage' field using the 'year in' and 'year out' fields for further analysis

Please see this page to download the raw datasets in XLSX format.