Overview

Crew List Dataset

My primary dataset is a Microsoft Excel speadsheet containing information on 168,367 crew list entries for voyages departing from New England between the years of 1809 to 1927. This dataset is accesible from the New Bedford Whaling Museum Whaling History Project. The first of two workbooks in the file contains a number of fields with information of the crew members, including:

  • Crew Entry ID Unique Identifiers for each entry
  • Voyage ID Unique Identifier for each voyage
  • Vessel ID Unique Identifiers for each vessel
  • Voyage Date
  • First Name and Last Name
  • Birthplace
  • Resident City, State, and Country
  • Citizenship
  • Age
  • Height in Feet and Inches
  • Skin, Hair, Eye Color
  • Rank and Lay 

The second workbook contains relevant information that, when linked to the first workbook, can realize interesting insights:

  • Rig
  • Custom House

Luckily, the dataset was provided to me in a form that did not require scanning, OCRing, or web scraping. I have the New Bedford Museum volunteers to thank for that. While there exist many data points, only the "rig" field is completely populated across all the data. The data is most complete, and most accurate according to the New Bedford Museum, for whaling trips registered at the New Bedford Custom House. In order to maintain some historical and methodical accuracies in testing my hypotheses, that is to say introduce the least amount of bias or "created" (inferred) data, I decided to bound the scope of the project to whaling as portrayed from the New Bedford Custom House documents. This decision on its own introduces bias in the sense that I am explicitly ignoring other available data points; however, drawing conclusions from those sparsely populated points would requre heavy manipulation, and bias introduction, which I believe is less preferable to limiting the analysis. Other notable inconsistencies that need to be addressed in the dataset include:

  • CSV Conversion: Excel file needs to be converted to a CSV to be manipulated by digital history tools
  • Proxy for Crew Origin: Birthplace & Resident City, State, and Country fields are not sparesely populated and so a proxy needs to be constructed for at least the origin of crew members
  • Standardize Lay: Lay (a fraction of earnings) does not have a standard format throughout the dataset
  • Link Worksheets: Rig and Custom House need to be linked to the crew list dataset via Voyage ID
  • Expand Acronyms: The Rig field needs to have acronyms expanded using a legend table
  • Filter for Scope: Custom House needs to be filtered to adjust project scope to New Bedford
  • Proxy for Ethinicity: Skin, Hair, and Eye color fields are inconsistently populated without standard wording. A proxy needs to be constructed to map each crew member to an ethnicity

Voyage Dataset

My supporting dataset is a Microsoft Excel speadsheet containing information on 16,650 whaling voyages between the 18th and 20th centuries. This dataset is accesible from the New Bedford Whaling Museum Whaling History Project. The file contains one workbook with information relevant to the primary dataset analysis, with paritcularly interesting fields:

  • Voyage ID Unique Identifier for each voyage
  • Ground, or water the ship treads on
  • Year In and Year Out of the voyage
  • Bone, Sperm, and Oil counts for the voyage
  • Information on the master of the ship, including birth location and marital status if available
  • Vessel tonnage, build place, and build date
  • Ultimate fate of the vessel as short text commentary

Given that both datasets share a unique identifier field, they can be linked in a relational database or joined together in CSV form. While a relational database is pragmatically the better choice for data processing, taking computer memory and database size into account, I have opted to make the tradeoff to join the relevant fields of the two datasets into one CSV file. I acknowledge there is a computational tradeoff; however, undoing the join of the two datasets would not be taxing using some code. Moreover, it is significantly easier to use graphical and visualization tools for the latter half of this project if all the data sits in one file, particularly for online tools. Lastly, the manipulation of the dataset to visualize on certain dimensions, just as a pivot table zooms analysis onto chosen dimensions, is easy to do using online processing tools like Tableau. I hope to clean and use this dataset to complement my primary data by:

  • CSV Conversion: Excel file needs to be converted to a CSV to be manipulated by digital history tools
  • Joining Tables: Join the relevant fields of this entry to the primary dataset. I will be doing this only after filtering the primary dataset, so there will be no need to filter for year in this dataset as information will join to the already filtered primary dataset entries. 
  • Extrapolate Voyage Length: Adding a 'length of voyage' field using the 'year in' and 'year out' fields for further analysis

Please see this page to download the raw datasets in XLSX format.