Why I’m bad with names, but it turned out for the best.

Image for post
Nobody:The credits: Tom Skerritt, Sigourney Weaver, Veronica Cartwright, Harry Dean Stanton, John Hurt, Ian Holm, Yaphet Kotto, Bolaji Badejo, Helen Horton, Eddie Powell, Gordon Carroll, David Giler, Walter Hill, Ivor…

Background

The credit sequence is a sad excuse for an effort to connect people who “touch the film” with those who watch it in its final form. It’s such a dull format that it has become customary for theaters to brighten the room for moviegoers to comfortably leave the theater. I want to curb this custom and get people invested in the countless names that scroll by. In my previous article, I explain why I ran with tagging names with nationality — put simply, strangers instantly connect when they find out they share a common origin story.

I want to make something like the sample credit sequence below. Let’s call this format “flag-tagged.”

Image for post
The flag-tags are randomly assigned and do not accurately represent ethnonationality.

This article is about how one developer’s quick-and-dirty tech solutions can’t change a culture but instead act as a powerful tool to expose and correct naïvete. Whose? Mine, for starters.

Image for post
Situation ‘A’ — I am the go-to credits-processing guy.Situation ‘B’ — Where I am now, developing a client-side post-production solution.

I will start off by addressing the hopelessness of my original project. Ideally, I would integrate nationality information into films themselves (flag-tagging) to make the credit sequence, hopefully, more enticing than the bathroom at the end of a movie. Since I don’t run an established editing operation and I’m not given footage as a matter of course for rapid credits-processing, I’m more inclined to write a script to automate flag-tagging by (1) programmatically find names in streamed video and (2) insert animated graphics to retroactively assign nationality information. This is above my paygrade.

Image for post
Over time, the credits grow longer and list more distinct roles. [1]

But I trudged onward, dual-wielding classification tools and public databases in hopes of revealing an ethnonational narrative in the long history of feature films. The history of feature films is well over 100 years-old, involving roughly 300,000 credited names. It was a huge process of augmenting name data, and the challenges I came across led me to narrow my scope to a small, intriguing subset of people.

Please Onomastics Responsibly

From the name Mark Hamill, you cannot definitively gather ethnic nationality, but you can assign a probability that the name was given by parents of a certain ethnonational lineage. This kind of task is perfectly suited for a classification algorithm, which dissects inputs and uses machine learning models to sort them into categories that most closely aligns with their attributes. Let’s see how NamSor predicts the heritage of the name of the famous American actor, which we know to be of English, Irish, Scottish, Welsh (paternal), and Swedish (maternal) descent.

Baby Name Wizard and Ancestry serve as the base-truth origin for the first and last names. For “Mark Hamill, U.S.A.,” NamSor’s most confident predictions are “British” and then “(German).”

See that it correctly guessed British? That’s 1 for 1! Only 299,999 to go!

Image for post
NamSor’s Diaspora endpoint drives the ethnonationality classification.

Yet, we can’t rely upon NamSor to be 100% accurate, even if it was a magical API. That’s because NamSor assumes tradition as a rule — that parents name their children to reflect their heritage. NamSor has no idea that names it’s asked to classify might not be names given at birth.

Call Me By That Name

Historically, Jews, Italians, and Poles in Hollywood changed their names to more pronounceable stage names, sometimes distancing themselves altogether from ethnic groups prone to discrimination by taking on a name with no trace of their born heritage.

I don’t care what anyone says, Herschlag has a nice ring to it!

I couldn’t go on labeling people with presumed ethnonationality knowing that some large chunk of the names I had access to through The Movie Database weren’t names that were given at birth. Lucky for me, IMBb had the birth and stage name of every person I was looking for, so I just had to identify every credited person in the 5–7 most popular movies of every year since 1900 (n≈100,000) who changed their name for better marketability in the industry in order to correct the “name-changers.” Without knowing the motivation of any individual’s name-change, and to eliminate false positives, I assumed that they made the switch for reasons other than the following:

  1. They changed their name for marriage. (It’s actually more professional for showbiz women to keep their maiden name after marriage. These women went against the grain.)
  2. They go by their initials or a nickname. (Consider writer-actor B.J. Novak.)
  3. They gained an honorific. (Consider Sir Patrick Stewart.)

What remained after that filter were ~7,000 significant name-changes, a substantial 7% of the sample and an interesting group to explore! I had some questions for the data before I dove in, including:

Can we observe anglicization as a trend in name-changes?

No

Hypothesis: since a HUGE portion of the names in the population ended up British, you can observe a pattern of anglicization among name-changes in the American film industry.

Analysis: Using NamSor to determine someone’s actual ethnonationality is dumb. Instead, I used NamSor to do what it’s built for; I got “impressions” of the ethnonationality associated with someone’s former and current names to look for patterns in ethnocultural importance to the industry.

Image for post
Too much?

Facts: As it turns out, there is a net-outflux of British-sounding names to elsewhere-sounding names. My hypothesis is WRONG. Check it out.

We’re only looking at the eastern hemisphere because ethnonationality, even in melting pots like the USA, is categorized primarily by geographical ties developed over the Modern Era, when populations were concentrated outside of the Americas.

About the above visualization:

  • A red line means the name originates in Britain and terminates elsewhere.
  • A blue line means the name originates elsewhere and terminates in Britain.
  • A white line means there is no association with Britain.
  • The thickness of the line indicates the number of name-changes recorded in one year from one “ethnic nation” to another.

There’s a lot to see here, but I want to point your attention to the red and blue lines, specifically between 1900–1970. Over this period, there appears to be a mass exodus of British-sounding to Portuguese-sounding names. Contrary to the hypothesis, this shows that there is a tremendous de-anglicization effect underway. Maybe because people aimed to distinguish themselves from the showbiz crowd with ethnic-sounding names?

Concluding points

After I saw my hypothesis was wrong, I felt that my project was at a dead-end. My goal was to deliver a product that could change the way people watch credits and I felt I had turned up with nothing. Even if I developed post-production software to change the visual content of the credit sequence, there are no guarantees that anyone would notice, let alone care that there is more to see after their movie has ended. Instead, I developed an independent website as a platform to present my progress as I explored and understood my dataset. This, of course, meant that anything I published was disassociated from the industry I was poking at. While this made my impact negligible, it forced me to find real value in the data and the tools I had on hand before I exposed something people would find interesting. The fun of exploring an untouched dataset quickly drained away when I found out my hypothesis was wrong, but I believe there might be more to extract from this group of name-changers than I originally thought.

Could there be a Part III? Only time will tell… In the meantime, thanks for reading! If the population I observed or my analysis process is of any interest to you, please get in touch!

[1] The Movie Database (~5 most popular feature films of each year)

I took on this project with Jyotsna Pant, a graduate student, for Computing for Social Good at the University of Southern California.