Recently, for a brief period, I had seriously considered quitting my academic career. This made me think long and hard about why all of a sudden, I want to give up something that I’ve believed that I wanted for such a long time. The conclusion was that I was just unhappy in life, and I just don’t have the emotional strength to push myself further. This made me think how much emotional strength it requires to get a Ph.D. and to continue on. I now have come out of that slump period (at least partially), but just wanted to share my candid thoughts I had during this process.

 

As much as intelligence is required, I think an incredible amount of emotional strength is required to get a Ph.D. I have seen many cases where students drop out of their Ph.D. program not because they didn’t like their research, but because they were going through personal struggles.

 

 

 

 

Let me begin with why I think the process of getting a Ph.D. and continuing on as an academic requires an extraordinary emotional strength.

 

First, the nature of research requires positivity, a lot of it. Doing your research is often like running your own start-up, facing a high degree of uncertainty in many steps of the way. Often, you just have to be able to convince yourself that this is going to work out, and even if it doesn’t, it is okay to waste months or years of your life on something that wouldn’t make it to publication because that’s.. research. Even after publication, you might not get any citations for years, and you just have to believe that one day, this will add value to the world. Even for an inherently optimistic person, keeping up the optimism with little to no positive feedback from the world requires a great deal of emotional strength.

 

This brings me to the second point. In research, short-term rewards are so scarce. Depending on the field, you have to work for months or years to get one publication. Before that happens, there is almost no reward. It isn’t like being a teacher and seeing students grow every day or being a doctor and seeing your parents improve on a daily basis. In this world of instant gratification where you get instant likes and responses on social media, academia is almost like at the worst extreme of delayed gratification.

 

There is another issue that adds to this: the highly competitive and high-pressure culture. Ph.D. students at top academic institutions are overachievers who excelled in their academic work, pretty much all throughout their life, and a highly-competitive culture naturally develops with this group of people. No-work-life-balance culture is pervasive in academia. I’ve heard time and again professors saying “Ph.D. students don’t (shouldn’t) have life” or “Ph.D. students don’t (shouldn’t) have weekends”. Admittedly, I was one of those people who spread this culture. But again, for an academic, your work is your own enterprise and your own brand, just like a start-up. If you spend 12 hours, 7 days a week at the office, you get more done than people who spend 8 hours only from Monday to Friday, and you’d have a stronger profile. No matter how much you do, there are always monsters who push out more publications than you. To get a tenure-track job at good universities, you have to beat them. It is hard to put a stop to this competitive mindset.

 

Finally.. there is the financial struggle. I’m mentioning this the last as if this is the smallest problem. But, no, the struggle is real. As a Ph.D. student, you really don’t make much more than working as a full-time waitress. You don’t need all the luxuries to be happy. But, the tiny income leaves you with no financial cushion, and this can be constant emotional stress. For instance, during my Ph.D., I lost my retainer during moving, and I was so stressed about it since replacing it would cost me a few hundred dollars, and that was a significant portion of my savings at the time. Recently, I lost my AirPods, which also cost me a couple hundred bucks to replace, but because I have a slightly better income as a postdoc, I wasn’t as stressed about it. I was still sad for like a day, but it wasn’t like that time when I was beating myself up about the retainer for weeks. Just knowing that there is a margin for mistakes relieves so much stress. Besides, you know that with your intelligence and work ethic, whatever “real” job you get will pay you at least 2-3 times more than the Ph.D. salary. You are essentially making this choice to be poor every day for your intellectual/academic pursuit, while you see your friends buying a house. This requires a certain amount of determination, and it is especially harder for students who don’t have a family who can provide financial support in case of an emergency. 

 

 

 

 

Okay, so I have complained enough about how damn hard it is. Then, who the hell does a Ph.D.? Because of all of this, I think only a very specific group of people come to and remain in the Ph.D. program and academia. (**WARNING** I am only speaking from my own experience with a few local samples..)

 

When I first started my Ph.D. at Carnegie Mellon, I realized that I have never encountered such a privileged group of people in my life. Most of my colleagues or friends had parents who are doctors or professors. It was even common to see friends whose grandparents are doctors and professors. In the first month at CMU, one of my friends asked me what my parents do. I said my dad is a police officer. He couldn’t hide the surprise from his face and I asked him why he was so surprised. He said because he expected I’d say either doctor or professor, because that’s how most students are like. Even though the student body comprises kids from all over the globe, if you dig into their backgrounds, they are surprisingly homogeneous.

 

Having a privileged background means that you have stable support from the family. First, you will have fewer worries about financial troubles. In the back of your head, you know that family is your financial fall-back option. Furthermore, you’ll likely have a family that supports and shares the same values as yours. Your parents or even grandparents are well-educated and appreciate intellectual work. They’ve traveled and seen the world, have progressive thoughts, and validate your views. In many cases, they can also provide helpful advice on the difficult academic journey. They understand your situations better and give you more tailored suggestions than say, a grandma who spent her whole life on the rural farmland. Finally, this often means that you are more surrounded by friends with similar backgrounds as well. 

 

Like how it takes a village to raise a child, it takes a rock-solid emotional support system to raise a successful academic, either from family, friends, or significant others. Admittedly, the support network that you’re born with is not the only network you’ll have in life. You can make new friends who share the same values as you, and you can meet a significant other who can be your best cheerleader/advisor. 

 

However, if you are from a less privileged background, it is difficult to find the support you need within the graduate school or within academia. If you have to run to a hospital in the middle of the night for your family member who lost consciousness due to a drug overdose, if you have to do a second job to support your family’s finance, if you have to give birth to and raise an out-of-wedlock baby, if you have to deal with a family member’s mental health issues while you have nobody to lean on, or if you have family and friends who have a completely different set of values and disapprove of your academic pursuit in every way possible, there are few people who can truly understand and empathize. Colleagues around you would rather pass judgment than provide support. Friends who understand, who have gone through something similar are not in academia. They’ve already chosen a more chill career path since it is already overwhelming to deal with the personal issues. 

 

Getting emotional support can be especially harder in the field of science and engineering, where the lack of human connection is prevalent and sometimes even encouraged. We all have that stereotype in our mind that a true scientific genius ought to be a sociopath.. right? For a long time in academia, I feared sharing emotions as I thought that will undermine my professionalism as a scientist since we should only speak with facts, not with emotions.

 

 

 

Until you get your tenure, academia is a really tough place to survive. It is a long journey. When you don’t find the right emotional support system throughout, you will likely get weeded out at some point. This really made me question.. With this current system and culture, are we left with the most talented at the end? Or, the most motivated? Or, is it the most privileged that remains at the top of the academic ivory tower? 

 

 

 


 

As fair machine learning (ML) has become a booming area of research, there are also many “fairness” metrics people have proposed. In this post, I want to summarize some of those metrics. I’ve only started working on fairness in ML a few months ago, and I still get confused between different terms. So, by writing this post, I wish I can reinforce my own memory πŸ™‚ 

 

Before I delve into fairness metrics, let me briefly introduce the field of fair ML. To do that, let me begin with the personal recount of the field of ML. When I applied for my PhD in 2013, I’ve never heard of the word “machine learning”. The term didn’t reach a clueless undergraduate student in Korea. When I started my PhD in 2014 at CMU, everybody was talking about Machine Learning. In fact, one of the first classes I took at CMU was the famous ’10-701: Introduction to Machine Learning’. ML was still a somewhat vague term back then. Statistician professors used to claim that ML is just a dressed-up term for statistics CS people came up with. At conferences, I’ve heard scholars saying if you want to get a job in the industry, you have to do ML or at least connect your research to ML. The hype was definitely there in 2014. At the same time, people were wary because of the hype. Especially about deep learning, lots of theory researchers around me were flat-out against it. 

 

Fast forward to 2020, the ML hype didn’t die out as a fad. The term that was unfamiliar to a CS student in 2014 has become universal and well-known to the general public. It has seeped into many industries and our everyday life. Not so tech-savvy people know that google translate works so great because of deep learning. One of the domains that have benefited the most from the recent ML advances, image processing, has created a huge industry of self-driving cars. It has long been claimed that ML technologies will upend the healthcare industry. For example, ML-based image processing would replace the role of radiologists. ML is also becoming an essential tool in science and engineering research ranging from astrophysics to biology. 

 

With its prevalence and wide-applicability, there comes a bigger liability. The application of ML in safety-critical areas such as self-driving cars or medicine has been a hot topic of debate since the beginning. Another facet of liability that has risen above the surface was “fairness”. Does this new technology give more advantage to a certain group of people than others? While I’m writing this, I got sorta curious about if this kind of question is unique to ML or it is a common question being asked when a new technology comes about. Maybe I should defer this question to another blog post πŸ˜‰ 

 

So, what would be an example of “unfair” ML? One example is an ML model trained to detect a person on a self-driving car can perform more poorly on black people compared to white people. This puts black people at a higher risk of a car accident. Also, an ML model trained to predict the risk of cancer might perform better on men than women. An example that depicts a more direct unfair treatment from an ML model is when you use an ML model to screen candidates for a tech-related job, the model might admit more male candidates than female candidates, thus providing less opportunity for qualified women. These problems are not hypothetical. Some versions of these examples have been observed multiple times in practice. (Maybe I need to add references here..)

 

As these problems have been recognized and have gained public attention, research interest in fair ML started to grow rapidly. See the figure below for the number of papers in this field in the past 7 years. (Is there any data that includes up to 2020?) There is also a new online book on this topic which I referenced for this blog post [1]. 

Image for post
Fig: The number of publications on fairness from 2011 to 2017. (Borrowed from Towards Data Science post. )

 

 

 

 

Okay, a short prelude became too lengthy. Now, let’s really talk about how to define fairness mathematically. To do this, I have to introduce a few notations. Let us define random variables as follows:

  • A: Protected attribute. E.g., gender, race, … 
  • Y: Target variable (ground-truth outcome). E.g., hired/not hired
  • R: Prediction outcome. E.g., hired/not hired, or score (probability of being hired)

 

A commonly used metric for non-discrimination is based on the “independence” between these variables. First, note that in lots of cases where fairness issues arise, A and Y are NOT  independent. For instance, even if you hire purely based on the merits, people from certain group can be hired more than the others. In another example in healthcare, people from a certain group can have a higher risk of the given disease they want to predict. 

 

There are three types of independence people consider for fairness:

 

1. Statistical Parity (Independence): R \perp A

Statistical parity is a very strong notion of fairness. What this is saying is that even if there is dependence between Y and A, we would enforce the prediction from the model R to not depend on A. For instance, let’s say the result from hiring purely based on merits would be hiring 50% of the male candidates and 30% of the female candidates. The ML model with the statistical parity property would rather hire 40% of the male candidates and 40% of the female candidates. 

.

2. Equalized Odds (Separation): R \perp A \;|\; Y.

This looks similar to statistical parity but it’s conditioned on Y. So, this is somewhat relaxed from statistical parity. Consider a binary-class case where Y \in \{ 0, 1\} and our prediction is also R \in \{ 0, 1 \}. Let’s also assume that the protected attribute A has two classes: A \in \{ \triangle, \square \}.  Equalized odds doesn’t enforce that there should be the same ratio of Y=1 in both class \triangle and class \square. Instead, what this enforces is the following:
P(R = 1 | Y=1, A= \triangle) &= P(R = 1 | Y=1, A= \square),
P(R = 1 | Y=0, A= \triangle) &= P(R = 1 | Y=0, A= \square).
This means that both groups \triangle and \square have the same true positive rate (TPR) and the same false positive rate (FPR). 
This is also referred to as “disparate mistreatment” because what this metric is preventing is different misclassification rates between groups. It equalizes false positive rate and false negative rate. 
.

3. Predictive Rate Parity (Sufficiency): Y \perp A \;|\; R

This again looks similar to the previous definition, but just Y and R are flipped. So, instead of equalizing P(R|Y) across groups, this equalizes the posterior distribution P(Y|R) across groups. I.e., 

P(Y = 1 | R=1, A= \triangle) &= P(Y = 1 | R=1, A= \square),
P(Y = 1 | R=0, A= \triangle) &= P(Y = 1 | R=0, A= \square).

 

This implies the equality in positive predictive value (PPV) and negative predictive value (NPV).

This is also called “conditional use accuracy equality”. I think this term was coined in this paper [4]. I guess the term “conditional” comes from conditioning on the prediction R, and the authors state that the term “use”is there because “when risk is actually determined, the predicted outcome is employed; this is how risk assessments are used in the field.”

 

 

 

 

Now, let’s look at some examples to understand these three different criteria better. 

The table shows the distribution of Y across two classes \triangle and \square. If we have a model that assigns R as given in the table, this satisfies statistical parity: P(R=1|\triangle) = P(R=1|\square) = 25/50 =0.5. However, this model doesn’t satisfy equalized odds nor predictive rate parity.

  • Equalized Odds — unsatisfied: P(R=1|Y=1, \triangle) =25/30 , but P(R=1|Y=1, \square) =20/20=1
  • Predictive Rate Parity — unsatisfied: P(Y=1|R=1, \triangle) = 25/25=1, but P(Y=1|R=1,\square) = 20/25

On the same dataset, we can train a different model that assigns R as below.

This satisfies equalized odds: P(R=1|Y=1, \triangle) = P(R=1|Y=1, \square) = 0.9 and P(R=0|Y=0, \triangle) = P(R=0|Y=0, \square) = 0.9. However, this model doesn’t satisfy statistical parity or predictive rate parity. 

  • Statistical Parity — unsatisfied: P(R=1 | \triangle) = 29/50, but P(R=0|\square) = 21/50
  • Predictive Rate Parity– unsatisfied: P(Y=1 | R=1, \triangle) = 27/29 \approx 0.93, but P(Y=1|R=1, \square) = 18/21 \approx 0.86

 

Hopefully, these examples give you some idea on in what cases these parity conditions satisfy. Also, as you can notice in these examples, even though these conditions look very similar, it is hard to satisfy two of them at the same time. You can in fact, prove that with mild assumptions, it is impossible to satisfy two out of these three conditions at the same time (See Chapter 2 in [1]). 

 

Finally, I want to mention that these metrics are all based on counting the prediction results across groups and equalizing them. There exist completely different notions of fairness: individual fairness and counterfactual/causal fairness. I will cover them in a different post because I want to go run now πŸƒπŸ»β€β™€οΈβ˜ΊοΈ

 

References

[1] Fair Machine Learning Book 

[2] Verma & Rubin, “Fairness Definitions Explained” , ACM/IEEE International Workshop on Software Fairness (2018)

[3] A Tutorial on Fairness in Machine Learning , A Blog post

[4] Berk et al., “Fairness in Criminal Justice Risk Assessments: The State of the Art” , Sociological Methods & Research (2018)

I recently attended an ICML tutorial on submodular optimization and I enjoyed it so much that I continued to read a review/tutorial paper: “Submodularity in Action: From Machine Learning to Signal Processing Applications”.

As I’m reading the paper, I just wanted to summarize what I’m learning on this blog. Initially, I was very ambitious and wanted to recreate something similar to the ICML tutorial. Soon, I realized that that will require probably one entire semester, not a week. Now, I’m just breaking things down into much smaller posts πŸ™‚

Let me begin with the definition of submodular functions. The core concept in submodular function is diminishing return:

    \[f(e|\mathcal{A}) \leq f(e|\mathcal{B}) \;\; \text{ if } \mathcal{B} \subseteq \mathcal{A},\]

where f(e|\mathcal{S})Β is defined as:

    \[f(e|\mathcal{S}) = f(e \cup \mathcal{S}) - f(\mathcal{S}) \;\; \text{ for } e \in \mathcal{S}^C.\]

So, if you consider f as some reward function, f(e|\mathcal{S}) is an additional reward you get if you add e into the set \mathcal{S}.

The best way to visualize this is through the “covering” problem, e.g, covering area with Internet access. As you add more transmission towers, there is more area covered, i.e, that has Internet access. Initially, every new tower you build, you have a new area that newly gains Internet access. However, once you have a few towers, the area covered by a new tower will start to have overlap with already-covered areas, and the amount of newly covered area decreases. See the image below:

Β  Image borrowed from: http://iid.yale.edu/icml/icml-20.md/

The concept of submodularity is important for two reasons:

  1. “Diminishing returns” scenario occurs frequently in real-world. Especially, it gained a lot of attention recently due to its applications in machine learning. For instance, feature selection can be modeled as a submodular optimization problem where you want to find features that explain the variance of the output the most. Clustering can also be formulated as a submodular optimization problem: choosing subsets that maximize the combinatorial dependence function.
  2. Submodularity is a discrete-function counterpart of convexity. Thanks to this similarity, some submodular optimization problems can be solved efficiently. For unconstrained submodular minimization, we can obtain an exact solution by utilizing convexity. For many constrained submodular optimization problems, there are greedy algorithms that provide approximation guarantees. (Will cover this in later posts!)

In this post, I am just going to talk about one thing:

Any set function can be represented as a difference between two submodular functions.

This is a really useful property because it lets you use techniques for maximizing the difference of two convex functions to maximize an arbitrary set function.

Now let’s prove why this is true!

 

 

Theorem 1. For any set function f: 2^{\mathcal{N} \rightarrow \mathbb{R}Β  defined over a ground set \mathcal{N}, there exist submodular set functions g: 2^{\mathcal{N} \rightarrow \mathbb{R} and h: 2^{\mathcal{N} \rightarrow \mathbb{R} such that:

(1)   \begin{equation*} f(\mathcal{S}) = g(\mathcal{S}) - h(\mathcal{S}) \quad \forall \mathcal{S} \subseteq \mathcal{N}. \end{equation*}

(Proof of Theorem 1)

Suppose f is any set function on \mathcal{N} and h is a submodular function on \mathcal{N}. And, let g be such that:

    \[g(\mathcal{S}) = f(\mathcal{S}) + h(\mathcal{S}).\]

If

(2)   \begin{equation*}  g(e|\mathcal{A})  \leq g(e|\mathcal{B}) \end{equation*}

is satisfied for every e \in \mathcal{N}, \mathcal{A} \subset \mathcal{B} \subset \mathcal{N}, we are all set!

If not, find e \in \mathcal{N}, \mathcal{A} \subset \mathcal{B} \subset \mathcal{N} that violates (2), i.e.,

(3)   \begin{equation*}  g(e|\mathcal{A}) > g(e|\mathcal{B}) \text{ for } \mathcal{B} \subset \mathcal{A}. \end{equation*}

We will show that we can change h that will remove the pair (e, \mathcal{A}, \mathcal{B}) that violates the condition (2) while maintaining the submodularity of h.

To do that, we will first prove a small lemma.

Lemma 1. Let h be a submodular function on \mathcal{N}. Let h' be another set function on \mathcal{N} such that:

    \[h'(\mathcal{S}) = \begin{cases} h(\mathcal{S}) -c  \text{ for } \mathcal{S} \in \mathcal{B}  \\ h(\mathcal{S}) \text{ otherwise.} \end{cases}\]

Then, h' is also a submodular function on \mathcal{N}.

(Proof of Lemma 1)Β 

We have to show that for all e \in \mathcal{N} and for all \mathcal{C} \subset \mathcal{A} \subseteq \mathcal{N},

(4)   \begin{equation*}  h'(e|\mathcal{A}) \leq h'(e|\mathcal{C}). \end{equation*}

Let’s divide this into three cases: i) \mathcal{B} \subset \mathcal{C} \subseteq \mathcal{A},Β  ii) \mathcal{C} \subseteq \mathcal{B} \subseteq \mathcal{A}, iii) \mathcal{C} \subseteq \mathcal{A} \subseteq \mathcal{B}.

i) h'(e|\mathcal{C}) = h(e|\mathcal{C}) and h'(e|\mathcal{A}) = h(e|mathcal{A}). Hence, (4) holds.

ii)

    \[h'(e|\mathcal{C}) = h'(\mathcal{C} \cup \{ e \}) - h'(\mathcal{C}) = h(\mathcal{C}') - h(\mathcal{C}) + c.\]

    \[h'(e|\mathcal{A}) = \begin{cases} h(\mathcal{A}') - h(\mathcal{A}) + c \quad \text{if} \mathcal{A} = \mathcal{B} \\ h(e|\mathcal{A}) \quad \text{otherwise}. \end{cases}\]

Hence, h'(e|\mathcal{A}) - h'(e|\mathcal{C}) \leq h(e|\mathcal{A}) - h(e|\mathcal{C}) = 0.

iii)

    \[h'(e|\mathcal{C}) = h'(e \cup \mathcal{C}) - h'(\mathcal{C}) = \begin{cases} h(\mathcal{C}') - h(\mathcal{C}), \quad \text { if } e \in \mathcal{B}\setminus \mathcal{A} \\ h(\mathcal{C}') - h(\mathcal{C}) + c \quad \text{otherwise} \end{cases},\]

where \mathcal{C}' = \mathcal{C} \cup \{ e \}, and, the same will hold for \mathcal{A}. I.e.,

    \[h'(e|\mathcal{A}) = h'(e \cup \mathcal{A}) - h'(\mathcal{A}) = \begin{cases} h(\mathcal{A}') - h(\mathcal{A}) \quad \text { if } e \in \mathcal{B}\setminus \mathcal{A} \\ h(\mathcal{A}') - h(\mathcal{A}) + c \quad \text{otherwise} \end{cases}\]

.

Hence, for either case, h'(e|\mathcal{A}) - h'(e|\mathcal{C}) = h(e|\mathcal{A}) - h(e|\mathcal{C}).

\square

Now, let \delta(e, \mathcal{A}, \mathcal{B}) = g(e|\mathcal{A}) - g(e|\mathcal{B}). Let

    \[h'(\mathcal{S}) = \begin{cases} h(\mathcal{S}) -\delta{e, \mathcal{A}, \mathcal{B})  \text{ for } \mathcal{S} \in \mathcal{B}  \\ h(\mathcal{S}) \text{ otherwise.} \end{cases}\]

Then, h' is still a submodular function following Lemma 1. Now let g' = f + h'.

    \begin{align*} g'(e|\mathcal{A}) - g'(e|\mathcal{B}) &= g(e|\mathcal{A}) - (g'(e \cup \mathcal{B}) - g'(\mathcal{B})) \\ &= g(e|\mathcal{A})  - (g(e \cup \mathcal{B}) - (g(\mathcal{B}) - \delta(e,\mathcal{A}, \mathcal{B}) )) \\ &= g(e|\mathcal{A}) - g(e|\mathcal{B}) - \delta (e,\mathcal{A}, \mathcal{B})  \\ & = \delta(e, \mathcal{A}, \mathcal{B}) - \delta (e,\mathcal{A}, \mathcal{B})  = 0. \end{align*}

 

We can repeat this process until there is no such pair (e, \mathcal{A}, \mathcal{B}) that violates (2).

\square

 

 

 

That’s it for today, and I hope I can come back with another submodular optimization post or quantum algorithm post soon πŸ™‚