# 1 Introduction

Your thesis is the primary deliverable of your time in 702. It is a structured presentation of your learning and scientific achievement. Its structure should be similar to that of other papers in your scientific area. For most theses, the structure described Section 3.2 is appropriate, but deviations from this structure can be made with the permission of your advisor and myself.

# 2 Guidance

The *Write Like a Scientist* page has excellent step-by-step instructions for scientific writing, and was put together by folks at Middlebury!

## 2.1 Find An Example

The best way to write well is to carefully read models of good writing. One kind of good model is a previous Middlebury thesis! You can find some examples of Middlebury CS theses here. Once you find a thesis, it is best to confer with your advisor about whether that thesis is a good one to emulate.

Another good approach is to find an academic paper in your area that is considered to be an especially useful, insightful, or seminal paper. Try structuring your writing like that!

## 2.2 Know Your Audience

Your audience is the Computer Science faculty of Middlebury College. You can assume that we are comfortable with math, algorithms, and computational experiments. However, you should also assume that most of us know very little about the topic you are working on. You will need to spend plenty of time explaining that, especially in your introduction (Section 3.2.1).

## 2.3 Good Writing is Dense

Excellent scientific writing is dense. It is concise, rich in meaning, and without weasel words. Theses are not necessarily extremely long — we have seen good theses with as few as 35-40 double-spaced pages—but each page has **a lot** of meaning.

### Use Math and Algorithms

As scientists, part of our responsibility is to describe our work in the most precise way possible. The gold standard for precision in quantitative science is the language of mathematics and algorithms. Some tips:

- When describing a measure of quality or error, give a formula.
- When describing an algorithm, use an algorithm environment in \(\LaTeX\).
- Use tables to efficiently display the results of experiments that don’t fit neatly into figures.

# 3 Deliverables

This section includes a complete set of deliverables related to your thesis document.

## 3.1 Outline

Your **outline** is a plan for the writing of your thesis. It’s a birdseye view of the whole thing. Your plans may change and you may deviate from your outline as you go—that’s fine—but you should do so with intention and purpose.

- Please read the descriptions and scaffolding for each of the five sections described in Section 3.2.
- Your outline should include each of these sections, along with a list of 3-5 bullet-point sentences describing what you will communicate to your reader. Together, these sentences form an “abstract” of the section.
- These should be
**complete sentences**, not phrases, each of which makes a claim that advances the arc of your thesis.

- These should be
- Below each sentence, please include references to what you will use to support the claim. These could be literature references; references to a figure you will create; etc. It’s ok to skip these references for the methods and discussion sections.

Submit your outline as a PDF on Canvas.

### Example 1

- Although the question of fairness in machine learning is very important, there is no consensus on which definitions of fairness should be used in different application areas.
- Barocas, Hardt, and Narayanan (2023)

- Furthermore, some of these definitions of fairness are incompatible with each other.
- One source of bias is failure to carefully specify the target variable and ensure that the target is appropriately reflective of the goal of the predictive model.
- Obermeyer et al. (2019)

- …

### Example 2

- Across several different tasks, our proposed algorithm performs slightly better than existing state of the art (SOTA).
- Table 1 showing task descriptions, our performance, SOTA performance, and timings.

- Our proposed algorithm also performs well on a broader range of tasks than are feasible with SOTA.
- Table 2 showing our algorithm working on examples where SOTA can’t even be run.

- Although our algorithm generates overall higher-quality solutions, the resulting solutions are not as easily interpretable or intuitive as SOTA.
- Figure 1 showing our algorithm output in comparison to SOTA.

- …

## 3.2 Sections

### Introduction

**4-6 double-spaced pages**, not including references.

Your introduction is your reader’s gateway into your thesis topic. By the end of your introduction, the reader should understand what your problem is, why it’s important, why **you** need to work on it, what your overall approach is, and what to expect as they continue reading.

### Literature Review

**6-8 double-spaced pages**, not including references. A fair target for a literature review is around

**30**references discussed, and

**20**would be a minimum. You may deviate from this with the consent of your thesis advisor. Literature review sections are often titled “Related Work.”

Your literature review is a survey of literature related to your topic. Your literature review should discuss a range of scholarly references and highlight their importance to your problem or related ideas.

One way to think about what you should include: your topic is probably an *intersection* of several ideas. For example, if my topic is “modified gradient descent for fairness-penalized classification models,” then my topic is the intersection of:

- Gradient descent algorithms
- Fairness in machine learning
- Penalty methods in optimization
- Classification models

It would be appropriate for my literature review to discuss sources in all of these areas, though not necessarily in that order.

It is often, but not always, appropriate for the introduction and literature review to contain illustrative diagrams to help your reader understand what it is that we are talking about and why so many people care about it. For the final thesis, these diagrams should be created **by you**, and not copied (even with attribution) from other sources.

#### Writing Style for Literature Reviews

- It’s easy to say have sentences like “Barocas, Hardt, and Narayanan (2023) discuss the importance of fairness in machine learning” over and over. Avoid this! Instead, use active verbs and specific claims or ideas. “Barocas, Hardt, and Narayanan (2023)
*argue*that fairness in machine learning is multifaceted and*demonstrate*how different formulations are appropriate in different contexts.” Some other good phrases are:*proves that*,*shows*,*formulates/invents*,*criticizes*,*improves ____ by doing ____*, etc. - In \(\LaTeX\), you can use
`\cite{source1, source2}`

to cite multiple sources simultaneously. - In scientific writing, we very rarely quote. Instead, paraphrase and cite.
- It’s most common to simply write a sentence about a matter of fact and then cite, e.g. “Fairness in machine learning is a multifaceted topic (Barocas, Hardt, and Narayanan 2023).” It’s also fine to use the authors as the subject of the sentence, e.g. “Barocas, Hardt, and Narayanan (2023) argue that fairness in machine learning is multifaceted.” It’s best to do this relatively rarely.

### Methods

**8-12 double-spaced pages**, not including references. The necessary length of this section can be highly variable and you should consult with your advisor about what target length might be right for you.

Your methods section is where you should describe the details of what exactly you are doing. This is the place for algorithms, equations, and descriptions of experimental protocols.

#### Math and Algorithms

Use them! Correctly incorporating mathematics and algorithms into your scientific writing is largely a matter of *defining your terms*. Here’s an example equation:

\[ \begin{aligned} R(\mathbf{X}, \mathbf{y}) = \frac{1}{n} \sum_{i = 1}^n \ell(f(\mathbf{x}_i), y_i) \end{aligned} \tag{1}\]

** This equation does not mean anything.** That’s because it has a bunch of symbols that haven’t been defined yet. Every mathematical symbol needs to be explicitly defined, usually in the body of the text. In this example, I would display the equation above and then immediately follow it with something like this:

Equation 1 defines the empirical risk function \(R\). In this equation, \(n\) is the number of distinct data points, \(p\) is the number of features, \(\mathbf{X}\in \mathbb{R}^{n\times p}\) is the matrix of predictor data and \(\mathbf{y}\) is a vector of targets. The function \(\ell\) is the loss function and \(f\) is the predictor function.

You should similarly use algorithms – with carefully-defined mathematical symbols – to express your ideas. Your aim should be that your reader could *with patience* reimplement exactly your own methods, without needing to “guess” at any of the mathematical or algorithmic choices you made along the way.

#### Describe Experiments

The methods section is also a good place to describe *how* you test your algorithm/model/implementation/protocol, but not *what* the results are. Things that should go in here may include:

- The conditions of your computational experiments: number of steps, size of problem, etc.
- A mathematical description of what you are measuring in those experiments; e.g. error, variance, mean, runtime, etc. In many cases it is useful to measure multiple things; e.g. error and runtime both.
- The specs of the machine on which you did those experiments.
- How you designed and administered a user survey; how you incorporated feedback from the survey into your product; etc.
- Any competitors (algorithms, products, etc.) that you compared to yours, and a description of how you made the comparison. If these are published, then you don’t usually need to go into detail – just the name of the competitor and a citation is fine.

### Results

**6-10 double-spaced pages**, not including references. The necessary length of this section can be highly variable and you should consult with your advisor about what target length might be right for you.

This is where we show that your innovation is Good, Actually. Your results section is the place to tell your reader about the outcomes of your experiments; the feedback that you received; etc. The exact structure of the results section depends on the kinds of results you have to share and how you present them, and so a major undertaking for the results section is simply the design of effective scientific graphics and tables. Figures are often best for showing big-picture patterns, while tables are most relevant when it is important for your reader to compare two specific numbers (for example, the error achieved by your algorithm vs. the error achieved by a competitor).

Remember that no figures or tables speak for themselves; it is your job to write captions and paragraph discussions. Captions are for ensuring that the reader is technically able to understand all the symbols and markings in the figures, while paragraph discussions are for telling the story.

Your results section is a good place to comment on comparisons (“Our algorithm was faster than the competitor by approximately 1 order of magnitude while achieving similar error.”). It’s not the right place to comment on the significance of this (“this is a big deal for users”); that’s for the Discussion section.

### Discussion

**5-7 double-spaced pages**, not including references. The necessary length of this section can be highly variable and you should consult with your advisor about what target length might be right for you.

The Discussion section is your closing interaction with your reader. It’s where you and they reflect on what you’ve done. The main questions you should address in the discussion are:

- Concisely, what exactly did we do again?
- In what ways did it work? In what ways did it not work?
- Why does it matter?
- What’s the next logical scientific step suggested by this work (i.e. what should your advisor’s
*next*thesis student work on?)

## 3.3 (Optional): Appendices

Appendices are spaces for information that doesn’t fit within the natural flow of the thesis itself. Code implementations; large sets of repeated experiments; supplementary tables; etc. are all suitable for appendices.

© Phil Chodrow, 2023

## 3.4 References

*Fairness and Machine Learning: Limitations and Opportunities*. MIT Press.

*Big Data*5 (2): 153–63.

*Proceedings of the 23rd Acm SIGKDD International Conference on Knowledge Discovery and Data Mining*, 797–806.

*arXiv Preprint arXiv:1609.05807*.

*Science*366 (6464): 447–53.