The best practices for working with Google Sheets

at 2018.08.26

Here is a post of my own experiences of working with information in spreadsheets (seriously) for almost 3 years, together with the opinions of others I have worked with and reports and posts I've read online.

By no means is it meant to be exhaustive or the previous word about the subject, but if you follow these guidelines, you should have a comprehensive SOP (standard operating procedure) for data analysis.

Why bother?

In implementing these best practices for working with data, they will cause you/your staff to function more efficiently and reduce the probability of mistakes (computer or human) and drudgery (humans weren't made to do the work APIs can do.

It'll make your work easier to follow, comprehend, and add value to a team's or client's workflow process. It is a good habit to have, and it'll serve you well as you progress with your data career.

Contents

18 best practices for coping with information

1. Organize your info

Jobs are more likely to be successful when there is good communication and a good organizational structure. A significant piece of that is good data management, and that begins with implementing a well-organized, sensible, and efficient folder arrangement.

As your projects grow in size, it becomes increasingly critical to maintain your Google Sheets organized in a meaningful way.

At a minimum, you'll have a top-level project folder and within this, you will want separate folders for your raw datasets, for your analysis, and to the final deliverables. Files can move from one folder to the next (or be copied from one folder to the next) as the project progresses.

Here's a screenshot from my Drive, with folders for how I would set up an client account for an ROI dashboard, and sub-folders within these:

2. Keep a backup copy of your data

Surely that is redundant in the era of cloud data, right?

On the 1 hand, there shouldn't be a necessity to backup your Google Sheets.

Google has so much redundancy built in, your information should be protected from becoming lost or corrupted. And you can rely on variant history to go back in time to a previous version of your Sheet should you want to. You are able to restore files from trash, although as soon as you've deleted them out of trash they're gone for good (I believe there is an exception for admins of school G Suite accounts, however).

So that your data should be safe.

On the other hand, however, your account could still get hacked or a colleague could (accidentally) delete a crucial file, and you might lose data like that.

For anything that's crucial, it is well worth making a copy, either at another google account if you have one, or offline (as a CSV or excel document).

It's definitely not necessary for all, just your mission critical stuff.

Additionally, it is worth mentioning that creating a copy of a Sheet signifies the new copy doesn't have the model history, which can be an advantage or disadvantage depending on your circumstance. It's generally a fantastic idea if you are sharing the Sheet and don't want anyone to be able to see all your workings to get to the final result (maybe you deleted confidential data for example).

3. Document the steps you take

But there will come a day when you are glad of a few notes about where your data came out, what assumptions you made, what calculations you chose to do and how you did!

It doesn't have to be super long, only enough detail to enable you or someone else enough info to understand and recreate the investigation.

This example is intentionally comprehensive, so that I wouldn't suggest you have to create something similar to this every time:

If you are developing a dataset, then you should also produce a data dictionary, in which you describe what information each column stays. A data dictionary is only a list of column headings in a separate tab, with a notice describing each column, for example, what the components are, while it's been normalized, how it's been computed.

4. Go with wide-format data tables

As users, we generally utilize 2-dimensional, wide format tables.

The columns and rows represent categories or measures (by way of instance, regions vs. months). It's easy to understand this grid format and charts and calculations give themselves to this layout, as you may run calculations across rows or columns.

The Google Sheets graph tool expects data in a “wide-format” table rather than a conventional “tall-format” dining table (which explains how data is stored in databases).

There are no blank rows or columns. The X values (the Nations) have been at the first column and the show names (the Quarters) are at the very first row.

Contrast that with a “tall-table” structure (how a database typically stores info) and you're able to see the chart tool can't correctly parse and show the information.

Note, this is not a case of right and wrong, and you'll encounter data in the two formats and even in between. It is about using the right shape for the context of your own situation. Do you want to create a graph in the data table? Go wide! Maybe you will need to create a tall format table then.

For additional reading on this topic, check out this excellent post, Spreadsheet Thinking vs Database Believing, from Robert Kosara.

5. Use great, consistent names

As your projects grow in size and sophistication, it pays to create a consistent naming plan for your nickname, for tabs, named ranges, variables and column headers.

Historically, computer programmers have averted spaces in names, and though it's not strictly necessary in Sheets, it's nonetheless a good idea to steer clear of spaces and non-alphanumeric characters in tab titles, called ranges and column titles.

Why?

Certain functions (e.g. within the select statement of the QUERY function) and certain add-ons still adhere to this strict no-space rule, although not very common, it will save you a hassle.

The preferred strategy is to use camel case or underscore notation, and you may choose whichever you prefer, e.g.:

customerData

customer_data

6. Data Validation, baby

One of the most time-consuming tasks data practitioners confront is tidying up and cleaning data.

The problem is more intense wherever there is user-generated content. Invariably, everybody will use various notations (e.g. US, USA, United States, America, US of A,…), or misspell names, or enter dates with month first vs. day , etc..

Whatever you can do in order to preempt this can save you a lot of time around the back-end, once you're working together with the information.

One technique is to use Data Validation to control what a person may input into a mobile. As an instance, you could present them with a drop-down menu of options (rather than a free-form field), or limit the cell to numbers only, or to positive numbers only, or all sorts of other data validation options.

Here's a more detailed example, mixing the ISEMAIL function with data validation to make sure only legitimate email addresses can be entered into your Sheets, ensuring better data accuracy going forward.

From the cell in which an email address will be entered, such as A1 in the image above, right click and choose Data Validation… or go to Data > Data Validation… menu option, which opens this popup.

This prevents an individual from entering an invalid email, or anything other than an email, into that cell, also displays a warning signal and leaves the cell empty:

7. Better yet, use Google Forms for data entry

Google Types are a much more robust way to collect user inputs, because it separates the information collection from the data storage/analysis. This will prevent users unintentionally (or intentionally) overwriting data and/or seeing data from other users.

They are super easy to set up and pair seamlessly with Google Sheets. I use them for all the clients I work with and my own audience surveys, too.

I've written about using Google Forms earlier, within this article.

8. One cellphone = one piece of information

Each cell should include just one piece of info. Don't be tempted to put more than one data point into a mobile.

Cells with single data points can be utilized in formulas and charts without problems. People that have several data points, well, they can't.

For Instance, If your dataset has distinct money values, then you'll want to use two columns to record the data, one for the value, and also one for the currency, like this:

This is Far Better than having a column like this on your dataset:

Note, you often get information in this format, with multiple data points encoded in one cell, and it is your job to divide out that data into different columns, which means you can do your own analysis. That is covered in my Data Cleaning class.

9. Distinguish columns you add (e.g. by colors)

It is helpful to be able to differentiate the initial raw data from any columns you've included in the course of your investigation. You will quickly lose track of the original data columns in case you do not mark them, and I have discovered adding a subtle color to the total column is the quickest and most reliable way of indicating this.

Can you tell which is the first and which are the columns I've added in this case?

10. Do not use formatting to convey data

Do not use formatting only to communicate data, merely to augment underlying data.

Having just told you at the former point to distinguish columns by color, I am now telling you not to? What?!?

That is different.

In the past point, color was used to make your life simpler, so you can quickly identify exactly what was original information and what has been calculated columns you have added. It is not communicating information, it is simply facilitating a more efficient workflow.

In this case, suppose you would like to classify revenue numbers into ones which require follow up versus ones that don't. Should you use a cell highlighting to identify the ones you want to follow up with, you risk that not being known by other users or worse, being missing should you export your Sheet into CSV, or somebody decides to make wholesale changes to the formatting.

The color is only a tool to help comprehension, to make it easier and faster for the user to work with your data.

11. Insert an index column for resizing & assessing

Anytime you end up sorting your data, such as to rank largest to smallest for metric X, then you will want to bring an indicator column, which will allow you to always get back to the initial order.

You will notice that it is a colored column, which means it is one that I have added to the dataset.

I have been stuck before, unable to undo the several sorts I'd done and left using a dataset in an arrangement than no longer made sense. There are a number of reasons why you may want to contact the order the information came in, possibly to check for your sequence transactions happened in, perhaps to explain your steps to somebody else, or even for sharing the data with other users.

12. Format the header row

Center your column headings.

Wrap header text.

That's all!

13. Freeze the header row

This should be something you do automatically without considering it. It's essential once you have over 20 rows of data, otherwise you will eliminate an eye on your column headings.

This is how locked top rows behave when you scroll right:

14. Convert formulas into static worth after use

As soon as you've used a formulation, it's normally a fantastic idea to flip those “reside” formulas to static worth, as long as you don't need to keep them active.

There are times when you need to keep your formula active because your information is changing and you would like to get the formulas to update in real-time. In that case, it's perfectly fine to leave “reside” formulas in your dataset.

But if you have used a formula to derive a column, such as an IF based on a static date column, and it is not going to ever change again, then you should change the column into static values.

Highlight the Entire column and then copy the data (Cmd + C on a Mac or Ctrl + C on a PC/Chromebook).

Then over the top of the very same data collection, Paste-Special-As-Values.

This can help make your Sheets run faster, which is an important factor if you have large amounts of data.

15. Keep backups of your formulas

In the previous measure, I recommended turning formulas which you have used into static values, once you're finished with them.

But, you should definitely keep a copy of those!

You'll thank yourself later once you need to re-use this, and you've got it available, or when someone asks you how you derived a particular column.

Definitely, the most convenient way to maintain a copy of formulas would be to paste a live copy to a row above your dataset and leave it there. That way it'll keep all of the comparative references complete, which means you can quickly copy it in your dataset to use again.

Here's an illustration and you'll see that I have colored the formula green, to indicate that I have a live formulation in that cell:

Don't be worried whether the formula shows an #N/A! , A cell or perhaps a wrong answer, what matters is that it keeps the relative references intact.

You can simply copy the formula from your column into the row above and vice-versa, and also understand that it will just work:

Once you've copied your formulation to the row above your dataset, feel free to copy-paste-special-values and turn the column into static values according to the preceding best practice.

Important Note: Purists will argue that you ought to have a single header row and nothing more. I believe there are pros and cons for this header-row-only strategy and the “utility” row strategy I have advocated here.

I use both approaches but have a tendency to favor the approach of having a few “usefulness” rows since I think that the benefits of keeping copies of formulas above your data analysis workings are well worth the small price to pay not to begin in row one. And as soon as you've set up your named ranges, the problem falls away anyhow.

One scenario I would go with a single header row strategy is if I was creating a level dataset (no formulas) to be shared.

16. Create named ranges for your datasets

This is a great habit to get into, and saves you having to manually emphasize your information or type from the scope reference each time you consult with your own information in a formulation.

It is going to reduce errors from erroneous array references or mixing relative/absolute references, also contains the added benefit of making your formulas better to understand.

Locate it in the Information menu: Data > Named ranges…

It is the difference between this, clean simple syntax:

=QUERY(countries,"SELECT B, C, D ORDER BY D ASC",1)

And this, where you need to emphasize the range you want:

=QUERY(Sheet1!A1:D234,"SELECT B, C, D ORDER BY D ASC",1)

17. Avoid merged cells

Merged cells should be avoided in datasets.

They cause all sorts of issues in datasets: they break formulas, no one will know which column or column they link to, your data gets overwritten, respectively…

I'm not suggesting that you never utilize merged cells, since they can be useful when building front-end software for Google Sheets and you also want to get a certain layout, such as in the header of the Facebook dash :

Just not in your datasets, ok!

18. Tell the story of one row

This is perhaps my favorite of the 18 best practices for working with data, as it's so simple but so effective.

\The idea is to see (out loud if you prefer!) Across one row of your information and really see and understand what is in each column.

It'll allow you to know what information is in your dataset or prompt you to research further if you don't know exactly what a column shows. Potentially it'll help save you performing unnecessary work since you'll know just what you have.

Developers frequently keep yellow rubber duckies sitting on their desks to debug their lines of code, so go on and follow their case if you like!

Want to join my Carpal Tunnel Reduction Society?

(Stop doing manual things with Google Sheets and start working smarter to focus on being an Analyst).

Comments

comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

The best practices for working with Google Sheets

by Ryan time to read: 11 min