# <ins>Tutorial 3.2: Reading & Writing Data</ins>
*ASTR 211: Observational Astronomy, Spring 2021* \
*Written by Mason V. Tea*

Astronomical data often comes with zillions of datapoints, and we have to store them somewhere. Once we've stored them somewhere, we have to be able to read them to do science with them. That's what this tutorial is all about.

This is a very brief introduction to reading from and writing data to files, specifically CSVs (and TSVs). **CSV** files, which have the naming format `file.csv`, are documents containing **c**omma **s**eparated **v**alues; TSVs (tab-separated values) are less common but still relevant. You can think of a CSV as a spreadsheet, where each row is a new line and each entry in a row is separated by a comma. For example, a file `personal-info.csv` may look something like this:

```
firstname,lastname,age,birthmonth,birthyear
joe,smith,20,june,1998
kate,yang,23,march,2001
```

The first row of a CSV typically contains labels for the columns, while the rest of the rows populate those columns. You might wonder why astronomers prefer CSVs to using Excel spreadsheets like normal people, and that's because it's much easier to have a machine parse (read) and append (write to) a CSV file.

This tutorial will cover reading and writing CSV files to the extent that I use it every day as someone who does research and homework that requires it, i.e. the basics. We'll talk about manipulating other types of files (i.e. FITS, the true astronomical filetype of choice) when we talk about telescope data processing in a few weeks.

In [36]:
import numpy as np

## Writing data

Let's start by populating a file with some data for us to poke around with. First, we want to open our file, so we can start writing in it. To do this, we use the `open()` command, which is built into Python. The way I like to go about it is using a `with` statement, like so:

In [5]:
with open("example.csv") as file:
    pass

FileNotFoundError: [Errno 2] No such file or directory: 'example2.csv'

If you run this cell, you'll see that it throws you an error: `No such file or directory: example.csv`. By only giving the `open()` function the name of the file you want to access, it doesn't know whether you want to read or write, and will let you do both. This means that, when it tries to read the file and it doesn't already exist, you get an error.

If you want to create a file from thin air, we can add a second argument `'w'`, meaning "write only". This will create the file that you named and allow you to edit it.

In [7]:
with open("example.csv", 'w') as file:
    pass

If you run the cell above, you should have a new file called `example.csv` in the same folder as this Jupyter file. Now, let's add some stuff to the file with the `write()` command. We've _aliased_ this file that we've opened as `file`, so we can type `file.write("<stuff>")` under the `with` statement to add lines. (`file` is a variable name, and can be whatever you want, though.)

In [10]:
with open("example.csv", 'w') as file:
    file.write("Hello")

Now, if you open the file, you should just see the line "Hello". To check this, here's two sidenotes: 

1. If you type a percent sign in a code cell in Jupyter, you can use Unix commands. 
2. If you type `cat <filename>` in the command line, it'll read the contents of the file to you.

Using this, let's check that our file writing worked.

In [12]:
%cat example.csv

Hello

Sure did. Let's add another line.

In [13]:
with open("example.csv", 'w') as file:
    file.write("Hello")
    file.write("Hello")

In [14]:
%cat example.csv

HelloHello

We don't seem to have gotten a new line. For this, we have to use a "special character" of sorts, which we talked about when we went over strings. The "new line" character is `"\n"` -- adding this to the end of a string means the next entry will be on the next line. (Alternatively, adding it to the beginning of a string means _it_ will be on the next line.) Let's add this character and give it another try:

In [19]:
with open("example.csv", 'w') as file:
    file.write("Hello\n")
    file.write("Hello")
    file.close()

In [20]:
%cat example.csv

Hello
Hello

Great! Now, something I've neglected to mention til now is that we've been overwriting our file over and over again, if it wasn't obvious. That's because of the `"w"` argument. So, if we open up our file again with `"w"` and write some new stuff, the old will be gone.

In [23]:
with open("example.csv", 'w') as file:
    file.write("Howdy\n")

In [24]:
%cat example.csv

Howdy


If, instead, you wanted to add onto the end of your file, you can substitute it for the `"a"` argment (append). If we open our file this way now, and add a new line, we should retain "Howdy\n".

In [25]:
with open("example.csv", 'a') as file:
    file.write("Hola")

In [26]:
%cat example.csv

Howdy
Hola

TL;DR: `"a"` preserves the file and adds onto the end of it, while `"w"` does not. Now, you can see how we can create something that lives up to the name of a CSV with this `open()` function: just add stuff line by line.

In [27]:
with open("example.csv", 'w') as file:
    file.write("name,age,birthmonth\n")
    file.write("joe,smith,june\n")
    file.write("kate,yang,march")

In [28]:
%cat example.csv

name,age,birthmonth
joe,smith,june
kate,yang,march

If you're adding a bunch of data to a CSV, you can totally use this `open()` function in a loop, in combination with some string manipulations. For example, let's add numbers in groups of threes to our file.

In [30]:
for x in range(20):
    with open('example.csv', 'w') as file:
        file.write('{0},{1},{2}\n'.format(x,x+1,x+2))

In [31]:
%cat example.csv

19,20,21


Remember, though: `"w"` overrwrites the file. So, by using `"w"` in this loop, we just overwrote the first line over and over again. Let's try again with `"a"`:

In [33]:
for x in range(20):
    with open('example2.csv', 'a') as file:
        file.write('{0},{1},{2}\n'.format(x,x+1,x+2))

In [34]:
%cat example2.csv

0,1,2
1,2,3
2,3,4
3,4,5
4,5,6
5,6,7
6,7,8
7,8,9
8,9,10
9,10,11
10,11,12
11,12,13
12,13,14
13,14,15
14,15,16
15,16,17
16,17,18
17,18,19
18,19,20
19,20,21


That's about all there is to say about writing. Now let's talk about what to do once you have a file you want to use.

## Reading data

While you can definitely use the `open()` function to read lines of your file, it's much easier when working with data (rather than, say, text) to use a library called `pandas`. This library comes with a structure similar to a `numpy` array called a `DataFrame`, which you can just think of as a table. 

First, I'm going to make a CSV with some random values for us to test it out on. First, the values:

In [38]:
val1 = np.random.randn(20)
val2 = np.random.randn(20)
val3 = np.random.randn(20)

Then, the column names:

In [39]:
with open('random.csv','a') as file:
    file.write('colA,colB,colC\n')

Finally, add the data:

In [40]:
for x in range(len(val1)):
    with open('random.csv','a') as file:
        file.write('{0},{1},{2}\n'.format(val1[x],val2[x],val3[x]))

In [41]:
%cat random.csv

colA,colB,colC
1.9836130420095421,0.04824535428206758,-0.6077963644263686
0.5247187259291767,-0.22430098876900806,0.011550230781106768
-0.28750600781964525,-0.5936154289298304,-0.037278140386934944
-0.1369824955136523,-0.6362942500144838,2.280356099225377
-0.3622079922244438,0.5423653139585406,-1.789900588307686
1.6814764704453473,-0.3988143723262242,0.20600481272680357
0.2868356023970937,1.0667656538878563,-0.16783808258082866
-1.414392612309825,1.3128734053396185,-0.5537060116875314
-0.715657384461026,-0.4024411883263483,-0.24237190738321882
1.2050861233402201,1.5754237740222052,1.264855520601279
-1.145545762480665,0.7546460222270381,-1.0089585012186073
0.7696035075040498,0.06501927544855607,-0.6390329741835039
0.30310712124667355,-0.8415151674906627,-0.8471543297316817
0.16467268906973545,1.9469985767758837,0.04096914210952014
-1.3366357249991512,2.1062706545730654,-0.21875363743539258
1.062100814405859,-0.5414327304651658,-0.18079044549951512
-0.7014849724923459,-0

Next, we'll import `pandas` (commonly aliased as `pd`) and get to work. To load the contents of the file into a `DataFrame`, we use the `read_csv()` function, the result of which we assign to a variable (in this case, we'll call it `data`).

In [42]:
import pandas as pd

data = pd.read_csv('random.csv')

Now, let's check out what's inside. The `data` variable is now a `DataFrame` type, with all of the contents of the CSV separated into the categories specified by the column titles. Think of each column as its own array -- in order to access that array, you have to index it using its _name_ rather than a number. 

For example, let's look at the contents of column A (`colA`).

In [45]:
print(data['colA'])

0     1.983613
1     0.524719
2    -0.287506
3    -0.136982
4    -0.362208
5     1.681476
6     0.286836
7    -1.414393
8    -0.715657
9     1.205086
10   -1.145546
11    0.769604
12    0.303107
13    0.164673
14   -1.336636
15    1.062101
16   -0.701485
17    0.468505
18   -1.321258
19   -0.027720
Name: colA, dtype: float64


As you can see, it's just a list, with the index of each element labeled when we print it. In order to access the individual values in this list, we can index it just like a normal one, with the addition of the column name. Let's grab the 6th element.

In [47]:
print(data['colA'][5])
print(type(data['colA'][5]))

1.681476470445347
<class 'numpy.float64'>


Note that reading in data like this means that you've loaded a copy into the Python program, i.e. you're not actually editing your CSV file.

Also, you will be happy to know that these named columns work the same way as `numpy` arrays, in that you can perform calculations on all of their values all at once. For example, let's make a copy of this column and multiply all the values by 10.

In [48]:
colA = data['colA']*10
print(colA)

0     19.836130
1      5.247187
2     -2.875060
3     -1.369825
4     -3.622080
5     16.814765
6      2.868356
7    -14.143926
8     -7.156574
9     12.050861
10   -11.455458
11     7.696035
12     3.031071
13     1.646727
14   -13.366357
15    10.621008
16    -7.014850
17     4.685047
18   -13.212578
19    -0.277204
Name: colA, dtype: float64


So, that means that you can load in arrays of data from a CSV and do math just like you would when defining the arrays yourself!

That's the basics, and essentially all I've needed to get by in the last four years.