こじ研（Deep Learning 入門）

Python for data science

This is a hands-on tutorial of Python, an advanced, general-purpose programming language.

Python is an interpreted language. No need to compile. Python gives you interactive programming environment.
Python is dynamically typed. No need to specify types of variables and arguments. Python interpreters are equipped with an automatic garbage collector.
Python uses white-space indentation for indicating code blocks. No need for curly braces or "begin"/"end".

Here is a simple example of a Python program to find the maximum value in a list of numbers. Click ▶ to run it. Try it out!

I would recommend to use Python for learning "data science". This is because (1) Python is widely used in real-world applications, (2) Python has strong connections with existing frameworks and libraries for "data analysis" and "machine learning", and (3) Python has advantages also in graphical data visualization, which will be one of the key technology needed in future collaboration between AI and humans.

Here I suppose that you have some experience in computer programming, say, in C/C++, JavaScript, or other programming languages, but I will keep this tutorial as plain as possible.

Note that we here use Python 3, which is getting wider popularity in data scientists.

Install Python (optional)

If you want to continue using web-based Python (like the one above), you can skip this section. Come back when you need to install Python yourself. This tutorial page includes embedded Python consoles (by trinket.io), where you can run sample codes, modify it, and test it. So, for the time being, you do not have to install Python on your PC.

If you want to install Python for real-world practices, here we give you instructions to install Anaconda, a distribution package of Python suitable for data analysis. Download and install Anaconda from https://www.continuum.io/downloads. Choose your platform (Windows, macOS, Linux) and download the installer for the latest version of Python 3 (3.6 or later). For macOS, "Graphical Installer" would be easy to handle. For Windows, "64-bit" would be appropriate in most cases. For Linux, choose "64-bit" for X86.

After installation, open "Terminal" or equivalent shell-terminal application, where you type:

python to use python in the terminal.
jupyter notebook to use python in interactive "notebook" environment.

In "interactive mode", Python interpreter runs line by line as you type in. While, in "editor mode", you first edit a Python program and then run it, as we saw in the previous section. There are a number of IDEs available.

Hello, world!

Try the following example. print() outputs the value of the argument out into the console. Python interprets and executes the commands line by line.

A string outputs itself.
You can choose "abc" or 'abc' for describing string constants. You may also write 'You said "Yes, we can".' and "I said 'How?'.".
You can put a number of arguments separated by commas. They are printed in a line with a space between them.
When you connect strings with "+", no space is added.
For math expressions, the values are evaluated and then printed.
Division "/" always results in a float number ("2 / 2" gives "1.0" for here).
Of course, strings and expressions can be arranged together.
But, "+" connects only strings. So, you need to convert a number to a string using str().
Note again that division "/" results in a float number. Valid for 16 digits or so.
If you need "integer division" (like in C/C++), use "//" instead.
For the remainder of division, use "%" (like in C/C++).

Variables

Now try the following example. You can use any identifier (other than some reserved keywords (like "print", "if", and "else") for a variable name. Python is case sensitive; "message", "Message", and "MESSAGE" are all different identifiers.

In Python, no declaration is needed for variables. When you want to use a new variable, just name it and assign some value to it. That's it!

Python is a dynamically typed language. You do not need to specify the type of a variable. A variable automatically determines its type (int, float, string, bool, etc.) when a value is assigned to it. You can overwrite a variable with a new value of different type.

Iterations (loops)

"for" loop

Next, try out the following example. In the first half, the "for" loop computes the sum of 1, 2, ..., 10. Python often uses a list, like [1, 2, ..., 10], for such interations.

A list can also be generated by using "range([start,] stop[, step])". For example, range(3) gives [0, 1, 2], and range(3, 9, 2) gives [3, 5, 7]. Note that the value of "stop" is not included in the list.

And, never ever forget to put ":" (colon) at the end of "for" line. The colon indicates that a block start from the next line. Take a look at the next example.

A block consists of consecutive lines sharing the same level of indentation (by leading spaces). This is sometimes called "off-side rule". It is strongly recommended to use white spaces to make this indentation. A 4-space indentation is commonly used in Python communities.

"while" loop

Another way to make a loop is "while". Try this example. This program generates "Fibonacci sequence" and computes the "golden ratio" (1.6180339887...).

Note that you may need to count up or down the control variable ("k" for here) yourself; otherwise, Python could be trapped in an infinite loop. "k += 1" is an abbreviated version of "k = k + 1". Python does not allow "k++" or "k--" for an increment or decrement. Use "k += 1" or "k -= 1" instead.

The next example also generates the same sequence, but it is designed to stop right after the value exceeds 10000. The "break" makes the (most inner) loop to terminate and gets out of the loop. Note that "while True: ..." works as an infinite loop; so, "break" has to be called sometime, somewhere in the loop.

Conditionals (if-elif-else)

Let us look at the first example on this page, as you see here again.

We have already used a conditional, namely if-statement, when we checked if the number (n) in the list exceeds our candidate for the maximum value (maximum).

if-statements may take several patterns of control flow. Here I graphically summarize the typical patterns. Of course, you can embed another if-statement inside any block.

if <condition>:
<then-block>

if <condition>:
<then-block>
else:
<else-block>

if <condition1>:
<then-block>
elif <condition2>:
<elif-block>

if <condition1>:
    <block1>
elif <condition2>:
    <block2>
elif <condition3>:
    <block3>

if <condition1>:
    <block1>
elif <condition2>:
    <block2>
elif <condition3>:
    <block3>
else:
    <block4>

The example below shows typical usage of if statements. All three parts determine if the givin year (say, 2017) is a leap year (with 366 days) or not (with 365 days). As Wikipedia says "Every year that is exactly divisible by four is a leap year, except for years that are exactly divisible by 100, but these centurial years are leap years if they are exactly divisible by 400."

All of the first, second, third parts perform exactly same: computing the number of days in the given Gregorian year.

Note that "elif" stands for "else if" or "elsif" in other programming languages. You could replace "elif ...:" with "else:" + "if ...:", but this makes the indentation deeper and the program becomes difficult to read. Please use "elif ...:" when applicable.

Importing "modules"

Modules are dynamically loadable package of functions and constants. There are standard modules like "math" (like exp, log, sin, etc.), "random" (like randint, randrange, etc.). ; you could make your own modules.

Math

To incorporate a module (for example, "math") into your program, just say "import math". After that line, you can use any function or constants in "math" module like math.sqrt() and math.pi (3.14159...). Take a look at this example.

Random

Sometimes the module name is a bit too long to add to its function and constant names. In such a case, you can "rename" it as in the following example, where "ran.randrange(0, 5)" returns 0, 1, 2, 3, or 4, randomly.

When importing "random" module, we rename it as "ran". So, we can access the members of the module by, for example, ran.ranrange(0, 10). The function "ran.ranrange(start, end)" generates a random number from "start" to "end - 1" (not exclusive). "len(list)" returns the number of elements in "list".

Note that "subjects", "verbs", and "objects" are lists of strings. For the first element of "subjects", you can get it by "subjects[0]". The index alway starts from 0, just like ordinary arrays in other programming languages. So, for the last element of "subjects", you can get it by "subjects[len(subjects) - 1]".

Other modules...

There are tons of modules available. Some modules you will be using in the study of "data science" may include:

numpy : Numerical computation tools including n-dimensional arrays.
scipy : Works with numpy for computing integrals, differential equiations, etc.
pandas : Data maniputation tools for statistical data analysis.
matplotlib : Plotting tools for data visualization.
theano : Mathematical computing tools using GPUs (by Université de Montréal).
tensorflow : Machine Learning tools (by Google) to train multi-layer neural network.

Lists and dictionaries

Lists

In the previous example, we glanced at lists of strings. A list is a sequence of objects embraced with square blackets. The following example shows further hints on the usage.

Note that "remove()", "append()", and "pop()" permanently change the value of the list. "pop(n)" removes (and returns) the n-th element in the list.

Lists of numbers have some additional functions. Take a look at the next example. Note that "a.sort()" permanently changes the order of elements in the list "a".

Dictionaries

A dictionary has a number of key : value pairs, just like JSON database in JavaScript. Keys are usually strings; values may be different in types, like numbers, strings, or embedded dictionaries. Dictionary do not care about the order of elements.

Note that, on line 6 and 8, print() performs formatted output, where "{0}", "{1}", etc. will be replaced with the arguments of the following "format()". You can specify the styles of each output unit. For example, "{0:4d}" outputs an integer value in a 4-digit span, "{0:4.3f}" outputs a real value as in "1234.567", and "{0:06x}" outputs an integer value as "6 digits of hexadecimal (with preceeding 0 paddings)".

A function in Python is, just like functions in other programming languages, a chunk of reusable code to perform a certain "function". We have already used "print()" function and others. Here we learn to make your original function. Look at the two examples below. Both compute the factorial of n (n! = 1 × 2 × ... × n).

Above is a straight-forward implementation of factorial, which uses a loop of mutiplications.

The second example above uses "recursion". The recursion stops at n = 1 (for 1! = 1); otherwise, it computes (n - 1)! by calling recursively itself, then computes n × (n - 1)! to get the result of n!.

A function can return not only a number, but also a string, list, or other types of data. Take a look at the examples below.

Functions can return multiple values. The following examples show how to do that. The first example uses "fib2()" which returns 2 values.

The second example uses multiple value assignment. For example, executing "x, y = a, b", "a" is assigned to "x", and "b" is assigned to "y", at the same time in parallel. Furthermore, "a, b = b, a" swaps the values of "a" and "b".

NumPy (numpy) is an external Python library for numerical computation, including tools for n-dimensional arrays like vectors and matrices. You need to "import" this library befor use. We here consistently refer to NumPy library as "np".

Take a look at the example below. The first line imports NumPy and rename it as "np". Hereafter, we can access to the functions and constants in NumPy as, say, np.array(), which generates a NumPy array (numpy.ndarray).

Calculations of "+", "-", "*", "/" between two NumPy arrays (of same "shape") perform "element-wise" calculations. Note that "*" here also performs "element-wise" multiplication. In the case that one part is a scalar (like "2") and the other part is a NumPy array, the scalar is "broadcasted" to each of the element in the array, making the same "element-wise" calculation possible.

NumPy can efficiently handle not only simple vectors, but n-dimensional arrays. For example, we can make a 2-dimensional array (also known as "matrix") in the following manner.

As you see here, you can make a matrix in the similar manner as vectors. "ndim" returns the number of dimensions ("1" for vectors, "2" for matrices, etc.). "dtype" returns the data type of the elements; all elements are in the same type in NumPy arrays. "shape" returns a list of the size in each dimension. Specifying "shape", you could also make an array of "1"s or "0"s with "np.ones()" and "np.zeros()" respectively. "np.identity(n)" generates an identity matrix E of the shape (n, n).

For calculations, NumPy automatically "broadcast" the missing elements, so that it can performs "element-wise" calculation. For example, "[10, 20]" expands to "[[10, 20], [10, 20]]" when multipying with a (2, 2) matrix. Take a look at the following example.

The rule of "broadcast" is straightforward. Compare the code above and the way of "broadcast" illustrated below.

You can access to any element in an NumPy array. Let us look at the following examples.

Just as for a list, you can pick up the i-th element of an array "A" by "A[i]", which is the i-th row vector. You can then pick up its j-th element of "A[i]" by "A[i][j]", which is a single number. You can also use "A[i, j]" for the same purpose as "A[i][j]".

Interestingly, "A[1, :]" gives us same result as "A[1]", where ":" stands for "all". How about "A[:, 2]"? This gives us "[3, 6, 9]", a vector of the right-most elements, in the form of a row-vector. Then, how about "A[:, [0,2]]"? This form collects 0-th and 2-nd elements in each row. Therefore, "A[:, [2]]" gives us the right-most column-vector "[[3], [6], [9]]".

There is yet another interesting way of accessing array elements. Take a look at the following example.

Let A be a simple array "[10, 11, 12, ..., 19]", and E1 "[1, 3, 5, 7, 9]", then "A[E1]" gives us an array of only elements selected by "E1" as indices. In this case, this gives us "[11, 13, 15, 17, 19]".

When we perform "A < 15", imagine that "15" internally expands to "[15, 15, ..., 15]" by "broadcast". Then, it performs "element-wise" comparison, which gives us an array of True/False. Similarly, "A % 2 == 0" also performs "element-wise" evaluation to check if an element of A is even number (True) or not (False). True/False works for picking up elements in an array, so "A[E3]" gives us a sub-array with even elements. Note that, as in line 14, "A[A % 2 == 0]" performs exactly same.

PyPlot is a simple but powerful module for plotting graphs. PyPlot is a part of larger Matplotlib library for data visualization. Please look at the following example. If you like to see the graph in a separate browser tab, click "trinket_plot.png" below the graph. Right-click to download!

Line 2 actually imports "pyplot" module from "matplotlib" library, and name the module as "plt". Line 4 generates a long NumPy array "[-3.0, -2.9, -2.8, ..., 2.8, 2.9]" (exclusive of "3.0") with the step of "0.1". And line 5 performs "element-wise" multiplication; in other words, each element of "x" is squared, generating a new array "y". Finally, line 6 and 7 draw the graph of "y" against "x".

The next example demonstrates "plt.plot(x, y)" actually draws a sequence of segments "(x[0], y[0])" - "(x[1], y[1])", "(x[1], y[1])" - "(x[2], y[2])", and so on. If you want to make the following circle "straight", add "plt.axis('equal')" somewhere before "plt.show()".

The graph above shows "amplitude modulation (AM)" of "sin(x)" (blue) onto the carrier "sin(10*x)" (green). Resulted AM signal "sin(x) * sin(10*x)" is shown in red. Note that "np.sin()" takes an array as its argument and outputs an array with "element-wise" sine values. Note that "sin()" takes radian values.

You can name each plotting by adding "label=..." to "plt.plot()". Do not forget "plt.legend()" to draw legend on the graph. Adding "linewidth=...", you can set the tickness of the graph in pt (default is "1"). The title at the top is given by "plt.title()".