Python for data science
This is a hands-on tutorial of Python, an advanced, general-purpose programming language.
- Python is an interpreted language. No need to compile. Python gives you interactive programming environment.
- Python is dynamically typed. No need to specify types of variables and arguments. Python interpreters are equipped with an automatic garbage collector.
- Python uses white-space indentation for indicating code blocks. No need for curly braces or "begin"/"end".
Here is a simple example of a Python program to find the maximum value in a list of numbers. Click ▶ to run it. Try it out!
I would recommend to use Python for learning "data science". This is because (1) Python is widely used in real-world applications, (2) Python has strong connections with existing frameworks and libraries for "data analysis" and "machine learning", and (3) Python has advantages also in graphical data visualization, which will be one of the key technology needed in future collaboration between AI and humans.
Here I suppose that you have some experience in computer programming, say, in C/C++, JavaScript, or other programming languages, but I will keep this tutorial as plain as possible.
Note that we here use Python 3, which is getting wider popularity in data scientists.
Install Python (optional)
If you want to continue using web-based Python (like the one above), you can skip this section. Come back when you need to install Python yourself. This tutorial page includes embedded Python consoles (by trinket.io), where you can run sample codes, modify it, and test it. So, for the time being, you do not have to install Python on your PC.
If you want to install Python for real-world practices, here we give you instructions to install Anaconda, a distribution package of Python suitable for data analysis. Download and install Anaconda from https://www.continuum.io/downloads. Choose your platform (Windows, macOS, Linux) and download the installer for the latest version of Python 3 (3.6 or later). For macOS, "Graphical Installer" would be easy to handle. For Windows, "64-bit" would be appropriate in most cases. For Linux, choose "64-bit" for X86.
After installation, open "Terminal" or equivalent shell-terminal application, where you type:
- python to use python in the terminal.
- jupyter notebook to use python in interactive "notebook" environment.
In "interactive mode", Python interpreter runs line by line as you type in. While, in "editor mode", you first edit a Python program and then run it, as we saw in the previous section. There are a number of IDEs available.
Hello, world!
Try the following example. print() outputs the value of the argument out into the console. Python interprets and executes the commands line by line.
- A string outputs itself.
- You can choose "abc" or 'abc' for describing string constants. You may also write 'You said "Yes, we can".' and "I said 'How?'.".
- You can put a number of arguments separated by commas. They are printed in a line with a space between them.
- When you connect strings with "+", no space is added.
- For math expressions, the values are evaluated and then printed.
- Division "/" always results in a float number ("2 / 2" gives "1.0" for here).
- Of course, strings and expressions can be arranged together.
- But, "+" connects only strings. So, you need to convert a number to a string using str().
- Note again that division "/" results in a float number. Valid for 16 digits or so.
- If you need "integer division" (like in C/C++), use "//" instead.
- For the remainder of division, use "%" (like in C/C++).
Variables
Now try the following example. You can use any identifier (other than some reserved keywords (like "print", "if", and "else") for a variable name. Python is case sensitive; "message", "Message", and "MESSAGE" are all different identifiers.
In Python, no declaration is needed for variables. When you want to use a new variable, just name it and assign some value to it. That's it!
Python is a dynamically typed language. You do not need to specify the type of a variable. A variable automatically determines its type (int, float, string, bool, etc.) when a value is assigned to it. You can overwrite a variable with a new value of different type.
Iterations (loops)
"for" loop
Next, try out the following example. In the first half, the "for" loop computes the sum of 1, 2, ..., 10. Python often uses a list, like [1, 2, ..., 10], for such interations.
A list can also be generated by using "range([start,] stop[, step])". For example, range(3) gives [0, 1, 2], and range(3, 9, 2) gives [3, 5, 7]. Note that the value of "stop" is not included in the list.
And, never ever forget to put ":" (colon) at the end of "for" line. The colon indicates that a block start from the next line. Take a look at the next example.
A block consists of consecutive lines sharing the same level of indentation (by leading spaces). This is sometimes called "off-side rule". It is strongly recommended to use white spaces to make this indentation. A 4-space indentation is commonly used in Python communities.
"while" loop
Another way to make a loop is "while". Try this example. This program generates "Fibonacci sequence" and computes the "golden ratio" (1.6180339887...).
Note that you may need to count up or down the control variable ("k" for here) yourself; otherwise, Python could be trapped in an infinite loop. "k += 1" is an abbreviated version of "k = k + 1". Python does not allow "k++" or "k--" for an increment or decrement. Use "k += 1" or "k -= 1" instead.
The next example also generates the same sequence, but it is designed to stop right after the value exceeds 10000. The "break" makes the (most inner) loop to terminate and gets out of the loop. Note that "while True: ..." works as an infinite loop; so, "break" has to be called sometime, somewhere in the loop.
Conditionals (if-elif-else)
Let us look at the first example on this page, as you see here again.
We have already used a conditional, namely if-statement, when we checked if the number (n) in the list exceeds our candidate for the maximum value (maximum).
if-statements may take several patterns of control flow. Here I graphically summarize the typical patterns. Of course, you can embed another if-statement inside any block.
if <condition>:
<then-block>
if <condition>:
<then-block>
else:
<else-block>
if <condition1>:
<then-block>
elif <condition2>:
<elif-block>
if <condition1>:
<block1>
elif <condition2>:
<block2>
elif <condition3>:
<block3>
if <condition1>:
<block1>
elif <condition2>:
<block2>
elif <condition3>:
<block3>
else:
<block4>
The example below shows typical usage of if statements. All three parts determine if the givin year (say, 2017) is a leap year (with 366 days) or not (with 365 days). As Wikipedia says "Every year that is exactly divisible by four is a leap year, except for years that are exactly divisible by 100, but these centurial years are leap years if they are exactly divisible by 400."
All of the first, second, third parts perform exactly same: computing the number of days in the given Gregorian year.
Note that "elif" stands for "else if" or "elsif" in other programming languages. You could replace "elif ...:" with "else:" + "if ...:", but this makes the indentation deeper and the program becomes difficult to read. Please use "elif ...:" when applicable.
Importing "modules"
Modules are dynamically loadable package of functions and constants. There are standard modules like "math" (like exp, log, sin, etc.), "random" (like randint, randrange, etc.). ; you could make your own modules.
Math
To incorporate a module (for example, "math") into your program, just say "import math". After that line, you can use any function or constants in "math" module like math.sqrt() and math.pi (3.14159...). Take a look at this example.
Random
Sometimes the module name is a bit too long to add to its function and constant names. In such a case, you can "rename" it as in the following example, where "ran.randrange(0, 5)" returns 0, 1, 2, 3, or 4, randomly.
When importing "random" module, we rename it as "ran". So, we can access the members of the module by, for example, ran.ranrange(0, 10). The function "ran.ranrange(start, end)" generates a random number from "start" to "end - 1" (not exclusive). "len(list)" returns the number of elements in "list".
Note that "subjects", "verbs", and "objects" are lists of strings. For the first element of "subjects", you can get it by "subjects[0]". The index alway starts from 0, just like ordinary arrays in other programming languages. So, for the last element of "subjects", you can get it by "subjects[len(subjects) - 1]".
Other modules...
There are tons of modules available. Some modules you will be using in the study of "data science" may include:
- numpy : Numerical computation tools including n-dimensional arrays.
- scipy : Works with numpy for computing integrals, differential equiations, etc.
- pandas : Data maniputation tools for statistical data analysis.
- matplotlib : Plotting tools for data visualization.
- theano : Mathematical computing tools using GPUs (by Université de Montréal).
- tensorflow : Machine Learning tools (by Google) to train multi-layer neural network.
Lists and dictionaries
Lists
In the previous example, we glanced at lists of strings. A list is a sequence of objects embraced with square blackets. The following example shows further hints on the usage.
Note that "remove()", "append()", and "pop()" permanently change the value of the list. "pop(n)" removes (and returns) the n-th element in the list.
Lists of numbers have some additional functions. Take a look at the next example. Note that "a.sort()" permanently changes the order of elements in the list "a".
Dictionaries
A dictionary has a number of key : value pairs, just like JSON database in JavaScript. Keys are usually strings; values may be different in types, like numbers, strings, or embedded dictionaries. Dictionary do not care about the order of elements.