NASSLLI2018-Corpus-Linguistics【Day 1】

Day 1: Jupyter Notebook interface, Python basics, text processing

The very basics

First code
Printing a string, using print().

Strings
String type objects are enclosed in quotation marks (” or ‘).
+ is a concatenation operator.
String methods such as .upper(), .lower() transform a string.
Rather than changing the original variable, the commands return a new string value.
Some string methods return a boolean value (True/False)
len() returns the length of a string in the # of characters.
in tests substring-hood between two strings.

Numbers
Integers and floats are written without quotes.
You can use algebraic operations such as +, -, * and / with numbers.

Lists
Lists are enclosed in [ ], with elements separated with commas. Lists can contain strings, numbers, and more.
As with string, you can use len() to get the size of a list.
As with string, you can use in to see whether an element is in a list.
A list can be indexed through li[i]. Python indexes starts with 0.
A list can be sliced: li[3:5] returns a sub-list beginning with index 3 up to and not

for loop
Using a for loop, you can loop through a list of items, applying the same set of operations to each element.
The embedded code block is marked with indentation.

List comprehension
List comprehension builds a new list from an existing list.
You can filter to include only certain elements, and you can apply transformationa in the process.

Dictionaries
Dictionaries hold key:value mappings.
len() on dictionary returns the number of keys.
Looping over a dictionary means looping over its keys.

Processing a piece of text
Visit this page and copy-paste the first passage of Moby Dick.
“”” triple quotes have the special power of straddling across line breaks.

exercise:

# In[6]:


print("hello, world!")


# In[7]:


greet = "Hello, world!"
greet = greet + " I come in peace." + " I'm called merklar."
greet


# In[8]:


greet2 = greet.upper().lower() 
greet2


# In[9]:


# try .isupper(), .isalnum(), .startswith('he')
'hello123'.isalnum()


# In[10]:


len(greet)


# In[11]:


'he' not in 'hello' or ''.endswith('')


# In[12]:


num1 = 5678
num2 = 3.141592
result = num1 / num2
print(num1, "divided by", num2, "is", result)  # can print multiple things!


# In[13]:


li = ['red', 'blue', 'green', 'black', 'white', 'pink']
len(li)


# In[14]:


# Try logical operators not, and, or
'mauve' in li and 'teal' in li
li.append('mauve')
print(li)


# In[15]:


# Try [0], [2], [-1], [3:5], [3:], [:5] 
li[2]


# In[16]:


li[-1]


# In[17]:


li[3:
]


# In[18]:


li[:5]


# In[21]:


for x in li :
    print('"'+x.capitalize()+'" is', len(x), "characters long.")
    print('--')
print("Done!")


# In[22]:


# filter
[x for x in li if len(x)==4]


# In[23]:


# transform
[x.upper() for x in li]


# In[24]:


# filter and transform
[x.upper() for x in li if len(x)>=5]


# In[25]:


di = {'Homer':35, 'Marge':35, 'Bart':10, 'Lisa':8}
di['Lisa']


# In[26]:


# 20 years-old or younger. x is bound to keys. 
[x for x in di if di[x] <= 20]


# In[27]:


len(di)


# In[28]:


moby = """Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly Novembprint(moby)er in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me."""


# In[29]:


# What is '\n'? 
moby


# In[30]:


print(moby)


# In[31]:


len(moby)
# But how many _words_?


# In[32]:


# .split() is a "poor-man's tokenizer". What problem do you see? 
get_ipython().run_line_magic('pprint', '')
moby.split()


# In[33]:


import re


# In[34]:


sent = "You haven't seen Star Wars...?"
re.findall(r'\w+', sent)


# In[35]:


re.findall(r'\w+', moby)


# In[37]:


moby_toks = re.findall(r'\w+', moby)


# In[38]:


len(moby_toks)


# In[40]:


moby_typesmoby_ty  = set(moby_toks)


# In[42]:


moby_types = set(moby_toks)


# In[43]:


len(moby_types)


# In[44]:


[w for w in moby_types if len(w)>=10]


# In[45]:


# lowercased version
moby_ltoks = [t.lower() for t in moby_toks]
moby_ltoks


# In[46]:


moby_ltypes = set(moby_ltoks)


# In[47]:


# sorted() takes a list/set/... and returns a sorted list
sorted(moby_ltypes)


# In[48]:


len(moby_ltypes)

`




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值