python笔记--2023.09

最新推荐文章于 2024-09-14 19:55:48 发布

可爱收藏家

最新推荐文章于 2024-09-14 19:55:48 发布

阅读量40

点赞数

分类专栏： bug记录文章标签： python 笔记机器学习

本文链接：https://blog.csdn.net/weixin_43643881/article/details/134420162

版权

bug记录专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1、iloc loc

iloc：通过索引位置获取dataframe的值
loc：通过行列标签获取值

key=self.rna_feature_data.iloc[index][0][1:]
value=self.rna_feature_data.iloc[index+1][0].split()

self.rna_feature_data.iloc[index][0][1:]：这一行从DataFrame中获取指定索引行和第一列（[0]）的值。然后，它从该值中切片，从第二个字符开始（[1:]），并将其赋值给变量key。
self.rna_feature_data.iloc[index+1][0].split()：这一行从DataFrame中获取下一行（index+1）和第一列（[0]）的值。然后，它使用空格作为分隔符将该值拆分，并返回一个子字符串的列表。将得到的列表赋值给变量value。

2、训练好的模型如何用于后续的序列特征生成

训练好的Doc2Vec模型可以用于生成序列的特征表示。每个序列将被表示为一个固定长度的向量，该向量可以用作后续任务的输入特征。

要使用训练好的模型生成序列特征，可以按照以下步骤进行：

加载已保存的训练好的模型文件，使用 gensim.models.doc2vec.Doc2Vec.load() 方法加载模型。例如：
python
复制
model = gensim.models.doc2vec.Doc2Vec.load(model_name + ".model")
对于每个序列，使用训练好的模型的 infer_vector() 方法生成其特征向量表示。例如：
python
复制
sequence = ["word1", "word2", "word3", ...]  # 输入的序列
vector = model.infer_vector(sequence)
infer_vector() 方法会将输入序列转换为一个向量，该向量表示了输入序列在训练好的模型空间中的位置。这个向量可以作为序列的特征表示。

3、Linux常用命令

Linux命令是在命令行上运行的程序或实用程序。命令行是一个界面，它接受文本行并将其处理为计算机的指令。
任何图形用户界面（GUI）都只是命令行程序的抽象。例如，当您通过单击“X”关闭窗口时，该操作后面会运行一个命令。
标志（flag）是我们可以向您运行的命令传递选项的一种方式。大多数Linux命令都有一个帮助页面，我们可以使用-h标记调用该页面。大多数情况下，标志是可选的。

ls 列出目录
pwd 打印当前绝对路径
cd 更改目录
cd / 回到根目录
rm 删除目录
midir 创建目录
wget 从互联网检索内容的实用工具。

4、fasta文件处理

4.1 fasta奇偶行切分

打开文件操作
读取文件行操作
文件写操作
切片操作基本表达式：object[start_index : end_index : step]

生成取Sequence.fatsa文件奇数行，生成新的fasta文件的python程序
def extract_odd_lines(input_file, output_file):
    with open(input_file, 'r') as input_f:
        lines = input_f.readlines()

    odd_lines = lines[::2]  # 提取奇数行
    #切片操作基本表达式：object[start_index : end_index : step]


    with open(output_file, 'w') as output_f:
        output_f.writelines(odd_lines)

# 输入文件路径和输出文件路径
input_file = 'Sequence.fasta'
output_file = 'OddLines.fasta'

# 调用函数提取奇数行
extract_odd_lines(input_file, output_file)

def extract_even_lines(input_file, output_file):
    with open(input_file, 'r') as input_f:
        lines = input_f.readlines()

    even_lines = lines[1::2]  # 提取偶数行

    with open(output_file, 'w') as output_f:
        output_f.writelines(even_lines)

# 输入文件路径和输出文件路径
input_file = 'Sequence.fasta'
output_file = 'EvenLines.fasta'

# 调用函数提取偶数行
extract_even_lines(input_file, output_file)

4.2 fasta奇偶列切分

strip（）的使用
忽略标题行的使用
逐行进行列划分：先取行，后通过strip变成列表，再通过切片操作取奇数列，将结果保存值新列表，将新列表逐行写入新文件。

def extract_odd_columns(input_file, output_file):
    with open(input_file, 'r') as input_f:
        lines = input_f.readlines()

    sequences = []
    for line in lines:
        if not line.startswith('>'):  # 忽略标题行
            sequence = line.strip()
            odd_columns = sequence[::2]  # 提取奇数列
            sequences.append(odd_columns)

    with open(output_file, 'w') as output_f:
        for sequence in sequences:
            output_f.write(sequence + '\n')

# 输入文件路径和输出文件路径
input_file = 'sequence.fasta'
output_file = 'OddColumns.fasta'

# 调用函数提取奇数列
extract_odd_columns(input_file, output_file)

4.3 fasta文件指定列提取

#sequence.fasta文件中列与列之间通过,分隔。
#生成取sequence.fasta文件第0列和第2列并生成新fasta文件的python程序
def extract_columns(input_file, output_file):
    with open(input_file, 'r') as input_f:
        lines = input_f.readlines()

    sequences = []
    for line in lines:
        if not line.startswith('>'):  # 忽略标题行
            columns = line.strip().split(',')  # 按逗号分隔列
            col0 = columns[0]  # 第0列
            col2 = columns[2]  # 第2列
            sequence = col0 + col2  # 提取第0列和第2列
            sequences.append(sequence)

    with open(output_file, 'w') as output_f:
        for sequence in sequences:
            output_f.write(sequence + '\n')

# 输入文件路径和输出文件路径
input_file = 'sequence.fasta'
output_file = 'Extracted.fasta'

# 调用函数提取列
extract_columns(input_file, output_file)

4.4 fasta文件去重

def read_fasta(file_path):
    """从FASTA文件中读取序列"""
    sequences = []
    current_seq = None

    with open(file_path, 'r') as file:
        for line in file:
            line = line.strip()
            if line.startswith('>'):  # 新的序列起始行
                if current_seq is not None:
                    sequences.append(current_seq)
                current_seq = {'header': line, 'sequence': ''}
            else:  # 序列行
                current_seq['sequence'] += line

        if current_seq is not None:
            sequences.append(current_seq)

    return sequences


def write_fasta(file_path, sequences):
    """将序列写入FASTA文件"""
    with open(file_path, 'w') as file:
        for seq in sequences:
            file.write(seq['header'] + '\n')
            sequence = seq['sequence']
            for i in range(0, len(sequence), 80):
                file.write(sequence[i:i+80] + '\n')


def remove_duplicates(file_path):
    """从FASTA文件中删除重复项"""
    sequences = read_fasta(file_path)
    unique_sequences = []

    for seq in sequences:
        if seq not in unique_sequences:
            unique_sequences.append(seq)

    write_fasta(file_path, unique_sequences)


# 调用删除重复项函数
fasta_file = 'sequence.fasta'
remove_duplicates(fasta_file)

4.5 fasta文件映射为字典

#生成一个python程序，将sequence.fasta文件中的内容处理为一个字典，字典的键为序列的名字，值为序列
def read_fasta(file_path):
    sequences = {}

    with open(file_path, 'r') as file:
        header = ''
        sequence = ''

        for line in file:
            line = line.strip()

            if line.startswith('>'):  # 新的序列起始行
                if header != '':
                    sequences[header] = sequence
                header = line[1:]  # 去除">"符号，获取序列名称
                sequence = ''
            else:  # 序列行
                sequence += line

        if header != '':
            sequences[header] = sequence

    return sequences


# 调用读取FASTA文件并生成字典的函数
fasta_file = 'sequence.fasta'
sequences_dict = read_fasta(fasta_file)

# 打印字典内容
for header, sequence in sequences_dict.items():
    print(f'{header}: {sequence}')

字典合并

未知个数形参的设置
字典更新：update

def merge_dicts(*dicts):
    merged_dict = {}
    for dictionary in dicts:
        merged_dict.update(dictionary)
    return merged_dict

# 示例用法
dict1 = {'a': 1, 'b': 2}
dict2 = {'c': 3}
dict3 = {'d': 4, 'e': 5}

merged_dict = merge_dicts(dict1, dict2, dict3)
print(merged_dict)

for循环遍历多个列表

在Python中，使用for循环遍历多个列表的元素，可以使用zip()函数来实现。zip()函数接受多个可迭代对象作为参数，并返回一个元组的迭代器，每个元组由这些可迭代对象中对应位置的元素组成。

for line1, line2 in zip(listsequence, listsequence2):
    # 在这里处理 line1 和 line2 的值
#这样就可以依次遍历listsequence和listsequence2中对应位置的元素，并将它们分别赋值给line1和line2变量。