Linux Kernel VFS-Read(3)

2021SC@SDUSC

From VFS to FS

Look back to our analyses in the past few blogs, the final goal is not to understand any lines in the code so that we could write a virtual file system on ourselves.VFS is a sort of functions that create link between linux's users' layer and file systems, which for now is necessary for all kinds of operating systems, not only Linux.So these OS uses their way to solve the problems, some are open sourced yet some are not able to figure out the details.

While trying to copy files from Ubuntu(Virtural machine) to Windows, it failes sometimes due to the differences in file systems.Symbolic link is one of them.Though the developer of VMware tries they could to solve copy problems between many different file systems and different methods they provided, it's not perfect for now.In linux, ext is likely to be used and in such indexed file system, all files has a inode for accessing it, yet in Windows ntfs and fat is used commonly in which inode doesn't exist,  instead tables for allocation is created to find those files.They all have advantages and disadvantages, and it's hard to say who's done things better.

This part looks like it is a summary towards all blogs we've met before, but it's not.The next one will sum up what I've learnt and I wish this work continues to future.

Let's go back a little bit.In the last blog, we've pushed to the function that calls the file system's own read_iter, that is:

static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,

                                     struct iov_iter *iter)

{      
        return file->f_op->read_iter(kio, iter);
} 

As I said, we will use ext4's f_op as a example so we could finish the read methods to the end.So we shall go to ext4's read_iter function which was band to function ext4_file_read_iter.The function is in fs/ext4/file.c:

static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
	struct inode *inode = file_inode(iocb->ki_filp);

	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
		return -EIO;

	if (!iov_iter_count(to))
		return 0; /* skip atime */

#ifdef CONFIG_FS_DAX
	if (IS_DAX(inode))
		return ext4_dax_read_iter(iocb, to);
#endif
	if (iocb->ki_flags & IOCB_DIRECT)
		return ext4_dio_read_iter(iocb, to);

	return generic_file_read_iter(iocb, to);
}

First use ext4_forces_shutdown to get required message from ext4's superblock to check if EXT4_FORCED_SHUTDOWN in flags is on(which is not likely to happen).Then use IS_DAX(Direct Acess) to discuss if the access toward the file is direct or uses pagecache.If it's in a direct way, call function ext4_dax_read_iter(ext4's read_iter to handle direct access, also in file.c).Check IOCB_DIRECT flag to discuss if the access is a direct io access or not.By the way this check is not here but in generic_file_read_iter in earlier versions.If not, use generic_file_read_iter to handle the requested access.

In Direct IO, the function will settle a shared_lock to the inode(if possible), then check if dio is supported for accessing the inode. If not, just release the lock, remove the flag and goes to basic generic_file_read_iter as usual.

static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
	ssize_t ret;
	struct inode *inode = file_inode(iocb->ki_filp);

	if (iocb->ki_flags & IOCB_NOWAIT) {
		if (!inode_trylock_shared(inode))
			return -EAGAIN;
	} else {
		inode_lock_shared(inode);
	}

	if (!ext4_dio_supported(inode)) {
		inode_unlock_shared(inode);
		/*
		 * Fallback to buffered I/O if the operation being performed on
		 * the inode is not supported by direct I/O. The IOCB_DIRECT
		 * flag needs to be cleared here in order to ensure that the
		 * direct I/O path within generic_file_read_iter() is not
		 * taken.
		 */
		iocb->ki_flags &= ~IOCB_DIRECT;
		return generic_file_read_iter(iocb, to);
	}

	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
	inode_unlock_shared(inode);

	file_accessed(iocb->ki_filp);
	return ret;
}

If DIO is supported, call function iomap_dio_rw in iomap_ops to access data, and release the shared lock after it finishes.Different file systems gives different DIO functions, some might don't support such way at all, so default ops is given for them.

In the end, funcion generic_file_read_iter(which is in mm/filemap.c, and it's hard to find) will do the rest parts.

ssize_t
generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
	size_t count = iov_iter_count(iter);
	ssize_t retval = 0;

	if (!count)
		return 0; /* skip atime */

	if (iocb->ki_flags & IOCB_DIRECT) {
		struct file *file = iocb->ki_filp;
		struct address_space *mapping = file->f_mapping;
		struct inode *inode = mapping->host;
		loff_t size;

		size = i_size_read(inode);
		if (iocb->ki_flags & IOCB_NOWAIT) {
			if (filemap_range_needs_writeback(mapping, iocb->ki_pos,
						iocb->ki_pos + count - 1))
				return -EAGAIN;
		} else {
			retval = filemap_write_and_wait_range(mapping,
						iocb->ki_pos,
					        iocb->ki_pos + count - 1);
			if (retval < 0)
				return retval;
		}

		file_accessed(file);

		retval = mapping->a_ops->direct_IO(iocb, iter);
		if (retval >= 0) {
			iocb->ki_pos += retval;
			count -= retval;
		}
		if (retval != -EIOCBQUEUED)
			iov_iter_revert(iter, count - iov_iter_count(iter));

		/*
		 * Btrfs can have a short DIO read if we encounter
		 * compressed extents, so if there was an error, or if
		 * we've already read everything we wanted to, or if
		 * there was a short read because we hit EOF, go ahead
		 * and return.  Otherwise fallthrough to buffered io for
		 * the rest of the read.  Buffered reads will not work for
		 * DAX files, so don't bother trying.
		 */
		if (retval < 0 || !count || iocb->ki_pos >= size ||
		    IS_DAX(inode))
			return retval;
	}

	return filemap_read(iocb, iter, retval);
}

We can see this is how IOCB_DIRECT read functions, for it's the direct way and we don't read data from cache, but directly from the file.This part looks like what file system works in Nachos experiment but filemap_write_and_wait_range is to be checked, which ensures the last write operation is finished so the file we got for now is the newest version.If it's not DIO way, goes to function filemap_read(in the end).The function is also in the same file, but it's too long, so we basically talk about it's job but not detailed code.

The kernel will try to read in a buffered way(for it's not direct io), to find if the page is in cache.Success, get data from cache and the job is done quickly.If not, functions will be called to read page from disk, and ensure the page is in newest version so we won't get dirty data, if it's newest page, calls readpage to get the recent page.If there is no more page cache, call page_cache_alloc to allocate a page and add it into page_cache_lru, then copy it to user's space through copy_page_to_iter.

Here the analyse for read is finally done.We have a glimpse of linux's reading methods, though it's not that detailed and not professional as a real programmer, it do helps a lot.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值