Author: Li-Wen Wang

45 Posts

【笔记】实用机器学习 – Data II
李沐老师实用机器学习笔记: Data II 1.3 网页数据的抓取 网页数据的抓取是获取训练数据的一个重要的方式。与爬虫不同,数据抓取是为了获取特定的信息,而非爬取整个网页的内容。 📽️ 授课视频: 1.3 网页数据抓取【斯坦福21秋季:实用机器学习中文版】_哔哩哔哩_bilibili 📓 课件: 讲义 工具 通常情况下,网站都会有防护措施,防止机器爬取数据。因此我们一般要使用 headless browser 像selenium。 而且,我们还需要大量的IP,一旦ip被禁止,我们可以换IP继续抓取数据。 案例学习 房屋价格的抓取 (省略),使用云平台,可以方便的更换ip,而且机器要求不高,价格不贵。 法律问题 数据抓取本身并不违法 但是,我们要避免抓取敏感数据:(1)需要登陆才能访问的数据,一般比较敏感。(2)不要爬取有版权的信息,(3)网站声明不允许爬取。 如果用做商业用途请咨询律师 1.4 数据标注 📽️ 授课视频: 1.4 数据标注【斯坦福21秋季:实用机器学习中文版】_哔哩哔哩_bilibili 📓 课件: 讲义 流程图 只有部分标签,没钱,不想人工标注数据:半监督学习 如果仅有一小部分的数据,可以使用半监督学习进行数据标注。这里我们的假设是: 数据有连续性:拥有相似特征的样本有相同的标签 聚类性:数据有聚类的特性,同一类有相同的标签 流体假设:数据的复杂度比输入的维度要小 自学习 首先,我们先用有限的标签训练一个模型。然后,我们用模型去预测未标记的数据,对于高置信的预测样本,我们把他们融合到标记的数据中,再继续训练我们的模型。 这里我们可以使用更深的模型,或者多个模型做集合预测。因为这里的深度学习模型只是用来打标签,并不会用来实际部署。 只有部分标签,不差钱,可以标注数据:众包标注 可以使用类似于Amazon Mechanical Turk的众包平台,人工标记大量数据。需要注意的是: 用户标注页面要尽可能地简单:越简单,对标记人员的要求越低,就可以找到更多的标记人员,就可以拿到更低的价格。比如,对于一个365类的分类问题,可以简化为365个二分类问题,标注人员只需要回答是或者否。 花费 质量控制:不同标注人员标注的质量不同 主动学习 对自学习进行升级,引入标记人员,进行主动学习。 学习那些更不确定的样本(现有知识难以理解,信息多)。 学习那些多模型预测不一致的样本 (多样性,信息多)。 这里,和自学习不同的地方在于:自学习,对于不确定的样本不做处理;主动学习,需要挑出最有代表性的不确定性的样本,进行标注。 质量控制 标注工人经常会犯错。 最简单的办法是:同一样本,多人多次标注,但花费更多。 解决办法:只对有异议的样本(比如,两人不一致的样本)进行多人标注,雇佣高质量的标注员工。 只有部分标签,没钱,想获得监督信号:弱监督学习 定义一些启发性的规则(heuristic programs),来过滤样本,进而产生一些标签,这些标签通常是比人工标记要差,但可以训练模型。 比如在评论分类(人评论的,机器评论的)模型里:对大量的评论可以通过查找关键词,使用其他模型(情绪模型)做预测,来找出最像 “人的评论” 和 “机器的评论。”
【笔记】实用机器学习 – Data I
李沐老师实用机器学习学习笔记,数据I 1.1 课程介绍 工业界有很多机器学习的应用, 例如传统的制造业中,可以利用传感器,自动找出出现问题的设备 📽️ 授课视频: 跟李沐学AI的个人空间_哔哩哔哩_Bilibili 📓 课件: Syllabus - Practical Machine Learning 机器学习工作流: 定义问题: 找出最关键的问题,在一个项目中,最能产生效果的问题 数据:收集高质量的数据,需要考虑隐私问题 训练模型:模型现在越来越复杂,成本越来越高 部署模型:为了实时化 监控:要不断的监控,可能存在偏向性问题 机器学习的角色: 软件设计工程师: 开发维护数据流,模型训练和服务流 领域专家:有商业眼光,发现问题 数据科学家:全栈能力,数据挖掘,模型训练和部署 机器学习专家:模型定制化,模型调优 1.2 数据获取 外部数据集 数据集的三种类型: 学术数据集:干净,简单,但是选择不多,通常是小规模的 比赛数据集:接近于真实的机器学习应用。缺点是简单,数量少 原始数据:有更大的灵活性,但需要更多的预处理 生成数据集 使用GAN 仿真 数据增广
TLSC (Test-time Local Statistics Converter)
Revisiting Global Statistics Aggregation for Improving Image Restoration (消除图像复原中的“misalignment”,性能大幅提升) Paper: Revisiting Global Statistics Aggregation for Improving Image Restoration (AAAI 2022) arXiv:https://arxiv.org/pdf/2112.04491.pdf Code: https://github.com/megvii-research/tlsc Reference: [1] 消除图像复原中的“misalignment”,性能大幅提升 https://mp.weixin.qq.com/s/HRm6wPDBeopsmLtilF5K-A [2] https://xiaoqiangzhou.cn/post/chu_revisiting/ 问题的提出: Specifically, with the increasing size of patches for testing, the performance increases in the case of UNet while it increases and then decreases in UNet-IN and UNet-SE cases. 对于一个训练好的UNet模型,使用patch作为测试输入时,随着输入尺寸的增加,性能会变好 (comment:消除了边界效应,更好的边界信息融合)。但是,如果UNet中包含了IN (Instance Norm, 对spatial 平面做归一化) 或者SE (Channel Attention/ Squeeze-and-Excitation, 中间包含了global average pooing, 对spatial平面做平均),增加输入的尺寸,性能先升后降。这就表明,对现有的模型,“训练是patch,测试是全图,使用IN和SE的策略” 存在问题。 训练与测试阶段的不同全局统计聚合计算方式就导致了"misalignment",即出现了统计分布不一致现象。[1] 训练/测试阶段的基于图像块/完整图像特征的统计聚合计算差异会导致不同的分布,进而导致图像复原的性能下降(该现象被广泛忽视了)。 为解决该问题,我们提出一种简单的TLSC(Test-time Local Statistics Converter)方案,它在测试阶段的区域统计聚合操作由全局替换为局部。无需重新训练或微调,所提方案可以大幅提升已有图像复原方案的性能。[1] 解决方案 以SE为例,原本的global average pooling做法: 改进后,local statistics calculation 做法。每一个像素都在一个local的区域内(大小等于训练时的尺寸)去做平均。 对于边缘像素,复制padding,再做local statistics calculation 方法在论文中被拓展到Instance Norm 实验结果 原始的HiNet包含InstanceNorm,使用提出的方法后,性能获得提升。 原始的MPRNet包含SE模块,使用提出的方法后,性能获得提升。 数据分布的提升。 Another observation Full-image training causes severe performance loss in low-level vision task. This is explained by that full-images training lacks cropping augmentation [2]. 代码 # ------------------------------------------------------------------------ # Copyright (c) 2021 megvii-model. All Rights Reserved. # ------------------------------------------------------------------------ """ ## Revisiting Global Statistics Aggregation for Improving Image Restoration ## Xiaojie Chu, Liangyu Chen, Chengpeng Chen, Xin Lu """ import torch from torch import nn from torch.nn import functional as F from basicsr.models.archs.hinet_arch import HINet from basicsr.models.archs.mprnet_arch import MPRNet train_size=(1,3,256,256) class AvgPool2d(nn.Module):…
BoTNet (Bottleneck Transformers)
BoTNet (2021-01): 将 Self-Attention 嵌入 ResNet 文章:Bottleneck Transformers for Visual Recognition 论文: https://arxiv.org/abs/2101.11605 摘要: We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy…
Python 代码格式化工具
项目地址:https://pypi.org/project/autopep8/ 参考文章:https://www.cnblogs.com/wuyongcong/p/9066531.html Autopep8 autopep8 自动格式化 Python 代码以符合 PEP 8 风格指南。它使用 pycodestyle 工具来确定代码的哪些部分需要被格式化。autopep8 能够修复大多数由 pycodestyle 报告的格式化问题。 安装 pip install autopep8 在命令行使用 autopep8 --in-place --aggressive --aggressive YOUR_PYTHON_FILE.py 在PyCharm 配置使用 配置 打开菜单:File ---> Settings ---> Tools ---> External Tools 窗体左上角有一个 + 号 Name: autopep8 # 或者其他名字 Program: autopep8 # 前提必须先安装 Arguments: --in-place --aggressive --aggressive $FilePath$ Working directory: $ProjectFileDir$ Advanced Options -> Outputfilters: $FILE_PATH$\:$LINE$\:$COLUMN$\:.* 使用 Tools ---> External Tools ---> Autopep8 鼠标点击一下即可。
Cheat Sheet
Unzip a set of "*.tar.gz" file for f in *.tar.gz; do tar -xvf "$f"; done List folders under a path import os list_dirs = [name for name in os.listdir("path") if os.path.isdir(name)]
HDF5 Python API
HDF5 Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. Document for HDF5 Python API: https://docs.h5py.org/en/stable/build.html Installation Installation with conda: conda install h5py Installation with pre-built wheels pip install h5py Usage import h5py h5_file_name = "my_data.h5" h5_writer = h5py.File(h5_file_name, 'a') # indicate the file to store the data for index in len(**YOUR_DATA_LIST**): set_intensity.create_dataset(f"{index:06}", data=**YOUR_DATA**) # save data if (index + 1) % 1500 == 0: # force to save once every 1500 records. print('Finish processing one section\n') h5_writer.close() time.sleep(1) h5_writer = h5py.File(h5_file_name, 'a') continue # when finished! h5_writer.close() References [1] https://en.wikipedia.org/wiki/Hierarchical_Data_Format [2] https://docs.h5py.org/en/stable/build.html Acknowledgement Codes are from Qiuliang Ye
[Reading Notes] Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization
Source Paper: [ICCV'2017] https://arxiv.org/abs/1703.06868 Authors: Xun Huang, Serge Belongie Code: https://github.com/xunhuang1995/AdaIN-style Contributions In this paper, the authors present a simple yet effective approach that for the first time enables arbitrary style transfer in real-time. Arbitrary style transfer: takes a content image $C$ and an arbitrary style image $S$ as inputs, and synthesizes an output image with the same content as $C$ and the same syle as $S$. Background Batch Normalization Given a input batch $x \in \mathbb{R}^{N \times C \times H \times W}$, batch normalization (BN) normalizes the mean and standard deviation for each individual feature channel: $$ \mathrm{BN}(x)=\gamma\left(\frac{x-\mu(x)}{\sigma(x)}\right)+\beta $$ where $\gamma , \beta \in \mathbb{R}^{C}$ are affine parameters learned from data. $\mu(x) , \sigma(x) \in \mathbb{R}^{C}$ are mean and standard deviation computed across batch size and spatial dimensions, independently. $$ \mu_{c}(x)=\frac{1}{N H W} \sum_{n=1}^{N} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{n c h w} $$ $$ \sigma_{c}(x)=\sqrt{\frac{1}{N H W} \sum_{n=1}^{N} \sum_{h=1}^{H} \sum_{w=1}^{W}\left(x_{n c h w}-\mu_{c}(x)\right)^{2}+\epsilon} $$ Instance Normalization Original feed-forward stylization method [51] utilizes BN layers after the convolutional layer. Ulyanov et al. [52] found using Instance Normalization…
[Reading Notes] Collaborative Distillation for Ultra-Resolution Universal Style Transfer
Source Authors: Huan Wang, Yijun Li, Yuehai Wang, Haoji Hu, Ming-Hsuan YangPaper: [CVPR2020] https://arxiv.org/abs/2003.08436Code: https://github.com/mingsun-tse/collaborative-distillation Contributions It proposes a new knowledge distillation method "Collobrative Distillation" based on the exclusive collaborative relation between the encoder and its decoder. It proposes to restrict the students to learn linear embedding of the teacher's outputs, which boosts its learning. Experimetenal works are done with different stylization frameworks, like WCT and AdaIN. Related Works Style Transfer WCT: Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. H. (2017). Universal style transfer via feature transforms. arXiv preprint arXiv:1705.08086.AdaIN: Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1501-1510). Model Compression low-rank decomposition pruning quantization knowledge distillationKnowledge distillation is a promising model compression method by transferring the knowledge of large networks (called teacher) to small networks (called student), where the knowledge can be softened probability (which can reflect the inherent class similarity structure known as dark knowledge) or sample relations (which…
NVIDIA Xavier Depolyment: ONNXRuntime and TensorRT
Installation Virtual Environment - archiconda Install the archiconda environment at terminal wget https://github.com/Archiconda/build-tools/releases/download/0.2.3/Archiconda3-0.2.3-Linux-aarch64.sh sh Archiconda3-0.2.3-Linux-aarch64.sh Reference: https://blog.csdn.net/qq_40691868/article/details/114362278?spm=1001.2014.3001.5501 [Not Nesessary] In order to enter the system path environment when enter the terminal. It needs to comment the code "conda activate base" in ".bashrc" # added by Archiconda3 0.2.3 installer # >>> conda init >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$(CONDA_REPORT_ERRORS=false '/home/jetson/archiconda3/bin/conda' shell.bash hook 2> /dev/null)" if [ $? -eq 0 ]; then \eval "$__conda_setup" else if [ -f "/home/jetson/archiconda3/etc/profile.d/conda.sh" ]; then . "/home/jetson/archiconda3/etc/profile.d/conda.sh" CONDA_CHANGEPS1=false #conda activate base else \export PATH="/home/jetson/archiconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda init <<< create a new environment It better to keep the python version consistent with the system. conda create --name mytest python=3.6.9 conda activate mytest Connect the prebuild packages to the virtual environment (useless, need further verification) Enter the python interactive command, there is no opencv package. It needs to allows the conda environment to be reintroduced to the global/user site packages. Enter the virtual environment through export PYTHONNOUSERSITE=0 conda activate <YOUR_ENVIROMENT>…