Deep Learning – California Sunshine

Docker Compose启动Apache Kafka+Zookeeper

这个YML文件很有用，可以在Docker环境下启动Apache Kafka和Zookeeper。

version: '3.1'

services:
  zoo1:
    image: zookeeper:3.6.2
    restart: always
    hostname: zoo1
    ports:
      - 2181:2181
    environment:
      ZOO_MY_ID: 1
      ZOO_SERVERS: server.1=0.0.0.0:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=zoo3:2888:3888;2181
  zoo2:
    image: zookeeper:3.6.2
    restart: always
    hostname: zoo2
    ports:
      - 2182:2181
    environment:
      ZOO_MY_ID: 2
      ZOO_SERVERS: server.1=zoo1:2888:3888;2181 server.2=0.0.0.0:2888:3888;2181 server.3=zoo3:2888:3888;2181
  zoo3:
    image: zookeeper:3.6.2
    restart: always
    hostname: zoo3
    ports:
      - 2183:2181
    environment:
      ZOO_MY_ID: 3
      ZOO_SERVERS: server.1=zoo1:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=0.0.0.0:2888:3888;2181
  kafka:
    image: wurstmeister/kafka:2.13-2.6.0
    ports:
     - "9092:9092"
    expose:
     - "9093"
    environment:
      KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
      KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
      KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
      KAFKA_ZOOKEEPER_CONNECT: zoo1:2181,zoo2:2181,zoo3:2181
      KAFKA_CREATE_TOPICS: "topic_test:1:1"

version: '3.1'

services:

zoo1:

image: zookeeper:3.6.2

restart: always

hostname: zoo1

ports:

- 2181:2181

environment:

ZOO_MY_ID: 1

ZOO_SERVERS: server.1=0.0.0.0:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=zoo3:2888:3888;2181

zoo2:

image: zookeeper:3.6.2

restart: always

hostname: zoo2

ports:

- 2182:2181

environment:

ZOO_MY_ID: 2

ZOO_SERVERS: server.1=zoo1:2888:3888;2181 server.2=0.0.0.0:2888:3888;2181 server.3=zoo3:2888:3888;2181

zoo3:

image: zookeeper:3.6.2

restart: always

hostname: zoo3

ports:

- 2183:2181

environment:

ZOO_MY_ID: 3

ZOO_SERVERS: server.1=zoo1:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=0.0.0.0:2888:3888;2181

kafka:

image: wurstmeister/kafka:2.13-2.6.0

ports:

- "9092:9092"

expose:

- "9093"

environment:

KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092

KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT

KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092

KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE

KAFKA_ZOOKEEPER_CONNECT: zoo1:2181,zoo2:2181,zoo3:2181

KAFKA_CREATE_TOPICS: "topic_test:1:1"

然后执行，

 docker-compose.exe -f .\zookeeper_stack.yml up

1	docker-compose.exe -f .\zookeeper_stack.yml up

All services will be brought up!

Train a simple model to recognize my son

Here is the summary of a personal project – to train a simple deep-learning model and recognize my son’s face.

So far, the hardest part is to prepare the data.

How to Recognize Henry’s Face from Yongqiang Li

Use boost::asio to implement a simple thread pool in C++

Working on a personal machine learning project, I would like to train a model which can recognize my son.

The first step is to extract my son’s faces from about 20K pictures. OpenCV can help a lot. It’s very handy and there already has the face cascade. Now the point is how to make the image processing faster. As a Java programmer, processing it in multiple threads is the first solution to try. After some searching, a lot of comments led me to boost::asio.

I was a C++ developer before C++11. I have to say, in the last several years, C++ got greatly improved. With boost::asio in C++, it becomes much easier to implement a simple thread pool, similar to Java concurrency.

thread_pool.h

#ifndef __THREAD_POOL_H__
#define __THREAD_POOL_H__

#include <boost/asio.hpp>
#include <vector>
#include <memory>
#include <boost/thread.hpp>

class ThreadPool;

class Worker
{
public:
  Worker(ThreadPool&);
  void operator()();

private:
  ThreadPool& m_pool;
};

class ThreadPool
{
public:
  explicit ThreadPool(size_t);
  ~ThreadPool();

  template<typename F>
  void enqueue(F f);

private:
  std::vector<std::unique_ptr<boost::thread>> m_workThreads;
  boost::asio::io_service m_ioService;
  boost::asio::io_service::work m_work;

  friend class Worker;
};

template<typename F>
void ThreadPool::enqueue(F f)
{
  m_ioService.post(f);
}

#endif

#ifndef __THREAD_POOL_H__

#define __THREAD_POOL_H__

#include <boost/asio.hpp>

#include <vector>

#include <memory>

#include <boost/thread.hpp>

class ThreadPool;

class Worker

{

public:

Worker(ThreadPool&);

void operator()();

private:

ThreadPool& m_pool;

};

class ThreadPool

{

public:

explicit ThreadPool(size_t);

~ThreadPool();

template<typename F>

void enqueue(F f);

private:

std::vector<std::unique_ptr<boost::thread>> m_workThreads;

boost::asio::io_service m_ioService;

boost::asio::io_service::work m_work;

friend class Worker;

};

template<typename F>

void ThreadPool::enqueue(F f)

{

m_ioService.post(f);

}

#endif

thread_pool.cpp

#include "thread_pool.h"

using namespace boost;
using namespace std;

Worker::Worker(ThreadPool& aPool) : m_pool(aPool)
{
  
}

void Worker::operator()()
{
  m_pool.m_ioService.run();
}

ThreadPool::ThreadPool(size_t sizeOfWorkerThreads) : m_work(m_ioService)
{
  for (auto i = 0; i < sizeOfWorkerThreads; ++i)
  {
    m_workThreads.push_back(
      unique_ptr<boost::thread>(new boost::thread(Worker(*this))));
  }
}

ThreadPool::~ThreadPool()
{
  m_ioService.stop();

  for (auto& workThread : m_workThreads)
  {
    workThread->join();
  }
}

#include "thread_pool.h"

using namespace boost;

using namespace std;

Worker::Worker(ThreadPool& aPool) : m_pool(aPool)

{

}

void Worker::operator()()

{

m_pool.m_ioService.run();

}

ThreadPool::ThreadPool(size_t sizeOfWorkerThreads) : m_work(m_ioService)

{

for (auto i = 0; i < sizeOfWorkerThreads; ++i)

{

m_workThreads.push_back(

unique_ptr<boost::thread>(new boost::thread(Worker(*this))));

}

ThreadPool::~ThreadPool()

{

m_ioService.stop();

for (auto& workThread : m_workThreads)

{

workThread->join();

}

Image processing code to make pictures smaller.

int doMain5(int, char**)
{
  ThreadPool threadPool(10);

  fs::path imgFolderPath("H:\\export_files");
  fs::directory_entry facesDirEntry(imgFolderPath);

  auto index = 1;
  for(auto& p : fs::directory_iterator(imgFolderPath))
  {
    if (p.path().string().find("smaller") != string::npos)
    {
      cout << index << ": Skip the path of " << p.path() << endl;
      index++;
      continue;
    }

    threadPool.enqueue([p, imgFolderPath, index] {

      Mat img = imread(p.path().string());
      Mat imgSmallerOne;

      resize(img, imgSmallerOne, Size(), 0.4, 0.4);

      stringstream ss;
      ss << p.path().stem() << "_smaller.jpg";

      cout << index << ": Write resized img into file of " << ss.str() << endl;
      fs::path newPath = fs::path(imgFolderPath) / fs::path(ss.str()).c_str();
      imwrite(newPath.string(), imgSmallerOne);

      fs::remove(p);
    });

    index++;
  }

  return 0;
}

int doMain5(int, char**)

{

ThreadPool threadPool(10);

fs::path imgFolderPath("H:\\export_files");

fs::directory_entry facesDirEntry(imgFolderPath);

auto index = 1;

for(auto& p : fs::directory_iterator(imgFolderPath))

{

if (p.path().string().find("smaller") != string::npos)

{

cout << index << ": Skip the path of " << p.path() << endl;

index++;

continue;

}

threadPool.enqueue([p, imgFolderPath, index] {

Mat img = imread(p.path().string());

Mat imgSmallerOne;

resize(img, imgSmallerOne, Size(), 0.4, 0.4);

stringstream ss;

ss << p.path().stem() << "_smaller.jpg";

cout << index << ": Write resized img into file of " << ss.str() << endl;

fs::path newPath = fs::path(imgFolderPath) / fs::path(ss.str()).c_str();

imwrite(newPath.string(), imgSmallerOne);

fs::remove(p);

});

index++;

}

return 0;

}

ND4j的CPU与GPU简单性能对比

最近在学习Deep Learning。ND4j是一个类似于Python Numpy的Java版本实现，支持CPU和GPU Backend。很是好奇，这两者性能到底能差多少，于是做了一个小的测试。

安装CUDA Toolkit 8.0

最新的CUDA Tooklit版本是9.1，但是目前最新ND4j的Release版本（0.9.1）还不支持。（看了ND4j论坛里的讨论，master branch已经支持9.1）0.9.1只支持CUDA 7.5和8.0，我的实验中，安装了8.0版本。在这里下载Installer和Patch。

安装完成后，机器要重启一下。

ND4j的Maven配置

在Maven里通过切换Nd4j的artifactId来设置CPU或GPU Backend。

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <nd4j.version>0.9.1</nd4j.version>
    <!--<nd4j.backend>nd4j-native-platform</nd4j.backend>-->
    <nd4j.backend>nd4j-cuda-8.0-platform</nd4j.backend>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.nd4j</groupId>
      <artifactId>${nd4j.backend}</artifactId>
      <version>${nd4j.version}</version>
    </dependency>
    ...
  </dependencies>

<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

<nd4j.version>0.9.1</nd4j.version>

<nd4j.backend>nd4j-cuda-8.0-platform</nd4j.backend>

</properties>

<artifactId>${nd4j.backend}</artifactId>

<version>${nd4j.version}</version>

</dependency>

...

</dependencies>

nd4j-native-platform是CPU Backend，nd4j-cuda-8.0-platform是GPU Backend。

一个简单的测试

下面是一个简单的测试代码，两个10K by 10K的Matrices做Outer Product。

    final int dimension = 10000;
    System.out.println("Do something big...");

    Stopwatch sw = Stopwatch.createUnstarted();
    sw.start();

    INDArray bigND = Nd4j.rand(dimension, dimension);
    INDArray bigND2 = Nd4j.rand(dimension, dimension);
    INDArray bigND3 = bigND2.mmul(bigND);

    sw.stop();
    System.out.println("Spent " + sw.elapsed(TimeUnit.MILLISECONDS) + "ms");

final int dimension = 10000;

System.out.println("Do something big...");

Stopwatch sw = Stopwatch.createUnstarted();

sw.start();

INDArray bigND = Nd4j.rand(dimension, dimension);

INDArray bigND2 = Nd4j.rand(dimension, dimension);

INDArray bigND3 = bigND2.mmul(bigND);

sw.stop();

System.out.println("Spent " + sw.elapsed(TimeUnit.MILLISECONDS) + "ms");

我的CPU是i7-5820K，GPU是GTX 970 3.5GB。测试结果真的非常让人吃惊 – GPU 497ms， CPU 7827ms. 差了约16倍。

我终于知道NViDIA的股价为什么涨这么多了！😁