制作数据集(二):使用Bing,制作更快,更干净的数据集!


背景:上一个版本的数据集制作方法尽管有效,但是数据集并不干净,例如出现很多广告等等,所以使用必应浏览器制作数据。注意,尽管必应浏览器不用FQ,但是是需要注册一些信息,不过不麻烦!

方法:首先还是先PO出作者的原文:https://www.pyimagesearch.com/2018/04/09/how-to-quickly-build-a-deep-learning-image-dataset/

  • 到微软的Bing Image Search API中注册账号,它是7天免费的:https://azure.microsoft.com/en-us/services/cognitive-services/bing-image-search-api/
  • 成功以后,你可以看到如下页面,并记住你的密钥1和密钥2
  • 建议阅读下面两个文档,尤其是第一个文档的.json文件格式,和第二个文档的count和offset的意义,它能帮助你在后面的代码阅读中更快地理解代码的意义。
    • https://docs.microsoft.com/en-us/azure/cognitive-services/bing-image-search/quickstarts/python
    • https://docs.microsoft.com/en-us/azure/cognitive-services/Bing-Web-Search/paging-search-results
  • 下面,在你的虚拟环境中,安装requests包
  • $ pip install requests
  • 新建一个“search_bing_api.py”文件,并将下面的代码复制到你这个文件中
    • # import the necessary packages
      from requests import exceptions
      import argparse
      import requests
      import cv2
      import os
      # construct the argument parser and parse the arguments
      ap = argparse.ArgumentParser()
      ap.add_argument("-q", "--query", required=True,
          help="search query to search Bing Image API for")    # 这个字段表示你想搜索什么
      ap.add_argument("-o", "--output", required=True,
          help="path to output directory of images")
      args = vars(ap.parse_args())
      
      args['output'] = os.path.join(args['output'], args['query'])
      if not os.path.exists(args['output']):
          os.makedirs(args['output'])
      
      # set your Microsoft Cognitive Services API key along with (1) the
      # maximum number of results for a given search and (2) the group size
      # for results (maximum of 50 per request)
      API_KEY = "YOUR_API_KEY_GOES_HERE"    # 将刚刚的密钥1 或 密钥2 复制到这里
      MAX_RESULTS = 250
      GROUP_SIZE = 50    # 最大是150,它的意义可以查看上面的文档链接2
      # set the endpoint API URL
      URL = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"
      
      # when attempting to download images from the web both the Python
      # programming language and the requests library have a number of
      # exceptions that can be thrown so let's build a list of them now
      # so we can filter on them
      EXCEPTIONS = set([IOError, FileNotFoundError,
          exceptions.RequestException, exceptions.HTTPError,
          exceptions.ConnectionError, exceptions.Timeout])
      
      # store the search term in a convenience variable then set the
      # headers and search parameters
      term = args["query"]
      headers = {"Ocp-Apim-Subscription-Key" : API_KEY}
      params = {"q": term, "offset": 0, "count": GROUP_SIZE}
      # make the search
      print("[INFO] searching Bing API for '{}'".format(term))
      search = requests.get(URL, headers=headers, params=params)
      search.raise_for_status()
      # grab the results from the search, including the total number of
      # estimated results returned by the Bing API
      results = search.json()
      estNumResults = min(results["totalEstimatedMatches"], MAX_RESULTS)
      print("[INFO] {} total results for '{}'".format(estNumResults,
          term))
      # initialize the total number of images downloaded thus far
      total = 0
      
      # loop over the estimated number of results in `GROUP_SIZE` groups
      for offset in range(0, estNumResults, GROUP_SIZE):
          # update the search parameters using the current offset, then
          # make the request to fetch the results
          print("[INFO] making request for group {}-{} of {}...".format(
              offset, offset + GROUP_SIZE, estNumResults))
          params["offset"] = offset
          search = requests.get(URL, headers=headers, params=params)
          search.raise_for_status()
          results = search.json()
          print("[INFO] saving images for group {}-{} of {}...".format(
              offset, offset + GROUP_SIZE, estNumResults))
      
          # loop over the results
          for v in results["value"]:
              # try to download the image
              try:
                  # make a request to download the image
                  print("[INFO] fetching: {}".format(v["contentUrl"]))
                  r = requests.get(v["contentUrl"], timeout=30)
                  # build the path to the output image
                  ext = v["contentUrl"][v["contentUrl"].rfind("."):]
                  p = os.path.sep.join([args["output"], "{}{}".format(
                      str(total).zfill(8), ext)])
                  # write the image to disk
                  f = open(p, "wb")
                  f.write(r.content)
                  f.close()
              # catch any errors that would not unable us to download the
              # image
              except Exception as e:
                  # check to see if our exception is in our list of
                  # exceptions to check for
                  if type(e) in EXCEPTIONS:
                      print("[INFO] skipping: {}".format(v["contentUrl"]))
                      continue
                      # try to load the image from disk
              image = cv2.imread(p)
              # if the image is `None` then we could not properly load the
              # image from disk (so it should be ignored)
              if image is None:
                  print("[INFO] deleting: {}".format(p))
                  os.remove(p)
                  continue
              # update the counter
              total += 1
      $ mkdir dataset
      $ mkdir dataset/charmander
      $ python search_bing_api.py --query "charmander" --output dataset/charmander  # 传入希望搜索的字段,以及输出目录
      [INFO] searching Bing API for 'charmander'
      [INFO] 250 total results for 'charmander'
      [INFO] making request for group 0-50 of 250...
      [INFO] saving images for group 0-50 of 250...
      [INFO] fetching: https://fc06.deviantart.net/fs70/i/2012/355/8/2/0004_c___charmander_by_gaghiel1987-d5oqbts.png
      [INFO] fetching: https://th03.deviantart.net/fs71/PRE/f/2010/067/5/d/Charmander_by_Woodsman819.jpg
      [INFO] fetching: https://fc05.deviantart.net/fs70/f/2011/120/8/6/pokemon___charmander_by_lilnutta10-d2vr4ov.jpg
      ...
      [INFO] making request for group 50-100 of 250...
      [INFO] saving images for group 50-100 of 250...
      ...
      [INFO] fetching: https://38.media.tumblr.com/f0fdd67a86bc3eee31a5fd16a44c07af/tumblr_nbhf2vTtSH1qc9mvbo1_500.gif
      [INFO] deleting: dataset/charmander/00000174.gif
      ...