2
\$\begingroup\$

I wrote a script that uses an uploaded excel file to import data to the database. It checks for the image first if it can be downloaded without any error then insert the rest of the data. With ten or hundred thousands of rows this takes hours. Is it possible to make it faster somehow?

<?php

namespace App\Services\Import;

use App\Models\Tags;
use App\Models\Videos;
use GuzzleHttp\Client;
use App\Models\Stars;
use App\Models\Categories;
use Illuminate\Support\Str;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Log;
use Maatwebsite\Excel\Facades\Excel;
use Illuminate\Support\Facades\Storage;
use Spatie\MediaLibrary\MediaCollections\Models\Media;
use Intervention\Image\ImageManager;
use Illuminate\Support\Facades\File;

class DataImport
{
    public function checkURL($url) {
        $ch = curl_init($url);

        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSLVERSION,'all'); 
        curl_exec($ch);

        $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        $ssl = curl_getinfo($ch, CURLINFO_SSL_VERIFYRESULT);
        curl_close($ch);

        if($http_code >= 400 && $http_code < 600 || $ssl === 1) {
            return false;
        }

        try {
            $stream = stream_context_create (array("ssl" => array("capture_peer_cert" => true)));
            if(!empty($url)){
                $read = @fopen($url, "rb", false, $stream);
            
            if ($read !== false) {
                $cont = @stream_context_get_params($read);
                $ssl = ($cont["options"]["ssl"]);
                fclose($read);
        
                return (isset($ssl['peer_certificate']) && !is_null($ssl['peer_certificate'])) ? true : false;
            } else {
                return false;
            }
        }
        }catch(ErrorException $e){
            error_log($e->getMessage());
            return false;
        }
    }


    public function import($path)
    {
        $tempDirectoryName = 'temp';
        $data = Excel::toArray([], $path);
        $dataChunks = array_chunk($data[0], 25);
        foreach ($dataChunks as $chunk) {
            DB::beginTransaction();
            foreach ($chunk as $row) {
                if (!empty($row[3])) {
                    $valid = $this->checkURL($row[3]);
                    if ($valid === true) {
                        $video = Videos::with(['categories', 'stars', 'tags'])->where('url', $row[0])->first();
                        if (!$video) {
                            $slug = Str::slug($row[1], '-');
                            $i = 1;
                            while (Videos::where('slug', $slug)->exists()) {
                                $slug = Str::slug($row[1], '-') . '-' . $i;
                                $i++;
                            }
                            
                            $video = Videos::create([
                                'url' => $row[0],
                                'title' => $row[1],
                                'slug' => $slug,
                                'length' => intval($row[2]),
                                'code' => intval($row[7]),
                                'embed_url' => $row[4],
                                'channel' => $row[9]
                            ]);
                        }

                        try {
                            $media = $video->addMediaFromUrl($row[3])->toMediaCollection('video_index_images', 'public');
                        } catch (\Exception $e) {
                            Log::error("Error importing media for video ID {$video->id}: {$e->getMessage()}");
                            continue;
                        }

                        if (!$media && !$media->exists()) {
                            Log::error("Error importing media for video ID {$video->id}: Invalid media file");
                            continue;
                        }

                            $title = str_replace('_', ' ', $row[8]);
                            $slug = strtolower(str_replace(' ', '-', $title));
                            $title = ucfirst(strtolower($title));
                            $category = Categories::firstOrCreate(['title' => $title], ['slug' => $slug]);
                            $video->categories()->syncWithoutDetaching([$category->id]);

                            $names = explode(',', $row[6]);
                            foreach ($names as $name) {
                                $name = trim($name);
                                $name = str_replace('_', '', $name);
                                $slug = strtolower(str_replace(' ', '-', $name));
                                $star = Stars::firstOrCreate(['name' => $name], ['slug' => $slug]);
                                $video->stars()->syncWithoutDetaching([$star->id]);
                            }

                            $tags = explode(',', $row[5]);
                            $tagIds = [];
                            foreach ($tags as $tagName) {
                                $tagName = str_replace('-', ' ', trim($tagName));
                                if (!empty(trim($tagName))) {
                                    $tag = Tags::firstOrCreate(['name' => $tagName, 'slug' => Str::slug($tagName)]);
                                    $tagIds[] = $tag->id;
                                }
                            }

                            $video->tags()->detach();
                            $video->tags()->attach($tagIds);

                            $tempDirectory = storage_path('media-library/temp/'.$tempDirectoryName);
                            if (file_exists($tempDirectory)) {
                                \File::deleteDirectory($tempDirectory);
                            }
                    }
                }
            }
            DB::commit();
        }
    }
} 

I know the code is doing a lot insede the loop but all these steps are necessary in order to work as intended.

The excel file consists of 10 columns:

  1. url to the video
  2. title of the video
  3. lenght in seconds
  4. url to an index image of the video
  5. embed code
  6. tags
  7. actors/actresses in the video
  8. unique code
  9. category
  10. name of the site where the video is from

The site will be a collection of videos, so the external media files are always needed at the time of the upload. Some of the url for the index images are missing etc., those videos wont get uploaded, thats why I first check for the images, and only then upload the rest.

The script is called in a filamentphp resource file using an action:

->actions([ Tables\Actions\Action::make('Run') 
   ->icon('heroicon-o-play') 
   ->requiresConfirmation() 
   ->form([ FileUpload::make('excel') 
   ->label('Excel')
   ->acceptedFileTypes(['application/vnd.openxmlformatsofficedocument.spreadsheetml.sheet']) 
   ->preserveFilenames(),]) 
   ->action(function ($data) { 
      $path = 'public/' . $data['excel'];
      $importScript = new DataImport();
      $importScript->import($path); 
      Storage::delete($path);
     })
  ])
\$\endgroup\$
8
  • 1
    \$\begingroup\$ The main task of your code is to import data from an Excel file, but we have no idea what that file contains. It is also exceedingly difficult, even if we have the Excel file, to test your code. That being say, why not first only import the Excel file data? That should be pretty fast. Then when you actually need any of the external media files, you can import, and save, them at that moment. \$\endgroup\$ Commented Apr 19, 2023 at 19:49
  • \$\begingroup\$ the excel file consists of 10 columns: 1. url to the video 2.title of the video 3. lenght in seconds 4. url to an index image of the video 5. embed code 6. tags 7. actors/actresses in the video 8. unique code 9. category 10. name of the site where the video is from The site will be a collection of videos, so the external media files are always needed at the time of the upload. Some of the url for the index images are missing etc., those videos wont get uploaded, thats why i first check for the images, and only then upload the rest. \$\endgroup\$ Commented Apr 19, 2023 at 20:25
  • \$\begingroup\$ The script is called in a filamentphp resource file using an action: ->actions([ Tables\Actions\Action::make('Run') ->icon('heroicon-o-play') ->requiresConfirmation() ->form([ FileUpload::make('excel') ->label('Excel') >acceptedFileTypes(['application/vnd.openxmlformatsofficedocument.spreadsheetml.sheet']) ->preserveFilenames(),]) ->action(function ($data) { $path = 'public/' . $data['excel']; $importScript = new DataImport(); $importScript->import($path); Storage::delete($path);})]) \$\endgroup\$ Commented Apr 19, 2023 at 20:29
  • 5
    \$\begingroup\$ Please add information, like you posted in the comments above, to your question. Code in comments is quite unreadable. I was thinking: Importing hundred thousands of videos, if that's actually what your code does, will always take a very long time, and use a lot of space, you cannot speed that up unless you implement some kind of parallel downloading. \$\endgroup\$ Commented Apr 19, 2023 at 20:32
  • \$\begingroup\$ A transaction per chunk? Why is that? \$\endgroup\$ Commented Apr 20, 2023 at 2:52

1 Answer 1

3
\$\begingroup\$

Some quick comments:

  • The function checkURL is called in a loop, but inside that function you close the Curl connection every time. Instead you could speed up things by reusing the Curl handle - opening a new connection adds some overhead. Plus, there is overhead in the SSL negotiation. Some background reading
  • This is not necessarily the biggest bottleneck in your code, but if you call the function often while fetching data from the same remote host, then you should be able to reduce run time significantly.
  • Better yet, you can fetch several URLs in parallel, for example using the curl_multi_init function. This requires some refactoring of your code. You'll also need to keep in mind that the target website may apply rate limiting. Too many connections will strain the server and possibly get you banned so be gentle.
  • But since you are not spoofing the user agent, it's already obvious you're a bot anyway
  • If you are fetching "ten or hundred thousands of rows" this is always going to take some time, especially if the files you're downloading are somewhat large. But again, parallel processing should help.
  • This code cannot easily be tested in proper live conditions, so as said above it would help if you could benchmark the different parts of your code and figure out which exactly is taking a long time.
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.