Hi Guys,
Need some advise here - i've got a CSV file which is 6.5gb large - financial data since 2008. I need to delete 2 columns and also change the format of one of the numbers so each line has 6 decimals (i.e 1.278400) I've tried: Csved / Unicsved and ReCSVeditor - all cant handle that file. Any suggestions? Thanks in advance =) |
Might be a long way around, but import to SQL/MYSQL/Access, perform manipulation via SQL, file export back to csv.
IMO Importing to swl will handle the large data better then trying to edit in one go, but just my opinion. |
wow 6.5gb csv file that's pretty amazing!
|
Might be a long way around, but import to SQL/MYSQL/Access, perform manipulation via SQL, file export back to csv. This. working with this data in a SQL database will be much easier. thats a lot of data, exactly what SQL is designed for. |
I wouldn't be trying to open it into any sort of graphical editor. They aren't going to play nice displaying that size of file. I'd be surprised if you couldn't do this in a bash or perl script. Not my forte so I won't volunteer one.
I don't know that dumping it all into an SQL database is necessary. Seems like an awkward fix, but I guess if you don't have any programmers around, its an easy-to-implement solution. If you really get stuck, pm me the layout of the file and I can throw together a console app that will do it. |
Import into SQL Server. You will need to use BULK import command.
So basically create a table that matches the current CSV's column with the data def that you wanted and _all_ the columns to begin with. I wouldn't bother creating any indexes on the table unless you really need it to perform the query etc as it'll slow down the import unncessarily. Then run this: BULK INSERT Your_SQL_Server_Table FROM 'c:\path_to_your_file.csv' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n' ) GO That will handle massive data import very quickly. Depending on the server you're running on 6.5GB shouldn't take too long to BULK INSERT as it is very efficient. Then you can just drop the column: ALTER TABLE my_table DROP COLUMN column_a GO ALTER TABLE my_table DROP COLUMN column_B GO Then export the data out using standard SQL Server import/export task into the CSV format. For more info on bulk import see: http://blog.sqlauthority.com/2008/02/06/sql-server-import-csv-file-into-sql-server-using-bulk-insert-load-comma-delimited-file-into-sql-server/ |
One character is one byte of plain text, so it's about 6.5 billion characters. :P
|
I need to delete 2 columns and also change the format of one of the numbers Opec's solution is probably the go, but if you have SQL Server handy then you could also do something like this in SSIS - there's a graphical ETL package builder and is designed to do stuff that you're asking. It will laugh at your puny 6.5GB csv file and just stream the data in & out as necessary. One character is one byte of plain text Countries with non-latin based languages would like a word with you kos ;) |
[6] 979 321 856 billion bytes at 1024 per kilobyte
that seems like a lot of plain text to be managing outside of a database |
Countries with non-latin based languages would like a word with you kos ;) Somehow I doubt that DoCZero is working in Melbourne with 5 years of financial data stored in a CSV written in kanji.. ;P |
there is an easier way, it would depend on how well formed the data fields are in the csv
eg: input.csv
and you want to reduce it to
run it through a simple cut filter cut -d, -f1,3,5 < input.csv > out.txt breaking it down it's
that should shave it down to possibly a managable size that you can load it into one of those other programs to reduce the decimal places of the number field that needs doing |
If you can wait until I get home, I can probably write a little java app to do it for you when I get home. I've already written a really good CSV parser that both streams and also handles encap'd tedt properly, so it would be easy as hell to wrap this in an app where you just tell it you ant to stream it to an output file with only particular columns.
|
Hi Guys,
First of all - thanks for all the effort you put into answering! Man I love this community for giving ideas. (also big thanks to raven for offering to make something to do this!) Regarding the content - it is every tick for the currency pair EURUSD from 2008 - (in 2011 the average turnover was $250,000,000,000 traded per day Ref: http://www.forexturtle.com/fxoptions/forextradingvolume.aspx) I'll be trying SQL tonight - if I get stuck i'll definately be back for more advice! I've got a lot of this to do , so I figure i'll try learn myself =) Thanks again! |
Now thats not true - yesterday I was trying to use csv editing software - today I'll be trying SQL and Cut as per this thread. What I got from this thread was more direction that I had yesterday. Yes I got offered some great help and im greatful - really I want to learn myself and not be a burden on the kind ppl here. So i'll give it a good shot using the advise in this thread.
|
Somehow I doubt that DoCZero is working in Melbourne with 5 years of financial data stored in a CSV written in kanji.. ;P hehe most likely correct, I was just being a smart arse. Murphy's law practically guarantees that somewhere in that 6.5GB is a character that will mess up an ETL process though. :/ |
God damn thats a huge file. But yeah agree with what some of the other guys already said re: SQL
|
least with SQL if you need to manipulate it or grab chunks its so much easier, and if you expect to get given the tasks of doing this again in the future setup some cron job to import the data daily to the SQL so you always having a running forex data to work with.
I am working on a project atm where normal people just throw around CSV files full of data from dozens of systems all in different formats and its like !@#@#@ DIAF So we are Data warehousing feeds of info into SQL, and making it available to people easier, via either predefined reporting and charting(highcharts), analyst based SQL access etc. One of the systems we grab data from is by AT&T and whoever the DBA was that came up with the structure is mad, the system spits out about 1-2gb a Day, where a similar system providing similar information is 1/10th the size. |
Post a demo line for us here (strip anything confidential or just change it), and what you want done, and someone will drop the 1 minute's work into a command line you can run.
Just make sure you keep a backed up copy somewhere in case someone posts something that donkeys the data. I doubt that'll happen but why take the change for a few GB of disk space? |
(Manipulating CSV on a line by line basis like this is ridiculously easy, regardless of the filesize). Importing into any kind of DB program is huge overkill.)
|
If this stupid forum had 'likes' I would like trillion's post! I didn't know cut could talk CSV so easily; I would have written a little PHP script to do it.
I would assume (without knowing anything about it) that cut would operate on a line-by-line basis so using it should use (basically) no memory and take just as long as it takes to read 6.5GB and write it back out again. I agree with blahnana that importing would be overkill unless you're planning to manipulate the data with SQL later. |
if for some reason you don't have the will to install and directory link a linux vm to use cat in, there's also a nice simple way to do this with powershell
eg data1.csv is delimited and formatted like
running this in a powershell prompt should do the business PS C:\Users\User> Import-Csv C:\Users\User\Data1.csv | Select data2a,data5a,data7a | Export-Csv -Path C:\Users\User\Data2.csv -NoTypeInformation which as you can probably guess will write this to Data2.csv
there's probably a way to identify the number column as an object and delimit the decimal places to a specific value, but i'll leave that with you |
Powershell will most likely do the job, but I've found the performance of the Import-Csv cmdlet is pretty horrible on files even under 100 MB. I'd shudder to think how long that command would take to complete on a 6+GB file.
|
sure you can do it in pure .NET and have it parse and write out as fast as your hardware i/o can give channel to filestream.io to tear through it
it's the only way to be sure |
So - job done, ended up using powershell - just let it go overnight - cmd:
Import-Csv eurusd.csv | Select time,ask,bid | Export-Csv -Path eurusd2.csv -NoTypeInformation Trillion - cheers for the tip of using powershell. I tried to use something called csvfix - which seemed to work, then the file wasnt edited how I needed. =) |
yeah the cmdlet throughput speed is pokey slow
|
lol, Achievement Unlocked: Number Cruncher
|
Dahzel don't tease. achievements were the best innovation trog has ever made happen.
that's just how i remember it cheers |